Latest Tweets

 

Minimal Perfect Hash for Strings

A minimal perfect hash is simply a bijection between a set of N objects and the first N integers. See the wikipedia page about perfect hashing for more information.

I’m looking to produce a fast and memory efficient map implementation over String keys and I want to use a perfect hash to pack values without risking collisions.

Casting around online for a Java implementation led me to Sux4J (this rather unpromising sounding library looks excellent and I’m going to take a closer look at it soon) but the implementations are too generic for my purposes.

So I’ve fashioned my own minimal perfect hashing algorithm for Java strings.I don’t know anything about the theory of such algorithms, but my approach was to build a data-structure as follows:

  1. input an array of ‘hashable’ strings
  2. order the strings by their standard Java hashcode
  3. store the hashcodes in an integer array
  4. identify spans of duplicate hashcodes
  5. if there are no duplicates then do nothing more
  6. else for each set of strings with the same hashcode
    1. build a decision tree that can distinguish the strings
    2. store a pointer to the tree for each string in the set

The gist of this approach is outlined visually in the following diagram. Here, the hashes for the strings “cat” and “dog” collide, so a small decision tree needs to be consulted in the case that either of these two are hashed to confirm which value should be returned.

There are some nuances in the actual implementation. For example, Strings are actually ordered by length and character sequence too; this assists the building of fast decision trees. Also each duplicated string stores its offset within the span to avoid scanning back to find the start of the span (I think there are better space/time trade-offs around this that I may investigate in the future.).

However, these complexities only come into play when there are collisions of the String hash function. The common case, even for fairly large numbers of strings, is that there are no collisions. This reduces the time complexity to a binary search over the integer array of hashcodes — not as fast as only taking the hashcode, but still fast. And even the exceptional case is still fairly fast since it typically needs to test candidate string’s length to determine which minimal hash value it should be assigned.

Note that the algorithm is designed to examine as little about the string as is necessary to determine the perfect hash value (this helps to make it fast). Furthermore, the algorithm does not keep a reference to the list of potential candidates for hashing (this helps to reduce memory overhead). As a consequence, the algorithm cannot necessarily distinguish valid candidates; though they are frequently identified as such, invalid string arguments may generate valid hashes.

The source code for this implementation is fully documented and is under the Apache 2.0 licence. It is available through my website’s Google code project

  1. tomgibara posted this
blog comments powered by Disqus