Latest Tweets

 

Living in Hash Table Ignorance

I’ve been living in hash table ignorance, always accepting the received wisdom that a load factor (the ratio of elements stored to available buckets) of around 0.75 provides a good trade-off between storage and performance but never stopping to think about what’s actually going on. For example, what happens at the limit, when the number of elements approaches the number of buckets in a chained hash table such as Java’s HashMap.

As part of my development of crinch I’ve been approaching data structures from the perspective of reducing memory overhead. So when I started writing a hash map implementation for the library, the following question naturally arose: in a hash table of capacity n, that contains n elements, how many buckets will be empty?

Assuming a hashing algorithm that distributes keys with a high degree of uniformity, there will inevitably be some collisions (implying empty buckets elsewhere) but my expectation was that there would not be not too many. I guessed that maybe 10% of the buckets might be empty, and I was hoping for fewer than that.

When I found that slightly more than a third of the buckets in my hash table implementation were empty, I wish I could say I was immediately shocked into the realization that I’d never bothered to understand the real performance characteristics of the humble hash map. Instead I wasted an hour looking for a mistake — trying different hash functions and a number of different key sets — before realizing that the results I was seeing were so consistent (always hovering tightly around 36.8% of buckets empty) that I must be missing something. Then I did something more intelligent and walked away from my keyboard, picked up a pen and reasoned through the problem. It turns out it’s simple.

Let’s fix our attention on one bucket (it doesn’t matter which they’re identical under the assumption of a uniform hash). What’s the probability that the first key hashes into that bucket? Assuming a uniform hash it’s \(\frac1n\), and the probability it doesn’t is \(1-\frac1n\). This probability is constant for each key we add and furthermore each outcome is independent, so the probability \(P_n\) that none of the \(n\) elements hash to our bucket is given by

\[P_n = \left(1 - \frac {1} {n}\right)^n = \left(\frac {n-1} {n}\right)^n\]

What happens as \(n\) grows large? If limit abuse upsets you, look away now.

\[\begin{aligned} \lim_{n\to\infty} P_n = & \lim_{n\to\infty} \left(\frac {n-1} {n}\right)^n \\ = & \lim_{n\to\infty} \left(\frac {n} {n-1}\right)^{-n} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’+1} {n’}\right)^{-(n’+1)} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’} {n’+1}\right) / \left(1 + \frac {1} {n’}\right)^{n’} \\ = & 1/e \\ \approx & 0.3678 \end{aligned}\]

What this result tells us that if we take a reasonably capacious hash table that chains collisions and which is equipped with an ideal hash function that uniformly distributes all keys, and we add as many records as there are buckets, it’s almost certain that more than a third of the buckets are completely empty. All the years I’ve been stuffing objects into Java HashMaps and I never knew.

I don’t have my volumes of Knuth to hand; I’m sure all this is covered and more and that I really should have known this years ago, but at least it was an enjoyable excursion of ignorance.

Minimal Perfect Hash for Strings

A minimal perfect hash is simply a bijection between a set of N objects and the first N integers. See the wikipedia page about perfect hashing for more information.

I’m looking to produce a fast and memory efficient map implementation over String keys and I want to use a perfect hash to pack values without risking collisions.

Casting around online for a Java implementation led me to Sux4J (this rather unpromising sounding library looks excellent and I’m going to take a closer look at it soon) but the implementations are too generic for my purposes.

So I’ve fashioned my own minimal perfect hashing algorithm for Java strings.I don’t know anything about the theory of such algorithms, but my approach was to build a data-structure as follows:

  1. input an array of ‘hashable’ strings
  2. order the strings by their standard Java hashcode
  3. store the hashcodes in an integer array
  4. identify spans of duplicate hashcodes
  5. if there are no duplicates then do nothing more
  6. else for each set of strings with the same hashcode
    1. build a decision tree that can distinguish the strings
    2. store a pointer to the tree for each string in the set

The gist of this approach is outlined visually in the following diagram. Here, the hashes for the strings “cat” and “dog” collide, so a small decision tree needs to be consulted in the case that either of these two are hashed to confirm which value should be returned.

There are some nuances in the actual implementation. For example, Strings are actually ordered by length and character sequence too; this assists the building of fast decision trees. Also each duplicated string stores its offset within the span to avoid scanning back to find the start of the span (I think there are better space/time trade-offs around this that I may investigate in the future.).

However, these complexities only come into play when there are collisions of the String hash function. The common case, even for fairly large numbers of strings, is that there are no collisions. This reduces the time complexity to a binary search over the integer array of hashcodes — not as fast as only taking the hashcode, but still fast. And even the exceptional case is still fairly fast since it typically needs to test candidate string’s length to determine which minimal hash value it should be assigned.

Note that the algorithm is designed to examine as little about the string as is necessary to determine the perfect hash value (this helps to make it fast). Furthermore, the algorithm does not keep a reference to the list of potential candidates for hashing (this helps to reduce memory overhead). As a consequence, the algorithm cannot necessarily distinguish valid candidates; though they are frequently identified as such, invalid string arguments may generate valid hashes.

The source code for this implementation is fully documented and is under the Apache 2.0 licence. It is available through my website’s Google code project