Latest Tweets

 

Living in Hash Table Ignorance

I’ve been living in hash table ignorance, always accepting the received wisdom that a load factor (the ratio of elements stored to available buckets) of around 0.75 provides a good trade-off between storage and performance but never stopping to think about what’s actually going on. For example, what happens at the limit, when the number of elements approaches the number of buckets in a chained hash table such as Java’s HashMap.

As part of my development of crinch I’ve been approaching data structures from the perspective of reducing memory overhead. So when I started writing a hash map implementation for the library, the following question naturally arose: in a hash table of capacity n, that contains n elements, how many buckets will be empty?

Assuming a hashing algorithm that distributes keys with a high degree of uniformity, there will inevitably be some collisions (implying empty buckets elsewhere) but my expectation was that there would not be not too many. I guessed that maybe 10% of the buckets might be empty, and I was hoping for fewer than that.

When I found that slightly more than a third of the buckets in my hash table implementation were empty, I wish I could say I was immediately shocked into the realization that I’d never bothered to understand the real performance characteristics of the humble hash map. Instead I wasted an hour looking for a mistake — trying different hash functions and a number of different key sets — before realizing that the results I was seeing were so consistent (always hovering tightly around 36.8% of buckets empty) that I must be missing something. Then I did something more intelligent and walked away from my keyboard, picked up a pen and reasoned through the problem. It turns out it’s simple.

Let’s fix our attention on one bucket (it doesn’t matter which they’re identical under the assumption of a uniform hash). What’s the probability that the first key hashes into that bucket? Assuming a uniform hash it’s \(\frac1n\), and the probability it doesn’t is \(1-\frac1n\). This probability is constant for each key we add and furthermore each outcome is independent, so the probability \(P_n\) that none of the \(n\) elements hash to our bucket is given by

\[P_n = \left(1 - \frac {1} {n}\right)^n = \left(\frac {n-1} {n}\right)^n\]

What happens as \(n\) grows large? If limit abuse upsets you, look away now.

\[\begin{aligned} \lim_{n\to\infty} P_n = & \lim_{n\to\infty} \left(\frac {n-1} {n}\right)^n \\ = & \lim_{n\to\infty} \left(\frac {n} {n-1}\right)^{-n} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’+1} {n’}\right)^{-(n’+1)} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’} {n’+1}\right) / \left(1 + \frac {1} {n’}\right)^{n’} \\ = & 1/e \\ \approx & 0.3678 \end{aligned}\]

What this result tells us that if we take a reasonably capacious hash table that chains collisions and which is equipped with an ideal hash function that uniformly distributes all keys, and we add as many records as there are buckets, it’s almost certain that more than a third of the buckets are completely empty. All the years I’ve been stuffing objects into Java HashMaps and I never knew.

I don’t have my volumes of Knuth to hand; I’m sure all this is covered and more and that I really should have known this years ago, but at least it was an enjoyable excursion of ignorance.

A memory efficient Map

Building on the perfect minimal hash I’ve developed for Java String objects, I’ve developed an memory efficient implementation of Java’s Map interface.

View the map source code in SVN

The map implementation is not general purpose, it’s specifically targeted to situations where:

  • The map is keyed by strings; the valid keys are known ahead of time.
  • The map values are, on average, anticipated to be fairly dense (ie. you don’t have situation where the map could contain any of several thousand keys, but only contains one value)
  • There will be many map instances with the same keys.
  • Performance comparable to that of a ‘HashMap’ is acceptable.

Even given these constraints, I frequently encounter situations where all of these apply:

  • High volume HTTP services where values need to be parsed out of the query string and bundled into a map.
  • Data querying APIs where query parameters need to be supplied via Map for substitution into a query.
  • Proxy implementations of setter/getter interfaces that are backed by a Map of values.

The implementation stores the set of valid keys in an object that can be shared between ‘like’ maps. The keys are hashed perfectly so that the map values can be stored directly into a single object array; no entry objects are required. This means that the memory required to store large numbers of similar maps can . Furthermore, none of the basic map operations (put/get/contains etc.) performs any memory allocation. This means that for many applications both memory overhead and object churn is minimal.

From the javadocs:

Unlike regular Map implementations, there is no default, or copy, constructor. Instead the map must be constructed around a set of possible String keys. Only the keys supplied to a constructor may be used in the maps it constructs. Re-using constructors (by repeatedly using them to create ParameterMap instances) minimizes the memory required to store the map. Furthermore, the only map operations that allocate new objects are those methods that required to do so by the Map interface; in-short, object creation strictly minimized. This makes the class ideal for environments where reducing GC overhead is important.

The source code is under the Apache 2.0 licence, built with Maven, and is available from my website’s google code project:

http://code.google.com/p/tomgibara/