Imagine a situation where every word in a large corpus is assigned an arbitrary numeric identifier, and you need a data structure that can quickly return the word-frequency for any particular identifier. If you are not a programmer or you can’t imagine finding yourself in this situation you can safely stop reading now.
I wasn’t in exactly in this situation, but a similar one that shares the same basic constraints:
It’s not a run-of-the-mill set of requirements for a data-structure; you certainly won’t find any help in the standard Java libraries, so I thought I’d share my approach.
The underlying concept is simple - sort the map entries into key order and use a binary search on the key to find the entry and return the associated value. This requires O(log n) accesses per map lookup but the problem with this naive approach is memory consumption. Imagining that both keys and values can be accommodated in 64 bit longs, one entry requires 16 bytes. With 10 million entries, we’re looking at 160 megabytes of memory for a data structure which in my application is only subsidiary to a number of larger algorithms. Can we do better?
It turns out that we can take advantage of fact that data frequently exhibit a bias towards small values. Though some values may be very large (eg. there are a small number of very high frequency words) most values are small (eg. most words are infrequent). By padding every value to the same width (eg. 64 bits) we’re wasting a lot of space.
To trim the values we can use a universal code and this will reduce the number of bits required to store most values if most values are relatively short. But now we have a problem: how can we operate a binary search when the entries are variable length?
My solution to this was to use Fibonacci coding. This elegant code has the property that every number’s binary representation ends with 11 and there are no consecutive ones elsewhere present. How does this help?
We encode every key and value using Fibonacci coding and lay them out [key,value,key,value,…] without any padding, in ascending key order. Then we perform our binary search at a bit-level. We examine the middle bit of our data and work outwards from there looking for a pair of consecutive ones. Because of the properties of this encoding, we know that the binary sequence 11 indicates the end of one number’s encoding and we can then decode the next pair of numbers. But how do we distinguish keys from values?
I chose a simple scheme: for each key, add one and double it; for each value double it and add one. So keys are always even and values are always odd and we can easily identify if it’s a key or a value we’ve decoded based on whether it is even or odd.
So the algorithm, slightly simplified, looks like this:
With the right library support, this is very easy to code — my crinch library has excellent support for universal codes.
There may be better approaches (I’d be happy to hear recommendations), but I thought this one was interesting enough to share.