Latest Tweets

 

Guava’s hashing abstractions compared to Crinch’s

Guava recently introduced a new hashing package and I’ve spent the evening benchmarking my hash library implementation against Guava’s. I’m interested in whether I can abandon my com.tomgibara.crinch.hashing package for com.google.guava.hash.

It’s been fascinating to closely inspect an alternative abstraction of the same (fairly fundamental) concept; especially since I’ve never really liked my own.

Unsurprisingly, though class names obviously differ (Guava’s are far better in my opinion — more consistent and intelligible), there are a number of similarities, but I’ve chosen to highlight some of the differences below, with my initial observations on them.

The intention is absolutely not to knock Guava or to say whether it’s better or worse than my own library — they operate in quite different contexts — it’s simply to explore the trade-offs made in their design (as I see them).

Here are my initial observations:

  • One fundamental difference between the approach of the libraries is that Guava sees hashing (executed by HashFunctions) always as an operation on a byte array. This is something I consciously avoided in Crinch. The equivalent class (Hash) operates directly on objects to be hashed, even if objects may be collapsed to bytes internally. I felt this design choice was important because some applications may need to provide an alternative hash implementation that can’t be easily or efficiently expressed at a byte level. The Guava abstraction is cleaner but may not be as flexible.
  • Guava converts objects into bytes using a Funnel abstraction that ‘puts’ bytes into a Sink which is fluent and appears to be strongly influenced by Java’s ByteBuffer class. Conversely Crinch uses a HashSource (a truly terrible class name) to ‘write’ object bytes into a WriteStream which is loosely based on Java’s DataOutputStream class. This write/put distinction has almost no implications at a code level, but hints at different mental models. One difference that does affect the relative usability of the APIs is that Sink is fluent (as are many of Guava’s interfaces) whereas WriteStream is not. I always hesitate when faced with the decision to employ a fluent interface that may be used in performance critical sections of code; I always worry that there may be a performance hit - even though I have no evidence that this is the case; it may be irrationality on my part.
  • Guava doesn’t appear to provide an object that combines a Funnel with a HashFunction. This means that, at least in little coding I’ve so far done with Guava’s hash package, I frequently have to push both objects around myself. It’s a small niggle, but one that Crinch avoids.
  • To accommodate hash values that are larger than a Java primitive, Guava uses a HashCode interface that gives access to the hash as an int, long or byte[]. Crinch on the other hand uses Java’s BigInteger class for the same purpose and I think this benefits the developer in terms of familiarity, definiteness and usefulness. Obviously Guava’s use of an interface here provides for important performance benefits when implementing specific HashFunctions, but this is not such an issue in Crinch because intermediate object creation can generally be avoided (see next comment).
  • Most of the Crinch code has been designed to avoid unnecessary garbage; it’s designed to handle large collections of objects and as such even small amounts of garbage build up quickly. This often works against producing simpler APIs, and here is one such case. Guava exposes a single hash() method that returns a HashCode object, but Crinch has three methods hashAsInt(), hashAsLong() and hashAsBigInt(). Intermediate object creation can be avoided at the expense of a more ugly API.
  • Guava only allows hash sizes to be specified in terms of bit size, but Crinch is more specific: hashes can be specified to lie within a range of values. This distinction can be important because many interesting data structures (such as Bloom filters and compact approximators) may rely on hash codes that lie over certain ranges and there exist specific modes of hashing that can match those ranges with little bias compared to simply using mod with more bits. But whether this would ever be a practical matter for developers using the Guava libraries is debatable.
  • Crinch exposes an extended Hash abstraction for generating multiple hash values for a single object called MultiHash. Guava performs a similar task within its BloomFilter implementation but entirely hides the implementation. I think this may turn out to be unfortunate because there are several different data structures that use multiple hashes (eg. cuckoo hash tables), and it seems artificial to accumulate them in the hash package without exposing a core abstraction on which they depend.
  • Guava provides clean and useful abstract base classes for implementing new HashFunctions in the form of AbstractNonStreamingHashFunction and AbstractStreamingHashFunction the use of a ByteBuffer in the latter is a particularly good fit. Crinch provides nothing half as useful, though it should.

On measure, I think Guava certainly provides a cleaner API, whether it’s more effective from the perspective of a developer writing an application is less certain. The important question for my applications is: does the cleaner API come at an unavoidable cost in performance?

I don’t know yet and I’m not implying it does; it could take a long while to find out.

This post turned out longer than I anticipated so finishing the preliminary benchmarking will have to wait until tomorrow.

Mapping numbers to numbers compactly

Imagine a situation where every word in a large corpus is assigned an arbitrary numeric identifier, and you need a data structure that can quickly return the word-frequency for any particular identifier. If you are not a programmer or you can’t imagine finding yourself in this situation you can safely stop reading now.

I wasn’t in exactly in this situation, but a similar one that shares the same basic constraints:

  • a map that is largely static,
  • the expectation that there will be millions of keys,
  • keys and values that are positive whole numbers of arbitrary size, and
  • the requirement that the map be stored in memory for consistently fast access.

It’s not a run-of-the-mill set of requirements for a data-structure; you certainly won’t find any help in the standard Java libraries, so I thought I’d share my approach.

The underlying concept is simple - sort the map entries into key order and use a binary search on the key to find the entry and return the associated value. This requires O(log n) accesses per map lookup but the problem with this naive approach is memory consumption. Imagining that both keys and values can be accommodated in 64 bit longs, one entry requires 16 bytes. With 10 million entries, we’re looking at 160 megabytes of memory for a data structure which in my application is only subsidiary to a number of larger algorithms. Can we do better?

It turns out that we can take advantage of fact that data frequently exhibit a bias towards small values. Though some values may be very large (eg. there are a small number of very high frequency words) most values are small (eg. most words are infrequent). By padding every value to the same width (eg. 64 bits) we’re wasting a lot of space.

To trim the values we can use a universal code and this will reduce the number of bits required to store most values if most values are relatively short. But now we have a problem: how can we operate a binary search when the entries are variable length?

My solution to this was to use Fibonacci coding. This elegant code has the property that every number’s binary representation ends with 11 and there are no consecutive ones elsewhere present. How does this help?

We encode every key and value using Fibonacci coding and lay them out [key,value,key,value,…] without any padding, in ascending key order. Then we perform our binary search at a bit-level. We examine the middle bit of our data and work outwards from there looking for a pair of consecutive ones. Because of the properties of this encoding, we know that the binary sequence 11 indicates the end of one number’s encoding and we can then decode the next pair of numbers. But how do we distinguish keys from values?

I chose a simple scheme: for each key, add one and double it; for each value double it and add one. So keys are always even and values are always odd and we can easily identify if it’s a key or a value we’ve decoded based on whether it is even or odd.

So the algorithm, slightly simplified, looks like this:

  1. Start with the whole range of binary data (consisting of key-value pairs ordered by key, encoded using a Fibonacci coding, with keys incremented and doubled, and values doubled and incremented)
  2. If the range is empty return null (ie. no match)
  3. Find the midpoint of the range.
  4. Starting from the midpoint scan for a consecutive pair of one bits.
  5. Decode the number starting immediately after the 11 bit sequence.
  6. If the number is odd, decode the next number.
  7. Recover the key by dividing the number by two and subtracting one.
  8. If the recovered key equals the key we are looking for, decode the next number, subtract one, divide it by two and return the result
  9. Otherwise, if the key is greater than the key we are looking for, move the right-hand limit of the range to the start of the key and continue from (2).
  10. Otherwise the key is less than the key we are looking for so move the left-hand limit of the range to the end of the value and continue from (2).

With the right library support, this is very easy to code — my crinch library has excellent support for universal codes.

There may be better approaches (I’d be happy to hear recommendations), but I thought this one was interesting enough to share.