Latest Tweets

 

Bloom filters and Hashing

Across various software libraries I’ve produced, I have a small number of ad-hoc Bloom filter implementations. What started out as a small task to rationalize them into a single more general implementation has turned out to be trickier than I expected.

Already having a target interface for the BloomFilter implementation, and though equipped with my powerful BitVector class, it turns out that the real barrier to producing a general Bloom filter implementation is producing a useful API that generalizes Object.hashCode() efficiently.

This task, one I expected to be simple, has bogged me down, and I still don’t think much of my API: there are many infidelities. One of the core requirements is that the API needs to permit hash values to be accessed quickly (say passing an Object in and getting an int back) because hashing is often performance critical. Conversely, some hashes (say cryptographic hashes like SHA-1) need to return values larger than any Java primitive. You can’t accommodate these into a single method signature. In addition, many useful data structures require the generation of multiple hashes over a specific numeric range (Bloom filters are not the only example, Cuckoo hashes provide another), even though the common case will probably remain creation of single hash values.

It’s early days, but the hashing package source-code is available.

At present it defines a core Hash interface that can generate individual hash values for objects. This is extended by MultiHash which can generate a HashList to return multiple hash values. Each Hash declares a HashRange which is the range that the hash values it generates may span. To facilitate the implementation of general hashing algorithms, a HashSource interface provides a standard way to decompose Objects into byte streams. Finally a Hashing class provides static utility methods.

The source code for the Bloom filter implementation is also available

blog comments powered by Disqus