Bloom filters are very easy to implement. Unfortunately, there doesn’t seem to be very much information on the web that describes how actually generate all those independent hash functions you’ll need, especially over arbitrarily typed key values.
A standard Java hash value is an int. If you need to generate say 10 hash values over a range of 1 million indices an int just wont do - you just don’t have enough bits (by my mental arithmetic you’d need about 200bits). But short of requiring that any developer who wants to use a Bloom filter implements their own good quality multi-hash implementation, the int returned by Object.hashCode() is all we have.
My approach in the past has been to use a SecureRandom instance seeded with the object hash code (or byte data from the object if its available), making repeated calls to nextInt(int) to obtain as many hash values as I need. This is almost the best you can do to create a multiplicity of well-distributed hash values within a specified range.
It is nevertheless slow, SecureRandom objects are expensive to create and use. An obvious alternative is to consider using a regular Random object, the questions are:
SecureRandom?This is based on a small test application I wrote that inserted increasing numbers of elements into a Bloom filter while estimating the false-positive rate at each step. Up to 10,000 integers were stored using 4 hashes in a basic Bloom filter containing approximately 57,000 bits. The test was repeated using both a regular Random implementation and a SecureRandom “SHA1PRNG” implementation.
The median runtime of the SecureRandom test was approximately 61 seconds, whereas the corresponding time for the Random test was approximately 18 seconds.
The chart below plots, for both classes, the increase in false positive rate that occurs as more elements are added to the filter. The plots are similar:

The source code used to perform these tests, and the resulting data used to compile these charts is available.
The answers appear to be:
So it is plausible that using the Random class may prove an effective way of extending the standard Object.hashCode() values to meet the requirements of probabilistic data structures such as Bloom filters.
Compact approximators are a generalization of Bloom filters, see Wikipedia for introductory reading. They replace the boolean values that compose a Bloom filter with the elements of a lattice. Lattices are algebraic structures that occur frequently in mathematics and in computing.
In the informal vocabulary of a programmer, the bit values 0 and 1 form a lattice because you can and them to get the lesser of two values and or them to get the greater of two values. Sets provide another common example of lattice algebra in programming. In this case ‘anding’ two sets is to form their intersection, and ‘oring’ them is to form their union.
In practice, a compact approximator operates something like a map, except that you can’t iterate over the keys and the values must be the elements of a lattice. As with the Bloom filter, it has the advantage that its storage size can strictly bounded. And like a Bloom filter, it’s a probabilistic data structure:
Bloom filters provide a guarantee that, though there may be false positives, there will never be a false negative. This is to say, a Bloom filter may ‘erroneously’ assert it contains an element that was never added, but it will never assert that it doesn’t contain an element that was. How does this property carry over to compact approximators?
Due to their structure, lattices impose a ‘partial ordering’ on their elements. The nature of this ordering depends wholly on the lattice structure. In the case of a lattice generated by the bits 1 and 0, its simply the case that 1 is greater than 0. Or in the case of a lattice generated over the subsets of a set: one element is larger than another if it is a superset of that other element. A compact approximator guarantees that though it may return a larger value than the one you stored for a given key, it will never return a smaller value.
What does this mean in practice?
Suppose you have distributed hash table over modestly sized cluster of machines. The hash table has redundancy, so that key values may be available from more than one machine in the cluster. Assuming, not unreasonably, that identifying which machine can respond with a key’s value is a relatively expensive operation that may involve multiple network requests, it would be useful to cache which machines have values for which keys. A compact approximator could help here.
We create a compact approximator (CA) whose potential keys range over those of the distributed hash table (remember, the CA doesn’t actually persist the keys) and whose values are taken from a lattice over the set of all machine addresses in the cluster (ie. a value is a set of machine addresses). Every time a key value lookup is successfully performed on a machine, its address is added to the CA under that key. Retrieving a key from the CA will return all the machines which have successfully returned values for the key - though it may return a few more. This data can be used as a first best effort to locate an appropriate machine, if the machine doesn’t actually have the data, a full lookup can be performed instead. The expectation is that you will have sufficiently few wrong results from the CA that the time saved by avoiding lookups will dominate the time wasted in misdirected requests.
This is just one simple scenario, there are many; compact approximators were first identified as a way to accelerate unicode text searches. There are a number of algorithms that I’m looking to apply them to so I’ve added an implementation to a Java library for compact data representations that I’m putting together. It is an early, but fully featured implementation.
Across various software libraries I’ve produced, I have a small number of ad-hoc Bloom filter implementations. What started out as a small task to rationalize them into a single more general implementation has turned out to be trickier than I expected.
Already having a target interface for the BloomFilter implementation, and though equipped with my powerful BitVector class, it turns out that the real barrier to producing a general Bloom filter implementation is producing a useful API that generalizes Object.hashCode() efficiently.
This task, one I expected to be simple, has bogged me down, and I still don’t think much of my API: there are many infidelities. One of the core requirements is that the API needs to permit hash values to be accessed quickly (say passing an Object in and getting an int back) because hashing is often performance critical. Conversely, some hashes (say cryptographic hashes like SHA-1) need to return values larger than any Java primitive. You can’t accommodate these into a single method signature. In addition, many useful data structures require the generation of multiple hashes over a specific numeric range (Bloom filters are not the only example, Cuckoo hashes provide another), even though the common case will probably remain creation of single hash values.
It’s early days, but the hashing package source-code is available.
At present it defines a core Hash interface that can generate individual hash values for objects. This is extended by MultiHash which can generate a HashList to return multiple hash values. Each Hash declares a HashRange which is the range that the hash values it generates may span. To facilitate the implementation of general hashing algorithms, a HashSource interface provides a standard way to decompose Objects into byte streams. Finally a Hashing class provides static utility methods.
The source code for the Bloom filter implementation is also available
Prior to implementing a BloomFilter along the lines of the interface described in the Guava project issue tracker, I’ve reviewed my sprawling personal collection of Java libraries to identify the all of the ad-hoc bit-processing related code I’ve written with a view to unifying it all into something more reusable.
To my surprise, much of the code didn’t actually deal with tricky bitwise processing, but with the practicalities of efficiently moving bit data between different (bytes, ints, longs, byte arrays, IO streams, Strings, BigIntegers etc). All of those bits have to come from somewhere I guess.
Of course, there was plenty of bit twiddling, but another obvious sticking point was supporting competing requirements for mutability and immutability, as is common when one is seeking to design APIs that are both robust and efficient.
So I’ve spent a bit of time combining everything I need for various binary processing tasks (bitwise operations, (im)mutability control, conversion to standard Java types) and the result is a single stand-alone class called BitVector.
The source code for the class is available under the Apache 2.0 licence from:
The class is mostly complete for my needs, though I haven’t yet begun to employ it in any of my projects.
For a better idea of what the class provides, the class Javadoc is included below.
BitVector stores fixed-length bit sequences of arbitrary length and provides a number of bit-wise operations, and methods for exposing the bit sequences as established java types.
In keeping with Java standards, bits are operated-on and exposed as big-endian. This means that, where bit sequences are input/output from methods, the least-significant bit is always on the right and the most significant bit is on the left. So, for example, the toString() method will contain the most significant bit in the character at index 0 in the string. Naturally, In the cases where this class is used without externalizing the bit representation, this distinction is irrelevant.
A consequence of this is that, in methods that are defined over ranges of bits, the from and to parameters define the rightmost and leftmost indices respectively. As per Java conventions, all from parameters are inclusive and all to parameters are exclusive.
Instances of this class may be aligned (see isAligned() and aligned(). Many operations will execute more efficiently on aligned instances. Instances may also be immutable (see isMutable(), mutable() and immutable()). In addition to a number of copying operations, the class also supports views; views create new BitVector instances that are backed by the same bit data, this allows applications to expose mutable BitVector instances ‘safely’ via immutable views.
The class extends Number which allows it to be treated as an extended length numeric type (albeit, one that doesn’t support any arithmetic operations). In addition to the regular number value methods on the interface, an bigIntValue() method is available that returns the BitVector as a positive, arbitrarily sized integer. Combined with the fromBigInteger() method, this allows, with some loss of performance, a range of arithmetic calculations to be performed.
The class also implements Iterable and provides methods to obtain ListIterator as one would for a List, but stops short of implementing the List interface. This allows a BitVector to be used with a range of Java language constructs and standard library classes.
The class is Serializable and Cloneable too (clones are shallow and essentially behave as a view of the original instance). For instances where more control over serialization is needed, read(InputStream) and write(OutputStream) methods are available, though better performance may result from calling toByteArray() and managing the writing outside this class.
In addition to all of the above methods outlined above, a full raft of bitwise operations are available, including:
Most such methods are available in both an operation specific version and operation parameterized version to cater for different application needs.
Performance should be adequate for most uses. There is scope to improve the performance of many methods, but none of the methods operate in anything more than linear time and inner loops are mostly ‘tight’. An implementation detail (which may be important on some platforms) is that, with some exceptions, none of the methods perform any object creation. The exceptions are: methods that require an object to be returned, floatValue(), doubleValue(), and situations where an operation is applied with overlapping ranges of bits, in which case it may be necessary to create a temporary copy of the BitVector.
The class is marked as final to ensure that immutable instances can be safely used in security sensitive code (eg. within a PrivilegedAction).