Bloom filters are very easy to implement. Unfortunately, there doesn’t seem to be very much information on the web that describes how actually generate all those independent hash functions you’ll need, especially over arbitrarily typed key values.
A standard Java hash value is an int. If you need to generate say 10 hash values over a range of 1 million indices an int just wont do - you just don’t have enough bits (by my mental arithmetic you’d need about 200bits). But short of requiring that any developer who wants to use a Bloom filter implements their own good quality multi-hash implementation, the int returned by Object.hashCode() is all we have.
My approach in the past has been to use a SecureRandom instance seeded with the object hash code (or byte data from the object if its available), making repeated calls to nextInt(int) to obtain as many hash values as I need. This is almost the best you can do to create a multiplicity of well-distributed hash values within a specified range.
It is nevertheless slow, SecureRandom objects are expensive to create and use. An obvious alternative is to consider using a regular Random object, the questions are:
SecureRandom?This is based on a small test application I wrote that inserted increasing numbers of elements into a Bloom filter while estimating the false-positive rate at each step. Up to 10,000 integers were stored using 4 hashes in a basic Bloom filter containing approximately 57,000 bits. The test was repeated using both a regular Random implementation and a SecureRandom “SHA1PRNG” implementation.
The median runtime of the SecureRandom test was approximately 61 seconds, whereas the corresponding time for the Random test was approximately 18 seconds.
The chart below plots, for both classes, the increase in false positive rate that occurs as more elements are added to the filter. The plots are similar:

The source code used to perform these tests, and the resulting data used to compile these charts is available.
The answers appear to be:
So it is plausible that using the Random class may prove an effective way of extending the standard Object.hashCode() values to meet the requirements of probabilistic data structures such as Bloom filters.
Compact approximators are a generalization of Bloom filters, see Wikipedia for introductory reading. They replace the boolean values that compose a Bloom filter with the elements of a lattice. Lattices are algebraic structures that occur frequently in mathematics and in computing.
In the informal vocabulary of a programmer, the bit values 0 and 1 form a lattice because you can and them to get the lesser of two values and or them to get the greater of two values. Sets provide another common example of lattice algebra in programming. In this case ‘anding’ two sets is to form their intersection, and ‘oring’ them is to form their union.
In practice, a compact approximator operates something like a map, except that you can’t iterate over the keys and the values must be the elements of a lattice. As with the Bloom filter, it has the advantage that its storage size can strictly bounded. And like a Bloom filter, it’s a probabilistic data structure:
Bloom filters provide a guarantee that, though there may be false positives, there will never be a false negative. This is to say, a Bloom filter may ‘erroneously’ assert it contains an element that was never added, but it will never assert that it doesn’t contain an element that was. How does this property carry over to compact approximators?
Due to their structure, lattices impose a ‘partial ordering’ on their elements. The nature of this ordering depends wholly on the lattice structure. In the case of a lattice generated by the bits 1 and 0, its simply the case that 1 is greater than 0. Or in the case of a lattice generated over the subsets of a set: one element is larger than another if it is a superset of that other element. A compact approximator guarantees that though it may return a larger value than the one you stored for a given key, it will never return a smaller value.
What does this mean in practice?
Suppose you have distributed hash table over modestly sized cluster of machines. The hash table has redundancy, so that key values may be available from more than one machine in the cluster. Assuming, not unreasonably, that identifying which machine can respond with a key’s value is a relatively expensive operation that may involve multiple network requests, it would be useful to cache which machines have values for which keys. A compact approximator could help here.
We create a compact approximator (CA) whose potential keys range over those of the distributed hash table (remember, the CA doesn’t actually persist the keys) and whose values are taken from a lattice over the set of all machine addresses in the cluster (ie. a value is a set of machine addresses). Every time a key value lookup is successfully performed on a machine, its address is added to the CA under that key. Retrieving a key from the CA will return all the machines which have successfully returned values for the key - though it may return a few more. This data can be used as a first best effort to locate an appropriate machine, if the machine doesn’t actually have the data, a full lookup can be performed instead. The expectation is that you will have sufficiently few wrong results from the CA that the time saved by avoiding lookups will dominate the time wasted in misdirected requests.
This is just one simple scenario, there are many; compact approximators were first identified as a way to accelerate unicode text searches. There are a number of algorithms that I’m looking to apply them to so I’ve added an implementation to a Java library for compact data representations that I’m putting together. It is an early, but fully featured implementation.