Latest Tweets

 

Multi-hashing effectiveness

Bloom filters are very easy to implement. Unfortunately, there doesn’t seem to be very much information on the web that describes how actually generate all those independent hash functions you’ll need, especially over arbitrarily typed key values.

The Problem

A standard Java hash value is an int. If you need to generate say 10 hash values over a range of 1 million indices an int just wont do - you just don’t have enough bits (by my mental arithmetic you’d need about 200bits). But short of requiring that any developer who wants to use a Bloom filter implements their own good quality multi-hash implementation, the int returned by Object.hashCode() is all we have.

A Possible Solution

My approach in the past has been to use a SecureRandom instance seeded with the object hash code (or byte data from the object if its available), making repeated calls to nextInt(int) to obtain as many hash values as I need. This is almost the best you can do to create a multiplicity of well-distributed hash values within a specified range.

It is nevertheless slow, SecureRandom objects are expensive to create and use. An obvious alternative is to consider using a regular Random object, the questions are:

  1. Does it impair the effectiveness of the Bloom filter (by significantly increasing the false positive rate)?
  2. How much faster is it than using a SecureRandom?

Results

This is based on a small test application I wrote that inserted increasing numbers of elements into a Bloom filter while estimating the false-positive rate at each step. Up to 10,000 integers were stored using 4 hashes in a basic Bloom filter containing approximately 57,000 bits. The test was repeated using both a regular Random implementation and a SecureRandom “SHA1PRNG” implementation.

The median runtime of the SecureRandom test was approximately 61 seconds, whereas the corresponding time for the Random test was approximately 18 seconds.

The chart below plots, for both classes, the increase in false positive rate that occurs as more elements are added to the filter. The plots are similar:

Chart of Bloom filter false positive rates

The source code used to perform these tests, and the resulting data used to compile these charts is available.

Conclusion

The answers appear to be:

  1. not much
  2. about 3 times faster

So it is plausible that using the Random class may prove an effective way of extending the standard Object.hashCode() values to meet the requirements of probabilistic data structures such as Bloom filters.

blog comments powered by Disqus