Latest Tweets

 

Double hashing comparison

After my previous post comparing Java’s regular random number generator to its cryptographic SHA1PNG relation, Benjamin Manes suggested using double-hashing has a way of generating multiple and provided an implementation for me to test.

The initial results appear to be excellent. It yielded by far the shortest test time (approx. 4s compared to 18s for the standard Java random number generator) and in this limited test exhibited a distribution that is almost indistinguishable from that of SHA1PNG.

Chart of Bloom filter false-positive rates

Multi-hashing effectiveness

Bloom filters are very easy to implement. Unfortunately, there doesn’t seem to be very much information on the web that describes how actually generate all those independent hash functions you’ll need, especially over arbitrarily typed key values.

The Problem

A standard Java hash value is an int. If you need to generate say 10 hash values over a range of 1 million indices an int just wont do - you just don’t have enough bits (by my mental arithmetic you’d need about 200bits). But short of requiring that any developer who wants to use a Bloom filter implements their own good quality multi-hash implementation, the int returned by Object.hashCode() is all we have.

A Possible Solution

My approach in the past has been to use a SecureRandom instance seeded with the object hash code (or byte data from the object if its available), making repeated calls to nextInt(int) to obtain as many hash values as I need. This is almost the best you can do to create a multiplicity of well-distributed hash values within a specified range.

It is nevertheless slow, SecureRandom objects are expensive to create and use. An obvious alternative is to consider using a regular Random object, the questions are:

  1. Does it impair the effectiveness of the Bloom filter (by significantly increasing the false positive rate)?
  2. How much faster is it than using a SecureRandom?

Results

This is based on a small test application I wrote that inserted increasing numbers of elements into a Bloom filter while estimating the false-positive rate at each step. Up to 10,000 integers were stored using 4 hashes in a basic Bloom filter containing approximately 57,000 bits. The test was repeated using both a regular Random implementation and a SecureRandom “SHA1PRNG” implementation.

The median runtime of the SecureRandom test was approximately 61 seconds, whereas the corresponding time for the Random test was approximately 18 seconds.

The chart below plots, for both classes, the increase in false positive rate that occurs as more elements are added to the filter. The plots are similar:

Chart of Bloom filter false positive rates

The source code used to perform these tests, and the resulting data used to compile these charts is available.

Conclusion

The answers appear to be:

  1. not much
  2. about 3 times faster

So it is plausible that using the Random class may prove an effective way of extending the standard Object.hashCode() values to meet the requirements of probabilistic data structures such as Bloom filters.

Bloom filters and Hashing

Across various software libraries I’ve produced, I have a small number of ad-hoc Bloom filter implementations. What started out as a small task to rationalize them into a single more general implementation has turned out to be trickier than I expected.

Already having a target interface for the BloomFilter implementation, and though equipped with my powerful BitVector class, it turns out that the real barrier to producing a general Bloom filter implementation is producing a useful API that generalizes Object.hashCode() efficiently.

This task, one I expected to be simple, has bogged me down, and I still don’t think much of my API: there are many infidelities. One of the core requirements is that the API needs to permit hash values to be accessed quickly (say passing an Object in and getting an int back) because hashing is often performance critical. Conversely, some hashes (say cryptographic hashes like SHA-1) need to return values larger than any Java primitive. You can’t accommodate these into a single method signature. In addition, many useful data structures require the generation of multiple hashes over a specific numeric range (Bloom filters are not the only example, Cuckoo hashes provide another), even though the common case will probably remain creation of single hash values.

It’s early days, but the hashing package source-code is available.

At present it defines a core Hash interface that can generate individual hash values for objects. This is extended by MultiHash which can generate a HashList to return multiple hash values. Each Hash declares a HashRange which is the range that the hash values it generates may span. To facilitate the implementation of general hashing algorithms, a HashSource interface provides a standard way to decompose Objects into byte streams. Finally a Hashing class provides static utility methods.

The source code for the Bloom filter implementation is also available

Minimal Perfect Hash for Strings

A minimal perfect hash is simply a bijection between a set of N objects and the first N integers. See the wikipedia page about perfect hashing for more information.

I’m looking to produce a fast and memory efficient map implementation over String keys and I want to use a perfect hash to pack values without risking collisions.

Casting around online for a Java implementation led me to Sux4J (this rather unpromising sounding library looks excellent and I’m going to take a closer look at it soon) but the implementations are too generic for my purposes.

So I’ve fashioned my own minimal perfect hashing algorithm for Java strings.I don’t know anything about the theory of such algorithms, but my approach was to build a data-structure as follows:

  1. input an array of ‘hashable’ strings
  2. order the strings by their standard Java hashcode
  3. store the hashcodes in an integer array
  4. identify spans of duplicate hashcodes
  5. if there are no duplicates then do nothing more
  6. else for each set of strings with the same hashcode
    1. build a decision tree that can distinguish the strings
    2. store a pointer to the tree for each string in the set

The gist of this approach is outlined visually in the following diagram. Here, the hashes for the strings “cat” and “dog” collide, so a small decision tree needs to be consulted in the case that either of these two are hashed to confirm which value should be returned.

There are some nuances in the actual implementation. For example, Strings are actually ordered by length and character sequence too; this assists the building of fast decision trees. Also each duplicated string stores its offset within the span to avoid scanning back to find the start of the span (I think there are better space/time trade-offs around this that I may investigate in the future.).

However, these complexities only come into play when there are collisions of the String hash function. The common case, even for fairly large numbers of strings, is that there are no collisions. This reduces the time complexity to a binary search over the integer array of hashcodes — not as fast as only taking the hashcode, but still fast. And even the exceptional case is still fairly fast since it typically needs to test candidate string’s length to determine which minimal hash value it should be assigned.

Note that the algorithm is designed to examine as little about the string as is necessary to determine the perfect hash value (this helps to make it fast). Furthermore, the algorithm does not keep a reference to the list of potential candidates for hashing (this helps to reduce memory overhead). As a consequence, the algorithm cannot necessarily distinguish valid candidates; though they are frequently identified as such, invalid string arguments may generate valid hashes.

The source code for this implementation is fully documented and is under the Apache 2.0 licence. It is available through my website’s Google code project