I did my first profiling of Crinch’s hashing code this evening at it’s been interesting. Guava’s new hash package gives me something against which I can benchmark.
My methodology, briefly, was to implement the murmur3 32 bit hash then use the hash on a million randomly populated POJOs that look like this:
private static class Pojo {
String strVal;
int intVal;
long longVal;
boolean boolVal;
}
Hashing them all with Crinch using:
private static class PojoSource implements HashSource<Pojo> {
public void sourceData(Pojo pojo, WriteStream out) {
out.writeChars(pojo.strVal);
out.writeInt(pojo.intVal);
out.writeLong(pojo.longVal);
out.writeBoolean(pojo.boolVal);
}
}
And hashing them all with Guava using:
private static class PojoFunnel implements Funnel<Pojo> {
@Override
public void funnel(Pojo pojo, Sink into) {
into.putString(pojo.strVal)
.putInt(pojo.intVal)
.putLong(pojo.longVal)
.putBoolean(pojo.boolVal);
}
}`
And storing the hashes in a million element int[], recording the median number of milliseconds required for both Crinch and Guava over 10 repetitions. Yes, this is a very rough first investigation, but you have to start somewhere. Code for the test case is available at:
The variance between repetitions was very low and on my machine (running OpenJDK 6) the times were consistently:
This is a much larger disparity than I anticipated. So I attached a profiler and used method sampling to profile the call tree (instrumentation would have provided more complete information but may have skewed the timings). It appears that much (though not all) of the disparity is due to the time spent in Crinch’s WriteStream.writeChars() method compared to Guava’s equivalent Sink.putString() method.
Here’s a barely legible screenshot of the call tree breakdown:

Both methods take a single CharSequence, iterate over its characters and hash them individually, but Crinch has an optimized path for the common case where a String is passed to the method; Guava does not. There is actually further scope for optimizing Crinch here too that I may investigate.
It’s too early to draw any conclusions about the relative overhead of hashing in Guava compared to that of Crinch. I’ll want to revisit the comparison when I’ve found time to build a patched version of Guava containing some appropriate optimizations.
Guava recently introduced a new hashing package and I’ve spent the evening benchmarking my hash library implementation against Guava’s. I’m interested in whether I can abandon my com.tomgibara.crinch.hashing package for com.google.guava.hash.
It’s been fascinating to closely inspect an alternative abstraction of the same (fairly fundamental) concept; especially since I’ve never really liked my own.
Unsurprisingly, though class names obviously differ (Guava’s are far better in my opinion — more consistent and intelligible), there are a number of similarities, but I’ve chosen to highlight some of the differences below, with my initial observations on them.
The intention is absolutely not to knock Guava or to say whether it’s better or worse than my own library — they operate in quite different contexts — it’s simply to explore the trade-offs made in their design (as I see them).
Here are my initial observations:
HashFunctions) always as an operation on a byte array. This is something I consciously avoided in Crinch. The equivalent class (Hash) operates directly on objects to be hashed, even if objects may be collapsed to bytes internally. I felt this design choice was important because some applications may need to provide an alternative hash implementation that can’t be easily or efficiently expressed at a byte level. The Guava abstraction is cleaner but may not be as flexible.Funnel abstraction that ‘puts’ bytes into a Sink which is fluent and appears to be strongly influenced by Java’s ByteBuffer class. Conversely Crinch uses a HashSource (a truly terrible class name) to ‘write’ object bytes into a WriteStream which is loosely based on Java’s DataOutputStream class. This write/put distinction has almost no implications at a code level, but hints at different mental models. One difference that does affect the relative usability of the APIs is that Sink is fluent (as are many of Guava’s interfaces) whereas WriteStream is not. I always hesitate when faced with the decision to employ a fluent interface that may be used in performance critical sections of code; I always worry that there may be a performance hit - even though I have no evidence that this is the case; it may be irrationality on my part.Funnel with a HashFunction. This means that, at least in little coding I’ve so far done with Guava’s hash package, I frequently have to push both objects around myself. It’s a small niggle, but one that Crinch avoids.HashCode interface that gives access to the hash as an int, long or byte[]. Crinch on the other hand uses Java’s BigInteger class for the same purpose and I think this benefits the developer in terms of familiarity, definiteness and usefulness. Obviously Guava’s use of an interface here provides for important performance benefits when implementing specific HashFunctions, but this is not such an issue in Crinch because intermediate object creation can generally be avoided (see next comment).hash() method that returns a HashCode object, but Crinch has three methods hashAsInt(), hashAsLong() and hashAsBigInt(). Intermediate object creation can be avoided at the expense of a more ugly API.mod with more bits. But whether this would ever be a practical matter for developers using the Guava libraries is debatable.Hash abstraction for generating multiple hash values for a single object called MultiHash. Guava performs a similar task within its BloomFilter implementation but entirely hides the implementation. I think this may turn out to be unfortunate because there are several different data structures that use multiple hashes (eg. cuckoo hash tables), and it seems artificial to accumulate them in the hash package without exposing a core abstraction on which they depend.HashFunctions in the form of AbstractNonStreamingHashFunction and AbstractStreamingHashFunction the use of a ByteBuffer in the latter is a particularly good fit. Crinch provides nothing half as useful, though it should.On measure, I think Guava certainly provides a cleaner API, whether it’s more effective from the perspective of a developer writing an application is less certain. The important question for my applications is: does the cleaner API come at an unavoidable cost in performance?
I don’t know yet and I’m not implying it does; it could take a long while to find out.
This post turned out longer than I anticipated so finishing the preliminary benchmarking will have to wait until tomorrow.
Across various software libraries I’ve produced, I have a small number of ad-hoc Bloom filter implementations. What started out as a small task to rationalize them into a single more general implementation has turned out to be trickier than I expected.
Already having a target interface for the BloomFilter implementation, and though equipped with my powerful BitVector class, it turns out that the real barrier to producing a general Bloom filter implementation is producing a useful API that generalizes Object.hashCode() efficiently.
This task, one I expected to be simple, has bogged me down, and I still don’t think much of my API: there are many infidelities. One of the core requirements is that the API needs to permit hash values to be accessed quickly (say passing an Object in and getting an int back) because hashing is often performance critical. Conversely, some hashes (say cryptographic hashes like SHA-1) need to return values larger than any Java primitive. You can’t accommodate these into a single method signature. In addition, many useful data structures require the generation of multiple hashes over a specific numeric range (Bloom filters are not the only example, Cuckoo hashes provide another), even though the common case will probably remain creation of single hash values.
It’s early days, but the hashing package source-code is available.
At present it defines a core Hash interface that can generate individual hash values for objects. This is extended by MultiHash which can generate a HashList to return multiple hash values. Each Hash declares a HashRange which is the range that the hash values it generates may span. To facilitate the implementation of general hashing algorithms, a HashSource interface provides a standard way to decompose Objects into byte streams. Finally a Hashing class provides static utility methods.
The source code for the Bloom filter implementation is also available