I’m continuing to evaluate my cluster painting algorithm, this time on a larger dataset.
This comparison with random colour assignment demonstrates the increased clarity that a good colouring provides. Without the algorithm, most of the clusters in the South West and Scotland are indistinguishable.
As I posted yesterday, I needed an algorithm that could paint clusters so as to make them distinguishable from each other. Today, I implemented a simple, “good enough” algorithm.
The algorithm is fairly general, only requiring that each cluster has:
It takes as parameters, a collection of clusters and a list of “paints” and returns a map that assigns every cluster a paint.
Here are progressive results over a small dataset using 5 colours.







Even with 200 clusters, this algorithm does a fairly good job at making all the clusters distinguishable using just 5 colours.
Ideas for a cluster colouring algorithm.
Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.
In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.
I’ve been investigating how clustering can be applied to sets of character strings in which common prefixes indicate similarity, and I came up with the following metric (probably known already, but no amount of online searching gave any useful results).
We start buy considering the set \(S\) of sequences over some alphabet \(\Sigma\). We write \(s=\{s_i\}_{i\geq0}\) with \(s_i\in\Sigma\) for \(s\in\Sigma\).
For \(s,t\in\Sigma\) with \(s\neq t\) we write \(\widehat{st}\) for the length of the longest common initial subsequence, ie. \(\widehat{st}\) is the greatest \(n\) such that \(s_k=t_k\forall{k < n}\). Then \(d:S^2\rightarrow\mathbb{R}\) defines a metric where
\[ d(s,t) = \left\{ \begin{array}{ll} \frac 1 {|\Sigma|^{\widehat{st}}} & \mbox{if } s \neq t \\ 0 & \mbox{if } s = t \end{array} \right. \]
By its definition, necessarily \(d(s,t)\geq0\) with \(d(s,t)=0\) iff \(s=t\). Furthermore, \(d(s,t)\) is symmetric as a consequence of the symmetry of \(\widehat{st}\). It only remains to demonstrate that \(d\) satisfies the triangle inequality so that, for \(s,t,u \in S\)
\[ d(s,u) \leq d(s,t) + d(t,u) \]
If any of the sequences are equal, the inequality is trivial so we only consider mutually distinct sequences. Hence \(|\Sigma| > 1\) since otherwise there would be no distinct sequences. Thus
\[ \begin{aligned} d(s,u) & \leq d(s,t) + d(t,u) \\ \frac 1 {|\Sigma|^{\widehat{su}}} & \leq \frac 1 {|\Sigma|^{\widehat{st}}} + \frac 1 {|\Sigma|^{\widehat{tu}}} \\ & \leq \frac 1 {|\Sigma|^{ \min(\widehat{st}, \widehat{tu}) }} \end{aligned} \]
so it suffices to show \(\widehat{su} \geq \min(\widehat{st},\widehat{tu})\). But this is clearly true since \(\widehat{su} < \widehat{uv} \Rightarrow \widehat{tu} < \widehat{su}\).
The metric is easily extended to finite sequences by extending \(\Sigma\) with a new null symbol \(\diamond \notin \Sigma\). Let \(\Sigma’ = \Sigma \cup \{\diamond\}\) then we can map any finite sequence \(w\) over \(\Sigma\) to a sequence \(w \in S’\) by
\[ w’_i = \left\{ \begin{array}{ll} w_i & \mbox{if } i \leq \operatorname{len}(w) \\ \diamond & \mbox{if } i > \operatorname{len}(w) \end{array} \right. \]
This map is clearly injective so we can extend \(d\) to a metric over finite sequences of \(\Sigma\).
So what properties does this metric have in the context of its practical application to clustering strings of characters?
Lots of further investigation is needed to see how useful this metric will be in practice. There are lots of avenues too explore too:
I’ve been living in hash table ignorance, always accepting the received wisdom that a load factor (the ratio of elements stored to available buckets) of around 0.75 provides a good trade-off between storage and performance but never stopping to think about what’s actually going on. For example, what happens at the limit, when the number of elements approaches the number of buckets in a chained hash table such as Java’s HashMap.
As part of my development of crinch I’ve been approaching data structures from the perspective of reducing memory overhead. So when I started writing a hash map implementation for the library, the following question naturally arose: in a hash table of capacity n, that contains n elements, how many buckets will be empty?
Assuming a hashing algorithm that distributes keys with a high degree of uniformity, there will inevitably be some collisions (implying empty buckets elsewhere) but my expectation was that there would not be not too many. I guessed that maybe 10% of the buckets might be empty, and I was hoping for fewer than that.
When I found that slightly more than a third of the buckets in my hash table implementation were empty, I wish I could say I was immediately shocked into the realization that I’d never bothered to understand the real performance characteristics of the humble hash map. Instead I wasted an hour looking for a mistake — trying different hash functions and a number of different key sets — before realizing that the results I was seeing were so consistent (always hovering tightly around 36.8% of buckets empty) that I must be missing something. Then I did something more intelligent and walked away from my keyboard, picked up a pen and reasoned through the problem. It turns out it’s simple.
Let’s fix our attention on one bucket (it doesn’t matter which they’re identical under the assumption of a uniform hash). What’s the probability that the first key hashes into that bucket? Assuming a uniform hash it’s \(\frac1n\), and the probability it doesn’t is \(1-\frac1n\). This probability is constant for each key we add and furthermore each outcome is independent, so the probability \(P_n\) that none of the \(n\) elements hash to our bucket is given by
\[P_n = \left(1 - \frac {1} {n}\right)^n = \left(\frac {n-1} {n}\right)^n\]
What happens as \(n\) grows large? If limit abuse upsets you, look away now.
\[\begin{aligned} \lim_{n\to\infty} P_n = & \lim_{n\to\infty} \left(\frac {n-1} {n}\right)^n \\ = & \lim_{n\to\infty} \left(\frac {n} {n-1}\right)^{-n} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’+1} {n’}\right)^{-(n’+1)} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’} {n’+1}\right) / \left(1 + \frac {1} {n’}\right)^{n’} \\ = & 1/e \\ \approx & 0.3678 \end{aligned}\]
What this result tells us that if we take a reasonably capacious hash table that chains collisions and which is equipped with an ideal hash function that uniformly distributes all keys, and we add as many records as there are buckets, it’s almost certain that more than a third of the buckets are completely empty. All the years I’ve been stuffing objects into Java HashMaps and I never knew.
I don’t have my volumes of Knuth to hand; I’m sure all this is covered and more and that I really should have known this years ago, but at least it was an enjoyable excursion of ignorance.
I did my first profiling of Crinch’s hashing code this evening at it’s been interesting. Guava’s new hash package gives me something against which I can benchmark.
My methodology, briefly, was to implement the murmur3 32 bit hash then use the hash on a million randomly populated POJOs that look like this:
private static class Pojo {
String strVal;
int intVal;
long longVal;
boolean boolVal;
}
Hashing them all with Crinch using:
private static class PojoSource implements HashSource<Pojo> {
public void sourceData(Pojo pojo, WriteStream out) {
out.writeChars(pojo.strVal);
out.writeInt(pojo.intVal);
out.writeLong(pojo.longVal);
out.writeBoolean(pojo.boolVal);
}
}
And hashing them all with Guava using:
private static class PojoFunnel implements Funnel<Pojo> {
@Override
public void funnel(Pojo pojo, Sink into) {
into.putString(pojo.strVal)
.putInt(pojo.intVal)
.putLong(pojo.longVal)
.putBoolean(pojo.boolVal);
}
}`
And storing the hashes in a million element int[], recording the median number of milliseconds required for both Crinch and Guava over 10 repetitions. Yes, this is a very rough first investigation, but you have to start somewhere. Code for the test case is available at:
The variance between repetitions was very low and on my machine (running OpenJDK 6) the times were consistently:
This is a much larger disparity than I anticipated. So I attached a profiler and used method sampling to profile the call tree (instrumentation would have provided more complete information but may have skewed the timings). It appears that much (though not all) of the disparity is due to the time spent in Crinch’s WriteStream.writeChars() method compared to Guava’s equivalent Sink.putString() method.
Here’s a barely legible screenshot of the call tree breakdown:

Both methods take a single CharSequence, iterate over its characters and hash them individually, but Crinch has an optimized path for the common case where a String is passed to the method; Guava does not. There is actually further scope for optimizing Crinch here too that I may investigate.
It’s too early to draw any conclusions about the relative overhead of hashing in Guava compared to that of Crinch. I’ll want to revisit the comparison when I’ve found time to build a patched version of Guava containing some appropriate optimizations.
Guava recently introduced a new hashing package and I’ve spent the evening benchmarking my hash library implementation against Guava’s. I’m interested in whether I can abandon my com.tomgibara.crinch.hashing package for com.google.guava.hash.
It’s been fascinating to closely inspect an alternative abstraction of the same (fairly fundamental) concept; especially since I’ve never really liked my own.
Unsurprisingly, though class names obviously differ (Guava’s are far better in my opinion — more consistent and intelligible), there are a number of similarities, but I’ve chosen to highlight some of the differences below, with my initial observations on them.
The intention is absolutely not to knock Guava or to say whether it’s better or worse than my own library — they operate in quite different contexts — it’s simply to explore the trade-offs made in their design (as I see them).
Here are my initial observations:
HashFunctions) always as an operation on a byte array. This is something I consciously avoided in Crinch. The equivalent class (Hash) operates directly on objects to be hashed, even if objects may be collapsed to bytes internally. I felt this design choice was important because some applications may need to provide an alternative hash implementation that can’t be easily or efficiently expressed at a byte level. The Guava abstraction is cleaner but may not be as flexible.Funnel abstraction that ‘puts’ bytes into a Sink which is fluent and appears to be strongly influenced by Java’s ByteBuffer class. Conversely Crinch uses a HashSource (a truly terrible class name) to ‘write’ object bytes into a WriteStream which is loosely based on Java’s DataOutputStream class. This write/put distinction has almost no implications at a code level, but hints at different mental models. One difference that does affect the relative usability of the APIs is that Sink is fluent (as are many of Guava’s interfaces) whereas WriteStream is not. I always hesitate when faced with the decision to employ a fluent interface that may be used in performance critical sections of code; I always worry that there may be a performance hit - even though I have no evidence that this is the case; it may be irrationality on my part.Funnel with a HashFunction. This means that, at least in little coding I’ve so far done with Guava’s hash package, I frequently have to push both objects around myself. It’s a small niggle, but one that Crinch avoids.HashCode interface that gives access to the hash as an int, long or byte[]. Crinch on the other hand uses Java’s BigInteger class for the same purpose and I think this benefits the developer in terms of familiarity, definiteness and usefulness. Obviously Guava’s use of an interface here provides for important performance benefits when implementing specific HashFunctions, but this is not such an issue in Crinch because intermediate object creation can generally be avoided (see next comment).hash() method that returns a HashCode object, but Crinch has three methods hashAsInt(), hashAsLong() and hashAsBigInt(). Intermediate object creation can be avoided at the expense of a more ugly API.mod with more bits. But whether this would ever be a practical matter for developers using the Guava libraries is debatable.Hash abstraction for generating multiple hash values for a single object called MultiHash. Guava performs a similar task within its BloomFilter implementation but entirely hides the implementation. I think this may turn out to be unfortunate because there are several different data structures that use multiple hashes (eg. cuckoo hash tables), and it seems artificial to accumulate them in the hash package without exposing a core abstraction on which they depend.HashFunctions in the form of AbstractNonStreamingHashFunction and AbstractStreamingHashFunction the use of a ByteBuffer in the latter is a particularly good fit. Crinch provides nothing half as useful, though it should.On measure, I think Guava certainly provides a cleaner API, whether it’s more effective from the perspective of a developer writing an application is less certain. The important question for my applications is: does the cleaner API come at an unavoidable cost in performance?
I don’t know yet and I’m not implying it does; it could take a long while to find out.
This post turned out longer than I anticipated so finishing the preliminary benchmarking will have to wait until tomorrow.
…say to cluster points in a map based on the zoom?
(information about the GVM Java library)
I was just asked this question by email, and thought I’d respond via a blog post in case the same query recurs. This short answer is: I certainly hope so, it’s exactly the kind of task the library was designed for, but there are some caveats:
I’ve not personally used the library on Android - there could be some minor incompatibilities due mismatches between the standard Java APIs and Android (though I don’t anticipate any). Should such an issue arise, I’m sure it could be patched very quickly.
As standard, I would typically use the implementation in the com.tomgibara.cluster.gvm.dbl package that uses double coordinates. But since double arithmetic may perform really poorly on handsets (I don’t know) it may be necessary to use alternative coordinate types for better performance (the float based com.tomgibara.cluster.gvm.flt for example). Note that whatever primitive coordinate-type you use, it will not avoid all double arithmetic.
How many points do you have device side? Making point clustering worthwhile when panning and zooming freely around a map implies to me that there are lots of points. A natural question is how do these points find their way into the application in the first place? Downloading 1M points in response to a query and then clustering them into 30 groups for display is probably not a good use of the device’s resources. Of course GVM could be used server side too, to provide a quick result to the user while lots more results streamed in for clustering in near-real-time.
Finally, I’d add the comment that GVM is very well suited to exactly this sort of application because of the flexibility that it provides around cluster keys: each time a cluster is enlarged or merged, the application is given the opportunity to modify the key - this means that the cluster key can be used to maintain all the salient state for the application (and no more, so performance can remain good).
In this online demonstration the cluster is keyed against the most populous city in the cluster but this could easily have been something else entirely.
A screenshot of heap usage under Crinch.
A single thread is streaming and decompressing about 7M records a second. Each record has 10 columns of strings and integers.
30B values read and only 20MB of garbage generated. Just 3ms per minute spent collecting garbage. Even on a laptop throughput is excellent.
It’s high time I made my open source Bloom filter implementation properly available by taking time out to introduce it properly.
Firstly, to get a copy of it, you’ll currently need to check it out of my Google code project and build it using maven:
svn checkout http://tomgibara.googlecode.com/svn/trunk/crinch
cd crinch
mvn install
It’s a multi-module build and you’ll need the following jars to make it work:
In the future I’ll probably produce a crinch-all.jar that conveniently aggregates these smaller jars.
All the Bloom filter related code is in the com.tomgibara.crinch.collections package, but here’s some code to get you off to a good start.