Latest Tweets

 

I’m continuing to evaluate my cluster painting algorithm, this time on a larger dataset.

This comparison with random colour assignment demonstrates the increased clarity that a good colouring provides. Without the algorithm, most of the clusters in the South West and Scotland are indistinguishable.

An algorithm for painting clusters

As I posted yesterday, I needed an algorithm that could paint clusters so as to make them distinguishable from each other. Today, I implemented a simple, “good enough” algorithm.

Inputs and Outputs

The algorithm is fairly general, only requiring that each cluster has:

  • has a size — the number of points it contains
  • has a radius — gives a general indication of the cluster’s spatial extent
  • has a centre — some point indicating the position of the entire cluster

It takes as parameters, a collection of clusters and a list of “paints” and returns a map that assigns every cluster a paint.

Coloring Algorithm

  • Order the clusters by the number of points they contain (most populous first)
  • For each cluster:
    • compute it’s distance from every other cluster (subtracting their combined radii from the distance between their centres)
  • Order these radius-adjusted distances (nearest first)
  • Iterate over all the clusters again, for each one:
    • if there is a paint that is not used by any of the other clusters, assign that to the cluster
    • Otherwise assign the most distant paint ie. the last paint used by any other cluster (distance order as above)

Early Results

Here are progressive results over a small dataset using 5 colours.

Even with 200 clusters, this algorithm does a fairly good job at making all the clusters distinguishable using just 5 colours.

Ideas for a cluster colouring algorithm.

Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.

In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.

Ideas for a cluster colouring algorithm.

Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.

In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.

A prefix metric on Strings

I’ve been investigating how clustering can be applied to sets of character strings in which common prefixes indicate similarity, and I came up with the following metric (probably known already, but no amount of online searching gave any useful results).

A metric over infinite sequences

We start buy considering the set \(S\) of sequences over some alphabet \(\Sigma\). We write \(s=\{s_i\}_{i\geq0}\) with \(s_i\in\Sigma\) for \(s\in\Sigma\).

For \(s,t\in\Sigma\) with \(s\neq t\) we write \(\widehat{st}\) for the length of the longest common initial subsequence, ie. \(\widehat{st}\) is the greatest \(n\) such that \(s_k=t_k\forall{k < n}\). Then \(d:S^2\rightarrow\mathbb{R}\) defines a metric where

\[ d(s,t) = \left\{ \begin{array}{ll} \frac 1 {|\Sigma|^{\widehat{st}}} & \mbox{if } s \neq t \\ 0 & \mbox{if } s = t \end{array} \right. \]

By its definition, necessarily \(d(s,t)\geq0\) with \(d(s,t)=0\) iff \(s=t\). Furthermore, \(d(s,t)\) is symmetric as a consequence of the symmetry of \(\widehat{st}\). It only remains to demonstrate that \(d\) satisfies the triangle inequality so that, for \(s,t,u \in S\)

\[ d(s,u) \leq d(s,t) + d(t,u) \]

If any of the sequences are equal, the inequality is trivial so we only consider mutually distinct sequences. Hence \(|\Sigma| > 1\) since otherwise there would be no distinct sequences. Thus

\[ \begin{aligned} d(s,u) & \leq d(s,t) + d(t,u) \\ \frac 1 {|\Sigma|^{\widehat{su}}} & \leq \frac 1 {|\Sigma|^{\widehat{st}}} + \frac 1 {|\Sigma|^{\widehat{tu}}} \\ & \leq \frac 1 {|\Sigma|^{ \min(\widehat{st}, \widehat{tu}) }} \end{aligned} \]

so it suffices to show \(\widehat{su} \geq \min(\widehat{st},\widehat{tu})\). But this is clearly true since \(\widehat{su} < \widehat{uv} \Rightarrow \widehat{tu} < \widehat{su}\).

Extending it to finite sequences

The metric is easily extended to finite sequences by extending \(\Sigma\) with a new null symbol \(\diamond \notin \Sigma\). Let \(\Sigma’ = \Sigma \cup \{\diamond\}\) then we can map any finite sequence \(w\) over \(\Sigma\) to a sequence \(w \in S’\) by

\[ w’_i = \left\{ \begin{array}{ll} w_i & \mbox{if } i \leq \operatorname{len}(w) \\ \diamond & \mbox{if } i > \operatorname{len}(w) \end{array} \right. \]

This map is clearly injective so we can extend \(d\) to a metric over finite sequences of \(\Sigma\).

Practical properties of the metric

So what properties does this metric have in the context of its practical application to clustering strings of characters?

  • The distance between two strings is always less than or equal to one.
  • The distance between from the empty string to any non-empty string is always one.
  • The distance between strings is inversely proportional to the length of their common prefix.
  • This distance is normalized by size of the character set, so for example, the distance between two string of 8 bit ASCII characters approximates (from above) their distance when represented as a two binary sequences.

Further work

Lots of further investigation is needed to see how useful this metric will be in practice. There are lots of avenues too explore too:

  • Can we create structures from this metric that can increase its applicability to existing algorithms?
  • Can we incorporate any additional knowledge we may have of ‘character similarity’ to increase its precision?
  • What is the best way of numerically representing the computed metric values?

Living in Hash Table Ignorance

I’ve been living in hash table ignorance, always accepting the received wisdom that a load factor (the ratio of elements stored to available buckets) of around 0.75 provides a good trade-off between storage and performance but never stopping to think about what’s actually going on. For example, what happens at the limit, when the number of elements approaches the number of buckets in a chained hash table such as Java’s HashMap.

As part of my development of crinch I’ve been approaching data structures from the perspective of reducing memory overhead. So when I started writing a hash map implementation for the library, the following question naturally arose: in a hash table of capacity n, that contains n elements, how many buckets will be empty?

Assuming a hashing algorithm that distributes keys with a high degree of uniformity, there will inevitably be some collisions (implying empty buckets elsewhere) but my expectation was that there would not be not too many. I guessed that maybe 10% of the buckets might be empty, and I was hoping for fewer than that.

When I found that slightly more than a third of the buckets in my hash table implementation were empty, I wish I could say I was immediately shocked into the realization that I’d never bothered to understand the real performance characteristics of the humble hash map. Instead I wasted an hour looking for a mistake — trying different hash functions and a number of different key sets — before realizing that the results I was seeing were so consistent (always hovering tightly around 36.8% of buckets empty) that I must be missing something. Then I did something more intelligent and walked away from my keyboard, picked up a pen and reasoned through the problem. It turns out it’s simple.

Let’s fix our attention on one bucket (it doesn’t matter which they’re identical under the assumption of a uniform hash). What’s the probability that the first key hashes into that bucket? Assuming a uniform hash it’s \(\frac1n\), and the probability it doesn’t is \(1-\frac1n\). This probability is constant for each key we add and furthermore each outcome is independent, so the probability \(P_n\) that none of the \(n\) elements hash to our bucket is given by

\[P_n = \left(1 - \frac {1} {n}\right)^n = \left(\frac {n-1} {n}\right)^n\]

What happens as \(n\) grows large? If limit abuse upsets you, look away now.

\[\begin{aligned} \lim_{n\to\infty} P_n = & \lim_{n\to\infty} \left(\frac {n-1} {n}\right)^n \\ = & \lim_{n\to\infty} \left(\frac {n} {n-1}\right)^{-n} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’+1} {n’}\right)^{-(n’+1)} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’} {n’+1}\right) / \left(1 + \frac {1} {n’}\right)^{n’} \\ = & 1/e \\ \approx & 0.3678 \end{aligned}\]

What this result tells us that if we take a reasonably capacious hash table that chains collisions and which is equipped with an ideal hash function that uniformly distributes all keys, and we add as many records as there are buckets, it’s almost certain that more than a third of the buckets are completely empty. All the years I’ve been stuffing objects into Java HashMaps and I never knew.

I don’t have my volumes of Knuth to hand; I’m sure all this is covered and more and that I really should have known this years ago, but at least it was an enjoyable excursion of ignorance.

Benchmarking Crinch hashing against Guava

I did my first profiling of Crinch’s hashing code this evening at it’s been interesting. Guava’s new hash package gives me something against which I can benchmark.

Methodology

My methodology, briefly, was to implement the murmur3 32 bit hash then use the hash on a million randomly populated POJOs that look like this:

private static class Pojo {
    String strVal;
    int intVal;
    long longVal;
    boolean boolVal;
}

Hashing them all with Crinch using:

private static class PojoSource implements HashSource<Pojo> {
    public void sourceData(Pojo pojo, WriteStream out) {
        out.writeChars(pojo.strVal);
        out.writeInt(pojo.intVal);
        out.writeLong(pojo.longVal);
        out.writeBoolean(pojo.boolVal);
    }
}

And hashing them all with Guava using:

private static class PojoFunnel implements Funnel<Pojo> {
@Override
    public void funnel(Pojo pojo, Sink into) {
        into.putString(pojo.strVal)
            .putInt(pojo.intVal)
            .putLong(pojo.longVal)
            .putBoolean(pojo.boolVal);
    }
}`

And storing the hashes in a million element int[], recording the median number of milliseconds required for both Crinch and Guava over 10 repetitions. Yes, this is a very rough first investigation, but you have to start somewhere. Code for the test case is available at:

http://tomgibara.googlecode.com/svn/trunk/crinch/crinch-hashing/src/test/java/com/tomgibara/crinch/hashing/perf/MurmurHashPerfTest.java

Results

The variance between repetitions was very low and on my machine (running OpenJDK 6) the times were consistently:

  • Crinch: 228ms (for 1M pojos)
  • Guava: 612ms (for 1M pojos)

This is a much larger disparity than I anticipated. So I attached a profiler and used method sampling to profile the call tree (instrumentation would have provided more complete information but may have skewed the timings). It appears that much (though not all) of the disparity is due to the time spent in Crinch’s WriteStream.writeChars() method compared to Guava’s equivalent Sink.putString() method.

Here’s a barely legible screenshot of the call tree breakdown: Call tree

Both methods take a single CharSequence, iterate over its characters and hash them individually, but Crinch has an optimized path for the common case where a String is passed to the method; Guava does not. There is actually further scope for optimizing Crinch here too that I may investigate.

Conclusions

It’s too early to draw any conclusions about the relative overhead of hashing in Guava compared to that of Crinch. I’ll want to revisit the comparison when I’ve found time to build a patched version of Guava containing some appropriate optimizations.

Guava’s hashing abstractions compared to Crinch’s

Guava recently introduced a new hashing package and I’ve spent the evening benchmarking my hash library implementation against Guava’s. I’m interested in whether I can abandon my com.tomgibara.crinch.hashing package for com.google.guava.hash.

It’s been fascinating to closely inspect an alternative abstraction of the same (fairly fundamental) concept; especially since I’ve never really liked my own.

Unsurprisingly, though class names obviously differ (Guava’s are far better in my opinion — more consistent and intelligible), there are a number of similarities, but I’ve chosen to highlight some of the differences below, with my initial observations on them.

The intention is absolutely not to knock Guava or to say whether it’s better or worse than my own library — they operate in quite different contexts — it’s simply to explore the trade-offs made in their design (as I see them).

Here are my initial observations:

  • One fundamental difference between the approach of the libraries is that Guava sees hashing (executed by HashFunctions) always as an operation on a byte array. This is something I consciously avoided in Crinch. The equivalent class (Hash) operates directly on objects to be hashed, even if objects may be collapsed to bytes internally. I felt this design choice was important because some applications may need to provide an alternative hash implementation that can’t be easily or efficiently expressed at a byte level. The Guava abstraction is cleaner but may not be as flexible.
  • Guava converts objects into bytes using a Funnel abstraction that ‘puts’ bytes into a Sink which is fluent and appears to be strongly influenced by Java’s ByteBuffer class. Conversely Crinch uses a HashSource (a truly terrible class name) to ‘write’ object bytes into a WriteStream which is loosely based on Java’s DataOutputStream class. This write/put distinction has almost no implications at a code level, but hints at different mental models. One difference that does affect the relative usability of the APIs is that Sink is fluent (as are many of Guava’s interfaces) whereas WriteStream is not. I always hesitate when faced with the decision to employ a fluent interface that may be used in performance critical sections of code; I always worry that there may be a performance hit - even though I have no evidence that this is the case; it may be irrationality on my part.
  • Guava doesn’t appear to provide an object that combines a Funnel with a HashFunction. This means that, at least in little coding I’ve so far done with Guava’s hash package, I frequently have to push both objects around myself. It’s a small niggle, but one that Crinch avoids.
  • To accommodate hash values that are larger than a Java primitive, Guava uses a HashCode interface that gives access to the hash as an int, long or byte[]. Crinch on the other hand uses Java’s BigInteger class for the same purpose and I think this benefits the developer in terms of familiarity, definiteness and usefulness. Obviously Guava’s use of an interface here provides for important performance benefits when implementing specific HashFunctions, but this is not such an issue in Crinch because intermediate object creation can generally be avoided (see next comment).
  • Most of the Crinch code has been designed to avoid unnecessary garbage; it’s designed to handle large collections of objects and as such even small amounts of garbage build up quickly. This often works against producing simpler APIs, and here is one such case. Guava exposes a single hash() method that returns a HashCode object, but Crinch has three methods hashAsInt(), hashAsLong() and hashAsBigInt(). Intermediate object creation can be avoided at the expense of a more ugly API.
  • Guava only allows hash sizes to be specified in terms of bit size, but Crinch is more specific: hashes can be specified to lie within a range of values. This distinction can be important because many interesting data structures (such as Bloom filters and compact approximators) may rely on hash codes that lie over certain ranges and there exist specific modes of hashing that can match those ranges with little bias compared to simply using mod with more bits. But whether this would ever be a practical matter for developers using the Guava libraries is debatable.
  • Crinch exposes an extended Hash abstraction for generating multiple hash values for a single object called MultiHash. Guava performs a similar task within its BloomFilter implementation but entirely hides the implementation. I think this may turn out to be unfortunate because there are several different data structures that use multiple hashes (eg. cuckoo hash tables), and it seems artificial to accumulate them in the hash package without exposing a core abstraction on which they depend.
  • Guava provides clean and useful abstract base classes for implementing new HashFunctions in the form of AbstractNonStreamingHashFunction and AbstractStreamingHashFunction the use of a ByteBuffer in the latter is a particularly good fit. Crinch provides nothing half as useful, though it should.

On measure, I think Guava certainly provides a cleaner API, whether it’s more effective from the perspective of a developer writing an application is less certain. The important question for my applications is: does the cleaner API come at an unavoidable cost in performance?

I don’t know yet and I’m not implying it does; it could take a long while to find out.

This post turned out longer than I anticipated so finishing the preliminary benchmarking will have to wait until tomorrow.

Can the GVM library be used on Android?

…say to cluster points in a map based on the zoom?

(information about the GVM Java library)

I was just asked this question by email, and thought I’d respond via a blog post in case the same query recurs. This short answer is: I certainly hope so, it’s exactly the kind of task the library was designed for, but there are some caveats:

  1. I’ve not personally used the library on Android - there could be some minor incompatibilities due mismatches between the standard Java APIs and Android (though I don’t anticipate any). Should such an issue arise, I’m sure it could be patched very quickly.

  2. As standard, I would typically use the implementation in the com.tomgibara.cluster.gvm.dbl package that uses double coordinates. But since double arithmetic may perform really poorly on handsets (I don’t know) it may be necessary to use alternative coordinate types for better performance (the float based com.tomgibara.cluster.gvm.flt for example). Note that whatever primitive coordinate-type you use, it will not avoid all double arithmetic.

  3. How many points do you have device side? Making point clustering worthwhile when panning and zooming freely around a map implies to me that there are lots of points. A natural question is how do these points find their way into the application in the first place? Downloading 1M points in response to a query and then clustering them into 30 groups for display is probably not a good use of the device’s resources. Of course GVM could be used server side too, to provide a quick result to the user while lots more results streamed in for clustering in near-real-time.

Finally, I’d add the comment that GVM is very well suited to exactly this sort of application because of the flexibility that it provides around cluster keys: each time a cluster is enlarged or merged, the application is given the opportunity to modify the key - this means that the cluster key can be used to maintain all the salient state for the application (and no more, so performance can remain good).

In this online demonstration the cluster is keyed against the most populous city in the cluster but this could easily have been something else entirely.

A screenshot of heap usage under Crinch.

A single thread is streaming and decompressing about 7M records a second. Each record has 10 columns of strings and integers.

30B values read and only 20MB of garbage generated. Just 3ms per minute spent collecting garbage. Even on a laptop throughput is excellent.

A screenshot of heap usage under Crinch.

A single thread is streaming and decompressing about 7M records a second. Each record has 10 columns of strings and integers.

30B values read and only 20MB of garbage generated. Just 3ms per minute spent collecting garbage. Even on a laptop throughput is excellent.

How to use Crinch’s BloomFilter

It’s high time I made my open source Bloom filter implementation properly available by taking time out to introduce it properly.

Firstly, to get a copy of it, you’ll currently need to check it out of my Google code project and build it using maven:

svn checkout http://tomgibara.googlecode.com/svn/trunk/crinch
cd crinch
mvn install

It’s a multi-module build and you’ll need the following jars to make it work:

  • crinch-collections.jar
  • crinch-hashing.jar
  • crinch-bits.jar
  • crinch-math.jar
  • crinch-util.jar

In the future I’ll probably produce a crinch-all.jar that conveniently aggregates these smaller jars.

All the Bloom filter related code is in the com.tomgibara.crinch.collections package, but here’s some code to get you off to a good start.