Ideas for a cluster colouring algorithm.
Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.
In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.
…say to cluster points in a map based on the zoom?
(information about the GVM Java library)
I was just asked this question by email, and thought I’d respond via a blog post in case the same query recurs. This short answer is: I certainly hope so, it’s exactly the kind of task the library was designed for, but there are some caveats:
I’ve not personally used the library on Android - there could be some minor incompatibilities due mismatches between the standard Java APIs and Android (though I don’t anticipate any). Should such an issue arise, I’m sure it could be patched very quickly.
As standard, I would typically use the implementation in the com.tomgibara.cluster.gvm.dbl package that uses double coordinates. But since double arithmetic may perform really poorly on handsets (I don’t know) it may be necessary to use alternative coordinate types for better performance (the float based com.tomgibara.cluster.gvm.flt for example). Note that whatever primitive coordinate-type you use, it will not avoid all double arithmetic.
How many points do you have device side? Making point clustering worthwhile when panning and zooming freely around a map implies to me that there are lots of points. A natural question is how do these points find their way into the application in the first place? Downloading 1M points in response to a query and then clustering them into 30 groups for display is probably not a good use of the device’s resources. Of course GVM could be used server side too, to provide a quick result to the user while lots more results streamed in for clustering in near-real-time.
Finally, I’d add the comment that GVM is very well suited to exactly this sort of application because of the flexibility that it provides around cluster keys: each time a cluster is enlarged or merged, the application is given the opportunity to modify the key - this means that the cluster key can be used to maintain all the salient state for the application (and no more, so performance can remain good).
In this online demonstration the cluster is keyed against the most populous city in the cluster but this could easily have been something else entirely.
Fifty postcode clusters generated using GVM by streaming all 1.7M postcode coordinates compressed via crinch. Identifying the clusters takes just a couple of seconds.
Currently crinch will index every postcode together with its lat/lon coordinates into a 14MB file (approximately 8 bytes per record). I’m looking to do better.
I was recently asked via email for a suggestion of how to adapt my GVM algorithm to cluster angular values. The example that was given in the email makes the problem clear:
Correct clustering: { 1°, 359°, 355° }, { 85°, 89°, 96° }
Incorrect clustering: { 1°, 85°, 89°, 96° }, { 359°, 355° }
Unfortunately, directly adapting the algorithm so that it works with angular data is not possible because the underlying mathematics requires a metric space. This assumption is violated in the case of a circle because there are generally two distances between any two points.
To use GVM (or any other spatial clustering algorithm) to cluster angular data, my approach would be to take each angle, and use it to plot a point on the unit circle in two dimensions, ie. given the angle a, plot x = cos a, y = sin a, and then cluster those coordinates.

Though the centroids of the resulting clusters may not lie on the circle (the algorithm cannot be constrained to ensure this), the centroids of tight clusters will fall close to it and using arctan (Math.atan2(y,x) in Java) you will be able to recover an angle for the cluster. This with the exception of the degenerate case where the centroid lies at the origin, but in any case you will need to discard clusters who’s centroids fall close to it since these are effectively ‘false’ clusters that arise because the algorithm is unaware of the true constraints of the problem.
It’s may be worth considering the presence of a cluster close to the origin as an indication that the number of clusters is too low and using it as a condition to re-cluster with a greater maximum number of clusters. Though I haven’t tried this myself, it might provide a good heuristic.
After a few birthing pains, Maven finally popped out a 1.0 release of my Java library for GVM clustering.
It feels great to finally tick that off my list. Hopefully someone with a very big dataset will find it useful.
I just published a Java implementation of my GVM clustering algorithm to Google Code. Later, I’ll be collating all the information I’ve been compiling about the algorithm at my website.
Until then, here are some basic build steps, you’ll need svn, Java 1.6 and Maven 2.2:
svn checkout http://tomgibara.googlecode.com/svn/trunk/cluster
cd cluster/cluster-mojo/
mvn install
cd ..
mvn install
Then, if you want to try out the included demo app, try:
java -jar cluster-demo/target/cluster-demo-0.1-SNAPSHOT-jar-with-dependencies.jar
I’ve been qualitatively comparing GVM with k-means by producing some plots using R.
As I anticipated (having used both a number of times) they both perform fairly equally, except GVM only needs to take a single pass through the dataset — a big advantage with datasets that don’t fit in memory — and has a robust upper bound on its execution time.
The fist plot in each set contains the points prior to clustering. The last plot in each set is one obtained using EM.
Waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA.




Two 2D normal distributions crossing at the origin.




Points uniformly distributed within three circles.




Almost four years ago, I discovered/invented/developed* a spatial clustering algorithm that I have since used in a number of projects, but I never found the time to describe it, or publish the source code. I’m sure it could be useful to other developers so this is a silly situation. One which I intend to rectify, starting now, with this, a hasty overview.
First here are some of its characteristics (copied from the original web page):
Of course, these claims can only be verified after I’ve published a formal description of the algorithm — it’s very possible that my analysis is faulty — or after I’ve published the source code (which is much more likely and probably more useful). Until then, here’s a simple overview of the approach.
I called the algorithm “Greedy Variance Minimization” (GVM for short). It’s so named because:
The algorithm is configured with a maximum number of clusters (C) and the point dimension (D). Initially, no points have been added, so there are no clusters.

The first C points that are added form their own single point clusters. Each point has a mass (which naturally serves as a weighting in most applications) and may be assigned a key (which is entirely under the domain of the application and is not included in these diagrams)

In addition to tracking the individual clusters, the algorithm maintains a heap of all possible cluster pairings. Each pair is assigned a cost equal to the amount by which the variance of the combined cluster exceeds that of the two clusters individually.

At the addition of the C+1th point, and for all subsequent points, the algorithm must choose either to add the point to an existing cluster or to merge two existing clusters and form a new cluster from the new point. To do this, the cost (increase in variance) of adding the point to each possible cluster is computed and compared to the cost of the cluster-pair top of the heap. The lowest cost operation — either adding or merging — wins.

After every operation the heap of cluster-pairs is sifted appropriately and the algorithm proceeds in this manner until all the points have been added.
Failures of the first type can be avoided by setting C higher than the anticipated number of clusters, and then, after all the points have been added, collapsing the clusters subject to a maximum permitted total variance (i.e. merging any clusters that the algorithm has kept separate, but which can be regarded as part of the same overall cluster). But performance is sensitive to C; if you set C too high, the algorithm will proceed slowly.
The second type of failure can be avoided by randomly (or otherwise) ordering the points before they are clustered. But this may increase space/time requirements depending on your application.
One thing that makes the algorithm very practical, is the degree of control it affords the application. The algorithm assigns a key to each cluster. The application assigns keys to points, and controls how keys are selected/combined when points are added to clusters or when clusters are merged.
At an extreme, if an application doesn’t assign any keys, then memory usage is minimal since no point-level data is stored. But this has the consequence that, although the algorithm will report the location, size and general shape of each cluster, nothing else can be reported back to the application.
At the other extreme, an application may choose to store “lists of points” as its keys: each point is tagged with its own coordinates, and when clusters are merged the lists are concatenated. This means that the application will receive the coordinates of every point in each cluster (at a cost in time and space).
Other ‘in-between’ strategies are possible too. For example, the city demo uses a single city as the key for a cluster and chooses the most populous city when merging clusters. This is very cheap but quite adequate for that application.
(*) Delete as per your philosophy