Latest Tweets

 

Greedy Variance Minimization

Almost four years ago, I discovered/invented/developed* a spatial clustering algorithm that I have since used in a number of projects, but I never found the time to describe it, or publish the source code. I’m sure it could be useful to other developers so this is a silly situation. One which I intend to rectify, starting now, with this, a hasty overview.

Characteristics

First here are some of its characteristics (copied from the original web page):

  • The algorithm places no restrictions on the number of items (N), dimensions (D) and clusters (C).
  • The algorithm scales linearly with the number items clustered. Specifically the worst case time complexity of the algorithm is N⋅D⋅C log C;
  • Memory usage is constant and independent of the number of items clustered. Specifically, the space complexity is C^2 + D⋅C.

Of course, these claims can only be verified after I’ve published a formal description of the algorithm — it’s very possible that my analysis is faulty — or after I’ve published the source code (which is much more likely and probably more useful). Until then, here’s a simple overview of the approach.

I called the algorithm “Greedy Variance Minimization” (GVM for short). It’s so named because:

  • It never looks ahead - a clustering decision is made every time a new point is added.
  • It uses variance as the measure of cluster cohesion - the variance of the points within a cluster is the only attribute considered; the points inside a cluster can be discarded as the algorithm proceeds.
  • It always acts so as to minimize - specifically it attempts to minimize the sum of cluster variances over all clusters.

Description

The algorithm is configured with a maximum number of clusters (C) and the point dimension (D). Initially, no points have been added, so there are no clusters.

The first C points that are added form their own single point clusters. Each point has a mass (which naturally serves as a weighting in most applications) and may be assigned a key (which is entirely under the domain of the application and is not included in these diagrams)

In addition to tracking the individual clusters, the algorithm maintains a heap of all possible cluster pairings. Each pair is assigned a cost equal to the amount by which the variance of the combined cluster exceeds that of the two clusters individually.

At the addition of the C+1th point, and for all subsequent points, the algorithm must choose either to add the point to an existing cluster or to merge two existing clusters and form a new cluster from the new point. To do this, the cost (increase in variance) of adding the point to each possible cluster is computed and compared to the cost of the cluster-pair top of the heap. The lowest cost operation — either adding or merging — wins.

After every operation the heap of cluster-pairs is sifted appropriately and the algorithm proceeds in this manner until all the points have been added.

Benefits

  • The algorithm is linear in the number of points. This is a lifesaver when you want quick results over large datasets.
  • The ‘best current’ clustering is always immediately available (essentially because it’s the core of the algorithm’s state). This means it is possible to get preliminary/progressive results.
  • The keying of points is under application control. This provides a lot of flexibility by allowing the application to trade-off space/time/completeness (see below).

Limitations

  • The algorithm is sensitive to C. The value of C may often need to be larger than the number of clusters in the dataset. This is because it may take time for the algorithm to accumulate sufficient points to seed sensible clusters. If C is too low, the algorithm may commit itself to clustering too early.
  • The algorithm is sensitive to the order in which points are added. If the order of the points, as presented to the algorithm, correlates to one or more coordinates, the algorithm may make bad choices from which it cannot later recover. A large number of highly localized points will create a number of highly localized clusters. As more distant points are added, large numbers of points will be assigned to the wrong cluster and the algorithm may fail to discern the correct clusters at all.

Failures of the first type can be avoided by setting C higher than the anticipated number of clusters, and then, after all the points have been added, collapsing the clusters subject to a maximum permitted total variance (i.e. merging any clusters that the algorithm has kept separate, but which can be regarded as part of the same overall cluster). But performance is sensitive to C; if you set C too high, the algorithm will proceed slowly.

The second type of failure can be avoided by randomly (or otherwise) ordering the points before they are clustered. But this may increase space/time requirements depending on your application.

Flexibility

One thing that makes the algorithm very practical, is the degree of control it affords the application. The algorithm assigns a key to each cluster. The application assigns keys to points, and controls how keys are selected/combined when points are added to clusters or when clusters are merged.

At an extreme, if an application doesn’t assign any keys, then memory usage is minimal since no point-level data is stored. But this has the consequence that, although the algorithm will report the location, size and general shape of each cluster, nothing else can be reported back to the application.

At the other extreme, an application may choose to store “lists of points” as its keys: each point is tagged with its own coordinates, and when clusters are merged the lists are concatenated. This means that the application will receive the coordinates of every point in each cluster (at a cost in time and space).

Other ‘in-between’ strategies are possible too. For example, the city demo uses a single city as the key for a cluster and chooses the most populous city when merging clusters. This is very cheap but quite adequate for that application.


(*) Delete as per your philosophy