Almost four years ago, I discovered/invented/developed* a spatial clustering algorithm that I have since used in a number of projects, but I never found the time to describe it, or publish the source code. I’m sure it could be useful to other developers so this is a silly situation. One which I intend to rectify, starting now, with this, a hasty overview.
First here are some of its characteristics (copied from the original web page):
Of course, these claims can only be verified after I’ve published a formal description of the algorithm — it’s very possible that my analysis is faulty — or after I’ve published the source code (which is much more likely and probably more useful). Until then, here’s a simple overview of the approach.
I called the algorithm “Greedy Variance Minimization” (GVM for short). It’s so named because:
The algorithm is configured with a maximum number of clusters (C) and the point dimension (D). Initially, no points have been added, so there are no clusters.

The first C points that are added form their own single point clusters. Each point has a mass (which naturally serves as a weighting in most applications) and may be assigned a key (which is entirely under the domain of the application and is not included in these diagrams)

In addition to tracking the individual clusters, the algorithm maintains a heap of all possible cluster pairings. Each pair is assigned a cost equal to the amount by which the variance of the combined cluster exceeds that of the two clusters individually.

At the addition of the C+1th point, and for all subsequent points, the algorithm must choose either to add the point to an existing cluster or to merge two existing clusters and form a new cluster from the new point. To do this, the cost (increase in variance) of adding the point to each possible cluster is computed and compared to the cost of the cluster-pair top of the heap. The lowest cost operation — either adding or merging — wins.

After every operation the heap of cluster-pairs is sifted appropriately and the algorithm proceeds in this manner until all the points have been added.
Failures of the first type can be avoided by setting C higher than the anticipated number of clusters, and then, after all the points have been added, collapsing the clusters subject to a maximum permitted total variance (i.e. merging any clusters that the algorithm has kept separate, but which can be regarded as part of the same overall cluster). But performance is sensitive to C; if you set C too high, the algorithm will proceed slowly.
The second type of failure can be avoided by randomly (or otherwise) ordering the points before they are clustered. But this may increase space/time requirements depending on your application.
One thing that makes the algorithm very practical, is the degree of control it affords the application. The algorithm assigns a key to each cluster. The application assigns keys to points, and controls how keys are selected/combined when points are added to clusters or when clusters are merged.
At an extreme, if an application doesn’t assign any keys, then memory usage is minimal since no point-level data is stored. But this has the consequence that, although the algorithm will report the location, size and general shape of each cluster, nothing else can be reported back to the application.
At the other extreme, an application may choose to store “lists of points” as its keys: each point is tagged with its own coordinates, and when clusters are merged the lists are concatenated. This means that the application will receive the coordinates of every point in each cluster (at a cost in time and space).
Other ‘in-between’ strategies are possible too. For example, the city demo uses a single city as the key for a cluster and chooses the most populous city when merging clusters. This is very cheap but quite adequate for that application.
(*) Delete as per your philosophy