Note that my analysis of the algorithm here is in no way complete. This serves me as documentation; it’s public because I hope it may be of use to others. Though I have implemented this algorithm and am employing it daily, what follows may be inaccurate or just plain wrong.
Given a large table of records, stored on disk, how do we verify that every record is unique, (ie. that they are all mutually distinct) in a way that’s efficient and fast.
Let \(L\) be an external (ie. not in main computer memory) list containing \(n\) elements \((e_1 e_2…e_n)\). We are looking to demonstrate that every element of \(L\) is distinct, or to demonstrate otherwise, by exhibiting any element that occurs at more than one index in the list. We assume that the elements of \(L\) can be consistently read into memory any number of times; that \(L\) is persistent.
Let \(m(e)\) be the size of element \(e\)’s in-memory representation measured in bits. A straightforward solution to the problem is to demonstrate that the ordered elements of \(L\) are strictly monotonic. The memory required to perform this, without resorting to an external sorting algorithm is at least \(\Sigma_{i=1}^n m(e_i)\).
In theory, one can demonstrate that every element is distinct by reading the elements of \(L\) into memory \(n\) times; each time storing only \(e_i\), discarding all preceding elements and comparing it to each subsequent element. Though only \(2.\max_{i=1}^n m(e)\) bits of memory are required in such an approach, it is generally impractical due to the slowness of external access (up to \(n\) passes over \(L\) are required), and wasteful since since it is almost always the case that \(\max_{i=1}^n m(e_i)\) is significantly less than the total available memory \(M\).
Nevertheless, the approach can be adapted into a practical one by not limiting ourselves to drawing one element at a time from the list, but instead forming successive in-memory sets of \(l\) elements from \(L\). Let \(\alpha=\Sigma_{i=1}^n m(e_i)/n\) be the average bit-size of an element in \(L\) then, assuming \(\alpha\) can be readily estimated, \(l\) can be chosen to maximize usage of the available memory by choosing \(l=\lfloor \frac M \alpha \rfloor\). In the worst case, when all the elements of the list are unique, this requires \(\lceil \frac n l \rceil\) passes.
An alternative approach can be used to verify that every value of \(L\) is distinct using less memory and requiring (with a high probability) only two passes over \(L). The memory required is approximately:
\[\frac n {\log^2 2} \left(1 + \log(\alpha.\log^2 2) \right)\]
The general idea is to make two passes over the list. The first pass uses a Bloom filter to identify probable duplicates. Any recurring element will be identified by the Bloom filter and added to a candidate set which, due to the probabilistic nature of the Bloom filter, may also contain some elements that do not recur. The second pass discards the Bloom filter and confirms whether any of the elements in the candidate set are real duplicates and not just false positives.
The steps of the algorithm are:
The peak memory required \(M\), is given by the number of bits \(m\) in the Bloom filter \(B\) plus the size of the candidate set \(C\) in bits. Assuming that the false positives of \(B\) predominate, the elements of \(C\) will consist mostly of these false positives which estimates the size of \(C\) at: \[|C| = n\left(1 - e^{kn/m}\right)^k\]
Choosing approximately \(k=\frac m n \log 2\) hash functions for \(B\) is expected to be near-optimal, and this simplifies our estimate for the size of \(C\) to: \[|C| = n.2^{-\frac m n \log 2}\]
And our estimate for the memory required becomes: \[M=m + \alpha.n.2^{-\frac m n \log 2}\]
We want to choose the \(m\) which minimizes this value, differentiating and solving for zero gives: \[m=\frac {n\log(\alpha.\log^2 2)} {\log^2 2}\]
from which the Bloom filter parameters and consequently the estimated memory requirements are derived.
As presented here, the algorithm doesn’t adequately address the situation where a large number of elements occur exactly twice. In the extreme case where this is true of every element in the list, the candidate set would grow to half the size of the list; it would contain every distinct element.
In practice, this situation is easily avoided. Assuming satisfactory hashing, the number of false positives in the candidate set will not greatly exceed its estimated peak size. If the algorithm detects that the candidate set has grown significantly beyond expectations, the first pass can be terminated prematurely and a search for a witness element begun with a very high probability of finding one.
In the extremely unlikely case that every element in the oversized candidate set was a false positive, the algorithm can resume the first pass, clearing the candidate set and populating it only beyond the point at which the first pass was terminated; in this way the algorithm is guaranteed to make progress and will eventually terminate.
I’m continuing to evaluate my cluster painting algorithm, this time on a larger dataset.
This comparison with random colour assignment demonstrates the increased clarity that a good colouring provides. Without the algorithm, most of the clusters in the South West and Scotland are indistinguishable.
As I posted yesterday, I needed an algorithm that could paint clusters so as to make them distinguishable from each other. Today, I implemented a simple, “good enough” algorithm.
The algorithm is fairly general, only requiring that each cluster has:
It takes as parameters, a collection of clusters and a list of “paints” and returns a map that assigns every cluster a paint.
Here are progressive results over a small dataset using 5 colours.







Even with 200 clusters, this algorithm does a fairly good job at making all the clusters distinguishable using just 5 colours.
Ideas for a cluster colouring algorithm.
Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.
In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.
I’ve been investigating how clustering can be applied to sets of character strings in which common prefixes indicate similarity, and I came up with the following metric (probably known already, but no amount of online searching gave any useful results).
We start buy considering the set \(S\) of sequences over some alphabet \(\Sigma\). We write \(s=\{s_i\}_{i\geq0}\) with \(s_i\in\Sigma\) for \(s\in\Sigma\).
For \(s,t\in\Sigma\) with \(s\neq t\) we write \(\widehat{st}\) for the length of the longest common initial subsequence, ie. \(\widehat{st}\) is the greatest \(n\) such that \(s_k=t_k\forall{k < n}\). Then \(d:S^2\rightarrow\mathbb{R}\) defines a metric where
\[ d(s,t) = \left\{ \begin{array}{ll} \frac 1 {|\Sigma|^{\widehat{st}}} & \mbox{if } s \neq t \\ 0 & \mbox{if } s = t \end{array} \right. \]
By its definition, necessarily \(d(s,t)\geq0\) with \(d(s,t)=0\) iff \(s=t\). Furthermore, \(d(s,t)\) is symmetric as a consequence of the symmetry of \(\widehat{st}\). It only remains to demonstrate that \(d\) satisfies the triangle inequality so that, for \(s,t,u \in S\)
\[ d(s,u) \leq d(s,t) + d(t,u) \]
If any of the sequences are equal, the inequality is trivial so we only consider mutually distinct sequences. Hence \(|\Sigma| > 1\) since otherwise there would be no distinct sequences. Thus
\[ \begin{aligned} d(s,u) & \leq d(s,t) + d(t,u) \\ \frac 1 {|\Sigma|^{\widehat{su}}} & \leq \frac 1 {|\Sigma|^{\widehat{st}}} + \frac 1 {|\Sigma|^{\widehat{tu}}} \\ & \leq \frac 1 {|\Sigma|^{ \min(\widehat{st}, \widehat{tu}) }} \end{aligned} \]
so it suffices to show \(\widehat{su} \geq \min(\widehat{st},\widehat{tu})\). But this is clearly true since \(\widehat{su} < \widehat{uv} \Rightarrow \widehat{tu} < \widehat{su}\).
The metric is easily extended to finite sequences by extending \(\Sigma\) with a new null symbol \(\diamond \notin \Sigma\). Let \(\Sigma’ = \Sigma \cup \{\diamond\}\) then we can map any finite sequence \(w\) over \(\Sigma\) to a sequence \(w \in S’\) by
\[ w’_i = \left\{ \begin{array}{ll} w_i & \mbox{if } i \leq \operatorname{len}(w) \\ \diamond & \mbox{if } i > \operatorname{len}(w) \end{array} \right. \]
This map is clearly injective so we can extend \(d\) to a metric over finite sequences of \(\Sigma\).
So what properties does this metric have in the context of its practical application to clustering strings of characters?
Lots of further investigation is needed to see how useful this metric will be in practice. There are lots of avenues too explore too:
I’ve been living in hash table ignorance, always accepting the received wisdom that a load factor (the ratio of elements stored to available buckets) of around 0.75 provides a good trade-off between storage and performance but never stopping to think about what’s actually going on. For example, what happens at the limit, when the number of elements approaches the number of buckets in a chained hash table such as Java’s HashMap.
As part of my development of crinch I’ve been approaching data structures from the perspective of reducing memory overhead. So when I started writing a hash map implementation for the library, the following question naturally arose: in a hash table of capacity n, that contains n elements, how many buckets will be empty?
Assuming a hashing algorithm that distributes keys with a high degree of uniformity, there will inevitably be some collisions (implying empty buckets elsewhere) but my expectation was that there would not be not too many. I guessed that maybe 10% of the buckets might be empty, and I was hoping for fewer than that.
When I found that slightly more than a third of the buckets in my hash table implementation were empty, I wish I could say I was immediately shocked into the realization that I’d never bothered to understand the real performance characteristics of the humble hash map. Instead I wasted an hour looking for a mistake — trying different hash functions and a number of different key sets — before realizing that the results I was seeing were so consistent (always hovering tightly around 36.8% of buckets empty) that I must be missing something. Then I did something more intelligent and walked away from my keyboard, picked up a pen and reasoned through the problem. It turns out it’s simple.
Let’s fix our attention on one bucket (it doesn’t matter which they’re identical under the assumption of a uniform hash). What’s the probability that the first key hashes into that bucket? Assuming a uniform hash it’s \(\frac1n\), and the probability it doesn’t is \(1-\frac1n\). This probability is constant for each key we add and furthermore each outcome is independent, so the probability \(P_n\) that none of the \(n\) elements hash to our bucket is given by
\[P_n = \left(1 - \frac {1} {n}\right)^n = \left(\frac {n-1} {n}\right)^n\]
What happens as \(n\) grows large? If limit abuse upsets you, look away now.
\[\begin{aligned} \lim_{n\to\infty} P_n = & \lim_{n\to\infty} \left(\frac {n-1} {n}\right)^n \\ = & \lim_{n\to\infty} \left(\frac {n} {n-1}\right)^{-n} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’+1} {n’}\right)^{-(n’+1)} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’} {n’+1}\right) / \left(1 + \frac {1} {n’}\right)^{n’} \\ = & 1/e \\ \approx & 0.3678 \end{aligned}\]
What this result tells us that if we take a reasonably capacious hash table that chains collisions and which is equipped with an ideal hash function that uniformly distributes all keys, and we add as many records as there are buckets, it’s almost certain that more than a third of the buckets are completely empty. All the years I’ve been stuffing objects into Java HashMaps and I never knew.
I don’t have my volumes of Knuth to hand; I’m sure all this is covered and more and that I really should have known this years ago, but at least it was an enjoyable excursion of ignorance.
Imagine a situation where every word in a large corpus is assigned an arbitrary numeric identifier, and you need a data structure that can quickly return the word-frequency for any particular identifier. If you are not a programmer or you can’t imagine finding yourself in this situation you can safely stop reading now.
I wasn’t in exactly in this situation, but a similar one that shares the same basic constraints:
It’s not a run-of-the-mill set of requirements for a data-structure; you certainly won’t find any help in the standard Java libraries, so I thought I’d share my approach.
The underlying concept is simple - sort the map entries into key order and use a binary search on the key to find the entry and return the associated value. This requires O(log n) accesses per map lookup but the problem with this naive approach is memory consumption. Imagining that both keys and values can be accommodated in 64 bit longs, one entry requires 16 bytes. With 10 million entries, we’re looking at 160 megabytes of memory for a data structure which in my application is only subsidiary to a number of larger algorithms. Can we do better?
It turns out that we can take advantage of fact that data frequently exhibit a bias towards small values. Though some values may be very large (eg. there are a small number of very high frequency words) most values are small (eg. most words are infrequent). By padding every value to the same width (eg. 64 bits) we’re wasting a lot of space.
To trim the values we can use a universal code and this will reduce the number of bits required to store most values if most values are relatively short. But now we have a problem: how can we operate a binary search when the entries are variable length?
My solution to this was to use Fibonacci coding. This elegant code has the property that every number’s binary representation ends with 11 and there are no consecutive ones elsewhere present. How does this help?
We encode every key and value using Fibonacci coding and lay them out [key,value,key,value,…] without any padding, in ascending key order. Then we perform our binary search at a bit-level. We examine the middle bit of our data and work outwards from there looking for a pair of consecutive ones. Because of the properties of this encoding, we know that the binary sequence 11 indicates the end of one number’s encoding and we can then decode the next pair of numbers. But how do we distinguish keys from values?
I chose a simple scheme: for each key, add one and double it; for each value double it and add one. So keys are always even and values are always odd and we can easily identify if it’s a key or a value we’ve decoded based on whether it is even or odd.
So the algorithm, slightly simplified, looks like this:
With the right library support, this is very easy to code — my crinch library has excellent support for universal codes.
There may be better approaches (I’d be happy to hear recommendations), but I thought this one was interesting enough to share.
A global postal coding system.
I’ve barely been at my home computer during the past couple of months as I work towards the conclusion of a project for a multinational company that has now occupied me for over a year.
Writing good software at this scale requires much effort, and sometimes the complexities almost become frustrations; modelling logistics and delivery in countries without postal codes is one such complexity.
And although I haven’t been at a computer, it hasn’t stopped me filling my notebooks with ideas that would resolve some of my frustrations. Like this, a system for coding small regions of the globe with a focus on human communicability: readily identified, unambiguous, 6 or 8 character codes which include a checksum that can catch the majority of basic transcription errors.
The fact that it’s based on a single continuous space-filling curve that loops through every region of the earth’s surface is a bonus.
Almost four years ago, I discovered/invented/developed* a spatial clustering algorithm that I have since used in a number of projects, but I never found the time to describe it, or publish the source code. I’m sure it could be useful to other developers so this is a silly situation. One which I intend to rectify, starting now, with this, a hasty overview.
First here are some of its characteristics (copied from the original web page):
Of course, these claims can only be verified after I’ve published a formal description of the algorithm — it’s very possible that my analysis is faulty — or after I’ve published the source code (which is much more likely and probably more useful). Until then, here’s a simple overview of the approach.
I called the algorithm “Greedy Variance Minimization” (GVM for short). It’s so named because:
The algorithm is configured with a maximum number of clusters (C) and the point dimension (D). Initially, no points have been added, so there are no clusters.

The first C points that are added form their own single point clusters. Each point has a mass (which naturally serves as a weighting in most applications) and may be assigned a key (which is entirely under the domain of the application and is not included in these diagrams)

In addition to tracking the individual clusters, the algorithm maintains a heap of all possible cluster pairings. Each pair is assigned a cost equal to the amount by which the variance of the combined cluster exceeds that of the two clusters individually.

At the addition of the C+1th point, and for all subsequent points, the algorithm must choose either to add the point to an existing cluster or to merge two existing clusters and form a new cluster from the new point. To do this, the cost (increase in variance) of adding the point to each possible cluster is computed and compared to the cost of the cluster-pair top of the heap. The lowest cost operation — either adding or merging — wins.

After every operation the heap of cluster-pairs is sifted appropriately and the algorithm proceeds in this manner until all the points have been added.
Failures of the first type can be avoided by setting C higher than the anticipated number of clusters, and then, after all the points have been added, collapsing the clusters subject to a maximum permitted total variance (i.e. merging any clusters that the algorithm has kept separate, but which can be regarded as part of the same overall cluster). But performance is sensitive to C; if you set C too high, the algorithm will proceed slowly.
The second type of failure can be avoided by randomly (or otherwise) ordering the points before they are clustered. But this may increase space/time requirements depending on your application.
One thing that makes the algorithm very practical, is the degree of control it affords the application. The algorithm assigns a key to each cluster. The application assigns keys to points, and controls how keys are selected/combined when points are added to clusters or when clusters are merged.
At an extreme, if an application doesn’t assign any keys, then memory usage is minimal since no point-level data is stored. But this has the consequence that, although the algorithm will report the location, size and general shape of each cluster, nothing else can be reported back to the application.
At the other extreme, an application may choose to store “lists of points” as its keys: each point is tagged with its own coordinates, and when clusters are merged the lists are concatenated. This means that the application will receive the coordinates of every point in each cluster (at a cost in time and space).
Other ‘in-between’ strategies are possible too. For example, the city demo uses a single city as the key for a cluster and chooses the most populous city when merging clusters. This is very cheap but quite adequate for that application.
(*) Delete as per your philosophy