Latest Tweets

 

Identifying uniqueness in O(n log α) bits

Note that my analysis of the algorithm here is in no way complete. This serves me as documentation; it’s public because I hope it may be of use to others. Though I have implemented this algorithm and am employing it daily, what follows may be inaccurate or just plain wrong.

Problem

Given a large table of records, stored on disk, how do we verify that every record is unique, (ie. that they are all mutually distinct) in a way that’s efficient and fast.

Possible Approaches

Let \(L\) be an external (ie. not in main computer memory) list containing \(n\) elements \((e_1 e_2…e_n)\). We are looking to demonstrate that every element of \(L\) is distinct, or to demonstrate otherwise, by exhibiting any element that occurs at more than one index in the list. We assume that the elements of \(L\) can be consistently read into memory any number of times; that \(L\) is persistent.

Let \(m(e)\) be the size of element \(e\)’s in-memory representation measured in bits. A straightforward solution to the problem is to demonstrate that the ordered elements of \(L\) are strictly monotonic. The memory required to perform this, without resorting to an external sorting algorithm is at least \(\Sigma_{i=1}^n m(e_i)\).

In theory, one can demonstrate that every element is distinct by reading the elements of \(L\) into memory \(n\) times; each time storing only \(e_i\), discarding all preceding elements and comparing it to each subsequent element. Though only \(2.\max_{i=1}^n m(e)\) bits of memory are required in such an approach, it is generally impractical due to the slowness of external access (up to \(n\) passes over \(L\) are required), and wasteful since since it is almost always the case that \(\max_{i=1}^n m(e_i)\) is significantly less than the total available memory \(M\).

Nevertheless, the approach can be adapted into a practical one by not limiting ourselves to drawing one element at a time from the list, but instead forming successive in-memory sets of \(l\) elements from \(L\). Let \(\alpha=\Sigma_{i=1}^n m(e_i)/n\) be the average bit-size of an element in \(L\) then, assuming \(\alpha\) can be readily estimated, \(l\) can be chosen to maximize usage of the available memory by choosing \(l=\lfloor \frac M \alpha \rfloor\). In the worst case, when all the elements of the list are unique, this requires \(\lceil \frac n l \rceil\) passes.

Algorithm

An alternative approach can be used to verify that every value of \(L\) is distinct using less memory and requiring (with a high probability) only two passes over \(L). The memory required is approximately:

\[\frac n {\log^2 2} \left(1 + \log(\alpha.\log^2 2) \right)\]

The general idea is to make two passes over the list. The first pass uses a Bloom filter to identify probable duplicates. Any recurring element will be identified by the Bloom filter and added to a candidate set which, due to the probabilistic nature of the Bloom filter, may also contain some elements that do not recur. The second pass discards the Bloom filter and confirms whether any of the elements in the candidate set are real duplicates and not just false positives.

The steps of the algorithm are:

  1. Create a Bloom filter \(B\) of with size \(m\) and number of hash functions \(k\) given by \[m=\frac {n\log(\alpha.\log^2 2)} {\log^2 2}\] \[k=\frac {\log(\alpha.\log^2 2)} {\log 2}\]
  2. Create an empty set \(C\) (the candidate set).
  3. Iterate over the elements of the list: if the element is not matched by the \(B\) then add it to \(B\), otherwise, if the element is not in \(C\) add it to \(C\), otherwise terminate; the element is a duplicate
  4. Discard \(B\) and create a new set \(W\) (the witnesses set).
  5. Iterate over the elements of the list: if the element is not in \(C\) ignore it, otherwise if the element is not in \(W\) add it to \(W\), otherwise terminate; the element is a duplicate
  6. The list contains no duplicate elements; terminate.

Sketch Analysis

The peak memory required \(M\), is given by the number of bits \(m\) in the Bloom filter \(B\) plus the size of the candidate set \(C\) in bits. Assuming that the false positives of \(B\) predominate, the elements of \(C\) will consist mostly of these false positives which estimates the size of \(C\) at: \[|C| = n\left(1 - e^{kn/m}\right)^k\]

Choosing approximately \(k=\frac m n \log 2\) hash functions for \(B\) is expected to be near-optimal, and this simplifies our estimate for the size of \(C\) to: \[|C| = n.2^{-\frac m n \log 2}\]

And our estimate for the memory required becomes: \[M=m + \alpha.n.2^{-\frac m n \log 2}\]

We want to choose the \(m\) which minimizes this value, differentiating and solving for zero gives: \[m=\frac {n\log(\alpha.\log^2 2)} {\log^2 2}\]

from which the Bloom filter parameters and consequently the estimated memory requirements are derived.

Notes

As presented here, the algorithm doesn’t adequately address the situation where a large number of elements occur exactly twice. In the extreme case where this is true of every element in the list, the candidate set would grow to half the size of the list; it would contain every distinct element.

In practice, this situation is easily avoided. Assuming satisfactory hashing, the number of false positives in the candidate set will not greatly exceed its estimated peak size. If the algorithm detects that the candidate set has grown significantly beyond expectations, the first pass can be terminated prematurely and a search for a witness element begun with a very high probability of finding one.

In the extremely unlikely case that every element in the oversized candidate set was a false positive, the algorithm can resume the first pass, clearing the candidate set and populating it only beyond the point at which the first pass was terminated; in this way the algorithm is guaranteed to make progress and will eventually terminate.

I’m continuing to evaluate my cluster painting algorithm, this time on a larger dataset.

This comparison with random colour assignment demonstrates the increased clarity that a good colouring provides. Without the algorithm, most of the clusters in the South West and Scotland are indistinguishable.

An algorithm for painting clusters

As I posted yesterday, I needed an algorithm that could paint clusters so as to make them distinguishable from each other. Today, I implemented a simple, “good enough” algorithm.

Inputs and Outputs

The algorithm is fairly general, only requiring that each cluster has:

  • has a size — the number of points it contains
  • has a radius — gives a general indication of the cluster’s spatial extent
  • has a centre — some point indicating the position of the entire cluster

It takes as parameters, a collection of clusters and a list of “paints” and returns a map that assigns every cluster a paint.

Coloring Algorithm

  • Order the clusters by the number of points they contain (most populous first)
  • For each cluster:
    • compute it’s distance from every other cluster (subtracting their combined radii from the distance between their centres)
  • Order these radius-adjusted distances (nearest first)
  • Iterate over all the clusters again, for each one:
    • if there is a paint that is not used by any of the other clusters, assign that to the cluster
    • Otherwise assign the most distant paint ie. the last paint used by any other cluster (distance order as above)

Early Results

Here are progressive results over a small dataset using 5 colours.

Even with 200 clusters, this algorithm does a fairly good job at making all the clusters distinguishable using just 5 colours.

Ideas for a cluster colouring algorithm.

Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.

In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.

Ideas for a cluster colouring algorithm.

Recently, I’ve received a small flurry of enquiries about clustering algorithms, and my GVM algorithm in particular. So I’ve started investigating some algorithmic improvements and generalizations.

In doing so, I need a good way of visually checking the quality of large clusterings; simply allocating the points of each cluster a random colour results in inadequate visualizations. So I’m scouting around for a simple algorithm that can produce cluster-colourings that make it easy to distinguish clusters.

A prefix metric on Strings

I’ve been investigating how clustering can be applied to sets of character strings in which common prefixes indicate similarity, and I came up with the following metric (probably known already, but no amount of online searching gave any useful results).

A metric over infinite sequences

We start buy considering the set \(S\) of sequences over some alphabet \(\Sigma\). We write \(s=\{s_i\}_{i\geq0}\) with \(s_i\in\Sigma\) for \(s\in\Sigma\).

For \(s,t\in\Sigma\) with \(s\neq t\) we write \(\widehat{st}\) for the length of the longest common initial subsequence, ie. \(\widehat{st}\) is the greatest \(n\) such that \(s_k=t_k\forall{k < n}\). Then \(d:S^2\rightarrow\mathbb{R}\) defines a metric where

\[ d(s,t) = \left\{ \begin{array}{ll} \frac 1 {|\Sigma|^{\widehat{st}}} & \mbox{if } s \neq t \\ 0 & \mbox{if } s = t \end{array} \right. \]

By its definition, necessarily \(d(s,t)\geq0\) with \(d(s,t)=0\) iff \(s=t\). Furthermore, \(d(s,t)\) is symmetric as a consequence of the symmetry of \(\widehat{st}\). It only remains to demonstrate that \(d\) satisfies the triangle inequality so that, for \(s,t,u \in S\)

\[ d(s,u) \leq d(s,t) + d(t,u) \]

If any of the sequences are equal, the inequality is trivial so we only consider mutually distinct sequences. Hence \(|\Sigma| > 1\) since otherwise there would be no distinct sequences. Thus

\[ \begin{aligned} d(s,u) & \leq d(s,t) + d(t,u) \\ \frac 1 {|\Sigma|^{\widehat{su}}} & \leq \frac 1 {|\Sigma|^{\widehat{st}}} + \frac 1 {|\Sigma|^{\widehat{tu}}} \\ & \leq \frac 1 {|\Sigma|^{ \min(\widehat{st}, \widehat{tu}) }} \end{aligned} \]

so it suffices to show \(\widehat{su} \geq \min(\widehat{st},\widehat{tu})\). But this is clearly true since \(\widehat{su} < \widehat{uv} \Rightarrow \widehat{tu} < \widehat{su}\).

Extending it to finite sequences

The metric is easily extended to finite sequences by extending \(\Sigma\) with a new null symbol \(\diamond \notin \Sigma\). Let \(\Sigma’ = \Sigma \cup \{\diamond\}\) then we can map any finite sequence \(w\) over \(\Sigma\) to a sequence \(w \in S’\) by

\[ w’_i = \left\{ \begin{array}{ll} w_i & \mbox{if } i \leq \operatorname{len}(w) \\ \diamond & \mbox{if } i > \operatorname{len}(w) \end{array} \right. \]

This map is clearly injective so we can extend \(d\) to a metric over finite sequences of \(\Sigma\).

Practical properties of the metric

So what properties does this metric have in the context of its practical application to clustering strings of characters?

  • The distance between two strings is always less than or equal to one.
  • The distance between from the empty string to any non-empty string is always one.
  • The distance between strings is inversely proportional to the length of their common prefix.
  • This distance is normalized by size of the character set, so for example, the distance between two string of 8 bit ASCII characters approximates (from above) their distance when represented as a two binary sequences.

Further work

Lots of further investigation is needed to see how useful this metric will be in practice. There are lots of avenues too explore too:

  • Can we create structures from this metric that can increase its applicability to existing algorithms?
  • Can we incorporate any additional knowledge we may have of ‘character similarity’ to increase its precision?
  • What is the best way of numerically representing the computed metric values?

Living in Hash Table Ignorance

I’ve been living in hash table ignorance, always accepting the received wisdom that a load factor (the ratio of elements stored to available buckets) of around 0.75 provides a good trade-off between storage and performance but never stopping to think about what’s actually going on. For example, what happens at the limit, when the number of elements approaches the number of buckets in a chained hash table such as Java’s HashMap.

As part of my development of crinch I’ve been approaching data structures from the perspective of reducing memory overhead. So when I started writing a hash map implementation for the library, the following question naturally arose: in a hash table of capacity n, that contains n elements, how many buckets will be empty?

Assuming a hashing algorithm that distributes keys with a high degree of uniformity, there will inevitably be some collisions (implying empty buckets elsewhere) but my expectation was that there would not be not too many. I guessed that maybe 10% of the buckets might be empty, and I was hoping for fewer than that.

When I found that slightly more than a third of the buckets in my hash table implementation were empty, I wish I could say I was immediately shocked into the realization that I’d never bothered to understand the real performance characteristics of the humble hash map. Instead I wasted an hour looking for a mistake — trying different hash functions and a number of different key sets — before realizing that the results I was seeing were so consistent (always hovering tightly around 36.8% of buckets empty) that I must be missing something. Then I did something more intelligent and walked away from my keyboard, picked up a pen and reasoned through the problem. It turns out it’s simple.

Let’s fix our attention on one bucket (it doesn’t matter which they’re identical under the assumption of a uniform hash). What’s the probability that the first key hashes into that bucket? Assuming a uniform hash it’s \(\frac1n\), and the probability it doesn’t is \(1-\frac1n\). This probability is constant for each key we add and furthermore each outcome is independent, so the probability \(P_n\) that none of the \(n\) elements hash to our bucket is given by

\[P_n = \left(1 - \frac {1} {n}\right)^n = \left(\frac {n-1} {n}\right)^n\]

What happens as \(n\) grows large? If limit abuse upsets you, look away now.

\[\begin{aligned} \lim_{n\to\infty} P_n = & \lim_{n\to\infty} \left(\frac {n-1} {n}\right)^n \\ = & \lim_{n\to\infty} \left(\frac {n} {n-1}\right)^{-n} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’+1} {n’}\right)^{-(n’+1)} \\ = & \lim_{{n’}\to\infty} \left(\frac {n’} {n’+1}\right) / \left(1 + \frac {1} {n’}\right)^{n’} \\ = & 1/e \\ \approx & 0.3678 \end{aligned}\]

What this result tells us that if we take a reasonably capacious hash table that chains collisions and which is equipped with an ideal hash function that uniformly distributes all keys, and we add as many records as there are buckets, it’s almost certain that more than a third of the buckets are completely empty. All the years I’ve been stuffing objects into Java HashMaps and I never knew.

I don’t have my volumes of Knuth to hand; I’m sure all this is covered and more and that I really should have known this years ago, but at least it was an enjoyable excursion of ignorance.

Mapping numbers to numbers compactly

Imagine a situation where every word in a large corpus is assigned an arbitrary numeric identifier, and you need a data structure that can quickly return the word-frequency for any particular identifier. If you are not a programmer or you can’t imagine finding yourself in this situation you can safely stop reading now.

I wasn’t in exactly in this situation, but a similar one that shares the same basic constraints:

  • a map that is largely static,
  • the expectation that there will be millions of keys,
  • keys and values that are positive whole numbers of arbitrary size, and
  • the requirement that the map be stored in memory for consistently fast access.

It’s not a run-of-the-mill set of requirements for a data-structure; you certainly won’t find any help in the standard Java libraries, so I thought I’d share my approach.

The underlying concept is simple - sort the map entries into key order and use a binary search on the key to find the entry and return the associated value. This requires O(log n) accesses per map lookup but the problem with this naive approach is memory consumption. Imagining that both keys and values can be accommodated in 64 bit longs, one entry requires 16 bytes. With 10 million entries, we’re looking at 160 megabytes of memory for a data structure which in my application is only subsidiary to a number of larger algorithms. Can we do better?

It turns out that we can take advantage of fact that data frequently exhibit a bias towards small values. Though some values may be very large (eg. there are a small number of very high frequency words) most values are small (eg. most words are infrequent). By padding every value to the same width (eg. 64 bits) we’re wasting a lot of space.

To trim the values we can use a universal code and this will reduce the number of bits required to store most values if most values are relatively short. But now we have a problem: how can we operate a binary search when the entries are variable length?

My solution to this was to use Fibonacci coding. This elegant code has the property that every number’s binary representation ends with 11 and there are no consecutive ones elsewhere present. How does this help?

We encode every key and value using Fibonacci coding and lay them out [key,value,key,value,…] without any padding, in ascending key order. Then we perform our binary search at a bit-level. We examine the middle bit of our data and work outwards from there looking for a pair of consecutive ones. Because of the properties of this encoding, we know that the binary sequence 11 indicates the end of one number’s encoding and we can then decode the next pair of numbers. But how do we distinguish keys from values?

I chose a simple scheme: for each key, add one and double it; for each value double it and add one. So keys are always even and values are always odd and we can easily identify if it’s a key or a value we’ve decoded based on whether it is even or odd.

So the algorithm, slightly simplified, looks like this:

  1. Start with the whole range of binary data (consisting of key-value pairs ordered by key, encoded using a Fibonacci coding, with keys incremented and doubled, and values doubled and incremented)
  2. If the range is empty return null (ie. no match)
  3. Find the midpoint of the range.
  4. Starting from the midpoint scan for a consecutive pair of one bits.
  5. Decode the number starting immediately after the 11 bit sequence.
  6. If the number is odd, decode the next number.
  7. Recover the key by dividing the number by two and subtracting one.
  8. If the recovered key equals the key we are looking for, decode the next number, subtract one, divide it by two and return the result
  9. Otherwise, if the key is greater than the key we are looking for, move the right-hand limit of the range to the start of the key and continue from (2).
  10. Otherwise the key is less than the key we are looking for so move the left-hand limit of the range to the end of the value and continue from (2).

With the right library support, this is very easy to code — my crinch library has excellent support for universal codes.

There may be better approaches (I’d be happy to hear recommendations), but I thought this one was interesting enough to share.

A global postal coding system.

I&#8217;ve barely been at my home computer during the past couple of months as I work towards the conclusion of a project for a multinational company that has now occupied me for over a year.

Writing good software at this scale requires much effort, and sometimes the complexities almost become frustrations; modelling logistics and delivery in countries without postal codes is one such complexity.

And although I haven&#8217;t been at a computer, it hasn&#8217;t stopped me filling my notebooks with ideas that would resolve some of my frustrations. Like this, a system for coding small regions of the globe with a focus on human communicability: readily identified, unambiguous, 6 or 8 character codes which include a checksum that can catch the majority of basic transcription errors.

The fact that it&#8217;s based on a single continuous space-filling curve that loops through every region of the earth&#8217;s surface is a bonus.

A global postal coding system.

I’ve barely been at my home computer during the past couple of months as I work towards the conclusion of a project for a multinational company that has now occupied me for over a year.

Writing good software at this scale requires much effort, and sometimes the complexities almost become frustrations; modelling logistics and delivery in countries without postal codes is one such complexity.

And although I haven’t been at a computer, it hasn’t stopped me filling my notebooks with ideas that would resolve some of my frustrations. Like this, a system for coding small regions of the globe with a focus on human communicability: readily identified, unambiguous, 6 or 8 character codes which include a checksum that can catch the majority of basic transcription errors.

The fact that it’s based on a single continuous space-filling curve that loops through every region of the earth’s surface is a bonus.

Greedy Variance Minimization

Almost four years ago, I discovered/invented/developed* a spatial clustering algorithm that I have since used in a number of projects, but I never found the time to describe it, or publish the source code. I’m sure it could be useful to other developers so this is a silly situation. One which I intend to rectify, starting now, with this, a hasty overview.

Characteristics

First here are some of its characteristics (copied from the original web page):

  • The algorithm places no restrictions on the number of items (N), dimensions (D) and clusters (C).
  • The algorithm scales linearly with the number items clustered. Specifically the worst case time complexity of the algorithm is N⋅D⋅C log C;
  • Memory usage is constant and independent of the number of items clustered. Specifically, the space complexity is C^2 + D⋅C.

Of course, these claims can only be verified after I’ve published a formal description of the algorithm — it’s very possible that my analysis is faulty — or after I’ve published the source code (which is much more likely and probably more useful). Until then, here’s a simple overview of the approach.

I called the algorithm “Greedy Variance Minimization” (GVM for short). It’s so named because:

  • It never looks ahead - a clustering decision is made every time a new point is added.
  • It uses variance as the measure of cluster cohesion - the variance of the points within a cluster is the only attribute considered; the points inside a cluster can be discarded as the algorithm proceeds.
  • It always acts so as to minimize - specifically it attempts to minimize the sum of cluster variances over all clusters.

Description

The algorithm is configured with a maximum number of clusters (C) and the point dimension (D). Initially, no points have been added, so there are no clusters.

The first C points that are added form their own single point clusters. Each point has a mass (which naturally serves as a weighting in most applications) and may be assigned a key (which is entirely under the domain of the application and is not included in these diagrams)

In addition to tracking the individual clusters, the algorithm maintains a heap of all possible cluster pairings. Each pair is assigned a cost equal to the amount by which the variance of the combined cluster exceeds that of the two clusters individually.

At the addition of the C+1th point, and for all subsequent points, the algorithm must choose either to add the point to an existing cluster or to merge two existing clusters and form a new cluster from the new point. To do this, the cost (increase in variance) of adding the point to each possible cluster is computed and compared to the cost of the cluster-pair top of the heap. The lowest cost operation — either adding or merging — wins.

After every operation the heap of cluster-pairs is sifted appropriately and the algorithm proceeds in this manner until all the points have been added.

Benefits

  • The algorithm is linear in the number of points. This is a lifesaver when you want quick results over large datasets.
  • The ‘best current’ clustering is always immediately available (essentially because it’s the core of the algorithm’s state). This means it is possible to get preliminary/progressive results.
  • The keying of points is under application control. This provides a lot of flexibility by allowing the application to trade-off space/time/completeness (see below).

Limitations

  • The algorithm is sensitive to C. The value of C may often need to be larger than the number of clusters in the dataset. This is because it may take time for the algorithm to accumulate sufficient points to seed sensible clusters. If C is too low, the algorithm may commit itself to clustering too early.
  • The algorithm is sensitive to the order in which points are added. If the order of the points, as presented to the algorithm, correlates to one or more coordinates, the algorithm may make bad choices from which it cannot later recover. A large number of highly localized points will create a number of highly localized clusters. As more distant points are added, large numbers of points will be assigned to the wrong cluster and the algorithm may fail to discern the correct clusters at all.

Failures of the first type can be avoided by setting C higher than the anticipated number of clusters, and then, after all the points have been added, collapsing the clusters subject to a maximum permitted total variance (i.e. merging any clusters that the algorithm has kept separate, but which can be regarded as part of the same overall cluster). But performance is sensitive to C; if you set C too high, the algorithm will proceed slowly.

The second type of failure can be avoided by randomly (or otherwise) ordering the points before they are clustered. But this may increase space/time requirements depending on your application.

Flexibility

One thing that makes the algorithm very practical, is the degree of control it affords the application. The algorithm assigns a key to each cluster. The application assigns keys to points, and controls how keys are selected/combined when points are added to clusters or when clusters are merged.

At an extreme, if an application doesn’t assign any keys, then memory usage is minimal since no point-level data is stored. But this has the consequence that, although the algorithm will report the location, size and general shape of each cluster, nothing else can be reported back to the application.

At the other extreme, an application may choose to store “lists of points” as its keys: each point is tagged with its own coordinates, and when clusters are merged the lists are concatenated. This means that the application will receive the coordinates of every point in each cluster (at a cost in time and space).

Other ‘in-between’ strategies are possible too. For example, the city demo uses a single city as the key for a cluster and chooses the most populous city when merging clusters. This is very cheap but quite adequate for that application.


(*) Delete as per your philosophy

A basic logo design combined with some genetically generated textures.

A basic logo design combined with some genetically generated textures.