Latest Tweets

 

Identifying uniqueness in O(n log α) bits

Note that my analysis of the algorithm here is in no way complete. This serves me as documentation; it’s public because I hope it may be of use to others. Though I have implemented this algorithm and am employing it daily, what follows may be inaccurate or just plain wrong.

Problem

Given a large table of records, stored on disk, how do we verify that every record is unique, (ie. that they are all mutually distinct) in a way that’s efficient and fast.

Possible Approaches

Let \(L\) be an external (ie. not in main computer memory) list containing \(n\) elements \((e_1 e_2…e_n)\). We are looking to demonstrate that every element of \(L\) is distinct, or to demonstrate otherwise, by exhibiting any element that occurs at more than one index in the list. We assume that the elements of \(L\) can be consistently read into memory any number of times; that \(L\) is persistent.

Let \(m(e)\) be the size of element \(e\)’s in-memory representation measured in bits. A straightforward solution to the problem is to demonstrate that the ordered elements of \(L\) are strictly monotonic. The memory required to perform this, without resorting to an external sorting algorithm is at least \(\Sigma_{i=1}^n m(e_i)\).

In theory, one can demonstrate that every element is distinct by reading the elements of \(L\) into memory \(n\) times; each time storing only \(e_i\), discarding all preceding elements and comparing it to each subsequent element. Though only \(2.\max_{i=1}^n m(e)\) bits of memory are required in such an approach, it is generally impractical due to the slowness of external access (up to \(n\) passes over \(L\) are required), and wasteful since since it is almost always the case that \(\max_{i=1}^n m(e_i)\) is significantly less than the total available memory \(M\).

Nevertheless, the approach can be adapted into a practical one by not limiting ourselves to drawing one element at a time from the list, but instead forming successive in-memory sets of \(l\) elements from \(L\). Let \(\alpha=\Sigma_{i=1}^n m(e_i)/n\) be the average bit-size of an element in \(L\) then, assuming \(\alpha\) can be readily estimated, \(l\) can be chosen to maximize usage of the available memory by choosing \(l=\lfloor \frac M \alpha \rfloor\). In the worst case, when all the elements of the list are unique, this requires \(\lceil \frac n l \rceil\) passes.

Algorithm

An alternative approach can be used to verify that every value of \(L\) is distinct using less memory and requiring (with a high probability) only two passes over \(L). The memory required is approximately:

\[\frac n {\log^2 2} \left(1 + \log(\alpha.\log^2 2) \right)\]

The general idea is to make two passes over the list. The first pass uses a Bloom filter to identify probable duplicates. Any recurring element will be identified by the Bloom filter and added to a candidate set which, due to the probabilistic nature of the Bloom filter, may also contain some elements that do not recur. The second pass discards the Bloom filter and confirms whether any of the elements in the candidate set are real duplicates and not just false positives.

The steps of the algorithm are:

  1. Create a Bloom filter \(B\) of with size \(m\) and number of hash functions \(k\) given by \[m=\frac {n\log(\alpha.\log^2 2)} {\log^2 2}\] \[k=\frac {\log(\alpha.\log^2 2)} {\log 2}\]
  2. Create an empty set \(C\) (the candidate set).
  3. Iterate over the elements of the list: if the element is not matched by the \(B\) then add it to \(B\), otherwise, if the element is not in \(C\) add it to \(C\), otherwise terminate; the element is a duplicate
  4. Discard \(B\) and create a new set \(W\) (the witnesses set).
  5. Iterate over the elements of the list: if the element is not in \(C\) ignore it, otherwise if the element is not in \(W\) add it to \(W\), otherwise terminate; the element is a duplicate
  6. The list contains no duplicate elements; terminate.

Sketch Analysis

The peak memory required \(M\), is given by the number of bits \(m\) in the Bloom filter \(B\) plus the size of the candidate set \(C\) in bits. Assuming that the false positives of \(B\) predominate, the elements of \(C\) will consist mostly of these false positives which estimates the size of \(C\) at: \[|C| = n\left(1 - e^{kn/m}\right)^k\]

Choosing approximately \(k=\frac m n \log 2\) hash functions for \(B\) is expected to be near-optimal, and this simplifies our estimate for the size of \(C\) to: \[|C| = n.2^{-\frac m n \log 2}\]

And our estimate for the memory required becomes: \[M=m + \alpha.n.2^{-\frac m n \log 2}\]

We want to choose the \(m\) which minimizes this value, differentiating and solving for zero gives: \[m=\frac {n\log(\alpha.\log^2 2)} {\log^2 2}\]

from which the Bloom filter parameters and consequently the estimated memory requirements are derived.

Notes

As presented here, the algorithm doesn’t adequately address the situation where a large number of elements occur exactly twice. In the extreme case where this is true of every element in the list, the candidate set would grow to half the size of the list; it would contain every distinct element.

In practice, this situation is easily avoided. Assuming satisfactory hashing, the number of false positives in the candidate set will not greatly exceed its estimated peak size. If the algorithm detects that the candidate set has grown significantly beyond expectations, the first pass can be terminated prematurely and a search for a witness element begun with a very high probability of finding one.

In the extremely unlikely case that every element in the oversized candidate set was a false positive, the algorithm can resume the first pass, clearing the candidate set and populating it only beyond the point at which the first pass was terminated; in this way the algorithm is guaranteed to make progress and will eventually terminate.

  1. tomgibara posted this
blog comments powered by Disqus