Latest Tweets

 

Compact Approximators

Compact approximators are a generalization of Bloom filters, see Wikipedia for introductory reading. They replace the boolean values that compose a Bloom filter with the elements of a lattice. Lattices are algebraic structures that occur frequently in mathematics and in computing.

In the informal vocabulary of a programmer, the bit values 0 and 1 form a lattice because you can and them to get the lesser of two values and or them to get the greater of two values. Sets provide another common example of lattice algebra in programming. In this case ‘anding’ two sets is to form their intersection, and ‘oring’ them is to form their union.

In practice, a compact approximator operates something like a map, except that you can’t iterate over the keys and the values must be the elements of a lattice. As with the Bloom filter, it has the advantage that its storage size can strictly bounded. And like a Bloom filter, it’s a probabilistic data structure:

Bloom filters provide a guarantee that, though there may be false positives, there will never be a false negative. This is to say, a Bloom filter may ‘erroneously’ assert it contains an element that was never added, but it will never assert that it doesn’t contain an element that was. How does this property carry over to compact approximators?

Due to their structure, lattices impose a ‘partial ordering’ on their elements. The nature of this ordering depends wholly on the lattice structure. In the case of a lattice generated by the bits 1 and 0, its simply the case that 1 is greater than 0. Or in the case of a lattice generated over the subsets of a set: one element is larger than another if it is a superset of that other element. A compact approximator guarantees that though it may return a larger value than the one you stored for a given key, it will never return a smaller value.

What does this mean in practice?

Suppose you have distributed hash table over modestly sized cluster of machines. The hash table has redundancy, so that key values may be available from more than one machine in the cluster. Assuming, not unreasonably, that identifying which machine can respond with a key’s value is a relatively expensive operation that may involve multiple network requests, it would be useful to cache which machines have values for which keys. A compact approximator could help here.

We create a compact approximator (CA) whose potential keys range over those of the distributed hash table (remember, the CA doesn’t actually persist the keys) and whose values are taken from a lattice over the set of all machine addresses in the cluster (ie. a value is a set of machine addresses). Every time a key value lookup is successfully performed on a machine, its address is added to the CA under that key. Retrieving a key from the CA will return all the machines which have successfully returned values for the key - though it may return a few more. This data can be used as a first best effort to locate an appropriate machine, if the machine doesn’t actually have the data, a full lookup can be performed instead. The expectation is that you will have sufficiently few wrong results from the CA that the time saved by avoiding lookups will dominate the time wasted in misdirected requests.

This is just one simple scenario, there are many; compact approximators were first identified as a way to accelerate unicode text searches. There are a number of algorithms that I’m looking to apply them to so I’ve added an implementation to a Java library for compact data representations that I’m putting together. It is an early, but fully featured implementation.

blog comments powered by Disqus