As promised in an earlier post about intents I’m sharing a tip on how to return dynamically generated resources from a ContentProvider — though what’s written here applies equally well to an Activity returning a result.
The basic scenario is this: You have implemented a ContentProvider and you want to use it to return something that is too big to fit into a cursor, say an image. It’s clear from the relevant Android platform javadocs, though perhaps not from the general ContentProvider documentation, that this is done by implementing openFile and returning a ParcelFileDescriptor.
But now suppose that the image resource is being rendered on the fly. How do you implement openFile?
As per the contact of openFile you need to return a ParceFileDescriptor which means that the resource must be returned via an open file (or possibly a socket, which we’ll ignore). This means that the image must first be written to a file, which is then opened as a ParceFileDescriptor. The interesting question here is when do you delete the file?
You can’t wait until the calling process has finished reading the file because (a) you won’t get notified when that occurs and (b) it may never exhaust the stream anyway. You can’t rely on the calling process to delete the file for you either, also, the whole idea behind wrapping the resource in this way is to guard it from direct access by other processes. You could delete the file after a fixed time period — assuming that the caller will have finished reading the file by that time. But this is unnecessarily complex, because the answer turns out to be very simple:
You delete the file immediately after you’ve opened the ParcelFileDescriptor but before you return it from the openFile method. Here are the relevant lines of the PlantProvider class from my Daisy Garden application; the try/finally block is there to ensure the file gets deleted no matter what — the last thing we want to do is leak precious storage space.
The reason this works is down to the way that Linux filesystems operate: directories maintain links to files, when a process opens a file a new link is created, closing a file or removing it from a directory removes a link. When there are no links to a file, the file is deleted.
So by opening the file in our application, we create a link to it. Then ‘deleting’ the file actually unlinks it, but the file won’t really be deleted until the file descriptor is discarded (ie. the last link is removed). This will happen automatically at some point after the calling application has finished accessing the file and any associated ParcelFileDescriptor objects have become garbage.
So it’s a very simple solution, but only if you know something about how the filesystem can be expected to work on Android devices.
One of my favourite things about the Android platform is how easily application can share data using Intents and ContentProviders. They provide a universal basis for Android applications to interoperate and I try to expose them whenever it’s practical, even for something as singular as my Daisy Garden application.
I’ve just published a new version of Daisy Garden and with it, Intents and ContentProviders that enable the users of other Android applications to choose flowers they have grown in Daisy Garden — assuming that the developers choose to support it. And the great thing is that supporting third-party intents is very easy to do (and even if no other developers use them, they will be heavily used to integrate my own applications).

The Java code embedded below is a self-contained Activity that responds to a tap on the screen by inviting the user to pick a flower and then displaying the flower as a bitmap. There are two basic parts: forming an Intent for the user to pick a flower and forming a Uri to the flower’s bitmap. Both of these aspects can be customized; messaging to the user, and the bitmap size for example.
It’s very simple — more than half the code is boilerplate and comments. And this is what’s really good: in a well structured Android application this functionality is simple to expose too — I’d go as far as saying it’s almost automatic, so there is very little reason not to share (non-sensitive) data between applications.
Having said all this, there is an interesting technical detail here that meant that I had to give some thought to exposing the bitmaps. Daisy Garden is generating the flower bitmaps “on-the-fly” when they are requested from the ContentProvider and it’s not obvious how to share these unless you know a little about how Linux filesystems work. I’ll try to find some time to outline the technique in a subsequent post.
Java code follows:
After my previous post comparing Java’s regular random number generator to its cryptographic SHA1PNG relation, Benjamin Manes suggested using double-hashing has a way of generating multiple and provided an implementation for me to test.
The initial results appear to be excellent. It yielded by far the shortest test time (approx. 4s compared to 18s for the standard Java random number generator) and in this limited test exhibited a distribution that is almost indistinguishable from that of SHA1PNG.

Bloom filters are very easy to implement. Unfortunately, there doesn’t seem to be very much information on the web that describes how actually generate all those independent hash functions you’ll need, especially over arbitrarily typed key values.
A standard Java hash value is an int. If you need to generate say 10 hash values over a range of 1 million indices an int just wont do - you just don’t have enough bits (by my mental arithmetic you’d need about 200bits). But short of requiring that any developer who wants to use a Bloom filter implements their own good quality multi-hash implementation, the int returned by Object.hashCode() is all we have.
My approach in the past has been to use a SecureRandom instance seeded with the object hash code (or byte data from the object if its available), making repeated calls to nextInt(int) to obtain as many hash values as I need. This is almost the best you can do to create a multiplicity of well-distributed hash values within a specified range.
It is nevertheless slow, SecureRandom objects are expensive to create and use. An obvious alternative is to consider using a regular Random object, the questions are:
SecureRandom?This is based on a small test application I wrote that inserted increasing numbers of elements into a Bloom filter while estimating the false-positive rate at each step. Up to 10,000 integers were stored using 4 hashes in a basic Bloom filter containing approximately 57,000 bits. The test was repeated using both a regular Random implementation and a SecureRandom “SHA1PRNG” implementation.
The median runtime of the SecureRandom test was approximately 61 seconds, whereas the corresponding time for the Random test was approximately 18 seconds.
The chart below plots, for both classes, the increase in false positive rate that occurs as more elements are added to the filter. The plots are similar:

The source code used to perform these tests, and the resulting data used to compile these charts is available.
The answers appear to be:
So it is plausible that using the Random class may prove an effective way of extending the standard Object.hashCode() values to meet the requirements of probabilistic data structures such as Bloom filters.
Compact approximators are a generalization of Bloom filters, see Wikipedia for introductory reading. They replace the boolean values that compose a Bloom filter with the elements of a lattice. Lattices are algebraic structures that occur frequently in mathematics and in computing.
In the informal vocabulary of a programmer, the bit values 0 and 1 form a lattice because you can and them to get the lesser of two values and or them to get the greater of two values. Sets provide another common example of lattice algebra in programming. In this case ‘anding’ two sets is to form their intersection, and ‘oring’ them is to form their union.
In practice, a compact approximator operates something like a map, except that you can’t iterate over the keys and the values must be the elements of a lattice. As with the Bloom filter, it has the advantage that its storage size can strictly bounded. And like a Bloom filter, it’s a probabilistic data structure:
Bloom filters provides a guarantee that, though there may be false positives, there will never be a false negative. This is to say, a Bloom filter may ‘erroneously’ assert it contains an element that was never added, but it will never assert that it doesn’t contain an element that was. How does this property carry over to compact approximators?
Due to their structure, lattices impose a ‘partial ordering’ on their elements. The nature of this ordering depends wholly on the lattice structure. In the case of a lattice generated by the bits 1 and 0, its simply the case that 1 is greater than 0. Or in the case of a lattice generated over the subsets of a set: one element is larger than another if it is a superset of that other element. A compact approximator guarantees that though it may return a larger value than the one you stored for a given key, it will never return a smaller value.
What does this mean in practice?
Suppose you have distributed hash table over modestly sized cluster of machines. The hash table has redundancy, so that key values may be available from more than one machine in the cluster. Assuming, not unreasonably, that identifying which machine can respond with a key’s value is a relatively expensive operation that may involve multiple network requests, it would be useful to cache which machines have values for which keys. A compact approximator could help here.
We create a compact approximator (CA) whose potential keys range over those of the distributed hash table (remember, the CA doesn’t actually persist the keys) and whose values are taken from a lattice over the set of all machine addresses in the cluster (ie. a value is a set of machine addresses). Every time a key value lookup is successfully performed on a machine, its address is added to the CA under that key. Retrieving a key from the CA will return all the machines which have successfully returned values for the key - though it may return a few more. This data can be used as a first best effort to locate an appropriate machine, if the machine doesn’t actually have the data, a full lookup can be performed instead. The expectation is that you will have sufficiently few wrong results from the CA that the time saved by avoiding lookups will dominate the time wasted in misdirected requests.
This is just one simple scenario, there are many; compact approximators were first identified as a way to accelerate unicode text searches. There are a number of algorithms that I’m looking to apply them to so I’ve added an implementation to a Java library for compact data representations that I’m putting together. It is an early, but fully featured implementation.
Across various software libraries I’ve produced, I have a small number of ad-hoc Bloom filter implementations. What started out as a small task to rationalize them into a single more general implementation has turned out to be trickier than I expected.
Already having a target interface for the BloomFilter implementation, and though equipped with my powerful BitVector class, it turns out that the real barrier to producing a general Bloom filter implementation is producing a useful API that generalizes Object.hashCode() efficiently.
This task, one I expected to be simple, has bogged me down, and I still don’t think much of my API: there are many infidelities. One of the core requirements is that the API needs to permit hash values to be accessed quickly (say passing an Object in and getting an int back) because hashing is often performance critical. Conversely, some hashes (say cryptographic hashes like SHA-1) need to return values larger than any Java primitive. You can’t accommodate these into a single method signature. In addition, many useful data structures require the generation of multiple hashes over a specific numeric range (Bloom filters are not the only example, Cuckoo hashes provide another), even though the common case will probably remain creation of single hash values.
It’s early days, but the hashing package source-code is available.
At present it defines a core Hash interface that can generate individual hash values for objects. This is extended by MultiHash which can generate a HashList to return multiple hash values. Each Hash declares a HashRange which is the range that the hash values it generates may span. To facilitate the implementation of general hashing algorithms, a HashSource interface provides a standard way to decompose Objects into byte streams. Finally a Hashing class provides static utility methods.
The source code for the Bloom filter implementation is also available
Prior to implementing a BloomFilter along the lines of the interface described in the Guava project issue tracker, I’ve reviewed my sprawling personal collection of Java libraries to identify the all of the ad-hoc bit-processing related code I’ve written with a view to unifying it all into something more reusable.
To my surprise, much of the code didn’t actually deal with tricky bitwise processing, but with the practicalities of efficiently moving bit data between different (bytes, ints, longs, byte arrays, IO streams, Strings, BigIntegers etc). All of those bits have to come from somewhere I guess.
Of course, there was plenty of bit twiddling, but another obvious sticking point was supporting competing requirements for mutability and immutability, as is common when one is seeking to design APIs that are both robust and efficient.
So I’ve spent a bit of time combining everything I need for various binary processing tasks (bitwise operations, (im)mutability control, conversion to standard Java types) and the result is a single stand-alone class called BitVector.
The source code for the class is available under the Apache 2.0 licence from:
The class is mostly complete for my needs, though I haven’t yet begun to employ it in any of my projects.
For a better idea of what the class provides, the class Javadoc is included below.
BitVector stores fixed-length bit sequences of arbitrary length and provides a number of bit-wise operations, and methods for exposing the bit sequences as established java types.
In keeping with Java standards, bits are operated-on and exposed as big-endian. This means that, where bit sequences are input/output from methods, the least-significant bit is always on the right and the most significant bit is on the left. So, for example, the toString() method will contain the most significant bit in the character at index 0 in the string. Naturally, In the cases where this class is used without externalizing the bit representation, this distinction is irrelevant.
A consequence of this is that, in methods that are defined over ranges of bits, the from and to parameters define the rightmost and leftmost indices respectively. As per Java conventions, all from parameters are inclusive and all to parameters are exclusive.
Instances of this class may be aligned (see isAligned() and aligned(). Many operations will execute more efficiently on aligned instances. Instances may also be immutable (see isMutable(), mutable() and immutable()). In addition to a number of copying operations, the class also supports views; views create new BitVector instances that are backed by the same bit data, this allows applications to expose mutable BitVector instances ‘safely’ via immutable views.
The class extends Number which allows it to be treated as an extended length numeric type (albeit, one that doesn’t support any arithmetic operations). In addition to the regular number value methods on the interface, an bigIntValue() method is available that returns the BitVector as a positive, arbitrarily sized integer. Combined with the fromBigInteger() method, this allows, with some loss of performance, a range of arithmetic calculations to be performed.
The class also implements Iterable and provides methods to obtain ListIterator as one would for a List, but stops short of implementing the List interface. This allows a BitVector to be used with a range of Java language constructs and standard library classes.
The class is Serializable and Cloneable too (clones are shallow and essentially behave as a view of the original instance). For instances where more control over serialization is needed, read(InputStream) and write(OutputStream) methods are available, though better performance may result from calling toByteArray() and managing the writing outside this class.
In addition to all of the above methods outlined above, a full raft of bitwise operations are available, including:
Most such methods are available in both an operation specific version and operation parameterized version to cater for different application needs.
Performance should be adequate for most uses. There is scope to improve the performance of many methods, but none of the methods operate in anything more than linear time and inner loops are mostly ‘tight’. An implementation detail (which may be important on some platforms) is that, with some exceptions, none of the methods perform any object creation. The exceptions are: methods that require an object to be returned, floatValue(), doubleValue(), and situations where an operation is applied with overlapping ranges of bits, in which case it may be necessary to create a temporary copy of the BitVector.
The class is marked as final to ensure that immutable instances can be safely used in security sensitive code (eg. within a PrivilegedAction).
Building on the perfect minimal hash I’ve developed for Java String objects, I’ve developed an memory efficient implementation of Java’s Map interface.
View the map source code in SVN
The map implementation is not general purpose, it’s specifically targeted to situations where:
Even given these constraints, I frequently encounter situations where all of these apply:
Map for substitution into a query.Proxy implementations of setter/getter interfaces that are backed by a Map of values.The implementation stores the set of valid keys in an object that can be shared between ‘like’ maps. The keys are hashed perfectly so that the map values can be stored directly into a single object array; no entry objects are required. This means that the memory required to store large numbers of similar maps can . Furthermore, none of the basic map operations (put/get/contains etc.) performs any memory allocation. This means that for many applications both memory overhead and object churn is minimal.
From the javadocs:
Unlike regular
Mapimplementations, there is no default, or copy, constructor. Instead the map must be constructed around a set of possibleStringkeys. Only the keys supplied to a constructor may be used in the maps it constructs. Re-using constructors (by repeatedly using them to createParameterMapinstances) minimizes the memory required to store the map. Furthermore, the only map operations that allocate new objects are those methods that required to do so by theMapinterface; in-short, object creation strictly minimized. This makes the class ideal for environments where reducing GC overhead is important.
The source code is under the Apache 2.0 licence, built with Maven, and is available from my website’s google code project:
A minimal perfect hash is simply a bijection between a set of N objects and the first N integers. See the wikipedia page about perfect hashing for more information.
I’m looking to produce a fast and memory efficient map implementation over String keys and I want to use a perfect hash to pack values without risking collisions.
Casting around online for a Java implementation led me to Sux4J (this rather unpromising sounding library looks excellent and I’m going to take a closer look at it soon) but the implementations are too generic for my purposes.
So I’ve fashioned my own minimal perfect hashing algorithm for Java strings.I don’t know anything about the theory of such algorithms, but my approach was to build a data-structure as follows:
The gist of this approach is outlined visually in the following diagram. Here, the hashes for the strings “cat” and “dog” collide, so a small decision tree needs to be consulted in the case that either of these two are hashed to confirm which value should be returned.

There are some nuances in the actual implementation. For example, Strings are actually ordered by length and character sequence too; this assists the building of fast decision trees. Also each duplicated string stores its offset within the span to avoid scanning back to find the start of the span (I think there are better space/time trade-offs around this that I may investigate in the future.).
However, these complexities only come into play when there are collisions of the String hash function. The common case, even for fairly large numbers of strings, is that there are no collisions. This reduces the time complexity to a binary search over the integer array of hashcodes — not as fast as only taking the hashcode, but still fast. And even the exceptional case is still fairly fast since it typically needs to test candidate string’s length to determine which minimal hash value it should be assigned.
Note that the algorithm is designed to examine as little about the string as is necessary to determine the perfect hash value (this helps to make it fast). Furthermore, the algorithm does not keep a reference to the list of potential candidates for hashing (this helps to reduce memory overhead). As a consequence, the algorithm cannot necessarily distinguish valid candidates; though they are frequently identified as such, invalid string arguments may generate valid hashes.
The source code for this implementation is fully documented and is under the Apache 2.0 licence. It is available through my website’s Google code project
The project that’s been taking up so much of my time recently is to architect a complex B2B e-commerce system. This naturally requires some form of “monetary abstraction” which is something that Java doesn’t provide (quite a surprise really, when you think about some of the classes that have made it into the system libraries over the years).
This is also something that the Java community doesn’t appear to have addressed in any significant way. Faced with this situation on the current project, I contributed my own small collection of classes for dealing with monetary amounts via the expedient of open-sourcing them and then treating them as I would any other third-party library.
This is a one line example that just about sums-up the whole library. Here’s how you make $100:
new MoneyType(Locale.US).money(10).calc().multiply(BigDecimal.TEN).money();
A page containing all relevant links to downloads, documentation etc. is available on my website: