Crinch has had another minor bump to version 0.7.2. This release fixes an issue that can prevent some character sequences being returned from tries.
A quick post to say that the Crinch libraries have had a minor bump from 0.7 to 0.7.1.
It was motivated by an urgent fix for a bug that impacted one of my clients. The bug was in the incomplete and undocumented record library, so it shouldn’t have any impact for other users of the library at the moment.
The only new aspect to this release is that the source code for each library is now being put into jars and published together with a new distribution jar that aggregates all of the Crinch libraries. This jar is available for download from Google Code.
Another weekend and another code release. I’ve released a new version of my crinch libraries. It now includes a documented API for binary coding. It currently supports a range of universal and non-universal codings. All I/O for the package is directed via the Crinch bits library which makes the library very flexible.
The library currently supports the following universal codings.
All universal codings support arbitrarily large numbers via BigInteger.
In addition to universal codings, the following non-universal codings are supported.
The Huffman implementation in particular has seen lots of work: it accelerates encoding through canonicalization and block reads bits for faster decoding (without a some degree of read-ahead, I’m not sure there’s a faster approach). It can also return a dictionary which contains the minimal state needed to reconstruct the encoding. This can be efficiently transmitted ahead of a compressed message.
A number of helpful classes are provided in addition to the core coding implementations.
ExtendedCoding wraps a coding to provide additionalCodedReader / CodedWriter conveniently pairs an ExtendedCoding with a BitReader/BitWriterCodedStreams provides static utility methods for common encoding and decoding tasks.CharFrequencyRecorder accumulates character frequencies from Strings and other sources of character data; useful for Huffman coding.CodingFrequencies calculates ‘zero-order’ information entropy from data arrays.The API is mostly finalized, though there are some functions which are likely to be added:
writer.chain().encodeInt(x).encodeLong(y).flush().If anyone has an coding they want/need implemented and that isn’t currently supported, you can let me know with a comment.
Crinch provides a number of different BitWriter implementations that can be used to accumulate written bits in memory. They each have advantages in different situations:
I ran a few benchmarks to verify the performance of these classes for (a) writing individual bits (b) writing multiple bits and (c) writing padding bits. I included the NullBitWriter to provide a baseline which gives some indication of the intrinsic API overhead.

As one might expect, BitWriter implementations that operate on greater numbers of bits simultaneously perform slightly worse when only single bits a written. A single ternary operator (writeBit(bit ? 1 : 0) vs write(bit, 1)) accounts for the significant difference in performance between WriteBoolean and WriteBit.

Although the ByteArrayBitWriter again performs well, observe how the performance of writing to BitVector improves significantly as bits are written in larger groups

The poor performance of IntArrayBitWriter for these two operations has highlighted a poorly optimized implementation. BitVector maintains a ‘flat’ performance profile; it provides the only writer that pads 255 bit intervals as quickly as word aligned 256 intervals.
My encodePositiveInt method is now going to allow calls with zero. So how should I change the method name to reflect this?
encodeNonNegativeInt longencodeNonNegative overloadedencodeNonNegInt uglyencodeUnsignedInt misleadingencodePositiveInt inaccurateGoddamit. What do I call this method?
Following on from my post about the design of the BitVector class yesterday, the question was asked, do you really need operations over ranges when you could just use views?
For example, the following two lines of code will have exactly the same result:
v.setRange(100, 200, false);
v.rangeView(100,200).set(false);
Except that the second line will create an intermediate object. Putting aside the fact that not every environment that executes Java code has performance to match the standard Java runtime, the question of whether intermediate views should replace ranges hinges on the degree to which they impair performance.
To ascertain this, I put together a hasty benchmark this evening. It repeats a large number of different operations over a BitVector, using both ranges and views. For stable results, the fastest and slowest third of each run was discarded, and the remainder averaged for each operation.
The tests were performed on a 32 bit dual core machine running
OpenJDK Runtime Environment (IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10.1)
OpenJDK Server VM (build 20.0-b11, mixed mode)
The following chart gives a summary of the results:

It shows there is a clear and consistent cost to the object allocations (about 100ms in each of these tests). Given the performance sensitive nature of bit handling, this isn’t something I’d necessarily want to give away for the benefit of fewer methods.
In addition to these operations, I also measured the performance of the much slower rotate operation. This threw up an interesting anomaly: in contrast to all other operations, creating the intermediate view is always faster!

From the perspective of a developer who doesn’t have visibility of the algorithms applied by the JIT, this is inexplicable. Here is the rotateRange method which is being called:
public void rotateRange(int from, int to, int distance) {
if (from < 0) throw new IllegalArgumentException();
if (from > to) throw new IllegalArgumentException();
from += start;
to += start;
if (to > finish) throw new IllegalArgumentException();
rotateAdj(from, to, distance);
}
In the other case we are calling rangeView() followed by rotate(). Here is one subcall of the rangeView method:
public BitVector duplicate(int from, int to, boolean copy, boolean mutable) {
if (mutable && !copy && !this.mutable) throw new IllegalStateException();
if (from < 0) throw new IllegalArgumentException();
if (to < from) throw new IllegalArgumentException();
from += start;
to += start;
if (to > finish) throw new IllegalArgumentException();
return duplicateAdj(from, to, copy, mutable);
}
Notice the basic similarity in structure. Both methods are doing the same checks except that the rangeView method actually ends up doing more because it then needs to create a new BitVector. Finally, here’s the rotate method that will be called on the newly created view:
public void rotate(int distance) {
rotateAdj(start, finish, distance);
}
These code snippets show that the the rotateRange method cannot do more work than a call to rangeView and rotate combined. So how can it be consistently slower?
I don’t even have the start of explanation for this anomaly (except that it may be a bug or shortcoming in the JIT), but as I observed in my comments yesterday, JITs can be capricious. Writing APIs that anticipate good results from them is fine, but there’s no escaping the fact that it actually passes risk onto the developers who use the library. The size of that risk is a function of performance: the degree to which JIT optimizations influence it and its importance to the developer.
This is a post about the API design of BitVector. I don’t think I’ve ever written a class which carried so many methods and here I try to give some justifications.
Binary data, in the form of an order sequence of bits, is an extremely fundamental software abstraction. Perhaps as a consequence of this there’s frequently a lot of it, and there are often many operations that need to be perform on it.
The large number of useful operations possible over binary data is one reason that BitVector has many methods, but there are others.
Developers spend most of their time processing binary data as byte sequences (generally byte arrays in most languages). Grouping bits into bytes, and in more generally into longer words (such as ints and longs), has the significant benefit of improving performance. An operation on a subsequence of 64 bits could be done with a single operation on a long. Addresses are shorter too; I can’t imagine any developer hankering for bit level memory addresses. And when a given word length neatly accommodates the data, everything is much simpler.
But some situations arise where fixed word lengths just don’t fit. Universal codes are one example: bit-length can vary dramatically from one value to the next, and though lengths are frequently less than a single byte long, they are unbounded. In this case, and in others, there isn’t an easy way to avoid bit-level processing.
So we’re still in the situation where we might have lots of data, and lots of things to do with it, but now have the added pressure of reduced performance. Take the example of a simple loop over a byte array which sets every byte to zero. The same loop at bit level will take eight times longer. What is more, the language (Java in this case) isn’t going to give you any core primitives to help: exposing direct access to byte arrays can boost efficiency, but bit-level operations are necessarily going to be forced through a method call. And, even more significantly for performance, the work of setting a single bit will also require significantly more code than the simple assignment needed for a language primitive like byte.
This pressure on performance significantly changes the trade-offs made when designing a general purpose API for bit-level processing. Whereas, the value of providing a method that, say, performs an element-wise XOR between two byte arrays is questionable (given the simplicity of implementing such a method), the value of the same method at bit level is much higher because very significant optimizations become available.
So performance is another reason that BitVector has so many methods. Simplistically: more specialized methods allow for more optimizations(*). But there is another reason there are so many methods.
Prior to starting development of the bit package, and BitVector in particular. I reviewed my various code bases to see what I needed from such a package. What I found was that my code (which often exposed bit level data in thinly wrapped byte arrays) was frequently forced to make array copies solely to guarantee encapsulation. And additionally, bytes were being copied into other intermediate forms, and then copied back just so that certain operations could be performed by methods that didn’t support the originating data representation.
I realized that, by creating a single class that supported encapsulation and provided these disparate methods, I could eliminate a great deal of the copying in my applications. And when there’s a lot of data, this gives a very large gain in performance.
Now, as one would expect when designing a class which is becoming overloaded with methods, I investigated ways of splitting the functionality into separate classes, and superficially this appears easy - the functionality of BitVector can be split along a number of clear lines and an API looking something like this is obvious:
v.asBits().setRange(0, 5, true);
v.asNumber().intValue();
v.asSequence().firstZero();
But a difficulty arises when BitVectors are aggregated in large numbers. For example, I had one small application that did lots of processing with 128 bit numbers. It stored a great many of these values, (representing them as a class with two long fields). For BitVector to be usable in this context, and others like it, per-instance memory usage becomes a consideration: Every ‘sub-view’ of a BitVector would require one of: storage of a reference to the view in a field and references back OR creating a new view on each call OR something more complicated.
I won’t go into detail, but however you tackle these options, nothing very good comes out of them: memory swells, or uneven performance occurs in ways that a developer wouldn’t anticipate. So I went for the low-tech approach - clear and logical method names together with documentation (still ongoing).
I’ll have to wait and see how it turns out.
(*) I also chose to provide more generalized methods too - this decision is more questionable - and was made on the basis that there is utility in the generalized form for some applications.
I’m pleased to announce that version 0.6 of Crinch has been released. To accompany this release, I’ve added a dedicated Crinch project page on my website.
This is an important release because it’s the first in which any of the Crinch libraries is stable enough to be properly documented and supported. This is the Crinch bits library.
The good news is that the entire bit handling library is now available to Java developers who want to be relieved of the drudgery of efficiently handling data at bit-level. Working with entropy encoding, universal codes, error correcting codes and the like without a rich, optimized and well abstracted API for handling the generated bits is horrible (I speak from experience).
The only bad news is that BitVector has not survived as a standalone class.
Seeing a Gray code being used yesterday, in turn made me think about De Bruijn sequences. Both enumerate all fixed-length words over set of symbols (frequently the binary symbols 0 and 1). But where Gray codes change one symbol at a time to generate the next word, De Bruijn sequences shift the whole word by one symbol.
I’ve posted about De Bruijn sequences before. They have a special significance for me because I independently discovered them when I was 18 or 19 together with an algorithm for generating them (which I later learned to be the “prefer one” algorithm).
It now occurs to me that this algorithm provides a good demonstration of how BitVector can make code simple and concise. The following code generates a binary De Bruijn sequence, see the links above for pictures and pointers to theory.
Observe how the use of the memory BitVector as a Set clarifies its role within the method. Also, since BitVector packs its bits into primitives, memory usage is essentially minimal. Note that the method returns extra bits to make it simpler to enumerate all the sequence values. These extra bits might have complicated its use in some applications, but because BitVector provides views to efficiently window in on specific bits, the concern is alleviated from the implementation.
It’s been a while since I first shared my BitVector class and in that time it has grown larger and even more useful. Yet it remains a self contained class that’s easy to use in any Java project. The source code is available under the Apache 2.0 licence.
It’s killer benefit for me is that it makes it really easy to cast bits from one form into another and to operate on them, and safely pass them around: all very cheaply. It’s so useful, I’d really like to see other developers using it, so I’ve started trying to write a code based introduction to the class.
I’m not even half done, but I think the results so far are worth sharing. Don’t be intimidated! There are lots of methods but any given application will only need a small number of them. The benefit of locating all of the methods on a single class is that it makes things convenient & efficient.