BOCU

Binary-Ordered Compression for Unicode

Draft 2001-05-30

Mark Davis and Markus Scherer

Il faut avoir beaucoup étudié pour savoir peu. — Montesquieu
(extrait de Pensées diverses)

Introduction
Structure
Base Adjustments
Binary_Order
Variants
Test Results
Sample Code

Introduction

BOCU is a general compression format for Unicode. It has a number of very useful characteristics compared to either UTF-8 or to SCSU.
The principal form of BOCU, used in most cases, is BOCU-1.

Compared to UTF-8

For Latin, takes approximately the same storage
For other small alphabets, takes much less storage, somewhat over 1 byte per character
For large alphabets (CJK, Hangul), takes about 2/3 the storage, somewhat over 2 bytes per character

Compared to SCSU

The binary sort order of BOCU is the same as Unicode code points. SCSU has essentially random binary order
The code for compression/decompression is much simpler
Unlike SCSU, where the same code points can be compressed into many different byte sequences, BOCU is determinate: a given sequence of code points always results in the same sequence of bytes
- Note that identical subsequences within two different code point sequences may be represented by different sequences of bytes

In addition, variants of BOCU allow for environments where the byte values are restricted. For example:

ICU uses a variant of BOCU for a final (code point) level in collation. In this environment, only the bytes 03..FF are allowed. (ICU also has a number of other modifications to BOCU specific to the collation environment, and calls it BOCSU.)
A variant of BOCU can be used in very restricted environments, such as bar-codes (47 allowable “byte” values) or domain names (37 allowable byte values).
BOCU-1 is MIME-friendly (directly usable in emails) because of its limited use of control code byte values.

The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

So far, this is by no means new. Differencing is an old technique, and has been used many times before (for example, see "Ultracode: A Barcode Symbology Encoding Unicode 2.0" at http://www.unicode.org/iuc/iuc10/program.html).

The first enhancement on top of this basic differencing technique is to choose the lead (first) byte of each sequences so that the binary order is correct. This is done by clustering the single byte differences around 128, then moving outwards, the double-byte differences, then triple bytes. The following table shows sample values for how this could be done.

The Lead Bytes are the values of the first byte of the sequence used to represent the delta. The Number of Leads indicates how many values there are. The Byte Count indicates the total number of bytes in the sequence, including the lead byte and any trailing byte (for the single-byte forms there are no trail bytes). The values are for illustration; they are not the actual values used in BOCU.

Deltas	Lead Bytes	No. of Leads	Byte Count
`2F3F..10FFFF`	`EF..FF`	17	3
`40..2F3F`	`C0..ED`	47	2
`0..3F`	`80..BF`	64	1
`-40..0`	`40..7F`	64	1
`-2F40..-41`	`11..3F`	47	2
`-10FFFF..-2F41`	`00..10`	17	3

The values need to be chosen so that small alphabets fit in the single-byte range, while the double-byte range covers at least the 20,902 CJK Ideograph values. The sequences must, as we said before, cover the range of values from -10FFFF to 10FFFF. Note that this is different from UTF-8, for example, which only needs to cover half that range.

Base Adjustments

However, this needs a further enhancement. We form our delta by looking at the last code point, which we will call the base value. Look at the single-byte values in the above table, for example. They range from 0 to 63 (3F₁₆) on the positive side, and from -1 to -64 (40₁₆) on the negative side. While most small alphabets fit within a range of 128 values, there may be more than 64 code points between the top and the bottom. So when the distance between the base code point and the new one exceeds the difference value that can be stored in a single byte, we would end up using two bytes.

Since the vast majority of small alphabets are aligned on 128-byte boundaries, the solution here is to adjust the base to always be in the middle of each 128-byte range. This is done with:

  baseCp = (baseCp & ~0x7F) + 0x40;

With this, values from xxxx00 to xxxx7F become xxxx40, values from xxxx80 to xxxxFF become xxxxC0. When we then get the difference from the base code point, any values within the 128 range are reachable with a single byte.

The same logic can apply to the two-byte case. In particular, we would like to be able to have sequences of CJK characters only take two bytes each. We can do this by having special range checks. For examples, see the sample code below.

The second enhancement is to use more than just the last code point in computing the base value. Many non-Latin alphabets use spaces between words. As long as the text stays in that alphabet, the characters are encoded in a single byte. However, every time a space is hit, it forces a "long jump" down to pick up 0020, then another "long jump" back up to the alphabet. BOCU saves half of this cost by looking back at more than just the last code point. Instead, the last three code points are used. If the adjusted values of any two of them are the same, then that value is used. Otherwise, the adjusted value of the last code point is used. As it turns out, this is simple to implement:

  int last = simpleAdjust(lastCp);
  int result = wasSame ? prelast : last;
  wasSame = prelast == last;
  prelast = last;
  return result;

Once these adjustments are made, we can use a modified distribution of lead bytes. This provides for 128 single-byte sequences, and over 31 thousand double-byte sequences.

No. Leads	1	1	62	64	64	62	1	1
Length	4	3	2	1	1	2	3	4
Deltas Handled	16,777,216	65,536	15,872	64	64	15,872	65,536	16,777,216

Binary Order

Even with all of these manipulations, any resulting byte sequences still has the same binary order as the original code point sequence did. Let's see how this works. Suppose we have two code point sequences C and D which are identical up to and including code point number n, and differ at code point n+1. Since the sequence C[0]..C[n] equals the sequence D[0]..D[n], the byte sequences generated from these will be the same.

So now look at the code points C[n+1] and D[n+1]. Since C[n] = D[n], baseCp will also be the same for both sequences. This is true even if we adjust baseCp as above: as long as it is completely determined by the sequence of preceeding code points C[0]..C[n] (which equals D[0]..D[n]), we can choose whatever value we want.. Subtracting the baseCp value will get us two delta values x and y. And x > y if and only if C[n+1] > D[n+1].

From x and y we will generate two byte sequences bx and by. By the way in which we assign values described above, if bx and by have different lead bytes, then those bytes are in the correct binary order and will correctly determine the order of the sequence. If bx and by have the same lead byte, then they must have the same number of bytes. In that case, one of the trailing bytes will differ, and determine the correct order.

Variants

BOCU can be extended to cover two other interesting cases:

Restricted byte values
Self-synchronization (non-overlap)

Restricted byte values

There are situations where the actual byte values used may be restricted, as mentioned above. To adapt BOCU to those cases, we may need to modify the number of lead bytes to still get good ranges for values. For example, if we were working with a very limited repertoire of 37 allowable bytes, we might use the following table:

No. Leads	1	0	8	2	7	8	2	8	0	1
Length	5	4	3	2	1	1	2	3	4	5
Deltas Handled	1,874,161	0	10,952	74	7	8	74	10,952	0	1,874,161

As before, we try to make sure that the CJK and small alphabets are covered. This is done by allocating 148 two-byte values for small alphabets, allocating 21,904 three-byte values for CJK coverage. We distribute the rest so that we have most of the BMP covered in four bytes, with the five bytes form ensuring that the entire codepoint range is covered. The remainder is in single bytes, to pick up as many small-delta cases as we can.

When computing the actual bytes, the simplest case is where the usable byte values are in a contiguous range. For example, in ICU collation the byte values 03..FF are usable (00, 01, and 02 cannot be used). In that case, we find the number of allowable values (count = 253) and an offset to the first value (offset = 3).

To compute the trailing bytes, we use successive divisions and modulos, going from the last byte to the first byte:

  output[m] = delta % trailCount + trailOffset;
  delta /= trailCount;

After all the trailing bytes are output, the remaining delta is used in computing the lead byte.

If the allowable bytes are discontiguous, then instead of using an offset, we use an array, such as:

  output[m] = map[delta % trailCount];
  delta /= trailCount;

The map function (or the offset) is used to get the proper byte values. For more details on this, see the code sample below.

Self-synchronization

Both UTF-8 and UTF-16 have the very important property of self-synchronization; also called non-overlap. What this means is that the lead bytes (16-bit units in the case of UTF-16) do not overlap in value with the trail bytes. This allows a number of important characteristics:

you can randomly access within strings, since you can always easily determine code point boundaries
if a byte is damaged (changed, omitted, or inserted) in transmission, it only affects the adjacent code points; the rest of the string is untouched.
you can never get a "false positive" when searching. E.g. when you search for one UTF-8 string within another, you will never find a match except at code point boundaries.

BOCU can be modified to also have characteristics (1) and (2), if desired. While this is not generally a design goal for compression, there may be environments where this is necessary. The downside is that the compression is not nearly as good.

Self-synchronization can be done by allocating certain byte values to leads and the rest to trails. In the general case, this means having different leadMap and trailMap arrays, and redistributing the lead bytes and number of trail bytes to get the correct coverage. For an example of an extreme case, suppose that we take the limited repertoire of 37 allowable byte values. We could allocate 22 values to trail bytes, and 15 values to lead bytes.

Test Results

The following describes test results when comparing BOCU against UTF-8 and UTF-16. The sample files are taken from http://www.unicode.org/unicode/standard/WhatIsUnicode.html (for the plain files) and http://www.unhchr.ch/udhr/navigate/alpha.htm (for the files marked with "HR"). The columns for BOCU, UTF-16 and UTF-8 contain the respective number of bytes of storage for each form for that file. The column heading B:16 is the size of BOCU compared to UTF-16, and B:8 is the size compared to UTF-8.

File	BOCU	UTF-16	UTF-8	B:16	B:8
german.txt	3,472	6,846	3,474	51%	100%
germanHR.txt	12,339	24,298	12,328	51%	100%
french.txt	3,493	6,830	3,491	51%	100%
frenchHR.txt	12,521	24,258	12,495	52%	100%
greek.txt	4,246	7,148	6,317	59%	67%
greekHR.txt	15,232	25,164	22,832	61%	67%
hindi.txt	3,725	5,986	7,706	62%	48%
japanese.txt	2,510	3,168	4,403	79%	57%
japaneseHR.txt	8,640	9,456	12,971	91%	67%
chinese.txt	1,978	2,112	2,549	94%	78%
chineseHR.txt	6,569	6,324	8,809	104%	75%
korean.txt	3,490	3,174	3,662	110%	95%
koreanHR.txt	11,195	10,066	11,721	111%	96%
koreanJamoHR.txt	13,871	20,064	26,454	69%	52%
Total:	103,281	154,894	139,212	67%	74%

Overall, for these samples, BOCU takes about one-third less storage than UTF-16, and about one-quarter less storage than UTF-8.

Note that in the case of Korean, BOCU is not quite as good as UTF-16. This is because Korean uses a large character set, plus spaces. The spaces require 3 bytes in those circumstances, which detracts from the compression. This does not happen with Japanese or Chinese. For comparison, the koreanHR.txt file is also given in Jamo form (with the Hangul Syllables decomposed). That would get much better compression, except that the layout of the encoding of the Jamo make the alphabet too large to really benefit.

For comparison, here is the variant BOCU using only 37 allowable bytes. The storage costs are roughly the same as UTF-16 for small alphabets, and somewhat more than UTF-8 for large alphabets. Not too bad, given only 37 allowable byte values.

File	BOCU	UTF-16	UTF-8	B:16	B:8
german.txt	6,793	6,846	3,474	99%	196%
germanHR.txt	24,206	24,298	12,328	100%	196%
french.txt	6,855	6,830	3,491	100%	196%
frenchHR.txt	24,588	24,258	12,495	101%	197%
greek.txt	7,195	7,148	6,317	101%	114%
greekHR.txt	25,832	25,164	22,832	103%	113%
hindi.txt	6,304	5,986	7,706	105%	82%
japanese.txt	4,383	3,168	4,403	138%	100%
japaneseHR.txt	15,172	9,456	12,971	160%	117%
chinese.txt	3,374	2,112	2,549	160%	132%
chineseHR.txt	10,889	6,324	8,809	172%	124%
korean.txt	5,531	3,174	3,662	174%	151%
koreanHR.txt	17,809	10,066	11,721	177%	152%
koreanJamoHR.txt	23,903	20,064	26,454	119%	90%
Total:	182,834	154,894	139,212	118%	131%

Sample Code

The following are extracts from sample code of the encoding and decoding. This code is for illustrative purposes only, and is not optimized. The arrays used in the computation are provided elsewhere.

     /**
     * Convert a Java string (UTF-16) to BOCU compressed format.
     * @return number of bytes actually written.
     */
     
    // Note: the following code is designed for simplicity
    // it can be recoded for speed.
    
    public int toBOCU (char[] input, int iStart, int iEnd, byte[] output, int oStart, int oEnd) {
        
        int oldCodePoint = 0;
        int iPosition;
        int cp;
        int oPosition = oStart;
        adjuster.reset();
        
        for (iPosition = iStart; iPosition < iEnd; iPosition += UTF32.count16(cp)) {
            
            // get code point, and adjust
            
            cp = UTF32.char32At(input, iPosition, iEnd);
            int delta = cp - adjuster.adjust(oldCodePoint);
            oldCodePoint = cp;
            
            // now encode delta
            
            for (int i = 0; ; ++i) {
                
                // find the proper range
                
                if (delta < boundary[i]) { // always terminates
                    
                    // set up to store values from the end backwards
                    
                    delta += offset[i];  // important: delta is now positive!
                    int bCount = count[i];
                    oPosition += bCount;
                    if (oPosition > oEnd) {
                        throw new IllegalArgumentException("Output buffer too short!");
                    }
                    int backup = oPosition;
                    
                    // store trailing bytes (bCount - 1 bytes)
                    
                    while (--bCount > 0) {
                        output[--backup] = toTrailBOCU[delta % radix];
                        delta /= radix;
                    }
                    
                    // store lead byte
                    
                    output[--backup] = toLeadBOCU[delta];
                    break;
                }
            }
        }
        return oPosition;
    }
    
// =============================================================

    /**
     * Converts from BOCU format back into a Java (UTF-16) character string.
     * @return number of characters (UTF-16 code units) actually written.
     */
    public int fromBOCU(byte[] input, int iStart, int iEnd, char[] output, int oStart, int oEnd) {
        
        // Note: bytes are masked with FF, since Java doesn't have unsigned byte
        
        int oldCodePoint = 0;
        int iPosition;
        int oPosition = oStart;
        int normByte;
        adjuster.reset();
        
        for (iPosition = iStart; iPosition < iEnd;) {
            
            // get first byte, tells us count and the initial part of cp
            // as an optimization, 3 arrays could be combined to 2.
            
            int delta = fromLeadBOCU[input[iPosition++] & 0xFF];
            if (delta == Short.MIN_VALUE) {
                throw new IllegalArgumentException("Invalid Lead Byte: " + Utility.hex(delta));
            }
            int group = groupFromValue[delta];
            int bCount = count[group];
            
            // get rest of the parts of the delta
            
            while (--bCount > 0) {
                delta *= radix;
                if (iPosition >= iEnd) {
                    throw new IllegalArgumentException("Byte sequence cut off!");
                }
                normByte = fromTrailBOCU[input[iPosition++] & 0xFF];
                if (normByte == Short.MIN_VALUE) {
                    throw new IllegalArgumentException("Invalid Trail Byte: " + Utility.hex(normByte));
                }
                delta += normByte;
            }
            
            // fix by offset, then
            // adjust, since we just have a delta at this point
            
            delta -= offset[group];
            int cp = adjuster.adjust(oldCodePoint) + delta;
            oldCodePoint = cp;
            
            if (cp < 0 || cp > 0x10FFFF) {
                throw new IllegalArgumentException("Illegal code point: " + Utility.hex(cp));
            }
            
            // Store cp in UTF-16 array
            
            oPosition = UTF32.setChar32At(output, oPosition, oEnd, cp);
        }
        return oPosition;
    }
    
// =============================================================

    // used in the adjuster

        int simpleAdjust(int cp) {
            
            if (style == HALF_BLOCK) return (cp & ~0x7F) + 0x40;
            
            // for large sets (CJK), align multibyte range of over 21K
            
            if (cp >= 0x4E00 && cp <= 0x9FA5) {     // CJK
                return alignTo(cp, 0x3000, cjkLowSpan, cjkHighSpan);
            }
            
            if (cp >= 0xAC00 && cp <= 0xD7AF) {     // Hangul
                return alignTo(cp, 0xAC00, cjkLowSpan, cjkHighSpan);
            }
            
            // handle badly aligned blocks
            
            int offset = 0;
            if (cp > 0x3040 && cp <= 0x30A0) offset = 0x40; // Hiragana
            else if (cp > 0x30A0 && cp <= 0x30FF) offset = 0x20; // Katakana
            else if (cp > 0xFF60 && cp <= 0xFF9F) offset = 0x60; // HW Katakana
            
            // align to start of half-byte + offset
            
            return alignTo(cp, (cp & ~0x7F) + offset, alphabetLowSpan, alphabetHighSpan);
        }

       /**
        * Utility used to align a code point
        */

        private int alignTo(int cp, int start, int lowSpan, int highSpan) {
            int factor = lowSpan + highSpan;
            return ((cp - start) / factor) * factor + start + lowSpan;
        }