Binary string comparison with UTF-16

How to efficiently compare UTF-16 strings in code point sort order

Markus W. Scherer
Unicode Software Engineer, IBM
April 2000, updated August 2001

Originally published on http://www.ibm.com/developerWorks/unicode/

Contents:
UTF-32
UTF-8
UTF-16
Comparison examples
Sample C code
UTF-8/32 string comparison in UTF-16 code unit order
Resources

This "how-to" article discusses binary string comparison. If "code point sort order" is required, then binary comparisons of UTF-16 strings need special attention. An efficient algorithm and sample code are provided.

In many applications, where culturally correct string comparisons (collation) are not necessary, strings are compared "binary" for performance reasons. Binary string comparison is a lexical comparison of the code unit values (the base units of a string).

Often, the particular sort order does not matter, as in an internal table lookup with a binary search. Sometimes, however, the sort order needs to match the sort order of a lexical comparison of the code point values (the character set codes encoded in each string).

UTF-32

In Unicode, code points are 21-bit integers. UTF-32 strings store code points directly as code units, so that the binary-comparison sort order is naturally the same as the code point sort order.

The other, more memory-efficient, UTFs would -- with a na´┐Że approach -- need to assemble the code point values from the code units, which would be relatively slow. This is not necessary because of their design: string comparison in code point sort order can be done with UTF-8 and UTF-16 efficiently and based only on code unit comparisons as follows:

UTF-8

A comparison of strings in UTF-8 is done entirely as a comparison of the byte sequences. This is possible because of the structure of UTF-8: All lead bytes have higher byte values than the single-byte characters, and within lead, single, and trail byte ranges, the order of code points is preserved as well.

UTF-16

In UTF-16, the binary order of code units is only "almost the same as code point order". UTF-16 was designed after characters had been assigned to code points at the upper end of the original, 16-bit, UCS-2 range. Leading and trailing code units ("surrogates") have smaller values than some single-unit characters. Otherwise, within single, lead, and trail code unit value ranges, the order is preserved, too.

This opens the door for a "fix-up" of code unit values that is faster than assembling 21-bit code point values. It does not require an extra branch inside the comparison loop. The basis for this is that the "surrogate" values are in an area that is aligned along blocks of 2k (0x800) values. In other words, the most significant five bits in the 16-bit values are always the same for surrogates, and are unique to them. The "fix-up" simply rotates the block of surrogates to the top of the 16-bit value range (up by 0x2000) and the higher blocks down by one block (subtract 0x800, or add 0xf800). This could be done for each code unit, using a little table of deltas for each 2k block, and adding three machine operations: a shift, a lookup in the 32-element table, and an addition. (Note that this table approach uses common C language arithmetic: Instead of getting an overflow error when adding 0xf800 to large numbers, unsigned 16-bit arithmetic in C wraps around to smaller values. This is not used for the optimized approach described in the following paragraph.)

In order to make such a string comparison with "fix-up" as fast as the regular one, the following optimizations are used (See Sample C code below.):

Note: This "fix-up" may yield unexpected results for malformed UTF-16 strings with single, unpaired surrogate code units. Surrogates only encode characters if they are used in pairs.

Comparison examples

The following table shows several comparison examples. Some of them work in UTF-16 without special code, but comparisons of a surrogate value with a single value of 0xe000 or higher, shown in bold, need a local "fix-up." The "corrected" values are shown in italics. All values are shown in hexadecimal.

Code point/UTF-32 comparison UTF-16 comparison without "fix-up" UTF-16 comparison with "fix-up" UTF-8 comparison
61 < 20ac 61 < 20ac (OK, no surrogates) 61 < 20ac 61 < e2 82 ac
61 < 10002 61 < d800 dc02 (OK, non-surrogate is below d800) 61 < f800 fc02 (OK, fix-up keeps order) 61 < f0 90 80 82
ff61 < 10002 ff61 > d800 dc02 (not code point order) f761 < f800 fc02 (OK, code units "fixed up") ef bd a1 < f0 90 80 82
10002 < 23456 d800 dc02 < d84d dc56 (OK, surrogates on both sides) d800 dc02 < d84d dc56 f0 90 80 82 < f0 a3 91 96

Sample C code

The following Sample C function compares two zero-terminated UTF-16 strings. The additional code is after the loop and skipped if both code units are below 0xd800.

    /* compare zero-terminated UTF-16 strings */
    int utf16_strcmp(unsigned short *s1, unsigned short *s2) {
        unsigned short c1, c2; /* 16 bits */

        /* compare identical prefixes - they do not need to be fixed up */
        for(;;) {
            c1=*s1;
            c2=*s2;
            if(c1==c2) {
                if(c1==0) {
                    return 0;
                }
                ++s1;
                ++s2;
            } else {
                break;
            }
        }

        /* c1!=c2, fix up each one if they're both in or above the surrogate range, then compare them */
        if(c1>=0xd800 && c2>=0xd800) {
            if(c1>=0xe000) {
                c1-=0x800;
            } else {
                c1+=0x2000;
            }
            if(c2>=0xe000) {
                c2-=0x800;
            } else {
                c2+=0x2000;
            }
        }

        /* now c1 and c2 are in code point order */
        return (int)c1-(int)c2; /* int must be 32 bits wide */
    }

UTF-8/32 string comparison in UTF-16 code unit order

Sometimes, the opposite may be desired: Comparing UTF-8/32 strings in the binary order of UTF-16 code units. This can be done with the opposite rotation of code unit values in UTF-8/32, similarly reversing the relative order of the code point ranges U+e000..U+ffff and U+10000..U+10ffff.

Note that both UTF-8 and UTF-32 strings should never contain single-surrogate code points. Also, since in both cases there are unused code unit values, a true rotation is not necessary; only one of the two code point ranges needs to be moved.

UTF-32 code point rotation for UTF-16 code unit order

It is sufficient to move the range U+e000..U+ffff to an arbitrary, unused range after the supplementary code points:

        unsigned int c1, c2; /* 32 bits, values 0..0x10ffff */
        if(c1>=0xe000 && c2>=0xe000) {
            if(c1<0x10000) {
                c1+=0x1f0000;
            }
            if(c2<0x10000) {
                c2+=0x1f0000;
            }
        }
  

UTF-8 code point rotation for UTF-16 code unit order

The lead bytes for the range U+e000..U+ffff are distinct (they are exactly 0xee and 0xef) and can be moved to the unused byte values 0xfe and 0xff.

        unsigned char c1, c2; /* 8 bits, values 0..0xfd */
        if((c1>=0xee && c2>=0xee) {
            if((c1&0xfe)==0xee) {
                c1+=0x10;
            }
            if((c2&0xfe)==0xee) {
                c2+=0x10;
            }
        }
  

Resources