Unicode Text Searching in Java

Efficient Text Searching in Java

(appeared in Java Report, February 1999)

Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly every engineering school, and there are entire books devoted to the subject. Why, you might ask, am I writing yet another article about searching?

The answer is that most of the well-known, efficient search algorithms don’t work very well in Unicode, which includes the char type in Java. Algorithms such as Knuth-Morris-Pratt and Boyer-Moore utilize tables that tell them what to do when a particular character is seen in the text being searched. That’s fine for a traditional character set such as ASCII or ISO Latin-1 where there are only 128 or 256 possible characters.

Java, however, uses Unicode as its character set. In Unicode, there are 65,535 distinct characters that cover all modern languages of the world, including ideographic languages such as Chinese. In general, this is good; it makes the task of developing global applications a great deal easier. However, algorithms like Boyer-Moore that rely on an array indexed by character codes are very wasteful of memory and take a long time to initialize in this environment.

And it gets worse. Sorting and searching non-English text presents number of challenges that many English speakers are not even aware of. The primary source of difficulty is accents, which have very different meanings in different languages, and sometimes even within the same language:

Many accented letters, such as "é" in "café", are treated as minor variants on the letter that is accented, in this case "e".
Sometimes the accented form of a letter is treated as a distinct letter for the purposes of comparison. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".
In some cases, an accented letter is treated as if it were two letters. In traditional German, for example, "ä" is compared as if it were "ae".

Other difficulties arise when multiple characters compare as if they were one, such as in traditional Spanish where "ch" is treated as a single letter that sorts just after "c" and before "d", or when single characters such as the "Æ" ligature or the German "ß" are treated as if they were spelled out as "AE" or "ss".

All of the above might make the problem of sorting and searching international text seem hopeless. While not impossible, it is a difficult problem. Do not despair, however. Starting in JDK 1.1, the class java.text.Collator is here to help.

Collators Made Simple

For JDK 1.1, Taligent, which has since been absorbed into IBM, contributed a number of international support classes to the Java Class Libraries. These classes are now maintained and enhanced by engineers at the IBM Center for Java Technology in Cupertino, California. In combination with work done by Sun, they provide a powerful foundation for developing truly global applications.

Among the new classes introduced in JDK 1.1 were Collator and RuleBasedCollator, both in the java.text package. Collator provides an abstract interface for comparing text, while RuleBasedCollator is a concrete subclass that implements the comparison using a rule-driven algorithm.

Using these classes to compare two strings is quite straightforward:

    Collator c = Collator.getInstance();
    if (c.compare(string1, string2) == 0) {
	// Strings are the same
    }

In a real program, of course, you would only call Collator.getInstance once and would then use the same collator object to perform as many comparisons as you wanted. Behind the scenes, getInstance determines the current default Locale, loads the appropriate collation rules, and constructs a Collator object that it returns. Currently, this object is always a RuleBasedCollator. However, in the future different locales may require different classes with customized behavior, which is why you use a factory method to create instances of the more abstract Collator class.

Under the Hood

Internally, RuleBasedCollator.compare has an awful lot of bookkeeping to do. A byte-by-byte string comparison function like C’s strcmp can walk though strings one character at a time and compare the bytes, but if Collator did something that simple it would rapidly get out of sync the first time it saw a contracting character like the Spanish "ch" or an expanding character like "Æ".

To keep track of this, RuleBasedCollator first translates strings into a series of collation elements, which correspond to single entities in the input string. In English, each character in the input maps to a collation element, but "Æ" produces two elements and the Spanish "ch" produces just one. This translation is done by the utility class CollationElementIterator, which uses a set of mapping tables built from the rules passed to the collator’s constructor.

CollationElementIterator is a public class, and you can use it yourself to do searches, as we will see below. As an introduction, let’s use it to iterate over the elements for a simple string:

    RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
    CollationElementIterator iter = c.getCollationElementIterator("Foo");

    int element;
    while ((element = iter.next()) != CollationElementIterator.NULLORDER) {
	System.out.println("Collation element is: " + 
        	              Integer.toString(e,16) );
    }

As you can see from the above example, a collation element is a fairly simple creature; it’s an int that describes where a character or group of characters falls in the sorting sequence. Higher-numbered elements are sorted after lower-numbered ones.

Of course, it’s really a bit more complicated than that. Each collation element can be broken down into three components (also known as weights or orders): primary, secondary, and tertiary. The primary component of a collation element corresponds to which base letter the order represents, so "A" and "B" will have different primary weights. The secondary components typically correspond to accents, so "á" and "é" will have the same secondary weight, which is different from the secondary weight of "a". The tertiary components usually represent the case of a character, so "a" and "A" will have different tertiary weighting. (There is a fourth level, the normalized original string itself, which can be used for a final comparison, but you usually don't need to worry about this level.)

Character	Primary	Secondary	Tertiary
a	1	0	0
á	1	1	0
A	1	0	1
...
B	2	0	1
...
é	5	1	0

The above table uses simple, made-up numbers to illustrate the components of a collation element. In practice, the numbers are usually larger, since most collators have many more possible elements. To see what real collation orders look like, you can modify the last code example as follows. The following code will print out the collation elements for the string "Foo":

    RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
    CollationElementIterator iter = c.getCollationElementIterator("Foo");

    int element;
    while ((element = iter.next()) != CollationElementIterator.NULLORDER) {
        System.out.println("Collation element is: " + element);
        System.out.println(" primary:   " + Integer.toString(
                CollationElementIterator.primaryOrder(element), 16) );
        System.out.println(" secondary: " + Integer.toString(
                CollationElementIterator.secondaryOrder(element), 16) );
        System.out.println(" tertiary:  " + Integer.toString(
                CollationElementIterator.tertiaryOrder(element), 16) );
    }

While this notion of three ordering levels seems complicated at first, it actually makes some tasks easier. If you want to do a case-insensitive comparison, you simply ignore the tertiary component of each collation element. If you also don’t want to include accents, you can ignore the secondary component too. Collator handles this by allowing you to set different strength levels using the setStrength method and constants such as Collator.PRIMARY.

    Collator c = Collator.getInstance();
    c.setStrength(Collator.PRIMARY);		// Ignore case and accents
    if (c.compare(string1, string2) == 0) {
	  ...					// Strings matched
    }

Text Searching in JDK 1.1

Now that you’re familiar with the concepts behind collators and collation elements, we can put some of that knowledge to use. The same collation elements that RuleBasedCollator uses to perform string comparisons can be used to do string searching as well. The basic concept is quite simple: instead of searching through the characters in the string, we’ll search through its collation elements.

Fast string-searching algorithms such as KMP and Boyer-Moore require the ability to back up or perform random access in the text being searched. Unfortunately, you can’t do this with international text in JDK 1.1, because CollationElementIterator does not allow random access. It only has a next method and is lacking setOffset and previous. This means that Boyer-Moore searching cannot be implemented without a complicated buffering scheme that is very tricky to get right.

However, a traditional, brute-force string search is quite possible using the JDK 1.1 API. Essentially this comes down to comparing the search pattern against each individual character position in the text being searched. If the collation elements for the search pattern match the collation elements for that substring of text being searched, we’ve found a match. The outer loop looks like this:

    String pattern = "for";                            // What to search for
    String text = "Now is the time for all good men";  // Text being searched

    RuleBasedCollator c = (RuleBasedCollator)collator.getInstance();
    CollationElementIterator patIter =
                             c.getCollationElementIterator(pattern);

    for (int i = 0; i < text.length(); i++) {
        String substr = text.substring(i);
        CollationElementIterator textIter = 
                            c.getCollationElementIterator(substr);
        patIter.reset();
        if (match(patIter, textIter)) {
            // They matched!  Do something.
        }
    }

Of course, I left out the hard part, the function match that decides whether two sequences of collation elements are equivalent. A simple, naive implementation would loop through both iterators and ensure that the elements they return are the same:

    boolean match(CollationElementIterator text,
                  CollationElementIterator pattern)
    {
        while (true) {
            int patternElement = pattern.next();
            int targetElement  = text.next();
            if (patternElement = CollationElementIterator.NULLORDER) {
                break;           // End of the pattern
            } else if (patternElement != targetElement) {
                return false;    // Mismatch
            }
        }
        return true;             // No mismatches
    }

This will work, but only if you want to treat any difference, be it primary, secondary, or tertiary, as significant. In most applications, that is not enough; users will want an "Ignore Case" option, and possibly an "Ignore Case and Accents" option as well. Fortunately, this is not very hard to do. Collator provides the constants PRIMARY, SECONDARY, and TERTIARY that you can use to represent the level of comparison you want, and CollationElementIterator provides methods to break down a collation order into its three components.

All we need to do is create a variable, e.g. weight, that stores the desired level of comparison. If weight is PRIMARY, we check only the primary component of each collation element. If weight is SECONDARY, we check both the primary and secondary components, and if it is TERTIARY we check all three. Fortunately, these values of these constants are in ascending numerical order, so we can use simple comparisons such as "if (weight > Collator.PRIMARY) ..."

To add this extra functionality we have to modify our match function a bit, but it is still fairly simple. Since the documentation for CollationElementIterator promises that "the first 16 bits of a collation order is its primary order; the next 8 bits is the secondary order and the last 8 bits is the tertiary order," we can simply mask away the portions of the collation element that we’re not interested in.

    
    // Return a mask for the part of the order we're interested in
    static final int getMask(int weight) {
        switch (weight) {
            case Collator.PRIMARY:
                return 0xFFFF0000;
            case Collator.SECONDARY:
                return 0xFFFFFF00;
            default: 
                return 0xFFFFFFFF;
        }
    }

    boolean match(CollationElementIterator text,
                  CollationElementIterator pattern)
    {
        int mask = getMask(weight);
        int done = CollationElementIterator.NULLORDER & mask;
      
        while (true) {
            int patternElement = pattern.next() & mask;
            int targetElement  = text.next()    & mask;
            
            if (patternElement == done) {
		break;            // End of pattern
            } else if (patternElement != targetElement) {
		return false;     // Mismatch
            }
        }
        return true;              // No mismatches
    }

Ignore That Character!

As you’ve probably guessed, I left something out again. There’s one last complication: ignorable characters. In Unicode, an accented character can be represented in two different ways. The single Unicode character \u00e1 represents "á", but the pair of Unicode values \u0061\u0301 also represents "á". The \u0061 is just a lowercase "a", but the \u0301 is special. It’s a "combining acute accent" that combines with the value before it to create the single "user-level" character "á".

These combining characters need special processing during a comparison. Since \u0301 is only an accent, its collation element has a secondary component but no primary or tertiary component. In a comparison that does not involve accents, we must ignore this element entirely. If we did not, we might end up comparing an accent in one string to a base letter in another string, which would give invalid results. For example, when doing an accent-insensitive comparison "a\u0301b" and "ab", we want to skip the "\u0301" and go on to the next character; otherwise we’d compare "\u0301" and "b".

This logic is relatively straightforward, but it does make the code a bit more complicated. The boolean variables getTarget and getPattern are used to decide whether to fetch the next collation element in the text and the pattern each time through the loop. Normally both variables are true, but one or the other of them can be set to false if we want to skip an element. For example, setting getPattern to false and getTarget to true causes the current pattern element to be re-used and compared with the next text element, thus skipping the current text element.

    boolean match(CollationElementIterator text,
                  CollationElementIterator pattern)
    {
        int mask = getMask(weight);
        int done = CollationElementIterator.NULLORDER & mask;
      
        boolean getPattern = true, getTarget = true;
        int patternElement = 0, targetElement = 0;

        while (true) {
            if (getPattern) patternElement = pattern.next() & mask;
            if (getTarget) targetElement   = text.next()    & mask;
            getTarget = getPattern = true;  // By default get both
    
            if (patternElement == done) {
                break;			            // End of pattern
            } else if (targetElement == 0) {
              getPattern = false;		    // skip targetElement
            } else if (patternElement == 0) {
              getTarget = false;		    // skip patternElement
            } else if (targetElement != patternElement) {
              return false;                         // Mismatch
            }
        }
        return true;                                // No mismatches
    }

This is about the best you can do with the JDK 1.1 API. You can add bells and whistles, such as searching backward though text by reversing the order of the outer loop that calls match, but you can’t really implement a more efficient search without a lot of work.

It’s Better in 1.2

In JDK 1.2, we are making quite a few improvements to the international classes in java.text and java.util. Among them are enhancements to CollationElementIterator that make it possible to write faster and more powerful search routines. These changes are present in JDK 1.2beta3 and later and will be in JDK 1.2 final when it ships. If you’d like to try them out now, you can download the latest beta from the Java Developer Connection at java.sun.com.

There are two major problems with the searching mechanism outlined above. First, it uses an inefficient algorithm that can, at worst, compare every character of the pattern against every character of the target, requiring a number of comparisons that is proportional to the size of the text being searched multiplied by the size of the pattern. (In practice, it’s usually not quite that bad, however.) In computer science terms, if the size of the text is T and the size of the pattern is P, the search time is proportional to T·P, or is O(TP). Modern searching algorithms can do much better.

The second, and more obvious problem is that there is an awful lot of object creation going on in the last few examples. Every time through the outer loop we call substring, which creates a new String object, and then we create a new CollationElementIterator. This happens at every single position in the target string, which is woefully inefficient given the cost of object creation in Java (and in most other languages, for that matter).

This second problem is solved by two new CollationElementIterator methods that we have added in JDK 1.2:

    public void setOffset(int newOffset)
    public int  getOffset()

These methods allow you to change the text position to which an iterator points and to retrieve its current position. With this flexibility, we can avoid all of the calls to substring and all of the extra iterator objects that we were creating before. The outer searching loop now looks like this:

    String pat = "for";                               // What to search for
    String text = "Now is the time for all good men"; // Text to search in

    RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
    CollationElementIterator patIter = c.getCollationElementIterator(pat);
    CollationElementIterator targIter = c.getCollationElementIterator(text);

    for (int i = 0; i < text.length(); i++) {
        targIter.setOffset(i);
        patIter.reset();                    // Or setOffset(0)
        if (match(patIter, targIter)) {
            // They matched!  Do something.
        }
    }

This will be much faster, because we’re no longer creating new objects each time through the loop. The algorithm is still O(TP), but the overhead per iteration is considerably lower, so the running time will a lot better.

Optimized Searching

We’ve solved the easier of our two efficiency problems; now it’s time for the hard one. As I explained above, the brute-force algorithm we’re using is O(TP). String searching is a well-researched area, and there are algorithms that can do considerably better. Perhaps the best is the Boyer-Moore method, which is never worse than O(T+P) and in practice is often proportional to T/P. That’s right: the size of the text divided by the size of the pattern. Rather than forcing us to examine characters in the text multiple times, this algorithm actually lets us skip characters.

Boyer-Moore can be a little bit tricky to explain, but once you "get" it, it seems almost too obvious. The trick is that instead of comparing the strings starting at the beginning of the pattern, you compare them starting at the end. If the characters don’t match, we’ve still gleaned a bit of useful information: we now know what character occurs at that position in the text being searched. Often, we can take advantage of that information to skip several characters in the target text rather than simply sliding the pattern along by one position and trying again.

An example will make this more clear. Imagine that you’re searching for the word "string" inside the phrase "silly spring string". To start off, you line up the beginning of the pattern with the beginning of the target, but you start comparing at the end, like so:

silly spring string

string

We compare the g in the pattern with a space character in the target, and there’s no match. So far, there’s nothing special. However, we know something else as well: the character at index 5 in the target is a space, and there are no space characters anywhere in the pattern we’re searching for. Combining these two facts, we can slide the pattern over by six characters and try again:

silly spring string

      string

This time, there is a match between the pattern and the text. Since we’re going backwards, we now compare the n, i, and r and find that they match too. However, the p and the t do not. We know that there is not a p anywhere in the pattern, so we can slide it over again:

silly spring string

        string

This time, we see an s in the text. It’s not a match, but we do know that there is an s at the beginning of the pattern. Therefore, we slide the pattern over five spaces. Now we have the following, which gives us a match.

silly spring string

             string

To implement this efficiently, you need to have a table that, for each possible character in the text, tells you how far from the end of the pattern that character first occurs. This is the distance by which the pattern can be shifted when that particular character is seen in the input. For the above example, the table would look like this:

s
t
r
i
n
g
others:

5
4
3
2
1
0
6

This table can be computed once, before the search is started, by making a single pass through the pattern. After that, it can be used each time we search for that pattern, leading to a huge performance win.

Boyer-Moore vs. Unicode

But wait! At the very beginning of this article, I said that this kind of algorithm doesn’t work well with Unicode because it has 65,535 possible character values, which would make the table too large. Actually, it’s worse, because we’re concerned with collation elements, which are 32-bit integers, not with the Unicode values themselves. That’s true, but (of course) there’s another trick...

First, consider what happens when a letter occurs twice in the search pattern. There are two possible shift distances for that letter, one for each occurrence. To make the Boyer-Moore algorithm work, we always want to enter the smaller of the two shift distances in the table. If we used the larger one, we might shift the pattern too far and miss a match. In a sense, the shift table is not required to be perfectly accurate, and conservative estimates of shift distances are OK. As long as we don’t shift the pattern too far, we’re fine.

This realization leads to a simple technique for applying the algorithm to Java collation elements: simply map all possible elements down to a much smaller set of shift table indices (say, 256). If two or more elements in your pattern happen to collide and end up with the same index, it’s not a problem as long as you enter the smaller of the shift distances in the table.

A simple way to map the collation elements to 256 values is to use the low byte of the element’s primary component. There are other approaches that will lead to a better distribution of values throughout the shift table and will thus give slightly better performance, but in practice this approach is usually good enough.

To implement Boyer-Moore searching with JDK 1.2, we first need to construct a shift table that tells us how far to shift the pattern when a particular collation element is seen in the text. The hash function is used to map from a 32-bit collation order down to an index in the 256-element array.

    // Member variables for storing precomputed pattern data
    private int   patLen;
    private int[] patElem;
    private int[] shifts;
    
    // Map a collation element to an array index
    int hash(int order) {
        return CollationElementIterator.primaryOrder(order) % 256;
    }
    
    // Initialize the Boyer-Moore shift tables
    void initialize(RuleBasedCollator c, String pat)
    {
        // First find out how many elements we're dealing with
        patLen = 0;
        CollationElementIterator iter = c.getCollationElementIterator(pat);
        while (iter.next() != CollationElementIterator.NULLORDER)
            patLen++;

        // Allocate space to store the pattern elements and the shift tables
        patElem = new int[patLen];
        shifts = new int[256];

        // Elements not found in the pattern get the maximum shift distance
        for (int i = 0; i < 256; i++) {
            shifts[i] = patLen;
        }

        // Now compute the shift distances for the elements in the pattern.
        // While we're at it, save the elements themselves for quick access.
        // The "-1" is in the calculation because Java indices are 0-based.
        iter.reset();
        for (int i = 0; i < patLen; i++) {
            patElem[i] = iter.next();
            int index = hash(patElem[i]);
            shifts[index] = Math.min(shifts[index], patLen - i - 1);
        }
    }

Once we have the tables, the search routine is straightforward. It uses another new JDK 1.2 method: CollationElementIterator.previous. Also note that there is no longer an outer loop that calls a separate match method, since that only worked well when we were marching through the text one character at a time. Now that we can skip ahead an arbitrary distance through the text, it is easier to combine all of the logic into one method.

    public int find(String text, String pattern)
    {
        RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance();
        CollationElementIterator targIter = 
                             coll.getCollationElementIterator(text);
        
	// build the shift table and the constants we need
        initialize(coll, pattern);
        int mask = getMask(weight);
        int done = CollationElementIterator.NULLORDER & mask;

	// Start at the text position corresponding to the end of the pattern
        int textIndex = pattern.length();

        while (textIndex <= text.length()) {
            boolean getPattern = true, getTarget = true;
            int targetElement=0, patternElement=0;

            iter.setOffset(textIndex);
            int patIndex = pattern.length();

            // Iterate backward until we hit the beginning of the pattern
            while (patIndex > 0)
            {
                if (getTarget)  targetElement  = targIter.previous() & mask;
                if (getPattern) patternElement = patElem[--patIndex] & mask;
                getTarget = getPattern = true;

                if (targetElement == 0) {
                    getPattern = false;            // skip targetElement
                } else if (patternElement == 0) {
                    getTarget = false;             // skip patternElement
                } else if (targetElement != patternElement) {
                    // There is a mismatch at this position.  Decide how far
                    // over to shift the pattern, then try again.
                    textIndex = iter.getOffset() +
                                shifts[hash(targetElement)];
                    break;
                }
            }
            if (patIndex == 0) {
                // We made it back to the beginning of the pattern,
                // which means we matched it all.  Return the location.
                return targIter.getOffset();
            }
            // Otherwise, we're here because of a mismatch, so keep going....
        }
        return -1;            // No match.
    }

There you have it: a way to do fast, linear-time, international-friendly string searching in Java.

The Real Stuff

I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applications. It is not that hard, and the benefits can be enormous, because global markets are becoming increasingly more important to the computer industry. For example, according to IBM’s first quarter 1998 financial statement over half of IBM’s revenue came from outside North America. Using the international features of the JDK can help you begin to tap into this huge market.

The code examples in this article were intended primarily for their educational value, but they do work. However, for clarity I have ignored a few remaining issues to make the code easier to understand. Expanding ("ä" to "ae") characters in the pattern are the chief difficulty. If the shorter, non-expanded version of the character occurs in the text being searched, you can end up shifting too far and missing a possible match. However, it is not too hard to compensate for this by using the getMaxExpansion method (also new in JDK 1.2) of CollationElementIterator to decrease the shift distances when expanded characters are seen in the pattern.

The other major feature I have left out is the ability to tell how much of the text matched the pattern. All of the code examples search for location in the text where a match starts, but they do not return the length of the text that matched. You would need to know this if you were writing a search and replace function in an editor, for example. In JDK 1.2, it is easy to tell where the match ends: just call the iterator’s getOffset method. With the JDK 1.1 API, it is harder; you basically have to resort to the brute-force technique of comparing longer and longer substrings of the text to the pattern and stopping when the collator says that they are equal.

If you want to see real, production-quality code that uses the same algorithms and solves these last few problems, visit www.alphaworks.ibm.com and download our fully-functional text searching class based on the JDK collation API. It supports case- and accent-insensitive searches, backward searches, and "whole word" searching, among other features. AlphaWorks also contains several other Java internationalization utilities that you might find useful, as well as a large number of JavaBeans, utility classes, and even a free XML parser written in Java.

Acknowledgements

The patented technique for applying the Boyer-Moore search algorithm to collation elements was developed by Dr. Mark Davis of IBM. Kathleen Wilson, the manager of the text and international groups at IBM’s Center for Java Technology in Silicon Valley, was very indulgent of the time I spent working on this article and the accompanying code. I would also like to thank Mark, Kathleen, Michael Pogue, John Raley, and Helena Shih Chapman for reviewing drafts of this article.

References

Algorithms in C, 3rd edition, by Robert Sedgewick (Addison-Wesley, 1997) has a good overview of the Boyer-Moore search algorithm as applied to small character sets. He also covers a number of other search algorithms.

The Java Class Libraries, 2nd Edition, vol. 1, by Chan, Lee, and Kramer (Addison-Wesley, 1998) has a nice description of CollationElementIterator.

Making your Java/C++/C Applications Global, at www.ibm.com/java/education/international-unicode/unicode1.html is a good overview of some of the issues involved in writing global applications.

The Unicode Standard, Version 2.0, (Addison-Wesley, 1996) has lots of useful information on processing international text. In particular, section 5.15 covers sorting and searching.

The Unicode Consortium web site at www.unicode.org has lots of good information. If you want an even more detailed explanation of Unicode collation, see www.unicode.org/unicode/reports/tr10/.

About the Author

Laura Werner is a Senior Software Engineer and Project Leader for the Java Internationalization effort at the IBM Center for Java Technology in Cupertino, CA. After receiving Bachelor’s degrees in Geological Sciences and Integrated Science (an honors, interdisciplinary program) from Northwestern University, she worked at SPSS, Inc. and UC Berkeley before joining Taligent in 1994. There, she helped port Taligent’s C++ frameworks to Windows NT, wrote the OpenClass file system framework, and participated in the initial design of the Java international frameworks. Now at IBM, she is responsible for maintaining and enhancing the JDK collation classes, leading other Java international projects, and working on occasional side projects such as this paper.