Efficient Text Searching in Java
By Laura Werner
(appeared in Java Report, February
1999)
Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly every engineering school, and there are entire books devoted to the subject. Why, you might ask, am I writing yet another article about searching?
The answer is that most of the well-known, efficient search algorithms dont work
very well in Unicode, which includes the char
type in Java. Algorithms such as Knuth-Morris-Pratt and Boyer-Moore utilize
tables that tell them what to do when a particular character is seen in the text being
searched. Thats fine for a traditional character set such as ASCII or ISO Latin-1
where there are only 128 or 256 possible characters.
Java, however, uses Unicode as its character set. In Unicode, there are 65,535 distinct characters that cover all modern languages of the world, including ideographic languages such as Chinese. In general, this is good; it makes the task of developing global applications a great deal easier. However, algorithms like Boyer-Moore that rely on an array indexed by character codes are very wasteful of memory and take a long time to initialize in this environment.
And it gets worse. Sorting and searching non-English text presents number of challenges that many English speakers are not even aware of. The primary source of difficulty is accents, which have very different meanings in different languages, and sometimes even within the same language:
Other difficulties arise when multiple characters compare as if they were one, such as in traditional Spanish where "ch" is treated as a single letter that sorts just after "c" and before "d", or when single characters such as the "Æ" ligature or the German "ß" are treated as if they were spelled out as "AE" or "ss".
All of the above might make the problem of sorting and searching international text
seem hopeless. While not impossible, it is a difficult problem. Do not despair,
however. Starting in JDK 1.1, the class java.text.Collator
is here to help.
Collators Made Simple
For JDK 1.1, Taligent, which has since been absorbed into IBM, contributed a number of international support classes to the Java Class Libraries. These classes are now maintained and enhanced by engineers at the IBM Center for Java Technology in Cupertino, California. In combination with work done by Sun, they provide a powerful foundation for developing truly global applications.
Among the new classes introduced in JDK 1.1 were Collator
and RuleBasedCollator
, both in the java.text
package. Collator
provides an abstract interface for comparing text, while RuleBasedCollator
is a concrete subclass that implements the comparison using a
rule-driven algorithm.
Using these classes to compare two strings is quite straightforward:
Collator c = Collator.getInstance(); if (c.compare(string1, string2) == 0) { // Strings are the same }
In a real program, of course, you would only call Collator.getInstance
once and would then use the same collator object to perform as many
comparisons as you wanted. Behind the scenes, getInstance
determines the current default Locale, loads the appropriate
collation rules, and constructs a Collator
object that it returns. Currently, this object is always a RuleBasedCollator
. However, in the future different locales may require different
classes with customized behavior, which is why you use a factory method to create
instances of the more abstract Collator
class.
Under the Hood
Internally, RuleBasedCollator.compare
has an awful lot of bookkeeping to do. A byte-by-byte string comparison function like
Cs strcmp
can walk though strings
one character at a time and compare the bytes, but if Collator did something that simple
it would rapidly get out of sync the first time it saw a contracting character like the
Spanish "ch" or an expanding character like "Æ".
To keep track of this, RuleBasedCollator
first translates strings into a series of collation elements, which correspond to
single entities in the input string. In English, each character in the input maps to a
collation element, but "Æ" produces two elements and the Spanish "ch"
produces just one. This translation is done by the utility class CollationElementIterator
, which uses a set of mapping tables built from the rules passed to
the collators constructor.
CollationElementIterator
is a public class, and you
can use it yourself to do searches, as we will see below. As an introduction, lets
use it to iterate over the elements for a simple string:
RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance(); CollationElementIterator iter = c.getCollationElementIterator("Foo"); int element; while ((element = iter.next()) != CollationElementIterator.NULLORDER) { System.out.println("Collation element is: " + Integer.toString(e,16) ); }
As you can see from the above example, a collation element is a fairly simple creature;
its an int
that describes where a
character or group of characters falls in the sorting sequence. Higher-numbered elements
are sorted after lower-numbered ones.
Of course, its really a bit more complicated than that. Each collation element can be broken down into three components (also known as weights or orders): primary, secondary, and tertiary. The primary component of a collation element corresponds to which base letter the order represents, so "A" and "B" will have different primary weights. The secondary components typically correspond to accents, so "á" and "é" will have the same secondary weight, which is different from the secondary weight of "a". The tertiary components usually represent the case of a character, so "a" and "A" will have different tertiary weighting. (There is a fourth level, the normalized original string itself, which can be used for a final comparison, but you usually don't need to worry about this level.)
Character |
Primary |
Secondary |
Tertiary |
a |
1 |
0 |
0 |
á |
1 |
1 |
0 |
A |
1 |
0 |
1 |
... |
|||
B |
2 |
0 |
1 |
... |
|||
é |
5 |
1 |
0 |
The above table uses simple, made-up numbers to illustrate the components of a collation element. In practice, the numbers are usually larger, since most collators have many more possible elements. To see what real collation orders look like, you can modify the last code example as follows. The following code will print out the collation elements for the string "Foo":
RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance(); CollationElementIterator iter = c.getCollationElementIterator("Foo"); int element; while ((element = iter.next()) != CollationElementIterator.NULLORDER) { System.out.println("Collation element is: " + element); System.out.println(" primary: " + Integer.toString( CollationElementIterator.primaryOrder(element), 16) ); System.out.println(" secondary: " + Integer.toString( CollationElementIterator.secondaryOrder(element), 16) ); System.out.println(" tertiary: " + Integer.toString( CollationElementIterator.tertiaryOrder(element), 16) ); }
While this notion of three ordering levels seems complicated at first, it actually
makes some tasks easier. If you want to do a case-insensitive comparison, you simply
ignore the tertiary component of each collation element. If you also dont want to
include accents, you can ignore the secondary component too. Collator
handles this by allowing you to set different strength levels using
the setStrength
method and constants such
as Collator.PRIMARY
.
Collator c = Collator.getInstance(); c.setStrength(Collator.PRIMARY); // Ignore case and accents if (c.compare(string1, string2) == 0) { ... // Strings matched }
Text Searching in JDK 1.1
Now that youre familiar with the concepts behind collators and collation
elements, we can put some of that knowledge to use. The same collation elements that RuleBasedCollator
uses to perform string comparisons can be used to do string
searching as well. The basic concept is quite simple: instead of searching through the
characters in the string, well search through its collation elements.
Fast string-searching algorithms such as KMP and Boyer-Moore require the ability to
back up or perform random access in the text being searched. Unfortunately, you cant
do this with international text in JDK 1.1, because CollationElementIterator does not
allow random access. It only has a next
method and is lacking setOffset
and previous
. This means that Boyer-Moore searching cannot be implemented
without a complicated buffering scheme that is very tricky to get right.
However, a traditional, brute-force string search is quite possible using the JDK 1.1 API. Essentially this comes down to comparing the search pattern against each individual character position in the text being searched. If the collation elements for the search pattern match the collation elements for that substring of text being searched, weve found a match. The outer loop looks like this:
String pattern = "for"; // What to search for String text = "Now is the time for all good men"; // Text being searched RuleBasedCollator c = (RuleBasedCollator)collator.getInstance(); CollationElementIterator patIter = c.getCollationElementIterator(pattern); for (int i = 0; i < text.length(); i++) { String substr = text.substring(i); CollationElementIterator textIter = c.getCollationElementIterator(substr); patIter.reset(); if (match(patIter, textIter)) { // They matched! Do something. } }
Of course, I left out the hard part, the function match that decides whether two sequences of collation elements are equivalent. A simple, naive implementation would loop through both iterators and ensure that the elements they return are the same:
boolean match(CollationElementIterator text, CollationElementIterator pattern) { while (true) { int patternElement = pattern.next(); int targetElement = text.next(); if (patternElement = CollationElementIterator.NULLORDER) { break; // End of the pattern } else if (patternElement != targetElement) { return false; // Mismatch } } return true; // No mismatches }
This will work, but only if you want to treat any difference, be it primary,
secondary, or tertiary, as significant. In most applications, that is not enough; users
will want an "Ignore Case" option, and possibly an "Ignore Case and
Accents" option as well. Fortunately, this is not very hard to do. Collator
provides the constants PRIMARY
, SECONDARY
, and TERTIARY
that you can use to represent the level of comparison you want, and
CollationElementIterator
provides methods
to break down a collation order into its three components.
All we need to do is create a variable, e.g. weight
, that stores the desired level of comparison. If weight
is PRIMARY
, we
check only the primary component of each collation element. If weight
is SECONDARY
, we
check both the primary and secondary components, and if it is TERTIARY
we check all three. Fortunately, these values of these constants
are in ascending numerical order, so we can use simple comparisons such as "if
(weight > Collator.PRIMARY) ...
"
To add this extra functionality we have to modify our match
function a bit, but it is still fairly simple. Since the
documentation for CollationElementIterator promises that "the first 16 bits of a
collation order is its primary order; the next 8 bits is the secondary order and the last
8 bits is the tertiary order," we can simply mask away the portions of the collation
element that were not interested in.
// Return a mask for the part of the order we're interested in static final int getMask(int weight) { switch (weight) { case Collator.PRIMARY: return 0xFFFF0000; case Collator.SECONDARY: return 0xFFFFFF00; default: return 0xFFFFFFFF; } } boolean match(CollationElementIterator text, CollationElementIterator pattern) { int mask = getMask(weight); int done = CollationElementIterator.NULLORDER & mask; while (true) { int patternElement = pattern.next() & mask; int targetElement = text.next() & mask; if (patternElement == done) { break; // End of pattern } else if (patternElement != targetElement) { return false; // Mismatch } } return true; // No mismatches }
Ignore That Character!
As youve probably guessed, I left something out again. Theres one last complication: ignorable characters. In Unicode, an accented character can be represented in two different ways. The single Unicode character \u00e1 represents "á", but the pair of Unicode values \u0061\u0301 also represents "á". The \u0061 is just a lowercase "a", but the \u0301 is special. Its a "combining acute accent" that combines with the value before it to create the single "user-level" character "á".
These combining characters need special processing during a comparison. Since \u0301 is only an accent, its collation element has a secondary component but no primary or tertiary component. In a comparison that does not involve accents, we must ignore this element entirely. If we did not, we might end up comparing an accent in one string to a base letter in another string, which would give invalid results. For example, when doing an accent-insensitive comparison "a\u0301b" and "ab", we want to skip the "\u0301" and go on to the next character; otherwise wed compare "\u0301" and "b".
This logic is relatively straightforward, but it does make the code a bit more
complicated. The boolean
variables getTarget
and getPattern
are
used to decide whether to fetch the next collation element in the text and the pattern
each time through the loop. Normally both variables are true
, but one or the other of them can be set to false
if we want to skip an element. For example, setting getPattern
to false
and getTarget
to true
causes the
current pattern element to be re-used and compared with the next text element, thus
skipping the current text element.
boolean match(CollationElementIterator text, CollationElementIterator pattern) { int mask = getMask(weight); int done = CollationElementIterator.NULLORDER & mask; boolean getPattern = true, getTarget = true; int patternElement = 0, targetElement = 0; while (true) { if (getPattern) patternElement = pattern.next() & mask; if (getTarget) targetElement = text.next() & mask; getTarget = getPattern = true; // By default get both if (patternElement == done) { break; // End of pattern } else if (targetElement == 0) { getPattern = false; // skip targetElement } else if (patternElement == 0) { getTarget = false; // skip patternElement } else if (targetElement != patternElement) { return false; // Mismatch } } return true; // No mismatches }
This is about the best you can do with the JDK 1.1 API. You can add bells and whistles,
such as searching backward though text by reversing the order of the outer loop that calls
match
, but you cant really implement
a more efficient search without a lot of work.
Its Better in 1.2
In JDK 1.2, we are making quite a few improvements to the international classes in java.text
and java.util
.
Among them are enhancements to CollationElementIterator
that make it possible to write faster and more powerful search
routines. These changes are present in JDK 1.2beta3 and later and will be in JDK 1.2 final
when it ships. If youd like to try them out now, you can download the latest beta
from the Java Developer Connection at java.sun.com.
There are two major problems with the searching mechanism outlined above. First, it uses an inefficient algorithm that can, at worst, compare every character of the pattern against every character of the target, requiring a number of comparisons that is proportional to the size of the text being searched multiplied by the size of the pattern. (In practice, its usually not quite that bad, however.) In computer science terms, if the size of the text is T and the size of the pattern is P, the search time is proportional to T·P, or is O(TP). Modern searching algorithms can do much better.
The second, and more obvious problem is that there is an awful lot of object creation
going on in the last few examples. Every time through the outer loop we call substring
, which creates a new String
object, and then we create a new CollationElementIterator
. This happens at every single position in the target string, which
is woefully inefficient given the cost of object creation in Java (and in most other
languages, for that matter).
This second problem is solved by two new CollationElementIterator methods that we have added in JDK 1.2:
public void setOffset(int newOffset) public int getOffset()
These methods allow you to change the text position to which an iterator points and to
retrieve its current position. With this flexibility, we can avoid all of the calls to substring
and all of the extra iterator objects that we were creating before.
The outer searching loop now looks like this:
String pat = "for"; // What to search for String text = "Now is the time for all good men"; // Text to search in RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance(); CollationElementIterator patIter = c.getCollationElementIterator(pat); CollationElementIterator targIter = c.getCollationElementIterator(text); for (int i = 0; i < text.length(); i++) { targIter.setOffset(i); patIter.reset(); // Or setOffset(0) if (match(patIter, targIter)) { // They matched! Do something. } }
This will be much faster, because were no longer creating new objects each time through the loop. The algorithm is still O(TP), but the overhead per iteration is considerably lower, so the running time will a lot better.
Optimized Searching
Weve solved the easier of our two efficiency problems; now its time for the hard one. As I explained above, the brute-force algorithm were using is O(TP). String searching is a well-researched area, and there are algorithms that can do considerably better. Perhaps the best is the Boyer-Moore method, which is never worse than O(T+P) and in practice is often proportional to T/P. Thats right: the size of the text divided by the size of the pattern. Rather than forcing us to examine characters in the text multiple times, this algorithm actually lets us skip characters.
Boyer-Moore can be a little bit tricky to explain, but once you "get" it, it seems almost too obvious. The trick is that instead of comparing the strings starting at the beginning of the pattern, you compare them starting at the end. If the characters dont match, weve still gleaned a bit of useful information: we now know what character occurs at that position in the text being searched. Often, we can take advantage of that information to skip several characters in the target text rather than simply sliding the pattern along by one position and trying again.
An example will make this more clear. Imagine that youre searching for the word "string" inside the phrase "silly spring string". To start off, you line up the beginning of the pattern with the beginning of the target, but you start comparing at the end, like so:
silly spring string string
We compare the g
in the pattern with a
space character in the target, and theres no match. So far, theres nothing
special. However, we know something else as well: the character at index 5 in the target
is a space, and there are no space characters anywhere in the pattern were searching
for. Combining these two facts, we can slide the pattern over by six characters and try
again:
silly spring string string
This time, there is a match between the pattern and the text. Since were going
backwards, we now compare the n
, i
, and r
and find
that they match too. However, the p
and
the t
do not. We know that there is not a p
anywhere in the pattern, so we can slide it over again:
silly spring string string
This time, we see an s
in the text.
Its not a match, but we do know that there is an s
at the beginning of the pattern. Therefore, we slide the pattern
over five spaces. Now we have the following, which gives us a match.
silly spring string string
To implement this efficiently, you need to have a table that, for each possible character in the text, tells you how far from the end of the pattern that character first occurs. This is the distance by which the pattern can be shifted when that particular character is seen in the input. For the above example, the table would look like this:
s t r i n g others: |
5 4 3 2 1 0 6 |
This table can be computed once, before the search is started, by making a single pass through the pattern. After that, it can be used each time we search for that pattern, leading to a huge performance win.
Boyer-Moore vs. Unicode
But wait! At the very beginning of this article, I said that this kind of algorithm doesnt work well with Unicode because it has 65,535 possible character values, which would make the table too large. Actually, its worse, because were concerned with collation elements, which are 32-bit integers, not with the Unicode values themselves. Thats true, but (of course) theres another trick...
First, consider what happens when a letter occurs twice in the search pattern. There are two possible shift distances for that letter, one for each occurrence. To make the Boyer-Moore algorithm work, we always want to enter the smaller of the two shift distances in the table. If we used the larger one, we might shift the pattern too far and miss a match. In a sense, the shift table is not required to be perfectly accurate, and conservative estimates of shift distances are OK. As long as we dont shift the pattern too far, were fine.
This realization leads to a simple technique for applying the algorithm to Java collation elements: simply map all possible elements down to a much smaller set of shift table indices (say, 256). If two or more elements in your pattern happen to collide and end up with the same index, its not a problem as long as you enter the smaller of the shift distances in the table.
A simple way to map the collation elements to 256 values is to use the low byte of the elements primary component. There are other approaches that will lead to a better distribution of values throughout the shift table and will thus give slightly better performance, but in practice this approach is usually good enough.
To implement Boyer-Moore searching with JDK 1.2, we first need to construct a shift
table that tells us how far to shift the pattern when a particular collation element is
seen in the text. The hash
function is
used to map from a 32-bit collation order down to an index in the 256-element array.
// Member variables for storing precomputed pattern data private int patLen; private int[] patElem; private int[] shifts; // Map a collation element to an array index int hash(int order) { return CollationElementIterator.primaryOrder(order) % 256; } // Initialize the Boyer-Moore shift tables void initialize(RuleBasedCollator c, String pat) { // First find out how many elements we're dealing with patLen = 0; CollationElementIterator iter = c.getCollationElementIterator(pat); while (iter.next() != CollationElementIterator.NULLORDER) patLen++; // Allocate space to store the pattern elements and the shift tables patElem = new int[patLen]; shifts = new int[256]; // Elements not found in the pattern get the maximum shift distance for (int i = 0; i < 256; i++) { shifts[i] = patLen; } // Now compute the shift distances for the elements in the pattern. // While we're at it, save the elements themselves for quick access. // The "-1" is in the calculation because Java indices are 0-based. iter.reset(); for (int i = 0; i < patLen; i++) { patElem[i] = iter.next(); int index = hash(patElem[i]); shifts[index] = Math.min(shifts[index], patLen - i - 1); } }
Once we have the tables, the search routine is straightforward. It uses another new JDK
1.2 method: CollationElementIterator.previous
. Also note that there is no longer an outer loop that calls a separate match
method, since that only worked well when we were marching through
the text one character at a time. Now that we can skip ahead an arbitrary distance through
the text, it is easier to combine all of the logic into one method.
public int find(String text, String pattern) { RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance(); CollationElementIterator targIter = coll.getCollationElementIterator(text); // build the shift table and the constants we need initialize(coll, pattern); int mask = getMask(weight); int done = CollationElementIterator.NULLORDER & mask; // Start at the text position corresponding to the end of the pattern int textIndex = pattern.length(); while (textIndex <= text.length()) { boolean getPattern = true, getTarget = true; int targetElement=0, patternElement=0; iter.setOffset(textIndex); int patIndex = pattern.length(); // Iterate backward until we hit the beginning of the pattern while (patIndex > 0) { if (getTarget) targetElement = targIter.previous() & mask; if (getPattern) patternElement = patElem[--patIndex] & mask; getTarget = getPattern = true; if (targetElement == 0) { getPattern = false; // skip targetElement } else if (patternElement == 0) { getTarget = false; // skip patternElement } else if (targetElement != patternElement) { // There is a mismatch at this position. Decide how far // over to shift the pattern, then try again. textIndex = iter.getOffset() + shifts[hash(targetElement)]; break; } } if (patIndex == 0) { // We made it back to the beginning of the pattern, // which means we matched it all. Return the location. return targIter.getOffset(); } // Otherwise, we're here because of a mismatch, so keep going.... } return -1; // No match. }
There you have it: a way to do fast, linear-time, international-friendly string searching in Java.
The Real Stuff
I hope this article has given you a good idea of how you can use collators to add language-sensitive sorting and searching to your own Java applications. It is not that hard, and the benefits can be enormous, because global markets are becoming increasingly more important to the computer industry. For example, according to IBMs first quarter 1998 financial statement over half of IBMs revenue came from outside North America. Using the international features of the JDK can help you begin to tap into this huge market.
The code examples in this article were intended primarily for their educational value,
but they do work. However, for clarity I have ignored a few remaining issues to make the
code easier to understand. Expanding ("ä" to "ae") characters in the
pattern are the chief difficulty. If the shorter, non-expanded version of the character
occurs in the text being searched, you can end up shifting too far and missing a possible
match. However, it is not too hard to compensate for this by using the getMaxExpansion
method (also new in JDK 1.2) of CollationElementIterator
to decrease the shift distances when expanded characters are seen
in the pattern.
The other major feature I have left out is the ability to tell how much of the text
matched the pattern. All of the code examples search for location in the text where a
match starts, but they do not return the length of the text that matched. You would need
to know this if you were writing a search and replace function in an editor, for example.
In JDK 1.2, it is easy to tell where the match ends: just call the iterators getOffset
method. With the JDK 1.1 API, it is harder; you basically have to
resort to the brute-force technique of comparing longer and longer substrings of the text
to the pattern and stopping when the collator says that they are equal.
If you want to see real, production-quality code that uses the same algorithms and solves these last few problems, visit www.alphaworks.ibm.com and download our fully-functional text searching class based on the JDK collation API. It supports case- and accent-insensitive searches, backward searches, and "whole word" searching, among other features. AlphaWorks also contains several other Java internationalization utilities that you might find useful, as well as a large number of JavaBeans, utility classes, and even a free XML parser written in Java.
Acknowledgements
The patented technique for applying the Boyer-Moore search algorithm to collation elements was developed by Dr. Mark Davis of IBM. Kathleen Wilson, the manager of the text and international groups at IBMs Center for Java Technology in Silicon Valley, was very indulgent of the time I spent working on this article and the accompanying code. I would also like to thank Mark, Kathleen, Michael Pogue, John Raley, and Helena Shih Chapman for reviewing drafts of this article.
References
Algorithms in C, 3rd edition, by Robert Sedgewick (Addison-Wesley, 1997) has a good overview of the Boyer-Moore search algorithm as applied to small character sets. He also covers a number of other search algorithms.
The Java Class Libraries, 2nd Edition, vol. 1, by Chan, Lee, and Kramer (Addison-Wesley, 1998) has a nice description of CollationElementIterator.
Making your Java/C++/C Applications Global, at www.ibm.com/java/education/international-unicode/unicode1.html is a good overview of some of the issues involved in writing global applications.
The Unicode Standard, Version 2.0, (Addison-Wesley, 1996) has lots of useful information on processing international text. In particular, section 5.15 covers sorting and searching.
The Unicode Consortium web site at www.unicode.org has lots of good information. If you want an even more detailed explanation of Unicode collation, see www.unicode.org/unicode/reports/tr10/.
Laura Werner is a Senior Software Engineer and Project Leader for the Java Internationalization effort at the IBM Center for Java Technology in Cupertino, CA. After receiving Bachelors degrees in Geological Sciences and Integrated Science (an honors, interdisciplinary program) from Northwestern University, she worked at SPSS, Inc. and UC Berkeley before joining Taligent in 1994. There, she helped port Taligents C++ frameworks to Windows NT, wrote the OpenClass file system framework, and participated in the initial design of the Java international frameworks. Now at IBM, she is responsible for maintaining and enhancing the JDK collation classes, leading other Java international projects, and working on occasional side projects such as this paper.