The Java International API: Beyond JDK 1.1

Helena Shih Chapman, Mark Davis
October 1998

Originally published on http://www-106.ibm.com/developerworks/java/library/j-intljava.html

Contents:
Date and time support
Locales and resources
Formatting and parsing
Conclusion
Acknowledgements
About the authors

The designers of Java made the important design decision that all text would be stored in Unicode. This solves the problem inherent in most other text-handling schemes, of always having to juggle multiple, limited character encodings. It puts all languages on an equal footing, and makes the whole process of designing for

worldwide products far easier. But proper support of international text requires far more than just storing characters in Unicode.

IBM's wholly owned subsidiary, Taligent, had a great deal of previous experience in Unicode software internationalization. In 1996, Sun contracted with Taligent to design and develop classes for the proper handling of multilingual text for JDK 1.1. Our goals were to provide an architecture that supplied the required functionality, was fully object oriented, could easily be extended to add additional features or to support additional countries, and would scale well across both small and large projects written in Java. There are some aspects of the architecture that we frequently get questions or complaints about, so we'll explain why we made some of the decisions we did.

In JDK 1.1 we focused on the "server level" support: that is, on the mid-level internationalization services. Most of the low level text services were already in JDK 1.0 and only needed some enhancement. The high level international services (for input and output) utilized the host platform services in JDK 1.1. In JDK 1.2 the high-level international services have been greately improved and no long depend on the host platform services.

In this paper, we will preview some of the most important new features under development in the mid-level internationalization services. Many of these features are coming in JDK 1.2. IBM is making others available through different channels, including classes available on the IBM AlphaWorks website at http://www.alphaWorks.ibm.com/. (Note: IBM also has C and C++ versions of its Unicode internationalization services; see http://www.ibm.com/developer/unicode/ for more information.)

Date and time support
The Calendar class contains API that allows you to interpret a Date according to a local calendar system, even non-Gregorian ones. It also contains routines to support GUI requirements, such as rolling, adding and subtracting dates and times. The TimeZone class enables the conversion between universal time (UTC) and local time. It also contains rules for figuring out the daylight savings time according to the local conventions.

Why is January zero?
We probably get more complaints about this than any other issue. Here's what happened.

The JDK 1.0 Date API and implementation were very specific to the Gregorian calendar, and was not terribly Y2K friendly. Although Gregorian calendar is used in most of the world, many countries use other calendar systems. For instance, businesses in Europe often use a calendar that measures by day, week and year, rather than day, month, and year. There are also a large number of traditional calendars in widespread use in the Middle East and Asia. To deal with these problems, we split out the date computations into another class, Calendar, and retained Date purely as a storage class.

The zero-based month numbers in Date were a vestige of old C-style programming_originally month names were stored in a zero-based array, and the months were numbered accordingly for convenience. JavaSoft felt that consistency with the old Date APIs was important, so we needed to keep this convention in Calendar. So calling calendar.set(1998, 3, 5) gives you April 15th, not March 15th.

Time zone display names
Every abstract class in the internationalization frameworks except for TimeZone has a getDisplayName() function. This means there's no easy way to get the displayable name for a time zone in JDK 1.1. This has led to some confusion, since some people thought that TimeZone.getID() returned a displayable name, when it actually returns an internal programmatic ID, one that should not be displayed to end users. Moreover, the internal IDs themselves were too short and confusing: "AST" could stand for either "Atlantic Standard Time" or "Alaska Standard Time". To remedy this, the new method getDisplayName() has been added to TimeZone in JDK 1.2, and longer more descriptive internal IDs are available.


  // In JDK 1.1
TimeZone zone = TimeZone.getTimeZone(
"EST"
);
SimpleDateFormat sdf = new SimpleDateFormat("z", Locale.English);
fmt.getCalendar().setTimeZone(zone);
String name = format.format(new Date());
// name is "Eastern Standard Time"

  TimeZone zone = TimeZone.getTimeZone(
"America/New_York"
);
String name = zone.getDisplayName(Locale.ENGLISH);
// name is "Eastern Standard Time"

Better Y2K support
Should "01/01/00" be year 2000 or year 1900? JDK 1.1 used the 80-20 rule. This amounts to adding 1900 to the two-digit year, and if the result was more than 80 years in the past, adding another 100 (for the Gregorian calendar). The JDK 1.2 method DateFormat.set2DigitStartDate( ) provides more specific control. This method sets the exact start of a 100-year range in which 2-digit years are interpreted.


 // In JDK 1.1: There is no way to specify when the 2 digit year starts.
 
// In JDK 1.2:
GregorianCalendar cal = new GregorianCalendar(1952, Calendar.SEPTEMBER, 13);
DateFormat fmt = DateFormat.getInstance();
fmt.set2DigitYearStart(cal.getTime());
fmt.parse("9-12-52"); // returns 9-13-1952
fmt.parse("9-14-52"); // returns 9-14-2052

Improved Daylight Savings switchover
In JDK 1.1, SimpleTimeZone allows the start and end dates for Daylight Savings Time to be specified in only one way, as the Nth or Nth-from-last occurrence of a given weekday in a given month, e.g., the last Sunday in October. However, some time zones have more complicated rules for the switchover dates. For example, in Brazil Eastern Time, DST ends on the first Sunday on or after February 11th, which cannot be expressed with the JDK 1.1 APIs.

This was resolved in JDK 1.2 by adding several new types of DST start and end rules. The following rule types will handle all known modern and historical time zones and provide more flexibility for the future:

  1. A fixed date in a given month, e.g. the 1st of April.
  2. The first occurrence of a given day of the week on or after a certain date in the month, e.g. the first Sunday on or after February 18th, or equivalently, the first Sunday after the second Thursday
  3. The first occurrence of a given day of the week on or before a certain date in the month

Correct rolling
There is no way to implement a date widget with arrow buttons correctly with the Calendar class in JDK 1.1. Suppose the user has selected the MONTH value for Jan 30, 1997, and hits the up arrow twice. The first click correctly sets the control to February 28, 1997, but the second click sets it to March 28, 1997, instead of March 30, 1997.

The correct implementation is to remember the original date and roll the month field the proper number of steps from that original date for each click of the arrows. Unfortunately, in JDK 1.1, you can only roll a field a single unit at a time. JDK 1.2 fixes this problem by adding the ability to roll a field multiple units in a single operation.


// In JDK 1.1: no workaround

// In JDK 1.2:
myCalendar.setTime(aDate);

myCalendar.roll(MONTH, numberOfArrowClicks);

International Calendar classes
Although the Calendar class is architected to allow for multiple calendars, both JDK 1.1 and 1.2 only include support for the Gregorian calendar.However, IBM is previewing a large set of international calendars on the AlphaWorks web site, currently including Hebrew, Islamic, Buddhist, and Japanese calendars.

Locales and resources
A locale in the JDK is merely an identifier. This identifier is made up of the ISO language code and country code, plus optional variants (for information on the ISO codes, see http://www.unicode.org/unicode/onlinedat/). Since Locale is just a lightweight identifier, there is no need for validity checking when you construct a locale. Whenever you construct an international object, you have the opportunity to supply an explicit Locale, or you can use whatever the current default locale is on your system:


Collator col = Collator.getInstance(Locale.FRANCE);
if (col.compare(string1, string2) < -1) {
  ... // based on the French locale's sort sequence

Collator col = Collator.getInstance();
if (col.compare(string1, string2) < -1) {
  ... // based on the default locale's sort sequence

The ResourceBundle class provides a way to isolate translatable text or localizable objects from your core source code. For example, resource bundles can be used for translatable error messages, or building translatable components. The JDK also uses resource bundles to hold its own localized data. For example, when you ask for a NumberFormat object, the necessary formatting information is retrieved from a resource bundle.

Why can't you set the default locale in Applets?
People frequently ask for the ability to call Locale.setDefault() within an applet. The problem is that a single JVM can run more than one applet at a time in the same address space. Locale.setDefault() would change the default locale for the whole address space, which means that all of the applets would be affected; this is considered a security violation. To work around this, set the applet's locale instead of using Locale.setDefault(). When you need an international class, supply the locale explicitly:

 
NumberFormat nf = NumberFormat.getInstance(myApplet.getLocale());

ResourceBundle fallback detection
The ResourceBundle implementation currently includes a fallback mechanism: if the specified resource can't be found in the specified locale, ResourceBundle searches:

Sometimes this is not what you want, or at least you may want to be able to detect when a particular piece of data came from a fallback locale rather than the specified one. For example, suppose you wanted a specific resource from the French Belgian locale, and there is only a French locale installed--you'll get the wrong resource. In JDK 1.2 we added a method, getLocale(), to find out the actual locale that a resource bundle comes from, so that you can determine if a fallback was used.


// In JDK 1.2
Locale frBE_Locale = new Locale("fr", "BE");

ResourceBundle rb = ResourceBundle.getBundle("MyResources", frBE_Locale);

if (!rb.getLocale().equals(frBE_Locale)) {
    // French Belgian resources not available, report an error

Comparison and boundaries
In JDK 1.1, Collator allows you to compare strings in a language-sensitive way. The standard comparison in String will just do a binary comparison. For strings that will be displayed to the user, this is almost always incorrect! Wherever the ordering or equality of strings is important to the user, such as when presenting an alphabetized list, then use a Collator instead. Otherwise a German, for example, will find that you don't equate two strings that she thinks are equal.


 if (string1.compareTo(string2) < 0) {... // bitwise comparison
 Collator col = Collator.getInstance();
 if (col.equals(string1, string2)) {...
 ...
 if (col.compare(string1, string2) < 0) {...

Why have CharacterIterator?
The CharacterIterator class is used in BreakIterator and a few other places in the JDK, and is used even more in JDK 1.2. Sometimes we are asked why we didn't use String or StringBuffer instead.

String and StringBuffer are simple classes that store their characters contiguously. Insertion or deletion of characters in a StringBuffer ends up shifting all the characters that follow, which works fine for reasonably small numbers of characters. However, this model doesn't scale well. Consider a word processor, for example, where shifting many kilobytes of characters just to insert or delete one character involves far too much extra work. For acceptable performance in these circumstances, text needs to be stored in data structures that use internally discontiguous chunks of storage.

We needed some way to have a more abstract representation of text that could be used both for String and for larger-scale text models. Unfortunately, we couldn't change String and StringBuffer to descend from an abstract class that would provide this sort of representation. To resolve this problem, we added a very minimal interface, CharacterIterator. This interface allows both sequential (forward and backward) and random access to characters from any source, not just from a String or StringBuffer.

Rule-based BreakIterator
The BreakIterator class finds character, word, line and sentence boundaries, which may vary depending on the locale. The JDK 1.1 BreakIterator implementation uses a state machine, which makes it very fast. However, it does not allow the behavior to vary depending on the locale. If the built-in classes don't support behavior the clients want, they must create a completely new BreakIterator subclass of their own--they can't leverage the JDK code at all.

Therefore, we undertook an extensive revision of the BreakIterator framework. The new RuleBasedBreakIterator class essentially works the same way the old class did, but it builds the category and state tables from a textual description, which is essentially a string of regular expressions. This description can be loaded from a resource--allowing different breaking rules for different languages--or supplied by the client--allowing runtime customization. This class is provided on the AlphaWorks web site.

Locale-sensitive searching
The CollationElementIterator class is intended for use in locale-sensitive text searching. However, it is missing several methods in JDK 1.1 that makes it impossible to use with fast string searching algorithms such as Boyer-Moore. The following new methods were added in JDK 1.2 to fix this:

  1. The getOffset() method tells where a collation element was found.
  2. The previous() and setOffset() methods enable backing up and moving around in the text being searched.
  3. The new setText() method allows reuse of a CollationElementIterator. When collating or searching a large number of strings, it is much faster to reuse one CollationElementIterator than to construct a new one each time.
  4. The isIgnorable() method tells whether a collation element is ignorable.
  5. The getMaxExpansion() method returns the maximum length of any expansion sequence producing a given character. A fast search algorithm needs to know the maximum "shift" distance in looking for possible match sites. This is complicated by the fact that in natural language, a match can occur with different numbers of characters. If a search pattern for German text contains "oe", for example, it can match the single character "?ot; in the text being searched. With the maximum expansion length, a fast search algorithm can compute the correct lower limit on shift distances.

Unicode normalization
Unicode is more than just "wide ASCII". One of the principal operations on Unicode is to normalize text, ensuring that you have a unique spelling for a given text. Text normalization includes decomposition and composition forms of characters. Text can be normalized to be a canonical equivalent to the original unnormalized text, or to be a compatibility equivalent to the original unnormalized text. For more information, please see Unicode technical report #15 on http://www.unicode.org/unicode/reports/tr15/.

One of the Unicode normalization forms is used internally as a part of JDK 1.1, but it is not public. The Normalizer class incorporates this technology, and allows either batch or incremental normalization of text. This class is provided on the AlphaWorks web site.

Formatting and parsing
JDK 1.1 provides a rich set of functionality for formatting values into strings and parsing strings into values in a locale-sensitive way. These include numbers, dates, times, and messages.

Number formatting supports spreadsheet-style patterns. For example, a format such as "#,##0.00#" will produce output like "1,234.567" or "5.00"; the pattern specifies that you have at least 2 decimal digits, but no more than 3. You can also reset the decimals and other characteristics of the pattern programmatically. Number formatting also provides powerful pattern parsing support for proportional font decimal alignment.

Date/Time formatting supports similar features, and are fully integrated with Calendar. Message formatting allows access to number, date, and time formatting within the context of a localizable string.

Substitutable currencies
NumberFormat provides the factory method getCurrencyInstance(), which creates an object that can convert numbers to and from strings in the currency format of a given locale. In JDK 1.1, these formats were treated just like any other number formats. They were constructed from strings that were fetched from ResourceBundles. In 1.2, the currency symbol can be specified independently from the rules for decimal places, thousands separator, and so on, and is supplied in the pattern with the international currency symbol ("¤" = "\u00A4"')


// In JDK 1.1: can't change currency symbols

// In JDK 1.2
DecimalFormatSymbols us_syms = (DecimalFormat)fmt.getDecimalFormatSymbols();
us_syms.setCurrencySymbol("US$ ");
fmt.setDecimalFormatSymbols(us_syms);
result = fmt.format(1234.56)       // result is "US$ 1,234.56"

ISO currency codes
Additionally, we added an API to retrieve the 3-letter international currency codes defined in ISO 4217. These are necessary in an application that deals with many different currencies, because the regular, one-character currency symbols are often shared by many different currencies. For example, both the US and Canada use "$" in their default currency format. An application dealing with both currencies will probably want to use "USD" and "CAD" instead. In JDK 1.2, this is now possible, using a sequence of two international currency symbols ("¤¤" = "\u00A4\u00A4") in the pattern.

 
// In JDK 1.1: can't get 3-letter currency codes
 
// In JDK 1.2: 
fmt = new DecimalFormat("\u00a4\u00a4 #,##0.00;(\u00a4\u00a4 #,##0.00)");
result = fmt.format(1234.56);       // result is "USD 1,234.56".

Parse error information
The abstract method parseObject() in java.text.Format is used to parse strings and turn them into objects. In JDK 1.1, the program can find out how far the parse got so that it can continue from that point on. However, it cannot find out how far it got if there was an error. In JDK 1.2 a new field, errorOffset, now contains that information. If an error occurs during parsing, the formatters set this value before returning an error or throwing an exception.

In the following example, a text field is parsed for a number. If an error is found, the text beyond the error is highlighted, a message is displayed, and a beep is played.


// In JDK 1.2:
String contents = textField.getText();
try {
    NumberFormat fmt = NumberFormat.getInstance();
    Number value = fmt.parse(contents);
} catch (ParseException foo) {
    errorLabel.getToolkit().beep();
    errorLabel.setText(myResourceBundle.getString("invalid number"));
    textField.select(foo.getErrorOffset(), contents.length());
}

Number format enhancements
On the alphaWorks Web site we provide a class that correctly supports exponentials in number formatting and parsing. The new number formatter supports formats such as "1.2345E3", as well as engineering exponents, where the exponent is always a power of 3. It also supports formatting and parsing BigInteger or BigDecimal values without loss of precision, and "nickel-rounding": the ability to round to multiples of a specified number, such as $0.05. (This is important for some countries whose smallest coin is 5 units instead of 1. The implementation is not restricted to nickels, however, and can be used to round to multiples of any given value.)

Here is an example using the class on the alphaWorks website.


NumberFormat fmt = new NumberFormat("0.0000E00");
String result = fmt->format(123456789);
// result is "1.2346E08"

Number formats in words
The ability to take a numeric value (such as 12,345) and translate it into words (such as "twelve thousand three hundred forty-five") is often needed in business applications, for example, to write out the amount on a check. Number spellout in English is a relatively easy thing to do; good algorithms for this are well-known and widely used. A number-spellout engine that can be customized for any language is another thing altogether.

It's not enough to simply take the algorithm for English and read the literal string values from a resource file. English separates all component parts of a number with spaces; Italian and German do not. Some languages, such as Spanish and Italian, drop the word for "one" from the phrases "one hundred" or "one thousand". There are many other examples that show translating a number into words is not a trivial task.

To solve these issues, we developed a class called RuleBasedNumberFormat. It's a general, rule-based mechanism for converting numbers to spelled-out strings. This class is available on the alphaWorks web site, along with information on the usage and rule syntax.


   
RuleBasedNumberFormat fmt = new RuleBasedNumberFormat(rules);

String result = fmt.format(1234);
// result is "one thousand two hundred thirty four"

Conclusion
The internationalization services in Java 1.1 provide a wide range of functionality, and are easily extended to add additional features and to support additional countries. We've had an opportunity to discuss some of the design decisions taken in developing these classes, and some of the enhancements that are being included in future releases. A more detailed discussion is available at http://www.ibm.com/developer/unicode/, and includes more about the JDK i18n classes and possible future internationalization improvements that IBM is discussing with Sun. These future possibilities include the following:

We are working on many future enhancements; some of which are available right now on IBM's alphaWorks website at http://www.alphaWorks.ibm.com/. We encourage those interested to download versions from there--any comments on the design and implementation are welcome!

Acknowledgements
Our thanks to Kathleen Wilson, Rich Gillam, and Laura Werner for their extensive review and suggestions for organization of the document. Many other people in IBM and Sun contributed to the Java internationalization efforts.


About the authors
Dr. Mark Davis is a Senior Technical Staff Member responsible for international software architecture. Mark co-founded the Unicode effort, and is the president of the Unicode Consortium. He is a principal co-author and editor of the Unicode Standard, Versions 1.0 and 2.0. At various times, his department has included software groups covering text, international, operating system services, Windows porting, and technical communications. Technically, he specializes in object-oriented programming and in the architecture and implementation of international and text software.

Helena Shih Chapman is the technical lead of the IBM Classes for Unicode at IBM's Center for Java Technology, Cupertino. She previously was a member of the Java i18n team at Taligent, and contributed to the JDK 1.1 international classes. Helena has also worked for Dataware Technologies and Apple's Advanced Technology Group. She holds an MSc. degree from University of Massachusetts. She is a native of Taipei, Taiwan.