Helena Shih Chapman, Mark Davis
October 1998
Originally published on http://www-106.ibm.com/developerworks/java/library/j-intljava.html
Contents:
|
Date and time support |
Locales and resources |
Formatting and parsing |
Conclusion |
Acknowledgements |
About the authors |
The designers of Java made the important design decision that all text would be stored in Unicode. This solves the problem inherent in most other text-handling schemes, of always having to juggle multiple, limited character encodings. It puts all languages on an equal footing, and makes the whole process of designing for
worldwide products far easier. But proper support of international text requires far more than just storing characters in Unicode.
IBM's wholly owned subsidiary, Taligent, had a great deal of previous experience in Unicode software internationalization. In 1996, Sun contracted with Taligent to design and develop classes for the proper handling of multilingual text for JDK 1.1. Our goals were to provide an architecture that supplied the required functionality, was fully object oriented, could easily be extended to add additional features or to support additional countries, and would scale well across both small and large projects written in Java. There are some aspects of the architecture that we frequently get questions or complaints about, so we'll explain why we made some of the decisions we did.
In JDK 1.1 we focused on the "server level" support: that is, on the mid-level internationalization services. Most of the low level text services were already in JDK 1.0 and only needed some enhancement. The high level international services (for input and output) utilized the host platform services in JDK 1.1. In JDK 1.2 the high-level international services have been greately improved and no long depend on the host platform services.
In this paper, we will preview some of the most important new features under development in the mid-level internationalization services. Many of these features are coming in JDK 1.2. IBM is making others available through different channels, including classes available on the IBM AlphaWorks website at http://www.alphaWorks.ibm.com/. (Note: IBM also has C and C++ versions of its Unicode internationalization services; see http://www.ibm.com/developer/unicode/ for more information.)
Date and time support
The Calendar
class contains API that allows you to interpret
a Date according to a local calendar system, even non-Gregorian ones. It also
contains routines to support GUI requirements, such as rolling, adding and subtracting
dates and times. The TimeZone
class enables the conversion between
universal time (UTC) and local time. It also contains rules for figuring out
the daylight savings time according to the local conventions.
Why is January zero?
We probably get more complaints about this than any other issue. Here's what
happened.
The JDK 1.0 Date
API and implementation were very specific to
the Gregorian calendar, and was not terribly Y2K friendly. Although Gregorian
calendar is used in most of the world, many countries use other calendar systems.
For instance, businesses in Europe often use a calendar that measures by day,
week and year, rather than day, month, and year. There are also a large number
of traditional calendars in widespread use in the Middle East and Asia. To deal
with these problems, we split out the date computations into another class,
Calendar, and retained Date purely as a storage class.
The zero-based month numbers in Date were a vestige of old C-style programming_originally month names were stored in a zero-based array, and the months were numbered accordingly for convenience. JavaSoft felt that consistency with the old Date APIs was important, so we needed to keep this convention in Calendar. So calling calendar.set(1998, 3, 5) gives you April 15th, not March 15th.
Time zone display names
Every abstract class in the internationalization frameworks except for TimeZone
has a getDisplayName()
function. This means there's no easy way
to get the displayable name for a time zone in JDK 1.1. This has led to some
confusion, since some people thought that TimeZone.getID()
returned
a displayable name, when it actually returns an internal programmatic ID, one
that should not be displayed to end users. Moreover, the internal IDs themselves
were too short and confusing: "AST" could stand for either "Atlantic Standard
Time" or "Alaska Standard Time". To remedy this, the new method getDisplayName()
has been added to TimeZone in JDK 1.2, and longer more descriptive internal
IDs are available.
|
|
Better Y2K support
Should "01/01/00" be year 2000 or year 1900? JDK 1.1 used the 80-20 rule. This
amounts to adding 1900 to the two-digit year, and if the result was more than
80 years in the past, adding another 100 (for the Gregorian calendar). The JDK
1.2 method DateFormat.set2DigitStartDate( )
provides more specific
control. This method sets the exact start of a 100-year range in which 2-digit
years are interpreted.
|
|
Improved Daylight Savings switchover
In JDK 1.1, SimpleTimeZone allows the start and end dates for Daylight Savings
Time to be specified in only one way, as the Nth or Nth-from-last occurrence
of a given weekday in a given month, e.g., the last Sunday in October. However,
some time zones have more complicated rules for the switchover dates. For example,
in Brazil Eastern Time, DST ends on the first Sunday on or after February 11th,
which cannot be expressed with the JDK 1.1 APIs.
This was resolved in JDK 1.2 by adding several new types of DST start and end rules. The following rule types will handle all known modern and historical time zones and provide more flexibility for the future:
Correct rolling
There is no way to implement a date widget with arrow buttons correctly with
the Calendar class in JDK 1.1. Suppose the user has selected the MONTH value
for Jan 30, 1997, and hits the up arrow twice. The
first click correctly sets the control to February 28, 1997, but the second
click sets it to March 28, 1997, instead of March 30, 1997.
The correct implementation is to remember the original date and roll the month field the proper number of steps from that original date for each click of the arrows. Unfortunately, in JDK 1.1, you can only roll a field a single unit at a time. JDK 1.2 fixes this problem by adding the ability to roll a field multiple units in a single operation.
|
|
International Calendar classes
Although the Calendar class is architected to allow for multiple calendars,
both JDK 1.1 and 1.2 only include support for the Gregorian calendar.However,
IBM is previewing a large set of international calendars on the AlphaWorks web
site, currently including Hebrew, Islamic, Buddhist, and Japanese calendars.
Locales and resources
A locale in the JDK is merely an identifier. This identifier is made up of the
ISO language code and country code, plus optional variants (for information
on the ISO codes, see http://www.unicode.org/unicode/onlinedat/).
Since Locale is just a lightweight identifier, there is no need for validity
checking when you construct a locale. Whenever you construct an international
object, you have the opportunity to supply an explicit Locale, or you can use
whatever the current default locale is on your system:
|
The ResourceBundle class provides a way to isolate translatable text or localizable objects from your core source code. For example, resource bundles can be used for translatable error messages, or building translatable components. The JDK also uses resource bundles to hold its own localized data. For example, when you ask for a NumberFormat object, the necessary formatting information is retrieved from a resource bundle.
Why can't you set the default locale
in Applets?
People frequently ask for the ability to call Locale.setDefault()
within an applet. The problem is that a single JVM can run more than one applet
at a time in the same address space. Locale.setDefault()
would
change the default locale for the whole address space, which means that all
of the applets would be affected; this is considered a security violation. To
work around this, set the applet's locale instead of using Locale.setDefault()
.
When you need an international class, supply the locale explicitly:
|
ResourceBundle fallback detection
The ResourceBundle implementation currently includes a fallback mechanism: if
the specified resource can't be found in the specified locale, ResourceBundle
searches:
Sometimes this is not what you want, or at least you may want to be able to
detect when a particular piece of data came from a fallback locale rather than
the specified one. For example, suppose you wanted a specific resource from
the French Belgian locale, and there is only a French locale installed--you'll
get the wrong resource. In JDK 1.2 we added a method, getLocale()
,
to find out the actual locale that a resource bundle comes from, so that you
can determine if a fallback was used.
|
Comparison and boundaries
In JDK 1.1, Collator allows you to compare strings in a language-sensitive way.
The standard comparison in String will just do a binary comparison. For strings
that will be displayed to the user, this is almost always incorrect! Wherever
the ordering or equality of strings is important to the user, such as when presenting
an alphabetized list, then use a Collator instead. Otherwise a German, for example,
will find that you don't equate two strings that she thinks are equal.
|
Why have CharacterIterator?
The CharacterIterator
class is used in BreakIterator and a
few other places in the JDK, and is used even more in JDK 1.2. Sometimes we
are asked why we didn't use String or StringBuffer instead.
String and StringBuffer are simple classes that store their characters contiguously. Insertion or deletion of characters in a StringBuffer ends up shifting all the characters that follow, which works fine for reasonably small numbers of characters. However, this model doesn't scale well. Consider a word processor, for example, where shifting many kilobytes of characters just to insert or delete one character involves far too much extra work. For acceptable performance in these circumstances, text needs to be stored in data structures that use internally discontiguous chunks of storage.
We needed some way to have a more abstract representation of text that could be used both for String and for larger-scale text models. Unfortunately, we couldn't change String and StringBuffer to descend from an abstract class that would provide this sort of representation. To resolve this problem, we added a very minimal interface, CharacterIterator. This interface allows both sequential (forward and backward) and random access to characters from any source, not just from a String or StringBuffer.
Rule-based BreakIterator
The BreakIterator
class finds character, word, line and sentence
boundaries, which may vary depending on the locale. The JDK 1.1 BreakIterator
implementation uses a state machine, which makes it very fast. However, it does
not allow the behavior to vary depending on the locale. If the built-in classes
don't support behavior the clients want, they must create a completely new BreakIterator
subclass of their own--they can't leverage the JDK code at all.
Therefore, we undertook an extensive revision of the BreakIterator framework. The new RuleBasedBreakIterator class essentially works the same way the old class did, but it builds the category and state tables from a textual description, which is essentially a string of regular expressions. This description can be loaded from a resource--allowing different breaking rules for different languages--or supplied by the client--allowing runtime customization. This class is provided on the AlphaWorks web site.
Locale-sensitive searching
The CollationElementIterator
class is intended for use in locale-sensitive
text searching. However, it is missing several methods in JDK 1.1 that makes
it impossible to use with fast string searching algorithms such as Boyer-Moore.
The following new methods were added in JDK 1.2 to fix this:
Unicode normalization
Unicode is more than just "wide ASCII". One of the principal operations on Unicode
is to normalize text, ensuring that you have a unique spelling for a given text.
Text normalization includes decomposition and composition forms of characters.
Text can be normalized to be a canonical equivalent to the original unnormalized
text, or to be a compatibility equivalent to the original unnormalized
text. For more information, please see Unicode technical report #15 on http://www.unicode.org/unicode/reports/tr15/.
One of the Unicode normalization forms is used internally as a part of JDK 1.1, but it is not public. The Normalizer class incorporates this technology, and allows either batch or incremental normalization of text. This class is provided on the AlphaWorks web site.
Formatting and parsing
JDK 1.1 provides a rich set of functionality for formatting values into
strings and parsing strings into values in a locale-sensitive way. These include
numbers, dates, times, and messages.
Number formatting supports spreadsheet-style patterns. For example, a format such as "#,##0.00#" will produce output like "1,234.567" or "5.00"; the pattern specifies that you have at least 2 decimal digits, but no more than 3. You can also reset the decimals and other characteristics of the pattern programmatically. Number formatting also provides powerful pattern parsing support for proportional font decimal alignment.
Date/Time formatting supports similar features, and are fully integrated with Calendar. Message formatting allows access to number, date, and time formatting within the context of a localizable string.
Substitutable currencies
NumberFormat provides the factory method getCurrencyInstance()
,
which creates an object that can convert numbers to and from strings in the
currency format of a given locale. In JDK 1.1, these formats were treated just
like any other number formats. They were constructed from strings that were
fetched from ResourceBundles. In 1.2, the currency symbol can be specified independently
from the rules for decimal places, thousands separator, and so on, and is supplied
in the pattern with the international currency symbol ("¤" = "\u00A4"')
|
|
ISO currency codes
Additionally, we added an API to retrieve the 3-letter international currency
codes defined in ISO 4217. These are necessary in an application that deals
with many different currencies, because the regular, one-character currency
symbols are often shared by many different currencies. For example, both the
US and Canada use "$" in their default currency format. An application dealing
with both currencies will probably want to use "USD" and "CAD" instead. In JDK
1.2, this is now possible, using a sequence of two international currency symbols
("¤¤" = "\u00A4\u00A4") in the pattern.
|
|
Parse error information
The abstract method parseObject()
in java.text.Format is used to
parse strings and turn them into objects. In JDK 1.1, the program can find out
how far the parse got so that it can continue from that point on. However, it
cannot find out how far it got if there was an error. In JDK 1.2 a new field,
errorOffset
, now contains that information. If an error occurs
during parsing, the formatters set this value before returning an error or throwing
an exception.
In the following example, a text field is parsed for a number. If an error is found, the text beyond the error is highlighted, a message is displayed, and a beep is played.
|
Number format enhancements
On the alphaWorks Web site we provide a class that correctly supports exponentials
in number formatting and parsing. The new number formatter supports formats
such as "1.2345E3", as well as engineering exponents, where the exponent is
always a power of 3. It also supports formatting and parsing BigInteger or BigDecimal
values without loss of precision, and "nickel-rounding": the ability to round
to multiples of a specified number, such as $0.05. (This is important for some
countries whose smallest coin is 5 units instead of 1. The implementation is
not restricted to nickels, however, and can be used to round to multiples of
any given value.)
Here is an example using the class on the alphaWorks website.
|
Number formats in words
The ability to take a numeric value (such as 12,345) and translate it into words
(such as "twelve thousand three hundred forty-five") is often needed in business
applications, for example, to write out the amount on a check. Number spellout
in English is a relatively easy thing to do; good algorithms for this are well-known
and widely used. A number-spellout engine that can be customized for any language
is another thing altogether.
It's not enough to simply take the algorithm for English and read the literal string values from a resource file. English separates all component parts of a number with spaces; Italian and German do not. Some languages, such as Spanish and Italian, drop the word for "one" from the phrases "one hundred" or "one thousand". There are many other examples that show translating a number into words is not a trivial task.
To solve these issues, we developed a class called RuleBasedNumberFormat. It's a general, rule-based mechanism for converting numbers to spelled-out strings. This class is available on the alphaWorks web site, along with information on the usage and rule syntax.
|
Conclusion
The internationalization services in Java 1.1 provide a wide range of functionality,
and are easily extended to add additional features and to support additional
countries. We've had an opportunity to discuss some of the design decisions
taken in developing these classes, and some of the enhancements that are being
included in future releases. A more detailed discussion is available at http://www.ibm.com/developer/unicode/,
and includes more about the JDK i18n classes and possible future internationalization
improvements that IBM is discussing with Sun. These future possibilities include
the following:
|
|
We are working on many future enhancements; some of which are available right now on IBM's alphaWorks website at http://www.alphaWorks.ibm.com/. We encourage those interested to download versions from there--any comments on the design and implementation are welcome!
Acknowledgements
Our thanks to Kathleen Wilson, Rich Gillam, and Laura Werner for their extensive
review and suggestions for organization of the document. Many other people in
IBM and Sun contributed to the Java internationalization efforts.
About the authors
Dr. Mark Davis is a Senior Technical Staff Member responsible for international
software architecture. Mark co-founded the Unicode effort, and is the president
of the Unicode Consortium. He is a principal co-author and editor of the Unicode
Standard, Versions 1.0 and 2.0. At various times, his department has included
software groups covering text, international, operating system services, Windows
porting, and technical communications. Technically, he specializes in object-oriented
programming and in the architecture and implementation of international and
text software.
Helena Shih Chapman is the technical lead of the IBM Classes for Unicode at IBM's Center for Java Technology, Cupertino. She previously was a member of the Java i18n team at Taligent, and contributed to the JDK 1.1 international classes. Helena has also worked for Dataware Technologies and Apple's Advanced Technology Group. She holds an MSc. degree from University of Massachusetts. She is a native of Taipei, Taiwan.