Adding internationalization support to the base standard for JavaScript

It all started innocently enough. In the spring of 1998, when IBM's National Language Support (NLS) organization wanted someone to look at something called ECMAScript and make sure that the technology using it would be able to conform to IBM's NLS guidelines, I thought no big deal. Little did I know that 15 months later I'd be in the middle of a large effort to add internationalization support to ECMAScript. While this case study will be most applicable to API-library and language designers, the lessons described in this article will be useful to anyone implementing or considering internationalization support.

Having now been through the process of adding internationalization support to two low-level APIs (the other was the Java programming language), I'd like to share with you the main lessons we've learned:

Internationalization support must be designed into a project from the very beginning. Bolting it on after the fact will always require significant changes in design and API, and usually require deprecating a large amount of a system's original API.
The work isn't always where you think it is. For example, some kind of message-catalog facility and some kind of locale-identification facility are usually very high on the list of internationalization features. In the ECMAScript project, however, both of these fell lower on the priority list. In fact, message catalogs fell off the priority list entirely. Instead, we spent our energy for this iteration solving much more fundamental problems in the handling and generation of Unicode text.
Know your audience. The design of the internationalization facilities in the new version of ECMAScript is dramatically different from the corresponding libraries in Java. This is because there are significant differences between the two languages' overall design goals and what their user constituencies expect out of them.

A brief history of ECMAScript
Before we look specifically at internationalization issues, let's drop back and look at just what ECMAScript is. Several years back, Netscape devised a scripting language for the Netscape Navigator browser. This would allow Web page designers to add various interactive features to their Web pages without having to resort to something complicated like writing a Java applet or using a server-side solution. Capitalizing on all the Java hype going around at the time, they dubbed it JavaScript.

JavaScript was designed to have a syntax and feel that would be comfortable for Java and C++ programmers, but be simple and flexible enough to be non-intimidating, easy to learn, and productive for scriptwriters who either had worked only in other "scripting" languages or who were just learning computer programming.

By all accounts, it was successful, and other players began developing scripting languages having similar features and structure. For example, Microsoft developed a scripting language for their Internet Explorer browser that was quite similar to JavaScript and named it JScript.

As Web scripting languages began to proliferate, Microsoft, Netscape, and the other players began to realize that a whole bunch of similar, yet gratuitously different, Web scripting languages would be a serious deterrent to the growth of Web scripting in general, and agreed to form a standards committee. The group formed under the auspices of ECMA, an international industry association based in Europe that sets standards for information and communication systems, and became known as ECMA Technical Committee #39. ("ECMA" originally stood for "European Computer Manufacturers' Association".)

ECMA TC39 is the committee responsible for Web scripting languages. Their primary product is something called ECMAScript, a standard defining the basic language structure for Web scripting languages. This was published as ECMA-262 in June 1997. It was accepted as an international standard and published as ISO/IEC 16262 in April 1998. The latest version of ECMA-262 is in its final stages and will be published sometime around the end of 1999. This latest version of ECMAScript, the third edition of the standard, includes a lot of features that have been present in both JavaScript and JScript for some time, along with a few completely new features.

JavaScript and JScript are, of course, fully ECMAScript-conformant. The current ECMAScript standard corresponds roughly to JavaScript 1.1, and the scripting engines in both Netscape and Microsoft's current browsers implement a superset of the ECMAScript standard). Netscape and Microsoft continue to be the main driving forces behind it, but there is also heavy involvement from HP, Sun, IBM, and the World Wide Web Consortium. (In fact, the current chairman of the main ECMAScript working group is from IBM -- there's an informal rule against the chair of the working group being from Netscape or Microsoft.)

ECMAScript design concepts
ECMAScript is designed to be simple to implement and easy to use. It has five primitive types: Null, Undefined, Boolean, Number, and String. There is only one type of number (corresponding to a double-precision float in most other languages) and one type of string. Strings are variable-length and always in Unicode. A string is not simply an array of characters, as in many other languages; in fact, there is no data type for individual characters.

Everything else, including functions, arrays, and exceptions, is an object. There are no classes in ECMAScript: an object is simply a list of key-value pairs called properties. A property has an arbitrary name selected by the programmer. Every object also has a prototype, which is an object that the name-binding mechanism should look to for any properties it doesn't find in the object itself. In other words, every object has not only the properties that have been attached to it, but all of the properties attached to the objects in its prototype chain (property names are unique, so when an object and an object in its prototype chain both define a property with the same name, the prototype's property is hidden or "overridden"). Properties can be added or removed from an object at will. Inheritance takes place through the property-lookup mechanism.

Variables, function parameters and return values, and properties are untyped. There is no built-in mechanism for restricting the data types that can be stored in a variable or property. Functions can perform type checking on their own, but most of the built-in functions in ECMAScript perform conversions on their arguments instead. For this reason, the type-conversion mechanism in the language is very rigorously defined.

"I support the Unicode standard"
AScript working group in the middle of development for this current edition when someone from one of the other member companies pointed out some internationalization issues. IBM sent me to represent their international interests and see whether the other company's complaints were valid. They were, and there were also additional internationalization issues to be resolved.

When I joined the discussion, ECMAScript made two concessions to international support: they had a function to translate a date value into a string in a locale-specific manner, and they were storing their text in Unicode. That was it. There wasn't really any problem so serious that it would prevent a determined scriptwriter from writing internationalized code at all. The scriptwriter could either write the international support or call through to external libraries. But the language didn't really provide any help.

We spent most of our time for this edition of the language cleaning up string handling and various other aspects of Unicode support. These are issues that are far more important than message catalogs, but tend to slip unnoticed by all but the serious Unicode mavens.

The primary thing to remember is that "I support the Unicode standard" is an often-misunderstood statement. It means both more and less than people think it does. It means more in the sense that there's a lot more involved in supporting Unicode than making the individual elements in a string 16 bits wide. It means less in the sense that implementors are not required to do every last little thing mentioned in the big Unicode book.

The gist of the Unicode standard's conformance clause is this:

If you pass text on from an outside source to an outside destination, you're not allowed to mangle the text you're passing through in a way that damages its intelligibility to the destination.
You're not obligated to support any particular characters mentioned in the Unicode standard. However, you're required to state specifically which characters you do support, and for the characters you do claim to support, you have to follow all the rules in the Unicode standard that apply to those characters. For instance, you're not required to implement the Unicode bidirectional line layout algorithm if you don't actually display text or if you don't claim to support the Hebrew and Arabic writing systems. But if your application claims it can display Hebrew or Arabic text, it has to do so according to the exact specifications in the Unicode standard.

Support for Unicode is in the details
In other words, the big problem with "I support the Unicode standard" is that it isn't specific enough. You have to specify which characters you handle correctly. ECMAScript gets around this by leaving the choice of supported characters up to the implementor.

But there are numerous other things that have to be nailed down. For instance, which version of the Unicode standard do you conform to? Except for the big change to the Korean characters between versions 1 and 2, the Unicode standard hasn't and won't change the assignments of existing characters, nor will it remove characters from the standard altogether. Successive versions can (and do) add characters, so to some extent support for a particular version of Unicode is a choice of whether or not to support the new characters.

Character properties, however, can also change between versions of Unicode. Non-normative properties, such as the mappings between uppercase and lowercase letters, can change at will. Normative properties usually don't change, but do when there's a strong enough consensus that a particular property assignment was a mistake. This means that applications can have slightly different behavior depending on the Unicode version. Depending on the situation, changing to a new version of Unicode may fix a problem "for free," or it may not make a difference, or it may introduce a portability problem.

Do you require only that version of Unicode, or can a conforming implementation use newer versions of Unicode instead? As mentioned above, this can introduce portability problems in some cases. Can you live with this, or do you need to nail yourself to one version and one version only?

The original version of ECMAScript didn't specify a Unicode version. We changed it to say we require Unicode 2.1 or later. The Unicode Consortium is recommending that other standards base themselves on Unicode 3.0, which makes eminent sense, but we decided that none of the current ECMAScript implementations, nor the internal libraries they use currently, support Unicode 3.0 right now, nor are they likely to do so in the immediate future. We felt that the potential benefit in allowing future implementations to use Unicode 3.0 far outweighed the potential portability problems it might create.

Unicode transformation formats
Unicode is an abstract encoding that can be realized in bits in a number of forms, called Unicode transformation formats. The use of a particular transformation format is basically a storage-level issue, but can appear in the programming model. Obviously, it matters in I/O-related situations (as does dealing with non-Unicode systems), but it can also surface in other parts of the programming model in interesting and subtle ways.

The Unicode standard has over a million possible abstract code-point values (called Unicode scalar values), of which some 45,000 are defined in the standard. A Unicode scalar value is a 21-bit value ranging from $000000 to $10FFFF. The three main transformation formats are:

UCS-4, which takes the Unicode scalar value as-is and stores it in a 32-bit storage unit.
UTF-8, which stores the scalar values from $0000 to $007F as-is in an 8-bit storage unit and all other values in a transformed form that uses anywhere from two to four 8-bit units.
UTF-16, which stores the scalar values from $0000 to $FFFF untransformed in a 16-bit storage unit and those above $FFFF in a transformed form that uses two 16-bit units (often called a surrogate pair)

UCS-4 represents everything with a 32-bit value, making counting and indexing characters easy but wasting lots of memory. UTF-16 saves memory but potentially makes counting and indexing characters harder. The question is whether the elements in a string that can be indexed are full-blown Unicode characters, or whether they're UTF-16 code-point values. In UTF-16, certain single characters will be treated as two characters. You could actually store the characters in either format independently of how they are indexed, but this makes the process of counting and indexing characters much harder and slower for any implementation that chooses a different storage format from the one the language uses for counting and indexing. For all intents and purposes, the choice that is made of how to index and count characters in the API drives the internal storage format.

UTF-8 is usually used when ASCII compatibility is especially important, and therefore is typically a file-storage or file-transfer format. UTF-16 and UCS-4 are both also used for internal string storage. Because the ECMAScript standard doesn't include a description of I/O facilities, the difference between storage formats should be irrelevant, but it's not. It surfaces in the programming model in the way characters in strings are counted and indexed.

Choosing which transformation format to use
On the surface, UCS-4 would seem to be the obvious choice, because characters and code points have a one-to-one mapping. However, a Unicode scalar value has only 21 bits, so at least 11 bits of each 32-bit unit are wasted. The Unicode standard also currently doesn't assign any characters to the range above $FFFF and plans to do so only for very rare characters. Current use of this range is limited to private-use characters for certain operating environments (which are inherently non-portable). Therefore, the vast majority of characters in standard Unicode text will use only 16 bits per character, wasting 16 bits per character. Because UTF-16 represents the Unicode values from $FFFF on down untransformed, this would seem to be a very compelling argument for UTF-16.

But doesn't this mean you get the wrong answer when you ask a string for its length? This depends on what you think you're asking for. It gives the right answer if you want the number of storage positions the string takes up in memory. It gives the wrong answer if you want the number of display positions that the string takes up on the screen, but Unicode already declares that there isn't a one-to-one correspondence between Unicode code-point values and marks on the screen (and in these days of proportional fonts, that number is irrelevant except in East Asian typesetting).

But doesn't numbering UTF-16 units mean you can split up a character? Yes, it does. But unlike many variable-width encodings, this destroys only that character, not the whole document as in Shift-JIS. And a program usually doesn't just blindly index to some arbitrary position within a string. If it does this, that's because it's already imposing semantics on the text in the string, in which case the programmer can declare the syntax to make sure this operation works. Otherwise, the character offsets are coming from a higher-level protocol, such as a text-editing engine or a search engine, and that's the party responsible for making sure characters don't get broken up.

Counting things that the user thinks of as a single character already means you're not counting storage units. There are many cases other than UTF-16 surrogate pairs where a single character is represented using multiple Unicode code points. For example, Korean and Hindi syllables are usually thought of as single characters, although they're often stored internally broken up into individual "letters." This type of interpretation is like counting words: language-specific and language-dependent. This already requires a higher-level facility than the ones normally used to count and index characters in strings.

The original ECMAScript didn't address this issue. We decided to go with UTF-16. The wasted-memory issue with UCS-4 was a compelling argument in favor of UTF-16, as was the fact that Java uses UTF-16 and the Unicode Consortium recommends it.

Multiple representations
Unicode includes the concept of a combining character, which is a character that (generally) modifies the character before it in some way rather than showing up as a character on its own. For example, the letter ?an be represented using a regular letter e followed by a combining acute-accent character. This occupies two storage positions in memory.Unicode defines these characters to allow flexibility in its use.

If you need a certain type of accented character, Unicode can give you the base character and the accent rather than having to assign a whole new code-point value to the actual combination you want to display. This greatly expands the effective number of characters that Unicode can encode. In some cases, such as Korean, "characters" are broken up into smaller units that can be combined into the actual characters. This was done to save on code-point assignments.

In many cases, including the two examples above, Unicode actually does have a single code point representing the entire unit. The letter ?an be represented using its own single code-point value, and all Korean syllables (including many that don't naturally occur in Korean) have single code-point values. This is because most constituencies prefer the precomposed versions. Storing ?s two characters would both complicate processing and increase storage size. This introduces a significant pro-English bias into what's supposed to be an international standard. Similarly, requiring 6 bytes for each Korean syllable when only 2 bytes are required for each Japanese ideograph introduces an anti-Korean bias.

As a result, many characters in Unicode have multiple possible representations. In fact, many characters that don't have a specific code point in Unicode (for example, many letters with two diacritical marks on them) have multiple sequences of Unicode characters that can represent them. This can make two strings that appear to be the same to the user appear to be different to the computer.

The Unicode standard generally requires that implementations treat alternative "spellings" of the same sequence of characters as identical. Unfortunately, this can be impractical.

Most processes just want to do bitwise equality on two strings. Doing anything else imposes a huge overhead both in executable size and in performance. But because many of these processes can't control the source of the text, they're stuck. The traditional way of handling this is normalization -- picking a preferred representation for every character or sequence of characters that can be represented multiple ways in Unicode. The Unicode standard does this by declaring a preferred ordering for multiple accent marks on a single base character and by declaring a "canonical decomposition" into multiple characters for every single character that can also be represented as two or more characters.

Four normalization forms
The Unicode standard also defines a set of "compatibility decompositions" for characters that can be decomposed into other characters but only with the loss of some information. Newer versions of the standard also define "compatibility compositions" and "canonical compositions." This actually gives you a choice of four "normalized forms" for a string.

A program can make bitwise equality comparison work right by declaring that all strings must be in a particular normalized form. The program has the choice of requiring that all text fed to it be normalized in order to work right (delegating the work to an outside entity), or normalizing things itself. The World Wide Web Consortium ran into this very problem and solved it by requiring all applications that produce text on the Internet to produce it in normalized form. Software that merely transfers text from one place to another or displays it can choose to normalize it again on receipt, but doesn't have to -- if everybody's followed the rules, it's already normalized.

The ECMAScript standard didn't say anything about this issue, either. In fact, this was the main issue that inspired all the internationalization activity in the first place. We eventually adopted canonical composition ("Unicode Normalization Form C") as our normalization format, just as the W3C had.

Who does the normalization?
The who-does-the-normalization question was solved automatically because ECMAScript doesn't specify an I/O interface. The expectation is that there's a separate layer of some kind around the environment of a running ECMAScript program that ensures text coming in from outside is normalized. Because the language already required this layer to translate incoming text from its native encoding into Unicode, the extra normalization requirement isn't that onerous. Most mappings from one encoding to Unicode already produce normalized text, and those that don't will have to normalize only a limited set of characters. (There's extra overhead in translating from one form of Unicode, such as UTF-8, to the internal format, however.)

By delegating normalization to an outside layer, the ECMAScript engine itself (and any running ECMAScript programs) don't have to worry about the normalization issue. For applications that have connections only to a single user, the outer layer can be designed to produce only normalized text. Applications that get text from the Internet can depend on the source to follow the W3C rules. So, for many systems, an actual normalization implementation isn't required.

We've adopted a "trust the programmer" attitude toward text produced by a running ECMAScript program. We felt it was reasonable to expect a program to produce only normalized text. We thought about providing a normalize() function to help the programmer along, but this would have required that every ECMAScript implementation carry the normalization tables along with it, which we definitely didn't want to do (they're about 40 KB). You can't pare down the implementation here to handle only a subset of characters. Unlike other Unicode processes, normalization doesn't have the option of supporting only some of the Unicode characters; it has to handle all of them.

Simple string manipulation
Of course, all kinds of bitwise manipulations on stings can produce unnormalized or just plain wrong text. Inserting a character into the middle of a combining-character sequence or a surrogate pair messes it up, for example, as does removing a single character from a combining-character sequence or surrogate pair. In fact, you can take two normalized strings, concatenate them, and end up with an unnormalized result. This will happen if the second string starts with a combining character.

Again, imposing a requirement that all strings be normalized at all times would be a huge performance burden, and would require that everyone carry around the normalization tables. It also would mess up any application that wanted to use the String data type to store something other than textual data.

We had some long discussions over this issue, but I think we were in "violent agreement" most of the time. No one thought forcing strings into normalized form all the time was a good idea. We opted to have all the basic string manipulations treat strings as sequences of arbitrary 16-bit values. Again, this was an area where the "trust the programmer" approach seemed reasonable, given the cost of protecting the programmer from himself.

So as you can see, there are a lot of very basic issues associated with storing and manipulating Unicode text that must be specifically dealt with when designing a programming language or API. You can't leave these to chance and expect implementations to agree with each other. Again, merely making characters in strings 16 bits wide doesn't hack it.

Program syntax
The program syntax itself was the next big area where we had to worry about text handling. ECMAScript source code must be in Unicode, but non-ASCII characters were allowed only in comments and string literals.

This seemed like rather halfhearted Unicode support, but there's a reasonable where-do-you-draw-the-line question to be asked here. It's possible to go totally over the top and try to make Unicode work everywhere. We tried to aim for an answer in the middle, taking the following considerations into account:

The most important issue was that a truly Unicode-savvy text editor (or character-code converter) would not produce compilable source code. This is because Unicode defines new line and paragraph separators (because the ASCII CR, LF, and VT characters all have inconsistent semantics between systems). Because more and more text editors over time will deal in the correct Unicode characters, we wanted to make sure they were legal. We updated the standard to allow the Unicode line and paragraph separators to be treated as line separators by the compiler in addition to the ASCII ones.
By the same token, we extended the definition of "whitespace" to include all of the Unicode whitespace characters (including the Latin1 non-breaking space, which had also been excluded). This includes zero-width spaces. This is actually more important in numeric parsing than it is in parsing program source, but we wanted to be consistent.
Unicode defines a number of formatting-control characters, invisible characters that control the formatting of the characters around them. These include characters to control bidirectional reordering (there are a number of ambiguous situations where the correct line layout can be determined only with "hints" from the source of the text), characters to control joining of adjacent characters in cursive scripts, characters to control which shapes the ASCII digit characters should display, and characters to control in-line annotations (for example, Ruby in Japanese). Some of these are deprecated now, but may still occur in text.The W3C explicitly disallows all of these characters in structured text because they are more appropriately the province of a markup language such as XML or HTML. We opted to follow this guideline as well, because the invisible formatting characters could cause equality-comparison problems just as multiple spellings can. We modified the specification for the parser to ignore these characters.
Allowing only ASCII characters in identifiers is an extremely U.S.-centric approach for a language that claims to support Unicode. We agreed to expand the identifier syntax to allow non-ASCII characters. Because the Unicode standard includes non-normative guidelines for identifier syntax, we decided to adopt them, extending them to allow the dollar sign ($), and to allow the underscore (_) at the beginning of an identifier. We also required incoming source text to be normalized, helping again to cut down on gratuitous identifier mismatches.We also added a Java-like escaping mechanism so that users of ASCII-only editors could still edit code that included non-ASCII identifier characters.
However, we stopped short of making all Unicode digit characters legal in numeric literals or allowing the typographic variants of things like the minus sign or the quotation marks as legal syntax. This seemed to create new problems: for example, if we allow alternative digits in numeric literals, shouldn't we allow alternative decimal-point characters? Allowing the comma as a decimal point would introduce ambiguity in number literals and conflict with the existing syntactic meanings of the comma.

Language-sensitive string operations
So we had to get through all of those low-level issues before we could even begin to look at some of the more obvious internationalization issues. There are obviously a whole host of operations on strings that care about the characters in the string or produce strings based on something else.

When a language or an API isn't designed to be language-sensitive in the first place, bolting it on later is generally problematic. The main reason is that once the first version of something has been released, you can't just replace the API with one that's language-sensitive. Instead, you're stuck with adding a new API that is language-sensitive and trying to encourage programmers to use it.

For example, in Java you have separate objects that do number formatting and string comparison in a locale-sensitive way, but the locale-insensitive way of doing these operations is right there on the objects being operated on. If you want to do a string comparison, you'll tend to prefer String.compareTo() even though it does only a bitwise compare. To do a language-sensitive comparison, you have to look in a whole different package for the Collator object, create one, and then use Collator.compare() to compare two strings to each other. In an ideal world, String.compareTo() should have taken an optional locale parameter or an optional Collator object to do a locale-sensitive compare. We couldn't add this after the fact. The same applied to number formatting and other processes.

So we had to decide how to add locale-sensitive functions to an existing API. This posed two interesting questions: 1) How should we add new functions? and 2) What new functions should we add?

How should we add new functions?
For this question, we opted not to add any new objects to the system, but instead to add minimal API to the existing objects. In doing so, we tried to think ahead to future things that we might want to do. It's still unclear how the future internationalization library will present its API because we still don't know exactly what its relationship will be to the rest of the system. But we tried to do the most rational things possible.

What new functions should we add?
Deciding what new functions to add to the API was a tricky issue. Because ECMAScript was a small, simple language, we didn't want to load down the whole thing with a full-blown internationalization library like the one in Java. This could have easily more than doubled the size of the runtime. We decided to take a two-pronged approach: Add the bare minimum necessary to provide decent internationalization support in the core language and have a separate, detachable internationalization library. We concentrated on the core language for the next release and are working on the internationalization library for a future release.

A minimalistic approach for the core language
The approach we took for the core language was as minimalistic as we could reasonably make it. We added no more than the basic functionality, eschewing all frills. We also took steps to ensure that we imposed no more of an implementation burden on implementors than was absolutely necessary. In support of that philosophy, we designed everything so that implementors could use any facilities available to them from their host environments. They could also declare that they support only a single locale and hard-code its behavior, or just fall back on the locale-independent algorithms. This could introduce portability problems, but the idea was that these functions were really only for producing user output and other things like that. In other words, we wanted everything to produce a "reasonable result" for whatever locale the user was using. That was it.

We also deliberately avoided adding functions that would allow the user to specify a target locale or customize the behavior of something. These would have gone against the no-frills policy. However, we recognized that this is something we'll definitely want in the future, so we tried to take steps to make this possible. (Because ECMAScript allows variable numbers of function parameters, we just warned implementors against adding parameters to the locale-sensitive functions.)

We couldn't delete the existing locale-independent API, nor could we change its behavior. Not only would that break existing applications, it could also break other parts of the runtime. The current ECMAScript API relies heavily on type conversions, and we didn't want to disturb that behavior.

Instead, because everything had a toString() method, we added a toLocaleString() method that would be parallel with it. We added toLocaleString() to Object so that all objects would have the API. In the default case, toLocaleString() just calls toString(). We overrode that behavior in Number, Date, and Array. (We might want to consider overriding it in Null, Undefined, and Boolean too, but we're not planning to for this release.)

In all three cases, the new functions are defined simply to have implementation-defined behavior, but we give guidelines for what implementations should do.

Some number and date formatting "frills" are necessary
For number and date formatting, we also decided that some "frills" were actually so basic as to be necessary. We added several new methods to Number that allow you to control whether you get fixed-point or scientific notation and how many decimal places things have. These functions won't be locale-sensitive in the upcoming edition, but will be in future ones. We also added functions that allow formatting of only the "time" or "date" portion of Date, instead of the whole thing. These functions exist in both locale-sensitive and locale-independent versions.

Number parsing only
Normally, there's some kind of parsing interface that goes along with a formatting interface. ECMAScript doesn't have this. The only parsing facility defined in ECMAScript is for the Number type. We talked about adding it for other types, but there were too many complexities involved, so we tabled the discussion. Even for number parsing, we extended its grammar to allow non-ASCII whitespace to be treated as whitespace, but we didn't add locale-sensitive number parsing, as this also seemed to have too many complexities. The problem in both cases was that there isn't a good way to communicate to the user what format to enter things like numbers and dates in, and trying to do the best with whatever we get is complicated and error-prone.

Locale-sensitive string comparison
We also decided we needed some kind of locale-sensitive string comparison. We wound up adding a localeCompare() method to String to handle this. Again, the function is minimal: it works only on the system default locale, and it doesn't allow setting of collation strength or decomposition level. We made it possible for implementors to provide a strength capability, but didn't require it. And again, we said it had implementation-dependent results, but we provided guidelines for those results.

toUpperCase() and toLowerCase() functions
Finally, String had toUpperCase() and toLowerCase() functions. These were originally defined only to do the one-to-one mappings from the Unicode Character Database. This excludes some mappings that go from one character to several (such as "? to "SS") or have context-dependent results. We added those back. We also added toLocaleUpperCase() and toLocaleLowerCase(). Most of the time these functions will have exactly the same behavior as toUpperCase() and toLowerCase(), but in some special cases (such as the Turkish "I" and "i") it has language-specific behavior. (Originally, we weren't going to add the new functions, but someone pointed out that all other locale-sensitive functions in the system had "Locale" somewhere in their names and that we should keep to this convention.)

Regular expressions
ECMAScript includes a regular-expression engine based on Perl, and the working group has taken care to avoid unnecessary inconsistencies with Perl as much as possible. The new standard will support Unicode in its regular-expression engine to the extent that Perl does now. Unfortunately, this means there won't be a way to test characters for membership in any arbitrary Unicode category. We're planning to add that in the next version. And because regular-expression matching on natural-language text can be extremely difficult, we specifically limited the scope of the regular-expression engine to exclude natural-language text from its design goals.

We also wound up defining case-insensitive regular-expression searches to work differently from the behavior of the toUpperCase() and toLowerCase() functions. Because we had defined regular expressions to be optimized for program text and the like, we felt that the full natural-language generality of normal Unicode case mapping would actually stand in our way. This means we purposely decided to forgo all one-to-many mappings and all context-sensitive mappings. In a few cases where a single character had more than one potential case mapping (such as I, which could map to either i or ?) we decided to support only one. Doing otherwise would have caused some weird problems with the expression "[a-z]" in a case-insensitive regular expression.

The Unicode standard actually defines a set of regular-expression guidelines. We're not following these right now, but will upgrade to support at least the minimal guidelines in the future. A full-blown language-sensitive regular expression requires much more full-featured support for locale-sensitive comparison than we have right now. All of these things are under discussion for future versions of ECMAScript.

Date and time handling
Date and time handling is another major category of internationalization support. The first version of ECMAScript already had a surprisingly full set of date- and time-handling facilities when I came to the project. It was already storing dates in a locale- and calendar-independent fashion and providing interfaces to extract certain pieces of the value (the day of the month, for example).

Unfortunately, the Date APIs were designed very rigidly and without thinking about international calendar support. All behavior was pegged not only to the Gregorian calendar, but to one specific algorithm that extrapolates Gregorian dates back into the past, prohibiting even a Gregorian-calendar implementation that correctly handles the switch from Julian to Gregorian. They did have time-zone support, but it was rather clumsy and constraining.

The problem with this kind of solution, again, is that the only way to fix it is to deprecate the whole API and replace it with something more internationalization-friendly. In fact, the working group already had to do this once when they discovered a Y2K problem in the original getYear() function.

The same thing happened in Java: the original Java Date class had a large number of functions for getting various pieces of the date, but did it in a way that was completely tied to the Gregorian calendar. All of those APIs had to be deprecated and replaced with a new Calendar class that operates on Dates. So Date was yet another case where an API that was designed without any thought about internationalization support will have to be replaced to allow it.

We didn't try to address this in the upcoming version of the standard because we're not going to support any international calendars yet, but we plan on reexamining this whole thing in subsequent versions.

Message-catalog facility made no sense
One thing we explicitly didn't address was the message-catalog issue. In our case, it didn't make any sense. This is because dynamic binding isn't a feature of ECMAScript. ECMAScript programs are completely self-contained: the only way to incorporate third-party code into a script is to copy and paste it. ECMAScript also has no I/O facilities, so there's no standard way of getting UI elements out of an external file.

All of this would really allow any particular program to support only one locale. A single resource bundle would be easy to implement as a regular Object whose properties are the resource names and whose property values are the resources. This is good programming practice, but didn't seem to necessitate the addition of anything to the standard.

Another approach that scriptwriters writing HTML could use is to embed the localizable data in the enclosing HTML document. Again, though, there's no standardized way in the language to access that data, so this would be an implementation-specific solution.

Most solutions to having localizable text will involve having the HTML server serve different pages depending on the user's locale. All of the pages would be identical except for the localized user-interface elements. There's a lot of repetition here, but there's no API that we could add within the confines of the ECMAScript standard that would make this any easier.

Without a significant change in ECMAScript's usage model, we agreed that a resource-bundle or a message-catalog facility doesn't make any sense in ECMAScript.

We also decided not to implement a word-boundary-detection feature. Although a facility that detects user-character boundaries, word boundaries, and UCS-4 character boundaries would be useful, especially to support more advanced text searching, we decided we could do without it for now. But we do plan to add something in the future.

Conclusions
Not many of the traditional internationalization library features got into the ECMAScript language this time around, but we still did a lot of work. The work wasn't always in the obvious places: the standard lacked many things that were necessary for full Unicode compatibility. Remember, Unicode support means more than 16-bit characters.

The first cardinal rule of internationalization is to make sure that you plan for it during the initial design. Failing to do so usually results in an API that has to be seriously overhauled, making it more complicated, more problematic for programmers, and more embarrassing. And internationalization is complicated stuff to do right. On the other hand, it may be reasonable to defer internationalization support to later versions of a project.
The second rule, if you're working on a project like this one, is to think carefully about the various details that we looked at and addressed. These are the most important things to get right.
The third rule is to know your audience. We wound up making very different choices about how to add internationalization support to ECMAScript than the Java team made when adding it to Java. This is because Java and ECMAScript are very different languages serving very different constituencies. The full-blown Java solution would have been overkill in ECMAScript, and some of the most fundamental features of the Java internationalization libraries, such as ResourceBundle, aren't even relevant to ECMAScript. Be careful about slapping a prefabricated list of features into a new project without considering carefully what it really needs.

Internationalization is something that never seems to be taken seriously until after problems have arisen. I hope some of the lessons we've learned can help you to avoid costly mistakes in your project.

About the author
Richard Gillam is a longtime member of IBM's Unicode Technology group, where he has contributed code or architecture to virtually every project the group has undertaken. He is currently a member of the Java Internationalization team, which works under contract to Sun to provide text-analysis facilities and other low-level utilities to the JDK. He is a columnist for C++ Report magazine, a regular contributor to various industry publications and conferences, and a member of the ECMA working group on scripting languages. Rich holds a bachelor's degree in percussion performance from the Eastman School of Music. You can reach him at rgillam@us.ibm.com.