The Unicode Standard, Version 2.0 (TUS2.0) provides different ways to encode accented characters, either decomposed (a combining character sequence [CCS]) or composed (as a single precomposed character). For example, the following are equivalent:
Precomposed | Decomposed |
---|---|
à | A + ~ |
The TUS2.0 specifies an algorithm for determining whether any two sequences of Unicode characters are canonical equivalent (see TUS2.0, pages 3-9 through 3-10). This algorithm basically decomposes any precomposed characters, then sorts them according to special rules, based on each character's combining class. This produces a normalized form.
Two common functions on Unicode text are to fully decompose the text (as far as possible), and to fully compose the text (as far as possible). In both cases, the correct result can only be achieved if the text is first converted to a normalized form.
The following describes mechanisms for composing and decomposing Unicode text that do not require fully normalizing the text, and yet produce the correct results. By avoiding the normalization phase, they represent significant performance advantages.
Note: | In the following discussion, we will abbreviate the Unicode names for brevity. Thus LATIN CAPITAL LETTER G WITH BREVE will be represented as G-breve. A plus sign will be used to indicate a sequence of characters. |
The following discussion requires that the reader have first read Chapter 3 of TUS2.0.
The simple method for producing a normalized decomposed form is to replace each character by its decomposition, then normalize the entire string. However, this does more work than is necessary, especially in the common cases. The optimized method works as follows:
This method avoids bubble-sorting all of the combining marks in a string, and optimizes for the common cases:
Since you are guaranteed that the decomposition is already in normalized order, as each successive combining character is appended, it is bubble-sorted up in the decomposition. Since the sequence starts in normalized order, and after each successive character the result is in normalized order, then the final result is in normalized order.
Source | 1 | 2 |
---|---|---|
Ã` | A + ~ | A + ~ + ` |
Ã. | A + ~ | A + . + ~ |
The simple method for producing a normalized composed form is to match each possible CCS against a database to see what matches, then replace the CCS with the result. However, this does more work than is necessary, especially in the common cases. The optimized method works as follows:
The following algorithm depends on the fact that except for one anomolous case, every CCS of length greater than two (which is canonical equivalent to a precomposed character) is also equivalent to a CCS of length exactly two. For example, C + cedilla + acute is equivalent to C-cedilla + acute, and C + acute + cedilla is equivalent to C-acute + cedilla.
Since all combinations of characters that could combine are in the mapping table, in every order that they could occur in, all the precomposed forms will be generated. Since we scan for illegal reversals, we eliminate non-canonical equivalents. At each point in this process, the result string contains a valid composition of the initial portion of the source string.
Notes: | If we didn't scan the intervening combining characters, then we could end up with a non-canonical equivalent sequence. For example, consider the following sequence: G + acute + breve. If we didn't scan, then this would produce G-breve + acute, since G-breve is a precomposed Unicode character, but G-acute is not. When decomposed, this represents G + brev + acute, which is not a cononical equivalent to the orginal string, since breve and acute have the same canonical class. |
The one anomolous precomposed character does require a special case in this algorithm--for simplicity of presentation, this complication is omitted. |