The Unicode Standard, Version 2.0 (TUS2.0) provides different ways to encode accented characters, either decomposed (a combining character sequence [CCS]) or composed (as a single precomposed character). For example, the following are equivalent:

Precomposed | Decomposed |
---|---|

Ã | A + ~ |

The TUS2.0 specifies an algorithm for determining whether any two sequences
of Unicode characters are canonical equivalent (see TUS2.0, pages 3-9 through
3-10). This algorithm basically decomposes any precomposed characters, then
sorts them according to special rules, based on each character's *combining
class*. This produces a normalized form.

Two common functions on Unicode text are to fully decompose the text (as far as possible), and to fully compose the text (as far as possible). In both cases, the correct result can only be achieved if the text is first converted to a normalized form.

The following describes mechanisms for composing and decomposing Unicode text that do not require fully normalizing the text, and yet produce the correct results. By avoiding the normalization phase, they represent significant performance advantages.

Note: |
In the following discussion, we will abbreviate the Unicode names for brevity.
Thus LATIN CAPITAL LETTER G WITH BREVE will be represented as G-breve.
A plus sign will be used to indicate a sequence of characters. |

The following discussion requires that the reader have first read Chapter 3 of TUS2.0.

The simple method for producing a normalized decomposed form is to replace each character by its decomposition, then normalize the entire string. However, this does more work than is necessary, especially in the common cases. The optimized method works as follows:

- Get the decomposition data from http://unicode.org
- Process the data to put all the decompositions in normalized order.
- The result is a mapping from characters to their normalized decompositions.

- Iterate through each character C in the source string. Process C according
to its type:
- C has a decomposition, where the first character of the decomposition
is a base character:
- Append the decomposition to the result string.

- C is a combining mark:
- Append the combining mark to the result string, then bubble-sort the combining mark through the result string, from the end forward, according to the description on TUS2.0, p 3-10.

- C has a decomposition, where the first character of the decomposition
is not a base character:
- For each character in the decompostion, treat as in the last step.

- C is none of the above:
- Add to the result string.

- C has a decomposition, where the first character of the decomposition
is a base character:

This method avoids bubble-sorting all of the combining marks in a string, and optimizes for the common cases:

- there are no, or few additional combining characters in the string.
- the string is already in normalized form.

Since you are guaranteed that the decomposition is already in normalized order, as each successive combining character is appended, it is bubble-sorted up in the decomposition. Since the sequence starts in normalized order, and after each successive character the result is in normalized order, then the final result is in normalized order.

Source | 1 | 2 |
---|---|---|

Ã` | A + ~ | A + ~ + ` |

Ã. | A + ~ | A + . + ~ |

The simple method for producing a normalized composed form is to match each possible CCS against a database to see what matches, then replace the CCS with the result. However, this does more work than is necessary, especially in the common cases. The optimized method works as follows:

The following algorithm depends on the fact that except for one anomolous
case, every CCS of length greater than two (which is canonical equivalent
to a precomposed character) is also equivalent to a CCS of length exactly
two. For example, `C + cedilla + acute` is equivalent to `C-cedilla
+ acute`, and `C + acut`e` + cedilla` is equivalent to
`C-acute + cedilla`.

- Get the decomposition data from http://unicode.org
- Process the data to generate the set of all possible orders of combining
marks.

For example, from the CCS`A + tilde + acute + dot_below`, generate the following:

(This CCS does not actually have a precomposed character in Unicode--it is only used for illustration.)`A + tilde + acute + dot_below``A + tilde + dot_below + acute``A + acute + tilde + dot_below``A + acute + dot_below + tilde``A + dot_below + tilde + acute``A + dot_below + acute + tilde`

- Filter out any combinations that are not canonically equivalent to
the original.

For example, for`A + tilde + acute + dot_below`, tilde and acute cannot be reversed and remain canonical equivalent, so you end up with just:`A + tilde + acute + dot_below``A + tilde + dot_below + acute``A + dot_below + tilde + acute`

- For every combination left, if the combination has length greater than
two, replace all but the last character with the corresponding precomposed
character, thus ending with combinations of exactly length 2.

For example:`A + tilde + acute + dot_below`becomes`A-tilde-acute + dot_below` - To the mapping table, add the mappings from all these pairs to the
fully precomposed character:

`A-tilde-acute + dot_below => A-tilde-acute-dot_below`

A-tilde-dot_below + acute => A-tilde-acute-dot_below

A-dot_below-tilde + acute => A-tilde-acute-dot_below

- Iterate through each character C in the source string.
- If C is a base character, add it to the result string, remember its position, and continue iterating
- If C is a combining mark, see if the last base character + C have a mapping M in the mapping table.

- If not, append C to the result string, and continue iterating
- If so, scan the combining characters after the last base character
(if any)
- If any such characters have the same combining class as C, then append C to the result string, and continue iterating
- Otherwise, replace the base character by M, and continue iterating

Since all combinations of characters that could combine are in the mapping table, in every order that they could occur in, all the precomposed forms will be generated. Since we scan for illegal reversals, we eliminate non-canonical equivalents. At each point in this process, the result string contains a valid composition of the initial portion of the source string.

Notes: |
If we didn't scan the intervening combining characters, then we could end
up with a non-canonical equivalent sequence. For example, consider the following
sequence: G + acute + breve. If we didn't scan, then this would
produce G-breve + acute, since G-breve is a precomposed
Unicode character, but G-acute is not. When decomposed, this represents
G + brev + acute, which is not a cononical equivalent to the orginal
string, since breve and acute have the same canonical class. |

The one anomolous precomposed character does require a special case in this algorithm--for simplicity of presentation, this complication is omitted. |