Optimized Unicode Composition and Decomposition

by Mark Davis

Background

The Unicode Standard, Version 2.0 (TUS2.0) provides different ways to encode accented characters, either decomposed (a combining character sequence [CCS]) or composed (as a single precomposed character). For example, the following are equivalent:

Precomposed Decomposed
 Ã  A + ~

The TUS2.0 specifies an algorithm for determining whether any two sequences of Unicode characters are canonical equivalent (see TUS2.0, pages 3-9 through 3-10). This algorithm basically decomposes any precomposed characters, then sorts them according to special rules, based on each character's combining class. This produces a normalized form.

Two common functions on Unicode text are to fully decompose the text (as far as possible), and to fully compose the text (as far as possible). In both cases, the correct result can only be achieved if the text is first converted to a normalized form.

Method

The following describes mechanisms for composing and decomposing Unicode text that do not require fully normalizing the text, and yet produce the correct results. By avoiding the normalization phase, they represent significant performance advantages.

Note: In the following discussion, we will abbreviate the Unicode names for brevity. Thus LATIN CAPITAL LETTER G WITH BREVE will be represented as G-breve. A plus sign will be used to indicate a sequence of characters.

The following discussion requires that the reader have first read Chapter 3 of TUS2.0.

Decomposition

The simple method for producing a normalized decomposed form is to replace each character by its decomposition, then normalize the entire string. However, this does more work than is necessary, especially in the common cases. The optimized method works as follows:

Preprocessing the composition data

  1. Get the decomposition data from http://unicode.org
  2. Process the data to put all the decompositions in normalized order.
  3. The result is a mapping from characters to their normalized decompositions.

Processing a particular source string

  1. Iterate through each character C in the source string. Process C according to its type:
    1. C has a decomposition, where the first character of the decomposition is a base character:
      • Append the decomposition to the result string.
    2. C is a combining mark:
      • Append the combining mark to the result string, then bubble-sort the combining mark through the result string, from the end forward, according to the description on TUS2.0, p 3-10.
    3. C has a decomposition, where the first character of the decomposition is not a base character:
      • For each character in the decompostion, treat as in the last step.
    4. C is none of the above:
      • Add to the result string.

This method avoids bubble-sorting all of the combining marks in a string, and optimizes for the common cases:

Since you are guaranteed that the decomposition is already in normalized order, as each successive combining character is appended, it is bubble-sorted up in the decomposition. Since the sequence starts in normalized order, and after each successive character the result is in normalized order, then the final result is in normalized order.

Example

Source 1 2
 Ã`  A + ~  A + ~ + `
 Ã.  A + ~  A + . + ~

Composition

The simple method for producing a normalized composed form is to match each possible CCS against a database to see what matches, then replace the CCS with the result. However, this does more work than is necessary, especially in the common cases. The optimized method works as follows:

Preprocessing the composition data

The following algorithm depends on the fact that except for one anomolous case, every CCS of length greater than two (which is canonical equivalent to a precomposed character) is also equivalent to a CCS of length exactly two. For example, C + cedilla + acute is equivalent to C-cedilla + acute, and C + acute + cedilla is equivalent to C-acute + cedilla.

  1. Get the decomposition data from http://unicode.org
  2. Process the data to generate the set of all possible orders of combining marks.
    For example, from the CCS A + tilde + acute + dot_below, generate the following:
    (This CCS does not actually have a precomposed character in Unicode--it is only used for illustration.)
  3. Filter out any combinations that are not canonically equivalent to the original.
    For example, for A + tilde + acute + dot_below, tilde and acute cannot be reversed and remain canonical equivalent, so you end up with just:
  4. For every combination left, if the combination has length greater than two, replace all but the last character with the corresponding precomposed character, thus ending with combinations of exactly length 2.
    For example: A + tilde + acute + dot_below becomes A-tilde-acute + dot_below
  5. To the mapping table, add the mappings from all these pairs to the fully precomposed character:
    A-tilde-acute + dot_below => A-tilde-acute-dot_below
    A-tilde-dot_below + acute => A-tilde-acute-dot_below
    A-dot_below-tilde + acute => A-tilde-acute-dot_below

Processing a particular source string

  1. Iterate through each character C in the source string.
    1. If C is a base character, add it to the result string, remember its position, and continue iterating
    2. If C is a combining mark, see if the last base character + C have a mapping M in the mapping table.

Since all combinations of characters that could combine are in the mapping table, in every order that they could occur in, all the precomposed forms will be generated. Since we scan for illegal reversals, we eliminate non-canonical equivalents. At each point in this process, the result string contains a valid composition of the initial portion of the source string.

Notes: If we didn't scan the intervening combining characters, then we could end up with a non-canonical equivalent sequence. For example, consider the following sequence: G + acute + breve. If we didn't scan, then this would produce G-breve + acute, since G-breve is a precomposed Unicode character, but G-acute is not. When decomposed, this represents G + brev + acute, which is not a cononical equivalent to the orginal string, since breve and acute have the same canonical class.
  The one anomolous precomposed character does require a special case in this algorithm--for simplicity of presentation, this complication is omitted.