One of the advantages to Unicode is its consistent interpretation on many computers systems (aka platforms). Unfortunately, the interpretation of many legacy codepages from various platforms is not consistent. For various reasons, many organizations and computer manufacturers have made small incompatible changes to some codepage interpretations. This causes portability problems when the codepage legacy data is transferred between platforms.
This Converter Explorer will allow you to "explore" the aliases and properties of each ICU converter. More details about ICU converters can be found on our Charset Repository page, the ICU API reference, and the Conversion section of the ICU User's Guide. All data from this explorer comes directly from ICU.
If you are wondering why some alias names or byte sequences are mapped certain ways, you can always view the ICU alias table directly. The alias table is not meant to be easily read by newcomers to ICU, which is the main reason why the Converter Explorer exists, but it does contain comments that some people may find helpful. The alias table from CVS may contain information that is more current than your copy of ICU or what is currently available in Converter Explorer. The bottom of each Converter Explorer page always describes the version of ICU it is using.
IANA is the main source of converter aliases on the Internet. Since IANA does not specify the Unicode mappings for every codepage and alias, and every platform supports other aliases besides the IANA aliases, ICU provides a way to target the codepage conversion based upon the standard or platform. This allows you to use the right converter name and implementation based upon which standard you are targeting.
You can change the view of aliases for each standard by selecting the appropriate standard at the top of the page. This will allow you to the see the subset of aliases that a standard or platform can recognize. For example, if you select IANA and ALL and select the "View Results" button, you will see all aliases recognized by IANA and ICU. You will notice that the IANA set of aliases is a subset of all ICU aliases.
The column marked as "Internal Converter Name" is also known as a
canonical name. The canonical name is a unique ICU converter name, and it is
usually based upon the UTR
#22 naming scheme. The canonical name is always guaranteed to be the
correct converter that you need in a particular ICU release, but sometimes the
mapping tables will get updated between ICU releases and this converter may
change at that time. API functions like ucnv_getCanonicalName()
and ucnv_getName()
will return this value. The
ucnv_getStandardName()
function requires this name as an
argument.
The "All Aliases" column is not a real standard. It is just a special way to see all of the aliases for a specific converter regardless of which standards support the converter's alias names.
The "Untagged Aliases" column is also not a real standard. It is a special way to see all of the aliases that are not associated with any particular standard. An alias in this column can mean that it is a name of an alternate mapping table with the same name under a different standard, or this is a rarely used alias and its use is discouraged.
Once you have selected a converter to view, you can get all of the details about that converter. Here is a list of things that you will find on that page:
ucnv_getType()
API will return this value. See API reference for details.ucnv_isAmbiguous()
returns. When this
value is TRUE, it usually implies that this is a non-ASCII compatible codepage
and an ASCII compatible codepage is available.ucnv_getCanonicalName()
or
ucnv_getStandardName()
.TRUE
, then a conversion from this codepage
to Unicode will always generate Unicode in Normalization Form Composed (NFC).
When this value is UNKNOWN
, then there may be a possibility that
this converter will generate Unicode text that is not in NFC depending on the
input, and applying an NFC transformation may change the original text. This
value is derived by creating a Unicode Set with the value
"[[:NFC_Quick_Check=yes:]&[:ccc=0:]]", and confirming that it is a full
superset of the codepage's Unicode Set. More details about Unicode
Normalization can be found in Unicode Standard Annex #15.TRUE
, then a conversion to or from this
codepage can contain bidirectional characters. These are right to left
characters, like Hebrew and Arabic characters. When displaying data from this
codepage, you may need to apply the BiDi algorithm described in
Unicode Standard Annex
#9.ucnv_getUnicodeSet()
and
ulocdata_getExemplarSet()
, and making sure that the returned
UnicodeSet for the language is a complete subset of the given codepage. The
list of languages comes from uloc_getAvailable()
.ucnv_getUnicodeSet()
.