utf.h File Reference

C API: Code point macros. More...

#include "unicode/utypes.h"
#include "unicode/utf8.h"
#include "unicode/utf16.h"
#include "unicode/utf_old.h"

Go to the source code of this file.

Defines

#define U_SENTINEL   (-1)
 This value is intended for sentinel values for APIs that (take or) return single code points (UChar32).
#define U_IS_UNICODE_NONCHAR(c)
 Is this code point a Unicode noncharacter?
#define U_IS_UNICODE_CHAR(c)
 Is c a Unicode code point value (0..U+10ffff) that can be assigned a character?
#define U_IS_BMP(c)   ((uint32_t)(c)<=0xffff)
 Is this code point a BMP code point (U+0000..U+ffff)?
#define U_IS_SUPPLEMENTARY(c)   ((uint32_t)((c)-0x10000)<=0xfffff)
 Is this code point a supplementary code point (U+10000..U+10ffff)?
#define U_IS_LEAD(c)   (((c)&0xfffffc00)==0xd800)
 Is this code point a lead surrogate (U+d800..U+dbff)?
#define U_IS_TRAIL(c)   (((c)&0xfffffc00)==0xdc00)
 Is this code point a trail surrogate (U+dc00..U+dfff)?
#define U_IS_SURROGATE(c)   (((c)&0xfffff800)==0xd800)
 Is this code point a surrogate (U+d800..U+dfff)?
#define U_IS_SURROGATE_LEAD(c)   (((c)&0x400)==0)
 Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a lead surrogate?
#define U_IS_SURROGATE_TRAIL(c)   (((c)&0x400)!=0)
 Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a trail surrogate?

Detailed Description

C API: Code point macros.

This file defines macros for checking whether a code point is a surrogate or a non-character etc.

The UChar and UChar32 data types for Unicode code units and code points are defined in umachines.h because they can be machine-dependent.

utf.h is included by utypes.h and itself includes utf8.h and utf16.h after some common definitions. Those files define macros for efficiently getting code points in and out of UTF-8/16 strings. utf16.h macros have "U16_" prefixes. utf8.h defines similar macros with "U8_" prefixes for UTF-8 string handling.

ICU processes 16-bit Unicode strings. Most of the time, such strings are well-formed UTF-16. Single, unpaired surrogates must be handled as well, and are treated in ICU like regular code points where possible. (Pairs of surrogate code points are indistinguishable from supplementary code points encoded as pairs of supplementary code units.)

In fact, almost all Unicode code points in normal text (>99%) are on the BMP (<=U+ffff) and even <=U+d7ff. ICU functions handle supplementary code points (U+10000..U+10ffff) but are optimized for the much more frequently occurring BMP code points.

utf.h defines UChar to be an unsigned 16-bit integer. If this matches wchar_t, then UChar is defined to be exactly wchar_t, otherwise uint16_t.

UChar32 is defined to be a signed 32-bit integer (int32_t), large enough for a 21-bit Unicode code point (Unicode scalar value, 0..0x10ffff). Before ICU 2.4, the definition of UChar32 was similarly platform-dependent as the definition of UChar. For details see the documentation for UChar32 itself.

utf.h also defines a small number of C macros for single Unicode code points. These are simple checks for surrogates and non-characters. For actual Unicode character properties see uchar.h.

By default, string operations must be done with error checking in case a string is not well-formed UTF-16. The macros will detect if a surrogate code unit is unpaired (lead unit without trail unit or vice versa) and just return the unit itself as the code point. (It is an accidental property of Unicode and UTF-16 that all malformed sequences can be expressed unambiguously with a distinct subrange of Unicode code points.)

The regular "safe" macros require that the initial, passed-in string index is within bounds. They only check the index when they read more than one code unit. This is usually done with code similar to the following loop:

while(i<length) {
   U16_NEXT(s, i, length, c);
   // use c
 }

When it is safe to assume that text is well-formed UTF-16 (does not contain single, unpaired surrogates), then one can use U16_..._UNSAFE macros. These do not check for proper code unit sequences or truncated text and may yield wrong results or even cause a crash if they are used with "malformed" text. In practice, U16_..._UNSAFE macros will produce slightly less code but should not be faster because the processing is only different when a surrogate code unit is detected, which will be rare.

Similarly for UTF-8, there are "safe" macros without a suffix, and U8_..._UNSAFE versions. The performance differences are much larger here because UTF-8 provides so many opportunities for malformed sequences. The unsafe UTF-8 macros are entirely implemented inside the macro definitions and are fast, while the safe UTF-8 macros call functions for all but the trivial (ASCII) cases. (ICU 3.6 optimizes U8_NEXT() and U8_APPEND() to handle most other common characters inline as well.)

Unlike with UTF-16, malformed sequences cannot be expressed with distinct code point values (0..U+10ffff). They are indicated with negative values instead.

For more information see the ICU User Guide Strings chapter (http://icu-project.org/userguide/strings.html).

Usage: ICU coding guidelines for if() statements should be followed when using these macros. Compound statements (curly braces {}) must be used for if-else-while... bodies and all macro statements should be terminated with semicolon.

Stable:
ICU 2.4

Definition in file utf.h.


Define Documentation

#define U_IS_BMP (  )     ((uint32_t)(c)<=0xffff)

Is this code point a BMP code point (U+0000..U+ffff)?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.8

Definition at line 176 of file utf.h.

#define U_IS_LEAD (  )     (((c)&0xfffffc00)==0xd800)

Is this code point a lead surrogate (U+d800..U+dbff)?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.4

Definition at line 192 of file utf.h.

#define U_IS_SUPPLEMENTARY (  )     ((uint32_t)((c)-0x10000)<=0xfffff)

Is this code point a supplementary code point (U+10000..U+10ffff)?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.8

Definition at line 184 of file utf.h.

#define U_IS_SURROGATE (  )     (((c)&0xfffff800)==0xd800)

Is this code point a surrogate (U+d800..U+dfff)?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.4

Definition at line 208 of file utf.h.

#define U_IS_SURROGATE_LEAD (  )     (((c)&0x400)==0)

Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a lead surrogate?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.4

Definition at line 217 of file utf.h.

#define U_IS_SURROGATE_TRAIL (  )     (((c)&0x400)!=0)

Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a trail surrogate?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Draft:
This API may be changed in the future versions and was introduced in ICU 4.2

Definition at line 226 of file utf.h.

#define U_IS_TRAIL (  )     (((c)&0xfffffc00)==0xdc00)

Is this code point a trail surrogate (U+dc00..U+dfff)?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.4

Definition at line 200 of file utf.h.

#define U_IS_UNICODE_CHAR (  ) 
Value:
((uint32_t)(c)<0xd800 || \
        ((uint32_t)(c)>0xdfff && \
         (uint32_t)(c)<=0x10ffff && \
         !U_IS_UNICODE_NONCHAR(c)))

Is c a Unicode code point value (0..U+10ffff) that can be assigned a character?

Code points that are not characters include:

  • single surrogate code points (U+d800..U+dfff, 2048 code points)
  • the last two code points on each plane (U+__fffe and U+__ffff, 34 code points)
  • U+fdd0..U+fdef (new with Unicode 3.1, 32 code points)
  • the highest Unicode code point value is U+10ffff

This means that all code points below U+d800 are character code points, and that boundary is tested first for performance.

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.4

Definition at line 164 of file utf.h.

#define U_IS_UNICODE_NONCHAR (  ) 
Value:
((c)>=0xfdd0 && \
     ((uint32_t)(c)<=0xfdef || ((c)&0xfffe)==0xfffe) && \
     (uint32_t)(c)<=0x10ffff)

Is this code point a Unicode noncharacter?

Parameters:
c 32-bit code point
Returns:
TRUE or FALSE
Stable:
ICU 2.4

Definition at line 142 of file utf.h.

#define U_SENTINEL   (-1)

This value is intended for sentinel values for APIs that (take or) return single code points (UChar32).

It is outside of the Unicode code point range 0..0x10ffff.

For example, a "done" or "error" value in a new API could be indicated with U_SENTINEL.

ICU APIs designed before ICU 2.4 usually define service-specific "done" values, mostly 0xffff. Those may need to be distinguished from actual U+ffff text contents by calling functions like CharacterIterator::hasNext() or UnicodeString::length().

Returns:
-1
See also:
UChar32
Stable:
ICU 2.4

Definition at line 134 of file utf.h.

 All Data Structures Files Functions Variables Typedefs Enumerations Enumerator Friends Defines

Generated on Sat Jan 23 15:17:40 2010 for ICU 4.3.4 by  doxygen 1.6.1