DB2 Version 10.1 for Linux, UNIX, and Windows

Unicode character encoding

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all of the living languages of the world.

Information about Unicode can be found in the latest edition of The Unicode Standard , and from the Unicode Consortium website at www.unicode.org.

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data being encoded. The default encoding form is 16-bit, that is, each character is 16 bits (two bytes) wide, and is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character. While the resulting over 65 000 code elements are sufficient for encoding most of the characters of the major languages of the world, the Unicode standard also provides an extension mechanism that allows the encoding of as many as one million more characters. The extension mechanism uses a pair of high and low surrogate characters to encode one extended or supplementary character. The first (or high) surrogate character has a code value between U+D800 and U+DBFF, and the second (or low) surrogate character has a code value between U+DC00 and U+DFFF.

UCS-2

The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) standard 10646 (ISO/IEC 10646) specifies the Universal Multiple-Octet Coded Character Set (UCS) that has a 16-bit (two-byte) version (UCS-2) and a 32-bit (four-byte) version (UCS-4). UCS-2 is identical to the Unicode 16-bit form without surrogates. UCS-2 can encode all the (16-bit) characters defined in the Unicode version 3.0 repertoire. Two UCS-2 characters - a high followed by a low surrogate - are required to encode each of the new supplementary characters introduced starting in Unicode version 3.1. These supplementary characters are defined outside the original 16-bit Basic Multilingual Plane (BMP or Plane 0).

UTF-16

ISO/IEC 10646 also defines an extension technique for encoding some UCS-4 characters using two UCS-2 characters. This extension, called UTF-16, is identical to the Unicode 16-bit encoding form with surrogates. In summary, the UTF-16 character repertoire consists of all the UCS-2 characters plus the additional one million characters accessible via the surrogate pairs.

When serializing 16-bit Unicode characters into bytes, some processors place the most significant byte in the initial position (known as big-endian order), while others place the least significant byte first (known as little-endian order). The default byte ordering for Unicode is big-endian.

UTF-8

Sixteen-bit Unicode characters pose a major problem for byte-oriented ASCII-based applications and file systems. For example, non-Unicode aware applications may misinterpret the leading 8 zero bits of the uppercase character 'A' (U+0041) as the single-byte ASCII NULL character.

UTF-8 (UCS Transformation Format 8) is an algorithmic transformation that transforms fixed-length Unicode characters into variable-length ASCII-safe byte strings. In UTF-8, ASCII and control characters are represented by their usual single-byte codes, and other characters become two or more bytes long. UTF-8 can encode both non-supplementary and supplementary characters.

UTF-8 characters can be up 4 bytes long. Non-supplementary characters are up to 3 bytes long and supplementary characters are 4 bytes long.

The number of bytes for each UTF-16 character in UTF-8 format can be determined from Table 1.

Table 1. UTF-8 Bit Distribution
Code Value

(binary)

UTF-16

(binary)

1st byte

(binary)

2nd byte

(binary)

3rd byte

(binary)

4th byte

(binary)

00000000

0xxxxxxx

00000000

0xxxxxxx

0xxxxxxx      
00000yyy

yyxxxxxx

00000yyy

yyxxxxxx

110yyyyy 10xxxxxx    
zzzzyyyy

yyxxxxxx

zzzzyyyy

yyxxxxxx

1110zzzz 10yyyyyy 10xxxxxx  
uuuuu

zzzzyyyy

yyxxxxxx

110110ww

wwzzzzyy

110111yy

yyxxxxxx

11110uuu

(where uuuuu = wwww+1)

10uuzzzz 10yyyyyy 10xxxxxx

In each of the code values listed in the previous table, the series of u's, w's, x's, y's, and z's is the bit representation of the character. For example, U+0080 transforms into 11000010 10000000 in binary format, and the surrogate character pair U+D800 U+DC00 becomes 11110000 10010000 10000000 10000000 in binary format.