Character conversion

A string is a sequence of bytes that may represent characters. Within a string, all the characters are represented by a common coding representation. In some cases, it might be necessary to convert these characters to a different coding representation. The process of conversion is known as character conversion.

Character conversion can occur when an SQL statement is executed remotely.¹ Consider, for example, these two cases:

The values of variables sent from the application requester to the current server.
The values of result columns sent from the current server to the application requester.

In either case, the string could have a different representation at the sending and receiving systems. Conversion can also occur during string operations on the same system.

Note that SQL statements are strings and are therefore subject to character conversion.

The following list defines some of the terms used when discussing character conversion.

character set

A defined set of characters. For example, the following character set appears in several code pages:

26 non-accented letters A through Z
26 non-accented letters a through z
digits 0 through 9
. , : ; ? ( ) ' " / - _ & + = < >

code page

A set of assignments of characters to code points. In EBCDIC, for example, "A" is assigned code point X'C1' and "B" is assigned code point X'C2'. Within a code page, each code point has only one specific meaning.

code point

A unique bit pattern that represents a character within a code page.

coded character set

A set of unambiguous rules that establish a character set and the one-to-one relationships between the characters of the set and their coded representations.

encoding scheme

A set of rules used to represent character data. For example:

Single-byte EBCDIC
Single-byte ASCII
Double-byte EBCDIC
Mixed single- and double-byte ASCII²
Unicode (UTF-8, UCS-2, and UTF-16 universal coded character sets).

substitution character

A unique character that is substituted during character conversion for any characters in the source coding representation that do not have a match in the target coding representation.

Unicode

A universal encoding scheme for written characters and text that enables the exchange of data internationally. It provides a character set standard that can be used all over the world. It uses a 16-bit encoding form that provides code points for more than 65,000 characters and an extension called UTF-16 that allows for encoding as many as a million more characters. It provides the ability to encode all characters used for the written languages of the world and treats alphabetic characters, ideographic characters, and symbols equivalently because it specifies a numeric value and a name for each of its characters. It includes punctuation marks, mathematical symbols, technical symbols, geometric shapes, and dingbats. Three encoding forms are supported:

UTF-8: Unicode Transformation Format, a 8-bit encoding form designed for ease of use with existing ASCII-based systems. UTF-8 data is stored in character data types. The CCSID value for data in UTF-8 format is 1208.
A UTF-8 character can be 1,2,3 or 4 bytes in length. A UTF-8 data string can contain any combination of SBCS and DBCS data, including surrogates and combining characters.
UCS-2: Universal Character Set coded in 2 octets, which means that characters are represented in 16-bits per character. UCS-2 data is stored in graphic data types. The CCSID value for data in UCS-2 format is 13488.
UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16, UCS-2 data will typically perform better than UTF-16.³
UTF-16: Unicode Transformation Format, a 16-bit encoding form designed to provide code values for over a million characters. UTF-16 data is stored in graphic data types. The CCSID value for data in UTF-16 format is 1200.
Both UTF-8 and UTF-16 data can contain combining characters. Combining character support allows a resulting character to be comprised of more than one character. After the first character, hundreds of different non-spacing accent characters (umlauts, accents, etc.) can follow in the data string. The resulting character may already be defined in the character set. In this case, there are multiple representations for the same character. For example, in UTF-16, an é can be represented either by X'00E9' (the normalized representation) or X'00650301' (the non-normalized combining character representation).

Since multiple representations of the same character will not compare equal, it is usually not a good idea to store both forms of the characters in the database. Normalization is a process that replaces the string of combining characters with equivalent characters that do not include combining characters. After normalization has occurred, only one representation of any specific character will exist in the data. For example, in UTF-16, any instances of X'00650301' (the non-normalized combining character representation of é ) will be converted to X'00E9' (the normalized representation of é ).⁴

In order to properly handle UTF-8 in predicates, normalization may occur.

Both UTF-8 and UTF-16 can contain 4 byte characters called surrogates. Surrogates are 4 byte sequences that can address one million more characters than would be available in a 2 byte character set.

¹ Character conversion, when required, is automatic and is transparent to the application when it is successful. A knowledge of conversion is, therefore, unnecessary when all the strings involved in a statement’s execution are represented in the same way. Thus, for many readers, character conversion may be irrelevant.

² UTF-8 unicode data is also mixed data. In this book, however, mixed data refer to mixed single- and double-byte data.

³ UCS-2 can contain surrogates and combining characters, however, they are not recognized as such. Each 16–bits is considered to be a character.

⁴ Since normalization can significantly affect performance (from 2.5 to 25 percent extra CPU), the default in column definitions is NOT NORMALIZED.