Character conversion
A string is a sequence of bytes that may represent characters. Within a string, all the characters are represented by a common coding representation. In some cases, it might be necessary to convert these characters to a different coding representation. The process of conversion is known as character conversion.
Character conversion can occur when an SQL statement is executed remotely.1 Consider, for example, these two cases:
- The values of variables sent from the application requester to the current server.
- The values of result columns sent from the current server to the application requester.
In either case, the string could have a different representation at the sending and receiving systems. Conversion can also occur during string operations on the same system.
Note that SQL statements are strings and are therefore subject to character conversion.
The following list defines some of the terms used when discussing character conversion.
- character set
- A defined set of characters. For example, the following character
set appears in several code pages:
- 26 non-accented letters A through Z
- 26 non-accented letters a through z
- digits 0 through 9
- . , : ; ? ( ) ' " / - _ & + = < >
- code page
- A set of assignments of characters to code points. In EBCDIC, for example, "A" is assigned code point X'C1' and "B" is assigned code point X'C2'. Within a code page, each code point has only one specific meaning.
- code point
- A unique bit pattern that represents a character within a code page.
- coded character set
- A set of unambiguous rules that establish a character set and the one-to-one relationships between the characters of the set and their coded representations.
- encoding scheme
- A set of rules used to represent character data. For example:
- Single-byte EBCDIC
- Single-byte ASCII
- Double-byte EBCDIC
- Mixed single- and double-byte ASCII2
- Unicode (UTF-8, UCS-2, and UTF-16 universal coded character sets).
- substitution character
- A unique character that is substituted during character conversion for any characters in the source coding representation that do not have a match in the target coding representation.
- Unicode
- A
universal encoding scheme for written characters and text that enables the
exchange of data internationally. It provides a character set standard that
can be used all over the world. It uses a 16-bit encoding form that provides
code points for more than 65,000 characters and an extension called UTF-16
that allows for encoding as many as a million more characters. It provides
the ability to encode all characters used for the written languages of the
world and treats alphabetic characters, ideographic characters, and symbols
equivalently because it specifies a numeric value and a name for each of its
characters. It includes punctuation marks, mathematical symbols, technical
symbols, geometric shapes, and dingbats. Three encoding forms are supported:
- UTF-8: Unicode Transformation Format, a 8-bit encoding form designed for
ease of use with existing ASCII-based systems. UTF-8 data is stored in character
data types. The CCSID value for data in UTF-8 format is 1208.
A UTF-8 character can be 1,2,3 or 4 bytes in length. A UTF-8 data string can contain any combination of SBCS and DBCS data, including surrogates and combining characters.
- UCS-2: Universal Character Set coded in 2 octets, which means that characters
are represented in 16-bits per character. UCS-2 data is stored in graphic
data types. The CCSID value for data in UCS-2 format is 13488.
UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16, UCS-2 data will typically perform better than UTF-16.3
- UTF-16: Unicode Transformation Format, a 16-bit encoding form designed
to provide code values for over a million characters. UTF-16 data is stored
in graphic data types. The CCSID value for data in UTF-16 format is 1200.
Both UTF-8 and UTF-16 data can contain combining characters. Combining character support allows a resulting character to be comprised of more than one character. After the first character, hundreds of different non-spacing accent characters (umlauts, accents, etc.) can follow in the data string. The resulting character may already be defined in the character set. In this case, there are multiple representations for the same character. For example, in UTF-16, an é can be represented either by X'00E9' (the normalized representation) or X'00650301' (the non-normalized combining character representation).
Since multiple representations of the same character will not compare equal, it is usually not a good idea to store both forms of the characters in the database. Normalization is a process that replaces the string of combining characters with equivalent characters that do not include combining characters. After normalization has occurred, only one representation of any specific character will exist in the data. For example, in UTF-16, any instances of X'00650301' (the non-normalized combining character representation of é ) will be converted to X'00E9' (the normalized representation of é ).4
In order to properly handle UTF-8 in predicates, normalization may occur.
Both UTF-8 and UTF-16 can contain 4 byte characters called surrogates. Surrogates are 4 byte sequences that can address one million more characters than would be available in a 2 byte character set.
- UTF-8: Unicode Transformation Format, a 8-bit encoding form designed for
ease of use with existing ASCII-based systems. UTF-8 data is stored in character
data types. The CCSID value for data in UTF-8 format is 1208.