Chinese (both simplified and traditional) Hanyu, Japanese Kanji, and Korean Hanja characters are based on conceptual pictures called ideographs, rather than phonetics. A complete collection of ideographs for the scripts of these languages contains many thousands of characters, therefore any representation of Asian text must use two or more bytes to uniquely represent a particular character. Double-byte character set (DBCS) is the most popular choice used in Asian platforms. The notion of uppercase and lowercase also does not exist in these Asian scripts.
As in the SBCS (single-byte character set) situation, the same character can have different code points in different DBCS CCSIDs.
Example: The Kanji character, , meaning beauty, is represented by X'457D' on the IBM Japanese mainframe utilizing the EBCDIC DBCS encoding scheme (CCSID 00930), or X'C8FE' on the IBM Japanese AIX computers using the IBM EUC encoding scheme (CCSID 05050).
In an Asian data stream, ideographic characters in names and addresses are encoded by more than one byte as these characters are not found in single-byte character sets. Entities such as yes/no responses, English text, and numbers are encoded in SBCS for ease of entry and lower storage cost. Control codes, such as, alert and line feed, are also encoded in SBCS as they are absent in non-single-byte coded character sets. Thus a typical Asian data stream would consist of characters represented by a mixture of one byte, two bytes, and sometimes by three bytes or more. Such multibyte encoding has no effect on numeric calculations. This publication uses the more accurate and inclusive term, multibyte character set (MBCS), instead of the industry norm DBCS, to mean a mixture single-byte and non-single-byte characters. The term DBCS is used for pure double-byte characters.
In an IBM EBCDIC MBCS data stream, all bytes bracketed by the Shift Out (X'0E') and Shift In (X'0F') control codes represent double-byte characters. Shift Out (X'0E') shifts out from SBCS to DBCS while Shift In (X'0F') shifts into SBCS from DBCS. Some database products, however, classify pure double-byte IBM EBCDIC characters as GRAPHIC and store the data without the Shift Out and Shift In control codes.
Example: In the data string below, 'S' represents a single-byte character, 'DD' represents a double-byte character, 'SO' is the Shift Out control code (X'0E'), and 'SI' is the Shift In control code (X'0F').
In an IBM PC MBCS data stream, the first byte of every double-byte character must lie within certain predefined range. These predefined ranges must be retrieved dynamically by the product using platform-supplied functions. Do not hard code them in your product as they are subject to change.
Example: The PC data string below shows a mixture of single-byte Latin characters and double-byte Kanji characters.
 |