Guideline G: Introducing Asian ideographic
scripts


Overview



Chinese (both simplified and traditional) Hanyu, Japanese Kanji, and Korean Hanja characters are based on conceptual pictures called ideographs, rather than phonetics. A complete collection of ideographs for the scripts of these languages contains many thousands of characters, therefore any representation of Asian text must use two or more bytes to uniquely represent a particular character. Double-byte character set (DBCS) is the most popular choice used in Asian platforms. The notion of uppercase and lowercase also does not exist in these Asian scripts.

As in the SBCS (single-byte character set) situation, the same character can have different code points in different DBCS CCSIDs.

Example: The Kanji character,<b>Kanji</b> character , meaning beauty, is represented by X'457D' on the IBM Japanese mainframe utilizing the EBCDIC DBCS encoding scheme ( CCSID 00930 ), or X'C8FE' on the IBM Japanese AIX computers using the IBM EUC encoding scheme (CCSID 05050).

In an Asian data stream, ideographic characters in names and addresses are encoded by more than one byte as these characters are not found in single-byte character sets. Entities such as yes/no responses, English text, and numbers are encoded in SBCS for ease of entry and lower storage cost. Control codes, such as, alert and line feed, are also encoded in SBCS as they are absent in non-single-byte coded character sets. Thus a typical Asian data stream would consist of characters represented by a mixture of one byte, two bytes, and sometimes by three bytes or more. Such multibyte encoding has no effect on numeric calculations. This publication uses the more accurate and inclusive term, multibyte character set (MBCS), instead of the industry norm DBCS, to mean a mixture single-byte and non-single-byte characters. The term DBCS is used for pure double-byte characters.

In an IBM EBCDIC MBCS data stream, all bytes bracketed by the Shift Out (X'0E') and Shift In (X'0F') control codes represent double-byte characters. Shift Out (X'0E') shifts out from SBCS to DBCS while Shift In (X'0F') shifts into SBCS from DBCS. Some database products, however, classify pure double-byte IBM EBCDIC characters as GRAPHIC and store the data without the Shift Out and Shift In control codes.

Example: In the data string below, 'S' represents a single-byte character, 'DD' represents a double-byte character, 'SO' is the Shift Out control code (X'0E'), and 'SI' is the Shift In control code (X'0F').

Sample data string

In an IBM PC MBCS data stream, the first byte of every double-byte character must lie within certain predefined range. These predefined ranges must be retrieved dynamically by the product using platform-supplied functions. Do not hard code them in your product as they are subject to change.

Example: The PC data string ABC123 below shows a mixture of single-byte Latin characters and double-byte Kanji characters.

PC data string


You can recognize the double-byte characters by checking the range of code positions of the first byte using the sample table below.



Language Mixed code page Range of code positions of 1st Byte of DBCS
Japanese CCSID 00943 X'81' - X'9F', X'E0' - X'FC'
Korean CCSID 01363 X'8F' - X'FE'
Simplified Chinese CCSID 01386 X'8C' - X'FE'.
Traditional Chinese IBM BIG-5 CCSID 00950 X'81' - X'FE'

Note that existing APIs should be used to determine the range of 1st byte positions rather than hard-coding the information in the table above.


Example: On IBM Japanese PC running CCSID 00943, when a byte is encountered that lies within the range of X'81' to X'9F' or X'E0' to X'FC', then that byte is the first byte of a double-byte character; otherwise the byte is a single-byte character.


Data stream X'41 42 8D5C 5C'
Character A B 8D5C Japanese character ¥

In an EUC (Extended Unix Code) data stream usually encountered in the workstation environment running the UNIX** operating system, MBCS data is the norm and characters of different sizes are demarcated using techniques from both the IBM EBCDIC and PC environments: by shift control characters and predefined regions of the leading byte. Characters from up to four character sets (referred to as G0, G1, G2, and G3 or as code set 0, 1, 2, and 3) can be included in the data stream, all obeying the following guidelines:


Example: CCSID 05050 - Japanese EUC



CCSID 05050

G set Character set Code page CCSID
G0 CS01120 CP00895 CCSID895
G1 CS01058 CP00952 CCSID952
G2 CS01121 CP00896 CCSID13184
G3 CS01060 CP00953 CCSID9145

Example: In Japanese EUC, the following four character sets are used:


Character set Notation Content
First X'c1' JIS X 0201 Roman
Second X'c2c2' Double-byte Japanese characters (JIS X 0208)
Third X'c3' Single-byte Katakana characters
Fourth X'c4c4' Double-byte extended Japanese characters (JIS X 0212)

x'c1 c2c2 c3 c4c4 c4c4 c1' is represented by X'c1 c2c2 8E c3 8F c4c4 8F c4c4 c1'