Recognizing multibyte characters

To support MBCS data streams correctly, it is necessary for your product to be aware of whether a given byte is a single-byte character, or part of a non-single-byte character.

Guideline G2

Do not interpret an individual byte of a multibyte character as a single-byte character.

If proper care is not taken, individual bytes that represent multibyte character can be misinterpreted as valid single-byte characters. Multibyte characters must be handled as a unit, and not as bytes.

Example: The Kanji character Kanji characterKanji character, meaning beauty, is represented by X'457D' on the IBM mainframe utilizing the EBCDIC MBCS CCSID 00930. However in CCSID 00500, X'45' is the a acute character á and X'7D' is the single quotation mark character '.

Example: On an IBM Japanese PC running CCSID 00942, X'8D5C' is the Kanji character meaning configuration; but in IBM CCSID 00850, X'8D' is the i grave character ì and X'5C' is the back slash character \.

To parse a MBCS data stream correctly, you either must start from the very beginning if it is an IBM PC data stream; or you must recognize special control characters such as Shift Out and Single Shift 2 for data streams in IBM EBCDIC and EUC encoding schemes, respectively.

To aid in the parsing process, X/Open and ISO/IEC have defined a new C data type called wchar_t to contain one (wide) character, regardless of its size; and a series of C runtime library functions that can convert between MBCS and wchar_t data streams, and process wchar_t data streams. Because each wchar_t character is of uniform width, and the width is implementation dependent, product developers can now concentrate on working with individual wchar_t characters, instead of worrying about individual bytes.Refer to the X/Open Portability Guide for more information about X/Open and the set of standardized wide-character C functions.

To parse a MBCS data stream correctly in a Windows environment, you can use the Win32 APIs CharNext and CharPrev.

Guideline G2-1

Recognize the unit used for character string length.

In a pure SBCS string, the number of bytes is equal to the number of characters; but in a MBCS string, this equality is not true. When calling a string function or communicating a string length to another product, ensure the sender's and receiver's counting units are the same by adjusting one if necessary.

Example:Some database products classify stored character strings as one of two data types:

  1. CHARACTER for MBCS strings where the counting unit is bytes;
  2. GRAPHIC for pure DBCS strings where the counting unit is characters.

For the string X'ssdddds' where s represents a single-byte character and dd represents a double-byte character:

Data type Length
GRAPHIC Cannot be classified as GRAPHIC as the string contains non-double-byte characters

For the string X'dddd' where dd represents a double-byte character:

Data type Length
GRAPHIC 2 characters