G2: Recognizing multibyte characters
Recognizing multibyte characters
To support MBCS data streams correctly, it is necessary for your product to be aware of whether a given byte is a single-byte character, or part of a non-single-byte character.
Guideline G2
If proper care is not taken, individual bytes that represent multibyte character can be misinterpreted as valid single-byte characters. Multibyte characters must be handled as a unit, and not as bytes.
Example: The Kanji character Kanji character
, meaning beauty, is represented by X'457D' on the IBM mainframe utilizing the EBCDIC MBCS CCSID 00930. However in CCSID 00500, X'45' is the a acute character á and X'7D' is the single quotation mark character '.
Example: On an IBM Japanese PC running CCSID 00942, X'8D5C' is the Kanji character meaning configuration; but in IBM CCSID 00850, X'8D' is the i grave character ì and X'5C' is the back slash character \.
To parse a MBCS data stream correctly, you either must start from the very beginning if it is an IBM PC data stream; or you must recognize special control characters such as Shift Out and Single Shift 2 for data streams in IBM EBCDIC and EUC encoding schemes, respectively.
To aid in the parsing process, X/Open and ISO/IEC have defined a new C data type called wchar_t to contain one (wide) character, regardless of its size; and a series of C runtime library functions that can convert between MBCS and wchar_t data streams, and process wchar_t data streams. Because each wchar_t character is of uniform width, and the width is implementation dependent, product developers can now concentrate on working with individual wchar_t characters, instead of worrying about individual bytes.Refer to the X/Open Portability Guide for more information about X/Open and the set of standardized wide-character C functions.
To parse a MBCS data stream correctly in a Windows environment, you can use the Win32 APIs CharNext and CharPrev.
Guideline G2-1
In a pure SBCS string, the number of bytes is equal to the number of characters; but in a MBCS string, this equality is not true. When calling a string function or communicating a string length to another product, ensure the sender's and receiver's counting units are the same by adjusting one if necessary.
Example:Some database products classify stored character strings as one of two data types:
- CHARACTER for MBCS strings where the counting unit is bytes;
- GRAPHIC for pure DBCS strings where the counting unit is characters.
For the string X'ssdddds' where s represents a single-byte character and dd represents a double-byte character:
| Data type | Length |
|---|---|
| CHARACTER | 7 bytes |
| GRAPHIC | Cannot be classified as GRAPHIC as the string contains non-double-byte characters |
For the string X'dddd' where dd represents a double-byte character:
| Data type | Length |
|---|---|
| CHARACTER | 4 bytes |
| GRAPHIC | 2 characters |
Guidelines
- Guidelines quick reference
- A: User interface
- B: Writing for an international audience
- C: Respect for culture and conventions
- D: Product structure in a globalized environment
- E: Input and output interfaces
- F: Coded character sets
- G: Introducing Asian ideographic scripts
- H: Languages with a bidirectional script
- I: The cursive Arabic script