G6: Switching character interpretation
Switching character interpretation
MBCS data streams differ from SBCS data streams in that every character in an MBCS data stream can be of a different width. This requires additional work in order to correctly parse and process the characters within an MBCS data stream. Applications need a mechanism to switch its character interpretation mode, that is, whether to assume the data stream consists of single-byte or multibyte characters.
Support a mechanism to switch the character interpretation mode in order to correctly interpret each character.
The implementation of this mechanism varies among different platforms and applications.
Example: The IBM XL C/C++ compiler for AIX requires the user to specify the compiler option -qmbcs or -qdbcs to enable the recognition of multibyte characters in the source code. If this option is omitted and the default -qnombcs or -qnodbcs compiler option is in effect, the compiler will treat all multibyte character literals as individual single-byte literals.
Example: Browsers on a Japanese Windows platform will assume the files are encoded in the platform's active CCSID. When browsing a SBCS file encoded using CCSID 01252, strange Japanese characters will appear on the screen because the characters in 01252 have code points that the browser misinterprets as the first byte of a double-byte character. The second example in Guideline G2 - Recognizing multibyte characters illustrates this problem. To synchronize the character interpretation mode of the browser to the CCSID of the SBCS file, the user needs to first issue the MS-DOS CHCP command to switch the platform's active CCSID to 01252.
Example: We can use character encodings such as ISO-2022-JP, ISO-2022-CN and ISO-2022-KR that use ISO-2022 mechanisms to include escape sequences or ASCII shift characters to declare which character set is being used. Character sets using this encoding can have 1 byte or 2 bytes per character.
Example: Many IBM Z/OS applications have their own mechanisms. Some examples are:
|CICS/BMS||SOSI=YES or PS=8|
One solution to satisfy this guideline is to convert the input data stream to wide character form (the wchar_t data type in C) prior to processing, as every wide character is of identical width. Another solution is to use the -encoding option of the native2ascii Java tool to convert files which contain other character encodings into files containing Latin-1 and/or Unicode-encoded characters. The same -encoding option can also be used with the javac compiler to set the source file encoding name.
Need assistance with your globalization questions?
- Guidelines quick reference
- A: User interface
- B: Writing for an international audience
- C: Respect for culture and conventions
- D: Product structure in a globalized environment
- E: Input and output interfaces
- F: Coded character sets
- G: Introducing Asian ideographic scripts
- H: Languages with a bidirectional script
- I: The cursive Arabic script