Guideline G: Introducing Asian ideographic
scripts


G3: Manipulating MBCS data



Manipulating MBCS data stream

Some string manipulation functions operate in terms of bytes instead of characters. When we treat characters as bytes, it is easy to inadvertently truncate a non-single-byte character, split a non-single-byte character into its individual bytes, or lose the other half of a control code pair such as the Shift In control code.


Guideline G3


Never split a multibyte character into bytes.

There are several ways to compensate for the resultant string to ensure the integrity of every character. If the string length is fixed, you can pad neutral single-byte characters such as SPACE or NULL to the string. Otherwise, you can shorten the resultant string.

Example: To extract a mixed single-byte and double-byte string starting at the second byte for six consecutive bytes on the IBM mainframe host computer, you can do the following:


Function Intermediate result Final result
Extract( "ssSOd1d2d3SIs", 2, 6 )

where:

- s is a single-byte character
- d1 to d3 are three double-byte characters
- SO is the Shift Out control character
- SI is the Shift In control character
sSOd1d2 sSOd1SI_ or
sSOd1SI\0 or
sSOd1SI if the resultant string length is not fixed to 6 bytes

where:

- _ is the single-byte SPACE character
- \0 is the single-byte NULL character

Example: To extract a mixed single-byte and double-byte string starting at the second byte for six consecutive bytes on the IBM PC, you can do the following:


Function Intermediate result Final result
Extract( "ssd1d2d3s", 2, 6 )

where:

- s is a single-byte character
- d1 to d3 are three double-byte characters
sd1d2 sd1d2d sd1d2_ or
sd1d2\0 or
sd1d2 if the resultant string length is not fixed to 6 bytes

where:

- _ is the single-byte SPACE character
- \0 is the single-byte NULL character

To aid the developer in manipulating MBCS strings, X/Open and ISO/IEC have defined a series of C runtime library string manipulation functions that process wchar_t strings, as opposed to the regular C string functions that process char (or byte) strings.