Encoding rules

The definition of a coded character set follows a set of encoding rules of an associated encoding scheme. The rules specify the form and values for the numbers or bit patterns that can be used, and associated rules for assigning characters to these bit patterns (see Figure 3).

Code space assignment rules specs for code points or bit patterns. Encoding scheme. coded character set

Figure 3.

The encoding scheme prescribes the information that enables you to parse textual data and extract the code point representing each individual character defined by the coded character set. You need to know this to establish where the boundaries representing individual characters are in a text data stream. When you generate textual data, the appropriate code point or bit pattern in the correct format prescribed in the encoding scheme associated with that character set has to be generated. Any character string manipulation function, as opposed to byte string manipulation, must be sensitized to respect the encoding scheme definition behind the coded character set used to represent that text.

If you are writing a character data converter, you have to know the encoding definitions of each of the coded character sets you are dealing with in order to correctly parse the source byte strings and to correctly assemble the target byte strings without breaking the character boundaries, and while maintaining character integrity during conversion. Even if you use someone else's converter, an understanding of the encoding definitions allows you to understand what is going on in the converter, and be able to allocate the necessary input and output data buffer spaces accordingly. If the conversion results don't look correct you will be able to look into the problem with more understanding of the mechanics involved.

Code points or bit patterns are positive integers. In the definitions for coded character sets that are in current use, the integer widths can be 7-bits (as in ASCII), 8-bits (as in single-byte EBCDIC or ISO/IEC 8859 series) , 16-bits (as in UCS-2 of ISO/IEC 10646 (PDF, 824.3KB) ), or 32-bits (as in UTF32 or UCS-4 of Unicode . Each encoding definition also has specification on allowed values for these numbers, and partitions reserved for graphic characters, or others. In the pre-digital computer era , these were patterns of punched holes on cards, or paper tapes, and were defined using the equivalents of 5, 6 or 7 bits for each code position. The width selected is based both on technology and the number of characters to be assigned code points.

Some definitions, especially for encoding the large character sets of the far eastern scripts, call for a mixture of two or more sets of numbers , each set having a different bit-length. IBM's Host mixed EBCDIC encoding scheme uses a mixture of single and double 8-bit bytes with a Shift-Out/Shift-In locking switch. The Extended UNIX Code for Japan use a collection of 2 sets of the single-byte kind, and 2 sets of the double-byte kind with single-shift controls to select the sets. The mixed single and double byte PC encodings have the first byte of a double byte coming from a predefined range. The most recent Chinese standard, GB18030, extends the PC single double mixed scheme to define 4 byte code points, by reserving another range to signal the beginning of a 4-byte code point. Others call for transformations into variable-length sequences of 8-bit bytes such as UTF-8 of Unicode. UTF-16 of Unicode uses pairs of 16-bits, called surrogate pairs, in addition to 16-bit code points. ISO 2022 defines escape sequences to announce the type of bit patterns that follow the sequence.

The parsing of coded character strings is straight forward only for the simpler cases of fixed-width coded character sets like the ASCII or ISO 8859 series. For strings using mixed-byte and variable length bit patterns (including the UTF-8 and UTF-16 Big and Little Endian forms, the more popular encoding forms of Unicode), the parsing logic has to be more intelligent to ensure that the character boundaries in the text are identified correctly.