Character Data Representation Architecture

Method 2 for pure DBCS

This method has the following characteristics:

Figure 55. Method 2: Double-Byte to Double-Byte Conversion

Figure 55. Method 2: Double-Byte to Double-Byte Conversion

Figure 55. Method 2: Double-Byte to Double-Byte Conversion

In order to understand the machine-readable format of the double-byte tables, you must first understand the concept of a "ward". A ward is a section of a double-byte coded character set, where the first byte of all code points contained in that section have the same value. A ward is populated if there are any characters in the double-byte coded character set whose first byte is the ward value. Conversely, a ward is not populated if there are no characters in the double-byte coded character set that have that ward value as the first byte. Each of the Far East double-byte coded character sets contains many unpopulated wards.

The CDRA machine-readable format of the double-byte conversion tables is a structure composed of many 512-byte records: a subtable pointer record, a substitution character record, and one record for each populated ward. The subtable record is the first record in the structure, and it is used as the index into the other character records of the structure.

The first 256 bytes of the subtable record contains information, whereas the second 256 bytes contains zeros. Each of the 256 assigned bytes corresponds to the first byte (the ward number) of an input code point. The byte values found in these locations are pointers to subsequent records in the structure that contain the output code points. Each of the subsequent records in the structure contain information in all 512 bytes in the form of 256 double-byte code point values. The appropriate subsequent record is selected from the structure using the value of the first byte of the input code point as an index into the subtable pointer record. The value obtained from the subtable pointer record points to the subsequent record required. The second byte of the input code point is then multiplied by two to calculate the correct offset into the selected record. Each output code point is two bytes in length, beginning at offset n into the record, where n is 2 times the input code point value.

In Figure 55, the conversion process may be described as follows:

  1. Input code value X'41C1'
  2. Use X'41' as an index into the subtable pointer record
  3. Retrieve record pointer "pp"
  4. Use the second input byte, X'C1', as an index into record "pp" to retrieve the output value X'43C4'.

Thus, the input code point X'41C1' maps to the output code point of X'43C4'.

The third record type in the structure is the substitution record. It is a 512-byte record constructed from 256 two-byte values, all of which are the code point value for the defined Substitute character for the target coded character set. The value retrieved from the subtable pointer record for each unpopulated ward will point to this substitution record.

Contact IBM

Need assistance with your globalization questions?