Character Data Representation Architecture

EUC and 2022 TCP/IP conversion tables

The Extended Unix Code (EUC) conversion tables are used to convert EUC encoded graphic character data from an EUC platform to or from a host or PC platform. The ISO 2022 TCP/IP (TCP) conversion tables are used to convert encoded graphic characters from the specific ISO 2022 format used by TCP/IP to or from a host or PC platform.

It is assumed at this point that the reader has some knowledge of the code extension techniques defined in the ISO 2022 standard. Both EUC and 2022 TCP/IP data streams make use of these techniques.

The conversion tables are constructed to achieve optimum character integrity after data conversion by using GCGID matching between source and target encodings. Both schemes can define up to four character set and code page pairs to better enhance the character matching between source and target CCSIDs.

The EUC and TCP conversion tables use a normalized form of data. Input passed to and output generated from the conversion tables is normalized. The PC code points are normalized by placing a leading X'00' in front of each single-byte to yield a two-byte form. Host (EBCDIC) data must have the SO-SI control characters deleted during normalization and reinserted afterwards during denormalization. As with the PC data, a leading X'00' is inserted in front of any single-byte data. The EUC and TCP code points are normalized to four-byte values.

Figure 60. EUC and ISO 2022 TCP Conversion

euctcpip epsbin - EUC and TCP convert

euctcpip epsbin - EUC and TCP convert

Figure 60 shows the general use of the EUC and TCP conversion tables. The input byte or bytes, up to a maximum of four bytes per code point, are first normalized and used as input to the conversion table. The output (again, four bytes per code point maximum) from the conversion table is also normalized data, which must be denormalized prior to subsequent processing.

Normalization and denormalization services are not part of the CDRA-supplied conversion tables.

EUC conversions

The EUC encoding technique uses up to four coded graphic character sets. Each must be predefined, as the information is not carried in the text data stream. In CDRA, the CCSID determines the group of coded graphic character sets being used. Code points from the left half of the 8-bit encoding space (the high-order bit is OFF) are in the set G0. Code points that lie in the right half of the encoding space (the high-order bit is ON) are in the set G1. The single-shift control characters, called SS2 and SS3, are used to invoke the other sets G2 and G3.

EUC conversions require that the EUC input or output contain not only the character to be converted, but the shift control character when applicable. All input EUC data needs to have the code point values padded with leading zeros to create a fixed length, normalized, four-byte encoding. The following example shows how a code point in G3 must be formatted for the conversion tables.

This means that when dealing with EUC data, the parser must recognize which G-set each character belongs to in order to build the correct normalized input for the conversion process. When converting from EUC, the denormalizing process must strip off the leading zeros and concatenate the converted characters to build the correct output string.

2022 TCP/IP conversions

Conversion tables for TCP/IP are very similar to those for EUC, except that only one G-set is used, namely G0. To switch from one coded graphic character set to another cannot be accomplished using the EUC technique of a single shift. An explicit escape sequence is used to designate a new coded graphic character set being loaded into the G0 set. The escape sequence value itself cannot be carried in the conversion table entries, so the high-order byte of the TCP/IP code point value will be set to correspond to the position of the CGCSID within the CCSID. For example, CCSID 956 is defined for Japanese TCP/IP and contains four CGCSIDs corresponding to JIS X 201 Roman, JIS X 208-1983, JIS X 201 Katakana, and JIS 212. For Japanese host to TCP/IP conversion, a code point value from the JIS X 201 Katakana set would contain a value of X'03' in the high-order byte.

In ISO 2022, control characters are not part of the coded graphic character set; therefore, loading a new coded character set into G0 does not affect the set of control characters in C0. It thus does not make sense to have the high-order byte setting indicate a specific coded graphic character set. Control characters will then be normalized as follows;

  1. For Host or PC to 2022 TCP/IP, the converted value will have X'00' in the high-order byte. This will indicate to the denormalization routine that it does not matter what the currently designated coded graphic character set is -- the control character may simply be placed directly into the output data stream.
  2. For 2022 TCP/IP to Host or PC, the input normalized value may have X'00' as the high-order byte, or it may contain any valid value for the table. The normalization routine can then recognize the value as a control character and place a X'00' in the high-order byte or it may use the current active value as a result of the previous escape sequence.

Although the identifiers coded in the high-order bytes will correspond to the position of the coded graphic character set within the CCSID, each table will also contain the set of ESC sequences used to designate the coded graphic character set in the G0. This mapping will be contained in the first record of the conversion table in the following format:

L1 ESC1 L2 ESC2 ... Ln ESCn

where Li is a one-byte unsigned field containing the length of the following ESC sequence, and ESCi is the ESC sequence associated with the high-order byte id "i" in the table.

Contact IBM

Need assistance with your globalization questions?