Conversion tables alone do not ensure the transfer or sharing of data objects between different computing environments: the proper selection and use of these tables is essential. Conversion methods, as described in the following sections, are used with the tables found in the CDRA Conversion Resources to ensure that the desired results are obtained. As with the selection of a table, the conversion method that is best for one application may not be appropriate for another.
It is the responsibility of the person designing the conversion method to respect the characteristics and requirements of the input and output data. An appropriate method can be selected based on the encoding schemes (ESs) and string types (STs) of the input and output data. The conversion method models described in the following sections are specifically for coded graphic character strings whose semantics follow the respective ES definitions for the character encodings. Necessary enhancements needed to deal with the following string types are also briefly described:
- Input null-terminated
- Output null-terminated
- Output SPACE-padded.
Conversion methods are not supplied by CDRA, but are described here in the context of use with the conversion tables created and supplied by CDRA.
Figure 54. Use of Conversion Methods
Figure 54 shows the use of the conversion methods and tables within the overall conversion process. The conversion method first parses the input data string, and if necessary performs any required substring operation. A substring operation may be required if the input data string contains embedded code-extension controls, such as SO/SI controls in EBCDIC-mixed SBCS/DBCS data. The rules for parsing the specified string type should also be followed. The resulting substrings should contain code points that possess similar characteristics -- they are all from the same CS, CP pair identified. Each substring is converted from input code points to output code points using the appropriate conversion table. This table selection is based on the characteristics of the input data and the desired characteristics of the output data including the CS, CP pairs and ESs. Finally, the conversion method assembles the resulting output substrings into the final output string. This process should include the insertion of any code extension control characters that are required by the output ES. Rules for assembling the specified output string type (ST) should also be followed.
Method 1 for SBCS
This method has the following characteristics:
- It is used for conversions between two pure single-byte CCSIDs.
- The valid encoding schemes for the input and output data are X'1100', X'2100', X'3100', X'4100', X'4105', X'4155', X'6100' and X'8100'; this method can also be used with ES X'5100' and X'5150' (single-byte 7-bit code) with considerations for the 7-bit limit.
- The conversion table selected by this method will be a single-byte code point to single-byte code point table from the input CS, CP pair to the output CS, CP pair (known as a TYPE 1 table). Figure 69 shows a model for a TYPE 1 conversion table.
- The contents of the table will reflect the criterion used for mismatch management.
- All control characters are treated as pure single-byte controls, and are mapped according to the mismatch management criterion.
- Handling of control function sequences is beyond the scope of this method.
The machine-readable format of the single-byte to single-byte conversion table is a file containing a single 256-byte record. This allows for 256 single-byte output values. Each character in the table corresponds to one input code point, X'00' through X'FF'. The byte value of the character that is found in the location corresponding to the input code point value is the output code point.
In the example shown in Figure 69, to find the output code point for the input code point X'53', we convert it to a decimal value of 83 and look at offset 83 in the record. (The first position is offset 0.) The value that we find at offset 83 is the corresponding output code point. In this example it is X'67'.
These tables are the standard distribution and storage format of CDRA.
Method 2 for Pure DBCS
This method has the following characteristics:
- It is used for conversions between two pure double-byte CCSIDs.
- The valid encoding schemes for the input and output data are X'1200', X'2200', X'3200', X'5200', X'6200', X'7200', X'8200' and X'9200'.
- The conversion table selected by this method will be a double-byte code point to double-byte code point table from the input CS, CP pair to the output CS, CP pair (known as a TYPE 2 table). Figure 55 shows a model for TYPE 2 conversion table.
- Most double-byte codes do not use all of the available first bytes as valid ward numbers. Thus, the tables are organized as several subtables, each containing 256 double-byte code point entries.
- The contents of the table will reflect the criterion used for mismatch management.
- Any fragmented double-bytes or first-bytes that do not have an entry in the conversion table are treated as errors in the data.
Figure 55. Method 2: Double-Byte to Double-Byte Conversion
In order to understand the machine-readable format of the double-byte tables, you must first understand the concept of a "ward". A ward is a section of a double-byte coded character set, where the first byte of all code points contained in that section have the same value. A ward is populated if there are any characters in the double-byte coded character set whose first byte is the ward value. Conversely, a ward is not populated if there are no characters in the double-byte coded character set that have that ward value as the first byte. Each of the Far East double-byte coded character sets contains many unpopulated wards.
The CDRA machine-readable format of the double-byte conversion tables is a structure composed of many 512-byte records: a subtable pointer record, a substitution character record, and one record for each populated ward. The subtable record is the first record in the structure, and it is used as the index into the other character records of the structure.
The first 256 bytes of the subtable record contains information, whereas the second 256 bytes contains zeros. Each of the 256 assigned bytes corresponds to the first byte (the ward number) of an input code point. The byte values found in these locations are pointers to subsequent records in the structure that contain the output code points. Each of the subsequent records in the structure contain information in all 512 bytes in the form of 256 double-byte code point values. The appropriate subsequent record is selected from the structure using the value of the first byte of the input code point as an index into the subtable pointer record. The value obtained from the subtable pointer record points to the subsequent record required. The second byte of the input code point is then multiplied by two to calculate the correct offset into the selected record. Each output code point is two bytes in length, beginning at offset n into the record, where n is 2 times the input code point value.
In Figure 55, the conversion process may be described as follows:
- Input code value X'41C1'
- Use X'41' as an index into the subtable pointer record
- Retrieve record pointer "pp"
- Use the second input byte, X'C1', as an index into record "pp" to retrieve the output value X'43C4'.
Thus, the input code point X'41C1' maps to the output code point of X'43C4'.
The third record type in the structure is the substitution record. It is a 512-byte record constructed from 256 two-byte values, all of which are the code point value for the defined Substitute character for the target coded character set. The value retrieved from the subtable pointer record for each unpopulated ward will point to this substitution record.
Method 3 for EBCDIC Mixed to PC Mixed
This method has the following characteristics:
- It is used for conversion between an input EBCDIC mixed CCSID (with SO-SI code extension controls) and a PC mixed CCSID.
- The valid encoding scheme for the input data is X'1301', and for the output data the encoding schemes are X'2300', X'2305' and X'3300'.
- The input parser separates the double-byte strings contained within the SO-SI pairs from the single-byte substrings. The SO-SI pair is discarded from the input string. The single-byte and the double-byte substrings are converted separately (shown in Figure 56).
- The input single-byte substrings are converted to corresponding output single-byte substrings using the appropriate Type 1 table (shown in Figure 69).
- The input double-byte substrings are converted to corresponding output double-byte substrings using the appropriate Type 2 table (shown in Figure 55).
- The output generator concatenates the converted substrings in the same order as their corresponding input substrings.
- The contents of the conversion tables used govern the accuracy of the output data.
- Handling of the single-byte controls within the input double-byte substrings is beyond the scope of this method.
- The removal of the SO-SI code extension controls generally results in an output string that is shorter in length than the corresponding input string.
Figure 56. Method 3: Host Mixed Single/Double-Byte to PC Mixed Single/Double-Byte
Method 4 for PC Mixed to EBCDIC Mixed
This method has the following characteristics:
Method 5 for Single-byte to Double-byte
This method has the following characteristics:
- It is used for conversion between an input single-byte and an output double-byte CCSID.
- The valid encoding schemes for the input data are X'1100', X'2100', X'3100', X'4100', x'4105', X'4155'and X'6100', and for the output data are X'1200', X'2200', X'3200', X'5200', X'6200', X'7200', X'8200' and X'9200' (21).
- It uses a TYPE 4 conversion table (see Figure 58) consisting of one 512 byte record (this allows for 256 double-byte output values).
- The possible input, single-byte code point values are in the range X'00' through X'FF'.
- The input code point is used as an offset into the conversion table. The 2-byte entry beginning at this location is the actual output double-byte code point.
- The following steps are taken in order to convert a single-byte X'40' to a UCS-2 (double-byte) value, as shown in Figure 58:
- The input value of X'40', is used as an index into the conversion table.
- The X'40'th entry is found (remembering that each entry in the conversion table is two bytes long).
- The two bytes found at this location comprise the output double-byte code point value.
- The resultant output string will be twice as long as the input string (each single-byte is converted to a double-byte).
- The content of the conversion table used governs the accuracy of the output data.
Figure 58. Method 5: SBCS to UCS Conversion Table
Method 6 for Double-byte to Single-byte
This method has the following characteristics:
- It is used for conversion between an input double-byte and an output single-byte CCSID.
- The valid encoding schemes for the input data are X'1200', X'2200', X'3200', X'5200', X'6200', X'7200', X'8200' and X'9200' (21) and for the output data are X'1100', X'2100', X'3100', X'4100', X'4105', X'4155' and X'6100'.
- It uses a TYPE 5 conversion table (see Figure 59) consisting of:
- A 256 byte subtable pointer record
- A pool of 256 byte subtables.
- The method takes each input double-byte code point and separates it into a first and second byte.
- The first byte is used as an offset into the subtable pointer record.
- The value found at this location "points" to the appropriate record in the pool of subtables.
- The second byte is then used as an offset into the selected subtable record.
- The value found at this location is the single-byte output code point.
- In the example shown in Figure 59 the following takes place:
- The first byte of the input value X'00' is taken and used as the offset.
- At location X'00' in the subtable pointer record the value 03 is found.
- The method locates record 03 in the subtable pool and uses the second byte of the input value, X'41', as the offset.
- The value found at this location, X'C1', is the output single-byte value.
- The resultant output string will be half as long as the input string (each double-byte is converted to a single-byte).
- The content of the conversion table used governs the accuracy of the output data.
Figure 59. Method 6: UCS to SBCS Conversion Table
Method 7 for Mixed Single/Double-byte to Double-byte
This method has the following characteristics:
- It is used for conversion between an input mixed single/double-byte and an output double-byte CCSID.
- The valid encoding schemes for the input data are X'1301', X'2300', X'2305' and X'3300', and for the output data are X'1200', X'2200', X'3200', X'5200', X'6200', X'7200', X'8200' and X'9200'.
- It uses a TYPE 2 conversion table, as described earlier for double-byte to double-byte conversions. See "Method 2 for Pure DBCS" for a description of the table and how it works.
- This method requires that the input data is normalized such that each input code point is two bytes long. This is done by prefixing each single-byte code point with a X'00'.
- Any code extension controls are also removed from the input data stream.
- The conversion then proceeds as any normal double-byte to double-byte conversion.
- This method is primarily used for converting data to UCS-2 (encoding scheme X'7200').
- The resultant output string will not necessarily be the same length as the input string, (each single-byte code point from the mixed input string is converted to a double-byte).
- The content of the conversion table used governs the accuracy of the output data.
Method 8 for Double-byte to Mixed Single/Double-byte
This method has the following characteristics:
- It is used for conversion between an input double-byte and an output mixed single/double-byte CCSID.
- The valid encoding schemes for the input data are X'1200', X'2200', X'3200', X'5200', X'6200',
X'7200', X'8200' and X'9200', and for the output data are X'1301', X'2300', X'2305' and X'3300'.
- It uses a TYPE 2 conversion table, as described earlier for double-byte to double-byte conversions. See "Method 2 for Pure DBCS" for a description of the table and how it works.
- This method takes the two byte input code points and uses the TYPE 2 conversion table to produce normalized (two byte) output code points.
- The output data is then denormalized by removing the leading X'00' found on the normalized single-byte code points.
- Any necessary code extension controls are also added to the output data stream. The resultant string must be well formed as defined by the appropriate encoding structure. For more information see Appendix A.
- This method is primarily used for converting data from UCS-2 (encoding scheme X'7200').
- The resultant output string will not necessarily be the same length as the input string, (some of the input double-byte code points may map to single-byte code points in the output mixed CCSID).
- The content of the conversion table used governs the accuracy of the output data.
EUC and 2022 TCP/IP Conversion Tables
The Extended Unix Code (EUC) conversion tables are used to convert EUC encoded graphic character data from an EUC platform to or from a host or PC platform. The ISO 2022 TCP/IP (TCP) conversion tables are used to convert encoded graphic characters from the specific ISO 2022 format used by TCP/IP to or from a host or PC platform.
It is assumed at this point that the reader has some knowledge of the code extension techniques defined in the ISO 2022 standard. Both EUC and 2022 TCP/IP data streams make use of these techniques.
The conversion tables are constructed to achieve optimum character integrity after data conversion by using GCGID matching between source and target encodings. Both schemes can define up to four character set and code page pairs to better enhance the character matching between source and target CCSIDs.
The EUC and TCP conversion tables use a normalized form of data. Input passed to and output generated from the conversion tables is normalized. The PC code points are normalized by placing a leading X'00' in front of each single-byte to yield a two-byte form. Host (EBCDIC) data must have the SO-SI control characters deleted during normalization and reinserted afterwards during denormalization. As with the PC data, a leading X'00' is inserted in front of any single-byte data. The EUC and TCP code points are normalized to four-byte values.
Figure 60. EUC and ISO 2022 TCP Conversion
Figure 60 shows the general use of the EUC and TCP conversion tables. The input byte or bytes, up to a maximum of four bytes per code point, are first normalized and used as input to the conversion table. The output (again, four bytes per code point maximum) from the conversion table is also normalized data, which must be denormalized prior to subsequent processing.
Normalization and denormalization services are not part of the CDRA-supplied conversion tables.
The EUC encoding technique uses up to four coded graphic character sets. Each must be predefined, as the information is not carried in the text data stream. In CDRA, the CCSID determines the group of coded graphic character sets being used. Code points from the left half of the 8-bit encoding space (the high-order bit is OFF) are in the set G0. Code points that lie in the right half of the encoding space (the high-order bit is ON) are in the set G1. The single-shift control characters, called SS2 and SS3, are used to invoke the other sets G2 and G3.
EUC conversions require that the EUC input or output contain not only the character to be converted, but the shift control character when applicable. All input EUC data needs to have the code point values padded with leading zeros to create a fixed length, normalized, four-byte encoding. The following example shows how a code point in G3 must be formatted for the conversion tables.
- Input code point X'A2C3' in set G3
- SS3 character X'8F' must be present and included with the input value
- Normalized input value becomes X'008FA2C3' (padded to a length of four bytes).
This means that when dealing with EUC data, the parser must recognize which G-set each character belongs to in order to build the correct normalized input for the conversion process. When converting from EUC, the denormalizing process must strip off the leading zeros and concatenate the converted characters to build the correct output string.
Method 9 for PC to EUC Conversions
The method shown in Figure 61 has the following characteristics:
- It is used for conversion between an input PC CCSID and an output EUC CCSID.
- The valid encoding schemes for input are X'2100', X'3100', X'2200', X'2300', X'2305', X'3200', and X'3300' while X'4403' is valid for the output CCSID.
- The PC input data is always normalized to two bytes per code point.
- The conversion table created will handle either single-byte or double-byte code points from the input CS, CP pair to a possible single-byte, double-byte or triple-byte output CS, CP, as determined by the EUC encoding scheme.
- The content of the table will reflect:
- CS, CP pair priorities for the EUC CCSID
- Matched GCGID priority within a CS, CP pair
- Mismatch management criteria
- Space character management.
- Since many double-byte encodings do not use all available first-byte values as ward numbers, the conversion table will contain one record for each valid ward and one additional record for all invalid wards. Each record will contain 256 four-byte entries.
- Invalid single-byte code points will be mapped into the single-byte G0 set character SUB, at code point X'1A'. Invalid double-byte values will be mapped into the double-byte G1 set as a SUB.
- Each of the four-byte values will contain the appropriate single-shift character (SS2 or SS3), whenever the output is in G2 or G3.
Figure 61. Method 9: PC to EUC Conversion
Method 10 for EUC to PC Conversions
The input values for the conversion in this case are four bytes, rather than the two bytes used for the PC input in the previous example. This results in a conversion table construction that is more complex than the PC-to-EUC case. The following description applies equally to all tables dealing with four-byte input values, namely those of EUC and TCP/IP CCSIDs.
There are four levels of tables within the constructed conversion table, where each table corresponds to one input byte value of the four input bytes per character.
- Level 0 tables (B0): Only one table can be constructed at this level. Byte 0 (the first byte) of the input code point is used to index into the B0 table and retrieve a pointer to the B1 level tables. Table B0 is 256 bytes long.
- Level 1 tables (B1): There is one B1 table for each valid entry in the B0 table, plus one table to contain all of the invalid entries for B0. The first four bytes of each B1 table are used as a pointer (23), b2pt, to a corresponding group of B2 tables. The second byte of the input code point (byte 1) is used as an index into B1 to retrieve the index number for the B2 table within the group of B2 tables pointed to by the b2pt value. Each B1 table is 260 bytes long.
- Level 2 tables (B2): There is one group of B2 tables for each B1 table. The first four bytes of each B2 table are used as a pointer (23), b3pt, to a corresponding group of B3 tables. The third byte of the input code point (byte 2) is used as an index into B2 to retrieve the index number for the B3 table within the group of B3 tables pointed to by the b3pt value. Each B2 table is 260 bytes long.
- Level 3 tables (B3): There is one group of B3 tables for each B2 table. Use the fourth byte of the input code point (byte 3, where byte 0 is the first byte) to index into the B3 table to retrieve the final conversion value. Each B3 table is 512 bytes in length.
An index value of 0 corresponds to the first table in the group.
Figure 62. Method 10: EUC to PC Conversion
The method shown in Figure 62 has the following characteristics:
- It is used for the conversion of data between an input EUC CCSID and output PC CCSID.
- The valid encoding scheme for input data is X'4403', while the valid schemes for output data are X'2100', X'3100', X'2200', X'2300', X'2305' X'3200', and X'3300'.
- The input bytes are always received in a normalized four-byte format.
- The conversion table will accept single-byte, double-byte, or triple-byte code points from the input CS, CP pair as defined by the EUC encoding scheme to be converted to a possible single-byte, double-byte CS, CP code-point output value.
- The content of the table will reflect the criterion used for:
- Matched GCGID priority within the target CS, CP
- Mismatch management
- Space character management.
- For most EUC four-byte codes, only a certain range of code point values are valid for the three high-order bytes, therefore the tables are organized as several subtables. Subtable pointer tables contain entries that point to a pool of subtables. The lowest-level subtable points to a series of records containing 256 double-byte code point values used as output.
- Invalid single-byte input values will be mapped to the single-byte SUB character for the PC, which is a X'7F'. All other invalid input values will be mapped to the double-byte SUB for the respective PC mixed CCSID.
- Only a triple-byte CS, CP pair will use all four bytes of the input code point.
Method 11 for Host to EUC Conversions
The method shown in Figure 63 has the following characteristics:
- It is used for conversion between an input Host CCSID and an output EUC CCSID.
- The valid encoding schemes for input data are X'1100', X'1200', and X'1301'. The valid output encoding scheme is X'4403'.
- Input is always expected in a normalized two-byte format.
- The conversion table created will handle either single-byte or double-byte code points from the input CS, CP pair to a possible single-byte, double-byte or triple-byte output CS, CP, as determined by the EUC encoding scheme.
- The content of the table will reflect:
- CS, CP pair priorities for the EUC CCSID
- Matched GCGID priority within a CS, CP pair
- Mismatch management criteria
- Space character management.
- Since many double-byte encodings do not use all available first-byte values as ward numbers, the conversion table will contain one record for each valid ward and one additional record for all invalid wards. Each record will contain 256 four-byte entries.
- Invalid single-byte code points (X'00xx') will be mapped into the single-byte G0 set character SUB, at code point X'1A'. Invalid double-byte values will be mapped into the double-byte G1 set as a SUB.
- Host double-byte control characters will be mapped to the single-byte control characters after denormalization.
Figure 63. Method 11: Host to EUC Conversion
Method 12 for EUC to Host Conversions
The method shown in Figure 64 has the following characteristics;
- See "Method 10 for EUC to PC Conversions" for a description of the table format.
- It is used for conversion between an input EUC CCSID and an output HOST CCSID.
- The valid encoding scheme for input data is X'4403'. The valid output encoding schemes are X'1100', X'1200', X'1301'.
- Input is always expected in a normalized four-byte format.
- The conversion table created will handle either single-byte, double-byte or triple-byte code points from the input CS, CP pair to a possible single-byte or double-byte CS, CP code point output.
- The content of the table will reflect:
- CS, CP pair priorities for the EUC CCSID
- Matched GCGID priority within a CS, CP
- Mismatch management criteria
- Space character management.
- Since most EUC four-byte encodings only use a certain range for the three high-order bytes, the conversion table is organized into several levels of subtables. These subtables in turn point to a pool of records containing 256 double-byte entries. There is a subtable data record code point for each valid input code point.
- Invalid single-byte code points (X'00xx') will be mapped into the single-byte G0 set character SUB, at code point X'3F'. Invalid multi-byte values will be mapped into the double-byte host SUB, at X'FEFE'.
- Only a triple-byte CS, CP pair will use the high-order byte of the four-byte encoding space.
Figure 64. Method 12: EUC to Host Conversion
2022 TCP/IP Conversions
This section documents the formats of the tables used for converting to and from the national standard coded graphic character sets that are used by TCP/IP.
Conversion tables for TCP/IP are very similar to those for EUC, except that only one G-set is used, namely G0. To switch from one coded graphic character set to another cannot be accomplished using the EUC technique of a single shift. An explicit escape sequence is used to designate a new coded graphic character set being loaded into the G0 set. The escape sequence value itself cannot be carried in the conversion table entries, so the high-order byte of the TCP/IP code point value will be set to correspond to the position of the CGCSID within the CCSID. For example, CCSID 956 is defined for Japanese TCP/IP and contains four CGCSIDs corresponding to JIS X 201 Roman, JIS X 208-1983, JIS X 201 Katakana, and JIS 212. For Japanese host to TCP/IP conversion, a code point value from the JIS X 201 Katakana set would contain a value of X'03' in the high-order byte.
In ISO 2022, control characters are not part of the coded graphic character set; therefore, loading a new coded character set into G0 does not affect the set of control characters in C0. It thus does not make sense to have the high-order byte setting indicate a specific coded graphic character set. Control characters will then be normalized as follows;
- For Host or PC to 2022 TCP/IP, the converted value will have X'00' in the high-order byte. This will indicate to the denormalization routine that it does not matter what the currently designated coded graphic character set is -- the control character may simply be placed directly into the output data stream.
- For 2022 TCP/IP to Host or PC, the input normalized value may have X'00' as the high-order byte, or it may contain any valid value for the table. The normalization routine can then recognize the value as a control character and place a X'00' in the high-order byte or it may use the current active value as a result of the previous escape sequence.
Although the identifiers coded in the high-order bytes will correspond to the position of the coded graphic character set within the CCSID, each table will also contain the set of ESC sequences used to designate the coded graphic character set in the G0. This mapping will be contained in the first record of the conversion table in the following format:
L1 ESC1 L2 ESC2 ... Ln ESCn
where Li is a one-byte unsigned field containing the length of the following ESC sequence, and ESCi is the ESC sequence associated with the high-order byte id "i" in the table.
Method 13 for PC to TCP Conversions
The method shown in Figure 65 has the following characteristics:
- It is used for conversion between an input PC CCSID and an output TCP CCSID
- The valid encoding schemes for input data are X'2100', X'3100', X'2200', X'2300', X'2305', X'3200', and X'3300'. The valid output encoding scheme is X'5404'
- Input is always expected in a normalized two-byte format
- The conversion table created will handle either single-byte or double-byte code points from the input CS, CP pair to a possible single-byte, double-byte or triple-byte output CS, CP, as determined by the TCP encoding scheme
- The content of the table will reflect:
- CS, CP pair priorities for the TCP CCSID
- Matched GCGID priority within a CS, CP pair
- Mismatch management criteria
- Space character management.
- Since many double-byte encodings do not use all available first-byte values as ward numbers, the conversion table will contain one record for each valid ward and one additional record for all invalid wards. Each record will contain 256 four-byte entries
- Invalid single-byte code points (X'00xx') will be mapped into the single-byte character SUB, at code point X'1A'. Invalid double-byte values will be mapped into the following:
- Japan - X'747E'
- Korea - X'2F7E'
- Traditional Chinese - X'7D7E'
- Simplified Chinese - X'2121'(24)
- The high-order byte of each output code point value will contain the identifier 1 to 4 for graphics or 0 for a control character.
Figure 65. Method 13: PC to TCP Conversion
Method 14 for TCP to PC Conversions
The method shown in Figure 66 has the following characteristics:
- See "Method 10 for EUC to PC Conversions" for a description of the type of table format used in this conversion method.
- It is used for conversion between an input TCP CCSID and an output PC CCSID
- The valid encoding scheme for input data is X'5404'. The valid output encoding schemes are X'2100', X'3100', X'2200', X'2300', X'2305', X'3200', and X'3300'
- Input is always expected in a normalized four-byte format, and it includes the identifier in the high-order byte indicating which coded graphic character set the code point was taken from.
- The conversion table created will handle either single-byte, double-byte, or triple-byte code points from the input CS, CP pair as defined by the TCP encoding scheme to a possible single-byte, or double-byte output CS, CP code point.
- The content of the table will reflect:
- Matched GCGID priority within a CS, CP
- Mismatch management criteria
- Space character management.
- For most TCP four-byte codes, only a certain range of values is valid for the three high-order bytes. To handle this situation, the table is organized as a series of subtables. Each subtable level points to a lower level subtable, until the last subtable level points to the actual output code point records. Each of the records contain 256 double-byte code point values.
- Invalid single-byte code points will be mapped into the single-byte character SUB, at code point X'7F'. All other invalid values will be mapped to the double-byte SUB character for the respective country version of the encoding scheme.
- Only a triple-byte CS, CP pair will use all four bytes of the input code point.
Figure 66. Method 14: TCP to PC Conversion
Method 15 for Host to TCP Conversions
The method shown in Figure 67 has the following characteristics:
- It is used for conversion between an input Host CCSID and an output TCP CCSID.
- The valid encoding schemes for input data are X'1100', X'1200', and X'1301'. The valid output encoding scheme is X'5404'.
- Input is always expected in a normalized two-byte format.
- The conversion table created will handle either single-byte or double-byte code points from the input CS, CP pair to a possible single-byte, double-byte, or triple-byte output CS, CP code point as defined by the TCP encoding scheme.
- The content of the table will reflect:
- CS, CP pair priorities for the TCP CCSID
- Matched GCGID priority within a CS, CP pair
- Mismatch management criteria
- Space character management.
- Most double-byte encodings do not use all of the available first bytes as valid ward numbers. To handle this situation and make effective use of table resource space, the table is organized as a series of subtables. Each subtable contains 256 four-byte code point entries. There is a subtable of output code points for each valid ward number of the input code points, and a single subtable for the substitution entries for all of the invalid first-byte values.
- Invalid single-byte code points (X'00xx') will be mapped into the single-byte character SUB, at code point X'1A'.
- Invalid double-byte values will be mapped as follows:
- Japan - X'747E'
- Korea - X'2F7E'
- Traditional Chinese - X'7D7E'
- Simplified Chinese - X'2121'(24)
- The high order byte of each output code point contains the identifier from 1 to 4 for graphic characters, or 0 for control characters.
- Host double-byte control characters will be mapped to single-byte control characters after denormalization.
Figure 67. Method 15: Host to TCP Conversion
Method 16 for TCP to Host Conversions
The method shown in Figure 68 has the following characteristics:
- See "Method 10 for EUC to PC Conversions" for a description of the type of table format used in this conversion method.
- It is used for conversion between an input TCP CCSID and an output Host CCSID.
- The valid encoding scheme for input data is X'5404'. The valid output encoding schemes are X'1100', X'1200', and X'1301'.
- Input is always expected in a normalized four byte format, and includes the identifier in the high-order byte that indicates the coded graphic character set of the output code point.
- The conversion table created will handle either single-byte, double-byte, or triple-byte code points from the input CS, CP pair as defined by the TCP encoding scheme to a possible single-byte, double-byte, or triple-byte output CS, CP code point.
- The content of the table will reflect:
- Matched GCGID priority within a CS, CP
- Mismatch management criteria
- Space character management.
- For most TCP four-byte codes, only a certain range of values are valid for the three high-order bytes, causing the table to be organized as several subtables. Each of the subtables contains pointers to subsequent records in the table. Each of the subsequent records contains 256 double-byte output code point entries.
- Invalid single-byte code points will be mapped into the single-byte character SUB, at code point X'3F'.
- Invalid multi-byte input values will be mapped to the host double-byte SUB value of X'FEFE'.
- Only a triple-byte CS, CP pair will use all four bytes of the four-byte input code point value.
Figure 68. Method 16: TCP to Host Conversion
Conversion Methods in Support of GB18030
Chinese Standard GB18030 defines a complex code composed of one, two and four-byte components. In order to convert between this standard and other encodings including Unicode and mixed single- double-byte codes a number of conversion table methods and structures have been defined. The specifics of the content and use of these structures is documented in the GB18030 Readme file that is part of the GB18030 conversion package.
Use of Shadow Flags - An Example
GCCT Shadow Flags Element describes the shadow flags associated with a conversion table. The shadow flag table is used to indicate if there is an exact match of input and output characters. This table can be used to verify that no character replacement has occurred when converting a coded graphic character string. The structure of the shadow flag tables and the method of getting the shadow flag value are the same as the associated conversion tables and methods, except that the entries in the shadow flag subtables will always be single-bytes having one of the shadow flag values X'00' or X'FF', where:
- 00
- Characters match for this code point.
- 01 - FE
- Reserved
- FF
- Characters do not match for this code point.
Figure 69. Method 1: Single-Byte to Single-Byte Conversion with Shadow Flag
Figure 69 shows how the shadow flags are accessed and used in conjunction with a Type 1 table. In this example the input code point X'53' is converted to X'67' as the output code point using the conversion table. Using the associated shadow flag table, the shadow flag entry of X'FF' indicates that the output character is not the same as the input character. The GCGIDs do not match.
Figure 70. Method 2: Double-Byte to Double-Byte Conversion with Shadow Flags
Figure 70 shows how the shadow flags are accessed and used in conjunction with the Type 2 tables. In this example the input code point X'41C1' is converted to X'43C4' as the output code point using the conversion tables. Using the associated shadow flag table, the shadow flag entry of X'00' indicates that the output character is the same as the input character. The GCGIDs match.
Enhancements to Support String Types
Each of the methods described above can be enhanced to deal with different input and output string types.
When the input is a null-terminated string, the input parsing step will be enhanced with the necessary logic to sense the null-termination and take appropriate action.
Similarly, when the output string type is null-terminated, the output assembly step of conversion will append a null-termination character to the end of the string. The necessary error-checking logic to ensure that a null-termination character is not encountered in the string converted using the conversion table should also be in place.
See "Null-terminated string" for the semantics of a null-terminated string.
A SPACE padding logic appends the appropriate number of SPACE code points from the relevant CP at the end of the converted string. This enhancement has to be added to the output assembly step for these output string types.
See "Padded string" for the semantics of a SPACE-padded string.
In addition to these two special string types, a number of string types have been defined in support of bidirectional text. See "Types of Strings" for specific information. |