|
EBCDIC Double-Byte Structure
The structure of IBM double-byte coded character sets is specified in IBM standards.
The double-byte EBCDIC code is called DBCS-HOST code. The basic EBCDIC structure has allocated coding space for control characters and graphic characters separately. The following describes the graphic character range of hexadecimal codes in the DBCS-HOST structure. Figure 46 illustrates the DBCS-HOST graphic character coding space. There are no 16-bit codes for control characters in the EBCDIC structure definition. A DBCS-HOST graphic character code has the following characteristics:
- The first byte is in the range X'41' to X'FE'
- The second byte is also in the range X'41' to X'FE', for all currently defined code pages
- X'4040' represents DBCS-HOST SPACE
- All other undefined 16-bit patterns are invalid as graphic characters.
Figure 46. EBCDIC DBCS Graphic Character Coding Space (ES = X'12zz' or X'13zz')
A DBCS Ward
A section of a DBCS where the first bytes of all the code points belonging to it are the same is called a ward. (19) A set of wards can be registered with a unique character set identifier, GCSGID, and the associated CPGID of the DBCS. This unique identifier, CGCSGID, defines the valid graphic character code points belonging to that set of wards.
EBCDIC Single/Double-Byte Mixed Encoding Structure
The coding space for EBCDIC Single/Double-byte mixed graphic characters is shown in Figure 47.
The encoding scheme is a hybrid of the two EBCDIC schemes: EBCDIC SBCS, described earlier in Figure 39, and EBCDIC DBCS, described in and Figure 46. This encoding scheme is a stateful encoding and uses a code extension technique to change between SBCS mode and DBCS mode. The control codes used to identify this change of state are X'0E' (shift out of SBCS) and X'0F' (shift into SBCS mode). The default starting state for a string encoded using this encoding scheme is single-byte. In order for a mixed string to begin in DBCS mode the first double-byte character must be preceded by a X'0E' in order to 'shift out' of SBCS mode. A well formed mixed host string must have matching shift out, shift in (SO, SI) pairs. All well formed mixed host strings will end in single-byte mode. When in either mode, the behavior of this encoding is as prescribed by the respective encoding scheme. All the semantics of the two individual encoding schemes apply in this case as well.
The following are examples of well formed mixed EBCDIC strings. In these examples SO - represents a shift-out control, SI - represents a shift-in control, s - represents a single-byte character and dd - represents a double-byte character.
ssssSOddddddddSIsssss - in this example the string begins in single-byte mode, shifts to double-byte mode for 4 characters and then returns to single-byte mode.
SOddddddddSIsssss - in this example the string begins with double-byte characters, thus the first character of the string must be the shift-out, following the double-byte characters there is a shift-in to change to the single-byte state for the last 5 characters in the string.
ssssSOddddddddSI - in this example the string begins in single-byte mode, shifts to double-byte mode and even though the string ends in double-byte mode the Shift-in control is required to create a well formed string.
SOSIssssSOddddddddSIsssss - in this example the SOSI at the beginning of the string is treated as a no-op. This is true for a SOSI pair found anywhere in a mixed EBCDIC string.
Figure 47. EBCDIC Mixed Single/Double-Byte Code Structure (ES = X'1301')
IBM-PC Double-Byte Code Structure
The coding space for DBCS-PC graphic characters is shown in Figure 48. The DBCS-PC graphic character code has the following characteristics:
Note: It is not advised to rely on the specific values above X'40' (second byte value) to denote the presence or absence of DBCS characters. These values will be encoding scheme specific and can change over time.
Figure 48. IBM-PC DBCS Graphic Character Coding Space (ES = X'22zz', X'32zz')
Note: In practice, the graphic characters of DBCS-PC are used with a single-byte PC coded character set. The specific values in the allocated range to be used as the first byte of a double-byte are detailed when the coded character set is registered. Other values from this range may be defined to be used as single-byte code points, and when so defined are not available for use as the first byte of a double-byte. Similarly, when a code point is declared to be the first byte of a double-byte code point, it cannot be used as a single-byte code point.
The control characters are all single-byte codes, as defined earlier for the IBM-PC Display and IBM-PC Data code structure. The definition of a ward given above also applies to DBCS-PC.
IBM-PC Mixed Single- and Double-Byte Structure
In the PC-Mixed scheme, both single-byte and double-byte code points may exist in the same data stream, without any explicit demarcation points between them.
Each specific use of a PC-Mixed scheme (ES=X'23zz' or X'33zz') must have an associated declaration of the specific single-byte code points to be used as the first byte of the double-byte code point. This set of code points is equivalent to a set of specific single-shift control code points in ISO (for example, Single-shift-2 (X'8E') as defined in ISO 6429). Each single-shift control causes the meaning of the following single-byte code point to be taken from a specific ward. The value of the first byte, besides being a single-shift control, is equal to the ward number. Figure 49 illustrates this definition.
Figure 49. IBM-PC Mixed Single/Double-Byte Graphic Character Coding Space (ES = X'2300', or X'3300')
Note: Application developers are cautioned to not rely on the absolute code point range values as they may change in the future. The begin and end values may be CCSID dependent.
The double-byte codes starting with a valid first byte follows the definition for IBM-PC Double-byte code structure. All the bytes that are not in the valid list of first bytes will have their single-byte code points assigned per IBM PC Single-Byte Data or Display structure definition. In comparison, in a pure PC-DBCS scheme the single-byte graphic code points of the base PC Encoding structure that are not used as the first byte of a double-byte code point cannot be assigned a graphic character.
Note: The size of the maximal character set of the double-byte code page determines the size of the double-byte coding space needed. This in turn governs the number of wards needed, and the corresponding number of code points to be reserved for use as the first byte of a double-byte code point. The character set of the single-byte code page also influences the maximum number of single-byte code points needed, by trading off with the maximum number of wards possible. The net result is that when a specific single-byte code page and a specific double-byte code page are used with the mixed encoding structure of the PC, the list of valid first bytes also gets fixed.
IBM Extended Unix Code (IBM EUC)
IBM's adaptation of Extended Unix* (20) Code (EUC) is called IBM EUC. It is also known (in IBM AIX documentation) as Multiple Byte Character Set (MBCS). The structure of IBM EUC coded character sets is specified in IBM Corporate Standard, Double-Byte Character Set (DBCS), Terminology and Coding Scheme, C-S 3-3220-102, 1992-07. The encoding scheme used in IBM EUC is shown in Figure 50.
Figure 50. Designation and Invocation of IBM-EUC (ES = X'4403')
IBM EUC is an adaptation of one of the several code extension techniques defined in ISO 2022. It uses the 8-bit coding environment. The coded graphic character sets used are a national version of ISO 646 designated as the G0 set, and at most three additional G sets (G1, G2, and G3). The graphic character sets used correspond to the national standards of the different countries in the Far East.
The 8-bit environment of ISO 2022 implicitly designates the G0 set into the left half and the G1 set into the right half of the ISO-8 encoding structure (see section "ISO 8-bit Structure"). Encoding scheme X''8100' has been defined to describe a G1 set in the right hand side of the ISO 8-bit encoding space when it is being used as a standalone portion of an EUC encoding. The single-shift controls, Single-shift 2 (SS2) and Single-shift 3 (SS3), are used for invoking the G2 and G3 sets into the right half of the 8-bit code. IBM EUC omits all the announcer, invocation, and designation sequences of ISO 2022.
The resultant complete coded graphic character sets are often called EUC_J (for use in Japan), EUC_K (for use in Korea), EUC_T (for use with Traditional Chinese), or EUC_S (for use with Simplified Chinese).
The EUC scheme combines up to four coded graphic character sets. The collection includes a basic character set (the G0 set of a national version of ISO 646), and one or more of the following coded graphic character sets:
- ISO 7/8 bit -- SBCS-EUC
- double-byte -- DBCS-EUC
- triple-byte -- TBCS-EUC.
The valid ranges of graphic character code points for each one of these sets when used in IBM EUC are given below:
-
Basic Character Set is the G0 set of a national version of ISO 646, and is implicitly designated and invoked into the code point range X'21' to X'7E' for graphic characters.
-
SBCS-EUC is a single-byte code page used with the code extension technique of IBM EUC. Each graphic character code point can be in the range X'A0' to X'FF' (called a 96-character G set in ISO 2022).
-
DBCS-EUC is a double-byte coded graphic character set, which is used with the code extension technique of IBM EUC. The valid set of graphic character code points of DBCS-EUC is shown in Figure 51. Any graphic character code point of DBCS-EUC meets the following criteria:
- Both bytes are in the range X'A0' to X'FF'
- Any two-byte pattern that includes a byte value outside the above range is invalid.
Figure 51. IBM-EUC Double Byte Code Structure
(ES = X'9200' standalone or as part of ES = X'4403')
-
TBCS-EUC is a triple-byte coded graphic character set, which is used with the code extension technique of IBM EUC. The valid set of graphic character code points of TBCS-EUC is shown in Figure 52. Any graphic character code point of TBCS-EUC meets the following criteria:
- All three bytes are in the range X'A0' to X'FF'
- Any three-byte pattern that includes a byte value outside the above range is invalid.
Any one of the SBCS-EUC, DBCS-EUC, or TBCS-EUC can be used as any of G1, G2, or G3 sets.
- Each code point invoked from G2 is preceded by an SS2 control that has been assigned X'8E' in C1.
- Each code point invoked from G3 is preceded by an SS3 control that has been assigned X'8F' in C1.
The remaining code points in the space X'00' to X'1F', and X'7F' to X'9F', follow the rules for an ISO 8-bit code (see "ISO 8-bit Structure").
- The default SPACE (X'20'), DELETE (X'7F'), and control code point assignments for C0 and C1 sets as defined in ISO 6429.
- The SS2 and SS3 controls are from the C1 set (X'8E' and X'8F').
- The default Substitute (SUB) code point is from the C0 set (X'1A').
- Additional SPACE and SUB control code points may also be specified to be used with G1, G2, or G3 sets.
Figure 52. IBM-EUC Triple Byte Code Structure (ES = X'5700' standalone or as part of ES = X'4403')
Notes:
- EUC does not specify what happens when a control set that is designated and invoked as a C1 set has SS2 and SS3 controls assigned to code points other than X'8E' and X'8F' -- for example, the control sets of CCITT T.61 for Telematic Services.
- IBM EUC specifies that the right half of the 8-bit coding space (GR) is the single-shift area. The following announcer sequences of ISO 2022 correspond to the EUC adaptation:
- ESC 20 43 announces an ISO-8 environment, with G0 on the left side and G1 on the right side of the 8-bit code
- ESC 20 5A announces an additional G2 invoked using SS2
- ESC 20 5B announces an additional G3 invoked using SS3.
- ESC 20 5C announces the single-shift to be GR.
Unicode Code Structure
Unicode is a universal character encoding scheme that provides a means of encoding all of the characters used for the written languages of the world. It has the capability to encode up to 216 x 17 characters. Unicode has been accepted by many as the strategic direction towards multilingual computing.
The basic encoding structure of Unicode is shown in Figure 52a. Unicode is made up of 17 planes of 256 rows and 256 columns. Plane 0 is the Basic Multilingual Plane (BMP). It contains the majority of the currently encoded characters. Plane 0 includes an area reserved for Private Use Characters (PUA) and an area used for surrogate characters. Plane 1 is the Supplementary Multilingual Plane. Its purpose is to encode characters from archaic or obsolete writing systems. Plane 2 is the Supplementary Ideographic Plane and is used for encoding rare and unusual Han characters (Chinese, Japanese, Korean and Vietnamese unified Idiographs). Planes 3 through 13 are currently (and expected to remain) unassigned. Plane 14 is reserved for special purpose characters and is thus called the Supplementary Special-Purpose Plane. The final two planes, 15 and 16, are Private Use Planes to be used as an extension of the private use area found in the BMP. Encoding scheme X'7209' has been defined to represent an individual plane within the Unicode structure. This encoding scheme is used when referencing a plane. Additional information about the Unicode encoding structure can be found on the Unicode home page..
Figure 52a. Unicode Basic Code Structure
Unicode Encoding Formats
Unicode is unique and different from most character encodings in that there are several formats defined for the encoding (see Table 2.3 in Unicode V4.0). While the encoding space is well structured and clearly defined, the Unicode Standard allows a number of different encoding formats. Characters may be encoded in one, two or four byte formats. Each of these formats is briefly described below. For more detailed information refer to the Unicode Standard V4.0 documentation or the Unicode home page. The Unicode encoding structure can easily be defined using the standard CDRA Encoding Scheme Identifiers (ESIDs). A number of ESIDs have been defined in order to accurately define the various encoding formats. In order to accurately interpret Unicode encoded data it is essential that the encoding be known and clearly defined.
UTF-8 (ES = X'7807')
Unicode Transformation Format 8 (UTF-8) is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the standard 8-bit ASCII set have the same values as ASCII, and that Unicode characters transformed into UTF-8 can often be used with existing software without extensive software rewrites. The main disadvantage of this encoding form is the overhead required to perform the transformation from one of the other encoding formats into UTF-8. UTF-8 is commonly used for file storage and as the default by the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C) protocols. The CDRA defined encoding scheme identifier for UTF-8 is 7807.
UTF-16 (ES = X'7200', X'720B', X'720F', X'8200)
Unicode Transformation Format 16 (UTF-16) is a reasonably compact encoding and all the heavily used characters fit into a single 16-bit code units in byte serialized form. All other characters are available via pairs of 16-bit codes (surrogates). UTF-16 is the most commonly used encoding form for internal processing. When using UTF-16 the order of the bytes of the character can be either most-significant-byte-first (big-endian, BE order) or least-significant-byte-first (little-endian, LE order). CDRA defines four ESIDs for UTF-16. The first is 7200. 7200 indicates UTF-16 with BE order. The second is 720B which indicates UTF-16 LE. The third ESID defined is 720F. 720F indicates UTF-16 where the endian order is determined by a byte order mark (BOM). If presents, a byte order mark will be found as the first two bytes of a data string. The value of the BOM indicating BE order is x'FEFF' and indicating LE order is x'FFFE'. If no BOM is found, the data is assumed to be big endian. The final encoding scheme defined for UTF-16 is 8200. This encoding scheme is called 'Unicode Presentation'. It is defined to be BE order in the absence of a BOM and is used exclusively by IBM printing systems. 8200 is a derivation of Unicode. It defines the C0 and C1 space of Unicode to be used for graphic characters.
UTF-32 (ES = X'7500', X'750B', X'750F')
Unicode Transformation Format 32 (UTF-32) provides fixed width, single code unit access to all of the characters. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32. As is the case with UTF-16, UTF-32 can also be byte-serialized in either big-endian (BE) or little-endian (LE) order. CDRA defines three encoding schemes for UTF-32 format. 7500 is defined for UTF-32 BE. Encoding shceme 750B explicitly defines the data to be UTF-32 LE. The third ESID is 750F. 750F indicates UTF-32 where the endian order is determined by a byte order mark (BOM). The byte order mark for UTF-32 is X'0000FEFF' for indicating BE order and X'FFFE0000' for indicating LE order. If no BOM is found, the data is assumed to be in BE order.
UTF-EBCDIC (ES = X'1808')
Unicode Transformation Format EBCDIC (UTF-EBCDIC) provides an EBCDIC friendly way of encoding Unicode. UTF-EBCDIC is defined in Unicode Technical Report 16. UTF-EBCDIC defines a means of transforming Unicode characters to a form that is safe for EBCDIC systems for the control characters and invariant characters. CDRA defines encoding scheme 1808 for UTF-EBCDIC. UTF-EBCDIC is intended to be used inside EBCDIC systems or in closed networks where there is a dependency on EBCDIC hard-coding assumptions. UTF-EBCDIC is unsuitable for use over the Internet or for data interchange.
Standard Compression Scheme for Unicode (SCSU) (ES = X'7B0C')
The Unicode Standard defines a compression scheme for storing and transmitting Unicode data. The details of this encoding form can be found in Unicode Technical Standard 6. The CDRA defined ESID for Unicode SCSU is 7B0C.
Binary Ordered Compression for Unicode (BOCU-1) (ES = X'7B0E)
The Unicode Standard defines this MIME compatible compression for Unicode. The details of this encoding form can be found in Unicode Technical Note #6. The CDRA defined ESID for Unicode BOCU-1 is 7B0E.
Compatibility Encoding Scheme for UTF-16: 8-Bit (ES = X'780D')
Unicode Technical Report 26 specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange. The CDRA defined ESID for Unicode CESU-8 is 780D.
ISO Universal Coded Character Set
Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane defines a large character set containing most of the graphic characters used in languages and scripts throughout the world today.
Each code point is represented in a four-octet canonical form. A two-octet subset (UCS-2) contains almost all the currently used graphic characters in the world's scripts. This is called the Basic Multilingual Plane (BMP). The four octets are called the Group Octet (most significant), the Plane Octet (next significant), the Row Octet (the next significant), and the Cell Octet (least significant), following the structure of the code.
Figure 53. UCS-2 Structure (ES = X'7200')
Figure 53 shows the allocation of Zones for graphic characters within the Basic Multilingual Plane. Two areas, C0 and C1, are reserved for the C0 and C1 control codes defined in ISO 6429. Further information may be obtained from the previously mentioned publication.
Chinese Standard - GB18030
GB18030 is a Chinese Standard which was defined as a super set of previously defined standards including GB 2312-80. It was defined in order to give customers the capability of using and processing a greater number of Chinese characters which are necessary for many applications used in organizations such as banks, insurance companies and by the postal service. It currently contains all of the characters defined in Unicode 3.0 including more than 27,000 Chinese characters. This standard provides solutions for the urgent needs of Chinese characters used in names and addresses.
GB 18030 uses a combination of one-byte, two-byte and four-byte codes and has a capacity of over 1.5 million code positions. The determination of character width (one, two or four-byte) is handled implicitly through the use of code point ranges as shown in Figure 53a.
Figure 53a. GB18030 Structure (ES = X'2A00')
| Number of Bytes |
Valid Byte Ranges |
Number of Codes |
| One-byte |
X'00'-X'80' |
129 codes |
| Two-byte |
First byte |
Second byte |
23,940 codes |
| X'81' ~ X'FE' |
X'40'~X'7E' X'80'~X'FE' |
| Four-byte |
First byte |
Second byte |
Third byte |
Fourth byte |
1,587,600 codes |
| X'81'~X'FE' |
X'30'~X'39' |
X'81'~X'FE' |
X'30'~X'39' |
Lotus Multi-byte Character Set (LMBCS) (ES = X'9300')
LMBCS encoding is used exclusively by Lotus. It is defined as a multi-byte encoding made up of one, two and three byte values. The first byte is the Group Byte. The Group Byte is a value between X'00' and X'1F' with meaning as described in Figure 53b. Following the group byte will be either one or two bytes identifying the character. For optimization purposes, the group byte is omitted in Notes for single-byte values between X'20' and X'FF'. For example, LMBCS is always optimized to group 0x01, which means that any character where the first byte is greater than 0x1F, has an implicit group byte of 0x01.
Figure 53b. LMBCS Structure (ES = X'9300)
| Group byte |
Character size (bytes) |
Description |
| 0x00 |
|
Reserved for future use |
| 0x01 |
2 |
Byte 2 = Codepage 850, i.e. Multilingual DOS |
| 0x02 |
2 |
Byte 2 = CP 851 (Greek DOS) |
| 0x03 |
2 |
Byte 2 = CP 1255 (Hebrew Windows) |
| 0x04 |
2 |
Byte 2 = CP 1256 (Arabic Windows) |
| 0x05 |
2 |
Byte 2 = CP 1251 (Cyrillic Windows) |
| 0x06 |
2 |
Byte 2 = CP 852 (Latin-2 DOS) |
| 0x07 |
1 |
BEL |
| 0x08 |
2 |
Byte 2 = CP 1254 (Turkish Windows) |
| 0x09 |
1 |
TAB |
| 0x0A |
1 |
NL |
| 0x0B |
|
Reserved for future use |
| 0x0C |
|
Reserved for future use |
| 0x0D |
1 |
CR |
| 0x0E |
|
Reserved for future use |
| 0x0F |
|
Reserved for future use |
| 0x10 |
3 |
Bytes 2 & 3 = CP 932 |
| 0x11 |
3 |
Bytes 2 & 3 = CP 949 |
| 0x12 |
3 |
Bytes 2 & 3 = CP 950 |
| 0x13 |
3 |
Bytes 2 & 3 = CP 936 |
| 0x14 |
3 |
Bytes 2 & 3 = UTF-16 bytes |
| 0x15 - 0x1F |
|
Reserved for future use |
|