Character Data Representation Architecture


Appendix A. Encoding Schemes

This appendix contains descriptions of the encoding structures defined by a number of the defined encoding schemes. The information is a summary of the definitions taken from relevant standards or system documentation.

EBCDIC Single-byte Structures

IBM Extended Binary Coded Decimal Interchange Code (EBCDIC) is based on an 8-bit-per-byte structure. The basic EBCDIC structure is shown in Figure 39.

Figure 39. Basic Structure of EBCDIC Code (ES = X'1yzz')

Figure 39. Basic Structure of EBCDIC Code (ES = X'1yzz') Figure 39. Basic Structure of EBCDIC Code (ES = X'1yzz')

EBCDIC structure implies the following:


The EBCDIC Presentation Structure

IBM Extended Binary Coded Decimal Interchange Code (EBCDIC) for presentation is based on an 8-bit-per-byte structure. The basic EBCDIC presentation structure is shown in Figure 40.

Figure 40. Basic Structure of EBCDIC Presentation Code (ES = X'6100')

Figure 40. Basic Structure of EBCDIC Presentation Code (ES = X'6100') Figure 40. Basic Structure of EBCDIC Presentation Code (ES = X'6100')

Some products have modified the EBCDIC structure for presentation purposes. The following describes the semantics of this encoding structure.

All the code points in the range X'00' to X'FF' are assignable to graphic characters with the following considerations:

IBM PC Single-byte Structures

IBM-PC structure is an extension of the ISO 646 (ANSI version) 7-bit code structure to an 8-bit structure. Unlike the EBCDIC and ISO structures, this structure is ill-defined, especially in distinguishing control character codes and graphic character codes in a context-independent manner.

The valid hexadecimal codes are in the range X'00' to X'FF'. When the codes are used to represent graphic characters on displays, all the code points are allocated for graphic characters. The range X'00' to X'1F' is reserved for control characters, following the ISO 646 scheme, except for the code points X'14' and X'15', which are used for graphic characters in some PC codes. Two basic structures, called IBM-PC Data Code and IBM-PC Display Code, are described below.

More than one byte per code point can also be used with the IBM-PC structures (see "IBM-PC Double-Byte Code Structure").


IBM-PC Data Code Structure

The IBM-PC Data code is shown in Figure 41.

Figure 41. IBM-PC Data Code Structure (ES = X'2yzz')

Figure 41. IBM-PC Data Code Structure (ES = X'2yzz') Figure 41. IBM-PC Data Code Structure (ES = X'2yzz')

It has the following characteristics:


IBM-PC Display Code Structure

The IBM-PC Display Code, shown in Figure 42, has the following characteristics:

Figure 42. IBM-PC Display Code Structure (ES = X'3yzz')

Figure 42. IBM-PC Display Code Structure (ES = X'3yzz') Figure 42. IBM-PC Display Code Structure (ES = X'3yzz')

ISO Single-byte Structures

The international standard ISO 2022, Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques specifies the general structures and code extension schemes in the ISO 2022 environments. Other ISO standards, such as ISO 646, ISO 4873, ISO 6429, ISO 6937, and ISO 8859, define further specific use of subsets of the environments prescribed by ISO 2022. CCITT recommendations on Telematics, such as T.61 and T.100, also use ISO 2022 techniques. (American Standard Code for Information Interchange, ASCII, is the US national version of ISO 646 code; it is defined in the ANSI X3.4 standard.)

There are other encoding schemes outside ISO 2022, such as in the International Telegraphic Alphabet Number 2 (ITA2), a 5-bit code with an Alpha-shift and a Numeric-shift, which is used in international Telex services. Picture coding is another example. ISO 2022 has defined a scheme to switch to such non-ISO 2022 codes.


ISO 7-bit Structure

The ISO 7-bit structure (see Figure 43) is characterized by:

Figure 43. ISO 7-Bit Code Structure (ES = X'5yzz')

Figure 43. ISO 7-Bit Code Structure (ES = X'5yzz') Figure 43. ISO 7-Bit Code Structure (ES = X'5yzz')


ISO 8-bit Structure

The ISO 8-bit structure is shown in Figure 44.

It has the following characteristics:

Figure 44. ISO 8-Bit Code Structure (ES = X'4yzz')

Figure 44. ISO 8-Bit Code Structure (ES = X'4yzz') Figure 44. ISO 8-Bit Code Structure (ES = X'4yzz')

The range of code positions X'20' to X'7F' are often referred to as the Left Half (GL) and X'A0' to X'FF' as the Right Half (GR) of an ISO-8 code.

Figure 45 shows the invariance of the syntactic character set found in the basic single byte (SBCS) encoding structures.

Figure 45. Invariance of the Syntactic Character Set in Basic SBCS Encoding Structures

Character GCGID PC, ISO-7, ISO-8 EBCDIC
CAUTION: There are some coded character sets in use in which the invariant property is not guaranteed. Among the ISO-7 and derived codes, one more character, the exclamation mark (SP020000) is allocated the invariant code point X'21'. It is not included in this table, since it is not in the syntactic character set (CS 640).
" (double quote) SP040000 22 7F
% (percent) SM020000 25 6C
& (ampersand) SM030000 26 50
' (apostrophe) SP050000 27 7D
( 'left parenthesis' SP060000 28 4D
) 'right parenthesis' SP070000 29 5D
* (asterisk) SM040000 2A 5C
+ (plus) SA010000 2B 4E
, (comma) SP080000 2C 6B
- (hyphen) SP100000 2D 60
. (period) SP110000 2E 4B
/ (slash) SP120000 2F 61
0 ND100000 30 F0
1 ND010000 31 F1
2 ND020000 32 F2
3 ND030000 33 F3
4 ND040000 34 F4
5 ND050000 35 F5
6 ND060000 36 F6
7 ND070000 37 F7
8 ND080000 38 F8
9 ND090000 39 F9
: (colon) SP130000 3A 7A
; (semi-colon) SP140000 3B 5E
< (less than) SA030000 3C 4C
= (equal) SA040000 3D 7E
> (greater than) SA050000 3E 6E
? (question mark) SP150000 3F 6F
A LA020000 41 C1
B LB020000 42 C2
C LC020000 43 C3
D LD020000 44 C4
E LE020000 45 C5
F LF020000 46 C6
G LG020000 47 C7
H LH020000 48 C8
I LI020000 49 C9
J LJ020000 4A D1
K LK020000 4B D2
L LL020000 4C D3
M LM020000 4D D4
N LN020000 4E D5
O LO020000 4F D6
P LP020000 50 D7
Q LQ020000 51 D8
R LR020000 52 D9
S LS020000 53 E2
T LT020000 54 E3
U LU020000 55 E4
V LV020000 56 E5
W LW020000 57 E6
X LX020000 58 E7
Y LY020000 59 E8
Z LZ020000 5A E9
_ (underscore) SP090000 5F 6D
a LA010000 61 81
b LB010000 62 82
c LC010000 63 83
d LD010000 64 84
e LE010000 65 85
f LF010000 66 86
g LG010000 67 87
h LH010000 68 88
i LI010000 69 89
j LJ010000 6A 91
k LK010000 6B 92
l LL010000 6C 93
m LM010000 6D 94
n LN010000 6E 95
o LO010000 6F 96
p LP010000 70 97
q LQ010000 71 98
r LR010000 72 99
s LS010000 73 A2
t LT010000 74 A3
u LU010000 75 A4
v LV010000 76 A5
w LW010000 77 A6
x LX010000 78 A7
y LY010000 79 A8
z LZ010000 7A A9

EBCDIC Double and Mixed-byte Structures

The structure of IBM double-byte coded character sets is specified in IBM standards.

The double-byte EBCDIC code is called DBCS-HOST code. The basic EBCDIC structure has allocated coding space for control characters and graphic characters separately. The following describes the graphic character range of hexadecimal codes in the DBCS-HOST structure. Figure 46 illustrates the DBCS-HOST graphic character coding space. There are no 16-bit codes for control characters in the EBCDIC structure definition. A DBCS-HOST graphic character code has the following characteristics:

Figure 46. EBCDIC DBCS Graphic Character Coding Space (ES = X'12zz' or X'13zz')

Figure 46. EBCDIC DBCS Graphic Character Coding Space (ES = X'12zz' or X'13zz') Figure 46. EBCDIC DBCS Graphic Character Coding Space (ES = X'12zz' or X'13zz')

A DBCS Ward

A section of a DBCS where the first bytes of all the code points belonging to it are the same is called a ward. (19) A set of wards can be registered with a unique character set identifier, GCSGID, and the associated CPGID of the DBCS. This unique identifier, CGCSGID, defines the valid graphic character code points belonging to that set of wards.

EBCDIC Single/Double-Byte Mixed Encoding Structure

The coding space for EBCDIC Single/Double-byte mixed graphic characters is shown in Figure 47.

The encoding scheme is a hybrid of the two EBCDIC schemes: EBCDIC SBCS, described earlier in Figure 39, and EBCDIC DBCS, described in and Figure 46. This encoding scheme is a stateful encoding and uses a code extension technique to change between SBCS mode and DBCS mode. The control codes used to identify this change of state are X'0E' (shift out of SBCS) and X'0F' (shift into SBCS mode). The default starting state for a string encoded using this encoding scheme is single-byte. In order for a mixed string to begin in DBCS mode the first double-byte character must be preceded by a X'0E' in order to 'shift out' of SBCS mode. A well formed mixed host string must have matching shift out, shift in (SO, SI) pairs. All well formed mixed host strings will end in single-byte mode. When in either mode, the behavior of this encoding is as prescribed by the respective encoding scheme. All the semantics of the two individual encoding schemes apply in this case as well.

The following are examples of well formed mixed EBCDIC strings. In these examples SO - represents a shift-out control, SI - represents a shift-in control, s - represents a single-byte character and dd - represents a double-byte character.

ssssSOddddddddSIsssss - in this example the string begins in single-byte mode, shifts to double-byte mode for 4 characters and then returns to single-byte mode.

SOddddddddSIsssss - in this example the string begins with double-byte characters, thus the first character of the string must be the shift-out, following the double-byte characters there is a shift-in to change to the single-byte state for the last 5 characters in the string.

ssssSOddddddddSI - in this example the string begins in single-byte mode, shifts to double-byte mode and even though the string ends in double-byte mode the Shift-in control is required to create a well formed string.

SOSIssssSOddddddddSIsssss - in this example the SOSI at the beginning of the string is treated as a no-op. This is true for a SOSI pair found anywhere in a mixed EBCDIC string.

Figure 47. EBCDIC Mixed Single/Double-Byte Code Structure (ES = X'1301')

Figure 47. EBCDIC Mixed Single/Double-Byte Code Structure (ES = X'1301') Figure 47. EBCDIC Mixed Single/Double-Byte Code Structure (ES = X'1301')

IBM PC Double and Mixed-byte Structures

The coding space for DBCS-PC graphic characters is shown in Figure 48.
The DBCS-PC graphic character code has the following characteristics:
 


Note: It is not advised to rely on the specific values above X'40' (second byte value) to denote the presence or absence of DBCS characters. These values will be encoding scheme specific and can change over time.

Figure 48. IBM-PC DBCS Graphic Character Coding Space (ES = X'22zz', X'32zz')

Figure 48. IBM-PC DBCS Graphic Character Coding Space (ES = X'22zz', X'32zz') Figure 48. IBM-PC DBCS Graphic Character Coding Space (ES = X'22zz', X'32zz')

Note: In practice, the graphic characters of DBCS-PC are used with a single-byte PC coded character set. The specific values in the allocated range to be used as the first byte of a double-byte are detailed when the coded character set is registered. Other values from this range may be defined to be used as single-byte code points, and when so defined are not available for use as the first byte of a double-byte. Similarly, when a code point is declared to be the first byte of a double-byte code point, it cannot be used as a single-byte code point.

The control characters are all single-byte codes, as defined earlier for the IBM-PC Display and IBM-PC Data code structure. The definition of a ward given above also applies to DBCS-PC.

IBM-PC Mixed Single- and Double-Byte Structure

In the PC-Mixed scheme, both single-byte and double-byte code points may exist in the same data stream, without any explicit demarcation points between them.

Each specific use of a PC-Mixed scheme (ES=X'23zz' or X'33zz') must have an associated declaration of the specific single-byte code points to be used as the first byte of the double-byte code point. This set of code points is equivalent to a set of specific single-shift control code points in ISO (for example, Single-shift-2 (X'8E') as defined in ISO 6429). Each single-shift control causes the meaning of the following single-byte code point to be taken from a specific ward. The value of the first byte, besides being a single-shift control, is equal to the ward number. Figure 49 illustrates this definition.


Figure 49. IBM-PC Mixed Single/Double-Byte Graphic Character Coding Space (ES = X'2300', or X'3300')

Figure 49. IBM-PC Mixed Single/Double-Byte Graphic Character Coding Space (ES = X'2300', or X'3300') Figure 49. IBM-PC Mixed Single/Double-Byte Graphic Character Coding Space (ES = X'2300', or X'3300')

Note: Application developers are cautioned to not rely on the absolute code point range values as they may change in the future. The begin and end values may be CCSID dependent.

The double-byte codes starting with a valid first byte follows the definition for IBM-PC Double-byte code structure. All the bytes that are not in the valid list of first bytes will have their single-byte code points assigned per IBM PC Single-Byte Data or Display structure definition. In comparison, in a pure PC-DBCS scheme the single-byte graphic code points of the base PC Encoding structure that are not used as the first byte of a double-byte code point cannot be assigned a graphic character.

Note: The size of the maximal character set of the double-byte code page determines the size of the double-byte coding space needed. This in turn governs the number of wards needed, and the corresponding number of code points to be reserved for use as the first byte of a double-byte code point. The character set of the single-byte code page also influences the maximum number of single-byte code points needed, by trading off with the maximum number of wards possible. The net result is that when a specific single-byte code page and a specific double-byte code page are used with the mixed encoding structure of the PC, the list of valid first bytes also gets fixed.

IBM Extended Unix Code (EUC)

IBM's adaptation of Extended Unix* (20) Code (EUC) is called IBM EUC. It is also known (in IBM AIX documentation) as Multiple Byte Character Set (MBCS). The structure of IBM EUC coded character sets is specified in IBM Corporate Standard, Double-Byte Character Set (DBCS), Terminology and Coding Scheme, C-S 3-3220-102, 1992-07. The encoding scheme used in IBM EUC is shown in Figure 50.

Figure 50. Designation and Invocation of IBM-EUC (ES = X'4403')

Figure 50. Designation and Invocation of IBM-EUC (ES = X'4403') Figure 50. Designation and Invocation of IBM-EUC (ES = X'4403')

IBM EUC is an adaptation of one of the several code extension techniques defined in ISO 2022. It uses the 8-bit coding environment. The coded graphic character sets used are a national version of ISO 646 designated as the G0 set, and at most three additional G sets (G1, G2, and G3). The graphic character sets used correspond to the national standards of the different countries in the Far East.

The 8-bit environment of ISO 2022 implicitly designates the G0 set into the left half and the G1 set into the right half of the ISO-8 encoding structure (see section "ISO 8-bit Structure"). Encoding scheme X''8100' has been defined to describe a G1 set in the right hand side of the ISO 8-bit encoding space when it is being used as a standalone portion of an EUC encoding. The single-shift controls, Single-shift 2 (SS2) and Single-shift 3 (SS3), are used for invoking the G2 and G3 sets into the right half of the 8-bit code. IBM EUC omits all the announcer, invocation, and designation sequences of ISO 2022.

The resultant complete coded graphic character sets are often called EUC_J (for use in Japan), EUC_K (for use in Korea), EUC_T (for use with Traditional Chinese), or EUC_S (for use with Simplified Chinese).

The EUC scheme combines up to four coded graphic character sets. The collection includes a basic character set (the G0 set of a national version of ISO 646), and one or more of the following coded graphic character sets:


The valid ranges of graphic character code points for each one of these sets when used in IBM EUC are given below:


Figure 51. IBM-EUC Double Byte Code Structure
(ES = X'9200' standalone or as part of ES = X'4403')

Figure 51. IBM-EUC Double Byte Code Structure(ES = X'9200' standalone or as part of ES = X'4403') Figure 51. IBM-EUC Double Byte Code Structure(ES = X'9200' standalone or as part of ES = X'4403')  

The remaining code points in the space X'00' to X'1F', and X'7F' to X'9F', follow the rules for an ISO 8-bit code (see "ISO 8-bit Structure").

 

Figure 52. IBM-EUC Triple Byte Code Structure (ES = X'5700' standalone or as part of ES = X'4403')

Figure 52. IBM-EUC Triple Byte Code Structure (ES = X'5700' standalone or as part of ES = X'4403') Figure 52. IBM-EUC Triple Byte Code Structure (ES = X'5700' standalone or as part of ES = X'4403')

Notes:
 

Unicode

Unicode is a universal character encoding scheme that has been developed by a consortium made up of members of the world wide IT community. The consortium is committed to maintaining synchronization between Unicode and ISO/IEC 10646, Information technology - Universal Coded Character Set (UCS). The encoding structure defined here for use in CDRA is applicable to both Unicode and ISO/IEC 10646.

Unicode provides a means of encoding all of the characters used for the written languages of the world. It has the capability to encode up to 216 x 17 characters. Unicode has been accepted by many as the strategic direction towards multilingual computing.

The basic encoding structure of Unicode is shown in Figure 52a. Unicode is made up of 17 planes of 256 rows and 256 columns. Plane 0 is the Basic Multilingual Plane (BMP). It contains the majority of the currently encoded characters. Plane 0 includes an area reserved for Private Use Characters (PUA) and an area used for surrogate characters. Plane 1 is the Supplementary Multilingual Plane. Its purpose is to encode characters from archaic or obsolete writing systems. Plane 2 is the Supplementary Ideographic Plane and is used for encoding rare and unusual Han characters (Chinese, Japanese, Korean and Vietnamese unified Idiographs). Planes 3 through 13 are currently (and expected to remain) unassigned. Plane 14 is reserved for special purpose characters and is thus called the Supplementary Special-Purpose Plane. The final two planes, 15 and 16, are Private Use Planes to be used as an extension of the private use area found in the BMP. Encoding scheme X'7209' has been defined to represent an individual plane within the Unicode structure. This encoding scheme is used when referencing a plane. Additional information about the Unicode encoding structure can be found on the Unicode web site.

Figure 52a. Unicode Basic Code Structure

Figure 52a. Unicode Basic Code Structure Figure 52a. Unicode Basic Code Structure

Unicode Encoding Formats

Unicode is unique and different from most character encodings in that there are several formats defined for the encoding. The various encoding formats are identified and described in chapter 2 of the Unicode standard. While the encoding space is well structured and clearly defined, the Unicode Standard allows a number of different encoding formats. Characters may be encoded in one, two or four byte formats. Each of these formats is briefly described below. For more detailed information refer to the Unicode Standard V4.0 documentation or the Unicode web site. The Unicode encoding structure can easily be defined using the standard CDRA Encoding Scheme Identifiers (ESIDs). A number of ESIDs have been defined in order to accurately define the various encoding formats. In order to accurately interpret Unicode encoded data it is essential that the encoding be known and clearly defined.

UTF-8 (ES = X'7807')

Unicode Transformation Format 8 (UTF-8) is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the standard 8-bit ASCII set have the same values as ASCII, and that Unicode characters transformed into UTF-8 can often be used with existing software without extensive software rewrites. The main disadvantage of this encoding form is the overhead required to perform the transformation from one of the other encoding formats into UTF-8. UTF-8 is commonly used for file storage and as the default by the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C) protocols. The CDRA defined encoding scheme identifier for UTF-8 is 7807.

UTF-16 (ES = X'7200', X'720B', X'720F', X'8200)

Unicode Transformation Format 16 (UTF-16) is a reasonably compact encoding and all the heavily used characters fit into a single 16-bit code units in byte serialized form. All other characters are available via pairs of 16-bit codes (surrogates). UTF-16 is the most commonly used encoding form for internal processing. When using UTF-16 the order of the bytes of the character can be either most-significant-byte-first (big-endian, BE order) or least-significant-byte-first (little-endian, LE order). CDRA defines four ESIDs for UTF-16. The first is 7200. 7200 indicates UTF-16 with BE order. The second is 720B which indicates UTF-16 LE. The third ESID defined is 720F. 720F indicates UTF-16 where the endian order is determined by a byte order mark (BOM). If presents, a byte order mark will be found as the first two bytes of a data string. The value of the BOM indicating BE order is x'FEFF' and indicating LE order is x'FFFE'. If no BOM is found, the data is assumed to be big endian. The final encoding scheme defined for UTF-16 is 8200. This encoding scheme is called 'Unicode Presentation'. It is defined to be BE order in the absence of a BOM and is used exclusively by IBM printing systems. 8200 is a derivation of Unicode. It defines the C0 and C1 space of Unicode to be used for graphic characters.

UTF-32 (ES = X'7500', X'750B', X'750F')

Unicode Transformation Format 32 (UTF-32) provides fixed width, single code unit access to all of the characters. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32. As is the case with UTF-16, UTF-32 can also be byte-serialized in either big-endian (BE) or little-endian (LE) order. CDRA defines three encoding schemes for UTF-32 format. 7500 is defined for UTF-32 BE. Encoding shceme 750B explicitly defines the data to be UTF-32 LE. The third ESID is 750F. 750F indicates UTF-32 where the endian order is determined by a byte order mark (BOM). The byte order mark for UTF-32 is X'0000FEFF' for indicating BE order and X'FFFE0000' for indicating LE order. If no BOM is found, the data is assumed to be in BE order.

UTF-EBCDIC (ES = X'1808')

Unicode Transformation Format EBCDIC (UTF-EBCDIC) provides an EBCDIC friendly way of encoding Unicode. UTF-EBCDIC is defined in Unicode Technical Report 16. UTF-EBCDIC defines a means of transforming Unicode characters to a form that is safe for EBCDIC systems for the control characters and invariant characters. CDRA defines encoding scheme 1808 for UTF-EBCDIC. UTF-EBCDIC is intended to be used inside EBCDIC systems or in closed networks where there is a dependency on EBCDIC hard-coding assumptions. UTF-EBCDIC is unsuitable for use over the Internet or for data interchange.

Standard Compression Scheme for Unicode (SCSU) (ES = X'7B0C')

The Unicode Standard defines a compression scheme for storing and transmitting Unicode data. The details of this encoding form can be found in Unicode Technical Standard 6. The CDRA defined ESID for Unicode SCSU is 7B0C.

Binary Ordered Compression for Unicode (BOCU-1) (ES = X'7B0E)

The Unicode Standard defines this MIME compatible compression for Unicode. The details of this encoding form can be found in Unicode Technical Note #6. The CDRA defined ESID for Unicode BOCU-1 is 7B0E.

Compatibility Encoding Scheme for UTF-16: 8-Bit (ES = X'780D')

Unicode Technical Report 26 specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange. The CDRA defined ESID for Unicode CESU-8 is 780D.

Chinese Standard GB18030

GB18030 is a Chinese Standard which was defined as a super set of previously defined standards including GB 2312-80. It was defined in order to give customers the capability of using and processing a greater number of Chinese characters which are necessary for many applications used in organizations such as banks, insurance companies and by the postal service. It currently contains all of the characters defined in Unicode 3.0 including more than 27,000 Chinese characters. This standard provides solutions for the urgent needs of Chinese characters used in names and addresses.


GB 18030 uses a combination of one-byte, two-byte and four-byte codes and has a capacity of over 1.5 million code positions. The determination of character width (one, two or four-byte) is handled implicitly through the use of code point ranges as shown in Figure 53a.

Figure 53a. GB18030 Structure (ES = X'2A00')

Number of Bytes Valid Byte Ranges Number of Codes
One-byte X'00'-X'80' 129 codes
Two-byte First byte
X'81' ~ X'FE'
Second byte
X'40'~X'7E' X'80'~X'FE'
23,940 codes
Four-byte First byte
X'81'~X'FE'
Second byte
X'30'~X'39'
Third byte
X'81'~X'FE'
Fourth byte
X'30'~X'39'
1,587,600 codes

Lotus Multi-byte Character Set (LMBCS)

LMBCS encoding is used exclusively by Lotus. It is defined as a multi-byte encoding made up of one, two and three byte values. The first byte is the Group Byte. The Group Byte is a value between X'00' and X'1F' with meaning as described in Figure 53b. Following the group byte will be either one or two bytes identifying the character. For optimization purposes, the group byte is omitted in Notes for single-byte values between X'20' and X'FF'. For example, LMBCS is always optimized to group 0x01, which means that any character where the first byte is greater than 0x1F, has an implicit group byte of 0x01.

Figure 53b. LMBCS Structure (ES = X'9300)

Group byte Character size (bytes) Description
0x00
Reserved for future use
0x01 2 Byte 2 = Codepage 850, i.e. Multilingual DOS
0x02 2 Byte 2 = CP 851 (Greek DOS)
0x03 2 Byte 2 = CP 1255 (Hebrew Windows)
0x04 2 Byte 2 = CP 1256 (Arabic Windows)
0x05 2 Byte 2 = CP 1251 (Cyrillic Windows)
0x06 2 Byte 2 = CP 852 (Latin-2 DOS)
0x07 1 BEL
0x08 2 Byte 2 = CP 1254 (Turkish Windows)Byte
0x09 1 TAB
0x0A 1 NL
0x0B
Reserved for future use
0x0C
Reserved for future use
0x0D 1 CR
0x0E
Reserved for future use
0x0F
Reserved for future use
0x10 3 Bytes 2 & 3 = CP 932
0x11 3 Bytes 2 & 3 = CP 949
0x12 3 Bytes 2 & 3 = CP 950
0x13 3 Bytes 2 & 3 = CP 936
0x14 3 Bytes 2 & 3 = UTF-16 bytes
0x15 - 0x1F
Reserved for future use