|
The long-form identification consists of an Encoding Scheme Identifier, one
or more Coded Graphic Character Set Global Identifiers (each consisting of a
Graphic Character Set Global Identifier and a Code Page Global Identifier),
and any Additional Coding-related Required Information that is required to complete
the specification of the representation.
Encoding Scheme Identifier
The Encoding Scheme Identifier, ESID, is a 4-digit hexadecimal number
that identifies the scheme used to code graphic character data. The following
3 elements have been used where possible in ESID definitions.
- The basic encoding structure (x)
- This element identifies the basic structural characteristic that differentiates
various encoding schemes such as EBCDIC, ISO-8, IBM-PC Data, or others.
- The number of bytes per code point (y)
- When the encoding scheme permits a different number of 7-bit or 8-bit
bytes per code point, this element identifies the selection used.
- The code extension method (zz)
- Code extensions are techniques used to encode more characters than can
be accommodated in the basic encoding structure. An example is the use of
SO (Shift-Out) and SI (Shift-In) as controls to access an alternative assignment
of graphic characters to code points, and to show whether one byte or two
bytes of the data constitute a code point, in the EBCDIC mixed single-byte
and double-byte encoding.
- This element of the ESID identifies the particular method of code extension
used from among the many that may be allowed in the encoding scheme.
Note to developers: While efforts have been made to define ESIDs using these
elements, not all ESIDs follow the above pattern. It is esential that all encoding
scheme identifiers be defined by the owner of CDRA prior to being used.
Figure 8 shows the three components of the ESID.
The component values and their meanings are detailed in the following sections.
Figure 8. Encoding Scheme Identifier Format
Figure 8. Encoding Scheme Identifier Format
The ESID makes the following possible:
- The selection of the correct algorithms (such as parsing) to be invoked
to process graphic character data.
- Identification of reserved code point(s) for allocation to some most-frequently
used characters such as SPACE (GCGID SP010000).
The ESID also determines the number and types of other CDRA identifiers needed
in the long form.
The term Encoding Scheme (ES) is synonymous with ESID.
Basic Encoding Structure (x)
The following values are defined for the first nibble (10)
of the ESID to identify the structure. The properties of each structure are
detailed in Appendix A. Encoding Schemes.
Hex Structure
0 Defaults to higher
level in hierarchy
1 EBCDIC
2 IBM-PC Data
3 IBM-PC Display
4 ISO 8
5 ISO 7
6 EBCDIC presentation
7 UCS
8 UCS Display
9 8 bit, for a standalone,
7-bit EUC G-set that has been shifted into the right half of the encoding space
A-C Reserved for future allocation by CDRA
D Unique encoding. Details of the encoding structure are found in the related CP and CCSID definitions.
E Reserved for extending ES id, when needed
F For Private Use. Use of this value must be accompanied by a specification of the structure,
and the rules for usage with specific values of the other parts of the ESID.
Definition of the Private Use values is outside the scope of CDRA.
Number of Bytes Indicator (y)
An encoding scheme may permit specific variations in the number of bytes associated
with a code point (for example, EBCDIC single-byte versus EBCDIC double-byte).
These variations are shown using the second nibble of the ESID. The value of
this nibble is not the number of bytes per code point, but rather a pointer
to the definition. The value does not equate to the number of bytes in the code
point. The values representing a variable number of bytes identify what is allowed
to appear in a string, not what actually appears. The encoding scheme defines
permitted values of this nibble for the encoding structure used.
If the value of the first nibble defining the basic encoding structure element
is zero, the second nibble identifying the number of bytes must also be zero.
The following values are defined:
Hex Number of Bytes per Code Point
0 Reserved for use with
zero value for the basic encoding structure
1 Fixed single-byte, SBCS
2 Fixed double-byte, DBCS
(including ISO/IEC 10646-1 UCS-2)
3 IBM Far East style,
mixed single-byte and double-byte
4 ISO 2022 schemes (EUC,
TCP/IP)
5 UCS-4 or UTF-32
6 Reserved for future
allocation by CDRA
7 Fixed triple-byte
8 UTF-n variable number
of bytes, self describing (37)
9 Fixed 4-byte
A Mixed 1-byte, 2-byte,
4-byte (for GB 18030)
B BOCU-1, SCSU and similar Stateful Compression Schemes
C-E Reserved for future allocation by CDRA
F For Private Use. The specification of Private Use must include the values (and the specific meaning) of the encoding structure nibble with which it can be used. Definition of the Private Use values is outside the scope of CDRA.
Code Extension Method (zz)
The code extension method is described by the second byte of the ES identifier.
This byte indicates that a code point from an extended coded character set may
appear in the data; it does not mean that the extension method has
actually been used in a specific character string.
When the first two nibbles of the ESID are zeros, the code extension byte value
must be zero.
The following values are defined:
Hex Code Extension Method
00 No extensions are
specified
01 Locking Shifts (SO and SI, or LS1 and LS0 (11) or UC and LC locking controls)
02 Reserved for future
allocation by CDRA
03 IBM EUC scheme (ISO-2022-based)
04 TCP/IP scheme (ISO-2022-based)
05 ISO-8 with possible
graphics in C1 area (X'80' to X'9F')
06 Reserved for future
allocation by CDRA
07 UTF-8 Universal Transformation
Format
08 UTF-EBCDIC Universal
Transformation Format
09 Used for an individual
Unicode plane
0A Reserved for future
allocation by CDRA
0B Used to indicate Little
Endian Order for UCS
0C Unicode Standard Code
Compression Scheme
0D Compatibility Encoding
Scheme for UTF-16: 8-Bit (CESU-8)
0E Binary Ordered Compression
for Unicode (BOCU-1)
0F UCS with Byte Order
Mark (BOM) to incidate Endianness; BE is assumed in absence of BOM
10 to 49 Reserved for future allocation by CDRA
50 ISO-7 with possible
graphics in the C0 area (X'00'to X'1F')
51 to 54 Reserved for future allocation by CDRA
55 ISO-8 with possible
graphics in C0 and C1 areas (X'00' to X'1F' and X'80' to X'9F')
56 to FD Reserved for future allocation by CDRA
FE Reserved for Private Use of Code Extension. Definition of the Private Use value is outside the scope of CDRA.
FF Code Extension consideration
does not apply.
Code Extension States
When an encoding scheme uses an extension technique, it uses more than one
elementary coded character set to create a composite coded character set. The
scheme specifies one code extension switching state for each coded
character set used. While in a given state, the associated coded character set
is used for representing and interpreting the character data. The method for
switching between these states can be implicit or explicit, locking or single
shifting. The number of switching states and the method of switching between
the states in a coded character set are specified by the encoding scheme. State
numbering begins at 1 and increases by 1 for each coded character set. For example,
in mixed single-byte, double-byte encodings there are 2 states; the single-byte
coded character set is state 1 and the double-byte coded character set is state
2. Encoding schemes which define a single coded character set have a single
state; state 1.
The second nibble and the last byte of the ESID together identify the number
of switching states. The last byte of the ESID identifies the switching method
employed in an encoding scheme. The first and second nibbles identify the nature
of the elementary code structures used in the resulting composite structure.
ESID Values
ESID values and their semantics are listed in Figure
9.
Figure 9. ESID values
| ESID hex |
Interpretation |
| 1100 |
EBCDIC, SBCS, No code extension is
allowed Number of States = 1. |
| 2100 |
IBM-PC Data, SBCS, No code extension
is allowed. Number of States = 1. |
| 3100 |
IBM-PC Display, SBCS, No code extension
is allowed. Number of States = 1. |
| 4100 |
ISO 8, SBCS, No code extension is allowed.
Number of States = 1. |
| 4105 |
ISO 8 (ASCII code), SBCS, Graphics
in C1
Note that graphic characters may be present in the area normally reserved
for the C1 control codes. (ie X'80' to X'9F') Number of States =
1. |
| 4155 |
ISO 8 Presentation (ASCII code), SBCS,
Graphics in C0 and C1. Number of States = 1. |
| 5100 |
ISO 7 (ASCII code), SBCS, No code extension
is allowed. Number of States = 1. |
| 5150 |
ISO 7 Presentation (ASCII code), SBCS,
Graphics in C0. Number of States = 1. |
| 6100 |
EBCDIC Presentation, SBCS, No code
extension is allowed. Number of States = 1. |
| 8100 |
8 bit, SBCS, used with a 7-bit code
page, characters are shifted into the right hand side of the encoding space,
used only for single-byte EUC G-sets when each G-set is treated as a standalone
code. Number of States = 1. |
| D100 |
PTTC/BCDIC – 6 bit encoding, no code extension is allowed. Number of States = 1. |
| D101 |
Paper Tape Transmission Code (PTTC), 6 bit encoding, uppercase/lowercase support using UC/LC code extension method. Number of States = 2. |
| 1200 |
EBCDIC, DBCS, No code extension is
allowed. Number of States = 1. |
| 2200 |
IBM-PC Data, DBCS, No code extension
is allowed. Number of States = 1. |
| 3200 |
IBM-PC Display, DBCS, No code extension
is allowed. Number of States = 1. |
| 5200 |
ISO 7 (ASCII code), DBCS, No code extension
is allowed. Number of States = 1. |
| 6200 |
EBCDIC Double-byte Presentation
Number of States = 1. |
| 7200 |
Unicode, UCS-2, including UTF-16 to
allow for support of surrogates, Big Endian order. No code extension is
allowed. Number of States = 1. |
| 7209 |
Unicode pure double-byte. Used for
any standalone, individual Unicode plane. Number of States
= 1. |
| 720B |
Unicode, UCS-2, including UTF-16 to
allow for support of surrogates, Little Endian order. No code extension
is allowed. Number of States = 1. |
| 720F |
Unicode, UCS-2, including UTF-16 to
allow for support of surrogates, endianness is determined by byte order
mark (BOM), assumed to be Big Endian in absence of BOM. No code extension
is allowed Number of States = 1. |
| 8200 |
Unicode Display Number
of States = 1. |
| 9200 |
8 bit, DBCS, used with a 7-bit code
page, characters are shifted into the right hand side of the encoding space,
used only for double-byte EUC G-sets when each G-set is treated as a standalone
code. Number of States = 1. |
| 1301 |
EBCDIC, Mixed single-byte and double-byte,
using SO/SI code extension method. Number of States = 2.
|
| 2300 |
IBM-PC Data, Mixed single-byte and
double-byte, with implicit code extension. Number of States
= 2. |
| 2305 |
PC Data, Mixed single-byte and double-byte,
with implicit code extension, single-byte is Windows encoding.
Number of States = 2. |
| 3300 |
IBM-PC Display, Mixed single-byte and
double-byte, with implicit code extension. Number of States
= 2. |
| 4403 |
IBM EUC Number of States
= 2-4. |
| 5404 |
ISO 2022 TCP/IP using ESC sequences
to designate code sets to G0. Number of States = 2-4. |
| 5409 |
ISO 2022 TCP/IP using SO/SI
Number of States = 2. |
| 540A |
ISO 2022 TCP/IP using SO, SI, SS2,
and SS3. Number of States = 3-4. |
| 7500 |
Unicode UTF-32, Big Endian order. No
code extension is allowed. Number of States = 1. |
| 750B |
Unicode UTF-32, Little Endian order.
No code extension is allowed. Number of States = 1. |
| 750F |
Unicode UTF-32, endianness is determined
by byte order mark (BOM), assumed to be Big Endian in absence of BOM. No
code extension is allowed Number of States = 1. |
| 5700 |
ISO 7 Triple-byte Code Set, No code
extension is allowed. Number of States = 1. |
| 1808 |
UTF-EBCDIC, as defined in
Unicode Technical Repot 16. Number of States = 1. |
| 7807 |
UTF-8, UCS-2 transform, No code extension
is allowed. Number of States = 1. |
| 780D |
Compatibility Encoding Scheme for UTF-16:
8-Bit (CESU-8), as defined in
Unicode Technical Report #26 . Number of States = 1.
|
| 2900 |
PC Data, fixed 4-byte Number
of States = 1. |
| 2A00 |
PC Data, mixed single-, double- and
four-byte (Note: IBM PC or Windows code pages may be used as the single-byte
component of a CCSID using this ESID.) Number of States = 3.
|
| 7B0C |
Standard Compression Scheme for Unicode
(SCSU) as defined in Unicode
Technical Standard 6. |
| 7B0E |
Binary Ordered Compression for Unicode
(BOCU-1) as defined in Unicode
Technical Note 6. |
| Fxxx |
Private Use. User-defined encoding
scheme. |
| xFxx |
Private Use. User-defined encoding
scheme. |
| xxFE |
Private Use. User-defined encoding
scheme. |
Coded Graphic Character Set Global Identifier
The Coded Graphic Character Set Global Identifier, CGCSGID,
is a ten digit decimal number representing the concatenation of the Graphic
Character Set Global Identifier (GCSGID) followed by the Code Page Global Identifier
(CPGID). GCSGID and CPGID are described in the following sections. CGCSGID identifies
a specific collection of graphic characters and their assigned code points using
an encoding scheme.
Many architectures and supporting implementations, such as Document Interchange
Architecture (DIA), have traditionally supported the CGCSGID. It has been assumed
that the encoding scheme information can always be reliably derived from the
code page identifier alone, but this assumption is not true for many registered
PC code pages. It will also be invalid if schemes such as the mixed single-byte
and double-byte encodings used by the IBM PCs in the Far East have to be represented.
The term GCID, used in some IBM architectures, is synonymous with
CGCSGID.
Graphic Character Set Global Identifier
A Graphic Character Set Global Identifier, GCSGID, is a
5-digit decimal identifier assigned to a collection of characters that is to
be processed as an entity (a Graphic Character Set). It uniquely
identifies a specific collection of GCGIDs that are valid in the set.
The range of GCSGID values is 00001 (X'0001') to 65534 (X'FFFE').
The values X'FE00' to X'FEFF' are reserved for Request for Price Quotation
(RPQ) use by IBM products. The values X'FF00' to X'FFFE' are reserved for customer
use.
A GCSGID is assigned to every Registered Graphic Character Set
by IBM (or by a customer organization). An explanation of CGCSGIDs is provided
in Appendix E. Graphic Character Global Identifiers.
See Special-Purpose Values for GCSGID and CPGID
for use of 00000 (X'0000'), 65,535 (X'FFFF') and other special-purpose values
for GCSGID.
The term Character Set (CS) is synonymous with GCSGID.
SPACE as a special character
By itself, the GCSGID does not specify either the inclusion or the exclusion
of the SPACE (GCGID = SP010000) character. Each encoding scheme reserves one
or more code points for allocation to the SPACE character. There are two possible
code points for it when using mixed SBCS and DBCS encoding schemes.
Code Page Global Identifier
A Code Page Global Identifier, CPGID, is a 5-digit decimal
number assigned to a code page.
A code page is a specification of code points from a defined encoding
structure for each graphic character (12)
in a collection of one or more graphic character sets.
A CPGID identifies a unique assignment of the graphic code points in an encoding
scheme to a specific set of GCGIDs. Many character sets may be contained in
a code page. When all of the code points in the graphic encoding space of a
code page have been assigned, then the character set containing this collection
of GCGIDs is defined to be full. Often, when a code page is first
created and registered, some of the assignable graphic code points may not have
assigned GCGIDs. The character set containing these assigned characters is defined
to be maximal. As more code point assignments are made, the maximal
character set will change. Once all code points have been assigned, the maximal
set will be the full set.
A CPGID is assigned to every Registered Code Page by IBM. In some
cases, the same CPGIDs have been used when the encoding structures are similar.
(13) An explanation of CPGIDs is provided
in Appendix E. Graphic Character Global Identifiers.
The range of CPGID values is 00001 (X'0001') to 65534 (X'FFFE').
The values X'FE00' to X'FEFF' are reserved for Request for Price Quotation
(RPQ) use by IBM products. The values X'FF00' to X'FFFE' are reserved for customer
use.
The term Code Page (CP) is synonymous with CPGID.
Special-Purpose Values for GCSGID and
CPGID
IBM standards reserve the values X'0000' and X'FFFF' for future assignments.
In practice, these identifier values have been used for a number of different
special purposes. Some values other than X'0000' and X'FFFF' that have been
reserved for special-purpose use are also included in this section. In the interest
of providing consistency between various implementations, the semantics of use
of these values, either in current use or for future use, are defined here.
Some known definitions are listed below, along with their semantics. Others
will be added as they become known to CDRA.
The CS value of X'0000' is used in several IBM architectures, such as Formatted
Data Object Content Architecture (FD:OCA), Mixed Object Document
Content Architecture (MO:DCA), and Document Interchange Architecture
(DIA) Profiles, to facilitate migration and coexistence between the use of only
a CGCSGID (CS, CP pair) prior to the advent of CDRA and the use of the CCSID
identifier in different architecture definitions.
In these architectures, if the CS portion of a structured field carrying a
CGCSGID has a value of X'0000', the value of the CP portion is interpreted as
a CCSID. The following definitions then apply:
- CS X'0000' with CP X'0000'
The CP value of X'0000' is interpreted as CCSID X'0000'. This CCSID value
means that the tag value is to be inherited from a higher level in a hierarchical
structure.
- CS X'0000' with CP X'FFFE'
The CP value of X'FFFE' is interpreted as CCSID X'FFFE'. This CCSID value
means that the tag value is to be obtained from a lower level in a hierarchical
structure.
- CS X'0000' with CP X'FFFF'
The CP value of X'FFFF' is interpreted as CCSID X'FFFF'. This CCSID value
means that the tagged data is to be interpreted as "not graphic character
data" or "actual representation is unknown".
- CS X'0000' with all other CP values
The CP value is interpreted as a CCSID.
The CS value of X'FFFF' can have the following special-purpose definitions.
- CS X'FFFF' with CP X'0000'
Reserved for future definition in CDRA.
- CS X'FFFF' with CP X'FFFF'
In FD:OCA, the combination is used to indicate inheritance from a higher
level in the structured object.
- CS X'FFFF' with all other CP values
A CS value of X'FFFF' used with CP values from X'0001' to X'FFFE' identifies
a growing character set.
In the Intelligent Printer Data Stream* (IPDS*), both the GCSGID
and CPGID are carried but are not treated as a CGCSGID construct. In this case,
the following special-purpose values for GCSGID and CPGID are defined:
- CS X'0000'
The CS value of X'0000' means that no value is supplied.
- CP X'0000'
The CP value of X'0000' means that no value is supplied.
- CP X'FFFF'
The CP value of X'FFFF' implies that the device default code page should
be used.
In IPDS and in MO:DCA the following special-purpose value is defined:
- CS X'FFFF'
The CS value of X'FFFF' implies that the set of characters with assigned
code points in the resource definition of the selected code page is to be
used.
Special CS and CP values are used to indicate "No CS, No CP" in the ACRI-EUC
structure defined in "ACRI Type EUC (ACRI-EUC)". the
following special-purpose value is defined:
- CS X'FDFF'
The CS value of X'FDFF' implies that there is no character set, that is
that the corresponding G set is not used for this particular EUC CCSID.
- CP X'FDFF'
The CP value of X'FDFF' implies that there is no code page, that is that
the corresponding G set is not used for this particular EUC CCSID.
Within CDRA the following CS/CP pair have been used in the definition of Unicode CCSIDs.
- CS X’FFF0’ (65520) and CP X’FFF0’ (65520)
This CS/CP pair is used to represent an empty plane of Unicode. By definition CS 65520 is an empty set containing no characters and CP 65520 is a Unicode plane with no characters defined.
Additional Coding-Related Required Information
Some encoding schemes require specifications beyond the CS and CP elements
to complete their definitions. Such specifications are called Additional
Coding-related Required Information (ACRI) elements.
Three types of ACRI are defined below.
ACRI PC Mixed Byte (ACRI-PCMB)
This type of ACRI applies to ES values X'2300', X'2305' or X'3300' (see semantics
of these ES values in Figure 9). It cannot be specified
with any other ES values. It consists of the specification of ranges of valid
first bytes of double-bytes associated with particular CS, CP pairs that are
used with this encoding scheme. An ACRI-PCMB has the following format:
Additional Coding-Related Required Information
Some encoding schemes require specifications beyond the CS and CP elements
to complete their definitions. Such specifications are called Additional
Coding-related Required Information (ACRI) elements.
Three types of ACRI are defined below.
ACRI PC Mixed Byte (ACRI-PCMB)
This type of ACRI applies to ES values X'2300', X'2305' or X'3300' (see semantics
of these ES values in Figure 9). It cannot be specified
with any other ES values. It consists of the specification of ranges of valid
first bytes of double-bytes associated with particular CS, CP pairs that are
used with this encoding scheme. An ACRI-PCMB has the following format:
| N |
S1 |
E1 |
S2 |
E2 |
-- -- |
Sk |
Ek |
-- -- |
Sn |
En |
where N is the number of ranges of valid first bytes, Sk is the starting byte
and Ek is the ending byte in the kth range, for all values of k from 1 to N.
Sk and Ek are each in the range 128 to 255.
For example, ACRI-PCMB associated with CCSID 00942 in the CCSID Registry (see
Appendix C: CCSID Repository ) will be represented as:
2 129 159 224 252
In this example, there are two sets of valid first bytes (shown as their decimal
values). The first set of 31 values is in the range 129 to 159 (X'81' to X'9F'),
and the second set of 29 values in the range 224 to 252 (X'E0' to X'FC'). Thus
a total of 60 double-byte wards (14)
can be defined using this ACRI-PCMB.
Other formats, such as a bit-pattern representation, are also possible.
ACRI Type EUC (ACRI-EUC)
This type of ACRI applies to ES value X'4403' only. It specifies the number
of coded character sets and the width of each. It has the following format:
where N is the number of coded graphic character sets, and Wn is the width
of the nth set. If a G set is not used then the value of W is 0 and the corresponding
CS/CP entries will be X'FDFF'.
ACRI Type TCP (ACRI-TCP)
This type of ACRI applies to ES value X'5404' only. It specifies the number
of coded character sets followed by a triplet for each consisting of the width
of the code points for the set, the length of the escape sequence used to designate
the set into G0, and the actual escape sequence. The format is as follows:
n W1 LD1 D1 W2 LD2 D2 ... Wn LDn Dn
where
n =number of CGCSGIDs associated with the CCSID
W =width of code points in the code page
LD =length of the designation escape sequence
D =actual designation sequence
For example, the ACRI-TCP for CCSID 00965 (TCP for Traditional Chinese) is:
03 01 03 ESC 28 42 02 04 ESC
24 29 30 02 04 ESC 24 29 31
In this example the ESC mnemonic is shown, rather than the hex value 1B, to
allow for ease of readability.
The format of the Escape Sequences is defined in ISO 2022. The "final byte,"
which defines the actual coded character set to be used, is defined
in the ISO document International Register of Coded Character
Sets to be Used With Escape Sequences.
|