Character Data Representation Architecture


Chapter 3. CDRA Identifiers

Character Data Representation Architecture deals primarily with graphic character data, and to a lesser extent with control character data. Graphic character data can include imbedded code extension controls that influence the interpretation of the data that follows. This chapter defines several identifiers related to graphic-character data representation, and what it means to tag with these identifiers.

The following identifiers are defined:

These identifiers form the architectural basis for unique identification and interpretation of coded graphic character data.

Coding of Graphic Character Data

Character data is represented in machines as code points, consisting of one or more 7-bit bytes (septets) or 8-bit bytes (octets) of data. Underlying each code is an encoding scheme. In the terminology of coded character set standards, a code is a system of bit patterns to which a specific graphic or control meaning has been assigned. Each unique bit pattern defined by a code is called a code point. CDRA identifiers provide the means by which the graphic character associated with a code point can be determined unambiguously.

Elements of Character Data Representation

The identifiers associated with graphic character representation (see Figure 7) are:

These identifiers are detailed in the following sections.

Figure 7. CDRA Identifier Forms

Diagram.

Graphic Character Global Identifier

A graphic character global identifier (GCGID)or a graphic character UCS identifier (GCUID) is used to convey the meaning of a graphic character in a code-independent manner. They are used primarily in the representation of characters in objects such as Character Set or Code Page resources. They are not used to tag the data directly.

A GCGID is a 4- to 8-character alphanumeric identifier assigned to a graphic character. A GCUID is an 8-character identifier of the form Unnnnnnn where nnnnnnn is a 7-digit hexadecimal value.

Each graphic character that is to be assigned a code point must have a GCGID or a GCUID. They are used wherever a graphic character is referenced in a code-independent manner. They are also the basis of establishing correspondences between code points in different representations.

The GCGID and GCUID are fully described on the IBM Graphic Character Identifiers web page.

In CDRA, the terms graphic character identifier, character identifier, meaning of graphic character, and rendering of graphic character are synonymous with GCGID.

IBM's GCGID system provides for distinguishing between two renderings of a graphic character. When no specific rendering is indicated, a "nominal" rendering is assumed with the character. Specific renderings can further be specified using another identifier such as Font Global Identifier, FGID.

The rendering part of a GCGID is of significance primarily for presentation processing such as formatting, displaying, or printing. GCGIDs with different renderings typically appear in character sets and code pages that are primarily presentation-oriented. However, some of these character sets and code pages are used for all aspects of processing of graphic characters. If they are encountered by functions such as comparison, depending on the context of use, graphic characters with two different renderings may be equated.

When graphic characters with two different renderings are part of a character set and are included in the same coded character set, different code points are assigned different GCGIDs to represent the different renderings. Some examples of use of GCGIDs for graphic characters with different renderings are:

Long-Form Identification

The long-form identification consists of an Encoding Scheme Identifier, one or more Coded Graphic Character Set Global Identifiers (each consisting of a Graphic Character Set Global Identifier and a Code Page Global Identifier), and any Additional Coding-related Required Information that is required to complete the specification of the representation.

Encoding Scheme Identifier
The Encoding Scheme Identifier, ESID, is a 4-digit hexadecimal number that identifies the scheme used to code graphic character data. The following 3 elements have been used where possible in ESID definitions.

The basic encoding structure (x)
This element identifies the basic structural characteristic that differentiates various encoding schemes such as EBCDIC, ISO-8, IBM-PC Data, or others.
The number of bytes per code point (y)
When the encoding scheme permits a different number of 7-bit or 8-bit bytes per code point, this element identifies the selection used.
The code extension method (zz)
Code extensions are techniques used to encode more characters than can be accommodated in the basic encoding structure. An example is the use of SO (Shift-Out) and SI (Shift-In) as controls to access an alternative assignment of graphic characters to code points, and to show whether one byte or two bytes of the data constitute a code point, in the EBCDIC mixed single-byte and double-byte encoding. This element of the ESID identifies the particular method of code extension used from among the many that may be allowed in the encoding scheme.

Note to developers: While efforts have been made to define ESIDs using these elements, not all ESIDs follow the above pattern. It is esential that all encoding scheme identifiers be defined by the owner of CDRA prior to being used.

Figure 8 shows the three components of the ESID. The component values and their meanings are detailed in the following sections.

Figure 8. Encoding Scheme Identifier Format

Diagram.

The ESID makes the following possible:

The ESID also determines the number and types of other CDRA identifiers needed in the long form.

The term Encoding Scheme (ES) is synonymous with ESID.

Basic Encoding Structure (x)

The following values are defined for the first nibble (10) of the ESID to identify the structure. The properties of each structure are detailed in Appendix A. Encoding Schemes.

Hex Structure
0 Defaults to higher level in hierarchy
1 EBCDIC
2 IBM-PC Data
3 IBM-PC Display
4 ISO 8
5 ISO 7
6 EBCDIC presentation
7 UCS
8 UCS Display
9 8 bit, for a standalone, 7-bit EUC G-set that has been shifted into the right half of the encoding space
A-C Reserved for future allocation by CDRA
D Unique encoding. Details of the encoding structure are found in the related CP and CCSID definitions.
E Reserved for extending ES id, when needed
F For Private Use. Use of this value must be accompanied by a specification of the structure, and the rules for usage with specific values of the other parts of the ESID. Definition of the Private Use values is outside the scope of CDRA.

Number of Bytes Indicator (y)

An encoding scheme may permit specific variations in the number of bytes associated with a code point (for example, EBCDIC single-byte versus EBCDIC double-byte). These variations are shown using the second nibble of the ESID. The value of this nibble is not the number of bytes per code point, but rather a pointer to the definition. The value does not equate to the number of bytes in the code point. The values representing a variable number of bytes identify what is allowed to appear in a string, not what actually appears. The encoding scheme defines permitted values of this nibble for the encoding structure used.

If the value of the first nibble defining the basic encoding structure element is zero, the second nibble identifying the number of bytes must also be zero.

The following values are defined:

Hex Number of Bytes per Code Point
0 Reserved for use with zero value for the basic encoding structure
1 Fixed single-byte, SBCS
2 Fixed double-byte, DBCS (including ISO/IEC 10646-1 UCS-2)
3 IBM Far East style, mixed single-byte and double-byte
4 ISO 2022 schemes (EUC, TCP/IP)
5 UCS-4 or UTF-32
6 Reserved for future allocation by CDRA
7 Fixed triple-byte
8 UTF-n variable number of bytes, self describing (37)
9 Fixed 4-byte
A Mixed 1-byte, 2-byte, 4-byte (for GB 18030)
B BOCU-1, SCSU and similar Stateful Compression Schemes
C-E Reserved for future allocation by CDRA
F For Private Use. The specification of Private Use must include the values (and the specific meaning) of the encoding structure nibble with which it can be used. Definition of the Private Use values is outside the scope of CDRA.

Code Extension Method (zz)

The code extension method is described by the second byte of the ES identifier. This byte indicates that a code point from an extended coded character set may appear in the data; it does not mean that the extension method has actually been used in a specific character string.

When the first two nibbles of the ESID are zeros, the code extension byte value must be zero.

The following values are defined:

Hex Code Extension Method Hex Code Extension Method
00 No extensions are specified 0C Unicode Standard Code Compression Scheme
01 Locking Shifts (SO and SI, or LS1 and LS0 (11) or UC and LC locking controls) 0D Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)
02 Reserved for future allocation by CDRA 0E Binary Ordered Compression for Unicode (BOCU-1)
03 IBM EUC scheme (ISO-2022-based) 0F UCS with Byte Order Mark (BOM) to incidate Endianness; BE is assumed in absence of BOM
04 TCP/IP scheme (ISO-2022-based) 10 to 49 Reserved for future allocation by CDRA
05 ISO-8 with possible graphics in C1 area (X'80' to X'9F') 50 ISO-7 with possible graphics in the C0 area (X'00'to X'1F')
06 Reserved for future allocation by CDRA 51 to 54 Reserved for future allocation by CDRA
07 UTF-8 Universal Transformation Format 55 ISO-8 with possible graphics in C0 and C1 areas (X'00' to X'1F' and X'80' to X'9F')
08 UTF-EBCDIC Universal Transformation Format 56 to FD Reserved for future allocation by CDRA
09 Used for an individual Unicode plane FE Reserved for Private Use of Code Extension. Definition of the Private Use value is outside the scope of CDRA.
0A Reserved for future allocation by CDRA FF Code Extension consideration does not apply.
0B Used to indicate Little Endian Order for UCS

Code Extension States

When an encoding scheme uses an extension technique, it uses more than one elementary coded character set to create a composite coded character set. The scheme specifies one code extension switching state for each coded character set used. While in a given state, the associated coded character set is used for representing and interpreting the character data. The method for switching between these states can be implicit or explicit, locking or single shifting. The number of switching states and the method of switching between the states in a coded character set are specified by the encoding scheme. State numbering begins at 1 and increases by 1 for each coded character set. For example, in mixed single-byte, double-byte encodings there are 2 states; the single-byte coded character set is state 1 and the double-byte coded character set is state 2. Encoding schemes which define a single coded character set have a single state; state 1.

The second nibble and the last byte of the ESID together identify the number of switching states. The last byte of the ESID identifies the switching method employed in an encoding scheme. The first and second nibbles identify the nature of the elementary code structures used in the resulting composite structure.

ESID Values

ESID values and their semantics are listed in Figure 9.

Figure 9. ESID values

ESID hex Interpretation
1100 EBCDIC, SBCS, No code extension is allowed Number of States = 1.
2100 IBM-PC Data, SBCS, No code extension is allowed. Number of States = 1.
3100 IBM-PC Display, SBCS, No code extension is allowed. Number of States = 1.
4100 ISO 8, SBCS, No code extension is allowed. Number of States = 1.
4105 ISO 8 (ASCII code), SBCS, Graphics in C1 Note that graphic characters may be present in the area normally reserved for the C1 control codes. (ie X'80' to X'9F') Number of States = 1.
4155 ISO 8 Presentation (ASCII code), SBCS, Graphics in C0 and C1. Number of States = 1.
5100 ISO 7 (ASCII code), SBCS, No code extension is allowed. Number of States = 1.
5150 ISO 7 Presentation (ASCII code), SBCS, Graphics in C0. Number of States = 1.
6100 EBCDIC Presentation, SBCS, No code extension is allowed. Number of States = 1.
8100 8 bit, SBCS, used with a 7-bit code page, characters are shifted into the right hand side of the encoding space, used only for single-byte EUC G-sets when each G-set is treated as a standalone code. Number of States = 1.
D100 PTTC/BCDIC – 6 bit encoding, no code extension is allowed. Number of States = 1.
D101 Paper Tape Transmission Code (PTTC), 6 bit encoding, uppercase/lowercase support using UC/LC code extension method. Number of States = 2.
1200 EBCDIC, DBCS, No code extension is allowed. Number of States = 1.
2200 IBM-PC Data, DBCS, No code extension is allowed. Number of States = 1.
3200 IBM-PC Display, DBCS, No code extension is allowed. Number of States = 1.
5200 ISO 7 (ASCII code), DBCS, No code extension is allowed. Number of States = 1.
6200 EBCDIC Double-byte Presentation Number of States = 1.
7200 Unicode, UCS-2, including UTF-16 to allow for support of surrogates, Big Endian order. No code extension is allowed. Number of States = 1.
7209 Unicode pure double-byte. Used for any standalone, individual Unicode plane. Number of States = 1.
720B Unicode, UCS-2, including UTF-16 to allow for support of surrogates, Little Endian order. No code extension is allowed. Number of States = 1.
720F Unicode, UCS-2, including UTF-16 to allow for support of surrogates, endianness is determined by byte order mark (BOM), assumed to be Big Endian in absence of BOM. No code extension is allowed Number of States = 1.
8200 Unicode Display Number of States = 1.
9200 8 bit, DBCS, used with a 7-bit code page, characters are shifted into the right hand side of the encoding space, used only for double-byte EUC G-sets when each G-set is treated as a standalone code. Number of States = 1.
1301 EBCDIC, Mixed single-byte and double-byte, using SO/SI code extension method. Number of States = 2.
2300 IBM-PC Data, Mixed single-byte and double-byte, with implicit code extension. Number of States = 2.
2305 PC Data, Mixed single-byte and double-byte, with implicit code extension, single-byte is Windows encoding. Number of States = 2.
3300 IBM-PC Display, Mixed single-byte and double-byte, with implicit code extension. Number of States = 2.
4403 IBM EUC Number of States = 2-4.
5404 ISO 2022 TCP/IP using ESC sequences to designate code sets to G0. Number of States = 2-4.
5409 ISO 2022 TCP/IP using SO/SI Number of States = 2.
540A ISO 2022 TCP/IP using SO, SI, SS2, and SS3. Number of States = 3-4.
7500 Unicode UTF-32, Big Endian order. No code extension is allowed. Number of States = 1.
750B Unicode UTF-32, Little Endian order. No code extension is allowed. Number of States = 1.
750F Unicode UTF-32, endianness is determined by byte order mark (BOM), assumed to be Big Endian in absence of BOM. No code extension is allowed Number of States = 1.
5700 ISO 7 Triple-byte Code Set, No code extension is allowed. Number of States = 1.
1808 UTF-EBCDIC, as defined in Unicode Technical Repot 16. Number of States = 1.
7807 UTF-8, UCS-2 transform, No code extension is allowed. Number of States = 1.
780D Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8), as defined in Unicode Technical Report #26 . Number of States = 1.
2900 PC Data, fixed 4-byte Number of States = 1.
2A00 PC Data, mixed single-, double- and four-byte (Note: IBM PC or Windows code pages may be used as the single-byte component of a CCSID using this ESID.) Number of States = 3.
7B0C Standard Compression Scheme for Unicode (SCSU) as defined in Unicode Technical Standard 6.
7B0E Binary Ordered Compression for Unicode (BOCU-1) as defined in Unicode Technical Note 6.
Fxxx Private Use. User-defined encoding scheme.
xFxx Private Use. User-defined encoding scheme.
xxFE Private Use. User-defined encoding scheme.

Coded Graphic Character Set Global Identifier

The Coded Graphic Character Set Global Identifier, CGCSGID, is a ten digit decimal number representing the concatenation of the Graphic Character Set Global Identifier (GCSGID) followed by the Code Page Global Identifier (CPGID). GCSGID and CPGID are described in the following sections. CGCSGID identifies a specific collection of graphic characters and their assigned code points using an encoding scheme.

Many architectures and supporting implementations, such as Document Interchange Architecture (DIA), have traditionally supported the CGCSGID. It has been assumed that the encoding scheme information can always be reliably derived from the code page identifier alone, but this assumption is not true for many registered PC code pages. It will also be invalid if schemes such as the mixed single-byte and double-byte encodings used by the IBM PCs in the Far East have to be represented.

The term GCID, used in some IBM architectures, is synonymous with CGCSGID.

Graphic Character Set Global Identifier

A Graphic Character Set Global Identifier, GCSGID, is a 5-digit decimal identifier assigned to a collection of characters that is to be processed as an entity (a Graphic Character Set). It uniquely identifies a specific collection of GCGIDs that are valid in the set.

The range of GCSGID values is 00001 (X'0001') to 65534 (X'FFFE'). The values X'FE00' to X'FEFF' are reserved for Request for Price Quotation (RPQ) use by IBM products. The values X'FF00' to X'FFFE' are reserved for customer use.

A GCSGID is assigned to every Registered Graphic Character Set by IBM (or by a customer organization).

See Special-Purpose Values for GCSGID and CPGID for use of 00000 (X'0000'), 65,535 (X'FFFF') and other special-purpose values for GCSGID.

The term Character Set (CS) is synonymous with GCSGID.

SPACE as a special character

By itself, the GCSGID does not specify either the inclusion or the exclusion of the SPACE (GCGID = SP010000) character. Each encoding scheme reserves one or more code points for allocation to the SPACE character. There are two possible code points for it when using mixed SBCS and DBCS encoding schemes.

Code Page Global Identifier

A Code Page Global Identifier, CPGID, is a 5-digit decimal number assigned to a code page.

A code page is a specification of code points from a defined encoding structure for each graphic character (12) in a collection of one or more graphic character sets.

A CPGID identifies a unique assignment of the graphic code points in an encoding scheme to a specific set of GCGIDs. Many character sets may be contained in a code page. When all of the code points in the graphic encoding space of a code page have been assigned, then the character set containing this collection of GCGIDs is defined to be full. Often, when a code page is first created and registered, some of the assignable graphic code points may not have assigned GCGIDs. The character set containing these assigned characters is defined to be maximal. As more code point assignments are made, the maximal character set will change. Once all code points have been assigned, the maximal set will be the full set.

A CPGID is assigned to every Registered Code Page by IBM. In some cases, the same CPGIDs have been used when the encoding structures are similar. (13)

The range of CPGID values is 00001 (X'0001') to 65534 (X'FFFE'). The values X'FE00' to X'FEFF' are reserved for Request for Price Quotation (RPQ) use by IBM products. The values X'FF00' to X'FFFE' are reserved for customer use.

The term Code Page (CP) is synonymous with CPGID.

Special-Purpose Values for GCSGID and CPGID

IBM standards reserve the values X'0000' and X'FFFF' for future assignments. In practice, these identifier values have been used for a number of different special purposes. Some values other than X'0000' and X'FFFF' that have been reserved for special-purpose use are also included in this section. In the interest of providing consistency between various implementations, the semantics of use of these values, either in current use or for future use, are defined here.

Some known definitions are listed below, along with their semantics. Others will be added as they become known to CDRA.

The CS value of X'0000' is used in several IBM architectures, such as Formatted Data Object Content Architecture (FD:OCA), Mixed Object Document Content Architecture (MO:DCA), and Document Interchange Architecture (DIA) Profiles, to facilitate migration and coexistence between the use of only a CGCSGID (CS, CP pair) prior to the advent of CDRA and the use of the CCSID identifier in different architecture definitions.

In these architectures, if the CS portion of a structured field carrying a CGCSGID has a value of X'0000', the value of the CP portion is interpreted as a CCSID. The following definitions then apply:

 
  1. CS X'0000' with CP X'0000'
    The CP value of X'0000' is interpreted as CCSID X'0000'. This CCSID value means that the tag value is to be inherited from a higher level in a hierarchical structure.
  2. CS X'0000' with CP X'FFFE'
    The CP value of X'FFFE' is interpreted as CCSID X'FFFE'. This CCSID value means that the tag value is to be obtained from a lower level in a hierarchical structure.
  3. CS X'0000' with CP X'FFFF'
    The CP value of X'FFFF' is interpreted as CCSID X'FFFF'. This CCSID value means that the tagged data is to be interpreted as "not graphic character data" or "actual representation is unknown".
  4. CS X'0000' with all other CP values
    The CP value is interpreted as a CCSID.
 

The CS value of X'FFFF' can have the following special-purpose definitions.

 
  1. CS X'FFFF' with CP X'0000'
    Reserved for future definition in CDRA.
  2. CS X'FFFF' with CP X'FFFF'
    In FD:OCA, the combination is used to indicate inheritance from a higher level in the structured object.
  3. CS X'FFFF' with all other CP values
    A CS value of X'FFFF' used with CP values from X'0001' to X'FFFE' identifies a growing character set.
 

In the Intelligent Printer Data Stream* (IPDS*), both the GCSGID and CPGID are carried but are not treated as a CGCSGID construct. In this case, the following special-purpose values for GCSGID and CPGID are defined:

 
  1. CS X'0000'
    The CS value of X'0000' means that no value is supplied.
  2. CP X'0000'
    The CP value of X'0000' means that no value is supplied.
  3. CP X'FFFF'
    The CP value of X'FFFF' implies that the device default code page should be used.
 

In IPDS and in MO:DCA the following special-purpose value is defined:

 
  1. CS X'FFFF'
    The CS value of X'FFFF' implies that the set of characters with assigned code points in the resource definition of the selected code page is to be used.
 

Special CS and CP values are used to indicate "No CS, No CP" in the ACRI-EUC structure defined in c. the following special-purpose value is defined:

 
  1. CS X'FDFF'
    The CS value of X'FDFF' implies that there is no character set, that is that the corresponding G set is not used for this particular EUC CCSID.
  2. CP X'FDFF'
    The CP value of X'FDFF' implies that there is no code page, that is that the corresponding G set is not used for this particular EUC CCSID.
 

Within CDRA the following CS/CP pair have been used in the definition of Unicode CCSIDs.

 
  1. CS X’FFF0’ (65520) and CP X’FFF0’ (65520)
    This CS/CP pair is used to represent an empty plane of Unicode. By definition CS 65520 is an empty set containing no characters and CP 65520 is a Unicode plane with no characters defined.

Additional Coding-Related Required Information

Some encoding schemes require specifications beyond the CS and CP elements to complete their definitions. Such specifications are called Additional Coding-related Required Information (ACRI) elements.

Three types of ACRI are defined below.

ACRI PC Mixed Byte (ACRI-PCMB)

This type of ACRI applies to ES values X'2300', X'2305' or X'3300' (see semantics of these ES values in Figure 9). It cannot be specified with any other ES values. It consists of the specification of ranges of valid first bytes of double-bytes associated with particular CS, CP pairs that are used with this encoding scheme. An ACRI-PCMB has the following format:

N S1 E1 S2 E2 -- -- Sk Ek -- -- Sn En

where N is the number of ranges of valid first bytes, Sk is the starting byte and Ek is the ending byte in the kth range, for all values of k from 1 to N. Sk and Ek are each in the range 128 to 255.

For example, ACRI-PCMB associated with CCSID 00942 in the CCSID Registry (see Appendix C: CCSID Repository ) will be represented as:

2 129 159 224 252

In this example, there are two sets of valid first bytes (shown as their decimal values). The first set of 31 values is in the range 129 to 159 (X'81' to X'9F'), and the second set of 29 values in the range 224 to 252 (X'E0' to X'FC'). Thus a total of 60 double-byte wards ( 14) can be defined using this ACRI-PCMB.

Other formats, such as a bit-pattern representation, are also possible.

ACRI Type EUC (ACRI-EUC)

This type of ACRI applies to ES value X'4403' only. It specifies the number of coded character sets and the width of each. It has the following format:

N W1 W2 W3 W4

where N is the number of coded graphic character sets, and Wn is the width of the nth set. If a G set is not used then the value of W is 0 and the corresponding CS/CP entries will be X'FDFF'.

ACRI Type TCP (ACRI-TCP)

This type of ACRI applies to ES value X'5404' only. It specifies the number of coded character sets followed by a triplet for each consisting of the width of the code points for the set, the length of the escape sequence used to designate the set into G0, and the actual escape sequence. The format is as follows:

n W1 LD1 D1 W2 LD2 D2 ... Wn LDn Dn

where

n =number of CGCSGIDs associated with the CCSID

W =width of code points in the code page

LD =length of the designation escape sequence

D =actual designation sequence

For example, the ACRI-TCP for CCSID 00965 (TCP for Traditional Chinese) is:

03 01 03 ESC 28 42 02 04 ESC 24 29 30 02 04 ESC 24 29 31

In this example the ESC mnemonic is shown, rather than the hex value 1B, to allow for ease of readability.

The format of the Escape Sequences is defined in ISO 2022. The "final byte," which defines the actual coded character set to be used, is defined in the ISO document International Register of Coded Character Sets to be Used With Escape Sequences.

Short-Form Identification

Many implementations and architectures cannot accommodate variable-length tags like the long-form identifier. To address this problem, an alternative short-form fixed-length identifier called the Coded Character Set Identifier (CCSID) is defined.

Coded Character Set Identifier

A CCSID is a 16-bit identifier defined by CDRA. A CCSID by definition uniquely defines a data encoding. Given a CCSID tag and a valid code point, the character associated with that code point can be precisely identified. This is because the definition of the CCSID is linked to the definition of the code page in the IBM corporate registry. The definition of the control characters associated with a CCSID are inherited from the defintion of controls defined for the related encoding scheme.

CCSIDs can be defined as growing. A growing CCSID is defined when the related code page is expected to be expanded. When the CCSID grows (ie, more characters are added to the related code page and character set), a non-growing, fixed, CCSID is defined for the existing resources and the growing CCSID takes on the characteristics of the expanded resources. The range of CCSID values is 00000 (X'0000') to 65535 (X'FFFF'). The bit allocations in a CCSID are shown in Figure 10.

Figure 10. Bit Allocations in the Coded Character Set Identifier (CCSID)

Diagram.

Each CCSID has a corresponding long-form identifier or has a predefined special meaning. Figure 11 shows the allocation of CCSID values.

Figure 11. Allocation of CCSID values

Value Purpose/Meaning
X'0000' Inheritance This value is reserved to show that the value of CCSID is defaulted and is to be taken from the next higher level in a defined hierarchy.(15) It cannot be used if there is no hierarchy or no higher level. The highest level in the hierarchy cannot use this value. If a CCSID value of X'0000' is used when there is no higher level or no hierarchy, it will resolve to X'FFFF' (CCSID is not applicable).
X'0001' to X'DFFF' IBM Registered CCSIDs These values are for IBM use. They will be registered and published in the CDRA documentation.
X'E000' to X'EFFF' Private-use CCSIDs These values are reserved for private use. Customers must maintain their own organizational registries.
X'F000' to X'F0FF' Reserved for future allocation by CDRA
X'F100' to X'F1FF' Global Use CCSIDs These values are reserved for global use common character sets, such as the Syntactic character set, associated with specific encoding structures. This avoids the need to issue specific CCSIDs for usage of these character sets with every code page registered. See the CCSID Repository for a list of CCSIDs. Note: The use of Global Use CCSIDs is optional; it is determined individually by each implementation.
X'F200' to X'F2FF' Reserved for RPQ use by products. Values in this range are specific to a product and must be completely defined by that product.
X'F300' to X'FFEF' Reserved for future allocation by CDRA
X'FFF0' CCSID for Empty Code Page
X'FFF1' to X'FFFB' Reserved for future allocation by CDRA
X'FFFC' to X'FFFD' Special value CCSIDs reserved for use in DB2.
X'FFFE' Lower Level in Hierarchy This value is reserved to show that a value for CCSID at this level is not relevant. It should be obtained from the tag fields of elements at a lower level in the defined hierarchy. If a hierarchy does not exist, or if a CCSID value of X'FFFE' is specified at the lowest level, then the CCSID resolves to X'FFFF' (CCSID is not applicable).
X'FFFF' CCSID is Not Applicable This value means that the tagged data is to be interpreted as "not graphic character data" or "actual representation is unknown".

Representation of CDRA Identifiers

Internal Representation of CCSID, GCSGID, and CPGID

The representation of the identifier values, the syntax, is specified by providers of the tag fields that hold these identifier values. Each of these CDRA identifiers is a 16-bit binary number. The CDRA recommendation is that the internal representations be unsigned binary integers, rather than numeric character strings. If they are stored as alphanumeric strings, they must be tagged (implicitly or explicitly) like any other graphic character data.

Internal Representation of GCGID

The GCGID values are made up of uppercase A to Z, the digits 0 to 9, and a SPACE (trailing). The method of encoding them in a particular object must be identified in the object definition.

Internal Representation of ACRI

A variable-length array containing the value of each ACRI is needed to store the information. Each element in the array is a positive integer with a maximum value of 255. These numbers should be stored as binary values rather than strings of digits, to eliminate the need for tagging.

External Representation of Identifiers

CDRA identifiers may appear in documentation, display panels, program statements, or other textual strings. For consistency, CDRA recommends:

As an aid to users, a descriptive name associated with each of the identifiers can also be presented.

CCSID Values

The CCSID values are categorized as follows:

  1. Interoperable CCSIDs:
    Interoperable CCSIDs have the following characteristics: Supporting interoperable CCSIDs allows for:
  2. Global Use CCSIDs
    Some CCSIDs are defined with character sets that are globally applicable. These typically use the Syntactic Character Set (CS 640). See Appendix C: CCSID Repository for access to a complete list of CCSIDs.
  3. Universal
    This category encompasses all the encoding forms of UCS, it is a Large Multi-Script Character set covering all the living languages of today, is the character set of the world-wide web and is expected to supported in all computing environments. Its character set is a super set of the character sets of the non-UCS CCSIDs.
  4. Coexistence and Migration CCSIDs
    All other CCSIDs are classified as Coexistence and Migration CCSIDs. They may be widely used within a country or environment but not have the properties of an interoperable CCSID, or they may have a very specific, limited use such as a 7-bit symbols set. See Appendix C: CCSID Repository for access to a complete list of CCSIDs.

Tagging in CDRA

When data is tagged with a CCSID, the GCGIDs assigned to the graphic character code points must be those defined by the CCSID.

When a graphic character is represented in data using a CCSID tag:

When data is to be interpreted according to a CCSID value:

When data with no assigned graphic character meaning is found, it should be treated as bytes.

These concepts are explained using two examples.

Example 1: Pure Single-Byte Case

In this example, let ESa, CSa, and CPa (in a single-byte encoding scheme) be the Encoding Scheme, Character Set, and Code Page elements of CCSIDa. See Figure 12.

Figure 12. Meaning of Tagging: A Single Byte Example

Diagram.

The encoding space defined by ESa is composed of and G1, where C is the control area and G1 is the graphic area. G1a represents all code points that have been assigned to the GCGIDs found in character set CSa. Only code points found within G1a can have graphic character meaning according to the definition of CCSIDa. G1m represents all code points within CPa that have assigned GCGID values.

Example 2: Case of Mixed Single-Byte Double-Byte in PC

The example shown in Figure 13 uses a PC mixed single-byte and double-byte encoding. Let the elements of CCSIDa be ESa, CSa1, CPa1, CSa2, CPa2, and Fa (=ACRI-PCMB, ranges of valid first bytes).

Figure 13. Meaning of Tagging: A PC Mixed SB/DB Example

Diagram.

The encoding space defined by ESa is composed of C, G1, and G2, where C is the control area, G1 is the single-byte graphic area, and G2 is the double-byte graphic area. G1a represents all of the single-byte code points that have been assigned to the GCGIDs found in character set CSa1. G2a represents all of the double-byte code points that have been assigned to the GCGIDs found in character set CSa2. Only code points found within G1a or G2a can have graphic character meaning according to the definition of CCSIDa. G1m and G2m represent all code points within CPa1 and CPa2, respectively, that have assigned GCGID values. Fa represents the set of valid first bytes for double-byte code points found in G2a.

Meaning of Tagging in CDRA

CDRA has a dependency on other architectures, processes, or functions to provide proper graphic character data processing. The tags can be used to set the meaning or derive the meaning of code points in data to the extent defined above, when:

Relationship of Tags to Data Path

The data along with its tag may traverse many different systems through networks. In the process the tag value may get changed to reflect any conversion of the data. The tag values do not have any relationship to the data path.

Contact IBM

Need assistance with your globalization questions?