Character Data Representation Architecture

Appendix K. CDRA and Unicode



Character Data Representation Architecture (CDRA) defines a set of identifiers which are used to uniquely identify graphic character data. The Coded Character Set Identifier (CCSID) is a 16 bit integer that can be expanded to a long form identifier which contains the following information:

In the case of Unicode, several encoding schemes are defined and for each Unicode CCSID there are multiple Code Page and Character Set pairs. The following figure shows how the Unicode CCSIDs are formulated.

Figure K-1. Unicode CCSID Structure
Figure of Unicode CCSIDs Figure of Unicode CCSIDs

Unicode Identifiers

While Unicode is a very unique code in terms of how it is defined and the several formats that it can be used in, it can still be well defined using the standard CDRA identifiers listed above. The following sections describe how the IBM and CDRA identifiers have been applied to handle Unicode.

Unicode Encoding Schemes

The Unicode encoding space is well defined within the standard as are the various formats that Unicode data may be encoded using. Refer to the section on the Unicode Code Structure for detailed information on the encoding structure and related CDRA encoding schemes.

IBM Code Page Identifiers in Support of Unicode

Within the IBM corporate code page registry the values in the range 01400 through 01499 have all been reserved for assignment to the individual components of Unicode (ISO 10646). The following table shows which values have been assigned or reserved for a specific purpose. The Unicode Standard deals with the Unicode character repertoire as a whole. For the purpose of managing such a large set of characters, CDRA defines a unique code page for each plane of Unicode.

Figure K-2. Code Page Identifiers in Support of Unicode

Code Page Plane Comments
1400 0 - BMP Basic Multilingual Plane (does not include PUA area)
1401 1 - SMP Supplementary Multilingual Plane
1402 2 - SIP Supplementary Ideographic Plane
1403-1413 3 - 13 Currently unassigned - reserved for planes 3 through 13
1414 14 - SSP Supplementary Special-Purpose Plane
1415-1444 - Not currently assigned, reserved for future assignment for Unicode code page components
1445 IBM advanced function print PUA Registered for private use area 1 of plane 15. See code page definition for details.
1446 15 Private Use Plane 15
1447 16 Private Use Plane 16
1448 PUA Reserved for PUA area of BMP including corporate zone
1449 IBM default PUA Registered IBM default for the PUA area of BMP
65520 Special Value Empty Plane Registered IBM Special value used to indicate an empty Unicode Plane

Additionally, the values 1200 through 1249 in the code page registry have been marked as reserved. This is to prevent these values from being used for code page assignments as the corresponding values are used for the Unicode CCSIDs.

IBM Character Set Identifiers in Support of Unicode

The IBM corporate character set registry contains the definition for all graphic character sets used within IBM as well as the definition of some special purpose values. One of these special purpose values is X'FFFF' or 65535. When used in a CCSID definition, in conjunction with a valid code page value, this value indicates that the character set (CS) for this CCSID is growing. This means that from time to time more characters will be added to the set and the character set to be used with this CCSID is the current maximal set associated with the code page. When dealing with Unicode identifiers this is a very useful value since the Unicode character set is still growing at regular intervals. What this means is that a product that supports Unicode can use a CCSID with a growing character set and not have to change the CCSID value every time more characters are added to Unicode. There is also a fixed character set that corresponds to the growing set at a given point in time; usually a specific version of Unicode. This allows products that are concerned with precise definitions to use exact identifiers while others can use the less specific growing values. The following character set identifiers are used with the various code pages assigned for use in Unicode CCSID definitions.

Figure K-3. Character Set Identifiers in Support of Unicode

Character Set Plane Number Comments
3001 0 - BMP Unicode 2.0 character repertoire
3002-3003 - Reserved for future Unicode definitions
3004 0 - BMP Unicode 3.0 character repertoire
3005 0 - BMP Unicode 4.0 character repertoire
3006 1 Unicode 4.0 character repertoire for Plane 1
3007 2 Unicode 4.0 character repertoire for Plane 2
3008 14 Unicode 4.0 character repertoire for Plane 14
3009 0 - BMP Unicode 4.1 character repertoire
3010 1 Unicode 4.1 character repertoire for Plane 1
3011 0 – BMP Unicode 5.0 character repertoire
3012 1 Unicode 5.0 character repertoire for Plane 1
3013 0 - BMP Unicode 5.1 character repertoire for Plane 0
3014 1 Unicode 5.1 character repertoire for Plane 1
3015 0 - BMP Unicode 5.2 character repertoire for Plane 0
3016 1 Unicode 5.2 character repertoire for Plane 1
3017 2 Unicode 5.2 character repertoire for Plane 2
3018 0 - BMP Unicode 6.0 character repertoire for Plane 0
3019 1 Unicode 6.0 character repertoire for Plane 1
3020 2 Unicode 6.0 character repertoire for Plane 2
3021 0 - BMP Unicode 6.2 character repertoire for Plane 0
3022 1 Unicode 6.1 character repertoire for Plane 1
3023 – 3094 - Reserved for future Unicode definitions
3095 15* IBM Advanced Function Printing private use area no. 1 (*for use in row FF of PUA plane 15)
3096 15 Unicode 4.0 generic PUA definition for Plane 15
3097 16 Unicode 4.0 generic PUA definition for Plane 16
3098 PUA of BMP Reserved for BMP PUA full character set of CP 1448
3099 PUA of BMP IBM Default PUA definition
65535 any Growing character set, use the current maximal set

CCSIDs Defined in Support of Unicode

The basic principle of CDRA is to be able to unambiguously identify data based on a unique, well defined identifier. The CDRA Coded Character Set IDentifier (CCSID) can be used to do this for Unicode data. Figure K-1 above shows how each Unicode CCSID is composed. Each CCSID, can be expanded to a long form consisting of an Encoding Scheme (ES), and a list of character set, code page pairs (CSn, CPn) and optionally ACRI (Additional Coding-Related Required Information). Each CCSID also has a string type (ST) characteristic associated with it which may be specified. In the case of Unicode CCSIDs, if the ST is not specified it defaults to ST 10. These string types can not be enforced on incoming data, however any data originating within IBM should comply to the string type properties. For more information on String types see "Types of Strings". In the case of Unicode, each full CCSID definition has 18 CS, CP pairs. The first pair is for the basic multilingual plane (BMP or plane 0) not including the private use area (PUA). The second CS, CP pair is for the the PUA of the BMP. The subsequent sets represent the character sets and code pages associated with each of planes 1 through 16. Special CS and CP values of 65520 have been defined to represent an 'empty' Unicode plane and are used for all planes that are unpopulated. Empty planes may be omitted from any CCSID definition so long as the implementation has a well defined means of determining which planes are included in the definitions and which ones have been omitted because they are unpopulated.

CDRA has used a combination of 'growing' and 'fixed' CCSIDs for Unicode. CCSID 1200 was the first Unicode CCSID defined. It is a growing CCSID with an encoding scheme of 7200 and was initially defined using code page 1400 with a growing character set (CS 65535) for the BMP (without the PUA) and code page 1449 with the fixed set character set 3099. This character set has the IBM defined default PUA characters in the last 256 positions of the PUA area and generic characters in all other PUA positions. Planes 1 through 16 were all 'empty'. As this is a growing CCSID, over time, as the definition of Unicode expanded so too did the definition of CCSID 1200. Today CCSID 1200 includes the initial two code page and character set pairs but has been expanded to include code pages 1401, 1402 and 1414 with growing character sets for planes 1, 2 and 14 respectively. It also includes code pages 1446 and 1447 for planes 15 and 16 with default character set definitions of 3096 and 3097. Planes 3 through 13 inclusive remain undefined using the special 65520 code page and character set in the full definition.

The following table presents a list of the Unicode CCSIDs currently defined. The full definition for each of these CCSIDs can be found in the CDRA CCSID Repository.

Figure K-4. Unicode CCSIDs

CCSID Decimal Description CCSID Decimal Description
1200 UTF-16 BE with IBM PUA 17593 Unicode 4.0, UTF-8
1201 UTF-16 BE 17616 Unicode 5.1, Utf-32 BE with IBM PUA
1202 UTF-16 LE with IBM PUA 17784 Unicode 4.1, BMP
1203 UTF-16 LE 17785 Unicode 5.1 Plane 1
1204 UTF-16 with IBM PUA 21680 Unicode 4.0, UTF-16 with IBM PUA
1205 UTF-16 21681 Unicode 4.0, UTF-16
1208 UTF-8 with IBM PUA 21682 Unicode 4.0, UTF-16 LE with IBM PUA
1209 UTF-8 21683 Unicode 4.0, UTF-16 LE
1210 UTF-EBCDIC with IBM PUA 21688 Unicode 4.1, UTF-8 with IBM PUA
1211 UTF-EBCDIC 21689 Unicode 4.1, UTF-8
1212 SCSU with IBM PUA 21712 Unicode 5.2, UTF-32 BE with IBM PUA
1213 SCSU 21880 Unicode 5.0 BMP
1214 BOCU-1 with IBM PUA 21881 Unicode 5.2, Plane 1
1215 BOCU-1 25776 Unicode 4.1, UTF-16 with IBM PUA
1232 UTF-32 BE with IBM PUA 25777 Unicode 4.1, UTF-16
1233 UTF-32 BE 25778 Unicode 4.1, UTF-16 LE with IBM PUA
1234 UTF-32 LE with IBM PUA 25779 Unicode 4.1, UTF-16 LE
1235 UTF-32 LE 25784 Unicode 5.0, UTF-8 with IBM PUA
1236 UTF-32 with IBM PUA 25785 Unicode 5.0 UTF-8
1237 UTF-32 25808 Unicode 6.0, UTF-32 BE with IBM PUA
1400 Unicode BMP 25976 Unicode 5.1, BMP
1401 Unicode Plane 1 25977 Unicode 6.0, Plane 1
1402 Unicode Plane 2 29872 Unicode 5.0, UTF-16 with IBM PUA
1414 Unicode Plane 14 29873 Unicode 5.0, UTF-16
1446 Unicode Plane15 29874 Unicode 5.0, UTF-16 LE with IBM PUA
1447 Unicode Plane 16 29875 Unicode 5.0, UTF-16 LE
1448 Unicode, Generic PUA of BMP 29880 Unicode 5.1, UTF-8 with IBM PUA
1449 Unicode, PUA of BMP, IBM Default 29881 Unicode 5.1, UTF-8
5304 Unicode 2.0, UTF-8 with IBM PUA 29904 Unicode 6.2 UTF-32 BE with IBM PUA
5305 Unicode 2.0, UTF-8 30072 Unicode 5.2, BMP
5328 Uncode 4.0, UTF-32 BE with IBM PUA 30073 Unicode 6.1, Plane 1
5496 Unicode 2.0 BMP 33968 Unicode 5.1, UTF-16 BE with IBM PUA
5497 Unicode 4.0, Plane 1 33969 Unicode 5.1, UTF-16 BE
5498 Unicode 4.0, Plane 2 33970 Unicode 5.1, UTF-16 LE with IBM PUA
5510 Unicode 4.0, Plane 14 33971 Unicode 5.1, UTF-16 LE
9400 CESU-8 with IBM PUA 33976 Unicode 5.2, UTF-8 with IBM PUA
9424 Unicode 4.1, UTF-32 BE with IBM PUA 33977 Unicode 5.2, UTF-8
9592 Unicode 3.0, BMP 34168 Unicode 6.0, BMP
9593 Unicode 4.1, Plane 1 38064 Unicode 5.2, UTF-16 BE with IBM PUA
9594 Unicode 5.2, Plane 2 38065 Unicode 5.2, UTF-16
13488 Unicode 2.0, UTF-16 IBM PUA 38066 Unicode 5.2, UTF-16 LE with IBM PUA
13489 Unicode 2.0, UTF-16 38067 Unicode 5.2, UTF-16 LE
13490 Unicode 2.0, UTF-16 LE with IBM PUA 38072 Unicode 6.0, UTF-8 with IBM PUA
13491 Unicode 2.0, UTF-16 LE 38073 Unicode 6.0, UTF-8
13496 Unicode 3.0, UTF-8 with IBM PUA 38264 Unicode 6.2, BMP
13497 Unicode 3.0, UTF-8 42160 Unicode 6.0, UTF-16 BE with IBM PUA
13520 Unicode 5.0, UTF-32 with IBM PUA 42161 Unicode 6.0, UTF-16 BE
13688 Unicode 4.0, BMP 42162 Unicode 6.0, UTF-16 LE with IBM PUA
13689 Unicode 5.0, Plane 1 42163 Unicode 6.0, UTF-16 LE
13690 Unicode 6.0, Plane 2 42168 Unicode 6.2, UTF-8 with IBM PUA
17584 Unicode 3.0, UTF-16 with IBM PUA 42169 Unicode 6.2, UTF-8
17585 Unicode 3.0, UTF-16 46256 Unicode 6.2, UTF-16 BE with IBM PUA
17586 Unicode 3.0, UTF-16 LE with IBM PUA 46257 Unicode 6.2, UTF-16 BE
17587 Unicode 3.0, UTF-16 LE 46258 Unicode 6.2, UTF-16 LE with IBM PUA
17592 Unicode 4.0, UTF-8 with IBM PUA 46259 Unicode 6.2, UTF-16 LE
65520 Unicode, empty plane

Detailed information (including the character set, code page pairs) for these CCSIDs can be found in the CCSID Repository.

In addition to the above CCSIDs, a number of 'special' CCSIDs have been defined for exclusive use by several IBM customers. These CCSID values have been assigned from the customer use range and are not intended for general use. The special CCSIDs are used in order to allow customers to define their own character assignments for the private use area (PUA).

Contact IBM

Need assistance with your globalization questions?