Skip to main content

Character Data Representation Architecture

Appendix K. CDRA and Unicode

Character Data Representation Architecture (CDRA) defines a set of identifiers which are used to uniquely identify graphic character data. The Coded Character Set Identifier (CCSID) is a 16 bit integer that can be expanded to a long form identifier which contains the following information:

In the case of Unicode, several encoding schemes are defined and for each Unicode CCSID there are multiple Code Page and Character Set pairs. The following figure shows how the Unicode CCSIDs are formulated.

Figure K-1. Unicode CCSID Structure
Figure of Unicode CCSIDs


Unicode Identifiers

While Unicode is a very unique code in terms of how it is defined and the several formats that it can be used in, it can still be well defined using the standard CDRA identifiers listed above. The following sections describe how the IBM and CDRA identifiers have been applied to handle Unicode.


Unicode Encoding Schemes

The Unicode encoding space is well defined within the standard as are the various formats that Unicode data may be encoded using. Refer to the section on the Unicode Code Structure for detailed information on the encoding structure and related CDRA encoding schemes.


IBM Code Page Identifiers in Support of Unicode

Within the IBM corporate code page registry the values in the range 01400 through 01499 have all been reserved for assignment to the individual components of Unicode (ISO 10646). The following table shows which values have been assigned or reserved for a specific purpose. The Unicode Standard deals with the Unicode character repertoire as a whole. For the purpose of managing such a large set of characters, CDRA defines a unique code page for each plane of Unicode.

Figure K-2. Code Page Identifiers in Support of Unicode

Code Page Plane Comments
1400 0 - BMP Basic Multilingual Plane (does not include PUA area)
1401 1 - SMP Supplementary Multilingual Plane
1402 2 - SIP Supplementary Ideographic Plane
1403-1413 3 - 13 Currently unassigned - reserved for planes 3 through 13
1414 14 - SSP Supplementary Special-Purpose Plane
1415-1445 - Not currently assigned, reserved for future assignment for Unicode code page components
1446 15 Private Use Plane 15
1447 16 Private Use Plane 16
1448 PUA Reserved for PUA area of BMP including corporate zone
1449 IBM default PUA Registered IBM default for the PUA area of BMP
65520 Special Value Empty Plane Registered IBM Special value used to indicate an empty Unicode Plane

Additionally, the values 1200 through 1249 in the code page registry have been marked as reserved. This is to prevent these values from being used for code page assignments as the corresponding values are used for the Unicode CCSIDs.


IBM Character Set Identifiers in Support of Unicode

The IBM corporate character set registry contains the definition for all graphic character sets used within IBM as well as the definition of some special purpose values. One of these special purpose values is X'FFFF' or 65535. When used in a CCSID definition, in conjunction with a valid code page value, this value indicates that the character set (CS) for this CCSID is growing. This means that from time to time more characters will be added to the set and the character set to be used with this CCSID is the current maximal set associated with the code page. When dealing with Unicode identifiers this is a very useful value since the Unicode character set is still growing at regular intervals. What this means is that a product that supports Unicode can use a CCSID with a growing character set and not have to change the CCSID value every time more characters are added to Unicode. There is also a fixed character set that corresponds to the growing set at a given point in time; usually a specific version of Unicode. This allows products that are concerned with precise definitions to use exact identifiers while others can use the less specific growing values. The following character set identifiers are used with the various code pages assigned for use in Unicode CCSID definitions.

Figure K-3. Character Set Identifiers in Support of Unicode

Character Set Plane Number Comments
3001 0 - BMP Unicode 2.0 character repertoire
3002-3003 - Reserved for future Unicode definitions
3004 0 - BMP Unicode 3.0 character repertoire
3005 0 - BMP Unicode 4.0 character repertoire
3006 1 Unicode 4.0 character repertoire for Plane 1
3007 2 Unicode 4.0 character repertoire for Plane 2
3008 14 Unicode 4.0 character repertoire for Plane 14
3009 0 - BMP Unicode 4.1 character repertoire
3010 1 Unicode 4.1 character repertoire for Plane 1
3011 0 – BMP Unicode 5.0 character repertoire
3012 1 Unicode 5.0 character repertoire for Plane 1
3013 - 3095 - Reserved for future Unicode definitions
3096 15 Unicode 4.0 generic PUA definition for Plane 15
3097 16 Unicode 4.0 generic PUA definition for Plane 16
3098 PUA of BMP Reserved for BMP PUA full character set of CP 1448
3099 PUA of BMP IBM Default PUA definition
65535 any Growing character set, use the current maximal set

CCSIDs Defined in Support of Unicode

The basic principle of CDRA is to be able to unambiguously identify data based on a unique, well defined identifier. The CDRA Coded Character Set IDentifier (CCSID) can be used to do this for Unicode data. Figure K-1 above shows how each Unicode CCSID is composed. Each CCSID, can be expanded to a long form consisting of an Encoding Scheme (ES), and a list of character set, code page pairs (CSn, CPn) and optionally ACRI (Additional Coding-Related Required Information). Each CCSID also has a string type (ST) characteristic associated with it which may be specified. In the case of Unicode CCSIDs, if the ST is not specified it defaults to ST 10. These string types can not be enforced on incoming data, however any data originating within IBM should comply to the string type properties. For more information on String types see "Types of Strings". In the case of Unicode, each full CCSID definition has 18 CS, CP pairs. The first pair is for the basic multilingual plane (BMP or plane 0) not including the private use area (PUA). The second CS, CP pair is for the the PUA of the BMP. The subsequent sets represent the character sets and code pages associated with each of planes 1 through 16. Special CS and CP values of 65520 have been defined to represent an 'empty' Unicode plane and are used for all planes that are unpopulated. Empty planes may be omitted from any CCSID definition so long as the implementation has a well defined means of determining which planes are included in the definitions and which ones have been omitted because they are unpopulated.

CDRA has used a combination of 'growing' and 'fixed' CCSIDs for Unicode. CCSID 1200 was the first Unicode CCSID defined. It is a growing CCSID with an encoding scheme of 7200 and was initially defined using code page 1400 with a growing character set (CS 65535) for the BMP (without the PUA) and code page 1449 with the fixed set character set 3099. This character set has the IBM defined default PUA characters in the last 256 positions of the PUA area and generic characters in all other PUA positions. Planes 1 through 16 were all 'empty'. As this is a growing CCSID, over time, as the definition of Unicode expanded so too did the definition of CCSID 1200. Today CCSID 1200 includes the initial two code page and character set pairs but has been expanded to include code pages 1401, 1402 and 1414 with growing character sets for planes 1, 2 and 14 respectively. It also includes code pages 1446 and 1447 for planes 15 and 16 with default character set definitions of 3096 and 3097. Planes 3 through 13 inclusive remain undefined using the special 65520 code page and character set in the full definition.

The following table presents a list of the Unicode CCSIDs currently defined. The full definition for each of these CCSIDs can be found in the CDRA CCSID Repository.

Figure K-4. Unicode CCSIDs

CCSID Decimal CCSID Hex ESID Endian Order Comments
1200 04B0 7200 BE UTF-16 BE with IBM PUA
1201 04B1 7200 BE UTF-16 BE
1202 04B2 720B LE UTF-16 LE with IBM PUA
1203 04B3 720B LE UTF-16 LE
1204 04B4 720F BE (in absence of BOM) UTF-16 with IBM PUA
1205 04B5 720F BE (in absence of BOM) UTF-16
1208 04B8 7807 NA UTF-8 with IBM PUA
1209 04B9 7807 NA UTF-8
1210 04BA 1808 NA UTF-EBCDIC with IBM PUA
1211 04BB 1808 NA UTF-EBCDIC
1212 04BC 7B0C NA SCSU with IBM PUA
1213 04BD 7B0C NA SCSU
1214 04BE 7B0E NA BOCU-1 with IBM PUA
1215 04BF 7B0E NA BOCU-1
1232 04D0 7500 BE UTF-32 BE with IBM PUA
1233 04D1 7500 BE UTF-32 BE
1234 04D2 750B LE UTF-32 LE with IBM PUA
1235 04D3 750B LE UTF-32 LE
1236 04D4 750F BE (in absence of BOM) UTF-32 with IBM PUA
1237 04D5 750F BE (in absence of BOM) UTF-32
5304 14B8 7807 NA Unicode 2.0, UTF-8 with IBM PUA
5305 14B9 7807 NA Unicode 2.0, UTF-8
9400 24B8 720C NA CESU-8 with IBM PUA
13488 34B0 7200 BE Unicode 2.0, UTF-16 IBM PUA
13489 34B1 7200 BE Unicode 2.0, UTF-16
13490 34B2 720B LE Unicode 2.0, UTF-16 LE with IBM PUA
13491 34B3 720B LE Unicode 2.0, UTF-16 LE
13496 34B8 7807 NA Unicode 3.0, UTF-8 with IBM PUA
13497 34B9 7807 NA Unicode 3.0, UTF-8
17584 44B0 7200 BE Unicode 3.0, UTF-16 with IBM PUA
17585 44B1 7200 BE Unicode 3.0, UTF-16
17586 44B2 720B LE Unicode 3.0, UTF-16 LE with IBM PUA
17587 44B3 720B LE Unicode 3.0, UTF-16 LE
17592 44B8 7807 NA Unicode 4.0, UTF-8 with IBM PUA
17593 44B9 7807 NA Unicode 4.0, UTF-8
21680 54B0 7200 BE Unicode 4.0, UTF-16 with IBM PUA
21681 54B1 7200 BE Unicode 4.0, UTF-16
21682 54B2 720B LE Unicode 4.0, UTF-16 LE with IBM PUA
21683 54B3 720B LE Unicode 4.0, UTF-16 LE
25776 64B0 7200 BE Unicode 4.1, UTF-16 with IBM PUA
25777 64B1 7200 BE Unicode 4.1, UTF-16
25778 64B2 720B LE Unicode 4.1, UTF-16 LE with IBM PUA
25779 64B3 720B LE Unicode 4.1, UTF-16 LE
29872 74B0 7200 BE Unicode 5.0, UTF-16 with IBM PUA
29873 74B1 7200 BE Unicode 5.0, UTF-16
29874 74B2 720B LE Unicode 5.0, UTF-16 LE with IBM PUA
29875 74B3 720B LE Unicode 5.0, UTF-16 LE

Detailed information (including the character set, code page pairs) for these CCSIDs can be found in the CCSID Repository.

In addition to the above CCSIDs, a number of 'special' CCSIDs have been defined for exclusive use by several IBM customers. These CCSID values have been assigned from the customer use range and are not intended for general use. The special CCSIDs are used in order to allow customers to define their own character assignments for the private use area (PUA).


Contact IBM

live-assistance

Need assistance with your globalization questions?