Appendix K. CDRA and Unicode
Character Data Representation Architecture (CDRA) defines a set of identifiers which are used to uniquely identify graphic character data. The Coded Character Set Identifier (CCSID) is a 16 bit integer that can be expanded to a long form identifier which contains the following information:
In the case of Unicode, several encoding schemes are defined and for each Unicode CCSID there are multiple Code Page and Character Set pairs. The following figure shows how the Unicode CCSIDs are formulated.
While Unicode is a very unique code in terms of how it is defined and the several formats that it can be used in, it can still be well defined using the standard CDRA identifiers listed above. The following sections describe how the IBM and CDRA identifiers have been applied to handle Unicode.
Unicode Encoding Schemes
The Unicode encoding space is well defined within the standard as are the various formats that Unicode data may be encoded using. Refer to the section on the Unicode Code Structure for detailed information on the encoding structure and related CDRA encoding schemes.
IBM Code Page Identifiers in Support of Unicode
Within the IBM corporate code page registry the values in the range 01400 through 01499 have all been reserved for assignment to the individual components of Unicode (ISO 10646). The following table shows which values have been assigned or reserved for a specific purpose. The Unicode Standard deals with the Unicode character repertoire as a whole. For the purpose of managing such a large set of characters, CDRA defines a unique code page for each plane of Unicode.
Figure K-2. Code Page Identifiers in Support of Unicode
|1400||0 - BMP||Basic Multilingual Plane (does not include PUA area)|
|1401||1 - SMP||Supplementary Multilingual Plane|
|1402||2 - SIP||Supplementary Ideographic Plane|
|1403-1413||3 - 13||Currently unassigned - reserved for planes 3 through 13|
|1414||14 - SSP||Supplementary Special-Purpose Plane|
|1415-1445||-||Not currently assigned, reserved for future assignment for Unicode code page components|
|1446||15||Private Use Plane 15|
|1447||16||Private Use Plane 16|
|1448||PUA||Reserved for PUA area of BMP including corporate zone|
|1449||IBM default PUA||Registered IBM default for the PUA area of BMP|
|65520||Special Value Empty Plane||Registered IBM Special value used to indicate an empty Unicode Plane|
Additionally, the values 1200 through 1249 in the code page registry have been marked as reserved. This is to prevent these values from being used for code page assignments as the corresponding values are used for the Unicode CCSIDs.
IBM Character Set Identifiers in Support of Unicode
The IBM corporate character set registry contains the definition for all graphic character sets used within IBM as well as the definition of some special purpose values. One of these special purpose values is X'FFFF' or 65535. When used in a CCSID definition, in conjunction with a valid code page value, this value indicates that the character set (CS) for this CCSID is growing. This means that from time to time more characters will be added to the set and the character set to be used with this CCSID is the current maximal set associated with the code page. When dealing with Unicode identifiers this is a very useful value since the Unicode character set is still growing at regular intervals. What this means is that a product that supports Unicode can use a CCSID with a growing character set and not have to change the CCSID value every time more characters are added to Unicode. There is also a fixed character set that corresponds to the growing set at a given point in time; usually a specific version of Unicode. This allows products that are concerned with precise definitions to use exact identifiers while others can use the less specific growing values. The following character set identifiers are used with the various code pages assigned for use in Unicode CCSID definitions.
Figure K-3. Character Set Identifiers in Support of Unicode
|Character Set||Plane Number||Comments|
|3001||0 - BMP||Unicode 2.0 character repertoire|
|3002-3003||-||Reserved for future Unicode definitions|
|3004||0 - BMP||Unicode 3.0 character repertoire|
|3005||0 - BMP||Unicode 4.0 character repertoire|
|3006||1||Unicode 4.0 character repertoire for Plane 1|
|3007||2||Unicode 4.0 character repertoire for Plane 2|
|3008||14||Unicode 4.0 character repertoire for Plane 14|
|3009||0 - BMP||Unicode 4.1 character repertoire|
|3010||1||Unicode 4.1 character repertoire for Plane 1|
|3011||0 – BMP||Unicode 5.0 character repertoire|
|3012||1||Unicode 5.0 character repertoire for Plane 1|
|3013 - 3095||-||Reserved for future Unicode definitions|
|3096||15||Unicode 4.0 generic PUA definition for Plane 15|
|3097||16||Unicode 4.0 generic PUA definition for Plane 16|
|3098||PUA of BMP||Reserved for BMP PUA full character set of CP 1448|
|3099||PUA of BMP||IBM Default PUA definition|
|65535||any||Growing character set, use the current maximal set|
CCSIDs Defined in Support of Unicode
The basic principle of CDRA is to be able to unambiguously identify data based on a unique, well defined identifier. The CDRA Coded Character Set IDentifier (CCSID) can be used to do this for Unicode data. Figure K-1 above shows how each Unicode CCSID is composed. Each CCSID, can be expanded to a long form consisting of an Encoding Scheme (ES), and a list of character set, code page pairs (CSn, CPn) and optionally ACRI (Additional Coding-Related Required Information). Each CCSID also has a string type (ST) characteristic associated with it which may be specified. In the case of Unicode CCSIDs, if the ST is not specified it defaults to ST 10. These string types can not be enforced on incoming data, however any data originating within IBM should comply to the string type properties. For more information on String types see "Types of Strings". In the case of Unicode, each full CCSID definition has 18 CS, CP pairs. The first pair is for the basic multilingual plane (BMP or plane 0) not including the private use area (PUA). The second CS, CP pair is for the the PUA of the BMP. The subsequent sets represent the character sets and code pages associated with each of planes 1 through 16. Special CS and CP values of 65520 have been defined to represent an 'empty' Unicode plane and are used for all planes that are unpopulated. Empty planes may be omitted from any CCSID definition so long as the implementation has a well defined means of determining which planes are included in the definitions and which ones have been omitted because they are unpopulated.
CDRA has used a combination of 'growing' and 'fixed' CCSIDs for Unicode. CCSID 1200 was the first Unicode CCSID defined. It is a growing CCSID with an encoding scheme of 7200 and was initially defined using code page 1400 with a growing character set (CS 65535) for the BMP (without the PUA) and code page 1449 with the fixed set character set 3099. This character set has the IBM defined default PUA characters in the last 256 positions of the PUA area and generic characters in all other PUA positions. Planes 1 through 16 were all 'empty'. As this is a growing CCSID, over time, as the definition of Unicode expanded so too did the definition of CCSID 1200. Today CCSID 1200 includes the initial two code page and character set pairs but has been expanded to include code pages 1401, 1402 and 1414 with growing character sets for planes 1, 2 and 14 respectively. It also includes code pages 1446 and 1447 for planes 15 and 16 with default character set definitions of 3096 and 3097. Planes 3 through 13 inclusive remain undefined using the special 65520 code page and character set in the full definition.
The following table presents a list of the Unicode CCSIDs currently defined. The full definition for each of these CCSIDs can be found in the CDRA CCSID Repository.
Figure K-4. Unicode CCSIDs
|CCSID Decimal||CCSID Hex||ESID||Endian Order||Comments|
|1200||04B0||7200||BE||UTF-16 BE with IBM PUA|
|1202||04B2||720B||LE||UTF-16 LE with IBM PUA|
|1204||04B4||720F||BE (in absence of BOM)||UTF-16 with IBM PUA|
|1205||04B5||720F||BE (in absence of BOM)||UTF-16|
|1208||04B8||7807||NA||UTF-8 with IBM PUA|
|1210||04BA||1808||NA||UTF-EBCDIC with IBM PUA|
|1212||04BC||7B0C||NA||SCSU with IBM PUA|
|1214||04BE||7B0E||NA||BOCU-1 with IBM PUA|
|1232||04D0||7500||BE||UTF-32 BE with IBM PUA|
|1234||04D2||750B||LE||UTF-32 LE with IBM PUA|
|1236||04D4||750F||BE (in absence of BOM)||UTF-32 with IBM PUA|
|1237||04D5||750F||BE (in absence of BOM)||UTF-32|
|5304||14B8||7807||NA||Unicode 2.0, UTF-8 with IBM PUA|
|5305||14B9||7807||NA||Unicode 2.0, UTF-8|
|9400||24B8||720C||NA||CESU-8 with IBM PUA|
|13488||34B0||7200||BE||Unicode 2.0, UTF-16 IBM PUA|
|13489||34B1||7200||BE||Unicode 2.0, UTF-16|
|13490||34B2||720B||LE||Unicode 2.0, UTF-16 LE with IBM PUA|
|13491||34B3||720B||LE||Unicode 2.0, UTF-16 LE|
|13496||34B8||7807||NA||Unicode 3.0, UTF-8 with IBM PUA|
|13497||34B9||7807||NA||Unicode 3.0, UTF-8|
|17584||44B0||7200||BE||Unicode 3.0, UTF-16 with IBM PUA|
|17585||44B1||7200||BE||Unicode 3.0, UTF-16|
|17586||44B2||720B||LE||Unicode 3.0, UTF-16 LE with IBM PUA|
|17587||44B3||720B||LE||Unicode 3.0, UTF-16 LE|
|17592||44B8||7807||NA||Unicode 4.0, UTF-8 with IBM PUA|
|17593||44B9||7807||NA||Unicode 4.0, UTF-8|
|21680||54B0||7200||BE||Unicode 4.0, UTF-16 with IBM PUA|
|21681||54B1||7200||BE||Unicode 4.0, UTF-16|
|21682||54B2||720B||LE||Unicode 4.0, UTF-16 LE with IBM PUA|
|21683||54B3||720B||LE||Unicode 4.0, UTF-16 LE|
|25776||64B0||7200||BE||Unicode 4.1, UTF-16 with IBM PUA|
|25777||64B1||7200||BE||Unicode 4.1, UTF-16|
|25778||64B2||720B||LE||Unicode 4.1, UTF-16 LE with IBM PUA|
|25779||64B3||720B||LE||Unicode 4.1, UTF-16 LE|
|29872||74B0||7200||BE||Unicode 5.0, UTF-16 with IBM PUA|
|29873||74B1||7200||BE||Unicode 5.0, UTF-16|
|29874||74B2||720B||LE||Unicode 5.0, UTF-16 LE with IBM PUA|
|29875||74B3||720B||LE||Unicode 5.0, UTF-16 LE|
Detailed information (including the character set, code page pairs) for these CCSIDs can be found in the CCSID Repository.
In addition to the above CCSIDs, a number of 'special' CCSIDs have been defined for exclusive use by several IBM customers. These CCSID values have been assigned from the customer use range and are not intended for general use. The special CCSIDs are used in order to allow customers to define their own character assignments for the private use area (PUA).