Unicode CCSID Notes

Introduction
The basic principle of CDRA is to be able to unambiguously identify data based on a unique, well defined identifier. Each CCSID, can be expanded to a long form consisting of an Encoding Scheme (ES), and a list of character set, code page pairs (CSn, CPn) and optionally ACRI (Additional Coding-Related Required Information). In order for CDRA to manage all of the planes of Unicode, each plane is assigned a unique Code Page identifier. Thus, each full Unicode CCSID definition has 18 CS, CP pairs. The first pair is for the basic multilingual plane (BMP or plane 0) not including the private use area (PUA). The second CS, CP pair is for the PUA of the BMP. The subsequent sets represent the character sets and code pages associated with each of planes 1 through 16. A special CP value, 65520, has been defined to represent an 'empty' code page for any Unicode plane that is unpopulated. Empty planes may be omitted from any CCSID definition so long as the implementation has a well defined means of determining which planes are included in the definitions and which ones have been omitted because they are unpopulated.

Growing and Fixed
CDRA has used a combination of 'growing' and 'fixed' CCSIDs for each of the various encoding formats of Unicode. The growing CCSIDs are designed to represent the 'current maximal' character repertoire for a specific encoding format. For example CCSID 1200 is defined as UTF-16, Big Endian (BE) order, and has growing character sets for each of plane 0, 1, 2 and 14 and has fixed maximal character sets for the private use area (PUA) of plane 0 (BMP) and planes 15 and 16. Growing CCSIDs have been defined for a number of Unicode Encoding formats such as Universal Transformation Format 32 (UTF-32), Universal Transformation Format 8 (UTF-8), Universal Transformation Format EBCDIC (UTF-EBCDIC) as well as a number of compressions schemes including Standard Compression Scheme for Unicode (SCSU), and Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). As Unicode continues to evolve these CCSIDs will evolve as well, always representing the latest published version of Unicode. A product or platform implementation would use a growing CCSID if they always wanted to be using the latest, supported version of Unicode and did not want to have to change the value of the CCSID each time a new version of Unicode was published.

Fixed CCSIDs have been assigned for the various 'versions' of Unicode, and again there are multiple CCSIDs for each version. For example 17584 is the CCSID for UTF-16 BE with a fixed character repertoire as defined in the Unicode Standard version 3.0 with IBM defined default for the PUA of the BMP. Additional CCSIDs are defined for Unicode Version 3.0 having different default PUA definition or Endianness. Similar sets of fixed CCSIDs exist for Unicode 2.0 and 4.0. A product or platform implementation would use a fixed CCSID if they were concerned with the exact repertoire of Unicode. The characters included in a fixed Unicode CCSID will never change.

Currently planes 3 through 13 inclusive have the special 'empty plane' code page associated with them. Code page 65520 has been reserved in the Code Page Registry to mean an empty plane of the Unicode encoding space. Code page 65520 is always associated with Character Set 0 which carries the 'empty' or 'not applicable' meaning. Implementations may choose to ignore these planes and deal only with those planes that contain assigned characters.

In addition to CCSIDs for the complete Unicode character repertoires, CCSIDs have been defined for each of the individual planes. These CCSIDs may be used by some products or platforms that are not capable of handling the entire Unicode repertoire as one large entity.

Unicode Control Codes
The CCSID repository identifies the Space, Substitute, New Line, Line Feed, Carriage Return and End of File control character values associated with each CCSID. The 'value' is the value as it would appear in a data stream that was tagged with the corresponding CCSID tag. For example the SUB control value for CCISD 17584 (Unicode 3.0, UTF-16 BE order) is U'001A' while the corresponding value for CCSID 17586 (Unicode 3.0, UTF-16 LE order) is U'1A00'. The value, as the CCSID indicates is in LE order, low order byte first.

String Types
CDRA has defined a number of string types to describe various characteristics of a data string. In the case of Unicode CCSIDs, if the ST is not specified it defaults to ST 10. These string types can not be enforced on incoming data, however any data originating within IBM should conform to the string type properties. More information on string types is available in the CDRA Reference document.

Contact IBM

Need assistance with your globalization questions?