Guideline F: Coded character sets

Overview



People deal with characters of different scripts, but unfortunately computers only use binary bits of B'1' and B'0'. Mappings between characters and binary bits are required to perform computational tasks.

A collection of graphic characters to be processed as an entity is grouped into a character set. One or more of these character sets are usually established to meet the minimum requirements of a country or a product. Sometimes supersets are created as a collection of related character sets. A unique binary bit pattern called code point is assigned to every character in a set, obeying the rules of a chosen encoding scheme .

Some popular encoding schemes are IBM-PC, IBM EBCDIC (Extended Binary Coded Decimal Interchange Code), ASCII (American Standard Code for Information Interchange), ISO/IEC 7-bit and 8-bit, ISO/IEC 10646 UCS (Universal multiple-octet coded Character Set) and Unicode encoding formats such as UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. An encoding scheme allocates the ranges of code points for control and graphic characters, and can preassign the code point for the SPACE character.

The entire collection of code points that encompasses a particular character set forms a code page. A code page and its character set constitute a coded character set . Coded character set is synonymous with code set on the workstation platforms, and is loosely equivalent to a code page in the personal computer world. A coded character set identifies both a character set and a code page. Some literature use the terms code page and coded character set interchangeably.

In any code page, each code point has no more than one graphic character associated with it, but a graphic character can be assigned to more than one code point. A code page can contain more than one character set.

Each character set, code page, and coded character set registered by IBM has a unique 5-digit decimal identifier with its leading zeros frequently omitted.

The same character may be represented by different code points in different coded character sets. The assignment of characters to code points varies from code page to code page. While two coded character sets may share a common set of characters, their code point assignments may be very different. Even when two coded character sets have a common character set and common encoding scheme there is no guarantee that the code points for all characters will be the same. The situation gets further complicated when different character sets and encoding schemes are involved. All of these situations create problems when two products try to communicate with each other or simply share data. Any time a conversion is done between code pages there is a risk of data loss or corruption.

Code page conversion becomes necessary when products or software communicate with other operating systems or when integrating with other applications in different encodings. When the product or software sends data to external sources, it must correctly use the code page conversion functions provided by the operating systems or conversion tools or services.

The Unicode Standard, provides an alternative for computer interchange of data. It provides a unique number for every character no matter what the platform, program or language. Products using Unicode eliminate the need for conversion and thus the possible loss or corruption of data when sharing data with other Unicode-based products.

One of the open source projects available today that provides full-featured Unicode services on a wide variety of platforms is IBM's International Components for Unicode (ICU). One of the services offered by ICU is the conversion service, which converts data between Unicode and many non-Unicode encodings or code pages.

In addition to conversion problems, unassigned code points in a code page may be assigned over time as requirements grow. Often the code page keeps its original identifier, creating the situation where products claim to support the same code page or coded character set and yet the character repertoire is different.

Example: Product A and Product B both claim to support the 8-bit ISO/IEC 8859 Part 7 Latin/Greek alphabet code page. When Product A sends the Euro Sign, Drachma Sign and Greek Ypogegrammeni characters (in code points X'A4', X'A5' and X'AA') to Product B, it discovers Product B cannot interpret the characters because Product B supports only the 1987 version of the coded character set, where X'A4', X'A5' and X'AA' were unassigned.

Data processing done in Asia must support data streams containing a mixture of single-byte and non-single-byte characters. The data stream requires at least two character sets and two corresponding code pages, one for single-byte characters and the other one for Nbyte characters (where N is greater than one). In practice many people erroneously use the term code page to indicate both code pages and character sets.

Example: The IBM mixed code page or combined code page 00942 used in IBM Japanese PC actually consists of two character sets and two corresponding code pages:


Character set Code page
01172 (containing single-byte characters only) 01041
00370 (containing double-byte characters only) 00301
To address these problems, IBM has defined an architecture known as Character Data Representation Architecture (CDRA). The objective of CDRA is to achieve consistent processing, interchange, and representation of graphic character data within and across heterogeneous environments, and to preserve graphic character data integrity.

CDRA consists of the following components:

  1. Tags to uniquely and reliably identify the representation of graphic character data.
  2. A set of services and functions.
  3. Resources in support of the tags and services.
  4. Conventions for the use of the tags and services.

A coded character set identifier (CCSID) is defined by CDRA to concisely and precisely identify the coded character set, its underlying character set or sets, the corresponding code page or pages, the encoding scheme, and any other related information as required.

Example: CCSID 00942 used in the IBM Japanese PC consists of:

Example: The figure below shows an ISO/IEC 8859-1 coded character set that has an IBM CCSID 00819, and consists of:

Figure 1: ISO/IEC 8859-1 Coded Character Set (IBM CCSID 00819)