|
The strategy used in CDRA:
- Categorizes and orders the overall problem of different character sets and the associated architectural and development solutions
- Provides a starting base for implementations from which support for other larger sets can be added in a controlled manner
- Recognizes the significant development effort that is needed to overcome the widespread single-byte per character limitation, to address large character sets on a global scale.
The three components of CDRA strategy -- Architecture Base, Character Set Groups, and Levels -- are shown in Figure 4, and are detailed below.
Figure 4. CDRA Strategy.
CDRA strategy encompasses the four basic elements of CDRA, the character set groupings and levels of the architecture itself.
Architecture Base The first component of CDRA strategy is the architecture base. This component provides a framework to solve current problems, and can be extended to cover future requirements. It consists of:
- A comprehensive identification system and a set of identifiers for currently used character sets; can be extended as required
- An initial set of services facilitating the use of CDRA identifiers; additional services can be defined as the architecture evolves
- A set of resources required by the services; additional resources can be defined as the architecture evolves
- A set of processing guidelines for functions that are affected by the representation aspects of graphic character data, as an aid to users.
Character Set Groups The second component of CDRA strategy is the concept of character set groups. Graphic character sets used in different countries to support different languages have been grouped into sets with common properties. A selected few of these are defined as Interoperable Character Sets within each group. To reduce the proliferation of graphic character sets and code pages in use, IBM and various standards organizations have collected and classified commonly used graphic characters into a few specific sets. Each of these sets has the following characteristics:
- It is a superset of many existing smaller graphic character sets
- It contains a base set of graphic characters required in a group of countries or in a group of national languages having some common characteristics
- It can be used in a broad range of common applications
- It permits preservation of graphic character integrity for interworking applications within a specific group of countries that use the set
- It is the target for convergence and migration in each country or group of countries.
Special graphic character sets supporting specific applications (such as APL, scientific word processing, or desktop publishing) are treated as extensions to the base sets. Each graphic character set in all countries, with a few exceptions, contains a common set of graphic characters: the uppercase English letters A to Z, the lowercase English letters a to z, (4) the numerals 0 to 9, and 19 miscellaneous symbols. (5) See Figure 45 in Appendix A for a complete list. The implications of supporting character set groups differ in the types of services and resources needed for each group. Character set groups are shown in Figure 5, and are described in the following sections.
Figure 5. CDRA's Character Set Groupings.
Commonly Used Character Sets
- Group Universal covers all of the 'large' character sets used in supporting Unicode and ISO-10646.
Large Character Sets
Over the past few years the IT industry has been very active in pursuing support of new, large repertoire character sets such as Unicode. CDRA has been enhanced to support these recent additions to the existing body of character set encodings. The new coded character sets are supersets of the many existing character sets of today. These large repertoire character sets are known as ISO/IEC 10646-1, Information Technology-Universal Multiple-Octet Coded Character Set (UCS) developed by the international standards bodies and Unicode, developed by an industry consortium. ISO/IEC 10646 specifies the universal coded character set. This character set is applicable to presentation, processing, storage, transmission, interchange and representation of all of the world's written form of language and symbols. This is an architected definition for coded character set representation endorsed by the international community. The architecture itself:
- describes the general structure of the coded character set
- specifies an encoding space known as the Basic Multilingual Plane (BMP) and a set of graphic characters contained in this space (basically all of the world's currently used characters)
- specifies two forms of encoding, a 4-byte canonical format known as UCS-4 and a two-byte BMP format known as UCS-2
- specifies the coded representations for control functions
- specifies how future additions to the coded character set will be managed
The major interest in UCS-2 centers around the BMP. This plane of 256 bytes by 256 bytes is divided into four zones, which are known as A, I, O, and R zones. In the BMP the A-zone is used for alphabetic and syllabic scripts as well as various symbols. This area contains what is commonly referred to as the Latin based scripts.
The I-zone contains Chinese, Japanese and Korean unified scripts.
The O-zone has been reserved to contain future characters as they are defined and standardized.
The R-zone has been deemed the restricted zone. Here is found private use characters (those which can be defined and used without the endorsement of any standards body), various presentation forms (as required for the Arabic scripts) and compatibility characters (used to bridge to some existing encoding standards).
ISO/IEC 10646 can be implemented in three different levels. Level 1 allows no combining characters (9), Level 2 allows for the use of some combining characters and Level 3 allows all defined characters to be used.
The Unicode standard defines a large character set and specifies a number of different encoding formats. For detailed information on how CDRA handles Unicode see Appendix K, CDRA and Unicode
CDRA Levels
The third component of CDRA strategy shown in Figure 4 is the concept of Levels. Levels are used to distinguish between specific sets of available elements from the architecture base, as the architecture and the supporting implementations evolve over time. The relationship between the levels has been depicted in the diagram shown in Figure 6. Level 1 provided the initial seed of CDRA, which was substantially extended with the release of Level 2. The growth in Level 2, noted as extensions in the diagram, has been more of a series of enhancements rather than the pronounced type of change that was seen from Level 1 to Level 2.
CDRA Level 1
CDRA Level 1 defined a initial set of elements from the architecture base. It consisted of:
- A comprehensive identification system
- A set of CDRA identifiers for commonly used character sets
- A subset of these CDRA identifiers specifically for interoperability, to assist customers in identifying the strategic direction for coded character sets
- Generic concepts of tagging and difference management.
Character Data Representation Architecture - Registry, SC09-1391 contained the following:
- A registry of CDRA identifiers
- A set of graphic character data conversion tables for selected pairs of identifiers
- The principles used in creating these tables, and the specific mismatch management criteria associated with each.
CDRA Level 1 addressed all of the commonly used character sets within:
- Latin Alphabet Number 1 in Group 1
- Single-byte graphic character sets in Group 1a
- Single- and double-byte graphic character sets in Group 2.
CDRA Level 1 satisfied these objectives:
- To achieve consistent character data processing and interworking between different systems or system components within a country or within a specific group of countries having a common character set. The graphic character sets within each country will be those that apply across a wide range of basic applications.
- To allow coexistence between countries or groups of countries with different character sets. Interchange of data for storage and retrieval purposes or for data pass-through will be possible.
- To allow limited interworking with systems outside the country or group of countries. The extent of correct interworking between two countries will be limited to the subset of characters that is common between the two character set groups or subgroups.
CDRA is primarily concerned with the coded graphic character set boundaries within and between different groups, rather than with political or geographical boundaries. However, these different types of boundaries are indirectly related to each other through the requirements for resources such as fonts, keyboards, and conversion tables.
CDRA Level 2
CDRA Level 2 included all Level 1 elements. In addition, it included definitions of functions called CDRA-Defined Services, along with the syntax for accessing these functions. These APIs were designed to be callable from any supported high-level language. A number of CDRA resources were needed to support the functions defined in Level 2. Level 2 included descriptions of the elements of those resources and some general principles for managing them. The resource data structures and the resource maintenance functions are implementation-specific.
Extensions
CDRA extensions now include support for:
- Seven new application programming interfaces (APIs)
- Encoding scheme, character set, and code page identifiers for Extended UNIX Code (EUC), Transmission Control Protocol (TCP), and Universal Multiple-Octet Coded Character Set (UCS)
- New conversion tables
- New conversion method definitions
- New ESID definitions
- New CCSIDs registered
Figure 6. Architecture Levels.
|