 |
|
 |
|
|
|
 | This chapter introduces readers to the Character Data Representation Architecture. The architecture objectives, challenges, coverage and concepts are presented to form the basis for understanding the following chapters.
Character Data Representation Architecture (CDRA) is an IBM* architecture that defines a set of identifiers, services, supporting resources, and conventions to achieve consistent representation, processing, and interchange of graphic character data (1) in data processing environments.
The overall objective of CDRA is to define a method of assigning and preserving the meaning and rendering of coded graphic characters through various stages of processing and interchange.
Specific objectives are to:
- Define an architected method that
- reliably distinguishes the machine representations of the various coded graphic character sets from each other
- allows the unique derivation of the meaning and rendering of a graphic character in a given representation
- covers a selection of coded graphic character sets that are supported on IBM and non-IBM systems or defined in existing or forthcoming standards
- allows coexistence with existing character sets and migration to new convergence character sets.
- Define the necessary supporting services (such as tagging) to allow consistent support of various coded graphic character sets, both within and across environments
- Select a minimum number of coded graphic character sets that satisfies most of today's text and data-processing applications and provide the necessary definitions for them so that they can be consistently supported in all applications, services, and devices within and across different environments
Character Data Representation Architecture is used by products and systems to address the following data integrity challenges:
- Proliferation of Different Character Codes
The primary problem in handling coded character sets is the variety of sets of characters and encoding schemes used to represent them. Technology has increased the variety of applications using computers, and the coded character support for these applications has been provided without an overall strategy.
- Graphic Character Data Recognition
The abilities to properly distinguish graphic character data in a universal manner and to attach a tag to the data are available only in some specific architected environments. The available architected methods are often inconsistent, have lagged behind worldwide requirements, and are constrained by the supporting products.
- Inconsistent or Incomplete Set of Identifiers
Few applications have consistent character support. Operating system environments rarely provide the necessary services to identify coded character data, leaving this responsibility to the applications.
- Use of Absolute Values (Hard-coding)
Many applications were designed to operate in specific environments with specific terminal characteristics. The internal representations of frequently used characters have been coded using absolute values. Character data misinterpretation occurs when environment changes are made and the initial character processing functions are no longer valid.
- Non-tagged Data
Traditional data processing environments were closed systems, and the coded character support was primarily governed by the device character handling capabilities. Data was rarely tagged.
The most visible end-user impact of all these concerns is in data interchange (2) within and across system environments. For example:
- The set of graphic characters supported on the IBM Personal Computer (PC) cannot be fully processed in non-PC environments. (Character set mismatch.)
- When a "dollar symbol" is sent in text from a U.S. mainframe computer to a U.K. mainframe computer, it often appears as a "pound sterling symbol" to the U.K. user. (Conversion based on byte integrity.)
- When a "lowercase a" is sent to a Katakana terminal in Japan, it appears as a "Katakana character" or gets irreversibly converted to an "uppercase A".
- Application and device code page differences may lead to users entering characters from the keyboard that are different from those specified by the application. For example, in Switzerland, the programmer must key "y-acute capital" ( İ ) and "diaeresis" ( ¨ ) instead of "left square bracket" ( [ ) and "right square bracket" ( ] ), even though the bracket symbols are supported and are engraved on the Swiss keyboard.
- The variety of existing code point conversion tables produces inconsistent and often unpredictable results between different environments.
Figure 1. Character Data Platform Domains
.
A character data domain may be described as an environment in which all character data has the same coded representation. This can be shown in a broad sense with respect to each category of system; midrange, mainframe, workstation and personal. Data domains may be the same within a system, or may differ within a system, but typically the view is one of a character data domain per system category. There are well known examples of the problems encountered as character data traverses data domains, leading to character data misrepresentation.
Businesses may spend a significant portion of their information technology budgets circumventing, repairing, and educating to resolve the data integrity problems. |
|
|
 | |
Character Data Representation Architecture defines:
- An identification or tagging system to uniquely and reliably identify the representation of graphic character data
- A set of portable Application Programming Interfaces (APIs)
- A set of resources in support of the tags and services
- A set of conventions on the use of the tags and services
- A strategy for coded character set convergence.
This coverage is depicted in Figure 2.
Figure 2. Components of CDRA
.
CDRA components are categorized as the identification mechanism, functions, resources and processing guidelines.
Data can be classified in many ways, such as character data, byte strings, integer numbers, or floating-point numbers. Character data is further classifiable into control character data and graphic character data. Control characters include, for example, Horizontal Tab and Line Feed, which perform specific functions. Graphic characters include uppercase and lowercase letters (with and without accent marks), numeric digits 0 to 9, ideographs, and other symbols. Graphic character data streams can include embedded code extension controls, such as Shift-Out or Shift-In, used in the interpretation of the data following the controls. Figure 3 shows an example of these classifications.
CDRA deals with character data -- primarily with graphic character data, and to a nominal extent with control character data.
Figure 3. Types of data in a string
.
Various types of data may be contained in a data string. CDRA focuses on the coded graphic character data.
Control functions as defined by either single control characters or sequences of code points -- can appear intermixed with graphic character data. From the CDRA point of view, there are two categories of control functions:
-
Code extension
-
These functions modify the interpretation of subsequent code points representing graphic characters. Examples are: Shift-Out (SO), Shift-In (SI), Single-Shift 2 (SS2).
-
All other control functions
-
Applications or architectures are responsible for handling specific control functions. CDRA provides an interface to query a set of control character encodings and uses the code point assignments for SPACE and SUB in its difference management functions.
CDRA conversion methods and functions support the concept of string types in order to handle space-padded and null-terminated strings. All other aspects of control functions are outside the scope of CDRA. |
|
|
 | |
Tagging
Tagging is the primary method to identify the meaning and rendering of coded graphic characters. It is the method by which:
- One or more CDRA identifiers can be associated with a coded graphic character in a data object (such as a file, a database table, or a data stream)
- The graphic character handling capability of a device (such as a display terminal) can be identified or selected
- The graphic character handling capability associated with a piece of processing logic can be identified.
The tag field may be in a data structure that is logically associated with the data object (explicit tagging), or it may be inherited from tag fields associated with other objects or with the computing environment (implicit tagging).
Underlying each code used to represent graphic characters is an encoding scheme. Encoding scheme definitions specify the coding space (number and allowable values of code points), the allocation of the code space for control and graphic characters, and other characteristics such as the number of bytes per code point and code extension methods permitted in that scheme.
The term integrity in CDRA means the preservation of a graphic character's meaning and rendering as identified by its graphic character global identifier (GCGID)(3) or graphic character UCS identifier (GCUID).
A character set is a specific collection of characters. There are many character sets in use today and the content of these sets may be quite similar or vastly different. CDRA recognizes two categories of character sets: interoperable sets and coexistence and migration sets.
Interoperable sets are the largest character sets for a specific set of languages and countries that:
- Do not contain environment-specific characters
- Do not contain application-unique characters
- Do not contain device-specific characters
- Ensure a high level of processing environment interoperability.
Coexistence and migration sets are those that:
- May contain environment-specific characters
- May contain application-unique characters
- May contain device-specific characters
- May be a subset or superset of an interoperable set
- May not be widely supported.
Services in support of CDRA are collections of functions such as setting and querying of tag values, manipulating tag values, defaulting tag values, or detecting differences in tag values. These services are not architected interfaces defined by CDRA.
The CDRA-defined functions have architected call interfaces, called CDRA Application Programming Interfaces (CDRA APIs), that facilitate application code portability across environments. These services are callable using the conventions of any of the supported high-level languages.
Difference management is the process of managing different representations of graphic character data. It involves the ability to determine if a difference exists, and to deal with the difference in a predictable and consistent manner.
CDRA describes the general principles of how to manage the representation differences in coded graphic characters, and the criteria for creating character-data conversion tables. For consistency, a set of default conversion tables and conversion methods have been defined. Further, to minimize the differences and thereby minimize the potential data loss and data corruption problems, CDRA has identified character sets for interoperability.
Resources are machine representations of definitions associated with CDRA identifiers and supporting data for CDRA services. Collections of such CDRA resources are called CDRA Resource Repositories. The internal representation of the resources is implementation-specific. |
|
|
 | |
Coexistence and migration refers to the current customer environment containing various levels of tagged and non-tagged data, and different levels of application support. CDRA provides the following means by which the current environments can coexist, and at the same time allow for a reasonable migration to a more architected environment:
- Wherever possible, the CDRA-defined Coded Character Set Identifier (CCSID) values are assigned to be the same as the corresponding code page identifiers.
- CDRA has defined CCSIDs for many coded character sets that are currently in use but have not been identified as interoperable. These CCSIDs are called Coexistence and Migration CCSIDs.
- CDRA provides many conversion tables that convert between the Coexistence and Migration CCSIDs and the Interoperable CCSIDs.
Some existing architectures and implementations have provisions for tagging. Some of these recognize code page identifiers (CP) only, while others recognize character set identifiers (CS) and code page identifiers (CP). These identification methods are considered intermediate forms of CDRA's long-form identification, which is composed of an encoding scheme, character set and code page pairs, and additional coding-related required information. |
|
|  | |