Character Data Representation Architecture

Chapter 6. Difference Management


Chapter 6. Difference Management

One of the key challenges in heterogeneous environments is to be able to deal with different coded graphic character sets and code pages in a consistent manner. Differences exist for reasons such as the origins of operating systems, the provision of national language support in different countries, or an application's requirements.

Migration to interoperable character sets and code pages for different countries and groups of countries will minimize, but not eliminate, the differences to be dealt with. Applications will continue to face this challenge, but now with assistance from CDRA.

Four functions, CDRCVRT, CDRMSCI, CDRMSCP, and CDRMSCC, were defined earlier (in "Chapter 5. CDRA Interface Definitions") in support of difference management. The present chapter describes the concepts behind difference management, the principles and criteria in designing the contents of conversion tables, and some aspects of managing the selection of the required conversion methods and tables.

Concepts

Difference management is a process by which differences in coded graphic character representation of data are recognized and dealt with. Difference-detection mechanisms must be placed in appropriate locations within a system in order to determine if such differences do exist.

Differences in the data representation and the processing capabilities of an application are used to trigger a difference management action. The choices may be to convert the graphic character data, to leave it as it is, or to terminate the current function altogether. CDRA provides assistance when a choice to convert the data has been made. Conversion is viewed as a tool in difference management. The results of difference management within CDRA will be dependent upon the conversion tables and the conversion methods chosen.

A general view of the difference management process is shown in Figure 19.

The query function (see "Querying Tag Values") assists in finding the relevant tag values to decide if a difference exists.

Figure 19. Difference Management Process Flow

Conversion ProcessConversion Process

Do Not Convert

Potential reasons for not converting the data are as follows:

Convert

The simplest form of conversion is the case where the input and output character sets are equivalent, but the code point assignments for the characters are different. Here, all the matched characters will only need to have their code points mapped.

The more general form of conversion must deal with input and output character sets within which only a subset of the characters are equivalent.

During a conversion, only the common set of coded graphic characters can be preserved. Management of the remaining unmatched characters depends on the nature and context of the data. Conversion with mismatches can generate converted data that may not have an assigned graphic character meaning in the output. Such results of conversion may not be acceptable to an application.

CDRA has defined criteria for dealing with mismatches during the conversion process. The specific criterion to be used is reflected in the content of the conversion tables (and the logic that uses them) that are used in the conversion process. A set of default conversion tables has been defined to map between specific pairs of CCSIDs according to the most appropriate criterion, and are defined with consistency as the goal. The use of such tables enhances the consistency among implementing products when performing coded graphic character data conversion.

One of CDRA's goals is to minimize the loss of coded graphic character data during conversion. The interoperable character sets and associated CDRA-defined conversion tables help to address this goal by maximizing graphic character integrity within a character-set group or subgroup.

Generic data conversion process

Once the decision to convert has been made, a generic data conversion process can be used.

A generic data conversion process contains many elements, one of which is the graphic character data conversion process. Figure 20 shows the elements of a typical data conversion procedure.

The different elements and there functions are:

Figure 20. Generic Data Conversion Process

Substrings of mixed data classes...Substrings of mixed data classes...

CDRA defines only the graphic character data conversion part of the overall data conversion process. A limited number of control characters are addressed as part of handling different string types (see "Types of Strings") and as part of control character mappings (see "Pairings of Code Points"). Other control characters are treated as bytes, and are dealt with according to mismatch management criteria.

Separation of Graphic Characters

For correct results, the caller of the CDRA conversion function should ensure that the input string does not contain characters other than graphic character data.

Each one of these conversion modules may permit direct access by an application. Here, the application assumes the responsibility for the functions of parsing and output generation. For example, when an application creates a sequential file in the PC, only it knows where the string of bytes is broken into logical substrings and which of these substrings represent graphic character data. Conventions such as CR, LF to show an end of record for organizing a file, must be known and handled by the parsing logic and the output generator. Handling of the data organization for output is not performed by the graphic character conversion function.

Misinterpretation of Data

If the separation of graphic character data from other classes of data is not done, the graphic character conversion function can find byte strings that may or may not have graphic character meanings. The criterion selected for mismatch management specifies how to convert such byte strings if they appear in the input string. However, the problem of possible misinterpretation cannot be entirely dealt with using the CDRA conversion criteria alone.

For example, if the data byte was representing a counter value equal to 74, which is the same bit configuration as the code point X'4A' for a left square bracket in a System/370* CCSID 00500, it will get converted to another code point (X'5B') representing the left square bracket in a PC using CCSID 00850. If this value is interpreted as a count on the PC, the value is now 91. Neither the CDRA identifiers nor the graphic character conversion process can deal with this kind of misinterpretation.

Types of Strings

A graphic character string may have a number of characteristics or properties associated with it. Some of these characteristics or properties are inherited from the encoding scheme such as the number of bytes per character. Others, such as how a string is terminated, the orientation of the string or whether or not the characters are shaped or unshaped can not be determined by the CCSID tag or encoding scheme alone. The following String Types are defined for use within the CDRA architecture.

String Type 0: CDRA Default

If there is no string type specified in a CCSID definition or as a parameter on an API call then the string type is zero. A string type of 0 means that the character data string is semantically defined by the CCSID. All of the characteristics of the string can be determined from the CCSID definition alone. No additional information is needed.

String Type 1: Null-terminated string

A variable-length graphic character string, which is terminated by a character whose code point has a binary value of zero. The number of bits in the code point used to represent the terminating character (the null terminator) is the smallest number of bits allowed for code points in the encoding scheme used.

The above definition is used in the following examples to determine the null-termination character:

The above definitions reflect the current usage and definitions of a null-terminated string in the C programming language. A length value may additionally be provided for the string; however, the null terminator takes precedence over the length value.

A null-terminated string is given a string type identifier of 1 in CDRA function calls and in the Graphic Character Conversion Selection Table (GCCST).

String Type 2: Padded string

A graphic character string that is padded with one or more space characters. Padding is done only when there is unused storage space available in an area containing the unpadded string, and when it can be done without violating the semantics of the encoding scheme of the CCSID of the string. The resultant space padded string will be a well-formed string following the semantics of its encoding scheme.

Caution: When space padding is done as part of graphic character conversion, it is not possible to distinguish (in the resultant output buffer) the space pad characters that are generated as a result of conversion maps from those generated by the padding process. If a subsequent string operation removes the space characters, there can be a potential loss of the converted pad characters.

String Type 3: "Special Newline Nextline Handling"

String type 3 has special meaning in certain IBM products. If a character data string is defined as a string type 3 than it is semantically defined by the CCSID with the additional property that any newline control characters in the string should be treated as linefeed control characters and likewise, any linefeed control characters should be treated as newline control characters.

String Types 4 - 15: String Types for Bidirectional Languages

In the case of bidirectional languages, the string type is used to describe characteristics that are not implied by the CCSID or Encoding Scheme. The string characteristics which are defined for the bidirectional string types are:

Following is a brief description of each of these characteristics and their possible values.

Text Type
The text type characteristic states what kind of algorithm is to be used when transforming the text layout. The text type can be visual (reading sequence), implicit (typing sequence), or explicit (includes directional control characters in the text segments explicitly). A visual algorithm copies entire lines of text as they appear without bothering about existing embedded directional segments. An implicit algorithm recognizes directional segments based on the natural directionality of the characters (i.e., right to left for Arabic characters and left to right for English characters) and performs segment inversions accordingly. An explicit algorithm recognizes directional segments and performs inversions based on special, explicit, directional controls embedded in the text.

Example:
Visual, shaped text:
Arabic visual stringArabic visual string

Implicit, unshaped text: Arabic implicit string
Arabic implicit stringArabic implicit string

Numeric Shaping
The numeric shaping characteristic states whether the numbers embedded in a text string will have the shapes that are used in English (called Arabic digits), or the national numerical shapes. Possible values for this characteristic are Arabic, Hindi or passthrough. When passthrough is specified numeric digits are left as they appear in the data string (no numeric shaping occurs).

Orientation
The orientation of a data string together with the text type, indicates the storage or display sequence of the Arabic and English characters. The possible values for this characteristic are left to right (LTR), right to left (RTL), Contextual LTR and Contextual RTL. The term contextual is used to indicate that the orientation should be taken from the context of the data. The data may contain "strong" characters that are either orientation left or orientation right. The term following contextual (LTR or RTL) specifies what should be the default orientation when the data is orientation-neutral (i.e. there are no strong characters).

Text Shaping
The text shaping characteristic of a bidirectional string type indicates whether text shaping is performed. This is relevant for the scripts of Arabic languages (including Farsi and Urdu), where characters assume different shapes (initial, medial, final, or isolated) according to their position in a word and the connectivity traits of the character and its surroundings.

Symmetrical Swapping
The symmetric swapping characteristic states whether, in a right-to-left text phrase some directional pairs of characters (such as left and right parentheses, greater than and lesser than signs, left and right brackets, left and right braces) will be interchanged in order to preserve the logical meaning of the inverted text.

Each CCSID that is defined in support of a bidirectional language may have a default string type associated with it. In the event that a string is tagged with a CCSID for a bidirectional language and no string type is explicitly specified than the default string type is to be used. If no default string type has been specified then the string type is defined to be 0.

The following table shows the specific characteristics of each bidirectional string type that have been defined to date.

String Type Text Type Numeric Shaping Orientation Text Shaping Symmetrical Swapping
4 Visual Passthrough LTR Shaped Off
5 Implicit Arabic LTR Unshaped On
6 Implicit Arabic RTL Unshaped On
7 Visual Passthrough Contextual Unshaped-Lig Off
8 Visual Passthrough RTL Shaped Off
9 Visual Passthrough RTL Shaped On
10 Implicit Arabic Contextual LTR Unshaped On
11 Implicit Arabic Contextual RTL Unshaped On
12 Implicit Arabic RTL Shaped Off
13 Visual Hindi
Arabic-Indic
LTR Shaped Off
14 Visual Hindi
Arabic-Indic
RTL Shaped Off
15 Visual Hindi
Arabic-Indic
RTL Shaped On
16 Visual Contextual LTR Shaped Off
17 Visual Contextual RTL Shaped Off

Such strings are often interchanged in heterogeneous (or distributed) environments between applications that can support these string types. If the data conversion methods used for graphic character mapping are enhanced to deal with the parsing and assembly aspects of converting between specific types of strings, a degree of efficiency in performance can be attained. With this in view, provisions are made in the graphic conversion functions of CDRA to allow string-type specifications to select conversion methods that can deal with various string types besides converting the graphic characters.

Generic Graphic Character Conversion

A generic graphic character conversion function (see "Conversion Functions") converts an input graphic character string represented in a CCSID (the input CCSID) to an output string according to the CCSID specified for the output (the output CCSID). The interpretation of the input character string and the generation of the code points of the output character string adhere to the definitions of CCSIDs (see "Tagging in CDRA").

The results of the conversion process will be the following:

Conversion of strings between some CCSIDs cannot maintain the same byte-length between the input and output strings. For example, the coded representation of a string containing a mixture of Katakana characters (single-byte code points) and Japanese ideographic characters (double-byte code points):

A function that converts the data between the two coding methods in this example will find a byte-length difference of at least two bytes. Provisions must be made to accommodate differences in byte lengths when developing and using conversion functions.

The designer of the conversion program can reference the CCSID elements and their definitions from CDRA documents. The logical steps in performing the conversion are:

  1. Select an appropriate conversion method (see "Appendix B. Conversion Methods") based on the encoding schemes associated with the input and the output.
  2. Select one or more conversion tables based on the CS and CP elements of the input and output CCSIDs. The following section describes the criteria that can be used for defining the contents of the conversion tables.

The various steps involved in selecting the conversion methods and the associated tables for different conversion criteria are described in "Graphic Character Conversion Selection Table (GCCST) Resource".

Defining the contents of conversion tables

The input and output CCSIDs identify the CS, CP pairs. The content of a conversion table is determined by the input and output CS, CP pairs to be mapped. When there is more than one set of CS, CP pairs in the input to be matched with more than one set in the output, the principles described in "Pairings of Code Points" are used to determine the mapping.

If the input CS, CP pair has some common graphic characters that are split between two output CSs, then the corresponding support in the conversion method and tables of the appropriate type (see Appendix B. Conversion Methods) are needed.

After the particular characters and their code point assignments are examined, they are categorized, and decisions are made about pairing the input and output code points.

A code point can be placed into one of the following categories:

  1. SPACE: the code point is assigned to the SPACE character GCGID SP010000
  2. Valid Graphic: the code point is assignable to a graphic character in the encoding structure, and is assigned a graphic character in the identified character set
  3. Code Extension: the code point is assignable to a control character, and its assigned value is a valid code extension control character or the first character of a multiple-character code extension control as determined by the encoding scheme identified
  4. Invalid Graphic: the code point is assignable to a graphic character in the encoding structure, but either it is not assigned any graphic character or it is assigned one that is not in the character set identified
  5. Single Control: the code point is assignable to a control character, and is assigned a permitted control character for the application
  6. Start of Control: the code point is assignable to a control character, and is assigned a permitted start of control sequence for the application
  7. Invalid Control: the code point is assignable to a control character but is not assigned any control character, or it is assigned a character that is valid neither for the application nor as a code extension control defined in the encoding scheme.

Pairings of Code Points

The following general principles are used in pairing the input and output code points:

Criteria for character set mismatch management

Character set mismatch management ( 17 ) is necessarily context- or application-sensitive: what is best for one application may not be appropriate for another. Sometimes arbitrary decisions have to be made, depending on the specific set of mismatched characters. Some general criteria for mismatch management are:

The application of these criteria results in different pairings of input and output code points for mismatched characters in conversion tables.

The above criteria are discussed in the following sections.

Round Trip Integrity

The objective of this criterion is to send data from one system to another one that has different representations of character data, and retrieve it without loss. Often the "do not convert" choice is not available. For example, data stored in a System/370 database is configured to have all its graphic character data in one CCSID. If it acts as a remote repository for data from a PC application, or from an application in another System/370 using a different CCSID, the data must be converted to the configured CCSID. The data is intended to be retrieved by the same application without loss when it is converted back for use in its original CCSID.

Interpretation of Converted Data in the Output CCSID

The tag associated with the converted data will be the CCSID of the output. The data will be interpreted -- possibly misinterpreted -- in the output environment. In the absence of any validation or filtering services, data that has been converted using the round trip criterion cannot be distinguished from data that has been created locally in the system, or that has been converted from another CCSID using the round trip criterion. Data conversion is only one of the possible generators of code points that have no graphic meaning in a data object tagged with a CCSID. An application that generates hexadecimal constants and stores them along with other textual data is another possible generator.

Feasibility of Round Trip

Round trip mapping is always feasible for a common set of graphic characters or for a set of control characters with the same mnemonics, assuming there are no control sequences involved. The common sets of graphic and control characters within the initial input and output CCSIDs can be preserved irrespective of how many intermediate CCSIDs may be involved, provided that all the intermediate CCSIDs contain the same common sets.

The round trip of all remaining code points from a particular input to an output and back is feasible only under the following conditions:

When round trip mapping is not feasible or not desirable for a specific application, other criteria must be used.

Pairing of Code Points Using Round Trip

In addition to the general principles described in "Pairings of Code Points", the following principles are used when the round trip integrity criterion is chosen:

Character Replacements

When round trip integrity is not feasible or desired, an alternative is to permanently replace each mismatched character in the input character set with its nearest equivalent in the output character set. The criterion for determining the nearest equivalent depends on the context within which the converted data will be used. For display and printing purposes, the nearest visual representation may be chosen; for processing purposes, a character with the nearest meaning may be selected. If neither criterion applies, an arbitrary character may be chosen from the output character set.

Pairing of Code Points Using Character Replacements

In addition to the general principles described in "Pairings of Code Points", the following additional principles are used when the character replacement criterion is chosen:

Enforced Subset Match

The enforced subset match criterion guarantees the preservation of the subset of characters that are common to both the input and output character sets. Any character not in this common subset will be replaced with a unique character that indicates that a substitution has occurred.

Wherever possible, CDRA recommends that the standardized control character SUB (substitute) be used for this purpose. Alternatives for "substitution character" may be declared as part of the CCSID resource definitions. The default SUB definition for each CCSID is included as part of the CCSID definition found in Appendix C. CCSID Repository

In environments using the PC-Data or PC-Display encoding structures, X'7F' is recommended as the default SUB. In single-byte EBCDIC environments, the defined SUB is X'3F', and in ISO-7 and ISO-8 environments it is X'1A'.

Visualization of SUB Character

The SUB character should be visually represented by a uniquely distinguishable character on presentation media. A warning flag should be returned to the caller of the mapping service to show that a substitution has occurred.

Default SUB-Visualization Character

Some presentation devices and data streams specify a unique character to be presented when a SUB code point is encountered in the presentation data. For example: the 3270 Data Stream defines a "filled circle" as default; the PC displays it as an "empty house symbol"; some printers print it as a "filled square".

When a presentation medium or a component interfacing to the presentation medium is not capable of replacing the SUB character with a unique non-SPACE visual character, the application sending data to be presented needs to convert the SUB character to an appropriate graphic character. For consistency among different implementations that do such a conversion, the Uppercase X (LX020000) (or its equivalent) is defined as the CDRA-recommended default.

Products that perform such SUB character replacement should also provide a means by which customers can select another graphic character of their choice as an alternative.

Pairing of Code Points Using Enforced Subset Match

In addition to the general principles described in "Pairings of Code Points", the following additional principle is used when the enforced subset criterion is chosen:

All unmatched input graphic code points and mnemonically unmatched input control code points are converted to the "substitution character" code point prescribed for the output CCSID.

Conversion tables for CDRA level 2

Default conversion tables to be used for specific pairs of CCSIDs in different groups are available. For information on how to obtain these tables see Appendix J. CDRA Conversion Resources The pairs of CCSIDs are those that are required within each character set group, and include both interoperable and coexistence and migration sets.

Each table has its own difference management criterion. Where possible, the round trip integrity criterion has been used; in other instances, enforcement and character replacement have been used.

Exceptions

The following exceptions to the basic mapping principles exist in some of the tables:

Alternatives to conversion defaults

The default tables defined in CDRA are based on specified criteria for mismatch management. These tables may not suit all application requirements; IBM products have used different tables for data conversion based on the criteria most suited to their customer. It may be necessary for the products to continue to support such tables.

Customers may have the need to continue using existing conversion tables or methods. Such methods or tables may produce conversion results that are different from those obtained using the default conversion tables.

Based on individual product and customer requirements, the ability to select alternative conversion methods or tables for a pair of CCSIDs may be supported by products as an option. If a product supports custom modifications, its documentation should describe the procedure for selecting the alternative method or table. Guidelines to prevent undesirable effects caused by such modifications should also be documented by the individual products.

Conversion functions

All the concepts described above can be incorporated into a collection of conversion methods and related conversion tables. The management aspects can also be embodied along with this collection. A single-step convert function and a three-part multiple-step conversion are defined in "Chapter 5. CDRA Interface Definitions".

Contact IBM

Need assistance with your globalization questions?