Character Data Representation Architecture

Footnotes


(1) Graphic character data is not to be confused with "graphic" data type used to represent double-byte data in some programming languages.

(2) Many of these problems are detailed in SHARE Report SSD No. 366 (ASCII and EBCDIC Character Set and Code Issues in Systems Application Architecture, The ASCII/EBCDIC Character Set Task Force. Edited by Edwin Hart, The Johns Hopkins University, Applied Physics Laboratory, Laurel, Maryland, USA Published by Share Inc., 111 East Wacker Drive, Chicago, Illinois, USA 60601; June 1989.)

(3) A GCGID is an alphanumeric identifier assigned to a graphic character. See entry in the glossary.

(4) The Katakana set used in Japan does not include lowercase a to z. An expanded set including these characters has been defined, providing a larger set for use in the different single-byte codes in Japanese environments.

(5) This set is called the Syntactic Character Set, and has been assigned the character set identifier 00640 by IBM.

(6) Latin Alphabet Number 1, Latin Alphabet Number 2, and others are the character sets defined in ISO 8859-1, ISO 8859-2, and other parts of ISO 8859.

(7) Thailand has two sets: a large one for the devices, and a small one for interchange and storage purposes. It is currently listed as part of the Group 2 set, along with other Far East sets that have similar character sets. Unlike Japanese or Chinese, Thai processing does not involve mixed single- and double-byte coding.

(8) Extended single-byte character sets have been defined for interoperability for each country in Group 2 to maximize the number of common graphic characters between PC and EBCDIC coded character sets. These extended sets will simplify the graphic character data interchange without character data loss. For example, the Character Set 01172 that is defined for Japanese contains lowercase alphabets, Katakana, and other symbols, and is represented in the PC and EBCDIC encoding schemes.

(9) Combining characters are a sequence of characters consisting of a base character followed by one or more non-spacing marks. For example, a lowercase a followed by a non-spacing acute accent, the sequence is processed as an a acute.

(10) A nibble is a bit-pattern consisting of four bits.

(11) LS0 (LOCKING-SHIFT ZERO) and LS1 (LOCKING-SHIFT ONE) are synonyms for SO (SHIFT OUT) and SI (SHIFT IN), and are defined in ISO 2022.

(12) The IBM Dictionary of Computing definition differs from the above in that it includes control characters in the definition of a code page. CDRA follows the IBM standards terminology for coded graphic character set definitions.

(13) For example, CPGID 00850 can be used with PC-Display, PC-Data, or ISO-8 encoding structures. In each use, the range of control code points and the range of graphic code points vary, as do the corresponding maximal character sets.

(14) A ward is a section of a double-byte-coded-character set, where the first byte of all code points contained in that section have the same value.

(15) An example of a hierarchy is a logically structured file, where there are logical fields in logical records, and logical records in the file. If the tag value is not specified or set to X'0000' at a field level, it may inherit the tag value from the file level.

(16) The terms "country" and "language" are used synonymously when referring to character sets.

(17) The term "character set mismatch management" is often shortened to "mismatch management" in this document.

(18) The ISO notation for code points is of the form xx/yy, where "xx" and "yy" are decimal numbers in the range 00 to 15. In the 7-bit code, "xx" can be from 00 to 07, or 0 to 7. The ISO notation, such as 2/13, maps to the hexadecimal representation, such as X'2D'; the digits before and after the slash represent the decimal equivalent of the first and second hexadecimal digits, respectively. The hexadecimal notation is used in this document, to maintain consistency.

(19) The DBCS Section is a synonym of a DBCS Ward. It is used in IBM Advanced Function Printing product publications and associated IBM architecture documents.

(20) Unix is a registered trademark of UNIX system Laboratories, Inc. in the U.S.A. and other countries. EUC is a code defined by Unix International Asia Pacific Office.

(21) The only single-byte to double-byte conversion tables currently available are those mapping single-byte data to UCS-2 (encoding scheme X'7200') others may be available in the future.

(22) The only double-byte to single-byte conversion tables currently available are those mapping UCS-2 data (encoding scheme X'7200') to single-byte; others may be available in the future.

(23) In the distribution files, each single table is in a separate record and the "pointers" contain the relative record numbers of the associated tables (where B0 is considered to be record 0). Tables within a group follow in sequence from the initial table in the group.

(24) Simplified Chinese uses the DBCS "space" character as a SUB since there are no additional code points available.

(25) The values given in these tables for EUC have the high order bit turned on when appropriate, even though the actual value in the code page may not have that bit on. This reflects the usage of the code page in the right-hand side of the 8-bit code space. So, for example, the space in the code page 952 component of CCSID 954 is given as X'A1A1' even though the space in the underlying code page 952 is actually X'2121'.

(26) Note that all values listed are hex values. "ESC" is the Escape Character, X'1B'.

(27) The assignment of the remaining seven characters of a user-defined GCGID is product-unique, and does not follow the definition presented in this document.

(28) These control points do not follow the definitions of ASCII in ANSI X3.4.

(29) These code points are in the graphic character space for IBM-PC code pages. The actual graphic characters vary from code page to code page. These code points are used for mapping control code points for consistency. (Note that a graphic character match will override the control character mapping.)

(30) These control points do not follow the definitions of ASCII in ANSI X3.4.

(31) These control points do not follow the definitions of ASCII in ANSI X3.4.

(32) These code points are in the graphic character space for IBM-PC code pages. The actual graphic characters vary from code page to code page. These code points are used for mapping control code points for consistency. (Note that a graphic character match will override the control character mapping.)

(33) These control points do not follow the definitions of ASCII in ANSI X3.4.

(34) These code points are in the graphic character space for IBM-PC code pages. The actual graphic characters vary from code page to code page. These code points are used for mapping control code points for consistency. (Note that a graphic character match will override the control character mapping.)

(35) Prior to 1986, ISO-8 X'9F' (APC) mapped to EBCDIC X'E1'. This control code point is a graphic code point. It was previously used as numeric space character in many EBCDIC SBCS coded character sets, and with the latest revised CECPs, the numeric space character has been replaced with DIVISION SYMBOL. The map shown in here is to EBCDIC Eight Ones control, which is used as a filler character.

(36) The resource for the UCS-2 code page is not included on the CD. This resource is available in the ISO document; ISO/IEC IS 10646-1, Information Technology - Universal Multiple-Octect Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane; 1993.

(37) Universal Transformation Format (UTF); a mechanism used to handle Universal Coded Character Set data in a manner that does not conflict with system conventions on ASCII based systems.

(38) Two mnemonics are specified when the standard has changed over time or the control code may be used for different purposes depending on the context of use. Both mnemonics are acceptable abbreviations.

(39) The mnemonic for the Start of Significance control character in EBCDIC has been modified to include a dot (.) at the end (SOS.). This has been done to distinguish it from the SOS mnemonic used in ISO-8 for the Start of String control character. The dot does not alter the property of the control in any way.

Contact IBM

Need assistance with your globalization questions?