Characters and their identity

The character is the unit that is assigned a code point in a coded character set.

Short identifiers long names nominal shapes. Scripts, symbols, notations. Universe of all characters. Control Functions. Selection. Character set. Encoding Scheme. Coded character set.

Figure 5.

From the processing point of view, the identity of each character is important. The properties of each character, such as whether it is a letter or a number, whether it has an inherent right to left writing direction like in the case of Middle East scripts, whether it has a case, or if it has a combining property, are dependent on its identity. It must have a unique name or identifier to distinguish it from all others.

As shown in Figure 1, a character can be shown by its shape or glyph within each cell of the code page documentation. However, because of similarities in the glyphs (for example between Latin "A," Cyrillic "A." or Greek "A," some other means than just the glyph is needed to distinguish between them, especially when they are encoded in a single coded character set. To enable this distinction, each character is given a descriptive name. Such descriptive names can be long and of variable length, can be in different languages, and for some characters like the ideographs in Japan, cannot be done.

The answer to this requirement is some form of short identifier. IBM standards have a registry of Graphic Character Global Identifiers for this purpose. For example, LA020000 that appears along with "A" at code point x'C' in Figure 1. Earlier editions of ISO/IEC 6937 used for example LA02 for the Latin "A." While these served well for small character sets, they are inadequate to deal with all the world scripts and notations.

The ideographs in the Far Eastern coded character sets were identified by a scheme based on their code positions in a chosen coded character set as reference, for example, the code points from IBM EBCDIC Double Byte code page are used to identify the ideographs used in all the Japanese code pages in the IBM registry. Unicode and ISO/IEC 10646 have defined a short identifier (PDF, 824.3KB) (UID) for each code position, and when a character is assigned to it, that code position identifier becomes a language-independent short identifier for that character, for example, U+0041 for Latin Letter Capital A. This is in addition to the unique long name assigned to each character in Unicode. Unicode can be viewed as the largest single catalog for all characters showing their long names and nominal glyphs with their code positions acting as short identifiers.

Such short identifiers are used in a machine-readable description of a coded character set. Such a resource shows the code point and its associated short identifier. The descriptive name of the short identifier can be kept in a separate reference registry or in resources such as the 'Names List' of Unicode. Such machine readable descriptions are called charmaps in the UNIX and Linux world. These charmaps are used in generating conversion resources between two coded character sets, to bind the locale source definitions to generate binary bound locale resources for consumption by globalized locale-sensitive functions. All the ISO/IEC 7-bit and 8-bit coded character sets have been revised to include these universal short identifiers in them. IBM's conversion resources are built using such coded character set resources. Also, refer to the ICU character mapping tables .

Often you will encounter characters that are not included in the character set that is encoded in a coded character set, but they need to be generated or included in the text. These are handled in several ways.

ISO/IEC 10646 also defines a UCS Sequence Identifier (PDF, 824.3KB) (USI). The USI's are used to identify specific sequences of characters that represent items such as ligatures, accented letters, conjuncts that appear in Indic scripts. Such items may be defined as a single character in some coded character set (for example, the latest definition of JIS X0213), but are encoded as sequences in Unicode. The USI-s are expressed as a sequence of short character identifiers in the order in which they appear in the sequence. The text data will contain sequences of code points assigned to these characters.

Character entity reference is another method of short identification of characters to be found in documents using SGML, HTML, and XML. These specifications make a distinction between a document character set and document character encoding. For example, a document can be encoded using only the ASCII coded character set, but its character set can be much larger. For example, '& Alpha;' can be used to represent the Greek Letter Alpha in an ASCII encoded HTML document. Character entity collections are defined for specific character subsets of Unicode.

Numeric character references overcome the limitation of character entry references. They look similar to the character entity references but use either decimal or hexadecimal digits. For example, the character U+0D15 - MALAYALAM LETTER KA has an NCR of & #x0d15; in the hexadecimal form and & #3349; in the decimal form.

Text containing such short identifiers will have sequences of code points corresponding to each character in the entity or numeric character reference string -- for the '&', '#', and so on.

You will need to generate and interpret these character entity and numeric character references in web documents. Browser software or other rendering software interpret these short identifiers for characters and display the appropriate glyph from the font available in the system. These short identifiers can also play an important role in interchanging data between incompatible coded character sets in a non-lossy manner.