Collection or character set

The character set or collection is either the entire set or a subset of characters that are assigned code points in a coded character set. It is typically a collection of all the characters needed for a given language or application. Knowledge of the character set contained in a coded character set is needed in some situations. See Figure 4.

Collection Identifiers Descriptive Names. Character Set. Encoding Scheme. Coded Character Set

Figure 4.

Terms like the ASCII repertoire, the Latin-1 or Latin-5 character set, the portable character set in the UNIX and Linux worlds, or the EBCDIC invariant or syntactic character set, are references to character sets, without any specific encodings being associated with them. These character sets have been encoded using different encoding schemes to create different compatible or interchangeable coded character sets. Some of the character sets are specified for portability or programming language syntactic reasons and must be contained in any coded character set used for that environment.

Character set identification assists in the differentiation of one version of a coded character set from the next, especially when the set expands in size over time. If a coded character set has reached its maximum possible size (per the encoding scheme definition) its maximum character set will be fixed. If there is still room, a coded character set can grow in size. Often, the same code page identifier is retained for the new coded character set, and the previous assignments of characters remains unchanged. In such cases, one cannot distinguish the old from the new using the code page identifier alone. The character set identification will help with this. IBM standards call for using both character set and code page identifiers (resulting in different CCSIDs), and for retaining the same code page identifier for the new coded character set. Knowledge of the character set defined in the coded character set helps in managing the flow of characters from a system that supports an expanded set to a system that is still back level. It helps in differentiating conversion resources created from the old and the new definitions.

The identification of character sets has been dealt with to different degrees of precision in the industry -- from loosely identifying closely related sets to distinguish even a single character difference.

IBM standards permits, and has registered, character sets with their identifiers. In theISO/IEC 10646 collection, numbers identify open as well as fixed subsets. Collection numbers have also been assigned to the fixed repertoire of each major edition or amendment to the standard (equivalent of selected version identifier for both ISO/IEC 10646 and Unicode) as the standard grows in size.

While the term character set is usually applied to the set of graphic characters such as the letters A to Z, control characters such as Horizontal Tab or Carriage Return that are also used in plain text are also included in coded character definitions.

Typically a set of control characters are also included in a coded character set definition. The set of control characters are associated with the encoding scheme definition in most cases. ISO/IEC 2022 provides a mechanism for invoking and using different control character sets from the ISO Registry. They can be found in some terminal data stream specifications and their emulations.

The character set identification is also useful in knowing the set of characters that are generated from a particular keyboard layout or supported by a font resource in a printer or display. These are examples of use of character sets outside the world of coded character sets.