Attributes of coded character sets

Attributes of a coded character set
There are a number of attributes associated with a coded character set that will be of interest depending on what you are doing with textual data -- its identifier, its encoding rules, its character set and the identity of each character encoded in it. These attributes are described next.

When you deal with different coded character sets you must be able to differentiate them. It is an important element of data interchange in any heterogeneous (in terms of coded character sets) environment such as the World Wide Web. Even if you are simply going to employ a data conversion service, you must be able to supply it with the appropriate identifiers to get correct conversion results.

Each coded character set is given a name and a number or short label to reference it and distinguish it from other coded character sets (see Figure 2).

Label identifier descriptive name. Coded character set.

Figure 2.

The identifier or label associated with a coded character set is probably of the most interest to developers. They appear as parameters in environment setups, or in querying or setting the information about text encoding in character handling service components. The label also helps to identify resources such as fonts, input methods, and keyboard maps that can handle the characters in that particular coded character set. For locale-sensitive processing, it enables the identification of a specific binding of a locale resource definition matching the bit patterns assigned to characters referenced in the locale resource.

In a Web environment, the label helps identify the document encoding used in a Web page or e-mail so that the receiver can take the necessary measures to display it properly to the user. It helps in the identification of the coded character set of the data in a database, and of the capabilities of the client to the server, so that the server software can tailor the information to suit the client's capabilities.

Unfortunately, this parameter is often omitted in many Web pages and protocols, and Web browsers have to rely on user settings, or some hierarchical inheritance rules, or make some default assumptions. You will need to understand what this fallback action is in different components.

There is no single global standard for labeling or identifying coded character sets, and such collections are not complete. As a result, cross-reference tables or alias tables between the different labels, at least for the frequently-encountered coded character sets, have to be maintained in implementations.

In IBM standards these identifiers have evolved from code page identifiers to Coded Character Set Identifiers ( CCSIDs ). These numbers are used either directly as numbers or used in a label of the form IBM-xxxxx, where xxxxx is the CCSID. Some other vendors, such as Microsoft Corporation, continue to use the term code page for their coded character sets along with code page numbers. IETF has termed them charsets along with the IANA registry of labels, pointing to their definitions in other standards, IETF RFCs or vendor documentation. The UNIX world calls these code sets using labels for ISO standards and vendor definitions. ISO/IEC 2022 defines an escape sequence mechanism for identification to be used with numbers from the ISO Registry.

Continue to Encoding rules