Conversion between code sets

A character is any symbol that is used for the organization, control, or representation of data. A group of such symbols that are used to describe a particular language make up a character set. A code set contains the encoding values for a character set. The encoding values in a code set provide an interface between the system and its input and output devices. Multicultural support supplies converters that conform to character-encoding values that are found in different code sets.

Historically, the effort was directed at encoding the English alphabet. It was sufficient to use a 7-bit encoding method for this purpose because the number of English characters is not large. To support larger alphabets, such as the Asian languages, such as Chinese, Japanese, and Korean, additional code sets were developed that contained multibyte encoding. Now, Unicode, a character set for supporting the worldwide information processing, is used as the basic interchange format in the operating system level. The UTF-8, UTF-16, and UTF-32 code sets are the major Unicode encoding schemes for system applications.

A globalized program must accurately read data that are generated in different code set environments and process the information accurately. Knowing the current code set can aid in code set conversion. You can use the nl_langinfo(CODESET) subroutine to obtain the current code set in a process. The return value is a char pointer that is the name of the code set in the system.