Introduction to character conversion

In computers, all characters are encoded according to the rules of a particular encoding scheme and code page. If your database and applications handle data from multiple code pages, that data might be converted at certain times from one code page to another. This conversion process is called character conversion.

This situation of handling data from multiple code pages is likely if your database and applications contain international data or data from multiple character sets, such as Latin-1 and Katakana. In this situation, character conversions are likely to occur.

The problem with character conversions is that they can degrade performance and potentially cause data loss. Therefore, you should avoid these conversions if possible. One way to avoid these conversions is to have all of your data in one code page. If you use multiple character sets, you might considering using the Unicode code page. This code page includes all characters. If you use Unicode for all of your data, conversions can be avoided. However, converting all of your data to Unicode is not a simple process.

This information discusses basic principles about character conversion and general recommendations that you can apply to your environment for optimal performance and storage.

Character conversion terminology
To understand the concept of character conversion, you should know the meaning of some basic related terms.
Code pages and CCSIDs
Because computers store only numbers, they store letters and other characters by assigning a number to them. Which number is mapped to each character depends on the CCSID and code page that is associated with that character.
Encoding schemes
An encoding scheme standardizes the encoding of character sets by defining a set of rules for representing character data. Each encoding scheme consists of a number of code pages that adhere to its rules. For example, code pages 37, 500, and 1047 are all part of the EBCDIC encoding scheme.
Endianness
Endianness is a data attribute that describes byte order. When applications exchange data, they need to know the ordering convention for multi-byte data. Otherwise, data can be misinterpreted.
Situations in which character conversion occurs
Character conversion is the process of converting data from one CCSID to another CCSID. This process can occur when data is transferred between a remote and local system or when data is manipulated within the local system.
Possible consequences of character conversion
You should try to avoid character conversions when possible, because conversions can potentially slow performance and sometimes cause data loss. The way to avoid conversions is to use the same CCSID for all of your data.
Types of character conversion
Character conversions can be characterized by their effect on the length of the string. Conversions can be expanding, contracting or neither. Character conversions can also be characterized by how they handle characters that do not exist in the target CCSID. They can be round-trip conversions or enforced subset conversions.
How DB2 for z/OS uses Unicode
Even if you do not use the Unicode encoding scheme for your data, you should be aware that DB2® uses Unicode in many of its internal processes. This use might affect your applications, queries, storage, and performance.
Setting up DB2 to ensure that it interprets characters correctly
You need to make sure that DB2 uses the correct code page (which is identified by a CCSID) to interpret your data. Otherwise, DB2 might store or use incorrect data. This situation is most likely to occur when characters are converted or transferred between systems.
Storing Unicode data
DB2 for z/OS® supports the full Unicode character repertoire, or set of characters. You can store DB2 data as UTF-8 or UTF-16.
Application programming with Unicode data and multiple CCSIDs
If your application handles Unicode data or data that is in different encoding schemes, you should be aware of several programming techniques and recommendations in DB2.
Debugging CCSID and Unicode problems
Some errors are obviously a problem with a CCSID or Unicode object. In other cases, DB2 returns unexpected data and you need to check if a CCSID is the cause of the problem. In these cases, you might not be using Unicode data or doing anything with CCSIDs other than accepting the default values.
DB2 utilities and Unicode support
You can run DB2 utilities on Unicode data, request that DB2 utilities return data in Unicode, and write utility control statements in Unicode.
EXPLAIN Unicode support
You can use DB2 EXPLAIN to capture access path information for your queries. This information is stored in the DB2 EXPLAIN tables, which are encoded in UTF-8.
DB2 ODBC Unicode support
Your DB2 for z/OS ODBC programs can manipulate Unicode data and report the CCSID settings of the subsystem.
IBM DB2 Tools for z/OS Unicode support
You can use IBM® DB2 Tools for z/OS on Unicode data, objects, and applications.
The International Components for Unicode
The International Components for Unicode (ICU) is a set of C/C++ and Java libraries for Unicode support and software internationalization. ICU is an open source project that is sponsored by IBM and provides Unicode services on many platforms.