IBM Support

Troubleshooting codepage conversion issues

Troubleshooting


Problem

Troubleshooting data conversion like truncation or when one character appears correctly in source database but incorrectly after importing data into the target database. XML is not covered.

Symptom

Source database displays MONTRÉAL (correct) while target database displays MONTRÃAL

Environment


For this example the source database is DB2 v9.7 (AIX) database in which data has been moved to DB2 v10.1 (Linux) via DB2's EXPORT/IMPORT utilities. Both source and target databases are defined as UTF-8 (codepage 1208) in DB2.

DB CFG
Database code page = 1208
Database code set = UTF-8

Data was imported from source (AIX) to target (Linux) database via
EXPORT/IMPORT utilities using IXF format. IXF format is suggested as per docs:

"PC/IXF import should be used to move data between databases. If character data containing row separators is exported to a delimited ASCII (DEL) file and processed by a text transfer program, fields containing the row separators will shrink or expand."

Diagnosing The Problem


Troubleshooting should be performed from bottom up approach for codepage problems:
a) Verify data is stored correctly in database via DB2 CLP
b) Check method used to transfer data or application dispalying the data. Codepage conversion is performed by the receiving application i.e. Java, .NET, etc..., not DB2 database server.
c) Verify the correct codepage was used during EXPORT (if applicable).

Resolving The Problem


Verify the data in the database is correct by using the hex() function to determine
what is actually stored in DB2 database. The data is stored in test.tb1.

Source database on AIX

db2 "select col, hex(col) from test.tb1"

COL 2
-------- -----------
MONTRÉAL 4D4F4E5452C389414C

1 record(s) selected.

Target database on Linux
db2 "select col, hex(col) from test.tb1"

COL 2
-------- ------------------
MONTRÃAL 4D4F4E5452C389414C

1 record(s) selected.

From output the hex values match so we can conclude the data is stored correctly on the DB2 database by comparing the hex to character mapping via UTF-8 encoding table on internet http://www.utf8-chartable.de/

If the hex values do not match between the source and target database do not match then the problem lies with how the data was initially transferred. If using EXPORT/IMPORT utilities with delimited ASCII files, data may have been corrupted during transfer since it is not in a binary format.

For our example, the hex values are expected to be the same in both source and target databases because they are both defined in DB2 as UTF-8 databases.

If databases were using different codepages then the hex values are expected to be different. Thus verification would be performed using two different codepage tables to determine if hex values map to the same character in both source and target databases.
For example source database in codepage 1252 (usually used on Windows) and target database in 1208 (UTF-8). In codeage 1252 É maps to 00C9 and in codepage 1208 (UTF-8) maps to C389. Each set of hex digits should be checked to ensure conversion was properly done.

Back to the example, the characters displayed for the same column are incorrect, but the hex values are correct.

UTF-8
4D = M
4F = O
4E = N
54 = T
52 = R
C3 89 = É (some unicode chars encoded using multiple bytes)
41 = A
4C = L

Thus next step is to investigate the application which receives the data and displays it. The receiving application is responsible for converting the hex values received into human readable characters. For this example it is PuTTy terminal application using DB2 CLP. Depending on the application, it may be Java, Windows .NET, performing the codepage conversion.

PuTTy is commonly used to access remote UNIX/Linux systems. This example problem was caused by PuTTy not configured to display unicode. By default PuTTY is configured to interpret data as codepage 819. Thus it was translating C389 to a different character. It can be resolved by following the link below.

What you need to do is to set the font (Figure 1) and encoding (Figure 2) in PuTTY.

For SecureCRT (vandyke.com) unicode is enabled via Session Options / Terminal / Appearance "Use Unicode line-drawing characters"

For applications outside of DB2 CLP, try the same query to see if the results match DB2 CLP

select col, hex(col) from test.tb1

Java:
JDBC driver uses UTF-8 by default.

.NET:
IBM Data Server Driver uses UTF-8 by default if supported.

XML:
Not covered in this technote, please see "Related Links"

Additional Considerations:

Data Growth/Truncation when converting from single byte code page to Unicode:

For unicode, the column would have to be redefined by increasing the number of characters or using the CODEUNITS16 (UTF-16) or CODEUNITS32 (UTF-32) keywords since some characters require multiple bytes. See String unit attributes improve handling of multi byte characters

If the column is not redefined there is a chance of data truncation if there is character from single byte codepage that requires multiple bytes of storage in unicode.

Before on single byte codepage:
CREATE TABLE TEST(C1 VARCHAR(1)) /* 1 CHAR = 1 Byte */

After converting to Unicode:
CREATE TABLE TEST(C1 VARCHAR(2)) /* 1 CHAR = 1 Byte */
OR
CREATE TABLE TEST(C1 VARCHAR(1 CODEUNITS32)) /* 1 CHAR = 4 Bytes */

This would result in data growth since multiple bytes may be required to represent the same character in Unicode vs. single byte code page. For example in Federation or non-Federation when converting from CP 297 to unicode the character below consumes 2 Bytes in unicode, thus if the column was defined as CHAR(2) in CP297, it should be redefined as CHAR(4) assuming UTF-16 (2 Byte chars) is used. Another alternative is to use the CODEUNITS16/CODEUNITS32 parameter.

CODEUNITS16 is supported only with GRAPHIC string types.

à = 0xC3A0 (CP 1208/unicode)
à = 0x7C (CP 297)

Unicode table: http://www.utf8-chartable.de/

When EXPORTing data:

By default DB2 will convert codepage to what the client is using. For example when connecting from Windows to remote DB2 z (mainframe) database to export data, by default EXPORT utility will export to Windows codepage 1252 even though MODIFIED BY CODEPAGE=1208 is used.

EXPORT TO C:\MYLOBS\test.ixf OF IXF LOBS TO
C:\MYLOBS modified by lobsinfile modified by codepage=1208 select * from
db2inst1.employee with ur;

For these situations explicitly set DB2CODEPAGE registry variable to the codepage required.

db2set db2codepage=1208
db2 terminate
db2 "connect to <DB2 z database alias> user <user id>"
(Prompts for password)

EXPORT TO C:\MYLOBS\test.ixf OF IXF LOBS TO
C:\MYLOBS modified by lobsinfile modified by codepage=1208 select * from
db2inst1.employee with ur;

[{"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Database Objects\/Config - Database - Codepage and Territory","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.7;9.5;10.1;10.5;11.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg21984868