IBM Support

Troubleshooting codepage conversion issues

Technote (troubleshooting)


Problem(Abstract)

Troubleshooting data conversion issues when one character appears correctly in source database but incorrectly after importing data into the target database. XML is not covered.

Symptom

Source database displays MONTRÉAL (correct) while target database displays MONTRÃAL


Environment

For this example the source database is DB2 v9.7 (AIX) database in which data has been moved to DB2 v10.1 (Linux) via DB2's EXPORT/IMPORT utilities. Both source and target databases are defined as UTF-8 (codepage 1208) in DB2.

DB CFG
Database code page = 1208
Database code set = UTF-8

Data was imported from source (AIX) to target (Linux) database via
EXPORT/IMPORT utilities using IXF format. IXF format is suggested as per docs:

"PC/IXF import should be used to move data between databases. If character data containing row separators is exported to a delimited ASCII (DEL) file and processed by a text transfer program, fields containing the row separators will shrink or expand."

Diagnosing the problem

Troubleshooting should be performed from bottom up approach for codepage problems:
a) Verify data is stored correctly in database via DB2 CLP
b) Check method used to transfer data or application dispalying the data. Codepage conversion is performed by the receiving application i.e. Java, .NET, etc..., not DB2 database server.
c) Verify the correct codepage was used during EXPORT (if applicable).

Resolving the problem

Verify the data in the database is correct by using the hex() function to determine
what is actually stored in DB2 database. The data is stored in test.tb1.

Source database on AIX

db2 "select col, hex(col) from test.tb1"

COL 2
-------- -----------
MONTRÉAL 4D4F4E5452C389414C

1 record(s) selected.

Target database on Linux
db2 "select col, hex(col) from test.tb1"

COL 2
-------- ------------------
MONTRÃAL 4D4F4E5452C389414C

1 record(s) selected.

From output the hex values match so we can conclude the data is stored correctly on the DB2 database by comparing the hex to character mapping via UTF-8 encoding table on internet http://www.utf8-chartable.de/

If the hex values do not match between the source and target database do not match then the problem lies with how the data was initially transferred. If using EXPORT/IMPORT utilities with delimited ASCII files, data may have been corrupted during transfer since it is not in a binary format.

For our example, the hex values are expected to be the same in both source and target databases because they are both defined in DB2 as UTF-8 databases.

If databases were using different codepages then the hex values are expected to be different. Thus verification would be performed using two different codepage tables to determine if hex values map to the same character in both source and target databases.
For example source database in codepage 1252 (usually used on Windows) and target database in 1208 (UTF-8). In codeage 1252 É maps to 00C9 and in codepage 1208 (UTF-8) maps to C389. Each set of hex digits should be checked to ensure conversion was properly done.

Back to the example, the characters displayed for the same column are incorrect, but the hex values are correct.

UTF-8
4D = M
4F = O
4E = N
54 = T
52 = R
C3 89 = É (some unicode chars encoded using multiple bytes)
41 = A
4C = L

Thus next step is to investigate the application which receives the data and displays it. The receiving application is responsible for converting the hex values received into human readable characters. For this example it is PuTTy terminal application using DB2 CLP. Depending on the application, it may be Java, Windows .NET, performing the codepage conversion.

PuTTy is commonly used to access remote UNIX/Linux systems. This example problem was caused by PuTTy not configured to display unicode. By default PuTTY is configured to interpret data as codepage 819. Thus it was translating C389 to a different character. It can be resolved by following the link below.

What you need to do is to set the font (Figure 1) and encoding (Figure 2) in PuTTY.

For SecureCRT (vandyke.com) unicode is enabled via Session Options / Terminal / Appearance "Use Unicode line-drawing characters"

For applications outside of DB2 CLP, try the same query to see if the results match DB2 CLP

select col, hex(col) from test.tb1

Java:
JDBC driver uses UTF-8 by default.

.NET:
IBM Data Server Driver uses UTF-8 by default if supported.

XML:
Not covered in this technote, please see "Related Links"

Additional Considerations:

When defining columns:

For multiple byte unicode, the column would have to be redefined as CODEUNITS16 (UTF-16) or CODEUNITS32 (UTF-32). See String unit attributes improve handling of multibyte characters

CREATE TABLE TEST(C1 VARCHAR(254 CODEUNITS32))

When EXPORTing data:

By default DB2 will convert codepage to what the client is using. For example when connecting from Windows to remote DB2 z (mainframe) database to export data, by default EXPORT utility will export to Windows codepage 1252 even though MODIFIED BY CODEPAGE=1208 is used.

EXPORT TO C:\MYLOBS\test.ixf OF IXF LOBS TO
C:\MYLOBS modified by lobsinfile modified by codepage=1208 select * from
db2inst1.employee with ur;

For these situations explicitly set DB2CODEPAGE registry variable to the codepage required.

db2set db2codepage=1208
db2 terminate
db2 "connect to <DB2 z database alias> user <user id>"
(Prompts for password)

EXPORT TO C:\MYLOBS\test.ixf OF IXF LOBS TO
C:\MYLOBS modified by lobsinfile modified by codepage=1208 select * from
db2inst1.employee with ur;

Related information

Encoding considerations when storing or passing XML dat

Document information

More support for: DB2 for Linux, UNIX and Windows
Database Objects/Config - Database - Codepage and Territory

Software version: 9.7, 10.1, 10.5, 11.1

Operating system(s): AIX, HP-UX, Linux, Solaris, Windows

Reference #: 1984868

Modified date: 08 June 2016


Translate this page: