Beginning with Version 9.5, XML data can be stored and retrieved from a database that does not use a Unicode code page.
Internally, XML data is always managed by the DB2® database server in a Unicode format, regardless of the database code page. Non-XML, relational data is managed in the database code page. In cases where SQL or XQuery statements involve both XML data and SQL relational data, such as in the casting of one data type to another, or in comparisons involving both the XML data type and SQL data types, code page conversion is often required. Comparisons of XML data to XML data do not require code page conversion because both sets of data are already in UTF-8 format. Similarly, comparisons of SQL data to SQL data do not require code page conversion because both sets of data are already in the database code page.
For operations involving XML data and SQL data, the need for code page conversion is eliminated in a Unicode database because the database uses the same encoding for all data types. In a non-Unicode database, however, operations that involve code page conversion can potentially result in the corruption or loss of data. If the XML data undergoing conversion contains characters with code points that are not part of the database code page, character substitution occurs. As a result, casting or comparison operations can have an unexpected outcome, and XML data retrieved from the database may contain incorrect values. Below are discussed different means of avoiding code page conversion problems so as to ensure the integrity of stored XML data and the operations that involve it.
Whenever XML data is inserted into a DB2 database server through a host variable or parameter marker that has a character data type (a data type of CHAR, VARCHAR or CLOB that is not a FOR BIT DATA type), code page conversion occurs if the database code page differs from that of the client or application issuing the request. A second conversion occurs as the inserted character data is converted from the database code page to Unicode, the format in which XML data is managed internally.
Scenario | XML document encoding | Database encoding | Code pages match? |
---|---|---|---|
1. | Unicode (UTF-8) | Unicode (UTF-8) | yes |
2. | Non-Unicode | Unicode (UTF-8) | no |
3. | Non-Unicode | Non-Unicode | yes |
4. | Unicode (UTF-8) | Non-Unicode | no |
5. | Non-Unicode | Non-Unicode | no |
The safest way to ensure the integrity of XML data is to use a Unicode database. However, if this is not possible, there are other ways to prevent character substitution from occurring. The following list describes various methods of inserting XML data safely, whether or not a Unicode database is used:
When a Unicode database cannot be used, code page conversion of XML data can also be avoided by binding the XML data using a host variable or parameter marker of type XML or any binary data type. That is, specifying a data type other than CHAR, VARCHAR, or CLOB for the XML data allows it to be passed directly from the client or application code page to Unicode, bypassing conversion to the database code page.
The ENABLE_XMLCHAR configuration parameter allows you to control whether or not inserting is allowed via character data types. Setting ENABLE_XMLCHAR to “NO” will block the usage of character data types during XML document insertion, preventing possible character substitution and ensuring the integrity of stored XML data. The BLOB and FOR BIT DATA types are still allowed, since these data types are safe from code page conversion. By default, ENABLE_XMLCHAR is set to “YES” so that insertion of character data types is allowed.
When a Unicode database is used code page conversion is never a problem, so in this case the ENABLE_XMLCHAR configuration parameter has no effect; character data types can be used for XML document insertion regardless of the setting for ENABLE_XMLCHAR.
As with inserting XML data into a database, the safest way to ensure data integrity during a query involving XML data is to use a Unicode database. If this is not possible, character substitution is avoided by making sure that all XML data is representable in the database code page, or by using character entity references for characters that are not in the database code page.
Examples
The following examples demonstrate the possible effects of code page conversion when a client or application with a UTF-8 encoding is used to query XML data in a non-Unicode database. In these examples, assume that the database is created using code page ISO8859-7 (Greek). XQuery expressions are used to match XML data stored in table T1, where the stored XML data consists of the Unicode Greek sigma character (ΣG) and the Unicode mathematical sigma character (ΣM). The code point 0xD3 identifies the sigma character in the ISO8859-7 database.
CREATE TABLE T1 (DOCID BIGINT NOT NULL, XMLCOL XML);
INSERT INTO T1 VALUES (1, XMLPARSE(
document '<?xml version="1.0" encoding="utf-8" ?> <Specialchars>
<sigma>ΣG</sigma>
<summation>ΣM</summation>
</Specialchars>'
preserve whitespace));
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ΣG"] return $test
This
expression produces the desired result: <sigma>ΣG</sigma>
In
this case, the expression ΣG begins at the client as the
Unicode code point for the Greek sigma character (U+03A3), is converted
to the sigma character in the Greek database code page (0xD3), and
is then converted back to the correct Unicode character for XML processing.
Because the Greek sigma character is representable in the database
code page, the expression matches correctly. This character conversion
is shown in the following table:Client (UTF-8) | Database (ISO8859-7) | XML parser (UTF-8) | |||
---|---|---|---|---|---|
Character | U+03A3 (Greek sigma) | → | 0xD3 (Greek sigma) | → | U+03A3 (Greek sigma) |
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ΣM"] return $test
This
expression does not produce the desired result:<sigma>ΣG</sigma>
In
this case, the expression ΣM begins at the client as the
Unicode code point for the mathematical symbol sigma (U+2211), is
converted to the sigma character in the Greek database code page (0xD3)
and then matches the ΣG character when the XML comparison
occurs. For the return expression, the process is identical to that
in Example 1. The Unicode XML character ΣG converts first
to the sigma character in the Greek database code page (ΣA),
and then back to the Greek sigma character in the client UTF-8 code
page (ΣG). This character conversion is shown in the following
table:Client (UTF-8) | Database (ISO8859-7) | XML parser (UTF-8) | |||
---|---|---|---|---|---|
Character | U+2211 (Mathematical sigma) | → | 0xD3 (Greek sigma) | → | U+03A3 (Greek sigma) |
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ࢣ"]
return $test
This expression produces the desired result:<summation>ΣM</summation>
In
this case, the expression ΣM begins at the client as the
Unicode code point for the mathematical symbol sigma (U+2211), and
because it is escaped as character reference ࢣ, the Unicode
code point is preserved when it is passed to the XML parser, allowing
for successful comparison against the stored XML value ΣM.
The bypassing of character conversion is shown in the following table:Client (UTF-8) | Database (ISO8859-7) | XML parser (UTF-8) | |||
---|---|---|---|---|---|
Character | U+2211 (character reference for mathematical sigma) | → | "ࢣ" (character reference for mathematical sigma) | → | U+2211 (mathematical sigma) |
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ΣG"] return $test
This
query fails to match the correct value in table T1. In this case,
the Unicode character U+2211 (Greek sigma) does not have a matching
code point in the ASCII code page so a default substitution character
is introduced, in this case the question mark character ('?'). This
character conversion is shown in the following table:Client (UTF-8) | Database (ISO8859-1) | XML parser (UTF-8) | |||
---|---|---|---|---|---|
Character | U+2211 (mathematical sigma) | → | 0x003F ('?') | → | 0x003F ('?') |