DB2 Version 9.7 for Linux, UNIX, and Windows

Japanese and traditional-Chinese extended UNIX code (EUC) considerations

Extended UNIX Code (EUC) for Japanese and Traditional-Chinese defines a set of encoding rules that can support from 1 to 4 character sets. In some cases, such as Japanese EUC (eucJP) and Traditional-Chinese EUC (eucTW), a character may be encoded using more than two bytes. Use of such an encoding scheme has implications when used as the code page of the database server or the database client. The key considerations involve the following:

Expansion or contraction of strings when converting between EUC code pages and double-byte code pages
Use of Universal Character Set-2 (UCS-2) as the code page for graphic data stored in a database server defined with the eucJP (Japanese) or eucTW (Traditional-Chinese) code pages.

With the exception of these considerations, the use of EUC is consistent with the double-byte character set (DBCS) support. References to double-byte have been changed to multibyte to reflect support for encoding rules that allow for character representations requiring more than 2 bytes. Detailed considerations for support of Japanese and Traditional-Chinese EUC are included here. This information should be considered by anyone using SQL with an EUC database server or an EUC database client, and used in conjunction with application development information.

Characters

Each multibyte character is considered a letter with the exception of the double-byte blank character which is considered a special character.

Tokens

multibyte lowercase alphabetic letters are not folded to uppercase. This differs from the single byte lowercase alphabetic letters in tokens which are generally folded to uppercase.

SQL identifiers

Conversion between a double-byte code page and an EUC code page may result in the conversion of double-byte characters to multibyte characters encoded with more than 2 bytes. As a result, an identifier that fits the length maximum in the double-byte code page may exceed the length in the EUC code page. Selecting identifiers for this type of environment must be done carefully to avoid expansion beyond the maximum identifier length.

Character strings

In an MBCS database, character strings may contain a mixture of characters from a single-byte character set (SBCS) and from multibyte character sets (MBCS). When using such strings, operations may provide different results if they are character based (treat the data as characters) or byte based (treat the data as bytes). Check the function or operation description to determine how mixed strings are processed.

Graphic strings

A graphic string is defined as a sequence of double-byte character data. In order to allow Japanese or Traditional-Chinese EUC data to be stored in graphic columns, EUC characters are encoded in UCS-2. Characters that are not double-byte characters under all supported encoding schemes (for example, PC or EBCDIC DBCS) should not be used with graphic columns. The results of using other than double-byte characters may result in replacement by substitution characters during conversion. Retrieval of such data will not return the same value as was entered.

String assignments

Conversion of a string is performed prior to the assignment. In cases involving an eucJP/eucTW code page and a DBCS code page, a character string may become longer (DBCS to eucJP/eucTW) or shorter (eucJP/eucTW to DBCS). This may result in errors on storage assignment and truncation on retrieval assignment. When the error on storage assignment is due to expansion during conversion, SQLSTATE 22524 is returned instead of SQLSTATE 22001.

Similarly, assignments involving graphic strings may result in the conversion of a UCS-2 encoded double-byte character to a substitution character in a PC or EBCDIC DBCS code page for characters that do not have a corresponding double-byte character. Assignments that replace characters with substitution characters will indicate this by setting the SQLWARN10 field of the SQLCA to 'W'.

In cases of truncation during retrieval assignment involving multibyte character strings, the point of truncation may be part of a multibyte character. In this case, each byte of the character fragment is replaced with a single-byte blank. This means that more than one single-byte blank may appear at the end of a truncated character string.

String comparisons

String comparisons are performed on a byte basis. Character strings also use the collating sequence defined for the database. Graphic strings do not use the collating sequence and, in an eucJP or eucTW database, are encoded using UCS-2. Thus, the comparison of two mixed character strings may have a different result from the comparison of two graphic strings even though they contain the same characters. Similarly, the resulting sort order of a mixed character column and a graphic column may be different.

Rules for result data types

The resulting data type for character strings is not affected by the possible expansion of the string. For example, a union of two CHAR operands will still be a CHAR. However, if one of the character string operands will be converted such that the maximum expansion makes the length attribute the largest of the two operands, then the resulting character string length attribute is affected. For example, consider the result expressions of a CASE expression that have data types of VARCHAR(100) and VARCHAR(120). Assume the VARCHAR(100) expression is a mixed string host variable (that may require conversion) and the VARCHAR(120) expression is a column in the eucJP database. The resulting data type is VARCHAR(200) since the VARCHAR(100) is doubled to allow for possible conversion. The same scenario without the involvement of an eucJP or eucTW database would have a result type of VARCHAR(120).

Notice that the doubling of the host variable length is based on the fact that the database server is Japanese EUC or Traditional-Chinese EUC. Even if the client is also eucJP or eucTW, the doubling is still applied. This allows the same application package to be used by double-byte or multibyte clients.

Rules for string conversions

The types of operations listed in the corresponding section of the SQL Reference may convert operands to either the application or the database code page.

If such operations are done in a mixed code page environment that includes Japanese or Traditional-Chinese EUC, expansion or contraction of mixed character string operands can occur. Therefore, the resulting data type has a length attribute that accommodates the maximum expansion, if possible. In cases where there are restrictions on the length attribute of the data type, the maximum allowed length for the data type is used. For example in an environment where maximum growth is double, a VARCHAR(200) host variable is treated as if it is a VARCHAR(400), but CHAR(200) host variable is treated as if it is a CHAR(254). A run-time error may occur when conversion is performed if the converted string would exceed the maximum length for the data type. For example, the union of CHAR(200) and CHAR(10) would have a result type of CHAR(254). If more than 254 bytes are required when the value from the left side of the UNION is converted, an error is returned.

In some cases, allowing for the maximum growth for conversion will cause the length attribute to exceed a limit. For example, UNION only allows columns up to 254 bytes. Thus, a query with a union that included a host variable in the column list (call it :hv1) that was a DBCS mixed character string defined as a varying length character string 128 bytes long, would set the data type to VARCHAR(256) resulting in an error preparing the query, even though the query in the application does not appear to have any columns greater than 254. In a situation where the actual string is not likely to cause expansion beyond 254 bytes, the following can be used to prepare the statement.

   SELECT CAST(:hv1 CONCAT ' AS VARCHAR(254)), C2 FROM T1
   UNION
   SELECT C1, C2 FROM T2

The concatenation of the null string with the host variable will force the conversion to occur before the cast is done. This query can be prepared in the DBCS to eucJP/eucTW environment although a truncation error may occur at run time.

This technique (null string concat with cast) can be used to handle the similar 254-byte limit for SELECT DISTINCT or use of the column in ORDER BY or GROUP BY clauses.

Graphic string constants

Japanese or Traditional-Chinese EUC client, may contain single or multibyte characters (like a mixed character string). The string should not contain more than 2000 bytes. It is recommended that only characters that convert to double-byte characters in all related PC and EBCDIC double-byte code pages be used in graphic constants. A graphic string constant in an SQL statement is converted from the client code page to the double-byte encoding at the database server. For a Japanese or Traditional-Chinese EUC server, the constant is converted to UCS-2, the double-byte encoding used for graphic strings. For a double-byte server, the constant is converted from the client code page to the DBCS code page of the server.

Functions

The design of user-defined functions should consider the impact of supporting Japanese or Tradition-Chinese EUC on the parameter data types. One part of function resolution considers the data types of the arguments to a function call. Mixed character string arguments involving a Japanese or Traditional-Chinese EUC client may require additional bytes to specify the argument. This may require that the data type change to allow the increased length. For example, it may take 4001 bytes to represent a character string in the application (a LONG VARCHAR) that fits into a VARCHAR(4000) string at the server. If a function signature is not included that allows the argument to be a LONG VARCHAR, function resolution will fail to find a function.

Some functions exist that do not allow long strings for various reasons. Use of LONG VARCHAR or CLOB arguments with such functions will not succeed. For example, LONG VARCHAR as the second argument of the built-in POSSTR function, will fail function resolution (SQLSTATE 42884).

Expressions with the concatenation operator

The potential expansion of one of the operands of concatenation may cause the data type and length of concatenated operands to change when in an environment that includes a Japanese or Traditional-Chinese EUC database server. For example, with an EUC server where the value from a host variable may double in length, consider the following example.

     CHAR200 CONCAT :char50

The column CHAR200 is of type CHAR(200). The host variable char50 is defined as CHAR(50). The result type for this concatenation operation would normally be CHAR(250). However, given an eucJP or eucTW database server, the assumption is that the string may expand to double the length. Hence char50 is treated as a CHAR(100) and the resulting data type is VARCHAR(300). Note that even though the result is a VARCHAR, it will always have 300 bytes of data including trailing blanks. If the extra trailing blanks are not desired, define the host variable as VARCHAR(50) instead of CHAR(50).

LIKE predicate

For a LIKE predicate involving mixed character strings in an EUC database:

An SBCS halfwidth underscore character refers to one SBCS character.
A non-SBCS fullwidth underscore character refers to one non-SBCS character.
An SBCS halfwidth or non-SBCS fullwidth percent sign character refers to zero or more SBCS or non-SBCS characters.

The escape character must be one SBCS or non-SBCS character. In a character column, the escape character can also be a binary string containing exactly one byte.

Note that use of the underscore character may produce different results, depending on the code page of the LIKE operation. For example, Katakana characters in Japanese EUC are multibyte characters (CS2) but in the Japanese DBCS code page they are single-byte characters. A query with the single-byte underscore in the pattern-expression would return occurrences of Katakana character in the position of the underscore from a Japanese DBCS server. However, the same rows from the equivalent table in a Japanese EUC server would not be returned, since the Katakana characters will only match with a double-byte underscore.

For a LIKE predicate involving graphic strings in an EUC database:

A fullwidth underscore character (U+FF3F) refers to one Unicode character.
A fullwidth percent sign character (U+FF05) refers to zero or more Unicode characters.

LENGTH function

The processing of this function is no different for mixed character strings in an EUC environment. The value returned is the length of the string in the code page of the argument. As of Version 8, if the argument is a host variable, the value returned is the length of the string in the database code page. When using this function to determine the length of a value, careful consideration should be given to how the length is used. This is especially true for mixed string constants since the length is given in bytes, not characters. For example, the length of a mixed string column in a DBCS database returned by the LENGTH function may be less than the length of the retrieved value of that column on an eucJP or eucTW client due to the conversion of some DBCS characters to multibyte eucJP or eucTW characters.

SUBSTR function

The SUBSTR function operates on mixed character strings on a byte basis. The resulting string may therefore include fragments of multibyte characters at the beginning or end of the resulting string. No processing is provided to detect or process fragments of characters.

TRANSLATE function

The TRANSLATE function supports mixed character strings including multibyte characters. The corresponding characters of the to-string-exp and the from-string-exp must have the same number of bytes and cannot end with part of a multibyte character.

The pad-char-exp must result in a single-byte character when the char-string-exp is a character string. Since TRANSLATE is performed in the code page of the char-string-exp, the pad-char-exp may be converted from a multibyte character to a single-byte character.

A char-string-exp that ends with part of a multibyte character will not have those bytes translated.

VARGRAPHIC function

The VARGRAPHIC function on a character string operand in a Japanese or Traditional-Chinese EUC code page returns a graphic string in the UCS-2 code page.

Single-byte characters are converted first to their corresponding double-byte character in the code set to which they belong (eucJP or eucTW). Then they are converted to the corresponding UCS-2 representation. If there is no double-byte representation, the character is converted to the double-byte substitution character defined for that code set before being converted to UCS-2 representation.
Characters from eucJP that are Katakana (eucJP CS2) are actually single byte characters in some encoding schemes. They are thus converted to corresponding double-byte characters in eucJP or to the double-byte substitution character before converting to UCS-2.
multibyte characters are converted to their UCS-2 representations.

CONNECT statement

The processing of a successful CONNECT statement returns information in the SQLCA that is important when the possibility exists for applications to process data in an environment that includes a Japanese or Traditional-Chinese EUC code page at the client or server. The SQLERRD(1) field gives the maximum expansion of a mixed character string when converted from the application code page to the database code page. The SQLERRD(2) field gives the maximum expansion of a mixed character string when converted from the database code page to the application code page. The value is positive if expansion could occur and negative if contraction could occur. If the value is negative, the value is always -1 since the worst case is that no contraction occurs and the full length of the string is required after conversion. Positive values may be as large as 2, meaning that in the worst case, double the string length may be required for the character string after conversion.

The code page of the application server and the application client are also available in the SQLERRMC field of the SQLCA.

PREPARE statement

The data types determined for untyped parameter markers are not changed in an environment that includes Japanese or Traditional-Chinese EUC. As a result, it may be necessary in some cases to use typed parameter markers to provide sufficient length for mixed character strings in eucJP or eucTW. For example, consider an insert to a CHAR(10) column. Preparing the statement:

   INSERT INTO T1 (CH10) VALUES (?)

would result in a data type of CHAR(10) for the parameter marker. If the client was eucJP or eucTW, more than 10 bytes may be required to represent the string to be inserted but the same string in the DBCS code page of the database is not more than 10 bytes. In this case, the statement to prepare should include a typed parameter marker with a length greater than 10. Thus, preparing the statement:

   INSERT INTO T1 (CH10) VALUES (CAST(? AS VARCHAR(20))

would result in a data type of VARCHAR(20) for the parameter marker.