DB2 Version 9.7 for Linux, UNIX, and Windows

Character strings

A character string is a sequence of bytes. The length of the string is the number of bytes in the sequence. If the length is zero, the value is called the empty string. This value should not be confused with the null value.

Fixed-length character string (CHAR)

All values in a fixed-length string column have the same length, which is determined by the length attribute of the column. The length attribute must be between 1 and 254 inclusive.

Varying-length character strings

There are two types of varying-length character strings:

A VARCHAR value can be up to 32 672 bytes long.
A CLOB (character large object) value can be up to 2 gigabytes minus 1 byte (2 147 483 647 bytes) long. A CLOB is used to store large SBCS or mixed (SBCS and MBCS) character-based data (such as documents written with a single character set) and, therefore, has an SBCS or mixed code page associated with it.

Special restrictions apply to expressions resulting in a CLOB data type, and to structured type columns; such expressions and columns are not permitted in:

A SELECT list preceded by the DISTINCT clause
A GROUP BY clause
An ORDER BY clause
A subselect of a set operator other than UNION ALL
A basic, quantified, BETWEEN, or IN predicate
An aggregate function
VARGRAPHIC, TRANSLATE, and datetime scalar functions
The pattern operand in a LIKE predicate, or the search string operand in a POSSTR function
The string representation of a datetime value.

The functions in the SYSFUN schema taking a VARCHAR as an argument will not accept VARCHARs greater than 4 000 bytes long as an argument. However, many of these functions also have an alternative signature accepting a CLOB(1M). For these functions, the user may explicitly cast the greater than 4 000 VARCHAR strings into CLOBs and then recast the result back into VARCHARs of desired length.

NUL-terminated character strings found in C are handled differently, depending on the standards level of the precompile option.

Each character string is further defined as one of:

Bit data: Data that is not associated with a code page.
Single-byte character set (SBCS) data: Data in which every character is represented by a single byte.
Mixed data: Data that may contain a mixture of characters from a single-byte character set and a multi-byte character set (MBCS).

Note: The LONG VARCHAR data type continues to be supported but is deprecated, not recommended, and might be removed in a future release.

String units in built-in functions

The ability to specify string units for certain built-in functions allows you to process string data in a more "character-based manner" than a "byte-based manner". The string unit determines the length in which an operation is to occur. You can specify CODEUNITS16, CODEUNITS32, or OCTETS as the string unit for an operation.

CODEUNITS16: Specifies that Unicode UTF-16 is the unit for the operation. CODEUNITS16 is useful when an application is processing data in code units that are two bytes in width. Note that some characters, known as supplementary characters, require two UTF-16 code units to be encoded. For example, the musical symbol G clef requires two UTF-16 code units (X'D834' and X'DD1E' in UTF-16BE).
CODEUNITS32: Specifies that Unicode UTF-32 is the unit for the operation. CODEUNITS32 is useful for applications that process data in a simple, fixed-length format, and that must return the same answer regardless of the storage format of the data (ASCII, UTF-8, or UTF-16).

Start of change Since CHAR is a fixed-length type, DB2 pads the string with blanks to complete the defined length. The defined length here is 5 * CODEUNITS32, which is 20 bytes and is stored on the lower level / page - disk storage. VARCHAR(x CODEUNITS16) would mean x 2-byte characters, VARCHAR(x CODEUNITS32) would mean x 4-byte characters, and VARCHAR(x OCTETS) would mean x bytes. End of change

Start of change For example, CHAR(5 CODEUNITS32), storing 'abc' is handled in memory as 'abc ' (two spaces) to meet the logical definition of the column. However, on disk it is stored as 'abc ' (17 spaces) since the DB2 Data Management Services layer requires the column to meet the physical definition. End of change

Note:

The storage on disk uses UTF-8 encoding. End of change

OCTETS: Specifies that bytes are the units for the operation. OCTETS is often used when an application is interested in allocating buffer space or when operations need to use simple byte processing.

The calculated length of a string computed using OCTETS (bytes) might differ from that computed using CODEUNITS16 or CODEUNITS32. When using OCTETS, the length of the string is determined by simply counting the number of bytes in the string. When using CODEUNITS16 or CODEUNITS32, the length of the string is determined by counting the number of 16-bit or 32-bit code units necessary to represent the string in UTF-16 or UTF-32, respectively. The length determined using CODEUNITS16 and CODEUNITS32 will be identical unless the data contains supplementary characters (see Difference between CODEUNITS16 and CODEUNITS32).

For example, assume that NAME, a VARCHAR(128) column encoded in Unicode UTF-8, contains the value 'Jürgen'. The following two queries, which count the length of the string in CODEUNITS16 and CODEUNITS32, respectively, return the same value (6).

   SELECT CHARACTER_LENGTH(NAME,CODEUNITS16) FROM T1
     WHERE NAME = 'Jürgen'

   SELECT CHARACTER_LENGTH(NAME,CODEUNITS32) FROM T1
     WHERE NAME = 'Jürgen'

The next query, which counts the length of the string in OCTETS, returns the value 7.

   SELECT CHARACTER_LENGTH(NAME,OCTETS) FROM T1
     WHERE NAME = 'Jürgen'

These values represent the length of the string expressed in the specified string unit.

The following table shows the UTF-8, UTF-16BE (big-endian), and UTF-32BE (big-endian) representations of the name 'Jürgen':

Format    Representation of the name 'Jürgen'
--------  --------------------------------------
UTF-8	  X'4AC3BC7267656E'
UTF-16BE 	X'004A00FC007200670065006E'
UTF-32BE 	X'0000004A000000FC0000007200000067000000650000006E'

The representation of the character 'ü' differs among the three string units:

The UTF-8 representation of the character 'ü' is X'C3BC'.
The UTF-16BE representation of the character 'ü' is X'00FC'.
The UTF-32BE representation of the character 'ü' is X'000000FC'.

Specifying string units for a built-in function does not affect the data type or the code page of the result of the function. If necessary, DB2® converts the data to Unicode for evaluation when CODEUNITS16 or CODEUNITS32 is specified.

When OCTETS is specified for the LOCATE or POSITION function, and the code pages of the string arguments differ, DB2 converts the data to the code page of the source-string argument. In this case, the result of the function is in the code page of the source-string argument. When OCTETS is specified for functions that take a single string argument, the data is evaluated in the code page of the string argument, and the result of the function is in the code page of the string argument.

Difference between CODEUNITS16 and CODEUNITS32

When CODEUNITS16 or CODEUNITS32 is specified, the result is the same except when the data contains Unicode supplementary characters. This is because a supplementary character is represented by two UTF-16 code units or one UTF-32 code unit. In UTF-8, a non-supplementary character is represented by 1 to 3 bytes, and a supplementary character is represented by 4 bytes. In UTF-16, a non-supplementary character is represented by one CODEUNITS16 code unit or 2 bytes, and a supplementary character is represented by two CODEUNITS16 code units or 4 bytes. In UTF-32, a character is represented by one CODEUNITS32 code unit or 4 bytes.

For example, the following table shows the hexadecimal values for the mathematical bold capital A and the Latin capital letter A. The mathematical bold capital A is a supplementary character that is represented by 4 bytes in UTF-8, UTF-16, and UTF-32.

Character	UTF-8 representation	UTF-16BE representation	UTF-32BE representation
Unicode value X'1D400' - 'A'; mathematical bold capital A	X'F09D9080'	X'D835DC00'	X'0001D400'
Unicode value X'0041' - 'A'; latin capital letter A	X'41'	X'0041'	X'00000041'

Assume that C1 is a VARCHAR(128) column, encoded in Unicode UTF-8, and that table T1 contains one row with the value of the mathematical bold capital A (X'F09D9080'). The following queries return different results:

Query                                                Returns
-----                                                -------
SELECT CHARACTER_LENGTH(C1,CODEUNITS16) FROM T1      2

SELECT CHARACTER_LENGTH(C1,CODEUNITS32) FROM T1      1

SELECT CHARACTER_LENGTH(C1,OCTETS) FROM T1           4