DB2 10.5 for Linux, UNIX, and Windows

DB2 Cancun Release: String unit attributes improve handling of multibyte characters

Starting in DB2® Cancun Release 10.5.0.4 database columns that hold character data have a string unit attribute that you use to define their string data type.

Historically, database columns that hold character data have been defined as storing a number of bytes. Within a single-byte character set, the number of bytes is equivalent to the number of characters that can be stored in the column. The string unit of the column data type is a byte, referred to as octets. This approach can cause problems when moving out of a single-byte character set and into multibyte character sets because the number of characters that you can store in a column depends on the encoding of those characters.

Define CHAR, VARCHAR, CLOB, GRAPHIC, VARGRAPHIC, and DBCLOB data types in a Unicode database by using one of the following string units:
OCTETS
Indicates that the units for the length attribute are bytes. This unit of length applies to all character string data types in a non-Unicode database. In a Unicode database, you can explicitly specify OCTETS, or it can be determined based on an environment setting.
CODEUNITS16
Indicates that the units for the length attribute are Unicode UTF-16 code units which are the same as counting in double bytes. This unit of length does not affect the underlying code page of the data type. You can use a string unit of CODEUNITS16 only with graphic string data types in a Unicode database.
CODEUNITS32
Indicates that the units for the length attribute are Unicode UTF-32 code units which approximate counting in characters. This unit of length does not affect the underlying code page of the data type. The actual length of a data value is determined by counting the UTF-32 code units that would apply if the data were converted to UTF-32. You can use a string unit of CODEUNITS32 only in a Unicode database.
The string data continues to be stored and processed according to the code page of the data type.

The string unit attribute is supported in the definition of table columns, routine arguments, array types, distinct types, row types, and variables and in the target data type of a CAST specification. You can use the string_units database configuration parameter or the NLS_STRING_UNITS global variable to make data types in string units CODEUNITS32 the default, or you can add them to a schema as needed by using the CODEUNITS32 keyword. You can use the ALTER TABLE statement to modify the string unit of table columns.

Since CHAR is a fixed-length type, DB2 pads the string with blanks to complete the defined length. The defined length here is 5 * CODEUNITS32, which is 20 bytes and is stored on the lower level / page - disk storage. VARCHAR(x CODEUNITS16) would mean x 2-byte characters, VARCHAR(x CODEUNITS32) would mean x 4-byte characters, and VARCHAR(x OCTETS) would mean x bytes.

For example, CHAR(5 CODEUNITS32), storing 'abc' is handled in memory as 'abc ' (two spaces) to meet the logical definition of the column. However, on disk it is stored as 'abc ' (17 spaces) since the DB2 Data Management Services layer requires the column to meet the physical definition.
Note: The storage on disk uses UTF-8 encoding.