Collating sequence

A collating sequence (also called a sort sequence) defines how characters in a character set relate to each other when they are compared and ordered.

Different collating sequences are useful for those who want their data ordered for a specific language. For example, lists can be ordered as they are normally seen for a specific language. A collating sequence can also be used to treat certain characters as equivalent, for instance, a and A. A collating sequence works on all comparisons that involve:

SBCS character data (including bit data)
the SBCS portion of mixed data
Unicode data (UTF-8, UCS-2, or UTF-16).

SBCS collating sequence support is implemented using a 256-byte table. Each byte in the table corresponds to a code point or character in a SBCS code page. Because the collating sequence is applicable to character data, a CCSID must be associated with the table. The bytes in the collating sequence table are set based on how each code point is to compare to other code points in that code page. For example, if the characters a and A are to be treated as equivalents for comparisons, the bytes in the collating sequence table for their code points contain the same value, or weight.

UCS-2 collating sequence support is implemented using a multi-byte table. A pair of bytes within the table corresponds to a character in the UCS-2 code page. Only a subset of the thousands of characters in UCS-2 are typically represented in the table. Only those characters that are to compare differently (and possibly other characters in the same ward) will be represented in the table. The bytes in the collating sequence table are set based on how each character is to compare with other characters in UCS-2.

When two or more bytes (or pair of bytes for UCS-2) in a collating sequence table have the same value, the collating sequence is a shared-weight collating sequence. If every byte (or pair of bytes for UCS-2) in a collating sequence table has a unique value, the collating sequence is a unique-weight collating sequence. For many languages, unique- and shared-weight collating sequences are shipped on the system as part of the operating system. If you need collating sequences for other languages or needs, you define them using the Create Table (CRTTBL) command.

UTF-8 and UTF-16 collating sequence support is implemented using ICU (International Components for Unicode). This is a standard API to sort Unicode. The API produces the same result for normalized and non-normalized data and returns a sort weight based on language specific rules. The IBM® i operating system supports ICU 2.3.1, ICU 3.4, and ICU 4.0 collating sequences, but ICU 3.4 or ICU 4.0 should be used. The ICU collating sequence table I34en_us (United States locale) will sort data differently than I34fr_FR (French locale).

If ICU is used, the LIKE predicate and the LOCATE, POSITION, POSSTR, and POSITION scalar functions are not supported.

If ICU 2.3.1 is used, the query cannot contain:

EXCEPT or INTERSECT operations,
VALUES in a fullselect,
OLAP specifications,
recursive common table expressions,
ORDER OF,
scalar fullselects (scalar subselects are supported),
full outer join,
LOBs in a GROUP BY,
grouping sets or super groups,
ORDER BY or FETCH FIRST n ROWS clause in a subselect,
OFFSET clause, or FETCH clause with a variable for N rows,
LISTAGG aggregate function,
VERIFY_GROUP_FOR_USER, LOCATE_IN_STRING, LTRIM or RTRIM with 2 arguments, EXTRACT function with EPOCH,
BSON_TO_JSON, JSON_ARRAY, JSON_ARRAYAGG, JSON_OBJECT, JSON_OBJECTAGG, JSON_QUERY, JSON_TABLE, JSON_TO_BSON, and JSON_VALUE functions, and the IS JSON and JSON_EXISTS predicates,
CONTAINS or SCORE functions,
XMLAGG, XMLATTRIBUTES, XMLCOMMENT, XMLCONCAT, XMLDOCUMENT, XMLELEMENT, XMLFOREST, XMLGROUP, XMLNAMESPACES, XMLPI, XMLROW, or XMLTEXT functions,
default values for user defined functions,
global variables, or
references to arrays.

An ICU collating sequence table will generally produce results that are more culturally correct, however:

The performance of SQL statements that use an ICU collating sequence table will generally perform worse than when using either an SBCS or UCS-2 collating sequence table. Indexes can be created with an ICU collating sequence table, however, to improve performance. In this case, the index key values will contain the ICU weighted value which will greatly reduce the number of times the system's ICU support is called.
The storage necessary for indexes that use an ICU collating sequence table will generally be greater than when using either an SBCS or UCS-2 collating sequence table. The key values can be up to 3 times longer than the length of SBCS data used to produce the key and up to 6 times longer than the length of DBCS data used to produce the key.

It is important to remember that the data itself is not altered by the collating sequence. Instead, a weighted representation of the data is used for the comparison. In SQL, a collating sequence is specified on the CRTSQLxxx, STRSQL, and RUNSQLSTM commands. The SET OPTION statement can be used to specify the collating sequence within the source of a program containing embedded SQL. The collating sequence applies to all character comparisons performed in the SQL statements. The default collating sequence on the system is the internal sequence that occurs when the hexadecimal representation of characters are used. This is the sequence you get when the SRTSEQ(*HEX) is specified. For programs precompiled with a release of the product that is earlier than Version 2 Release 3, the collating sequence is *HEX.

Collating sequences do not apply to FOR BIT DATA or binary string columns.

The collating sequence is not allowed for an index or unique constraint that contains a key column with a field procedure.

The collating sequence is explicitly specified through the following interfaces:

Table 1. Collating Sequence Interfaces
SQL Interface	Specification
Embedded SQL	SRTSEQ parameter on the Create SQL Program (CRTSQLxxx) commands. The SET OPTION statement can also be used to set the SRTSEQ values. (For more information about CRTSQLxxx commands, see Embedded SQL Programming.)
Run SQL Statements	SRTSEQ parameter on the Run SQL Statements (RUNSQLSTM) command. (For more information about the RUNSQLSTM command, see SQL Programming.)
Call Level Interface (CLI) on the server	SQL_ATTR_JOB_SORT_SEQUENCE environment variable (For more information about CLI, see SQL Call Level Interfaces (ODBC).)
JDBC or SQLJ on the server using IBM IBM Developer Kit for Java™	job.sort.sequence property object (For more information about JDBC and SQLJ, see IBM Developer Kit for Java.)
ODBC on a client using the IBM i Access Family ODBC Driver	Sort Type in ODBC Setup (For more information about ODBC, see IBM i Access.)
JDBC on a client using the IBM Toolbox for Java	Sort Sequence Table in JDBC Setup (For more information about JDBC, see IBM i Access.) (For more information about the IBM Toolbox for Java, see IBM Toolbox for Java.)
OLE DB on a client using the IBM i Access Family OLE DB Provider	Sort Sequence Connection Object Properties (For more information about OLE DB, see IBM i Access.)
ADO .NET on a client using the IBM i Access Family ADO .NET Provider	SortSequence in Connection Object Properties (For more information about ADO .NET, see IBM i Access.)

For more information about CCSIDs, see the Work with CCSIDs topic in the Globalization section of the IBM i Information Center. For more information about collating sequences and the sequences shipped with the system, see the DB2® and SQL collating sequence topic in the IBM i Information Center.