LC_COLLATE category

The LC_COLLATE category defines character or string collation information. Within LC_COLLATE you can specify a sort sequence to use using the cpysyscol keyword. The cpysyscol keyword value is used in place of the LC_COLLATE category definitions.

A collation element is the unit of comparison for collation. A collation element may be a character or a sequence of characters. Every collation element in the locale has a set of weights, which determine if the collation element collates before, equal to, or after the other collation elements in the locale. Each collation element is assigned collation weights by the CRTLOCALE command when the locale definition source file is created. These collation weights are then used by applications programs that compare strings.

Every character defined in the CCSID that is specified in the CRTLOCALE command is itself a collating element. Additional collating elements can be defined using the collating-element statement. The syntax is:

collating-element character-symbol from string

The LC_COLLATE category begins with the LC_COLLATE keyword and ends with the END LC_COLLATE keyword.

The following keywords are recognized in the LC_COLLATE category:

cpysyscol
This statement specifies that a system collating sequence table is to be used for the collation information for the category. If the locale is intended to be used to set the sort sequence table for the job, then it is required that the CPYSYSCOL keyword be used. If the CPYSYSCOL keyword is specified, no other keyword can be specified. The syntax for the CPYSYSCOL keyword is:
CPYSYSCOL sort sequence path name;langid

The sort sequence path name is a string specifying a fully expanded path name of an existing sort sequence table to use as the definition for this category. The path name delimiter must be a slash (/). Other valid values are strings containing one of the following sort sequence tables:

*JOB
The sort sequence of the job.
*LANGIDUNQ
The unique-weighted sort sequence table that is associated with the language identifier requested parameter.
*LANGIDSHR
The shared-weighted sort sequence table that is associated with the language identifier requested parameter.
*HEX
The sort sequence according to the hexadecimal value of the characters.

The langid is a string specifying the language identifier of the sort sequence table to be used. All langid must be in uppercase. Valid values are strings containing one of the following language identifiers:

*JOB
Use the language identifier of the job.
language id
A valid 3-character language identifier. For example, Danish can be DAN.
Collating-element
The collating-element statement specifies multi-character collating elements. The syntax for the collating-element statement is:
collating-element symbolic-name from string

The symbolic-name value defines a collating element that is a string of one or more characters as a single collating element. The symbolic-name value cannot duplicate any system predefined symbolic name, or any other symbolic name defined in this collation definition. The string value specifies a string of two or more characters or character symbols that define the symbolic-name value. The following are examples of the syntax for the collating-element statement:

collating-element <ch> from "<c><h>"
collating-element <e-acute> from "<acute><e>"
collating-element <11> from "<1><1>"

A symbolic-name value defined by the collating-element statement is recognized only with the LC_COLLATE category.

Order_start
The order_start statement may be followed by one or more collation order statements, assigning collation weights to collating elements. This statement is required. The syntax for the order_start statement is:
order_start  sort-rules;sort-rules;...sort-rules collation-order-statements order_end

The sort-rules have the following syntax:

directive, directive,...directive

where directive is one of the directives; forward, backward, and position.

The sort-rules directives are optional. If present, they define the rules to apply during string comparison. The number of specified sort-rules directives defines the number of weights each collating element is assigned (that is, the number of collation orders in the locale). If no sort-rules directives are present, one forward directive is assumed.

If present, the first sort-rules directive applies when comparing strings using primary weight, the second when comparing strings using the secondary weight, and so on. Each set of sort-rules directives is separated by a ; (semicolon). A sort-rules directive consists of one or more comma-separated directives. The following directives are supported:

Forward
Specifies that collation weight comparisons proceed from the beginning of a string toward the end of the string.
Backward
Specifies that collation weight comparisons proceed from the end of a string toward the beginning of the string.
Position
Specifies that collation weight comparisons consider the relative position of non-ignored elements in the string. That is, if strings compare equal, the element with the shortest distance from the starting point of the string collates first.

The forward and backward directives are mutually exclusive. The following example shows the syntax for the sort-rules directives:

order_start  forward;backward
Order_end
This keyword ends collating order entries introduced by the order_start keyword.

The order of the characters and elements specified between the order_start and order_end keywords defines the character order used in range expressions and regular expressions. If no weights are assigned to the characters, then the character order also becomes the collation sequence weight.

Special symbols

Special symbols are required to be all upper-case characters. The following special symbols can be used in the LC_COLLATE category:

  • IGNORE

    The optional operands for each collation element are used to define the primary, secondary, or subsequent weights for the collating element. The special symbol IGNORE is used to indicate a collating element that is to be ignored when strings are compared.

  • UNDEFINED

    All characters in the character set must be placed in the collation order, either explicitly or implicitly, by using the Undefined symbol. The UNDEFINED symbol includes all coded character set values not specified explicitly. These characters are inserted in the character collation order at the point indicated by the Undefined symbol in the order of their character code page values. If a collating weight is not explicitly specified for the UNDEFINED symbol, then by default, all of the undefined characters are assigned the same collating weight equal to the relative order of the first undefined character in the collating sequence. If no UNDEFINED special symbol exists and the collation order does not specify all collation elements from the coded character set, a warning is issued and all undefined characters are placed at the end of the character collation order and be given the same collating weight.

Example 1

Here is an example of a collation order statement in the LC_COLLATE locale definition source file category.

Note: By using the code examples, you agree to the terms of the Code license and disclaimer information.

The text that follows the LC_COLLATE keywords has been added for clarity and does not appear in the locale source file.

order_start  forward;backward
#           The order_start has two sort rules specified:
#           forward and backward
 
UNDEFINED   IGNORE;IGNORE
#           The UNDEFINED special symbol indicates that
#           all characters in the CCSID of the locale
#           that are not specified in the definition
#           are ignored for collation purposes.
 
<LOW>
#           <LOW> is a collating symbol that is ordered
#           after all undefined characters. For example, if there
#           were only two undefined characters, then the <LOW> symbol
#           would be third in the order.
 
#           All collating elements between <space> and <a> have the
#           same primary equivalence class and individual secondary
#           weights based on their coded character set values.
 
<a>      <a>;<a>
<a-acute> <a>;<a-acute>
<a-grave> <a>;<a-grave>
<A>       <a>;<A>
<A-acute> <a>;<A-acute>
<A-grave> <a>;<A-grave>
#           All characters between <a> and <A-grave> belong to the
#           same primary equivalence class because they have the same
#           primary weight.
 
<ch>      <ch>;<ch>
<Ch>      <ch>;<Ch>
#           The <c><h> multi-character collating element is
#           represented by the <ch> collating symbol and belongs to the
#           same primary equivalence class as the <Ch> multi-character
#           collating element.
 
<s>       <s>;<s>
<eszet>   "<s><s>";<s>
#           A one-to-many mapping is indicated by the <eszet>
#           character collated as an <s><s> string. That is, one
#           <eszet> character is expanded to <s><s> characters before
#           comparing.
 
<HIGH>
order_end

Example 2

Here is an example of a CPYSYSCOL statement in the LC_COLLATE locale definition source file category.

LC_COLLATE 
 
CPYSYSCOL "//QSYS.LIB//QLA10025S.TBL";"ENU"
 
END LC_COLLATE