The fn:tokenize function
breaks a string into a sequence of substrings.
Syntax
>>-fn:tokenize(--source-string--,--pattern--+----------+--)----><
'-,--flags-'
- source-string
- A string that is to be broken into a sequence of substrings.
source-string is
an xs:string value or the empty sequence.
- pattern
- The delimiter between substrings in source-string.
pattern is
an xs:string value that contains a regular expression. A regular expression
is a set of characters, wildcards, and operators that define a string
or group of strings in a search pattern.
- flags
- An xs:string value that can contain any of the following values
that control how pattern is matched to characters in source-string:
- s
- Indicates that the dot (.) in the regular expression matches any
character, including the new-line character (X'0A').
If
the s flag is not specified, the dot (.) matches any character except
the new-line character (X'0A').
- m
- Indicates that the caret (‸) matches the start of a line (the
position after a new-line character), and the dollar sign ($) matches
the end of a line (the position before a new-line character).
If
the m flag is not specified, the caret (‸) matches the start of a
string, and the dollar sign ($) matches the end of the string.
- i
- Indicates that matching is case-insensitive.
If the i flag is
not specified, case-sensitive matching is done.
- x
- Indicates that whitespace characters within pattern are
ignored.
If the x flag is not specified, whitespace characters are
used for matching.
- Limitation of length
The length of source-string and pattern is limited
to 32000 bytes.
Returned value
If
source-string is
not the empty sequence or a zero-length string, the returned value
is a sequence that results when the following operations are performed
on
source-string:
- source-string is searched for characters that match pattern.
- If pattern contains two or more alternative sets of characters,
the first set of characters in pattern that matches characters
in source-string is considered to be the matching pattern.
- Each set of characters that does not match pattern becomes
an item in the result sequence.
- If pattern matches characters at the beginning of source-string,
the first item in the returned sequence is a string of length 0.
- If two successive matches for pattern are found within source-string,
a string of length 0 is added to the sequence.
- If pattern matches characters at the end of source-string,
the last item in the returned sequence is a string of length 0.
If pattern is not found in source-string,
an error is returned.
If source-string is the empty sequence,
or is the zero-length string, the result is the empty sequence.
Example
The following function creates a
sequence from the string "Tokenize this sentence, please." "\s+"
is a regular expression that denotes one or more whitespace characters.
fn:tokenize("Tokenize this sentence, please.", "\s+")
The
returned value is the sequence ("Tokenize", "this", "sentence,", "please.").