Related information
Many z/OS shell commands
match strings of text in text files using a type of pattern
known as a regular expression. A regular expression lets you
find strings
in text files not only by direct match, but also by extended matches,
similar to, but much more powerful than the file name patterns described
in sh.
The newline character at the
end of each input line is never explicitly matched by any regular
expression or part thereof.
expr and ed take basic
regular expressions;
all other shell commands accept extended regular expressions. grep and sed accept
basic regular expressions, but will accept extended regular expressions
if the –E option is used.
Regular
expressions can be made up of normal characters or special
characters, sometimes called metacharacters. Basic and extended
regular expressions differ only in the metacharacters they can contain.
The
basic regular expression metacharacters are:
¬ $ . * \( \) [ \{ \} \
The
extended regular expression metacharacters are:
| ¬ $ . * + ? ( ) [ { } \
These
have the following meanings:
- .
- A dot character matches any single character of the input line.
- ¬
- The ¬ character does not match any character
but represents the beginning of the input line. For example, ¬A is
a regular expression matching the letter A at the
beginning of a line. The ¬ character is only special
at the beginning of a regular expression, or after a ( or |.
- $
- This does not match any character but represents the end of the
input line. For example, A$ is a regular expression
matching the letter A at the end of a line. The $ character
is only special at the end of a regular expression, or before a ) or |.
- [bracket-expression]
- A bracket expression enclosed in square brackets is a regular expression
that matches a single character, or collation element. This bracket
expression applies not only to regular expressions, but also to pattern
matching as performed by the fnmatch() function
(used in file name expansion).
- If the initial character is a circumflex (o),
then this bracket expression is complemented. It matches any character
or collation-element except for the expressions specified in the bracket
expression. For pattern matching, as performed by the fnmatch function,
this initial character is instead ! (the exclamation
mark).
- If the first character after any potential circumflex is either
a dash (-), or a closing square bracket (]),
then that character matches exactly that character—that is, a literal
dash or closing square bracket.
- You can specify collation sequences by enclosing their name inside square brackets and periods. For example, [.ch.] matches
the multicharacter collation sequence ch (if the
current language supports that collation sequence). Any single character
is itself. Do not give a collation sequence that is not part of the
current locale.
- Equivalence classes can be specified by enclosing a character
or collation sequence inside
square bracket equals. For example, [=a=] matches
any character in the same equivalence class as a.
This normally expands to all the variants of a in
the current locale—for example, a, \(a:, \(a',
… On some locales it might include both the uppercase and lowercase
of a given character. In the POSIX locale, this always expands to
only the character given.
- Within a character class expression (one made with square
brackets), the following constructs can be used to represent sets
of characters. These constructs are used for globalization and handle
the different collation sequences as required by POSIX.
- [:alpha:]
- Any alphabetic character.
- [:lower:]
- Any lowercase alphabetic character.
- [:upper:]
- Any uppercase alphabetic character.
- [:digit:]
- Any digit character.
- [:alnum:]
- Any alphanumeric character (alphabetic or digit).
- [:space:]
- Any white space character (blank, horizontal tab, vertical tab).
- [:graph:]
- Any printable character, except the blank character.
- [:print:]
- Any printable character, including the blank character.
- [:punct:]
- Any printable character that is not white space or alphanumeric.
- [:cntrl:]
- Any nonprintable character.
For example, given the character class expression:
[:alpha:]
you need to enclose the expression
within another set of square brackets, as in:
/[[:alpha:]]/
- Character ranges are specified by a dash (–),
between two characters, or collation sequences. These indicates all
character or collation sequences that collate between two characters
or collation sequences. It does not refer to the native character
set. For example, in the POSIX locale, [a-z] means
all the lowercase alphabetics, even if they don't agree with the binary
machine ordering. However, because many other locales do not collate
in this manner, use of ranges are not recommended, and are not used
in strictly conforming POSIX.2 applications. An endpoint of a range
can explicitly be a collation sequence; for example, [[.ch.]-[.ll.]] is
valid. However, equivalence classes or character classes are not: [[=a=]-z] is
not permitted.
- \
- This character turns off the special meaning of metacharacters.
For example, \. only matches a dot character. Note
that \\ matches a literal \ character.
Also note the special case of “\d” described
in the following paragraph.
- \d
- For d representing any single decimal
digit (from 1 to 9), this pattern is equivalent to the string matching
the dth expression enclosed within the () characters
(or \(\) for some commands) found at an earlier
point in the regular expression. Parenthesized expressions are
numbered by counting ( characters from the left.
Constructs of this form can be used in the replacement strings
of substitution commands (for example, the sub function
of awk), to stand for constructs matched
by parts of the regular expression.
- regexp*
- A regular expression regexp followed
by * matches a string of zero or more strings
that matches regexp. For example, A* matches A, AA, AAA and
so forth. It also matches the null string (zero occurrences of A).
).
- regexp+
- A regular expression regexp followed
by + matches a string of one or more strings
that matches regexp.
- regexp?
- A regular expression regexp followed
by ? matches a string of one or zero occurrences
of strings that matches regexp.
- char{n} | char\{n\}
- In this expression (and the ones to follow), char is
a regular expression that stands for a single character—for example,
a literal character or a period (.). Such a regular
expression followed by a number in brace brackets stands for that
number of repetitions of a character. For example, X\{3\} stands
for XXX. In basic regular expressions, in order to
reduce the number of special characters, { and } must
be escaped by the \ character to make them special,
as shown in the second form (and the ones to follow).
- char{min,} | char\{min,\}
- When a number, min, followed by a comma
appears in braces following a single-character regular expression,
it stands for at least min repetitions of
a character. For example, X\{3,\} stands for at least
three repetitions of X.
- char{min,max} | char\{min,max\}
- When a single-character regular expression is followed by a pair
of numbers in braces, it stands for at least min repetitions
and no more than max repetitions of a character.
For example, X\{3,7\} stands for three to seven repetitions
of X.
- regexp1 | regexp2
- This expression matches either regular expression regexp1 or regexp2.
- (regexp) | \(regexp\)
- This lets you group parts of regular expressions. Except where
overridden by parentheses, concatenation has the highest precedence.
In basic regular expressions, in order to reduce the number of special
characters, ( and ) must be escaped
by the \ character to make them special, as shown
in the second form.
Several regular expressions can be concatenated
to form a larger regular expression.
Summary
The commands that use basic and
extended regular expressions are as
follows:
- Basic
- ed, expr, grep, sed
- Extended
- awk, grep with -E option, sed with
the -E option.
Table 1 summarizes
the features that apply to the applicable shell commands.
Table 1. Regular Expression Features (regexp)Notation |
awk |
ed |
grep -E |
expr |
sed |
. |
Yes |
Yes |
Yes |
Yes |
Yes |
^ |
Yes |
Yes |
Yes |
No |
Yes |
$ |
Yes |
Yes |
Yes |
Yes |
Yes |
[…] |
Yes |
Yes |
Yes |
Yes |
Yes |
[::] |
Yes |
Yes |
Yes |
Yes |
Yes |
re* |
Yes |
Yes |
Yes |
Yes |
Yes |
re+ |
Yes |
No |
Yes |
No |
No |
re? |
Yes |
No |
Yes |
No |
No |
re|re |
Yes |
No |
Yes |
No |
No |
\d |
Yes |
Yes |
Yes |
Yes |
Yes |
(…) |
Yes |
No |
Yes |
No |
No |
\(…\) |
No |
Yes |
No |
Yes |
Yes |
\< |
No |
No |
No |
No |
No |
\> |
No |
No |
No |
No |
No |
\{ \} |
Yes |
No |
Yes |
No |
Yes |
Examples
The
following patterns are given as illustrations, along with descriptions
of what they match:
- abc
- Matches any line of text containing the three letters abc in
that order.
- a.c
- Matches any string beginning with the letter a,
followed by any character, followed by the letter c.
- ^.$
- Matches any line containing exactly one character (the newline
is not counted).
- a(b*|c*)d
- Matches any string beginning with a letter a,
followed by either zero or more of the letter b,
or zero or more of the letter c, followed by the
letter d.
- .* [a–z]+ .*
- Matches any line containing a word, consisting of lowercase
alphabetic characters, delimited by at least one space on each side.
- (morty).*\1
-
- morty.*morty
- These expressions both match lines containing at least two occurrences
of the string morty.
- [[:space:][:alnum:]]
- Matches any character that is either a white space character or
alphanumeric.