Actions taken by the lexical analyzer
When the lexical analyzer matches one of the extended regular expressions in the rules section of the specification file, it executes the action that corresponds to the extended regular expression. Without sufficient rules to match all strings in the input stream, the lexical analyzer copies the input to standard output. Therefore, do not create a rule that only copies the input to the output. The default output can help find gaps in the rules.
When using the lex command to process input for a parser that the yacc command produces, provide rules to match all input strings. Those rules must generate output that the yacc command can interpret.
Null action
[ \t\n] ;
Same as next action
" " |
"\t" |
"\n" ;
The quotation marks that surround \n and \t are not required.
Printing a matched string
[a-z]+ printf("%s",yytext);
The C language printf subroutine accepts a format argument and data to be printed. In this example, the arguments to the printf subroutine have the following meanings:
- %s
- A symbol that converts the data to type string before printing
- %S
- A symbol that converts the data to wide character string (wchar_t) before printing
- yytext
- The name of the array containing the data to be printed
- yywtext
- The name of the array containing the multibyte type (wchar_t) data to be printed
[a-z]+ ECHO;
[a-z]+ printf("%s",yytext);
- %array
- Defines yytext as a null-terminated character array. This is the default action.
- %pointer
- Defines yytext as a pointer to a null-terminated character string.
Finding the length of a matched string
- yyleng
- Tracks the number of bytes that are matched.
- yywleng
- Tracks the number of wide characters in the matched string. Multibyte characters have a length greater than 1.
[a-zA-Z]+ {words++;chars += yyleng;}
This action totals the number of characters in the words matched and puts that number in chars.
yytext[yyleng-1]
Matching strings within strings
she {s++; REJECT;}
he {h++}
\n |
. ;
After counting the occurrences of she, the lex command rejects the input string and then counts the occurrences of he. Because he does not include she, a REJECT action is not necessary on he.
Adding results to the yytext array
Typically, the next string from the input stream overwrites the current entry in the yytext array. If you use the yymore subroutine, the next string from the input stream is added to the end of the current entry in the yytext array.
%s instring
%%
<INITIAL>\" { /* start of string */
BEGIN instring;
yymore();
}
<instring>\" { /* end of string */
printf("matched %s\n", yytext);
BEGIN INITIAL;
}
<instring>. {
yymore();
}
<instring>\n {
printf("Error, new line in string\n");
BEGIN INITIAL;
}
Even though a string may be recognized by matching several rules, repeated calls to the yymore subroutine ensure that the yytext array will contain the entire string.
Returning characters to the input stream
yyless(n)
where n is the number of characters of the current string to keep. Characters in the string beyond this number are returned to the input stream. The yyless subroutine provides the same type of look-ahead function that the / (slash) operator uses, but it allows more control over its usage.
=-[a-zA-Z] {
printf("Operator (=-) ambiguous\n");
yyless(yyleng-1);
... action for = ...
}
Input/Output subroutines
- input()
- Returns the next input character
- output(c)
- Writes the character c on the output
- unput(c)
- Pushes the character c back onto the input stream to be read later by the input subroutine
- winput()
- Returns the next multibyte input character
- woutput(C)
- Writes the multibyte character C back onto the output stream
- wunput(C)
- Pushes the multibyte character C back onto the input stream to be read by the winput subroutine
The lex program provides these subroutines as macro definitions. The subroutines are coded in the lex.yy.c file. You can override them and provide other versions.
The winput, wunput, and woutput macros are defined to use the yywinput, yywunput, and yywoutput subroutines. For compatibility, the yy subroutines subsequently use the input, unput, and output subroutine to read, replace, and write the necessary number of bytes in a complete multibyte character.
- All subroutines must use the same character set.
- The input subroutine must return a value of 0 to indicate end of file.
- Do not change the relationship of the unput subroutine to the input subroutine or the look-ahead functions will not work.
The lex.yy.c file allows the lexical analyzer to back up a maximum of 200 characters.
To read a file containing nulls, create a different version of the input subroutine. In the normal version of the input subroutine, the returned value of 0 (from the null characters) indicates the end of file and ends the input.
Character set
%T
%T
{integer} {character string}
{integer} {character string}
{integer} {character string}
%T
End-of-file processing
When the lexical analyzer reaches the end of a file, it calls the yywrap library subroutine, which returns a value of 1 to indicate to the lexical analyzer that it should continue with normal wrap-up at the end of input.
However, if the lexical analyzer receives input from more than one source, change the yywrap subroutine. The new function must get the new input and return a value of 0 to the lexical analyzer. A return value of 0 indicates that the program should continue processing.
You can also include code to print summary reports and tables when the lexical analyzer ends in a new version of the yywrap subroutine. The yywrap subroutine is the only way to force the yylex subroutine to recognize the end of input.