Jan ’10 25

Starting with Spirit V2 we added a module for generating code aimed at the lexical analysis of the input: Spirit.Lex (a lexer module, also called scanner). Lexical analysis is the process of preprocessing the stream of input characters and separating it into strings called tokens, most of the time delimited by whitespace. Most compiler texts start here, and devote several chapters to discussing various ways to build scanners. Spirit.Lex is a library built to take care of the complexities of creating a lexer for your grammar.

We know the documentation of Spirit.Lex is not complete yet. So I will write  more about it here from now on to fill in the missing pieces and to show a couple of tricks demonstrating its best usage.

Lexical analysis is done in a separate module from the parser, feeding the parser with a stream of input tokens only. The following picture visualizes this process.

The common flow control implemented while parsing combined with lexical analysis Instead of directly getting the next character from the input the parser asks the lexer for the next input token. The lexer in turn looks at the next input characters to decide what token patterns match best. Only after matching the next token from the input it returns it to the parser.

Theoretically it is not necessary to implement this separation as in the end the language is defined only by exactly one set of syntactical rules. So we could write the whole parser in one module. In fact, Spirit.Qi allows you to write parsers without using a lexer, while parsing the input character stream directly, and for the most part this is the way Spirit has been used since its invention.

So the question is: “When does it make sense to invest the additional time and effort to create a separate lexer for your grammar and when is it sufficient to exclusively utilize Spirit.Qì?” Unfortunately, there is no single answer to this question, and all I can try is to highlight some of the advantages and disadvantages of a separate lexer module. The concrete decision whether to utilize a lexer or to parse the grammar in one step has to be made on a case by case basis.

Spirit.Lex gives you the ability to create lexical analyzers based on patterns. These patterns are regular expression used to define the different tokens to be recognized in the character input sequence. The lexer generates internal tables from all token definitions – the so called deterministic finite automata (DFA’s). It is well know that matching an input sequence using a DFA is as efficient as it can get. Spirit.Lex is fast! Measurements prove that lexers generated with Spirit.Lex are faster than comparable lexers generated with flex, the most widely used lexical generator used today.

The input sequence is expected to be provided to the lexical analyzer as an arbitrary standard forward iterator. The lexical analyzer itself exposes a standard forward iterator as well. The difference here is that the exposed iterator provides access to the token sequence instead of to the character sequence. The tokens in this sequence are constructed on the fly by analyzing the underlying character sequence and matching this to the patterns as defined by the application.

Here is a list of additional advantages Spirit.Lex provides you with:

  • The definition of tokens is done using regular expressions, where the token definitions can refer to special substitution strings (pattern macros) simplifying the token definitions.
  • The generated lexer may have multiple start states (we will talk about lexer states in a separate post).
  • It is possible to attach code to any of the token definitions; this code gets executed whenever the corresponding token pattern has been matched.
  • The iterator exposed by the lexer buffers the last emitted tokens. This significantly speeds up parsing of grammars which require backtracking.
  • The tokens created at runtime can carry arbitrary token specific data items which are available from the parser as attributes. The input character sequence is converted to the token data items exactly once, regardless of how often the value is accessed.
  • Token based parsing simplifies the Qi grammar necessary to describe the matching rules. This lessens the runtime overhead introduced by the parser itself.

But at the same time this feature list hints at the disadvantages of a separate lexer module:

  • The definition of tokens is done using regular expressions. To some, regular expressions are more difficult to understand compared to EBNF or PEG rules.
  • Additional effort is required to develop and debug the lexer and its token descriptions.
  • The lexer definition may add measurable compile time overhead, depending on what features are employed.
  • Additional runtime overhead is required in order to generate the lexer tables and construct the tokens.

The last bullet point touches on the most interesting characteristic: the speed of parsing. As already mentioned, generally it is not possible to predict whether the overhead introduced by the lexer is amortized by savings on grammar specific things like required backtracking, repeated attribute conversion, required symbol lookup, etc. But the experience shows that lexers tend to pay off for moderately sized or large grammars, such as full language implementations or data format descriptions with a lot of dynamic structural alternatives. But when in doubt, you will have to do measurements based on your concrete grammar and input data to make a sound decision. You know the drill…

Leave a Reply

preload preload preload