Spirit supports skipper based parsing since its very invention. So this is definitely not something new to Spirit V2. Nevertheless, the recent discussion on the Spirit mailing list around the semantics of Qi’s lexeme directive shows the need for some clarification. Today I try to answer questions like: “What does it mean to use a skipper while parsing?”, or “When do I want to use a skipper and when not?”.
While parsing some formatted data stream it is very often desirable to ignore some parts of the input. A common example would be the need to skip whitespace and comments while parsing some computer language. Certainly it is possible to explicitly account for the tokens to skip (such as the whitespace or the comments) while writing the grammar. But this can get very tedious as those tokens are valid to appear at any point in the input.
For the sake of simplicity, let us assume we want to parse a simple key/value expression: key=value, where we want to allow for any number of space characters before, in between, or after the key or the value. A naive grammar matching the plain key/value pair without whitespace skipping would look like (see Parsing a List of Key-Value Pairs Using Spirit.Qi for more details):
pair = key >> ‘=’ >> value;
key = qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
value = +qi::char_("a-zA-Z_0-9");
If we want to explicitly accommodate the rule pair to match any interspersed space characters we get:
pair = *space >> key >> *space >> ‘=’ *space >> value >> *space;
which, while it produces the desired result, is not only error prone, but additionally difficult to write, to understand, and to maintain. If we look closer we see, that the process of skipping the whitespace tokens is easily automated. It seems to be sufficient to insert a repeated invocation of the space parser (or generally, any skip parser) in between the elements of the user defined parser expression sequences.
In fact, that is exactly what Spirit can do for you! The library invokes any supplied skip parser upon entry to the parse member function of any parser conforming to the PrimitiveParser concept. The skip parser has to be supplied by calling a special API function: phrase_parse:
namespace qi = boost::spirit::qi;
typedef std::string::const_iterator iterator;
qi::rule<iterator, qi::space_type> pair = key >> ‘=’ >> value;
qi::rule<iterator> key = qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
qi::rule<iterator> value = +qi::char_("a-zA-Z_0-9");
std::string input(" key = value ");
iterator_type begin = input.begin();
iterator_type end = input.end();
qi::phrase_parse(begin, end, pair, qi::space);
This code snippet illustrates several important things:
- The function qi::phrase_parse is equivalent to the API function qi::parse except for its additional parameter, the skip parser. Our example utilizes qi::space, but it is possible to use any other, even more complex parser expression as the skipper instead.
- All rules which we want to perform the skip parsing need to be declared with the type of the skip parser they are going to be used with. Our example specifies the type of the qi::space parser expression, which is qi::space_type. For more complex parser expressions you might want to use a (mini) grammar or take advantage of BOOST_TYPEOF to let the compiler deduce the actual type.
- All rules which should not perform skip parsing have to be declared without an additional skip parser type. These rules behave like an implicit lexeme directive (for more information about lexeme, see below), they inhibit the invocation of the skip parser even if they are executed as part of a rule with an associated skipper.
In the example above we suppressed skipping while matching either the key or the value, otherwise our grammar would match any additional space character inside the key or value as well. Remember, the expression char_ conforms to the PrimitiveParser concept, it will execute the skip parser for each of its invocations. In this case any skip parser would be executed in between any two of the matched characters.
Sometimes it is necessary to turn of skipping for a smaller part of the grammar only. For this purpose Spirit implements the lexeme directive. This directive inhibits skipping during the execution of the embedded parser. For instance, parsing a quoted string of alphanumeric characters would look like this:
string = lexeme[‘"’ >> *alnum >> ‘"’];
Here the lexeme directive disables skipping while matching the string, which avoids ‘loosing’ characters otherwise matched by the skipper. Please note: lexeme performs a pre-skip step, even if it is not a PrimitiveParser itself (it is essentially considered to be a logical primitive by design). If this is undesired, you can utilize the no_skip directive instead:
string = ‘"’ >> no_skip[*alnum] >> ‘"';
This parser will match all the characters in between the quotes, even if the string starts with a character sequence matched by the applied skip parser. The no_skip directive is semantically equivalent to lexeme except it does not perform a pre-skip before executing the embedded parser. Note: the no_skip directive has been added only recently. It will be available starting with the next release (Boost V1.43).
This short article would not be complete without mentioning the skip directive. This directive is the counterpart to lexeme. It enables skipping for the embedded parser. Without any argument it can be used inside a lexeme or no_skip directive only. In this case it just re-enables the outer skipper:
string = lexeme[‘"’ >> *(alpha | skip[digit]) >> ‘"’];
This (purely hypothetical) parser would enable skipping inside a string as long as it matches digits. But the skip directive can do more. It may take an additional argument allowing to specify a new skipper, for instance:
which will skip spaces while executing the embedded *alnum parser. This form of the directive can be applied for two purposes. It can be used either for changing the current skip parser or to establish skipping inside a context otherwise not doing skipping at all (even if invoked with the qi::parse() API function).
For more detailed information about all the mentioned directives please see the corresponding documentation.