Feb ’10 24

Spirit supports skipper based parsing since its very invention. So this is definitely not something new to Spirit V2. Nevertheless, the recent discussion on the Spirit mailing list around the semantics of Qi’s lexeme[] directive shows the need for some clarification. Today I try to answer questions like: “What does it mean to use a skipper while parsing?”, or “When do I want to use a skipper and when not?”.

While parsing some formatted data stream it is very often desirable to ignore some parts of the input. A common example would be the need to skip whitespace and comments while parsing some computer language. Certainly it is possible to explicitly account for the tokens to skip (such as the whitespace or the comments) while writing the grammar. But this can get very tedious as those tokens are valid to appear at any point in the input.

For the sake of simplicity, let us assume we want to parse a simple key/value expression: key=value, where we want to allow for any number of space characters before, in between, or after the key or the value. A naive grammar matching the plain key/value pair without whitespace skipping would look like (see Parsing a List of Key-Value Pairs Using Spirit.Qi for more details):

pair  =  key >> '=' >> value;
key   =  qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
value = +qi::char_("a-zA-Z_0-9");

If we want to explicitly accommodate the rule pair to match any interspersed space characters we get:

pair  = *space >> key >> *space >> '=' *space >> value >> *space;

which, while it produces the desired result, is not only error prone, but additionally difficult to write, to understand, and to maintain. If we look closer we see, that the process of skipping the whitespace tokens is easily automated. It seems to be sufficient to insert a repeated invocation of the space parser (or generally, any skip parser) in between the elements of the user defined parser expression sequences.

In fact, that is exactly what Spirit can do for you! The library invokes any supplied skip parser upon entry to the parse member function of any parser conforming to the PrimitiveParser concept. The skip parser has to be supplied by calling a special API function: phrase_parse:

namespace qi = boost::spirit::qi;
typedef std::string::const_iterator iterator;

qi::rule<iterator, qi::space_type> pair = key >> '=' >> value;
qi::rule<iterator> key = qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
qi::rule<iterator> value = +qi::char_("a-zA-Z_0-9");

std::string input(" key = value ");
iterator_type begin = input.begin();
iterator_type end = input.end();
qi::phrase_parse(begin, end, pair, qi::space);

This code snippet illustrates several important things:

  • The function qi::phrase_parse is equivalent to the API function qi::parse except for its additional parameter, the skip parser. Our example utilizes qi::space, but it is possible to use any other, even more complex parser expression as the skipper instead.
  • All rules which we want to perform the skip parsing need to be declared with the type of the skip parser they are going to be used with. Our example specifies the type of the qi::space parser expression, which is qi::space_type. For more complex parser expressions you might want to use a (mini) grammar or take advantage of BOOST_TYPEOF to let the compiler deduce the actual type.
  • All rules which should not perform skip parsing have to be declared without an additional skip parser type. These rules behave like an implicit lexeme[] directive (for more information about lexeme[], see below), they inhibit the invocation of the skip parser even if they are executed as part of a rule with an associated skipper.

In the example above we suppressed skipping while matching either the key or the value, otherwise our grammar would match any additional space character inside the key or value as well. Remember, the expression char_ conforms to the PrimitiveParser concept, it will execute the skip parser for each of its invocations. In this case any skip parser would be executed in between any two of the matched characters.

Sometimes it is necessary to turn of skipping for a smaller part of the grammar only. For this purpose Spirit implements the lexeme[] directive. This directive inhibits skipping during the execution of the embedded parser. For instance, parsing a quoted string of alphanumeric characters would look like this:

string = lexeme['"' >> *alnum >> '"'];

Here the lexeme directive disables skipping while matching the string, which avoids ‘loosing’ characters otherwise matched by the skipper. Please note: lexeme[] performs a pre-skip step, even if it is not a PrimitiveParser itself (it is essentially considered to be a logical primitive by design). If this is undesired, you can utilize the no_skip[] directive instead:

string = '"' >> no_skip[*alnum] >> '"';

This parser will match all the characters in between the quotes, even if the string starts with a character sequence matched by the applied skip parser. The no_skip[] directive is semantically equivalent to lexeme[] except it does not perform a pre-skip before executing the embedded parser. Note: the no_skip[] directive has been added only recently. It will be available starting with the next release (Boost V1.43).

This short article would not be complete without mentioning the skip[] directive. This directive is the counterpart to lexeme[]. It enables skipping for the embedded parser. Without any argument it can be used inside a lexeme or no_skip directive only. In this case it just re-enables the outer skipper:

string = lexeme['"' >> *(alpha | skip[digit]) >> '"'];

This (purely hypothetical) parser would enable skipping inside a string as long as it matches digits. But the skip directive can do more. It may take an additional argument allowing to specify a new skipper, for instance:

skip(qi::space)[*alnum]

which will skip spaces while executing the embedded *alnum parser. This form of the directive can be applied for two purposes. It can be used either for changing the current skip parser or to establish skipping inside a context otherwise not doing skipping at all (even if invoked with the qi::parse() API function).

For more detailed information about all the mentioned directives please see the corresponding documentation.

29 Responses to “Parsing Skippers and Skipping Parsers”

  1. Henry says:

    Sometimes what we want to do is to have a no_skip[..] between rules. For example sometimes I want to make sure that I want to have a grammar ‘=’ where I don’t want to have any space between ParameterName/Value and the equal sign, For example “foo=bar” instead of “foo =bar” or “foo= bar”. I don’t think lexeme[..] for grammar like the following:

    rule NameValuePair = lexeme[ ParameterName >> char_('=') >> ParameterValue ];
    rule ParameterName = qi::string("name");
    rule ParameterValue = qi::string("value");
    

    Will no_skip[ ... ] works against the above grammar ?

    • Hartmut Kaiser says:

      Henry,

      in short, the answer is yes. That’s eactly what lexeme[] and no_skip[] are designed for.

      Regards Hartmut

      • Henry says:

        Hi Hartmut:

        If you have rules inside the lexeme[...], it does not compile for me. For me lexeme[...] only work if you put primitive parser inside like qi::alnum, qi::print, etc.

        For example, the following grammar won’t compile, unless I remove lexeme [...] from the Root rule.

        struct MyGrammar : grammar
        {
            MyGrammar()
                : base_type(Root)
            {
                Root             = lexeme
                                       [
                                           ParameterName
                                           &gt;&gt; char_(":")
                                           &gt;&gt; ParameterValue
                                       ];
        
                ParameterName  = lexeme[ +alnum ];
        
                ParameterValue  = lexeme[ +alnum ];
            }
        
            rule Root;
            rule ParameterName;
            rule ParameterValue;
        };
        
        • Hartmut Kaiser says:

          Henry,

          the template parameters for grammar and rule got lost in your post above, but I assume you’ve been specifying a skipper even for the rules included inside the lexeme[]. But rules used inside lexeme[] or no_skip[] should not have a skipper, otherwise they will not compile.

          Regards Hartmut

          • Henry says:

            Hi Hartmut, yes the template argument for my rules got lost in my post. But thanks for your great intuition that I was specifying skipper in my rule and indeed that is the problem!

  2. Henry says:

    You mentioned about the possibility of using a more complex skip parser than just default ascii::space_type in the rule. I have been wanting to do that. Could you give us a little example of how to use a custom skipper. Let’s say I want my skipper to be any of the following char “:=()[]\”".

  3. Henry says:

    Harmut,

    I define a SkipperGrammar g.

    How do I use g with skip(g) [...]

    I got a compilation error when I pass g to skip(..)[].

    \boost\spirit\home\qi\nonterminal\grammar.hpp(107) : error C2248: ‘boost::noncopyable_::noncopyable::noncopyable’ : cannot access private member declared in class ‘boost::noncopyable_::noncopyable’

    • Hartmut Kaiser says:

      Henry,
      yes, grammars are non-copyable, you can’t store them directly. Either use a rule and call it as: skip(r.alias())[], or wrap your grammar into a phoenix::ref(): skip(ref(g))[].

      HTH
      Regards Hartmut

  4. Gustavo says:

    Harmut,
    Do you know why skip parsers doesn’t work if qi is combined with lex?
    I’m trying everything I can to make it skip a certain token defined by a lexer. I’m an experienced spirit user, but recently I’ve upgrade to spirit v2.1 (which is amazing, btw). Using qi, everything is perfectly, but when previously tokenizing it, the parser just ignores it.

    • Hartmut Kaiser says:

      Gustavo,

      do you mind sending a small example to the mailing list (or just to me) helping me to reproduce your problem?

      Regards Hartmut

      • Gustavo says:

        I think I’ve realized what went wrong. I’m posting it right now at spirit-general mailing list. Hooray to mr. René Descartes (breaking the big problem into several small problems). Anyway, breaking it up to send you the sample led me to experiment some variations and figuring it out.

  5. joel says:

    2nd post, this time properly formatted:

    Hi Harmut,

    I’ve defined a space parser in the rule header_entity (see below) and no space parser (lexeme implicit) on keyword (header_entity = keyword) but if I have a space *in front* of keyword it will not parse. Any thoughts??

    Thanks,
    joel

    struct tagExpressParser : qi::grammar<const_iterator, space_type>
    {
       ...
       rule<const_iterator> digit;
       rule<const_iterator> lower;
       rule<const_iterator> upper;
       rule<const_iterator, space_type> standard_keyword;
       ...
       rule<const_iterator, space_type> exchange_file;
       rule<const_iterator, space_type> header_section;
       rule<const_iterator, space_type> header_entity;
       rule<const_iterator, space_type> header_entity_list;
       ...
       tagExpressParser() : tagExpressParser::base_type(exchange_file)
       {
          digit = char_("0-9");
          lower = char_("a-z");
          upper = char_("A-Z_");
          standard_keyword = lexeme [upper >> *(upper | digit)];
          exchange_file = lit("ISO-10303-21;") > header_section >
          data_section > *data_section > lit("END-ISO-10303-21;");
          header_section = lit("HEADER;") > header_entity > header_entity >
          header_entity > -header_entity_list > lit("ENDSEC;");
          header_entity = standard_keyword > '(' > -parameter_list > ')' > ';';
          ...
       }
    }
    void main ()
    {
       ...
       phrase_parse (begin, str.end(), expressParser, ascii::space);
    }
    
    • Hartmut Kaiser says:

      Joel,

      rules which are implicit lexemes execute an implicit pre-skip to be consistent with the behavior of an explicit lexeme (which pre-skips as well).

      HTH
      Regards Hartmut

  6. David Rajaratnam says:

    Hi Hartmut,

    I’m very new to Spirit and wondering if you can clarify some points about the example in the article. I ran a few tests on the key-value-pair rules using “key=value” and adding a space at various points. Hopefully I didn’t make some stupid error (always possible) but here are the results:

    1) “key=value” – parsed
    2) ” key=value” – failed to parse
    3) “key=value ” – parsed
    4) “key =value” – parsed
    5) “key= value” – failed to parse

    The article mentions that “the library invokes any supplied skip parser upon entry to the parse member function of any parser conforming to the PrimitiveParser concept”.

    In light of this I understand that 1) and 4) parsed successfully. Also 5) failed to parse since the value rule wasn’t declared with a skip parser so there was no skip parser to suck up the space after the ‘=’.

    This leaves 2) and 3) that I don’t understand. In the case of 2) shouldn’t the entry to the pair rule remove the space at the start? Finally, with 3) what is sucking up the space at the end of the string?

    Sorry to ask such pedantic questions but I’m trying to get a better understanding about how these things work. To provide some context I’m trying to do something similar to your example above and the skip parser didn’t do what I expected (i.e., pseudo-magically remove all the whitespace that I wanted it to remove :) ). To get it to work I ended up declaring my rule with a skip parser and then wrapping the thing with a lexeme, something like:

    qi::rule<iterator, qi::space_type> value = qi::lexeme[+qi::char_("a-zA-Z_0-9")];
    

    Regards,
    Dave

  7. anders li says:

    hey, may I ask a question?
    in the following cpp code:

    int a1;
    uint err = CFs.Connect(); // line 2
    int a2;
    User::Leave(err); // line 4
    int a3;
    int a4;
    

    I only want to focus on line2 and line4, for other sentences, I want to ignore them.
    I write such a group of rules:

    term = lexeme[ *alnum % ' '] >> '=' >> lit(“CFs.Connect()”) >> ’;’ ;
    
    other = *(char_–”Connect”-”Leave”-’;') >> ';' ;
    
    leave = lit(“User::Leave”) >> ‘(‘ >> qi::lit(“err”) >> char_(‘)’) >> ‘;’ ;
    
    start = *other
    >> term
    >> *other
    >> leave
    >> *other
    ;
    

    Please see the start rule, it is very like the one metioned in the article:

    *space >> something >> *space >> something >> *space
    

    It is now like this, because I cannot give a good model on the rule: other
    The other rule seems like not very beatiful:

    other = *(char_–”Connect”-”Leave”-’;') >> ‘;’ ;
    

    I want to ask how could I make the start rule more clear ?

  8. tb says:

    In the above example the following change to the first pre-space -handling code does not compile even if qi::space is used instead of space:

    pair  = *space >> key >> *space >> '=' *space >> value >> *space;
    
  9. Felipe says:

    “The library invokes any supplied skip parser upon entry to the parse member function of any parser conforming to the PrimitiveParser concept.”

    Shouldn’t that say that it also skips after a successful match of the parser?

    • Hartmut Kaiser says:

      Felipe,

      the PrimitiveParser concept requires any conforming parser to skip before doing the actual work only. This is sufficient as the next parser in any sequence does the same. The top level API functions allow you to specifiy whether or not to do a last post skip, though. But the default is not to do any post skip, IIRC.

      Regards Hartmut

  10. Felipe says:

    Thanks, Hartmut, but I believe I still need some clarification. This is the code I’m testing:

    test_phrase_parser_attr(
        "123\n\n\n\n",
        int_ >> no_skip[eol | eoi],
        output
    );
    

    test_phrase_parser_attr is the same one found in the Spirit distribution.

    It uses ascii::space as it’s skip parser, but I thought that because of no_skip it would match the int and wouldn’t produce a full match, consuming only one newline char. Instead a full match happens, unless I explicitly use

    skip_flag::dont_postkip

    inside test_phrase_parser_attr.

    I’ve got the same behaviour both on 1.44 and trunk, I’m a bit confused.

    • Hartmut Kaiser says:

      The no_skip[] directive inhibits skipping for the parser which it embeds, in this case the eol | eoi. Indeed, in your case this will match exactly one newline character. Seems I was wrong in my first reply and by default the API functions always do execute a post_skip step. Therefore the phrase_parse() API function will invoke an additional post skip step after the no_skip[] directive returns, which eats all remaining newlines in you input (space matches all whitespace, including newlines). See the docs for the corresponding description: here. This is consistent with your observation that an explicit dont_postskip exposes the expected behavior.

      Regards Hartmut

  11. Tomas says:

    hi, can I have a question?

    I would like to write a parser, that parses some list of strings.

    valuefoo and valuebar are some string parsers.

    list = *space >> qi::lit("foo") >> +space >> valuefoo >> +space >> qi::lit("bar") >> +space >> valuebar >> *space;
    

    So as you can see, I would like to have at least one space between tokens. Is it possible to do it using some skipper, or do I have to wite the space explicitly in the code?

    Thanks.
    Tomas

    • Rob Stewart says:

      Just use phrase_parse() with space as your skip parser.

      • Tomas says:

        Thanks. but I already have ascii::space used in my phrase_parse.

        My grammar looks like this:

        template
        class AttributeParser: public qi::grammar

        and phrase_parse:

        parsingSucceeded = phrase_parse(iter, end, attributeParser, ascii::space);

        The problem is, that my parser ignores all spaces. I would like to achieve something like this:

        qi::lit("foo") >> qi::lit("bar")

        and I want the parser to accept this: foo bar
        but fail on this: foobar

        Thanks.

        • Rob Stewart says:

          Your grammar also needs a skipper.

          • Tomas says:

            Thanks, but I still can not find out, what kind of skipper. Because when I use ascii::space as skipper, my grammar accepts “foobar”, that I do not want to. I want my grammar to accept “foo bar”, “foo bar”, “foo bar”, etc.

        • Rob Stewart says:

          Please post a complete program illustrating your problem as described on the Support page. Be sure to include example input, within the program rather than as a separate file so it is fully self-contained.

          We’ve reached the comment nesting level and comments are not the place for support.

  12. Haitham Gad says:

    Question: Does specifying a skipper in a qi::grammar template parameter list imply that this skipper is going to be used by all the grammar rules (even if they don’t specify it as a skipper in their template parameter list)?

Leave a Reply

preload preload preload