Jun 23

Best Practices

By Joel de Guzman Add comments

This page is a compilation of best practices using Spirit.

  1. Separate grammar construction from parsing. I am not entirely sure if this merits an entry here since this is pretty much C++ 101 and not directly related to Spirit. Anyway, since it is short, let’s have it anyway as our first entry. Examples speak volumes and Spirit has lots of examples. For brevity, in the examples, parsing immediately follows the construction of the grammar. Example (example/qi/roman.cpp):
    roman roman_parser; // Our grammar
    /*...*/
    bool r = parse(iter, end, roman_parser, result);
    

    In real world usage, this is not efficient. Grammars are meant to be constructed once and used many times. It is always a good idea to separate construction from parsing.

    There are exceptions, for sure. Daniel James noted (see comments below) that for non context-free grammars that require a reference to some state, the easiest way is to construct a new grammar each time.

  2. Avoid complex rules. Rules with complex definitions hurt the compiler badly. We’ve seen rules that are more than a hundred lines long and take a couple of minutes to compile. On some compilers, experience shows that the compile time is exponential in relation to the RHS expression length. C++ compilers were not designed to handle such big expressions and some just couldn’t cope (crashes). It is always best to break complex rules into more manageable, easier to digest parts. Doing so also makes the rules more readable.
  3. Avoid complex grammars. Try as much as possible to modularize big grammars into smaller sub-grammars. Spirit grammars are composable. Try to identify the grammar parts, especially those that can be reused, and separate them into their own sub-grammars. Reusable grammars are a real advantage. For example, how often have you written a rule for identifiers?
  4. Take things one step at a time.  Don’t try to write a grammar that covers all the complexity of your input.  Start with the simplest piece of the input and write a parser for that.  Gradually add more rules to your grammar as you cover more complexity in the input.
take things one step at a time.  Don't try to write a
grammar that covers all the complexity of your input.  Start with the
simplest piece of the input and write a parser for that.  Gradually
add more rules to your grammar as you cover more complexity in the
input.

You can either develop the whole parsing bit first and then work on
the semantic actions associated with each rule, or you can ping-pong
back and forth between parsing and actions.  I don't know which is
better yet.  I've been doing a little bit of both.

17 Responses to “Best Practices”

  1. Daniel says:

    It’s worth mentioning – quickbook is constantly constructing new grammars. One problem is when the grammar isn’t context free, then it needs a reference to some state. The easiest way to do that is to construct a new grammar each time.

    I think it’s also worth mentioning splitting complicated grammars into several files, that seems to come up a lot.

    • Daniel says:

      On second thoughts, I don’t think grammars with state should be mentioned since that’s a more advanced topic. Best practices should be established from the start, knowing when and how to break them can come later.

      I haven’t actually used it yet but error handling is probably worth a mention. Perhaps also some simple ways to keep the grammar efficient (maybe avoiding excessive back tracking?), although that might be best left for another article.

    • Josh says:

      I agree that a lot of people ask help on how to reduce compile times by splitting their grammars across several translation units. After doing some reading in the mailing list archives and piecing together a few hints in the documentation I was able to pull it off. I feel that this is requested often enough that it could be explained here.

      I could contribute some example code if that would help.

      It may be a little on the advanced side, but it would encourage people to create smaller, simpler grammars.

  2. Olaf Peter says:

    spirit’s mini_c example shows how to separate!

    Well, Grammars are meant to be constructed once and used many times. In my (and others) case it would be useful to have an example of this best practise in practise, maybe a simple threaded parse function with ‘global’ grammar. Where to place them, how to access them as best practise? Anonymous namespace, static global grammar, foo_init(), singleton etc.?

    Thanks,
    Olaf

  3. Olaf Peter says:

    BTW, even in spirit’s scheme example the grammar isn’t reused, isn’t?

    • Olaf Peter says:

      sry, after thinking about even more ideas/questions rise up. All the examples I’ve seen bind the error handler to the grammar (scheme e.g.) where the grammar takes the filename for reporting errors trough the handler. What are the recommandations on this? This may prevent simple reuse (or I do have to rebind the phoenix::function again or similar). Maybe the scheme and mini_c examples should aware on this best practise also since it serves as inspiration for non professionals like me.

      Regards,
      Olaf

      • Joel de Guzman says:

        Hmm, I’m not sure how binding the error handler to the grammar prevents reuse? What am I missing?

  4. anders li says:

    hey, guys
    I recently uses spirit, my platform is MinGW.
    Because the grammar includes lots of rules, I got a long-time compile:
    1. debug target: can not compile, errors are:

    ————– Build: Debug in jack —————

    Compiling: main.cpp
    cc1plus.exe: out of memory allocating 38714 bytes
    Process terminated with status 1 (9 minutes, 33 seconds)
    0 errors, 0 warnings

    2. release target: can compile. consume lots of time (>5 mins)

    I know there are lots of compile time computation there. It will consume lot of RAM, in my example, the compile needs >2G ram. But my computer’s RAM is right 2G there. So when the RAM is used up, both the compile and system
    will slow down largely.

    Any suggestions on how to write rules and grammars more efficiently ?
    C++ Template is good, but on another side it is bad as well.

    anders

    • Joel de Guzman says:

      Well, as it says in #2, avoid very complex rules. It is best to keep rules as simple as possible. Do you have rules spanning multiple lines? Break them up into smaller rules. As for #3, I’ll be providing a simple example on modular grammar construction (Josh gave me a small example, but I am not sure yet how to best proceed) or you might want to look at the mini_c example.

  5. anders li says:

    thanks Joel.
    In my practice for using Spirit for about 2 months, I find that char_ is sometimes a
    invisible killer. Say, when a man want to ignore something in parsing, he may write something like this:

    expr = *( char_ - "!=" ) ;
    if_statement = blablabla ; 
    

    Whey he parses something like this kind of fragment:

    if(iFs->Open() != KErrNone )
    { 
       User::Leave(whatever);
    }
    else   if( from here goes wrong... )
    { 
        intL();
    } 
    else
    {
        if( KErrNone != xyz->Open())
       { 
           hereL();
       }
    }
    

    Here it cannot be parsed successfully. why ? because the expr will eat the whole:

     if( from here goes wrong... )
       { 
           intL();
       } 
    else
    {
        if( KErrNone         
    

    after it sees the != sign , it is terrible !! Once this happens, the bad man will use lots of time to find the root cause. I am one of the poor mans.
    The solution for this is for example :

    expr = *( char_ - "!=" - ';' ) ;  
    

    At least do not let it go across C++ lines.

    So I suggest whether the following can be seen like best practices:

    1. Parse something as accurate as you can. That means, if an certain identifier can be recognized as

    alpha   >> *alnum
    

    then it’s a bad idea to use somthing like this:

    *(char_ - "_") 
    

    2. If a rule starting like this:

    rule A = *(char_ - "}" ) >> ruleB >> ruleC;
    

    Then it is a good idea to have a twice eye on it. This is definitely a bug candidate.

  6. Leo Goodstadt says:

    Dear Joel,
    You say “Avoid complex rules”. But what is the performance penalty for breaking up complex grammars / rules into smaller composable pieces.
    I have been using smaller “sub-rules” each with their own names for better error handling but am unsure how much of a performance penalty I am paying.
    Leo

  7. Xavi Gratal says:

    Well, that’s not my experience.

    Whenever you break a rule, you are creating a qi::rule object, which calls the enclosed parser through a virtual function, which is slower. For a parser that is called for every word in a text, this can make the parsing 2x slower. For a parser called for every character it would be even worse.

    Of course you can break the rule using auto (or BOOST_AUTO), but this just clarifies the code, it doesn’t really break the rule from the compiler’s perspective.

    Am I missing something?

  8. anders li says:

    hi,
    There is a discussion on the spirit user mailing list, I think this is a problem when using spirit if you want your grammar be modular.
    So, can you please have a look at this problem? I think this problem is critical to the extension on spirit usage compared to the many details in spirit.
    http://news.gmane.org/gmane.comp.parsers.spirit.general
    the subject is : “questions on modular grammars”

    anders

  9. anders li says:

    The project Scalpel models std c++ grammar, but, it writes all the rules together in one grammar(struct). So there is no modular in Scalpel project, I think modular should be separated the grammars in different files, in different grammar structs.
    But, the “object recursively constructed” problems lie there.
    So , can spirit developers give some ideas on how the modular can be done.(please have a look at the discussions)

Leave a Reply

To use RetinaPost you must register at http://www.RetinaPost.com/register