Nov 17

Nabialek trick

By Joel de Guzman Add comments

This sample is a port of the original “Nabialek trick” from classic Spirit. See the cpp file here. For now, it lives in SVN. Later, this will become part of the distribution.

This technique, I’ll call the “Nabialek trick” (from the name of its inventor, Sam Nabialek), can improve the rule dispatch from linear non-deterministic to deterministic. The trick applies to grammars where a keyword (operator, etc), precedes a production. There are lots of grammars similar to this:

r =
        keyword1 >> production1
    |   keyword2 >> production2
    |   keyword3 >> production3
    |   keyword4 >> production4
    |   keyword5 >> production5
    /*** etc ***/
    ;
;

The cascaded alternatives are tried one at a time through trial and error until something matches. The Nabialek trick takes advantage of the symbol table’s search properties to optimize the dispatching of the alternatives. For an example, consider this grammar:

one = id;
two = id >> ',' >> id;

start = *("one" >> one | "two" >> two);

There are two rules (one and two). When “one” is recognized, rule ‘one’ is invoked. When “two” is recognized, rule ‘two’ is invoked.

Here’s the corresponding grammar using the Nabialek trick (see nabialek.cpp):

one = id;
two = id >> ',' >> id;

keyword.add
    ("one", one)
    ("two", two)
    ;

start = *(keyword[_a = _1] >> lazy(_a));

The grammar works as follows. keyword is a symbol table with two slots containing rules:

qi::symbols<char, qi::rule<Iterator, ascii::space_type> > keyword;

This is initialized to contain the rules “one” and “two”.

one, two, id, start are rules which are declared as follows:

qi::rule<Iterator, ascii::space_type> id, one, two;
qi::rule<Iterator, ascii::space_type,
    qi::locals<qi::rule<Iterator,
    ascii::space_type> > > start;

The start rule has a local variable —a rule. Spirit2′s rules are full fledged C++ objects and can be copied and stored just like any other object. In this example, when we enter the start rule, we create a temporary local variable. When a valid keyword is matched, notice that the semantic action stores the rule we initially stored in the symbol table into the local variable with this code:

keyword[_a = _1]

_a refers to the local variable and _1 refers to the attribute of keyword —the rule contained in the symbol slots.

Now that the local variable (referred to by _a) contains the rule we want to continue parsing with, the next thing to do is call it. This is done by the lazy pseudo parser:

>> lazy(_a)

The dynamic grammar parses inputs such as:

"one only"
"one again"
"two first, second"

The cool part is that the rule we want to parse with is dynamically chosen at runtime depending on what the symbol table contains. If it got a “one” then we use the ‘one’ rule. If it got “two”, then we use the ‘two’ rule. Very nifty! This technique should be very fast, especially when there are lots of keywords. It would be nice to add special facilities to make this easy to use. I imagine:

r = keywords >> rest;

where keywords is a special parser (based on the symbol table) that automatically sets its RHS (rest) depending on the acquired symbol. This, I think, is mighty cool! Someday perhaps…

Now, that last line was written in 2003. Will someone come forward and finally do it? :-)

GD Star Rating
loading...
Nabialek trick, 4.0 out of 5 based on 2 ratings
  • Reddit
  • Facebook
  • del.icio.us
  • Digg
  • Twitter
  • Print

8 Responses to “Nabialek trick”

  1. Andy Stevenson says:

    It is a very cool technique.
    I have a grammar that requires keywords to use the distinct[] directive out of the repository. Is there a way to adapt the symbols construct in the technique to match the keyword using the distinct directive?

    GD Star Rating
    loading...
    • Joel de Guzman says:

      Sure. Just wrap the keywords in the distinct directive. In my case, I always prefer to spell it out explicitly. For example, if I do not want my keywords (and all identifiers for that matter) to be followed by alpha or underscore, I write is this way:

      lexeme[keywords >> !(alnum | '_')]  // make sure we have whole words
      

      The predicate makes sure we don’t have an alpha or underscore after a keyword.

      GD Star Rating
      loading...
  2. Andy Stevenson says:

    Thanks Joel – perfect.

    GD Star Rating
    loading...
  3. Raindog says:

    This took me a while to understand what was going on, perhaps you can explain it in terms of the original grammar.

    r = keyword1 >> production1 | keyword2 >> production2 ....
    

    and in the example given of the Nabialek trick,

    rule1 =
    rule2 =
    

    Then you have a symbol table used to dispatch the rule associated with the symbol.

    keyword.add
        ("one", one)
        ("two", two)
        ;
    

    The *items* in the symbol table correspond to the keywordX list from the original example grammar, which is the part I had trouble figuring out, perhaps the spirit version of the example grammar could have be changed to:

    production1 = id;
    production2 = id >> ',' >> id;
    
    keyword1 = production1;
    keyword2 = production2;
    
    keyword.add
        ("keyword1", keyword1)
        ("keyword2", keyword2)
        ;
    

    It’s a little more verbose, but you can then provide the “real” example.

    GD Star Rating
    loading...
  4. Thomas Bernard says:

    I’m working on a rather complex grammar which makes an extensive use of keywords. I have a working state but somehow I wasn’t happy with the effecency of the parsing.

    I started out with the first contruct using alternatives and tried to get something better using the permutation. That lead me to making a small benchmark to compare three approches I came up with (with some tips from the Mailing list).

    I wrote a small benchmark program with some generated keywords (5, 10, 20 and 50) to get a better appreciation of the best technique depending on the number of keywords.

    The first implementation was using alternates *( | | | ) style parsing. The second implemtation used permutations ( ^ ^ ^ ) with the drawback of not being able to parse multiple occurences of the same keyword. The last implementation is the Nabialek trick as described on this page.

    My benchmarks showed that somehow the nabialek trick was the slowest. Even slower than the alternative parser.

    I was very disappointed because nabialek trick is advertised as being very fast. But I finally found the clue !
    There’s a small catch in the proposed implementation which produces expensive copying of the parser being evaluated.

    Here’s how you will get the expected lightning speed out of the nabialek trick.
    Instead of declaring the local variable as a parser you must use a parser pointer.
    Replace:

    qi::rule<Iterator, ascii::space_type, qi::locals<qi::rule> > start;
    

    with:

    qi::rule<Iterator, ascii::space_type, qi::locals<qi::rule *> > start;
    

    And in the grammar:

    keyword[_a = _1] >> lazy(_a)
    

    with:

    keyword[_a = &_1] >> lazy(*_a)
    

    That will even beat the permutation parser.

    GD Star Rating
    loading...
    • Thomas Bernard says:

      Please forget the last part on optimizing the Nabialek trick. I was a bit hasty and didn’t check that the parsing succeeded. The solution is worhless. I’ll give it another try as soon as I can.

      GD Star Rating
      loading...
  5. Thomas Bernard says:

    The Nabialek trick doesn’t present real performance benefits unless you have a very large number of keywords. The simple fact that semantic actions are necessary to get the trick working slows everything down.

    The second issue I see is the data extraction. Here too you’re bound to using some semantic actions to get the data into some usable data structures. It’s tricky and you very quickly loose all the performance benefits presented by the symbols parser.

    Unless you habe a rather large number of keywords (more than 30) you’ll get better parsing speed if you write your grammar without using the nabialek trick.

    GD Star Rating
    loading...

Leave a Reply

preload preload preload