Qi Example – Boost.Spirit

How to Optimize Qi

Hartmut Kaiser — Sat, 23 Jul 2011 15:49:28 +0000

Mike Lewis posted a marvelous experience report dubbed ‘Optimizing Boost Spirit – Blazing fast AST generation using boost::spirit’. He describes how he took an old compiler for the Epoch programming language (which was based on Spirit.Classic) and tuned it for performance using Spirit.Qi and Spirit.Lex. His results are exceptional, he got roughly a thousand fold speedup compared to the old version. The complete code for his compiler can be downloaded from here.

He writes:

This code illustrates several advanced techniques for parsing large inputs with complex Spirit grammars:

Deferred construction and minimal copying of attribute values

Lexical analysis for faster backtracking

A special directive for using qi::symbols alongside a lexer

Linear allocators for faster AST node allocation

Intrusive reference counting for even faster AST node allocation/copying

Grammar transformations for general optimality

Abuse of the &-predicate for skipping expensive productions

Dividing grammars into multiple implementation files for minimal recompilation times

Thanks Mike for sharing your work! I’m sure many Spirit developers will find it very enlightening and encouraging to read about your work. Keep up the excellent work!

Rating: 4.7/5 (6 votes cast)

Spirit.Qi in the Real World

Joel de Guzman — Wed, 08 Jun 2011 21:34:32 +0000

This is the first time I missed attending BoostCon (May 15-20, 2011 – Aspen, Colorado). Fortunately, for us who were not able to attend, Marshall Clow uploaded some videos. Here’s one one that’s relevant to Spirit: “Spirit.Qi in the Real World”, by Robert Stewart. Watch the presentation here:

http://blip.tv/boostcon/spirit-qi-in-the-real-world-5254335

You can find the slides here: https://github.com/boostcon/2011_presentations/raw/master/tue/spirit_qi_in_the_real_world.pdf

Past sessions on Spirit have focused on introducing Spirit or showing extracts of real use, intermingled with tutorial highlights. Upon writing real Spirit.Qi parsers, however, one quickly discovers that “the devil is in the details.” There are special cases, tricks, and idioms that one must discover by trial and error or, perhaps, by following the Spirit mailing list, all of which take time and may not be convenient. In this session, we’ll walk through the development of a Spirit.Qi parser for printf()-style format strings. The result will be a replacement for printf() that is typesafe and efficient.

Rating: 5.0/5 (2 votes cast)

The Keyword parser

teajay — Sat, 16 Apr 2011 17:27:14 +0000

The keyword parser construct has recently been added to spirit’s repository (available in 1.47 or from svn) . Here’s a small introduction to help you get started using the keyword parsers.

Those of you familiar with the Nabialek trick will recognize it’s working under the hood. What you can achieve with the keywords parser can also be achieved with the Nabialek trick but not always as elegantly or as efficiently.

The two examples presented below are included in the spirit repository and can be found in the folder :

libs/spirit/repository/example/qi

Data members marked by keywords (options.cpp)

For this small introduction we’ll consider parsing a program command line.

Options are commonly passed to applications delimited by option keywords :


mySuperCompiler --include includePath --define newSymbol=10 --output output.txt --define newSymbol2=20 --source mySourceFile

The order in which the options are specified doesn’t matter at all. The task of the parser we are going to write is to extract the individual options into some internal data structure we will use to control the program.

Here are the structures we could use to hold the options passed to our command line :

// A basic preprocessor symbol

typedef std::pair preprocessor_symbol; struct program_options { // symbol container type definition typedef std::vector< preprocessor_symbol > preprocessor_symbols_container; // include paths std::vector includes; // preprocessor symbols preprocessor_symbols_container preprocessor_symbols; // output file name boost::optional output_filename; // input file name std::string source_filename; };

Of course the structures are adapted to be compatible with fusion in order to get the data pulled into the structures easily.

Now lets define our options rule:


rule kwd_rule;

kwd_rule %= kwd("--include")[
                parse_string
            ]
          / kwd("--define") [
                parse_string
                >> (
                    (lit('=') > int_) | attr(1)
                   )
            ]
          / kwd("--output",0,1)[
                parse_string
            ]
          / kwd("--source",1)[
                parse_string
            ]
          ;

The first thing to notice here is that we used the %= operator. This means that the parsing construct we just wrote has an attribute type compatible with the attribute type of our adapted structure!

This is one spot were the keyword parsing construct surpasses the Nabialek trick. The Nabialek trick just can’t do that.

On the next lines we define our keyword parsing constructs. Writing

kwd("--include")[ parse_string ]

is equivalent to writing:

lit("--inlude") > parse_string

The word “–include” must be followed by a string.

The kwd directive has the ability to be combined by using the / operator. The kwd directive and the operator / work tightly together to achive the goal of attribute compatibility while using the Nabialek trick.

One last thing to notice is the occurrence constraints which can be associated with a kwd directive. It works like the repeat directive and enables to add additional validation checks inside the keyword parsing loop.

Writing

kwd("--output",0,1)[ parse_string ]

means that the keyword “–output” may occur 0 or 1 times at most. If it occurs more than once the parser will fail.

Writing

kwd("--source",1)[ parse_string ]

means that the keyword “–source” must occur once and only once. This works just like the repeat directive.

Using occurrence constraints doesn’t cost much on the runtime performance and gives the ability to easily enforce constraints which would be otherwise way much more difficult to formulate.

The kwd directive also exists in a case insentive variant : ikwd. You can combine the kwd and ikwd freely inside the same keyword block at the cost of a small runtime overhead.

Derived structures (derived.cpp)

A recent post in the mailing list gave me the idea to provide an example of how the keyword parser can be used to produce different derived structures depending on keywords placed in the input.

Here’s the problem as described by MM:

“I have a case where I have a prefix string that will distinguish what will follow it.

prefix string - struct members

this is what is read from the input stream. I have a base struct and 5 derived D1..D5, each derived has a different prefix as a static const std::string member. Parsing the prefix string tells me which struct D1..D5 I should parse after. All these derived structs are fusion adapted. There is a rule for each of the derived.”

To keep the example simple here are the classes we could consider:


struct base_type {
    base_type(const std::string &name) : name(name)  {}

    std::string name;
    virtual std::ostream &output(std::ostream &os) const {
        os<<"Base : "<
Our parse result must be a vector of pointers to our base class:
std::vector
To get that done, we’ll use semantic actions inside the kwd directive:

kwd_rule = kwd("derived1")[
              ('=' > parse_string > int_ )
              [phx::push_back(_val,phx::new_(_1,_2))]
           ]
         / kwd("derived2")[
              ('=' > parse_string > int_ )
              [phx::push_back(_val,phx::new_(_1,_2))]
           ]
         / kwd("derived3")[
              ('=' > parse_string > int_ > double_)
              [phx::push_back(_val,phx::new_(_1,_2,_3))]
           ]
           ;


This rule will construct new derived classes and append them to our result vector during parsing. The input parsed by this construct is of the form:
 derived2 = "object1" 10 derived3= "object2" 40 20.0 
Keywords vs Nabialek trick
Here’s a small table to compare the features of the keyword parsing constructs and the Nabialek trick to help you decide which solution better suits your needs.



	
		 Nabialek trick Keywords parser
	


	
		Attribute propagation no yes
	
	
		Runtime modification of the keyword set yes no
	
	
		Occurrence constraints not easily implented yes
	
	
		Number of keyword limit available runtime memory BOOST_VARIANT_LIMIT_TYPES
	



The keywords parsing construct can save a lot of typing over the Nabialek trick and has in many cases even better performance. It also makes retrieving the parsed data into the program usable structures much easier as it supports attribute propagation. The main limitation of the keyword parser is the number of keywords a keyword block may contain ( limited by the maximum size of the variant type BOOST_VARIANT_LIMIT_TYPES).

Rating: 4.0/5 (1 vote cast)

	Nabialek trick	Keywords parser
Attribute propagation	no	yes
Runtime modification of the keyword set	yes	no
Occurrence constraints	not easily implented	yes
Number of keyword limit	available runtime memory	BOOST_VARIANT_LIMIT_TYPES

Dispatching on Expectation Point Failures

Rob Stewart — Mon, 28 Feb 2011 14:23:06 +0000

When using expectation points, a parsing failure results in an exception that generically indicates the failure, but probably doesn’t explain the problem in the most meaningful way. It is possible to attach an error handler to react to the failed match in a more specialized way:

rule = alpha > '!';
on_error(rule,
   std::cerr << val("Expected '!' at offset ") << (_3 - _1)
      << " in \" << std::string(_1, _2) << '"'
      << std::endl);

That will produce a message like the following on stderr:

Expected '!' at offset 7 in "Some input"

However, if there’s more than one expectation point in a rule, then the diagnostic may be unhelpfully generic. To do otherwise, one must distinguish which expectation point failed. While it is certainly possible to factor the grammar into additional rules in order to have at most one expectation point per rule, that’s not necessary and can make the grammar less readable than otherwise. Instead, the what parameter (_4) of the error handler can be used:

rule = alpha > '!';
on_error(rule,
   std::cerr << val("Expected " << _4 << " at offset ")
      << (_3 - _1) << " in \" << std::string(_1, _2) << '"'
      << std::endl);

The what parameter describes the failure. In the case of an expectation point match failure, it is the name of the parser that failed to match or, if the parser is to match literal text, like '!' in the preceding example, the what parameter will be "literal-char" or similar. In this case, _4 will be "literal-char" (in the form of a boost::spirit::utf8_string which is a specialization of std::basic_string), and thus not terribly useful in a diagnostic.

To make the error message more helpful, and especially in rules with more than one literal parser to distinguish, create distinct, named rules:

exclamation = lit('!');
exclamation.name("!");
rule = alpha > exclamation;
on_error(rule,
   std::cerr << val("Expected ") << _4 << " at offset "
      << (_3 - _1) << " in \" << std::string(_1, _2) << '"'
      << std::endl);

This will report Expected ! at offset... when the exclamation rule fails to match.

Since an expectation point failure is distinguished by the what parameter, it follows that the what parameter can be used to dispatch to different behavior in the error handler based upon which expectation point failed to match. Doing so can be as simple as passing the what parameter to an error handling function which can use normal C++ techniques for dispatch such as cascading if-else’s or a map lookup, using the what string as the key to find a function to call. However, Phoenix offers power to do that work within the context of the on_error() call:

semicolon = lit(';');
semicolon.name(";");
rule = alpha > semicolon > alpha;
on_error(rule,
   let(_a = bind(&boost::spirit::info::tag, _4))
   [
      if_(";" == _a)
      [
         report_missing(_4, _1, _2, _3)
      ]
      .else_
      [
         if_("alpha" == _a)
         [
            report_missing("second word", _1, _2, _3)
         ]
         .else_
         [
            report_error(_4, _1, _2, _3)
         ]
      ]
   ]);

For the last example to compile, a number of include and using directives are necessary beyond the basics you are probably accustomed to seeing:

#include 
#include 
#include 
#include 
using boost::phoenix::local_names;

It would seem, at first blush, that comparing to _4 directly should work, but it doesn’t because _4 is a Phoenix actor. Instead, a string type is needed to support the comparisons against the string literals for dispatching. In this example, a local Phoenix variable, _a is declared and assigned the result of binding _4 to boost::spirit::info::tag, the field of the boost::spirit::info struct that contains the what string. Thus, _a is a variable local to the error handler that is bound to the boost::spirit::utf8_string that describes the error and supports comparisons. Note the use of Phoenix’s let construct to declare a local variable scope. (This _a, which is boost::phoenix::local_names::_a, can be ambiguous with boost::spirit::qi::_a, depending upon using directives and declarations.)

The two functions, report_missing() and report_error() are not defined here, but presumably would report on stderr or raise an exception to indicate that a parsing error occurred, and would report the error context from the input range [_1,_2) and would note the error location, within that range, as given by _3.

When dispatching in this manner, there can be other parsing errors besides expectation point match failures, hence the final .else_ branch in the example error handler. For lack of a better response, the example just reports a generic error message that includes the what parameter’s text to give some sort of explanation. A real world rule would possibly provide a more context-specific diagnostic.

A final caution regarding this technique: the compile time, maintenance burden, and code size increases with each additional expectation point to be handled. Using a map-based dispatch may well be better when the number of expectation points grows. However, the diagnostic text generation may get out of synchronization with the point in the grammar triggering it because of their being located in different parts of the code.

There is another way to keep the diagnostic text near the rule triggering an error, while avoiding a great deal of code within the grammar. It involves collecting the rule name and corresponding diagnostic in a structure stored in an array that is then passed to an error handler that uses the what parameter to select a diagnostic from the array. If that was as clear as mud, don’t worry. The code should make it clear. Let’s start with the rule name to diagnostic mapping which combines the structure and array within a class template:

template 
class diagnostics
{
public:
   diagnostics();

   // Adds a tag and diagnostic message pair to self.
   void
   add(char const * _tag, char const * _diagnostic);

   // Returns the diagnostic, if any, for _tag.
   char const *
   operator [](char const * _tag) const;

private:
   struct entry
   {
      char const * tag;
      char const * diagnostic;
   };

   entry  entries_[N];
   size_t size_;
};

diagnostics, as written, simply saves pointers to string literals. For more flexibility, it could store real strings (std::basic_string<>s, for instance), but this design is useful and simpler for exposition. To use diagnostics, one must create a grammar data member for each rule that will use it, and then populate it as needed by the rule:

semicolon = lit(';');
semicolon.name(";");
rule = alpha > semicolon > alpha;
diags.add(";", "Missing semicolon after first word");
diags.add("alpha", "Missing second word");
on_error(rule,
   error_handler(ref(diags), _1, _2, _3, _4));

Notice how the first expectation point is identified by a named rule for the required semicolon, which will produce an error message or exception containing the diagnostic text "Missing semicolon after first word". Similarly, if there is no word after a semicolon, then the diagnostic "Missing second word" will be used because the second alpha will fail to match. In each case, the expectation is that the error handler will use _4 to indicate which rule fail to satisfy an expectation point.

To round out this example, here’s how error_handler() might look:

struct error_handler_impl
{
   template 
   struct result { typedef void type; };

   template 
   void
   operator ()(D const & _diagnostics, B _begin, E _end,
      W _where, I const & _info) const
   {
      utf8_string const & tag(_info.tag);
      char const * const what(tag.c_str());
      char const * diagnostic(_diagnostics[what]);
      std::string scratch;
      if (!diagnostic)
      {
         scratch.reserve(25 + tag.length());
         scratch = "Invalid syntax: expected ";
         scratch += tag;
         diagnostic = scratch.c_str();
      }
      raise_parsing_error(diagnostic, _begin, _end,
         _where);
   }
};
phx::function error_handler;

You’re probably wondering where the implementation of diagnostics’ member functions are to be found. Here they are:

template 
inline
diagnostics::diagnostics()
   : size_(0)
{
}

template 
void
diagnostics::add(char const * const _tag,
   char const * const _diagnostic)
{
   assert(size_ < N);
   entry & e(entries_[size_++]);
   e.tag = _tag;
   e.diagnostic = _diagnostic;
}

template 
char const *
diagnostics::operator [](char const * const _tag) const
{
   for (size_t i(0); i < size_; ++i)
   {
      entry const & e(entries_[i]);
      if (0 == std::strcmp(e.tag, _tag))
      {
         return e.diagnostic;
      }
   }
   return 0;
}

It should now be apparent that there are numerous ways to dispatch error handling when using expectation points, but all revolve around decoding the what parameter. In the end, factor your grammar to be functional and readable and then consider which expectation point failure dispatching technique fits best without sacrificing readability or performance.

Rating: 5.0/5 (3 votes cast)

Parsing Escaped String Input Using Spirit.Qi

Hartmut Kaiser — Sat, 13 Nov 2010 20:28:40 +0000

Jeroen Habraken (a.k.a VeXocide) sent an article about parsing escaped strings using Qi, which we happily publish for everybody to read. Thanks Jeroen!

Continue reading here.

Rating: 4.5/5 (2 votes cast)

S-expressions and variant

Joel de Guzman — Fri, 12 Mar 2010 00:24:42 +0000

I have a mixed relationship with variant…

I just wrote a parser for S-expressions (that will be the basis of ASTs and intermediate types in my planned “write-a-compiler” article series). The parser itself is easy, but as always, I spent more time on the underlying data structures.

What are S-expressions? S-expressions, also called sexps, are recursive, list based, data structures. Being recursive, they can represent hierarchical information. S-expressions are parenthesized prefix expressions, known for their use in LISP (and its sibling Scheme). Here’s a simple sexp:

(* 2 (+ 3 4))

The sexp above corresponds to this infix expression:

(2 * (3 + 4))

S-expressions are simple and infinitely powerful beasts as evident in applications that use LISP as their scripting language. They can represent code and data. Some people even use S-expressions as a suitable (and terser!) replacement for XML. The in-memory data structures are very easy to use, transform and manipulate, traverse and compile or accumulate results from.

The plan is to use S-expressions as our AST representation and embed a minimal LISP/Scheme interpreter IN the compiler. This implies that along the way, we’ll be building an S-expression parser and a LISP/Scheme interpreter. How cool is that? … We’re talking about scripting the compiler with an interpreter!

I needed a dynamic data type that can represent the S-expressions. I called it utree, short for universal-tree. I want it to be as simple as it can be and fast and tight in memory footprint. Boost variant was simply out of the question (I used it in one early prototypes). For one, it failed a basic requirement (tight memory footprint). The padding and the way it aligns the “what-type” integer member is quite wasteful. It uses a conservative alignment using the worst alignment of the types in the union. Thus if you have a type in there that aligns to 8 bytes, variant requires another 8 bytes just for the type discriminator! Try it out:

struct x { void* a; void* b; void* c; };
/***/
std::cout << sizeof(x) << std::endl;
std::cout << sizeof(boost::variant) << std::endl;

I get: 12 and 24 respectively (32 bit system).

I ended up with 40 bytes in my initial prototype (using STL containers and variant) and later squeezed that to 24 (minimum). I did away with variant in my latest version and got 16 bytes. In this case, I “stole” unused padding bits from the data to store the discriminator. With this 16 bytes, I have nil, bool, int, double, string and (double linked) list. The string itself steals memory when it can (i.e. it stores the string in the union when it can and only uses the heap when needed). The string steals as much as it can. So, on 32 bit systems, it can store in-situ as much as 14 bytes. That’s a lot for storing simple strings like symbols and identifiers. On 64 bit systems, you can store a lot more in-situ and minimize heap usage more.

At this point, I feel like writing my own variant type that can do such things (intrusive variant?). Barring the use of Boost.Variant, I needed to write my own data structures (double linked list). I really wanted to use Boost.Intrusive which is quite efficient, but because I had to squeeze my own variant in there, I had to make use of unions which require PODs!

Here’s the work in progress:
http://boost-spirit.com/dl_more/scheme/scheme_v0.2/

Here’s the utree API:


    ///////////////////////////////////////////////////////////////////////////
    // The main utree (Universal Tree) class
    // The utree is a hierarchical, dynamic type that can store:
    //  - a nil
    //  - a bool
    //  - an integer
    //  - a double
    //  - a string (textual or binary)
    //  - a (doubly linked) list of utree
    //  - a reference to a utree
    //
    // The utree has minimal memory footprint. The data structure size is
    // 16 bytes on a 32-bit platform. Being a container of itself, it can
    // represent tree structures.
    ///////////////////////////////////////////////////////////////////////////
    class utree
    {
    public:

        typedef utree value_type;
        typedef detail::list::node_iterator iterator;
        typedef detail::list::node_iterator const_iterator;
        typedef utree& reference;
        typedef utree const& const_reference;
        typedef std::ptrdiff_t difference_type;
        typedef std::size_t size_type;

        struct nil {};

        utree();
        explicit utree(bool b);
        explicit utree(unsigned int i);
        explicit utree(int i);
        explicit utree(double d);
        explicit utree(char const* str);
        explicit utree(char const* str, std::size_t len);
        explicit utree(std::string const& str);
        explicit utree(boost::reference_wrapper ref);

        utree(utree const& other);
        ~utree();

        utree& operator=(utree const& other);
        utree& operator=(bool b);
        utree& operator=(unsigned int i);
        utree& operator=(int i);
        utree& operator=(double d);
        utree& operator=(char const* s);
        utree& operator=(std::string const& s);
        utree& operator=(boost::reference_wrapper ref);

        template 
        typename F::result_type
        static visit(utree const& x, F f);

        template 
        typename F::result_type
        static visit(utree& x, F f);

        template 
        typename F::result_type
        static visit(utree const& x, utree const& y, F f);

        template 
        typename F::result_type
        static visit(utree& x, utree const& y, F f);

        template 
        typename F::result_type
        static visit(utree const& x, utree& y, F f);

        template 
        typename F::result_type
        static visit(utree& x, utree& y, F f);

        template 
        void push_back(T const& val);

        template 
        void push_front(T const& val);

        template 
        iterator insert(iterator pos, T const& x);

        template 
        void insert(iterator pos, std::size_t, T const& x);

        template 
        void insert(iterator pos, Iter first, Iter last);

        template 
        void assign(Iter first, Iter last);

        void clear();
        void pop_front();
        void pop_back();
        iterator erase(iterator pos);
        iterator erase(iterator first, iterator last);

        utree& front();
        utree& back();
        utree const& front() const;
        utree const& back() const;

        utree& operator[](std::size_t i);
        utree const& operator[](std::size_t i) const;

        void swap(utree& other);

        iterator begin();
        iterator end();
        const_iterator begin() const;
        const_iterator end() const;

        bool empty() const;
        std::size_t size() const;
    };

    bool operator==(utree const& a, utree const& b);
    bool operator<(utree const& a, utree const& b);
    bool operator!=(utree const& a, utree const& b);
    bool operator>(utree const& a, utree const& b);
    bool operator<=(utree const& a, utree const& b);
    bool operator>=(utree const& a, utree const& b);

Rating: 4.3/5 (4 votes cast)

Tracking the Input Position While Parsing

Peter Schüller — Fri, 05 Mar 2010 16:21:53 +0000

The following article is about tracking the parsing position with Spirit V2. This is useful for generating error messages which tell the user exactly where an error has occurred. We also show how to use Spirit V2 to parse from an input stream without first reading the whole stream into a std::string.

Rating: 5.0/5 (1 vote cast)

Parsing Skippers and Skipping Parsers

Hartmut Kaiser — Wed, 24 Feb 2010 13:32:08 +0000

Spirit supports skipper based parsing since its very invention. So this is definitely not something new to Spirit V2. Nevertheless, the recent discussion on the Spirit mailing list around the semantics of Qi’s lexeme[] directive shows the need for some clarification. Today I try to answer questions like: “What does it mean to use a skipper while parsing?”, or “When do I want to use a skipper and when not?”.

While parsing some formatted data stream it is very often desirable to ignore some parts of the input. A common example would be the need to skip whitespace and comments while parsing some computer language. Certainly it is possible to explicitly account for the tokens to skip (such as the whitespace or the comments) while writing the grammar. But this can get very tedious as those tokens are valid to appear at any point in the input.

For the sake of simplicity, let us assume we want to parse a simple key/value expression: key=value, where we want to allow for any number of space characters before, in between, or after the key or the value. A naive grammar matching the plain key/value pair without whitespace skipping would look like (see Parsing a List of Key-Value Pairs Using Spirit.Qi for more details):

pair  =  key >> '=' >> value;
key   =  qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
value = +qi::char_("a-zA-Z_0-9");

If we want to explicitly accommodate the rule pair to match any interspersed space characters we get:

pair  = *space >> key >> *space >> '=' *space >> value >> *space;

which, while it produces the desired result, is not only error prone, but additionally difficult to write, to understand, and to maintain. If we look closer we see, that the process of skipping the whitespace tokens is easily automated. It seems to be sufficient to insert a repeated invocation of the space parser (or generally, any skip parser) in between the elements of the user defined parser expression sequences.

In fact, that is exactly what Spirit can do for you! The library invokes any supplied skip parser upon entry to the parse member function of any parser conforming to the PrimitiveParser concept. The skip parser has to be supplied by calling a special API function: phrase_parse:

namespace qi = boost::spirit::qi;
typedef std::string::const_iterator iterator;

qi::rule pair = key >> '=' >> value;
qi::rule key = qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
qi::rule value = +qi::char_("a-zA-Z_0-9");

std::string input(" key = value ");
iterator_type begin = input.begin();
iterator_type end = input.end();
qi::phrase_parse(begin, end, pair, qi::space);

This code snippet illustrates several important things:

The function qi::phrase_parse is equivalent to the API function qi::parse except for its additional parameter, the skip parser. Our example utilizes qi::space, but it is possible to use any other, even more complex parser expression as the skipper instead.
All rules which we want to perform the skip parsing need to be declared with the type of the skip parser they are going to be used with. Our example specifies the type of the qi::space parser expression, which is qi::space_type. For more complex parser expressions you might want to use a (mini) grammar or take advantage of BOOST_TYPEOF to let the compiler deduce the actual type.
All rules which should not perform skip parsing have to be declared without an additional skip parser type. These rules behave like an implicit lexeme[] directive (for more information about lexeme[], see below), they inhibit the invocation of the skip parser even if they are executed as part of a rule with an associated skipper.

In the example above we suppressed skipping while matching either the key or the value, otherwise our grammar would match any additional space character inside the key or value as well. Remember, the expression char_ conforms to the PrimitiveParser concept, it will execute the skip parser for each of its invocations. In this case any skip parser would be executed in between any two of the matched characters.

Sometimes it is necessary to turn of skipping for a smaller part of the grammar only. For this purpose Spirit implements the lexeme[] directive. This directive inhibits skipping during the execution of the embedded parser. For instance, parsing a quoted string of alphanumeric characters would look like this:

string = lexeme['"' >> *alnum >> '"'];

Here the lexeme directive disables skipping while matching the string, which avoids ‘loosing’ characters otherwise matched by the skipper. Please note: lexeme[] performs a pre-skip step, even if it is not a PrimitiveParser itself (it is essentially considered to be a logical primitive by design). If this is undesired, you can utilize the no_skip[] directive instead:

string = '"' >> no_skip[*alnum] >> '"';

This parser will match all the characters in between the quotes, even if the string starts with a character sequence matched by the applied skip parser. The no_skip[] directive is semantically equivalent to lexeme[] except it does not perform a pre-skip before executing the embedded parser. Note: the no_skip[] directive has been added only recently. It will be available starting with the next release (Boost V1.43).

This short article would not be complete without mentioning the skip[] directive. This directive is the counterpart to lexeme[]. It enables skipping for the embedded parser. Without any argument it can be used inside a lexeme or no_skip directive only. In this case it just re-enables the outer skipper:

string = lexeme['"' >> *(alpha | skip[digit]) >> '"'];

This (purely hypothetical) parser would enable skipping inside a string as long as it matches digits. But the skip directive can do more. It may take an additional argument allowing to specify a new skipper, for instance:

skip(qi::space)[*alnum]

which will skip spaces while executing the embedded *alnum parser. This form of the directive can be applied for two purposes. It can be used either for changing the current skip parser or to establish skipping inside a context otherwise not doing skipping at all (even if invoked with the qi::parse() API function).

For more detailed information about all the mentioned directives please see the corresponding documentation.

Rating: 4.6/5 (8 votes cast)

Parsing Arbitrary Things in Any Sequence

Hartmut Kaiser — Wed, 17 Feb 2010 15:45:25 +0000

Recently, there have been a couple of questions on the Spirit mailing list asking how to parse as set of things known in advance in any sequence and any combination. A simple example would be a list of key/value pairs with known keys but the keys may be ordered in any sequence. This use case seems to be quite common. Fortunately Spirit provides you with a predefined parser component designed for exactly that purpose: the permutation parser.

Spirit’s permutation parser a ^ b matches either a, b, a >> b, or b >> a, where a and b can be arbitrary parser expressions. Just like normal sequences this operator can be utilized to combine more than two operands. For instance, the expression a ^ b ^ c will match a or b or c (or an combination thereof) in any sequence. The attribute propagation rule for the permutation parser is

a: A, b: B --> (a ^ b): tuple, optional >

As usual, if one or more operand of the expression do not expose any attribute (expose unused_type as their attribute, which is equivalent), this operand disappears from attribute handling:

a: A, b: Unused --> (a ^ b): optional;

The permutation parser works out of the box whenever you do not require to match all of the elements in the input. But what if you want strict permutation (operands get matched exactly once)? You have two possibilities, as often, one simple and less versatile and one more complex but universally applicable solution. The simple solution is to parse the input and to check afterward whether all optionals in the resulting attribute have been filled. I will leave that solution as an exercise for the reader.

If we assume the attribute to be a (Fusion) tuple of optionals, containing one optional for each of the parser components in the permutation parser we can write the following code (thanks to Carl Barron for the initial idea).

This code defines a Phoenix function (a lazy function encapsulating some custom functionality) checking whether one or more of the optionals in a given Fusion sequence are empty. The Fusion algorithm find_if iterates over the given sequence of optionals, invoking the option_empty::operator() for each of the elements. fusion::find_if stops iterating on the first invocation returning true and returns the iterator to the element it stopped on. This is very similar to the well known std::find_if algorithm.

namespace phoenix = boost::phoenix; namespace fusion = boost::fusion; namespace qi = boost::spirit::qi; class no_empties_impl { // helper function object to be invoked by fusion::find_if struct optional_empty { template bool operator ()(T const& val) const { return !val; // return true if 'val' is empty. } }; public: template struct result { typedef bool type; }; // This operator will get called from the semantic action attached // to the permutation parser. The parameter refers to its overall // attribute: the fusion tuple of optionals. template bool operator ()(T const& t) const { // look for an empty optional, if any return false. return fusion::find_if(t) == fusion::end(t); } }; // define the Phoenix function phoenix::function const no_empties = no_empties_impl();

The overall Phoenix function no_empties will return false if we found at least one non-initialized optional in the passed sequence. The following code snippet illustrates how everything fits together:

std::string input ("BCA"); std::string::const_iterator begin = input.begin(); std::string::const_iterator end = input.end(); qi::parse(begin, end, (qi::char_('A') ^ 'B' ^ 'C')[qi::_pass = no_empties(qi::_0)]);

We assign the result of the invocation of no_empties to Qi’s predefined placeholder _pass. If we assign false, then the parser the semantic action is attached to will be forced to fail in retrospective (even if it matched the input successfully before). As a result the overall parser expression will succeed as long as a) the permutation parser matches its input and b) the Phoenix function inside the semantic action returns true.

For more information about the permutation parser please consult its documentation here. Overall, this example is a bit more complex than the average parser you might usually write. It utilizes three libraries: Spirit, Phoenix, and Fusion in a seamless manner. But for sure, once you understand the idea, it will be easier for you to come up with similar solutions. Spirit has been designed with Phoenix and Fusion in mind, and in fact it relies on Fusion heavily itself. As a result, the integration of those libraries is almost perfect.

Rating: 5.0/5 (2 votes cast)

How to Adapt Templates as a Fusion Sequence

Hartmut Kaiser — Mon, 08 Feb 2010 15:59:13 +0000

Here is another question raised from time to time: “I know how to use a plain struct as an attribute for a sequence parser in Qi by adapting it with BOOST_FUSION_ADAPT_STRUCT. Unfortunately this does not work if the struct is a template. What can I do in this case?”.

There have been plans for a while to create a separate Fusion facility BOOST_FUSION_ADAPT_TPL_STRUCT allowing to adapt templated data types, but this is not in place yet. Today I will describe a trick you can apply to adapt your templates into ‘proper’ Fusion sequences anyway.

We will use the fact that a Qi grammar is already a template in most cases, and even if it is not a template yet, it can be easily converted into one. Further we will use the built-in capability of rule’s to invoke a custom attribute transformation if the attribute type of the right hand side does not exactly match the left hand side’s attribute type.

Let us assume this to be our data structure we want to fill while parsing:

template struct data { A a; B b; };

We would like to be able to directly utilize this template type as an attribute for our grammar. A possible way of adapting the template type to make it usable as a Fusion sequence is to define a fusion::vector and initialize it with the references to the data members of our template type. If we the pass this Fusion vector as the attribute to the actual parser expression we effectively supply our original data members as the attributes to the parsing process.

namespace qi = boost::spirit::qi; namespace fusion = boost::fusion; template struct data_grammar : qi::grammar()> { data_grammar() : data_grammar::base_type(start) { // the implicit attribute transformation 'adapts' data<> to // the Fusion vector start = real_start; // do the actual parsing of the data<> members real_start = qi::auto_ >> ',' >> qi::auto_; } qi::rule()> start; qi::rule()> real_start; };

The signature of the grammar’s start rule has to match the signature of the grammar itself. To accommodate for this we introduce a second rule ‘real_start’ dedicated to the parsing of our data members. At the same time this allows us to inject the needed transformation of our data<> attribute to the Fusion vector. As the left hand side’s and right hand side’s attribute types do not match, the parser expression start = real_start will invoke Spirit’s customization point transform_attribute. But since the default implementation of this customization point does not handle our special data types the way we want, we are required to implement our own specialization:

namespace boost { namespace spirit { namespace traits { template struct transform_attribute, fusion::vector > { typedef fusion::vector type; static type pre(data& val) { return type(val.a, val.b); } static void post(data&, fusion::vector const&) {} static void fail(data&) {} }; }}}

The function pre() is called before the right hand side parser expression is invoked. It gets passed the left hand side’s attribute (the data<> instance) and is required to return the attribute to be passed to the rule’s right hand side expression. The returned Fusion vector is initialized with the references to the data members of our original data<> instance. The functions post() and fail() can be left empty in our case. For more information about this customization point please see the corresponding documentation here.

I added a new example to Spirit demonstrating this technique. Currently, it can be accessed from the Boost SVN only (see adapt_template_struct.cpp), but in the future it will be released as part of Spirit.

Just in case you were wondering: yes, this trick works equally well for Karma generators. The only difference is that the members of the created Fusion vector will have to be constant references instead.

Rating: 5.0/5 (2 votes cast)