Nov 15

Parsing a List of Key-Value Pairs Using Spirit.Qi

By Hartmut Kaiser Add comments

One of the goals of this blog is to provide shrink wrapped solutions for small everyday parsing and output generation problems. We hope to show how Spirit may be used to simplify the life for you as a C++ programmer related to data conversion problems from some external representation to your internal data structures and vice versa.

One of the tasks often to be solved is to parse arbitrary key/value pairs delimited by some separator into a std::map. Parsing the URL query format is one example: key1=value1;key2=value2;…;keyN=valueN, and that’s what I would like to provide a reusable solution for.

Let’s start with the corresponding grammar (written using the notation of Parsing Expression Grammars):

query ← pair ((';' / '&') pair)*
pair  ← key ('=' value)?
key   ← [a-zA-Z_][a-zA-Z_0-9]*
value ← [a-zA-Z_0-9]+

We assume that a key has to start with any letter or an underscore, but otherwise might consist of letters, digits or underscores. A value can not be empty and may consist of letters, digits or underscores only as well. Further, we assume a query to be a sequence of at least one pair delimited by semicolons (‘;’) or ampersands (‘&’), and a pair to be a sequence of a mandatory key optionally followed by a ‘=’ and a value.

Converting any Parsing Expression Grammar (PEG) into an equivalent grammar for Spirit.Qi is a purely mechanical step. Let me provide the result first and explain some of the details later on:

query =  pair >> *((qi::lit(';') | '&') >> pair);
pair  =  key >> -('=' >> value);
key   =  qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
value = +qi::char_("a-zA-Z_0-9");

All differences we see are caused by limitations of the C++ language Spirit has to live with. Direct juxtaposition is expressed using the right shift operator (‘>>’), postfix operators as the Kleene Star (‘*’) and the Plus (‘+’) are moved to the front of the corresponding expression and written as prefix operators, and the optional operator (question mark) is rewritten using the unary minus (‘-’). Otherwise the two grammars look fairly similar. The char_ is a predefined Qi parser primitive matching exactly one character based on the description provided as its argument. The lit is very similar to char_ except that it does not expose the matched value as its attribute.

The next required step is to understand what attribute type each of our defined grammar rules should expose. These attribute types will allow us to map the external representation onto the internal C++ types we will use to store the parsing results. Let us store the keys and the values as std::string’s (assuming we have to deal with narrow character representations). The result of a parsed key/value pair can be conveniently stored into a std::pair<std::string, std::string>. Finally, the overall result of parsing the query string should be stored into a std::map<std::string, std::string>. This knowledge allows to write the declaration of our rule variables as used above:

qi::rule<Iterator, std::map<std::string, std::string>()> query;
qi::rule<Iterator, std::pair<std::string, std::string>()> pair;
qi::rule<Iterator, std::string()> key, value;

The type qi::rule<> is a predefined non-terminal parser provided by Spirit usable for storing the grammar definitions above. We need to provide the iterator type of the underlying input data (Iterator) and the attribute type, which is the type of the data the rule is supposed to store its parsed data in.

A word about the unusual function declaration syntax used to specify the attribute type of the rule. Non-terminals in recursive descent parsers can be seen as being very similar to functions. They return a value, their (synthesized) attribute, while they optionally may take arguments, their (inherited) attributes. Spirit uses the function declaration syntax in order to emphasize this similarity.

In the beginning I promised to provide a shrink wrapped solution, so what’s still left is to encapsulate the whole functionality. Again Spirit has some recommended way of doing that: grammar<>’s. Spirit grammar’s are special non-terminals acting as containers for one or more rules allowing to encapsulate more complex parsers:

namespace qi = boost::spirit::qi;

template <typename Iterator>
struct keys_and_values
  : qi::grammar<Iterator, std::map<std::string, std::string>()>
{
    keys_and_values()
      : keys_and_values::base_type(query)
    {
        query =  pair >> *((qi::lit(';') | '&') >> pair);
        pair  =  key >> -('=' >> value);
        key   =  qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
        value = +qi::char_("a-zA-Z_0-9");
    }
    qi::rule<Iterator, std::map<std::string, std::string>()> query;
    qi::rule<Iterator, std::pair<std::string, std::string>()> pair;
    qi::rule<Iterator, std::string()> key, value;
};

The derivation from Qi’s grammar type converts the keys_and_values type into a parser. Its member rules define a grammar which makes it usable for recognizing URL query strings. The base class constructor gets passed the start rule, which is the top most rule of the grammar to be executed when the grammar is invoked. The type keys_and_values has a template parameter allowing to utilize this grammar with arbitrary iterator types.

The last missing code piece shows how to invoke the newly created parser.

std::string input("key1=value1;key2;key3=value3");  // input to parse
std::string::iterator begin = input.begin();
std::string::iterator end = input.end();

keys_and_values<std::string::iterator> p;    // create instance of parser
std::map<std::string, std::string> m;        // map to receive results
bool result = qi::parse(begin, end, p, m);   // returns true if successful

The function qi::parse() is one of Spirit’s main API functions. In the simplest case it takes a pair of iterators pointing to the input sequence to parse (begin, end), an instance of the parser to invoke (p), and the attribute instance to be filled with the converted data (m). This function executes the actual parsing operation and returns true if it was successful.

The fact that attributes of certain types are getting filled on the fly might look like magic to you, or you might think I left out some essential code snippets. But neither it is magic nor did I leave out anything. In Spirit.Qi all parser components (such as char_ or lit) expose specific attribute types and all compound parsers (such as sequences and alternatives) implement well defined rules for attribute propagation and merging. Our example uses the knowledge about these rules, and if you want to understand how this works in more detail, please refer to Spirit’s documentation.

Here is an example: as you might expect char_ exposes the matched character as its attribute. The Kleene Star (‘*’) and the Plus (‘+’) are compound parsers collecting the attributes of their embedded parser in a (STL compatible) container (in our case a std::string, which is a container of char). The non-terminal rule normally inherits the attribute of the parser expression on the right hand side of its assignment. Last but not least, sequences match the attributes of their elements to the corresponding parts of their attribute data structure. Since literals (such as ‘=’ or lit(‘;’)) do not expose any attribute, the expression

pair  =  key >> -('=' >> value);

naturally combines the two string attributes of key and value into the pair of strings as expected by the non-terminal pair, initializing the attribute of value with an empty string if it is not matched.

If you want to try it out for yourself, the complete source code for this example is available from the Boost SVN here. In the future this example will be distributed as part of the Spirit distribution, but for now it lives in the SVN only.

21 Responses to “Parsing a List of Key-Value Pairs Using Spirit.Qi”

  1. Bernard says:

    Thanks for this excellent example !
    It does a great job at explaining how to use Spirit.
    However, I think that the choice of an std::map is not obvious. One might also want to allow duplicates and preserve the original order when parsing lists of key[,values]. For better reusability, how about adding a “bool Ordered” template parameter to the grammar storing the result in a std::vector< std::pair<std::string, boost::optional > when true ?

    Best regards,
    Bernard

    • Hartmut Kaiser says:

      Bernard.

      Both suggestions are valid. The example uses a std::map in order to be as straightforward as possible. But for the sake of completeness let me provide two different versions accomodating your requirements.

      Here is the first iteration which just changes the map to vector<pair<string, string> >. No other modifications are required to implement a version preserving both, the sequence of elements and optional duplicates:

      typedef std::pair<std::string, std::string> pair_type;
      typedef std::vector<pair_type> pairs_type;
      
      template <typename Iterator>
      struct key_value_sequence
        : qi::grammar<Iterator, pairs_type()>
      {
          key_value_sequence()
            : key_value_sequence::base_type(query)
          {
              query =  pair >> *((qi::lit(';') | '&') >> pair);
              pair  =  key >> -('=' >> value);
              key   =  qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
              value = +qi::char_("a-zA-Z_0-9");
          }
      
          qi::rule<Iterator, pairs_type()> query;
          qi::rule<Iterator, pair_type()> pair;
          qi::rule<Iterator, std::string()> key, value;
      };
      

      In order to allow for and distinguish empty values (i.e. key=; versus key;) you need to change the definition of pair_type:

      typedef 
          std::pair<std::string, boost::optional<std::string> > 
      pair_type;
      

      and you need to (slightly) change the grammar:

      pair  =  key >> -('=' >> -value);

      (note the additional unary ‘-’).

      Both versions are in Boost SVN here and here.

      HTH
      Regards Hartmut

  2. Daniel James says:

    That’s very helpful, I’m trying to do something like that. Is there any way to automatically check for repeated keys?

    • Hartmut Kaiser says:

      Daniel,

      The solution I outlined in my response to Bernard above which gives you the full list of parsed items seems to answer your question as well. This certainly requires a second pass over the parsed data to eliminate/diagnose duplicate entries.

      Another approach would be to use some custom semantic action checking whether the current key is duplicate, while handling errors appropriately:

      void f (pairs_type& c, pair_type const& p) 
      {
          // check for duplicates and throw on error
          c.insert(c.end(), p); 
      }
      
      using boost::phoenix::bind;
      using boost::spirit::_val;
      using boost::spirit::_1;
      query = pair[bind(f, _val, _1)] 
          >> *((qi::lit(';') | '&') >> pair[bind(f, _val, _1)]);
      

      Here _val and _1 are predefined placeholders referring to the attribute of the left hand side (query) and to the attribute of the parser component (pair) respectively.

      HTH
      Regards Hartmut

  3. Marshall says:

    I’ve been playing with this example, and ran into a problem. I have data that separates the keys with a ‘;’, with a final ‘;’ at the end.

    I wrote the grammar like this:

       pairList     =  +(header) >> qi::lit (';');
       pair         = token >> ':' >> value >> qi::lit (';');
    

    and it worked fine. But when I defined a rule for the separator, the compiler barfed all over it (pages and pages of errors)

       pairList     =  +(header) >> sep;
       pair         = token >> ':' >> value >> sep;
       sep          = qi::lit (';');
    

    This seems to me to be a simple transformation of the grammar, yet one works and the other does not. What’s up with that?

  4. Marshall says:

    Please replace “header” with “pair” in my comment above. Sorry.

    • Hartmut Kaiser says:

      Marshall,

      What attribute type has your new rule ‘sep’? If it exposed no attribute at all the code should actually compile, but if it exposed a std::string things would break because the attribute of ‘pair’ would not match the overall attribute of its right hand side anymore.

      Regards Hartmut

  5. Marshall says:

    Thanks, Hartmut – that was it.

    I changed:

        qi::rule<Iterator, std::string ()> sep;
    

    to

        qi::rule<Iterator> sep;
    

    And it worked!
    Now for the next problem. In my real data, the values can be “continued” across a separator (when it is followed by a whitespace). I want to “eat” the separator + whitespace and add the following data into the value.

    I tried this:

            valuePart = +qi::char_("a-zA-Z_0-9\\./()");
            value     = valuePart >> *( sep >> ( qi::char_(" ") | '\t' ) >> valuePart[ _val += _1 ] );
    

    but that doesn’t compile :-(
    it complains about a missing “operator +=”.
    (Strictly speaking, the semantic action should be _val += std::string ( ” ” ) + _1, but I thought that I would get the basic stuff working first )

    • Hartmut Kaiser says:

      Marshall,

      I think the problem is the char_(‘ ‘) which exposes ‘char’ as its attribute, messing up things. Try using lit(‘ ‘) instead (matches a space as well, but doesn’t expose any attribute).

      Regards Hartmut

  6. Thomas Heller says:

    I just wanted to try that example. However it doesn’t compile if you just include

    #include <boost/spirit/include/qi.hpp>.
    

    You will need to make Boost.Fusion aware of std::pair. To do this you have to add the following include directive:

    #include <boost/fusion/include/std_pair.hpp>
    
  7. Gabor Balazsfalvi says:

    Hi,

    This sample works fine with boost 1.42. Unfortunately I have limited boost access at the place I need this, the newest boost release is 1.36. There I cannot compile (or even rewrite) it because the compiler complains that map doesn’t have push_back member. I tried to write my own action, but I failed to do it. Can you please point out how to solve this problem in boost 1.36?
    Thanks,
    Gabor

  8. Marcos says:

    Hi,

    I’ve been trying to replace a buggy parser in some code I’ve inherited with something and trying to use something more efficient. However, I’ve actually been finding Spirit rather difficult to debug. This example, along with the boost documentation’s XML parser have been my main reference points.

    If I wanted to make a valid value be a new key value pair with braces. For example: k1=v1|k2=v2 and k1={k2=v2} and {k1=v1} should all be valid.

    This is what is the modifications I made to the example code:

                query =  pair >> *('|' >> pair);
                pair  %=  key >> '=' >> -(value) | '{' >> pair >> '}';
                key   =  qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");
                value = +qi::char_("a-zA-Z_0-9") | '{' >> query >> '}';
    

    This produces pages and pages of error messages that I’ve been unable to decipher. What am I missing, or what documentation section should I be reading to better understand what is going on?

    Thanks,

    Marcos

  9. Glen Staring says:

    Thanks for your excellent work.

    I’m trying to put together some of the features you have discussed, but I’m runnign into difficulty. Specifically, I am having a hard time constructing a pair consisting of a string key and a synthesized attribute for the value.

    Consider the grammar I make in the code I quote below ( the struct block_grammar). I gather g++ (4.3.4) fails at the line:
    “pair = key >> value”, with the errnor:

    “error: no matching function for call to ‘std::pair<std::basic_string<char, std::char_traits, std::allocator >, TinWriter>::pair(const std::basic_string<char, std::char_traits, std::allocator >&)”.

    This looks to me like it’s trying to construct a pair with only the string, and is not passing an instance of the TinWriter struct, which should be the synthesized attribute of the ‘value’ rule.

    I would appreciate any light you can share on the subject.

    Best regards. Code to reproduce the error follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    namespace qi = boost::spirit::qi;
    namespace ascii = boost::spirit::ascii;

    enum CastTypes{
    STRING,
    INT,
    FLOAT,
    VECTOR_OF_INTS
    };

    std::ostream& operator<< (std::ostream& lhs, const CastTypes ct){
    switch (ct){
    case (STRING): lhs<<"STRING";return lhs; break;
    case (INT): lhs<<"INT";return lhs; break;
    case (FLOAT): lhs<<"FLOAT";return lhs; break;
    case (VECTOR_OF_INTS): lhs<<"VECTOR_OF_INTS";return lhs; break;
    default: lhs<<"UNKNOWN CastTypes";break;
    }
    return lhs;
    }

    struct castings_ : qi::symbols{
    castings_()
    {
    add
    (“INT”, INT)
    (“STRING”, STRING)
    (“FLOAT”, FLOAT)
    (“VECTOR_OF_INTS”, FLOAT)
    ;
    }
    } castings;

    struct mappings_ : qi::symbols{
    mappings_()
    {
    add
    (“FI_FID”, 1)
    (“FI_VAL”, 5)
    (“FI_FIELD_NAME”, 15)
    (“FI_GMS_MAPPING”, 10065)
    (“FI_GMS_AGENCY”, 10066)
    (“FI_GMS_SRCTYP”, 10067)
    (“FI_GMS_KEYNAME”, 10068)
    (“FI_GMS_TKCODEA”, 10069)
    (“FI_GMS_VAL1″, 10070)
    ;
    }
    } mappings;

    struct TinWriter
    {
    CastTypes m_castType;
    unsigned m_fid;
    };

    std::ostream& operator<< (std::ostream& lhs, const TinWriter rhs){
    lhs << rhs.m_castType << "," <1){
    confFilename = argv[1];
    }
    else {
    confFilename = “test.conf”;
    }
    std::cout << " Parsing out comments from file " << confFilename <<std::endl;

    /// Load file into a single string:
    std::ifstream fStream;
    fStream.open(confFilename.c_str());
    if(!fStream.good()){
    std::cerr << "couldn't open file" <<std::endl;
    exit(-1);
    }
    std::stringstream sStream;
    while(!fStream.eof()) {
    sStream.put(fStream.get());
    }
    return sStream.str();
    }

    typedef std::vector<std::pair > ResultType;
    //——————————————–
    template
    struct block_grammar : qi::grammar
    {
    block_grammar() : block_grammar::base_type(start)
    {
    value = castings >> mappings;
    key = qi::char_(“a-zA-Z_”) >> *qi::char_(“a-zA-Z0-9_”);
    pair = key >> value;
    start = pair % qi::eol ;
    }
    qi::rule key;
    qi::rule value;
    qi::rule<Iterator, std::pair(), SkipperType> pair;
    qi::rule start;
    };

    int main (int argc, const char* argv[]) {
    std::string input = read_file(argc, argv);
    std::cout << input <<std::endl;

    IterType first = input.begin();
    IterType last = input.end();

    ResultType res;
    typedef ascii::blank_type SkipType;
    SkipType skipper;
    block_grammar g;
    bool r;
    r = qi::phrase_parse(first, last, g, skipper, res);
    std::cout << "parsing " << (r?"passed":"failed")<<std::endl;
    std::cout<<res.size()<<std::endl;

    ResultType::const_iterator siter = res.begin();
    while(siter!= res.end()){
    std::cout<first<<std::endl;
    ++siter;
    }
    return r;
    }

  10. Adrian Miller says:

    I am able to successfully place these key_value_combinations in a rule.

    const qi::rule key_value_combination = key >> ‘=’ >> value;
    const qi::rule key_value_combinations = key_value_combination >> qi::repeat['|' >> key_value_combination];
    const qi::rule dictionary = qi::lit(‘{‘) >> key_value_combinations >> ‘}’;

    I want to be able to do the following:
    const qi::rule dictionaries = dictionary >> qi:repeat['|' >> dictionary];

    but it doesn’t work…what am I messing? I assume that the problem is that qi doesn’t know how to insert KeyValueMap into a vector….is this the case? If so how do I fix it? fusion?

    • Adrian Miller says:

      for some reason the attributes didn’t get added.

      dictionary has KeyValueMap attribute and I want dictionaries to have VectorOfKeyValueMaps, where KeyValueMap is a map and VectorOfkeyValueMaps is vector.

  11. Thomas Bretgeld says:

    Hi,

    first of all I would like to say thanks for this example. It works great and helped me a lot. However I am a little confused about why it why it works – especially why the attributes of the rhs side are automatically assigned to the lhs. I thought it would have to use the “%=” operator to do this automatically, but here it somehow magically happens. Why?

    To clarify a little further, this is what I would have expected it to be:

    query %= pair >> *((qi::lit(‘;’) | ‘&’) >> pair);
    pair %= key >> -(‘=’ >> value);
    key %= qi::char_(“a-zA-Z_”) >> *qi::char_(“a-zA-Z_0-9″);
    value %= +qi::char_(“a-zA-Z_0-9″);

    Did I miss something in the documentation?

  12. Jurek says:

    Thank you for this example.

    I have modified it to parse lines of key-value pairs. I simply added ‘>> qi::eol’ to the query rule, and created a rule ‘start = +query’. I use std::vector< std::map > to store the parsed data. Is it possible to make a parser that put these pairs into `std::vector< boost::shared_ptr<std::map >`?

  13. Łukasz says:

    What if I wanted the container not be a map or vector, but MyContainer<pair<string, string>>? What interface should it expose or what traits do I need to specialize?

Leave a Reply

To use RetinaPost you must register at http://www.RetinaPost.com/register