Nov 13

Parsing Escaped String Input Using Spirit.Qi

By Jeroen Habraken Add comments

In a previous article you’ve learned how to generated escaped strings, today we’re going to do the reverse. We’re going to describe a Qi grammar you can use to parse quoted strings in which special characters are escaped.

The purpose of the unescaped_string grammar is to removed the enclosing quotes and un-escape the characters which were previously escaped (i.e. convert “\\n” back to ‘\n’ again). Unsurprisingly, the Parsing Expression Grammar (PEG) is quite like the one in the Karma article:

unesc_str = '"' (unesc_char / . / "\\x" hex)* '"'
unesc_char = &"\\a" '\a' / &"\\b" '\b' / &"\\f" '\f' /
             &"\\n" '\n' / &"\\r" '\r' / &"\\t" '\t' /
             &"\\v" '\v' / &"\\\\" '\\' /
             &"\\\'" '\'' / &"\\\"" '"'

An escaped strings starts and ends with the quoting character (in this case ‘”‘) and all characters of the sequence are read either as an escaped character (unesc_char), a printable character or a “\\x” followed by the hexadecimal representation of a character code. The unesc_char will handle any of the listed character codes by parsing a backslash followed by the corresponding C-style encoding. Like in the Karma example, each of the listed alternatives (such as &”\\a” ‘\a’) reads as: if the input being parsed starts with “\\a”, return ‘\a’.

Now let’s convert this to a Spirit grammar:

unesc_str = '"' >> *(unesc_char | qi::alnum | "\\x" >> qi::hex) >> '"';
unesc_char.add("\\a", '\a')("\\b", '\b')("\\f", '\f')("\\n", '\n')
              ("\\r", '\r')("\\t", '\t')("\\v", '\v')("\\\\", '\\')
              ("\\\'", '\'')("\\\"", '\"');

Just like in the Karma example we’re using predefined facilities, but in this case Qi facilities. Where we were previously using karma::symbols<> to map special characters to their C-style representation, we’re doing the reverse now: mapping C-style representations to the special characters using qi::symbols<>. This too will conveniently fail if the representation is not in the symbol table. The next parser that will be tried is qi::alnum, which successfully accepts an alphanumeric character and fails on all other input. At this point you might be wondering why I’m not using qi::print, like we were using karma::print, I’ll come back to this later. Last but not least we’ll try converting a hexadecimal character representation to the actual character code if it’s prefixed with a “\\x”.

Generally, what is true for Karma generators holds for their Qi parser counterparts. They too have the ability to fail parsing if their preconditions are not met, allowing them to be used in alternatives just like the Karma generators. When one fails, the next alternative is tried until we’re out of alternatives and have to fail overall. The next alternative here is qi::alnum, which as we said accepts an alphanumeric character and fails otherwise.

Since we now have the grammar, the next step is to figure out which attributes to use for the rules. The attributes in Qi are the types and values we gat as the result of converting the input, thus it’s quite straightforward:

qi::rule<InputIterator, std::string()> unesc_str;
qi::symbols<char const, char const> unesc_char;

Note that we’re assuming narrow character representation here, for wide character representation one would have used std::wstring and wchar_t const.

Akin to the Karma example, we use the Qi counterpart qi::rule<> as the non-terminal for storing the parsed data for the unesc_str, for which we again assume std::string to be its attribute. Again two of the parser alternatives inside the Kleene Star expose a single character as their attribute, the other (“\\x” >> qi::hex) exposing an unsigned int which can be implicitly casted to a char. If we strictly followed the rules we’ll see the rule actually exposes a vector<char>, but this is compatible with a std::string. Secondly the C-style representations with their character matches are stored in a qi::symbols<> instance. Now you might wonder why it’s a qi::symbols<char const, char const> and not qi::symbols<char const *, char const> since the Karma example was using karma<char const, char const *> the first parameter is the character type of the strings stored, not the type of the strings.

That’s all there’s to it!

Since the Karma example allows you to specify the quoting character, we’re going to do so too, and I’ll use this opportunity to show just how similar Karma and Qi really are. Now here’s the Karma code presented in the previous article:

karma::rule<OutputIterator, std::string(char const*)> esc_str =
        karma::lit(karma::_r1)
    << *(esc_char | karma::print | "\\x" << karma::hex)
    <<  karma::lit(karma::_r1);

Since we’re parsing data instead of generating it, we should use >> instead of <<, and obviously we’re using Qi thus we need to use that namespace, which gives us:

qi::rule<InputIterator, std::string(char const*)> unesc_str =
        qi::lit(qi::_r1)
    >> *(unesc_char | qi::alnum | "\\x" >> qi::hex)
    >>  qi::lit(qi::_r1);

Easy huh?

This leads us back to the reason we’re using qi::alnum instead of qi::print, there are two reasons:

  1. The alternatives are are tried in the order they are in, thus if we had *(unesc_char | qi::print | “\\x” >> qi::hex) the “\\x” >> qi::hex option would never be tried since “\\x” and hexadecimal characters are printable, and thus parsed using qi::print. Simply re-arranging the order solves this problem, thus we could use *(unesc_char | “\\x” >> qi::hex | qi::print).
  2. The quoted character used in the example is “”'”, which means we’re actually parsing as follows
unesc_str = qi::lit("'''")
    >> *(unesc_char | "\\x" >> qi::hex | qi::print)
    >>  qi::lit("'''");

but the Kleene Star is greedy (an so is the unary +, which means “one or more”), thus it would parse the closing “”'” too! The solution here is not to use qi::print, but (qi::print – “‘”) which means “all printable characters, except “‘”.

Putting this all together and wrapping it in a qi::grammar we get:

template <typename InputIterator>
struct unescaped_string
  : qi::grammar<InputIterator, std::string(char const*)>
{
    unescaped_string()
      : unescaped_string::base_type(unesc_str)
    {
        unesc_char.add("\\a", '\a')("\\b", '\b')("\\f", '\f')("\\n", '\n')
                      ("\\r", '\r')("\\t", '\t')("\\v", '\v')
                      ("\\\\", '\\')("\\\'", '\'')("\\\"", '\"')
        ;

        unesc_str = qi::lit(qi::_r1)
            >> *(unesc_char | qi::alnum | "\\x" >> qi::hex)
            >>  qi::lit(qi::_r1)
        ;
    }

    qi::rule<InputIterator, std::string(char const*)> unesc_str;
    qi::symbols<char const, char const> unesc_char;
};

which can be called as such:

typedef std::string::const_iterator iterator_type;

std::string parsed;
std::string str("'''string to unescape:\\x20\\n\\r\\t\\\"\\'\\x41'''");
char const* quote = "'''";
iterator_type iter = str.begin();
iterator_type end = str.end();
client::unescaped_string<iterator_type> p;

qi::parse(iter, end, p(quote), parsed);

    // this will result in: parsed == "string to unescape: \n\r\t\"\'A"

Which uses qi::parse, one of Qi its main API functions, similar to karma::generate for generation. It takes the beginning and the end of the string to parse, an instance of the parser (p) and an attribute instance to which the parsed data is saved (parsed). We pass the quoting character sequence (the “”'” in quote) as an inherited attribute when invoking the grammar. When successful qi::parse returns true, while it returns false otherwise and one can compare iter to end to check if the complete input string has been parsed.

This leaves one minor enhancement, one can use parsed.reserve(str.length()) to reserve enough space in parsed to fit the un-escaped string, which encompasses a small speed enhancement.

If you want to try out this example for yourself, the complete source code is available from the Boost SVN here.

8 Responses to “Parsing Escaped String Input Using Spirit.Qi”

  1. Jamboree says:

    Good, as I saw the Karma example before, I considered the reverse would be done in Qi.

    But making it deal with custom quoting character sequence may involve more things than what is depicted here, and this doesn’t parse with punctuations, space…

    BTW, I use g++3.4.5 and have problem with get_begin, get_end ambiguity in , so I just commented the lines between 203-216 (in file “boost/spirit/home/support/string_traits.hpp”).

    • Jamboree says:

      Instead of using qi::alnum to avoid the problem,
      how about:

          unesc_str = qi::lit(qi::_r1)
              >> *(unesc_char | "\\x" >> qi::hex | (qi::print - qi::lit(qi::_r1)))
              >>  qi::lit(qi::_r1)
              ;
      
      • Jeroen Habraken says:

        This is indeed a valid solution, there is an even better one though. In essence you have:

        qi::lit(_r1) >> "A bunch of characters" >> qi::lit(_r1) >> qi::eoi // End Of Input

        Thus if you want to accept any printable character after having handled the escaped characters (both C-style and hexadecimal) you really want to say, "Any printable character, except the string qi::lit(_r1) followed by qi::eoi". This can be done using look-ahead, as follows:

        qi::lit(qi::_r1) >> *(unesc_char | "\\x" >> qi::hex | (!(qi::lit(qi::_r1) >> qi::eoi) >> qi::print)) >> qi::lit(qi::_r1)

        This allows you to use "'" in your string as you normally would (no need to escape it), and still allows the closing quote to be "'''".

        • Jamboree says:

          But that’s not what we normally want.
          In most cases, we still need the escape character as we usually do not expect nothing will follow the string.

    • Jeroen Habraken says:

      You’re right, it does however show the general approach. Exactly which characters are escaped, which are represented in a hexadecimaal format and which can be parsed literally depend on what you’re trying to achieve, thus the solution will have to be adapted to your use-case.

      To be honest, the custom quoting was added as an afterthought because the Karma example has it, and I wanted to show it’s done in pretty much the same way in Qi as in Karma.

  2. Moritz Hassert says:

    I have adopted your example for my own string parser, but qi::lit() behaves strange: it does not suppress the quoting characters but includes them in the parser result string. (Note that I’m using “>” instead of “>>”)

    const char Q1 = '"'; //double quotes
    const char Q2 = '\''; //single quotes
    // ...
    my_string  =  unesc_str(Q1)  |  unesc_str(Q2);
    unesc_str = qi::lit(qi::_r1) > *(my_characters - qi::lit(_r1) ) >  qi::lit(qi::_r1)
    my_characters = ... //allowed characters
    

    Using char literals directly I get a result string without quotes as expected:

    unesc_str = '"' > *(my_characters - '"' ) >  '"'
    

    Any ideas why this is happening? I’m using the version of spirit that is included in boost 1.45.

    • Sven says:

      I am having the same problem, with a much simpler example here … but it seems to be the same I will call it BUG or misunderstanding:

      quotedQString %= lexeme[lit(_r1) >> +(char_ – lit(_r1)) >> lit(_r1)];

      expressionRegEx = (valueGetter >> “=~” >> quotedQString(‘/’)) ….

      qi::rule quotedQString;

      Exchanging lit(_r1) with a ‘”‘ will do the job as expected.

Leave a Reply

preload preload preload