Nov 27

Generate Escaped String Output Using Spirit.Karma

By Hartmut Kaiser Add comments

This is another article in the series of “How To’s” providing you with shrink wrapped grammars directly usable in any project. I’m going to describe a Karma grammar you can use to generate output for quoted strings, where all contained special characters are properly escaped.

The previous installment of this series (see here) is already giving a high level overview of Karma. This allows us to skip any related introductions and to start right away. The purpose of the escaped_string grammar is to generate output for any given character sequence while enclosing it in quotes (i.e. ‘\”’ or ‘\’’) and making sure all relevant characters get escaped by pre-pending a backslash (i.e. ‘\n’ will be generated as “\\n”, etc.). Let’s start with the reverse Parsing Expression Grammar (PEG) for this:

esc_str → '"' (esc_char / . / "\\x" hex)* '"'
esc_char → &'\a' "\\a" / &'\b' "\\b" / &'\f' "\\f" /
           &'\n' "\\n" / &'\r' "\\r" / &'\t' "\\t" /
           &'\v' "\\v" / &'\\' "\\\\" /
           &'\'' "\\\'" / &'"' "\\\""

Any escaped string (esc_str) starts and ends with the quoting character (in this case ‘”’) and all characters of the sequence are printed either as an escaped character (esc_char) or a printable character or a “\\x” followed by the hexadecimal representation of the corresponding character code. The esc_char will handle any of the listed character codes by generating a backslash followed by the corresponding C-style encoding. Each of the listed alternatives (such as &‘\a’ “\\a”) reads as: if the character has the code ‘\a’ print it as “\\a” (a backslash followed by ‘a’).

We have seen this before: converting the PEG into a Spirit grammar is a simple and formal step (we discussed that briefly here):

esc_str = '"' << *(esc_char | karma::print | "\\x" << karma::hex) << '"';
esc_char.add('\a', "\\a")('\b', "\\b")('\f', "\\f")('\n', "\\n")
            ('\r', "\\r")('\t', "\\t")('\v', "\\v")('\\', "\\\\")
            ('\'', "\\\'")('"', "\\\"");

Note we use predefined Karma facilities here, replacing a straight reverse PEG translation. We could have defined the rule esc_char in a very similar way as done in the reverse PEG notation above, but decided to use a predefined Karma component, the karma::symbols<>, instead. This gives us a nice way of mapping the special characters to their C-style representation. Conveniently, the symbols generator fails to generate anything if the supplied character is not contained in the symbols table. The karma::print is a primitive generator which succeeds generating output for characters satisfying std::isprint() while failing for all others. The karma::hex is a predefined numeric generator emitting its attribute in hexadecimal integer representation. 

Generally, Karma generators have the ability to fail generating if some of their preconditions are not met. This allows to use them in alternatives (as shown above), forcing to try the next alternative if the current generator doesn’t handle the output. In the example above a specific character is first tried to be handled by the symbols generator. If the character is not in the list of those to be handled by the symbols generator, the next alternative (karma::print) is tried. If the character does not satisfy std::isprint() (resulting in the karma::print primitive to fail) we use the last alternative as a catch all fallback and emit the character’s hexadecimal representation.

The overall rule esc_str reads as: ‘generate any supplied string embedded inside quotes (‘”’), convert all special characters into a sequence of a backslash and the corresponding C-style encoding and represent all non-printable characters in C-style hexadecimal character encoding’.

Now, as we have the grammar, the next step is to figure out the proper attribute types for the rules. As written before, attributes in Qi are the types and values we get as the result of converting the input. Attributes in Karma are the types and values we want to generate output from. For our example the choice is straightforward and already implied by the goal. If we assume narrow character representation we get:

karma::rule<OutputIterator, std::string()> esc_str;
karma::symbol<char, char const*> esc_char;

We use karma::rule<> as the non-terminal for storing the output format for the esc_str. As this is the top level rule we assume std::string to be its attribute. As alluded to earlier, esc_char is a karma::symbols<> instance using one character in the string as the key while storing the C-style representation of that character as its value. Two of the generator alternatives inside the Kleene Star expose a single char as their attribute. The third alternative, the (“\\x” << karma::hex) exposes an int as its attribute, but this is perfectly compatible with a single character. This allows the std::string to be naturally split into single characters, one at a time.

Essentially we are done. That was not too hard, was it?

But before we write the (reusable) grammar and demonstrate how it needs to be called I thought to make things a bit more interesting. So far our output format is bound to generate strings quoted using the double quote character (‘”’). Wouldn’t it be nice to be able to customize the quoting character as well? This is interesting for instance if you need to generate Python style strings which come in 3 flavors: quoted with ‘”’, “’”, or “’’’”. So I decided to introduce another features provided by Spirit’s non-terminals: inherited attributes. Non-terminals in recursive descent parsers and generators can be seen as being very similar to functions. Parsers return a value, their synthesized attribute, while generators require a special consumed attribute. Both optionally may take arguments, their inherited attributes. Spirit uses the function declaration syntax in order to specify all attributes in a very compact form. So we make the grammar customizable by ‘passing’ the quoting character sequence as an inherited attribute. We modify the rule esc_str to expect a single inherited attribute (the quoting sequence) as a plain char const*, and change the rule definition to use this attribute:

karma::rule<OutputIterator, std::string(char const*)> esc_str =
        karma::lit(karma::_r1)
    << *(esc_char | karma::print | "\\x" << karma::hex)
    <<  karma::lit(karma::_r1);

The predefined primitive karma::lit emits its argument as a literal and karma::_r1 is a special predefined placeholder expression referring to the first inherited attribute of the rule on the left hand side of the expression.

Now, as we have all the required pieces in place I’ll show you how to wrap everything into a karma::grammar to make it reusable:

template <typename OutputIterator>
struct escaped_string
  : karma::grammar<OutputIterator, std::string(char const*)>
{
    escaped_string()
      : escaped_string::base_type(esc_str)
    {
        esc_char.add('\a', "\\a")('\b', "\\b")('\f', "\\f")('\n', "\\n")
                    ('\r', "\\r")('\t', "\\t")('\v', "\\v")('\\', "\\\\")
                    ('\'', "\\\'")('"', "\\\"")
            ;
        esc_str =   karma::lit(karma::_r1)
                << *(esc_char | karma::print | "\\x" << karma::hex)
                <<  karma::lit(karma::_r1)
            ;
    }
    karma::rule<OutputIterator, std::string(char const*)> esc_str;
    karma::symbols<char, char const*> esc_char;
};

The derivation from Karma’s grammar type converts the escaped_string type into a generator. Its member rules define a grammar which makes it usable for emitting quoted strings using arbitrary quoting characters. The base class constructor gets passed the esc_str rule, which is the top most rule of the grammar to be executed when the grammar is invoked. The type escaped_string has a template parameter allowing to utilize this grammar with arbitrary output iterator types.

The last missing code piece shows how to invoke the newly created generator.

typedef std::back_insert_iterator<std::string> sink_type;

std::string generated;
sink_type sink(generated);

std::string str("string to escape: \n\r\t\"'\x19");
char const* quote = "’’’";

client::escaped_string<sink_type> g;
karma::generate(sink, g(quote), str);
    // this will emit: ’’’string to escape: \n\r\t\"\'\x19’’’</a>

The function karma::generate() is another of Spirit’s main API functions. In the simplest case it takes an output iterator representing the output target to send the output to (sink), an instance of the generator to invoke (g), and the attribute instance holding the data (v). We pass the quoting character sequence (the “\””) as a inherited attribute while invoking the grammar. This function executes the actual generator operation and returns true if it was successful.

If you want to try out this example for yourself, the complete source code is available from the Boost SVN here. In the future this example will be distributed as part of the Spirit distribution, but for now it lives in the SVN only. Additionally, the karma::symbols<> generator is not yet part of the latest released code (Spirit V2.1). You either need to checkout the current version of Spirit from the SVN or download the related file here. In this case you need to explicitly include it into the example in order to be able to compile the code.

8 Responses to “Generate Escaped String Output Using Spirit.Karma”

  1. Eddward says:

    The example looks good and the library looks great. I just have a minor nit with the example. What if the quote string is not ‘, ” or ”’? That may be to complex to want to deal with for this example though. Say:

    ...
    char const* quote = "/";
    ...
    
    // want /^\/(foo|bar)\/(.*)$/ but I think will produce  /^/(foo|bar)/(.*)$/
    std::string str("^/(foo|bar)/(.*)$");  
    

    Edd

    (PS: Sorry if this comes out bad. There’s no preview.)

    • Hartmut Kaiser says:

      Edd,

      yes, you’re right. This example works well for the mentioned quoting characters only. I didn’t want to complicate matters with modifying the symbols generator depending on the supplied quoting character. Especially as I wanted to highlight inherited attributes, which would not work anymore for this.

      An alternative implementation could take the quoting character as a constructor parameter for the grammar (instead of as the inherited attribute), allowing to modify the symbols generator at initialization time. I.e. something like:

      template <typename OutputIterator>
      struct escaped_string
        : karma::grammar<OutputIterator, std::string()>
      {
          escaped_string(char quote)
            : escaped_string::base_type(esc_str), quoted("\\")
          {
              quoted += quote;
              esc_char.add
                      ('\a', "\\a")('\b', "\\b")('\f', "\\f")('\n', "\\n")
                      ('\r', "\\r")('\t', "\\t")('\v', "\\v")
                      ('\'', "\\\'")('"', "\\\"")
                      (quote, quoted.c_str())
                  ;
              esc_str =   karma::lit(quote)
                      << *(esc_char | karma::print | "\\x" << karma::hex)
                      <<  karma::lit(quote)
                  ;
          }
          karma::rule<OutputIterator, std::string()> esc_str;
          karma::symbols<char, char const*> esc_char;
          std::string quoted;
      };
      

      Regards Hartmut

  2. Ben Voigt says:

    Are these examples supposed to be representative of the escape characters used in C and C++ string literals? You need to protect the escape character (backslash) itself!

    Otherwise you’d get the same result from the strings “\n” and “\\n” — definitely not good.

  3. michael shiplett says:

    is there a way to conditionally quote/escape output, i.e., not quote it if there are no special characters? adding rules such as

        unesc_str = +karma::alnum;
        str = 
            unesc_str
            | esc_str
            ;
    

    only seems to eliminate non-alphanumeric characters from the output.

    • Hartmut Kaiser says:

      Michael,

      it doesn’t work as you suggested because the plus operator will always succeed as long as at least one character is an alnum. You will need to rewrite your rule as:

      str = +(
               karma::alnum 
          |    esc_char 
          |    karma::print 
          |    "\\x" << karma::hex
          );
      

      This will try to output a character as an alnum first, and when that fails (because it’s not a alpha-numeric), it will invoke the other alternatives.

      Regards Hartmut

      • michael shiplett says:

        thank you. that does indeed escape the characters and explain the behaviour i’m seeing with the plus operator (though i’m surprised it continues generating from a string once it encounters a non-alphanum).

        i’m trying to quote/escape strings only when necessary, e.g., given a sequence of strings, quote (and internally escape) only those strings which are not entirely alphanumeric:
        atom1 atom2 “atom 3″ “atom \”4\”"
        instead of
        “atom1″ “atom2″ “atom 3″ “atom \”4\”"

        it may be i’m trying to force the switching logic into karma while karma expects it to be done before the generation.

Leave a Reply

preload preload preload