Jan 13

While looking through the mailing list archives I realized that often the small issues get into our way. Small snippets of information are making the difference. So I decided to start a (more or less regular) series about small tips helping to get your work done while using Spirit. Even if I sloppily call this series ‘Tip of the Day’, I by no means plan to have a tip a day.

This time I’m going to highlight the difference between three different ways to parse or generate a single character: ‘a’, lit(‘a’), and char_(‘a’).

The first two forms, ‘a’ and lit(‘a’) are semantically equivalent. Both create a component (a parser in Qi and a generator in Karma) handling the specified literal character, in this case the ‘a’. As ‘a’ and lit(‘a’) expose unused_type as their attribute they can be used to match or generate a single ‘a’ without exposing or consuming an attribute value (remember, unused_type is Spirit’s fancy way of saying ‘I don’t care’). This property makes the literal components very useful for matching and emitting characters not contributing to the semantics, such as comments in parsed source code, commas in the list of arguments to a function, etc.

You might ask: ‘Why do you have two forms of expressing literals, then?’. The answer is readability. The first form (‘a’)  is simpler, costs less keystrokes to write and clearly expresses the intent to handle this character. So it is the preferred way of doing things. But unfortunately this isn’t syntactically possible all the time. Spirit implements its domain specific embedded languages (DSEL’s) by directly utilizing the C++ language. This defines the constraints we have to work with. Let us have a look at the following example, where we want to match two characters in a row:

'a' >> 'b'

This expression simply right shifts the bit pattern representing the character ‘a’ by the number of bits defined by the bit pattern representing the character ‘b’. That is clearly not what we want! We need to have a way to tell the compiler that the expression has to be interpreted as a Qi parser. Spirit solves this dilemma by ‘tainting’ at least one of the terms:

boost::spirit::lit('a') >> 'b'

Now we have a valid parser expression as the compiler is forced to invoke the proper overload of the right-shift operator as exposed by Spirit.

The third form, char_(‘a’) distinguishes itself from the former ones by exposing its literal type as its attribute. This expressions used as a Qi parser will expose the ‘a’ as the attribute value if it matched an ‘a’ in the input. If it is used as a Karma generator it will succeed generating an ‘a’ either if no attribute is supplied or if the attribute value is ‘a’, failing to generate otherwise.

4 Responses to “What’s the difference between the Spirit components ‘a’, lit(‘a’), and char_(‘a’)?”

  1. Stuart Dootson says:

    Thanks for writing that, Hartmut – it certainly caused a small ‘ah, of course!’ moment for me 🙂 I hadn’t realised that a bare character (effectively) has no attribute before, so isn’t equivalent to a character within a char_().

  2. Hartmut Kaiser says:

    Stuart,

    Glad it helped! That’s exactly what I intended to achieve, write about small bits of apparently not so obvious information necessary for the understanding of Spirit.

    Regards Hartmut

  3. S.J.W. says:

    I wrote a json spirit with Qi and Phoenix, it performs really well.
    But I still have a question about charset namespaces,
    for example,:
    it seems as if lit(“null”) can match both “null” in ascii and L”null” in unicode, and ‘a’ or lit(‘a’) the same way. I wondered why lit is not located in different namespaces ascii or standard_wide, and why it can match strings in different charsets?
    thanks.

    • Hartmut Kaiser says:

      S.J.W.,

      The lit() component and plain character/string literals are character set agnostic as they represent just the bit pattern of their literal argument. No ambiguity is possible. At the same time, char_ et.al. potentially encapsulate character set specific functionality, such as char_(“a-z”) or isalpha. Surely, not all of the char_ components are character set specific. But we thought it might be confusing to have two different char_’s: one representing a given character set and one being character set agnostic. So we went for the character set specific char_ only.

      Regards Hartmut

Leave a Reply

preload preload preload