Mar 05

Tracking the Input Position While Parsing

By Peter Schüller Add comments

The following example is about tracking the parsing position with Spirit V2. This is useful for generating error messages which tell the user exactly where an error has occurred. We also show how to use Spirit V2 to parse from an input stream without first reading the whole stream into a std::string.

In our example, we want to parse a list of real numbers and return them in a std::vector and throw an exception if an error occurs. We will start with a naive way of achieving our goal and then improve the program in two steps. This naive way could be the following small program. It provides basic and not very helpful error handling and scales bad for large inputs:

#include <vector>
#include <istream>
#include <sstream>
#include <iostream>
#include <boost/include/spirit/qi.hpp>

namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;

// parse list of doubles from input stream
// throw exception (perhaps including filename) on errors
std::vector<double>
  parse(std::istream& input, const std::string& filename);

// main function
int main(int, char**)
{
  try
  {
    parse(std::cin, "STDIN");
  }
  catch(const std::exception& e)
  {
    std::cerr << "Exception: " << e.what() << std::endl;
    return -1;
  }
  return 0;
}

// implementation
std::vector<double>
  parse(std::istream& input, const std::string& filename)
{
  // get input into string
  std::ostringstream buf;
  buf << input.rdbuf();
  std::string str = buf.str();

  // prepare iterators
  std::string::iterator first = str.begin();
  std::string::iterator last = str.end();

  // parse
  std::vector<double> output;
  bool r = qi::phrase_parse(
    // iterators over input
    first, last,
    // recognize list of doubles
    qi::double_ >> *(',' >> qi::double_) >> qi::eoi,
    // comment skipper
    ascii::space | '#' >> *(ascii::char_ - qi::eol) >> qi::eol,
    // store into this object
    output);

  // error detection
  if(!r || first != last)
    throw std::runtime_error("parse error in " + filename);

  // return result
  return output;
}

On the following input, parsing succeeds (and there is no output):

# good input
1.3,2.3, 5

With the following faulty input

# bad input
123,42.0,a,1.4

we get an exception:

Exception: parse error in STDIN

If you want to try out this version of the program you may download it here: naive_parsing.cpp, the two input files are goodinput.txt and badinput.txt.

To make the above scalable for large files, we now change the example to parse from an STL stream, and not buffer everything into a std::string. For that, we need the boost::spirit::multi_pass class which wraps a stream iterator and exposes an iterator usable by Qi parsers. The reason we need this adaptor is, that a std::istream_iterator is an input iterator, and Qi requires a forward iterator which allows certain backtracking on the input.

So we include the following file

#include <boost/include/spirit/support_multi_pass.hpp>

and replace

   // get input into string
   std::ostringstream buf;
   buf << input.rdbuf();
   std::string str = buf.str();

   // prepare iterators
   std::string::iterator first = str.begin();
   std::string::iterator last = str.end();

by

   // iterate over stream input
   typedef std::istreambuf_iterator<char> base_iterator_type;
   base_iterator_type in_begin(input);

   // convert input iterator to forward iterator, usable by spirit parser
   typedef boost::spirit::multi_pass<base_iterator_type> forward_iterator_type;
   forward_iterator_type fwd_begin =
       boost::spirit::make_default_multi_pass(in_begin);
   forward_iterator_type fwd_end;

Now we only need to rename all occurrences of first to fwd_begin and all occurrences of last to fwd_end and we are done! (The complete second program can be downloaded from here: stream_iterator_parsing.cpp.)

This new program has the same behavior as the first one, but it does not need to buffer the whole input.

The final task is to get more useful error messages if a parsing problem occurs. To that end we wrap the iterator yet another time, using the boost::spirit::classic::position_iterator2 class which records line number, column number, and filename of the parsed input.

First we need one of additional include file:

#include <boost/include/spirit/classic_position_iterator.hpp>
namespace classic = boost::spirit::classic;

This file in fact provides several possibilities for position iterators: boost::spirit::classic::position_iterator is a bit more efficient than boost::spirit::classic::position_iterator2, but the latter allows to extract the whole line of input from an iterator in that line (and we will use this feature). Additionally, one can specify a ‘position iterator policy’ to only store line numbers and not column numbers. This increases efficiency but might decrease usefulness of the position information.

Wrapping the forward iterator with the position iterator is not difficult, we create another typedef and initialize the iterators as follows:

   // wrap forward iterator with position iterator, to record the position
   typedef classic::position_iterator2<forward_iterator_type>
     pos_iterator_type;
   pos_iterator_type position_begin(fwd_begin, fwd_end, filename);
   pos_iterator_type position_end;

Hint: make sure there are no brackets after position_end – this often happens and creates compile errors which are difficult to find: with brackets you do not declare a variable of type pos_iterator_type, but a function returning pos_iterator_type!

We now use the new position iterator for parsing, but the grammar needs to be adjusted a bit. We will replace the following fragment:

  // parse
  bool r = qi::phrase_parse(
    // iterators over input
    fwd_begin, fwd_end,
    // recognize list of doubles
    qi::double_ >> *(',' >> qi::double_) >> qi::eoi,
    // comment skipper
    ascii::space | '#' >> *(ascii::char_ - qi::eol) >> qi::eol,
    // doubles are stored into this object
    output);

  // error detection
  if (!r || fwd_begin != fwd_end)
    throw std::runtime_error("parse error in " + filename);

In our new code, we first call qi::phrase_parse as before, but this time we disallow backtracking by using > instead of >> for certain parser sequence operators in the grammar: if the parser would backtrack over such a sequence, it throws a qi::expectation_failure<pos_iterator_type> exception.

   try
   {
     qi::phrase_parse(
       // iterators over input
       position_begin, position_end,
       // recognize list of doubles
       qi::double_ > *(',' > qi::double_) >> qi::eoi,
       // comment skipper
       ascii::space | '#' >> *(ascii::char_ - qi::eol) >> qi::eol,
       // store into this object
       output);
   }
   catch(const qi::expectation_failure<pos_iterator_type>&amp; e)
   {
   }

To process the caught exception we add the following code:

     const classic::file_position_base<std::string>& pos =
         e.first.get_position();
     std::stringstream msg;
     msg <<
         "parse error at file " << pos.file <<
         " line " << pos.line << " column " << pos.column << std::endl <<
         "'" << e.first.get_currentline() << "'" << std::endl <<
         std::setw(pos.column) << " " << "^- here";
     throw std::runtime_error(msg.str());

(The complete final program can be downloaded from here: stream_iterator_errorposition_parsing.cpp.)

This code creates an error message from the filename, the line and column number and the currently parsed line.

The faulty input now creates the following useful error message:

Exception: parse error at file STDIN line 1 column 10
'123,42.0,a,1.4'
          ^- here

The programs of this example can be downloaded here: naive_parsing.cpp, stream_iterator_parsing.cpp, and stream_iterator_errorposition_parsing.cpp. The two input files are goodinput.txt and badinput.txt.

GD Star Rating
loading...
Tracking the Input Position While Parsing, 5.0 out of 5 based on 4 ratings
  • Reddit
  • Facebook
  • del.icio.us
  • Digg
  • Twitter
  • Print

20 Responses to “Tracking the Input Position While Parsing”

  1. Olaf says:

    stream_iterator_errorposition_parsing.cpp doesn’t compile at least on g++4.1.1

    What’s wrong here?

    GD Star Rating
    loading...
    • Which version of boost are you using?

      Due to a bug which crept into 1.42, this example will work for 1.41 and for the current trunk, but – unfortunately – not for version 1.42.

      If this does not solve the problem: What is the error message?

      Does the second example “stream_iterator_parsing.cpp” compile?

      GD Star Rating
      loading...
    • Finally I got myself a g++-4.1.3 to test, and I can confirm that it does not work.

      I will forward this issue to the mailing list.

      It works with g++-4.2.4 though, between these versions I have not tested.

      Is your error message similar to

      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:129:
       error: no type named ‘iterator_category’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:130:
       error: no type named ‘value_type’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:131:
       error: no type named ‘difference_type’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:132:
       error: no type named ‘pointer’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:133:
       error: no type named ‘reference’ in ‘_ForwardIterator2’
      
      GD Star Rating
      loading...
  2. Thomas Heller says:

    Hi,
    I was wondering if something like this is possible when using an iterator provided by the lexer?

    GD Star Rating
    loading...
    • This is a good question – in fact I wrote this example to build something connected to a lexer.

      As a lexer (at the moment) requires a random access iterator and not a forward iterator, the first technique (i.e., parsing from a stream using multi_pass) is not applicable to lexers.

      However, the input position tracking should be possible with a lexer, so the error message output should also be possible.

      Best Regards,
      Peter

      GD Star Rating
      loading...
      • Hartmut Kaiser says:

        Peter,

        FWIW, the iterator type restriction for a lexer has already been relaxed for Boost V1.41. The lexer requires a forward iterator as it’s underlying data stream, only. Does the documentation still say you need a random access iterator?

        Regards Hartmut

        GD Star Rating
        loading...
        • Sorry, I recently saw this “random access requirement” on the mailing list, so I assumed it still is in the same state.

          Although, this must have been a post from quite some while ago if it has been fixed in v1.41.

          GD Star Rating
          loading...
    • Hartmut Kaiser says:

      Thomas,

      I’m planning to write an example demonstrating how this can be done. Stay tuned!

      Regards Hartmut

      GD Star Rating
      loading...
      • Thomas Heller says:

        Thanks. I am really looking forward to this. As all my experiments with that ended up in long compile errors.

        GD Star Rating
        loading...
  3. Wieger says:

    Hi Peter,

    your example is much appreciated! When playing with it I noted the following inconsistent behavior (with Visual C++ express 2008):

    ’1, 2 a’ parses OK
    ’1, 2 aa’ results in the error message

    Exception: parse error at file STDIN line 1 column 2
    ’1, 2 a’
    ^- here

    Is there a logical explanation for this?

    Regards,

    Wieger

    GD Star Rating
    loading...
    • With g++-4.4 it is as follows (and as I would expect it to be):

      ’1, 2 a’ returns the error at the ‘,’
      ’1, 2 aa’ likewise returns the error at the ‘,’

      Both errors say “line 1 column 2″ which is expected in the following way:

      The *(‘,’ > qi::double_) Kleene-star is entered and traversed correctly for the “, 2″ part.

      But there is no subsequent ‘,’, and as ‘a’ does not match sp::eoi there is an attempted backtracking out of the Kleene-star to the previous double_ which fails and produces the above error message.

      Best,
      Peter

      GD Star Rating
      loading...
  4. Wieger says:

    It seem that on Windows the last character gets lost. If I do ‘type input.txt | stream_iterator_errorposition_parsing.exe’ the last character of input.txt is ignored. But even if I modify the program as follows the behavior on Windows and linux is different(!)

    std::string s = “1, 2, a”;
    std::istringstream in(s);
    try
    {
    parse(in, “STDIN”);
    }

    On linux I get

    Exception: parse error at file STDIN line 1 column 2
    ’1, 2 a’
    ^- here

    and on Windows I get

    Exception: parse error at file STDIN line 1 column 7
    ’1, 2, ‘
    ^- here

    GD Star Rating
    loading...
    • This is really strange. Are you sure that you recompiled your testcase correctly on Linux?

      Because it seems that the second ‘,’ character gets lost on Linux and this would be a very “untypical” kind of error (compared to losing the last character).

      GD Star Rating
      loading...
      • Wieger says:

        I just did an svn update of boost on Windows and now all differences are gone. Sorry for the noise.

        GD Star Rating
        loading...
  5. Korval says:

    So, how do you combine this with qi::on_error? While you can print the line and even point to the character causing the offense, you can’t describe the actual problem in anything more than the most general terms: something went wrong exactly here.

    qi::on_error is supposed to be the way to produce more descriptive results. But how do you combine on_error with these stateful iterators?

    GD Star Rating
    loading...
    • I would say that the “>” operator describes exactly the situation which is reported – that something unexpected occured.

      However, you can derive from the iterator class and add new data members, so you could use a semantic action to store additional information like “parsed shopping list, now expecting weather forecast” which could be displayed by the on_error handler as it can access the iterators.

      GD Star Rating
      loading...
  6. Patrick says:

    Hi Peter,

    Thanks a lot for this nice example. When I try your third example code with the files you posted, everything works as decribed.
    If I introduce now an error before the first number in the file the code simple exits, without any message. Is this behaviour the normal one? And is there a way to catch this error with a meaningfull message aswell?

    Thanks a lot!

    System: Standart Ubuntu 10.04, 64 bit, boost-1.43.0, g++ 4.4.3

    GD Star Rating
    loading...
    • Hi Patrick,

      I think you found some bug in my example, because I simply removed the following code and replaced it by the catch(…):

        // error detection
        if (!r || fwd_begin != fwd_end)
          throw std::runtime_error("parse error in " + filename);
      

      This “error detection” detects if the parse failed because of a non-complete match. The expectation_failure-exception only happens if a forbidden backtracking occurred.

      So if you add something unparseable at the beginning, you will get a non-complete match, so the (forgotten) code above will throw a generic error.

      If you want to track the position of this problem, I think the easiest way would be to add an epsilon parser “eps_p” at the beginning of your grammar, and afterwards a non-backtracking operator “>”. Then at least the first part of the grammar will match any input and you will get an exception if something wrong is in the input at the beginning.

      However, you will also need the above code, in case something wrong is at the end of your input (after some successful parse).

      Please let me know if this solved your problem. :)

      Cheers,
      Peter

      GD Star Rating
      loading...
      • Patrick says:

        Hi Peter,

        That was, what I was looking for it works perfect for me!

        Thanks a lot for your fast response
        Patrick

        GD Star Rating
        loading...

Leave a Reply

preload preload preload