Nov 07

The Spirit mailing list is still the place where active discussion takes place. I will be posting some excerpts from the mailing list here every once in a while. Here’s an interesting post from Leo Goodstadt:
Are Qi parsers thread-safe?

The answer of course is yes

I have been trying to speed up some parsing code going through around 25 GB of data. This had been taking up 6 hours using Python code. Re-writing it using Qi had taken it down to 4 minutes. And finally, with the help of Intel Threading Building Blocks, I am down to 37 seconds or a > 500x speedup!

The final code runs optimally with around 15 threads. The process seems to be entirely bounded by the physical limits of pulling 25 Gb off the hard disk raid over the network.
Intel TBB works really very well with Spirit / Qi.
I divide the parsing tasks into three parts:

  1. The data is in a one-item-per-line format, so I can divide up the file as it is being read into chunks of lines of around (<=) 64 kb at a time. This is a serial process.
  2. These are then parsed by Qi and analysed in parallel.
  3. TBB then takes the output from (2) and feeds them serially in order for step (3).

I have to write some classes to provide a set of reusable buffers for the data flow through the pipeline to avoid lots of memory allocations and thrashing the cache.

At the moment, each thread has its own lexer. Is there data held statically within each instance of the lexer or are they thread-safe?

Hartmut replies: There is no static data, everything should be perfectly thread safe as long as you do not share either the lexer objects or the iterators between threads, while simultaneously accessing them.

I should say, thank you so much for Qi. It is the one library I can point to in c++ which trumps anything out there for any other language.

Leo Goodstadt

8 Responses to “Multi-threaded Qi: 6 hours -> 37 seconds”

  1. Sohail says:

    \o/ Spirit!

  2. Arash Partow says:

    Leo, can you please provide an example of the format you’re parsing.

    Furthermore can you please provide timings that DO NOT include noise from hard disk access – In short, how long does it take using Qi to parse the data if it were in memory (say a 1-2GB block of it)

    • Leo Goodstadt says:

      I only get a small 10% increase in speed on the second pass through the data (37s -> 30s), when the whole file would be cached. (The machine has 80 Gb of RAM):
      Unfortunately, the test server has only 2 quad core CPUs, so with hyperthreading I see 16 threads of execution. With Intel TBB, I see >> 1500% cpu usage. So the code is cpu limited. I need a faster machine!!

      The big improvement has been the use of Intel Threading Building Blocks. This library is the other piece of the puzzle and has been a joy to use. My code has NO synchronisation primitives. All parallelisation and re-serialisation is taken care of by TBB. This was the trick to get from ~3.5 minutes to 37s. (Don’t ask me why I don’t get a 16x speed up: because hyperthreading is “faking it”?)

      Leo

  3. Leon Sit says:

    Second on some example of this task. I am interested in parsing a large embarrassingly parallel structure into memory too.

    • Arash Partow says:

      Leon, Can you please provide the format of the data parsed, perhaps an example line if possible. I ask because you’ve seen a nearly 600x increase in throughput, I think many people like myself would be very interested in the details.

      • Leon Sit says:

        You asked the wrong person

      • Leo Goodstadt says:

        This is genetics data in the Variant Call Format (VCF) format as standardised by the “1000 genomes project” (http://en.wikipedia.org/wiki/1000_Genomes_Project).

        Each line of my files has a header of 9 tab delimited fields of various data types (strings, limited vocabulary symbols, integers etc.) followed by XX number of tab delimited data. XX is constant for any file.

        Each of the data fields have a number of “:” separated inner fields which again can be in a variety of data types.

        My test file of 25 Gb is not particularly large for its type. I had > 65 million lines, each with 9 header + 17 data fields. Each data field had 5 sub-members 4 of which were integers. So there are around 4.5 billion string->integer conversions at the onset.

        The python code was slow especially in the split() and and string->int conversions. Python code is so much easier than c++ to code rapidly and flexibly that I was really reluctant to rewrite. However, the speed increases were dramatic enough to make it worth my while.

        The last bit of the speed-up involved *either* using a (boost) memory mapped file (my test machine has 80 Gb of memory) *or*, eventually, the intel TBB library.
        To be fair to python, I could have had some degree of speed up if I had rewritten the slow bits in cython etc., but once I decided to tune things for speed, I thought I might as well bite the bullet and programme in c++ where I would have much better idea whether the bottlenecks are. Also learning Spirit has been a great pleasure.

  4. Leo Goodstadt says:

    Here are some details on the final ~ 7x speed up of the parsing using the Intel TBB with Spirit.

    It took a little work to get TBB working happily with Spirit. This was partly because I couldn’t get the critical example code on the TBB website / book to even compile! (My incompetence or out of date documentation. Hmmm!)

    This is my Spirit/TBB pipeline:

    1) The TBB pipeline parcels out sets of successive lines which fit into chunks of a certain (runtime determined) size (e.g. of 64 kb).

    2) Each set of lines in the block is parsed by Spirit in parallel.

    3) The resulting data is recombined back in the correct order (if necessary).

    The block size is a balance between fitting things on the CPU data cache and minimising thread setup overhead. In practice, as suggested by the docs, the block size is not that critical to get exactly right.

    The slightly tricky thing is that I used a reusable / circular buffer of memory blocks to avoid data cache / memory allocation thrashing.

    This scheme would have to be adapted if the data format is not line-by-line. I.e. it would be necessary to make sure that data fields do not span the discrete chunks that spirit is parsing in parallel.

    The other nice part of this scheme is that I could still use standard spirit error messages to point out the precise lines and fields which were failing to parse.

    Leo

Leave a Reply

preload preload preload