Mar 11

I have a mixed relationship with variant…

I just wrote a parser for S-expressions (that will be the basis of ASTs and intermediate types in my planned “write-a-compiler” article series). The parser itself is easy, but as always, I spent more time on the underlying data structures.

What are S-expressions? S-expressions, also called sexps, are recursive, list based, data structures. Being recursive, they can represent hierarchical information. S-expressions are parenthesized prefix expressions, known for their use in LISP (and its sibling Scheme). Here’s a simple sexp:

(* 2 (+ 3 4))

The sexp above corresponds to this infix expression:

(2 * (3 + 4))

S-expressions are simple and infinitely powerful beasts as evident in applications that use LISP as their scripting language. They can represent code and data. Some people even use S-expressions as a suitable (and terser!) replacement for XML. The in-memory data structures are very easy to use, transform and manipulate, traverse and compile or accumulate results from.

The plan is to use S-expressions as our AST representation and embed a minimal LISP/Scheme interpreter IN the compiler. This implies that along the way, we’ll be building an S-expression parser and a LISP/Scheme interpreter. How cool is that? … We’re talking about scripting the compiler with an interpreter! :-)

I needed a dynamic data type that can represent the S-expressions. I called it utree, short for universal-tree. I want it to be as simple as it can be and fast and tight in memory footprint. Boost variant was simply out of the question (I used it in one early prototypes). For one, it failed a basic requirement (tight memory footprint). The padding and the way it aligns the “what-type” integer member is quite wasteful. It uses a conservative alignment using the worst alignment of the types in the union. Thus if you have a type in there that aligns to 8 bytes, variant requires another 8 bytes just for the type discriminator! Try it out:

struct x { void* a; void* b; void* c; };
/***/
std::cout << sizeof(x) << std::endl;
std::cout << sizeof(boost::variant<x, int, double>) << std::endl;

I get: 12 and 24 respectively (32 bit system).

I ended up with 40 bytes in my initial prototype (using STL containers and variant) and later squeezed that to 24 (minimum). I did away with variant in my latest version and got 16 bytes. In this case, I “stole” unused padding bits from the data to store the discriminator.   With this 16 bytes, I have nil, bool, int, double, string and (double linked) list. The string itself steals memory when it can (i.e. it stores the string in the union when it can and only uses the heap when needed). The string steals as much as it can. So, on 32 bit systems, it can store in-situ as much as 14 bytes. That’s a lot for storing simple strings like symbols and identifiers. On 64 bit systems, you can store a lot more in-situ and minimize heap usage more.

At this point, I feel like writing my own variant type that can do such things (intrusive variant?). Barring the use of Boost.Variant, I needed to write my own data structures (double linked list). I really wanted to use Boost.Intrusive which is quite efficient, but because I had to squeeze my own variant in there, I had to make use of unions which require PODs!

Here’s the work in progress:
http://boost-spirit.com/dl_more/scheme/scheme_v0.2/

Here’s the utree API:


    ///////////////////////////////////////////////////////////////////////////
    // The main utree (Universal Tree) class
    // The utree is a hierarchical, dynamic type that can store:
    //  - a nil
    //  - a bool
    //  - an integer
    //  - a double
    //  - a string (textual or binary)
    //  - a (doubly linked) list of utree
    //  - a reference to a utree
    //
    // The utree has minimal memory footprint. The data structure size is
    // 16 bytes on a 32-bit platform. Being a container of itself, it can
    // represent tree structures.
    ///////////////////////////////////////////////////////////////////////////
    class utree
    {
    public:

        typedef utree value_type;
        typedef detail::list::node_iterator<utree> iterator;
        typedef detail::list::node_iterator<utree const> const_iterator;
        typedef utree& reference;
        typedef utree const& const_reference;
        typedef std::ptrdiff_t difference_type;
        typedef std::size_t size_type;

        struct nil {};

        utree();
        explicit utree(bool b);
        explicit utree(unsigned int i);
        explicit utree(int i);
        explicit utree(double d);
        explicit utree(char const* str);
        explicit utree(char const* str, std::size_t len);
        explicit utree(std::string const& str);
        explicit utree(boost::reference_wrapper<utree> ref);

        utree(utree const& other);
        ~utree();

        utree& operator=(utree const& other);
        utree& operator=(bool b);
        utree& operator=(unsigned int i);
        utree& operator=(int i);
        utree& operator=(double d);
        utree& operator=(char const* s);
        utree& operator=(std::string const& s);
        utree& operator=(boost::reference_wrapper<utree> ref);

        template <typename F>
        typename F::result_type
        static visit(utree const& x, F f);

        template <typename F>
        typename F::result_type
        static visit(utree& x, F f);

        template <typename F>
        typename F::result_type
        static visit(utree const& x, utree const& y, F f);

        template <typename F>
        typename F::result_type
        static visit(utree& x, utree const& y, F f);

        template <typename F>
        typename F::result_type
        static visit(utree const& x, utree& y, F f);

        template <typename F>
        typename F::result_type
        static visit(utree& x, utree& y, F f);

        template <typename T>
        void push_back(T const& val);

        template <typename T>
        void push_front(T const& val);

        template <typename T>
        iterator insert(iterator pos, T const& x);

        template <typename T>
        void insert(iterator pos, std::size_t, T const& x);

        template <typename Iter>
        void insert(iterator pos, Iter first, Iter last);

        template <typename Iter>
        void assign(Iter first, Iter last);

        void clear();
        void pop_front();
        void pop_back();
        iterator erase(iterator pos);
        iterator erase(iterator first, iterator last);

        utree& front();
        utree& back();
        utree const& front() const;
        utree const& back() const;

        utree& operator[](std::size_t i);
        utree const& operator[](std::size_t i) const;

        void swap(utree& other);

        iterator begin();
        iterator end();
        const_iterator begin() const;
        const_iterator end() const;

        bool empty() const;
        std::size_t size() const;
    };

    bool operator==(utree const& a, utree const& b);
    bool operator<(utree const& a, utree const& b);
    bool operator!=(utree const& a, utree const& b);
    bool operator>(utree const& a, utree const& b);
    bool operator<=(utree const& a, utree const& b);
    bool operator>=(utree const& a, utree const& b);
GD Star Rating
loading...
S-expressions and variant, 5.0 out of 5 based on 1 rating
  • Reddit
  • Facebook
  • del.icio.us
  • Digg
  • Twitter
  • Print

14 Responses to “S-expressions and variant”

  1. Darid says:

    What is an S-expression?
    would you please be kind enough to explain it to a layman like myself.

    GD Star Rating
    loading...
  2. Larry Evans says:

    You’re storing the discriminator in the padding. Couldn’t the same be
    done by changing boost/variant/variant.hpp:215-216 from:

    which_t which_;
    storage_t storage_;

    to:

    storage_t storage_;
    which_t which_;

    ? IOW, the compiler should be able to figure out how to best fit
    which_ after storage_.

    GD Star Rating
    loading...
    • Joel de Guzman says:

      Nope. I don’t think so. The padding will be the same either way. Try it out.

      GD Star Rating
      loading...
      • Larry Evans says:

        OK, but now I can’t figure how you store the discriminant inside the
        padding at the end of your union of POD’s without somehow
        misaligning the discriminant. I guess just saying I wonder why
        the compiler writers couldn’t figure how to do this most efficiently.
        Do you know of some language rules which require this wasted
        space?

        GD Star Rating
        loading...
        • Joel de Guzman says:

          I did nothing special. I just stole one byte from the fast_string (the string that places its data in-situ) for the discriminator. Check out the implementation of fast_string:

          http://boost-spirit.com/dl_more/scheme/scheme_v0.1/detail/utree_detail1.hpp

          GD Star Rating
          loading...
        • Larry Evans says:

          I tried using a union with a discriminant inside a struct. I get the same
          size of for composite_tagged_seq:

          IOW, the following code:

          #include
          #include
          #include 
          
          #include 
          
          struct x { void* a; void* b; void* c; };
          
          struct x_union
          {
                union
                { x u_x
                ; int u_int
                ; double u_dbl
                ;}
              storage
              ;
                char
              tag
              ;
          };
          struct x_oneof
          : boost::composite_tagged_seq
            < boost::composite_tags::one_of_maybe
            , boost::mpl::integral_c
            , boost::mpl::list
          
            >
          {
          };
          
          int main(void)
          {
              std::cout<<"sizeof(x)="<<sizeof(x)<<"\n";
              std::cout<<"sizeof(x_union)="<<sizeof(x_union)<<"\n";
              std::cout<<"sizeof(x_oneof)="<<sizeof(x_oneof)<<"\n";
              return 0;
          }
          

          produces:

          sizeof(x)=24
          sizeof(x_union)=32
          sizeof(x_oneof)=32

          GD Star Rating
          loading...
          • Joel de Guzman says:

            Yep. You can’t get away with the extra padding. In my case, I just store the discriminant (intrusively) in the largest union type in a place where it can not be overwritten by other inhabitants of the union. That is a caveat. I place this at the right/bottom-most of the largest struct but there can be some obscure OS where some weird alignment can place data such that this is overwritten. This can be tested at compile time though. For a generic variant, one can perhaps use the same trick by having an array of chars with 1+ the largest struct and place the discriminator there at the extra byte.

            GD Star Rating
            loading...
  3. Dainis says:

    Sometimes for tree is needed some function on its all elements .
    For u-tree its possible to doo for_each?

    GD Star Rating
    loading...
    • Joel de Guzman says:

      Definitely. It’s an STL container. You can also do unary and binary visitation like you do with variant, allowing you to traverse the tree any way you want.

      GD Star Rating
      loading...
  4. Larry Evans says:

    The reason for this:

    Thus if you have a type in there that aligns to 12 bytes, variant
    requires another 12 bytes just for the type discriminator!

    is to preserve the 12 byte alignment for the oddly aligned type when the
    variant is put into an array. By storing the type discriminant in the
    1st 12 bytes of the containing structure, then when the containing structure
    is put into a 2 element array, the 2nd array element will have the properly
    aligned oddly aligned type. Now I suppose you could avoid this restriction
    if you’re sure never to put utree into an array.

    GD Star Rating
    loading...
    • Joel de Guzman says:

      Yes, Larry, that is correct. 12 bytes alignment is quite odd though. 64 bit machines I am aware of typically aligns to 8 bytes (64 bits). Correct me if I’m wrong. Anyway, the point really is to make use of the padding bits and optimize the packing of the data.

      GD Star Rating
      loading...

Leave a Reply

preload preload preload