I have a mixed relationship with variant…
I just wrote a parser for S-expressions (that will be the basis of ASTs and intermediate types in my planned “write-a-compiler” article series). The parser itself is easy, but as always, I spent more time on the underlying data structures.
What are S-expressions? S-expressions, also called sexps, are recursive, list based, data structures. Being recursive, they can represent hierarchical information. S-expressions are parenthesized prefix expressions, known for their use in LISP (and its sibling Scheme). Here’s a simple sexp:
(* 2 (+ 3 4))
The sexp above corresponds to this infix expression:
(2 * (3 + 4))
S-expressions are simple and infinitely powerful beasts as evident in applications that use LISP as their scripting language. They can represent code and data. Some people even use S-expressions as a suitable (and terser!) replacement for XML. The in-memory data structures are very easy to use, transform and manipulate, traverse and compile or accumulate results from.
The plan is to use S-expressions as our AST representation and embed a minimal LISP/Scheme interpreter IN the compiler. This implies that along the way, we’ll be building an S-expression parser and a LISP/Scheme interpreter. How cool is that? … We’re talking about scripting the compiler with an interpreter!
I needed a dynamic data type that can represent the S-expressions. I called it utree, short for universal-tree. I want it to be as simple as it can be and fast and tight in memory footprint. Boost variant was simply out of the question (I used it in one early prototypes). For one, it failed a basic requirement (tight memory footprint). The padding and the way it aligns the “what-type” integer member is quite wasteful. It uses a conservative alignment using the worst alignment of the types in the union. Thus if you have a type in there that aligns to 8 bytes, variant requires another 8 bytes just for the type discriminator! Try it out:
struct x { void* a; void* b; void* c; };
/***/
std::cout << sizeof(x) << std::endl;
std::cout << sizeof(boost::variant<x, int, double>) << std::endl;
I get: 12 and 24 respectively (32 bit system).
I ended up with 40 bytes in my initial prototype (using STL containers and variant) and later squeezed that to 24 (minimum). I did away with variant in my latest version and got 16 bytes. In this case, I “stole” unused padding bits from the data to store the discriminator. With this 16 bytes, I have nil, bool, int, double, string and (double linked) list. The string itself steals memory when it can (i.e. it stores the string in the union when it can and only uses the heap when needed). The string steals as much as it can. So, on 32 bit systems, it can store in-situ as much as 14 bytes. That’s a lot for storing simple strings like symbols and identifiers. On 64 bit systems, you can store a lot more in-situ and minimize heap usage more.
At this point, I feel like writing my own variant type that can do such things (intrusive variant?). Barring the use of Boost.Variant, I needed to write my own data structures (double linked list). I really wanted to use Boost.Intrusive which is quite efficient, but because I had to squeeze my own variant in there, I had to make use of unions which require PODs!
Here’s the work in progress:
http://boost-spirit.com/dl_more/scheme/scheme_v0.2/
Here’s the utree API:
///////////////////////////////////////////////////////////////////////////
// The main utree (Universal Tree) class
// The utree is a hierarchical, dynamic type that can store:
// - a nil
// - a bool
// - an integer
// - a double
// - a string (textual or binary)
// - a (doubly linked) list of utree
// - a reference to a utree
//
// The utree has minimal memory footprint. The data structure size is
// 16 bytes on a 32-bit platform. Being a container of itself, it can
// represent tree structures.
///////////////////////////////////////////////////////////////////////////
class utree
{
public:
typedef utree value_type;
typedef detail::list::node_iterator<utree> iterator;
typedef detail::list::node_iterator<utree const> const_iterator;
typedef utree& reference;
typedef utree const& const_reference;
typedef std::ptrdiff_t difference_type;
typedef std::size_t size_type;
struct nil {};
utree();
explicit utree(bool b);
explicit utree(unsigned int i);
explicit utree(int i);
explicit utree(double d);
explicit utree(char const* str);
explicit utree(char const* str, std::size_t len);
explicit utree(std::string const& str);
explicit utree(boost::reference_wrapper<utree> ref);
utree(utree const& other);
~utree();
utree& operator=(utree const& other);
utree& operator=(bool b);
utree& operator=(unsigned int i);
utree& operator=(int i);
utree& operator=(double d);
utree& operator=(char const* s);
utree& operator=(std::string const& s);
utree& operator=(boost::reference_wrapper<utree> ref);
template <typename F>
typename F::result_type
static visit(utree const& x, F f);
template <typename F>
typename F::result_type
static visit(utree& x, F f);
template <typename F>
typename F::result_type
static visit(utree const& x, utree const& y, F f);
template <typename F>
typename F::result_type
static visit(utree& x, utree const& y, F f);
template <typename F>
typename F::result_type
static visit(utree const& x, utree& y, F f);
template <typename F>
typename F::result_type
static visit(utree& x, utree& y, F f);
template <typename T>
void push_back(T const& val);
template <typename T>
void push_front(T const& val);
template <typename T>
iterator insert(iterator pos, T const& x);
template <typename T>
void insert(iterator pos, std::size_t, T const& x);
template <typename Iter>
void insert(iterator pos, Iter first, Iter last);
template <typename Iter>
void assign(Iter first, Iter last);
void clear();
void pop_front();
void pop_back();
iterator erase(iterator pos);
iterator erase(iterator first, iterator last);
utree& front();
utree& back();
utree const& front() const;
utree const& back() const;
utree& operator[](std::size_t i);
utree const& operator[](std::size_t i) const;
void swap(utree& other);
iterator begin();
iterator end();
const_iterator begin() const;
const_iterator end() const;
bool empty() const;
std::size_t size() const;
};
bool operator==(utree const& a, utree const& b);
bool operator<(utree const& a, utree const& b);
bool operator!=(utree const& a, utree const& b);
bool operator>(utree const& a, utree const& b);
bool operator<=(utree const& a, utree const& b);
bool operator>=(utree const& a, utree const& b);
loading...
What is an S-expression?
would you please be kind enough to explain it to a layman like myself.
loading...
Try googling s-expression.
I got several hits.
loading...
I’ll add a short summary anyway. Thanks, Darid.
loading...
Done
loading...
You’re storing the discriminator in the padding. Couldn’t the same be
done by changing boost/variant/variant.hpp:215-216 from:
which_t which_;
storage_t storage_;
to:
storage_t storage_;
which_t which_;
? IOW, the compiler should be able to figure out how to best fit
which_ after storage_.
loading...
Nope. I don’t think so. The padding will be the same either way. Try it out.
loading...
OK, but now I can’t figure how you store the discriminant inside the
padding at the end of your union of POD’s without somehow
misaligning the discriminant. I guess just saying I wonder why
the compiler writers couldn’t figure how to do this most efficiently.
Do you know of some language rules which require this wasted
space?
loading...
I did nothing special. I just stole one byte from the fast_string (the string that places its data in-situ) for the discriminator. Check out the implementation of fast_string:
http://boost-spirit.com/dl_more/scheme/scheme_v0.1/detail/utree_detail1.hpp
loading...
I tried using a union with a discriminant inside a struct. I get the same
size of for composite_tagged_seq:
IOW, the following code:
#include #include #include #include struct x { void* a; void* b; void* c; }; struct x_union { union { x u_x ; int u_int ; double u_dbl ;} storage ; char tag ; }; struct x_oneof : boost::composite_tagged_seq < boost::composite_tags::one_of_maybe , boost::mpl::integral_c , boost::mpl::list > { }; int main(void) { std::cout<<"sizeof(x)="<<sizeof(x)<<"\n"; std::cout<<"sizeof(x_union)="<<sizeof(x_union)<<"\n"; std::cout<<"sizeof(x_oneof)="<<sizeof(x_oneof)<<"\n"; return 0; }produces:
sizeof(x)=24
sizeof(x_union)=32
sizeof(x_oneof)=32
loading...
Yep. You can’t get away with the extra padding. In my case, I just store the discriminant (intrusively) in the largest union type in a place where it can not be overwritten by other inhabitants of the union. That is a caveat. I place this at the right/bottom-most of the largest struct but there can be some obscure OS where some weird alignment can place data such that this is overwritten. This can be tested at compile time though. For a generic variant, one can perhaps use the same trick by having an array of chars with 1+ the largest struct and place the discriminator there at the extra byte.
loading...
Sometimes for tree is needed some function on its all elements .
For u-tree its possible to doo for_each?
loading...
Definitely. It’s an STL container. You can also do unary and binary visitation like you do with variant, allowing you to traverse the tree any way you want.
loading...
The reason for this:
Thus if you have a type in there that aligns to 12 bytes, variant
requires another 12 bytes just for the type discriminator!
is to preserve the 12 byte alignment for the oddly aligned type when the
variant is put into an array. By storing the type discriminant in the
1st 12 bytes of the containing structure, then when the containing structure
is put into a 2 element array, the 2nd array element will have the properly
aligned oddly aligned type. Now I suppose you could avoid this restriction
if you’re sure never to put utree into an array.
loading...
Yes, Larry, that is correct. 12 bytes alignment is quite odd though. 64 bit machines I am aware of typically aligns to 8 bytes (64 bits). Correct me if I’m wrong. Anyway, the point really is to make use of the padding bits and optimize the packing of the data.
loading...