Jan 02

Thank you all for your warm feedback on the “Build a Compiler” post. It seems this has become very popular indeed. I guess it’s time to start. Your overwhelming feedback and comments is enough motivation to carry on with the article series.

In general, an imperative OO language seems to be the way to go. It’s not surprising that C++ is very popular. People want a C++ parser! Barring that, due to complexity, a subset or a sanitized/re-syntaxed C++ (e.g. SPECS) is also a popular request. Go is also quite popular. That language indeed looks good and modern. FP, especially LISP/Scheme and even Haskell(!) is also quite popular. And hey: Javascript! and Python! Life would not be complete without these fun languages:-).

Here’s a tally of languages suggested for the project sorted by popularity:

I’ll need some time to think about all these. In the meantime, it’s still not too late to chime in with your comments. This particular post will be dynamic. I will add on it over time based on my thoughts and your feedback.

C++ 4
Lisp/Scheme 4
C 3
Go 3
Haskell 3
C++ SPECS 2
Javascript 2
Python 2
ALGOL-like 1
C++ Subset 1
Chapel 1
Clojure 1
EBNF 1
GIMPLE 1
Forth 1
Java 1
Kaleidoscope 1
Lua 1
OCC 1
Pascal 1
Perl 1
R 1
SML 1
Scala 1
TinyC 1
VB 1
Requirements:

Here are the requirements thus far:

  1. It will be a statically typed, imperative language of the ALGOL/Pascal/C++ family.
  2. It shall be embeddable in your application.
  3. Seamless integration with C++ will be provided with full round-trip calls to and from C++.
  4. There will be zero runtime overhead and zero requirements/dependencies apart from the C++ standard lib and the LLVM back-end.
  5. It shall target LLVM with its JITC support.
  6. It shall be a practical tool, not a toy for instructional purposes only.

34 Responses to “Build a Compiler, What to Build?”

  1. OvermindDL1 says:

    Lisp/Scheme are very powerful and rather easy to set up, but will not take full advantage of Spirit due to the simplicity of the parser for it (a boon for Lisp, but a negative for demonstrating the power of Spirit), it is still possible but we would have to make some new terminals specifically for it (yes, have to).

    Go has a very well defined language spec and looks quite fascinating. It also has a lot of modern features and still is C++’ish in its syntax while being *VASTLY* easier to parse.

    Javascript is easy’ish to parse, but require a bit of look-ahead, and the dynamic typing is *extremely* dynamic, making it very hard to JIT, so probably not good to attempt it either.

    Python would fascinating to parse, but the dynamic typing again makes it very difficult to JIT.

    Kaleidoscope, I am assuming you mean the LLVM example language, would be very easy to put in Spirit (I already mostly have), but the final page of the tutorial gives a lot of recommendations for enhancements to the language (type system, and a lot more), so if we went with this then we should continue to do the enhancements and recommendations as well, instead of just stopping where the LLVM tutorial does (which is very easy to that point).

    The rest of the 1’s I shall mostly ignore, although I do not really like the looks of them anyway

    And I have never really looked closely at:
    Haskell, SPECS, Chapel, Clojure, GIMPLE, FORTH, OCC, and R

  2. s1n says:

    Given the type safety requirement, I think there are a number of languages on that list that can be eliminated. The practicality requirement can eliminate Kaleidoscope and put a higher emphasis on certain languages (such as C++ Subset, Haskell, or Go).

  3. OvermindDL1 says:

    The requirements were just added, so commenting on them as well.

    1. It will be a statically typed, imperative language of the ALGOL/Pascal/C++ family.
    I agree, even people use dynamic languages like they are statically typed, just without the safety.

    2. It shall be embeddable in your application.
    Definitely

    3. Seamless integration with C++ will be provided with full round-trip calls to and from C++.
    Again, definitely.
    4. There will be zero runtime overhead and zero requirements/dependencies apart from the C++ standard lib.
    Yes, plus LLVM and such for the back-end…

    5. It shall target LLVM with interpreter and JIT support.
    You should *NOT* target the LLVM interpreter because it was only a testing lattice while the JIT did not exist. The interpreter does not work on the current version of LLVM and they plan to remove it here soon, you should only target the JIT.

    6. It shall be a practical tool, not a toy for instructional purposes only.
    Again, definitely. I think Go would be useful in that regard, and people do want an LLVM version of it.

    • Joel de Guzman says:

      Updated.

      Now I wonder about the size of LLVM and the possibility of a zero overhead interpreter as an optional back-end in addition to LLVM (i.e. evaluating the AST — just make the AST some form of SExpr). This makes me wonder too about having a Scheme/LISP-ish language as Bootstrap. Hmmm… talk about embedding a language in an embedded language. (!)

      Actually, I’ve long been toying with the idea of embedding an interpreter in the compiler.

      • OvermindDL1 says:

        That is definitely one of the perks of a LISP’ish language, it can practically become anything. I had started making (before school got too busy and killed my time, so glad I graduated finally) a LISP’ish language, but strongly-typed (with *much* more sensible C-like names for functions, instead of crap like car/cdr/etc…), it really is easier then you may think, the really only difficult part about the whole thing was the type resolution system (I made it so you could define functions that could be matched at any point, it definitely made the language more unique with more constructs that made more sense for their use, but definitely more difficult), for example:

        Lisp style:  (and (+ x 3) 17)
        Mine:  ((x + 3) and 17)
        

        Or you could still do mine in the Lisp style too, depending on how the functions were defined. The “+” was a symbol, the x is an integer variable, the 3 was an integral constant, as such it would look for a function that took an integer, a symbol of “+”, and an integer, it finds that function and sees that it returns an integer, so then it pattern matches for the next function that takes an integer, the symbol “and”, and an integer constant, so it matches the function that takes an integer, a symbol of “and”, and an integer. That was mostly an experimentation, but it worked. Pure LISP though is a lot easier to handle in that way since everything is basically a simple function look-up by identifier.
        I did have an interpreter in mine, just to make sure the JIT ran the same way, just a simple tree traversal (lots of visitors/variants/etc…), my interpreter was *worlds* faster then the LLVM interpreter, but the JIT was still *worlds* faster then my interpreter.

        • Joel de Guzman says:

          Interesting approach!

          • OvermindDL1 says:

            If we remove some of the ‘super’ features of LISP by making it more directly compiled, it would vastly simplify parsing.

            I worked around that by JIT’ing functions as I found them, in my version I did not quite have the power of the LISP style ‘frames’ for things like actually jumping between function internals (their form of exception handling, but we could make that more explicit if necessary to add it back in), but by removing that *one* feature (big feature admittedly for LISP, but still not necessary), vastly simplified the compilation complexity.

            I like LISP, a lot actually, but there are still things I would change about it, like making it statically typed (although I still added in a variant type, modeled on Boost.Variant actually, and after reading the Go language specs, I came up with an Any type that could work as well, well in the language).

      • Jason says:

        Pretty cool! How about a statically typed lisp/sexp based language:

        eg.

        (defun add(x::int y::int)
        (let (z::int)
        (setf z (+ x y))
        z))

        as an AST middle layer that translates directly to llvm bitcode. If the LLVM interpreter is too heavy a basic interpreter for said AST wouldn’t be too difficult to tack on top of the JIT when fast machine code isn’t needed. You would also get the nice ability to skin any syntax you desire to the AST via Spirit once setup… which would be pretty awesome!

        Jason

        • OvermindDL1 says:

          Yep, that is exactly what I had been proposing, I like statically typed languages, less issues.

          Although no need to keep annoyances like setf, if it is going to be redesigned, may as well give it decent wording (no crap like cdr and such either).

  4. Troy Straszheim says:

    Let me play devil’s advocate here a minute.

    A statically typed imperative language, embeddable in your C++ application, that integrates seamlessly with C++… why not use C++ itself? I get lost when combining ‘statically typed’ with ’embeddable’ and ‘targets LLVM’. Am I missing something, or are there conflicting design goals? Why would I embed a PASCAL that is generating and jitting LLVM IR in my C++ application? And it is going to call back into the application that it is embedded in?

    So I’ll argue the following: If you want to build a compiler, build a compiler.
    I’ll also argue that you should do something that makes a strong case for spirit and boost from the quality of its implementation of language L compared to other implementations of L. By quality of implementation I mean lines of code, performance, etc. People who are serious about using spirit for large compiler projects *will* make this comparison to whatever language you implement. If you can say “we get better performance in 50% as many LOC” you’ll have made this case convincingly. So my first argument is that these strategic ‘metacriteria’ should have more weight than the bucket o’ requirements.

    C++ is clearly too hard. The next best thing: D is the closest thing to C++ that exists and has a context-free grammar. D has an implementation (in C++) that you can examine to gain confidence that you’re going to get your “our way works better” goals before you invest a lot of time.

    D1’s frontend is 95k lines of C++ (no boost, no STL), 10k lines of which is “longhand” parsing and lexing; I bet the parse/lex stuff could be done in 10% as many lines. The backend is another 91k lines that would be mostly offloaded to LLVM. And you wouldn’t have to define any new languages or subsets of languages.

    • Joel de Guzman says:

      “A statically typed imperative language, embeddable in your C++ application, that integrates seamlessly with C++… why not use C++ itself?”

      Short answer: I want a compiled scripting language.

      Here’s a scenario (I’m sure there are other scenarios): Someone wants a high performance real-time application, written in Visual Studio C++, deployed with an embedded scripting language to make it extensible. They want the embedded language to be 1) As fast as C++ (no slow python here) and 2) They do not want the complexity of C++, they want something simple to learn by possibly non programmers.

      “Why would I embed a PASCAL that is generating and jitting LLVM IR in my C++ application?”

      Because you want it to be deployed like a scripting language for the purpose of extending the application at runtime with user supplied code. Change “PASCAL” to Your-Favorite-Scripting-Language here and you see what I mean. Imagine, say, a graphics application with a a scriptable plug-in architecture, or a program like Autocad where you start with a minimal extensible core and a DSL like AutoLISP, but with sheer frightening speed.

      “And it is going to call back into the application that it is embedded in?”

      Yes. Its should be 2-way through a specific controlled environment for security reasons. You only expose what’s necessary for the application.

      • Troy Straszheim says:

        Fascinating. I follow you now. Instead of embedding an interpreter (as is typical), you’re embedding a compiler.

        Your statically-typed, compiled “scripting” language will be fascinating, I’m sure… I watch with interest…

  5. Alex Ott says:

    I think, that implementing C++ will nightmare. Lisp/Scheme/Clojure/… will much simpler (i have Lisp-like DSL implemented in ~1000 lines of code in Spirit.Classic), but will show most of Spirit functionality….

  6. Troy Straszheim says:

    I wasn’t advocating c++, dunno if you missed that.

  7. Jonas Persson says:

    If it’s not too late to propose another language, then Felix is an interesting statically typed ALGOL family c++compatible scripting language that may fit you requirements well.

  8. Michael Caisse says:

    I personally would find Javascript useful.

  9. Felipe Magno de Almeida says:

    How about a new language which is value-based, statically-typed with generic programming, but simpler than C++ (context-free for example)?

    • Joel de Guzman says:

      Any particular language you have in mind?

      • Felipe Magno de Almeida says:

        Unfortunately no.
        But I envision something easier to work with generic-programming and “elements of programming” programming style.
        Though I understand that it may be too much work to design *and* write a compiler.

  10. Michael S. says:

    I would go for C++. And I do not even need a compiler for it. I believe the C++ community needs to (finally) see a C++ parser that covers the entire language 100% (incl C++0x thank you) and can lay the foundation for a lot of useful tools.

    If spirit can show that it can cover parsing of C++ (which is not an easy grammar) then I believe it will show how powerfull it really is.

    /Michael

    PS. My priority of tools using such a C++ parser (and I believe that the grammer AST should also cover whitespace and comments as these also are important):

    * C++ Layout Checker – checks source code a specific layout style.
    * C++ Layout Converter – converts (force) the source code layout to a specified layout style.
    * C++ Analyzer – static analysis, dependency analysis, etc.
    * C++ Transformation – refactoring, optimizer, etc.
    * …
    * C++ Compiler

  11. Korval says:

    “It shall be a practical tool, not a toy for instructional purposes only.”

    I thought the purpose of this project was to show how to actually *use* Spirit. Many of the uses of a general-purpose parsing framework involve parsing a real programming language. So having something more complicated than the “calc” tutorials would be very useful.

    However, I just don’t see who would actually use the language you’re poking about with. And spending time to make it into something useful would divert attention from its primary purpose: showing how to use Spirit. All of the LLVM stuff sounds nice and all, but in terms of actually showing someone how to go about parsing a language, it is 100% useless.

    • Korval says:

      I would also strongly advise that the code be structured to generate data structures, rather than doing what that one “calc” compiler did that directly generated opcodes did. It would also allow you to exercise Karma, by using Karma to convert the data structures to an opcode sequence.

    • Joel de Guzman says:

      100% Disagree. It will be useful at least for me, and I am sure others as well. Case in point: Nicklaus Wirth intended Pascal to be for instruction purposes, yet at the same time, it is a practical language. I don’t agree that instruction and practicality are mutually exclusive goals.

      • Korval says:

        OK, let’s say I’m trying to use Spirit to parse a programming language. And I see this example of exactly that. And I take a look at it.

        What good to me is any of the code that deals with LLVM? I’m not using LLVM, so it just adds complexity to what I’m actually interested in: the *parser*.

        What Spirit 2.1 needs isn’t an LLVM compiler for some language that happens to use Spirit. What Spirit 2.1 needs the most is a fully-functioning example of how to use Spirit to parse and build complex data structures from languages (and use Karma to un-build them) with good quality error messages that point directly to the line number that generated the error. A compiler is a good project that needs to do that.

        The specifics of what language it is, using LLVM to output that language, etc are irrelevant. What we need is to have a real, live, definitive example that says “*This* is how you use Spirit to get stuff done,” that answers everyone’s questions about how to use Qi, Karma, and Lex.

        *Time* is what is mutually exclusive. And every minute spent on LLVM or whatever is a minute *not* spent on getting that definitive example finished. The sooner this example is completed and made available, the better off Spirit will be. And taking the time to make this not a “toy implementation” is not time that will help people who actually use the example to write the code they’re actually interested in.

        I think you’re getting too caught up in the coolness of writing a functioning language compiler, and forgetting the reason for starting this project to begin with.

        • Joel de Guzman says:

          Well said. Indeed time is crucial! All the more reason why LLVM is a good target. We do not have to worry about the backend since everything is done for you. I want a practical example, We need not argue about that. LLVM happens to be a practical target for compiler writers in this day and age. I can’t even imagine having to write what these guys give for free.

          What good to me is any of the code that deals with LLVM? I’m not using LLVM, so it just adds complexity to what I’m actually interested in: the *parser*.

          If it’s just the parser you are interested in, then this “Build a compiler” series is not for you, as the title is very clear, we are building a compiler. What Spirit needs are examples and lots of ’em that cover a wide spectrum. This is just one of them.

          I think you’re getting too caught up in the coolness of writing a functioning language compiler, and forgetting the reason for starting this project to begin with.

          Well, 1) Cool is good. I don’t want yet another boring example. 2) The focus of the project is to build a practical compiler. I don’t think I am deviating from that goal by choosing LLVM. On the contrary, I consider building your own compiler backend by hand impractical. That is a strong message that I want to advocate.

        • OvermindDL1 says:

          For note, some of us are already pretty good with LLVM, I am one such person, so it should not affect time that much since there will be multiple people working on various parts.

  12. coach says:

    C++ all the way. If spirit can parse C++ it will show how powerful it really is.

  13. Yee Wang says:

    I propose:
    a generic EBNF parser in Qi and then a Karma to write out the C++ source code using Qi to parse the language the EBNF describes.

    • Wieger says:

      Yes, that would be very useful. I have several grammars in EBNF that I would like to automatically convert to Spirit. Currently I’m relying on python scripts to do things like that.

  14. Ignatich says:

    Since C-style syntax is most widespread, I think C or subset of C++ will be a great choice. D and Go would be nice choices too.

    Also static code analyzer (like Microsoft PREfast or VCC) for C/C++ would qualify both as great demonstration of spirit 2 and extremely useful tool.

  15. Matthew says:

    I would love to see a SPECS compiler. I recently discovered the syntax, and while I think some things need to be changed–mostly to take advantage of C++11 features, which I think could be expressed better by a syntax independent of C–it’s refreshing to see a syntactical re-imagining of C++ that doesn’t try to add the sorts of language features that slow down runtime and defeats the whole purpose of using C++.

  16. Austin says:

    URL for SPECS?

Leave a Reply to Yee Wang

preload preload preload