Dec ’09 02

I am considering writing a series of articles on compiler development using Spirit-2. It will be based on the series of BoostCon talks from 07 to 09. From the humblest calculator to a full blown programming language. Details are still sketchy at this point. All I can ascertain is that the final language shall use LLVM as a back-end. There are lots of questions I need to ask in order to get the bases covered. I’d like to solicit feedback and comments. Will it be an imperative language like C or a functional language like Scheme? I love the line oriented syntax of Python, but is a free format syntax with the all too familiar braces or begin/end blocks be better? Will it be statically typed like C++ or dynamically typed like Python or LISP? Will it be OOP? How complex or simple should it be? Remember, this is meant for instruction. Is a toy language good enough? How about basing it on a simple ISO-standard language like Pascal? Or fun languages like Javascript? How about features? Type inference? Lambda? Operator overloading? Etc., etc., etc.

Let me know what you think…

63 Responses to “Build a Compiler”

  1. Richard says:

    I’d like to see the back end target a real processor. Perhaps an 8 bit microprocessor like a 6502 or a 6809 or something more recent. (Although 6502/6809 are so well documented on the web that they are still useful as an example of a real processor.)

    Also, presumably the point of this exercise is to teach about Spirit and how to use it to implement a language, not teach language design or theory. So stick with a well known language with well known semantics. Pascal or TinyC would be my choice. They are sufficiently rich to demonstrate the necessary principles, but not so esoteric that people won’t relate to the language and the compiler implementation. They are also sufficiently simple that you don’t lose people in the scope of the full implementation.

    • Joel de Guzman says:

      Re: Targeting a real processor, I’d like to focus more on the front end. LLVM is an easy target, is modern, and can target a multitude of processors. In this day and age, you no longer directly target a specific processor. The back end itself is a whole new ballgame with all the intricacies (e.g. JIT, optimizations, etc.). In as much as you use a parser like Spirit to do your front end parsing, the best practice today, IMO, is to use another tool or library to do the back end. So, bottom line: I’d rather stick to the plan and use LLVM. Hey, it’s the 21st century! :-)

      Re: All the rest. Agreed.

      • Richard says:

        I have just spent some time reading through http://www.llvm.org to learn about LLVM. At first I was under the misimpression (inferred from its name) that it was just a virtual machine and not a native code generator. Now I know better! I’m glad I bothered to read about it because I’m working on an OSS fractal generator that has a native language for expressing various fractal iterating formulas. My intention was to rewrite the parser using Spirit. Now I see that I can get an efficient back end using LLVM, so thanks for bringing that to my attention.

  2. Carl says:

    What level of compiler writing knowledge are you expecting of your reader? If next to none then I’d first do a statically typed reserved keyword language that is mostly context-free. A toy bigger than mini-c is probably the first compiler to write, a full language might be C89 [not the lib, the language itself ]is fairly simple, for a real language I am not that familiar with C99 as I was into C++ by then:) but as I recall C89 is mostly context free.

    • Joel de Guzman says:

      C is a “dirty” language. If I were to go imperative, I’d choose a clean syntax that is *fully* context free.

  3. QPlace says:

    That will be fantastic. I am using spirit for parsing text data, but my ultimate goal is to have my own compiler for the mix of vb and javascript-like language that is used in one of the vertical industries. I am just a “user”, not a compiler specialist, and without specialized knowledge I think that such task is too big of a goal for me, but hopefully with the guidance of this planned series, I can do it.

    Ideally this series will demonstrate a workflow of taking the source code and produce something that can be executed on desktop OS. Or linked with something to produce the executable.

    • Joel de Guzman says:

      It gets better than that. How about a compiled scripting language or DSL (Domain Specific Language), with the syntax of your choosing, that you can embed in your application and can be interpreted, JIT compiled, or compiled into tight, fast, highly optimized machine code. Interested?

  4. OvermindDL1 says:

    What about Google’s language ( http://golang.org/ ). It is designed to be C-like, syntax is slightly different to make it fast and easy to parse (it does parse *fast*, would be fun to benchmark with a Spirit version), and the language is fully defined. No templates or anything sadly.

    Personally I would prefer a Common LISP style language, but that is *too* easy to parse in Spirit. ;-)

    • Joel de Guzman says:

      Hmmm interesting… Re: LISP, yeah, it’s cool. I like LISP too. Perhaps that is a low hanging fruit. Come to think of it, perhaps I shouldn’t stick to one language only.

      • bytecolor says:

        Low hanging fruit is good, yes?

        If Scheme were the target you would have a spec to work from; R5RS or 6.
        A repl with an LLVM backend would be something useful, *not* a toy.
        Numbers require a bit more parsing than most any other language.
        Same with Identifiers.
        Skipping: space, blank, custom? (cons . syntax)
        A sexp is a tree -> AST.
        Pretty printing with Karma?

        If I weren’t a poor student, I’d pay to see that ;)


        S. Edward Dolan

      • OvermindDL1 says:

        What about Go though? I think if we made an embedded LISP compiler (easy to drop into programs to use as a compiled scripting language) would be pretty awesome though, we could even go on a tangent and make an Actor variant of LISP. :p

        Go does seem useful though, a lot of people want it on LLVM, its spec is very well set, and it would get a lot of use out of it.

        • Jason says:

          A sexp based language ala Clojure would also be pretty cool! I’m partial to common lisp but that is a serious undertaking if you want to conform to the standard which also has a lot of cruft that could/should be *snipped*. Besides there are already some very good compilers for common lisp such as sbcl/clozure/ecl/etc. Maybe a common lisp subset including reader macros… that would be very cool!

  5. Descender says:

    I would be content with Tiny C but would really like C++. Everything need not be done; only enough to show all that is involved in object oriented languages.

    • Joel de Guzman says:

      Right. A simple OOP language. I think this is becoming the consensus. Perhaps incremental evolution of that base language would yield something cool and never been seen before. A good strategy is to start something simple yet practically useful by itself (read: not a toy), then let people innovate.

      • OvermindDL1 says:

        That is what I am doing in my LISP’ish language. I is only similar to LISP in syntax and its macro and read-macro support, but mine is strongly typed, concurrent, etc… Even most of the function names are different (like really, car and cdr? Who uses ancient register names to handle lists anymore). Along with complete function overload based on types (with an auto (template like) and variant support, although now that I have seen Go I may have to alter it further). I do need to completely rewrite mine now that I learned more, so it would be fun. :)

        For something with a defined Spec, I still think Go would be fascinating, and a *lot* of people out there want an llvmGo compiler.

  6. sfrehse says:

    What about the GIMPLE-Language from the internal representation of the GCC. The language is C-like.
    You can get an impression of the language by typing g++ -fdump-tree-gimple test.C.

  7. C++ is too hard and Pascal might be too “easy”. JavaScript sounds like a good idea — compile to LLVM and apply optimization strategies?

    Of course if you can pull of C++ then that would be the ultimate. ;)

  8. Alex Ott says:

    May be it will be good idea to take language, implemented in LLVM Tutorial, and re-implement parsing in Spirit?

  9. Ram says:

    Joel,
    I would say Python or Haskell
    It would be great to see some examples of source to source translations also, even a pretty printer would be a good start.
    cheers
    Ram,

    • Joel de Guzman says:

      Haskell is lovely. I look at Haskell programs like poetry. I don’t think it’s good for teaching compiler development though. Thoughts? I love Python too, but I am not quite fond of dynamic typing, to be honest.

      • OvermindDL1 says:

        Nor am I, which is why I have been making a strongly-statically-typed Lisp variant. Most bugs with dynamically typed programming languages are because of the dynamic typing, and everyone treats dynamically typed languages as static anyway, thus I see no point. I like a language where you can set a variable without needing to specify a type, based on what you assign to it (Google’s Go does that well for a C++’ish syntax).

      • Ram says:

        I think Haskell is a fairly simple language to understand. It offers a lot of more complex possibilities but fundamentally, as a language it is easy to define and explain to others. And hence is a good language to target when teaching compiler development.
        There is a plethora of examples of LISP or its variants, in general. a different functional language would be very welcome :) .

        As for Python, again, its easy to explain and hence easy to discuss teach how to write a compiler for.

        If you are willing to take on a bigger project, this might appeal to you
        http://www.csse.monash.edu.au/~damian/papers/PDF/ModestProposal.pdf

  10. Jack says:

    I know I would enjoy reading such a series.

    As to language, something like C++, Scala, Haskell, or Perl (!) would be interesting but quite a lot of effort. But at the same time, I do think it would be nice if it were not a ‘toy’ language but instead something truly useful, because (ideally) it could eventually become a used-in-the-real-world compiler instead of just a learning exercise. Languages that come to mind that might be the right fit between too hard and too easy – Google’s Go, C89, Clojure, or SML, say.

    I think using LLVM was the backend would be an excellent demonstration, since LLVM is a quite reasonable target for anyone actually implementing a new language.

  11. Jason says:

    How about the R programming language (http://www.r-project.org/)? It has the curly bracket syntax but is rather scheme-like in scope etc. It is a very commonly used language for data analysis/simulation that has been gaining a lot of traction in recent years. The interesting thing about this language is that its target market has very computationally intensive demands yet the implementation is a pure interpreter. If you could produce a repl with JIT and/or a AOT compiler for R with an llvm backend that would be marvelous!

  12. Lee R says:

    So many languages!

    I’m voting for Lua:
    http://www.lua.org/manual/5.1/manual.html#2

    • s1n says:

      I’m just learning about Lua and I would second this recommendation. It’s supposed to be a small lightweight language that easily embeds. Sounds like a good language for this project.

      • Joel de Guzman says:

        Ah Lua! Hmmm…

        • OvermindDL1 says:

          I dealt with Lua for years, implemented it as a scripting language in a variety of things. I can truly say I hate it… If you want reasons why then just ask and I will do one of my normal very long posts as to why.

          Plus, if you hate dynamic typing, you will hate Lua too. :)

  13. While supporting all of C++’s features would be an unreasonable task, SPECS is a really nice re-syntax of the language with a simpler (and freely available) grammar.

    I imagine things like templates and inheritance would fall outside the scope of this tutorial series, but a simpler C++ with a sane grammar isn’t a bad place to start.

    Other ideas: OOC (ooc-lang.org) is really nice, and Google’s Go is getting popular.

  14. C++0X says:

    Why not python?

  15. jose says:

    >It gets better than that. How about a compiled scripting language or DSL >(Domain Specific Language), with the syntax of your choosing, that you can >embed in your application and can be interpreted, JIT compiled, or compiled >into tight, fast, highly optimized machine code. Interested?

    YES !!. This would be brilliant. It would help for the DSL to be able to support class-like definitions for arbitrary types, e.g.

    myobjectidea MyNumber { attributes: number x; string locale; conditions: X < 10; … }

    • OvermindDL1 says:

      Good interoperability and intermixability with C++ is always sadly lacking with embeddable languages. Boost.Python and such helps a lot with regards to Python, but it is *so* heavyweight, especially since there is no need to be when using LLVM. Hence why I have a very light intermixing and type-safe layer between my C++ and LLVM scripting language, we should do the same. :)

      • Joel de Guzman says:

        You read my mind!

        • Jason says:

          Seamless C++ interface (i.e. no boilerplate), an llvm backend and Spirit for lexing/parsing/AST to twiddle with the syntax to ones hearts content… you had me at… “Build a Compiler” ;-) !

          One more suggestion as to syntax. Take a look at Chapel:

          http://chapel.cray.com/

          which looks pretty cool although I have not experimented with it.

  16. ww says:

    Hi,

    I would like to see a C++ compiler. It would be extremely useful to have something that is able to understand at least interfaces, so (for example) writing Mock object support for Boost Test could be supported by it. Current approaches with C++ (since we have no introspection and whatever else it is called in Java) are clumsy, hard to read, non-intuitive.

    I would also like to be able to do an “interactive compiler”, something that would be able to support an IDE that could turn code into some sort of p-code, and support template metaprogramming debugging, and “magical” refactoring automation and so on. I guess it is quite close to what LLVM does, but I want it to work instantly as I type. So if I type:

    std::string s;

    I want it to tell me: hey, for this you need to #include , even if string is included by coincidence by another header (but string is not part of the interfaces specified by those headers, only the implementation, such as a member variable)…

    • OvermindDL1 says:

      ww said:
      > I would like to see a C++ compiler. It would be extremely useful to have
      > something that is able to understand at least interfaces, so (for example)
      > writing Mock object support for Boost Test could be supported by it. Current
      > approaches with C++ (since we have no introspection and whatever else it is
      > called in Java) are clumsy, hard to read, non-intuitive.

      However, a C++ compiler is one of the most difficult compilers you can write, and would not at all be any kind of a good intro into writing a full language.

      ww said:
      I would also like to be able to do an “interactive compiler”, something that
      > would be able to support an IDE that could turn code into some sort of
      > p-code, and support template metaprogramming debugging, and “magical”
      > refactoring automation and so on. I guess it is quite close to what LLVM
      > does, but I want it to work instantly as I type. So if I type:
      >
      >std::string s;
      >
      >I want it to tell me: hey, for this you need to #include , even if string is
      > included by coincidence by another header (but string is not part of the
      > interfaces specified by those headers, only the implementation, such as a
      > member variable)…

      That is pretty near impossible because an implementation file can potentially include files from anywhere on the entire filesystem, as well as it is quite possible for their to be multiple, non-compatible definitioins of something like std::string (unlikely for this specific one, but possible). Remember, in C/C++ includes files into other files, it does not import definitions like Java does, this gives C/C++ much more power, but makes it quite literally impossible to do what you ask.

      I have no clue what “p-code” is though. :p

      As for template metaprogramming debugging and ‘magical’ refactoring automation, I find Visual Assist fills that need quite well.

  17. bytecolor says:

    I was thinking a fun project might be to create an EBNF (14977) parser that generates Qi code. You could have all sorts of fun with the ?special sequence?

    Here’s a list of all languages mentioned so far, just for the hell of it.

    C
    C++
    C++ Resyntaxed
    C++ SPECS
    C++ Subset
    Chapel
    Clojure
    Common Lisp
    GIMPLE
    Go
    Haskell
    Javascript
    Kaleidoscope (LLVM Tutorial Language)
    Lua
    OCC
    Pascal
    Perl
    Python
    Python
    R
    SML
    Scala
    Scheme
    Scheme
    TinyC

  18. Joel de Guzman says:

    Yeah! RAD Spirit

    We’re thinking about that as well (ISO EBNF). It would be nice to have it emit: 1) interpreter that can be immediately executable 2) LLVM parser code 3) C++ Spirit code. This would be solve the remaining issues with Spirit: 1) Hard to debug error messages and 2) Long compile times.

  19. Jonas Persson says:

    RAD spirit would be cool. A small language where you could connect a qi grammar and a karma grammar and have it typechecked for compatible attributes could make for a really nice and simple text transformation tool.

  20. bytecolor says:

    Joel
    What did you have in mind when you said:

    interpreter that can be immediately executable

    I’ve been working on an ISO EBNF parser. It works pretty good. I think I have all
    of the syntax parsing correctly, including the alternate symbols (/ /) / ! etc.
    I’ve got a visitor written that just writes the data structures back out as
    semi-formatted EBNF.

    ?> foo = ["this" | "that"], 4*”x”;
    foo
        = ['this' | 'that'], 4 * ‘x’
        ;
    ?> foo = (/’this’ / “that”/), 4*”x” .
    foo
        = ['this' | 'that'], 4 * ‘x’
        ;

  21. Ben Allan says:

    I’ll nominate Java, since I’d really love to see a grammar I can then use to derive open-source source-to-source tools for taking code out of Java and several other java-like languages and into C/C++ without a tool-chain dependence on llvm or a java compiler/runtime.

    Or +1 for C, and particularly for just updating Hartmut’s example from http://spirit.sourceforge.net/ as a full example of up-porting to 2.1 from 1.x. Separately, the tutorial aspects of walking a C ast, handling crufty bits in “real” languages, etc could be built on top of the updated base of parsing C.

  22. Glen says:

    Hey Joel

    I’ve only just casually come across your site, so I might not understand what you are thinking about doing, but you said:

    “C++ would be the ultimate. But that would mean spending the next N years of my life working on it ”

    Well Bjarne has just published this work which seems like a paper that might help?:
    http://www2.research.att.com/~bs/macis09.pdf

    Wouldn’t his paper make your job a heck of a lot easier? So maybe you can do it after all, waddya think? Maybe he’d even help ;)

    Anyway, the link had a security server error when I tried to open it but I saw an older version of this document (I think) so when it’s back up take a look. I found it via Bjarne’s usual papers page:

    http://www2.research.att.com/~bs/papers.html

    Thanks.

  23. Jesse says:

    I guess it comes back to the goal of your project. If the goal is to introduce Spirit by example, I think something like Scheme that is quite small yet fairly widely-known is not a bad idea.

    On the other hand, something more ALGOL like could have more appeal, since a lot of people use those kinds of languages and could see familiar constructs. The advantage might be that learners could focus less on trying to figure out the language being implemented, and more on how Spirit is used to implement familiar constructs.

    On the other other hand (am I out of hands, yet?) I don’t think anyone’s mentioned Forth. ;-)

  24. Jesse says:

    lol! I used to know a fellow (Brian Prentice) who had worked for Boeing, and he told me that whenever they got a new system the first thing he’d do was to implement a Forth for it, since as far as he was concerned that was the quickest way to bootstrap pretty much everything.

  25. Troy Straszheim says:

    There’s “D”, which was designed from the beginning to be C-like with a context-free grammar….

  26. Troy Straszheim says:

    While I’m at it, that’s from the ‘major goals for D’ section:

    - Make D substantially easier to implement a compiler for than C++.
    - Be compatible with the local C application binary interface.
    - Have a context-free grammar.

    of this page: http://www.digitalmars.com/d/2.0/overview.html

  27. kyunghoon says:

    Lisp like languages are too easy to parse, therefore would not be a very good exercise in that area. My vote is for JavaScript. Building something on par with the other speedy JavaScript engines out there would be very interesting.

Leave a Reply

preload preload preload