Saturday, September 16, 2006

Compiler issues

Since February, 2004 I am working on a compiler for Hyper. It's difficult and lots of work. But if I ever want a usable language I will need one. And of course no one else will write it for me.

This language I am using to write the compiler in is C++. C++ is one of the languages I like most, and writing the compiler in Hyper is unfortunately not possible. It would be great if it was, but it isn't. The bootstrapping problem, if you know what that is. You cannot write the very first compiler for language X in language X. Unless, of course, you already have an interpreter for language X. But for Hyper that's not the case.

The compiler is already open-source, under the GPL license. It is not yet publicly downloadable yet though; for now it is only available to fellow students and people I know personally. Making it available on my website will happen in the near future. The single license GPL for all will have to change; I will need the LGPL license for the runtime library, and some other not very restrictive license (maybe the BSD license?) for the class library. The GPL for the compiler is OK, any modified version of the compiler (or anything derived from it) will also need to be released under the GPL. The runtime library needs the LGPL because it will need to be linkable to proprietary programs. And the class library needs a not very restrictive license, to allow proprietary classes to be derived from classes in the library. And because inheritance counts as 'making a derivative work', the GPL and LGPL are not an option here. (Someone correct me if I am wrong about this)

The compiler for Hyper is still only a front end. So the compiler only checks its input file(s) for errors, but does not yet generate an executable for valid input source files. The task of writing a back end for code generation still lies ahead of me. I will not write my own code generators for all machine architectures that exist. This leaves me two options: (1) let the compiler generate source code for another language, most likely C++. (2) use a library or source from another project for code generation. Of these two, I have a slight preference for the second option. Again, two options exist for this. The first is writing my compiler as yet-another front end for GCC. Another option is to use LLVM, an open-source compiler infrastructure. I strongly prefer LLVM. An important reason is that LLVM is also written in C++ (as is the front end of my compiler), but GCC is written in C. I am more familiar with C++ than I am with C, and I consider C++ a better choice. Also, in my opinion, the GCC source code is difficult to understand, and contains lots of macro's.

Another issue I will have to deal with: some time ago I switched to CMake for building the compiler. LLVM uses the GNU autotools for building. I will need to find a way to let those two build systems cooperate.

The compiler development is progressing well. The compiler accepts a simple subset of the language (so without inheritance, interfaces, generic programming, etc.) and it already does most of the semantic checking that needs to be done. But this does not come without some complexity; I do a regular line count on the compiler sources, and today the number of lines code (headers + implementation) exceeded 30,000. It surprises even me that I already have such a large codebase. Some refactoring could make the number drop a bit, but many language features will need to be added and you can expect the number to rise even more.

No comments: