Sunday, June 15, 2008

About the 0.4.0 release and the back end

The compiler had a new release last month, with some big improvements. So the version number was not increased to 0.3.39, but to 0.4.0.

I did a major internal restructuring of the compiler's semantic checking, as I wrote earlier. These changes allow me to detect circular dependencies between classes, and fix some nasty bugs. I no longer keep a separate symbol table; the symbol names and symbol lookup are now embedded in the AST data structures directly.

And the compiler now finally has full Unicode support! Source files are assumed to be in UTF-8 format now, because only that format is supported at this time. It also has the advantage that it's a superset of ASCII, so you can open a source file with a dumb text editor that doesn't know about Unicode and still have readable source, except for some special characters that aren't displayed correctly because they're not ASCII. Other formats like UTF-16 are on my TO-DO list, but do not have a very high priority. And though the compiler now has Unicode support, the language still needs to be adapted partially to it. The string type, for instance, will need to have its interface changed because it will use UTF-8; the random access (array) operator needs to go, and a replacement mechanism must be provided to be able to iterate over the characters in the string.

The compiler with back end has gotten some of my attention as well. It has inherited the new features from the 0.4.0 release and can do LLVM code generation for most expressions and statements, involving only the built-in types. User-defined classes, strings, floating-point types, and the 'iterate' statement are not supported yet. I have added support for printing single characters to stdout, so I wouldn't have to wait for full string support until I could see simple test programs working. Next thing I would like to be compilable is the eight-queens program.

My efforts will be mostly on the compiler back end now. I would like to have complete code generation for what the front end currently supports, and then I can merge the llvm-branch to the main-branch so that the main release is no longer front-end only. But I guess I will need to implement a solution for the iterate problem first.

The compiler sources are getting pretty big now. The main branch now has almost 44000 lines of code, and the llvm branch has about 49000 lines (not counting the LLVM sources, of course). That's a simple count, including header files, blank lines and comments. It's still quite impressive to me, though. My biggest project ever :-)