Sunday, December 09, 2007

Status update & future directions

It has been some time since my last writing here. In the mean time there have been two compiler releases, 0.3.37 and 0.3.38. Most important new features from those are:
  • double pointers (pointer to pointer, look here)
  • pointer L-values and class L-values
  • sorted diagnostics, with common directory prefix printed separately
  • a new type of comment
  • first steps of platform detection (32 bit or 64 bit)
The new comment type was introduced because of an annoying property that the existing line comments have. Suppose you have a small procedure (one ore two lines lines) you want to comment out. The fastest way is to put a hash character ('#') in front of each line. But this does not work if one of the lines already has a line comment behind it; putting a hash mark in front of the line cancels out the comment that already was there, it will no longer be treated as comment text. So I created a new type of line comment that simply reaches until the end of the line, ignoring any hash characters that are already there. You just type two hash characters (with no space in between). Isn't it simple?

The platform detection that was added is limited. If you have an amd64 (or derived) CPU, it assumes the HOST platform is 64-bit. If it is not, and you have an x86 or derived CPU, the HOST is 32-bit. And otherwise it produces an error, because your platform is not supported yet. Of course, support for more architectures will come later.

I have decided to require all source files be in UTF-8 format, unless specified otherwise (by a magic number in the file). This has not been implemented yet. The char type will most likely be represented by a 32 bit number (UTF-32/UCS-4). And string will probably use UTF-8, which unfortunately requires me to remove the random access functionality.

As you probably already know, I am working on a compiler with back end. Most basic features are implemented, such as statements and expressions, integral types and arrays. Some exceptions are the iterate statement and the chained comparison expression. I would like to see some simple programs compiled completely, but that is not possible yet. A simple program could calculate something and then print the result. But: I haven't implemented standard output yet, and I haven't implemented string types yet. The standard output only accepts strings, so strings are a dependency in this case. And if I have to implement strings, I need the char type supported as well.

Unfortunately there are some things I will have to change in the current implementation. I will need to change how the symbol table works, because it needs to support sourcefiles importing each other, making symbols visible and invisible again, etc. The current operator overloading mechanism for binary operators needs to be changed as well, because it currently requires to maintain a global list of all binops. So I will need to change their semantics; looking up a binary operator should be possible by looking at the operands instead of a global list.

The compiler's semantic processing will need to be restructured. This is how the front end currently works, in 4 phases:
  1. Reading the source from file, lexing, parsing and AST building.
  2. Doing resolve1 recursively: looking up typenames and calculating compile time array sizes
  3. Doing resolve2 recursively: check for duplicate overloads, create default copy constructors
  4. Doing resolve3 recursively: check all other semantics
I have thought of a better way of doing things. The first phase can remain, of course, but the other three need to be changed. I would create a single phase for all semantics. Each AST member would have two functions for semantic checking: one to check interface semantics only, and one to do all checks. Every function would have to check if it isn't in a circular dependency (such would require an error diagnostic of course) and if it hasn't been completed before (in that case it would not do the checks again). If the compiler is resolving some code that uses another class, it would only need to resolve the interface of that other class, i.e. only procedure and constructor headers, and fields. If a procedure is called, it would only need to make sure its interface was resolved. An expression's interface is its result type plus other result properties (L-value or not, compile time value or not, etc.) and those require a complete check, so for an expression there would not be a distinction between 'interface resolve' and 'complete resolve'.

There are other things to be solved as well: how to manage binary compatibility with the standard library, how to create an interface to libraries written in other language etc... These things will need lots of thinking.

No comments: