Wednesday, December 19, 2007

iterate revisited

I am going to talk a bit about the iterate statement in this post. If you don't know what it is, look at the web page behind the hyperlink.

The iterate statement was derived from the FOR statement from PASCAL and BASIC (and their derivatives). I had to generalize it in order to allow it being used with user-defined types instead of just numbers; my intention is to have it support the behavior of 'foreach' in some other languages. So I came up with some scheme to support the use of iterators. I defined some operators that need to be supported by the types used in an iterate statement. For example:

The variable should be initializable by the from-expression. The variable should support increment when using "step ++". The variable must be able to be compared to the to-expression. In an "iterate v : t from x to y step z" the compiler would first compare x and y, to see which one is the largest value. If x <= y then the step would be considered to be an incremental value, and otherwise a decrementing value. And there are operators needed that can decide when the to-value is reached for both cases; so you would need comparisons v <= y (for an incrementing step) and y<= v (for a decrementing step) to be available. But this scheme is not good enough. When I came up with this I assumed the use of a "begin" and "end" iterator, where the "end" iterator should point at the last element of the collection. In C++ the 'end' iterator points one past the last element. I thought that my change would not be a problem, but I forgot to think of what would happen when iterating over an empty collection. Such a collection does not have a first or last element iterator!

So I will have to change the current way of doing things. One possible way is having the 'end' iterator should be done as in C++, and point to one past the last element. This means an iterate will look as follows:
iterate v : c.iterator from c.begin to --c.end
# do something
end
It would require the programmer to remember using a "--" in front of the "end" iterator, because the "end" iterator does not point to a valid element in the collection. The iterator also must support the predecessor operator, but what does it return when used on the "end" iterator in an empty collection?

Another possibility is having iterate extract some other info from the collection before starting the loop. It could first query the iterators for the instance of the collection class they belong to (and both iterators must return the same value) and then ask the collection class whether it's non-empty and skip the loop if it's not. But this would require allowing invalid iterators returned from the collection in order to return their collection object when the collection is empty.

And a third option would be to specify the collection object explicitly in the iterate statement itself. Something like:
iterate i in c
# do something for each element
end

# - or -

iterate i in c from someStartIterator # optional "to" or "step" omitted here
# do something for each element, starting from some given iterator
end
This would need an extra keyword "in" (already silently supported by the latest compilers), and some predefined methods in the collection, required by the iterate statement for determining the non-emptyness of the collection and its start and end values. It would not change the existing iterate behaviour but only extend it.

I don't know yet what I'll choose as the solution. Maybe you'll see the change as it appears in a new compiler release, or maybe I'll write about it before I implement it.

Sunday, December 09, 2007

Status update & future directions

It has been some time since my last writing here. In the mean time there have been two compiler releases, 0.3.37 and 0.3.38. Most important new features from those are:
  • double pointers (pointer to pointer, look here)
  • pointer L-values and class L-values
  • sorted diagnostics, with common directory prefix printed separately
  • a new type of comment
  • first steps of platform detection (32 bit or 64 bit)
The new comment type was introduced because of an annoying property that the existing line comments have. Suppose you have a small procedure (one ore two lines lines) you want to comment out. The fastest way is to put a hash character ('#') in front of each line. But this does not work if one of the lines already has a line comment behind it; putting a hash mark in front of the line cancels out the comment that already was there, it will no longer be treated as comment text. So I created a new type of line comment that simply reaches until the end of the line, ignoring any hash characters that are already there. You just type two hash characters (with no space in between). Isn't it simple?

The platform detection that was added is limited. If you have an amd64 (or derived) CPU, it assumes the HOST platform is 64-bit. If it is not, and you have an x86 or derived CPU, the HOST is 32-bit. And otherwise it produces an error, because your platform is not supported yet. Of course, support for more architectures will come later.

I have decided to require all source files be in UTF-8 format, unless specified otherwise (by a magic number in the file). This has not been implemented yet. The char type will most likely be represented by a 32 bit number (UTF-32/UCS-4). And string will probably use UTF-8, which unfortunately requires me to remove the random access functionality.

As you probably already know, I am working on a compiler with back end. Most basic features are implemented, such as statements and expressions, integral types and arrays. Some exceptions are the iterate statement and the chained comparison expression. I would like to see some simple programs compiled completely, but that is not possible yet. A simple program could calculate something and then print the result. But: I haven't implemented standard output yet, and I haven't implemented string types yet. The standard output only accepts strings, so strings are a dependency in this case. And if I have to implement strings, I need the char type supported as well.

Unfortunately there are some things I will have to change in the current implementation. I will need to change how the symbol table works, because it needs to support sourcefiles importing each other, making symbols visible and invisible again, etc. The current operator overloading mechanism for binary operators needs to be changed as well, because it currently requires to maintain a global list of all binops. So I will need to change their semantics; looking up a binary operator should be possible by looking at the operands instead of a global list.

The compiler's semantic processing will need to be restructured. This is how the front end currently works, in 4 phases:
  1. Reading the source from file, lexing, parsing and AST building.
  2. Doing resolve1 recursively: looking up typenames and calculating compile time array sizes
  3. Doing resolve2 recursively: check for duplicate overloads, create default copy constructors
  4. Doing resolve3 recursively: check all other semantics
I have thought of a better way of doing things. The first phase can remain, of course, but the other three need to be changed. I would create a single phase for all semantics. Each AST member would have two functions for semantic checking: one to check interface semantics only, and one to do all checks. Every function would have to check if it isn't in a circular dependency (such would require an error diagnostic of course) and if it hasn't been completed before (in that case it would not do the checks again). If the compiler is resolving some code that uses another class, it would only need to resolve the interface of that other class, i.e. only procedure and constructor headers, and fields. If a procedure is called, it would only need to make sure its interface was resolved. An expression's interface is its result type plus other result properties (L-value or not, compile time value or not, etc.) and those require a complete check, so for an expression there would not be a distinction between 'interface resolve' and 'complete resolve'.

There are other things to be solved as well: how to manage binary compatibility with the standard library, how to create an interface to libraries written in other language etc... These things will need lots of thinking.