Wednesday, December 19, 2007

iterate revisited

I am going to talk a bit about the iterate statement in this post. If you don't know what it is, look at the web page behind the hyperlink.

The iterate statement was derived from the FOR statement from PASCAL and BASIC (and their derivatives). I had to generalize it in order to allow it being used with user-defined types instead of just numbers; my intention is to have it support the behavior of 'foreach' in some other languages. So I came up with some scheme to support the use of iterators. I defined some operators that need to be supported by the types used in an iterate statement. For example:

The variable should be initializable by the from-expression. The variable should support increment when using "step ++". The variable must be able to be compared to the to-expression. In an "iterate v : t from x to y step z" the compiler would first compare x and y, to see which one is the largest value. If x <= y then the step would be considered to be an incremental value, and otherwise a decrementing value. And there are operators needed that can decide when the to-value is reached for both cases; so you would need comparisons v <= y (for an incrementing step) and y<= v (for a decrementing step) to be available. But this scheme is not good enough. When I came up with this I assumed the use of a "begin" and "end" iterator, where the "end" iterator should point at the last element of the collection. In C++ the 'end' iterator points one past the last element. I thought that my change would not be a problem, but I forgot to think of what would happen when iterating over an empty collection. Such a collection does not have a first or last element iterator!

So I will have to change the current way of doing things. One possible way is having the 'end' iterator should be done as in C++, and point to one past the last element. This means an iterate will look as follows:
iterate v : c.iterator from c.begin to --c.end
# do something
end
It would require the programmer to remember using a "--" in front of the "end" iterator, because the "end" iterator does not point to a valid element in the collection. The iterator also must support the predecessor operator, but what does it return when used on the "end" iterator in an empty collection?

Another possibility is having iterate extract some other info from the collection before starting the loop. It could first query the iterators for the instance of the collection class they belong to (and both iterators must return the same value) and then ask the collection class whether it's non-empty and skip the loop if it's not. But this would require allowing invalid iterators returned from the collection in order to return their collection object when the collection is empty.

And a third option would be to specify the collection object explicitly in the iterate statement itself. Something like:
iterate i in c
# do something for each element
end

# - or -

iterate i in c from someStartIterator # optional "to" or "step" omitted here
# do something for each element, starting from some given iterator
end
This would need an extra keyword "in" (already silently supported by the latest compilers), and some predefined methods in the collection, required by the iterate statement for determining the non-emptyness of the collection and its start and end values. It would not change the existing iterate behaviour but only extend it.

I don't know yet what I'll choose as the solution. Maybe you'll see the change as it appears in a new compiler release, or maybe I'll write about it before I implement it.

Sunday, December 09, 2007

Status update & future directions

It has been some time since my last writing here. In the mean time there have been two compiler releases, 0.3.37 and 0.3.38. Most important new features from those are:
  • double pointers (pointer to pointer, look here)
  • pointer L-values and class L-values
  • sorted diagnostics, with common directory prefix printed separately
  • a new type of comment
  • first steps of platform detection (32 bit or 64 bit)
The new comment type was introduced because of an annoying property that the existing line comments have. Suppose you have a small procedure (one ore two lines lines) you want to comment out. The fastest way is to put a hash character ('#') in front of each line. But this does not work if one of the lines already has a line comment behind it; putting a hash mark in front of the line cancels out the comment that already was there, it will no longer be treated as comment text. So I created a new type of line comment that simply reaches until the end of the line, ignoring any hash characters that are already there. You just type two hash characters (with no space in between). Isn't it simple?

The platform detection that was added is limited. If you have an amd64 (or derived) CPU, it assumes the HOST platform is 64-bit. If it is not, and you have an x86 or derived CPU, the HOST is 32-bit. And otherwise it produces an error, because your platform is not supported yet. Of course, support for more architectures will come later.

I have decided to require all source files be in UTF-8 format, unless specified otherwise (by a magic number in the file). This has not been implemented yet. The char type will most likely be represented by a 32 bit number (UTF-32/UCS-4). And string will probably use UTF-8, which unfortunately requires me to remove the random access functionality.

As you probably already know, I am working on a compiler with back end. Most basic features are implemented, such as statements and expressions, integral types and arrays. Some exceptions are the iterate statement and the chained comparison expression. I would like to see some simple programs compiled completely, but that is not possible yet. A simple program could calculate something and then print the result. But: I haven't implemented standard output yet, and I haven't implemented string types yet. The standard output only accepts strings, so strings are a dependency in this case. And if I have to implement strings, I need the char type supported as well.

Unfortunately there are some things I will have to change in the current implementation. I will need to change how the symbol table works, because it needs to support sourcefiles importing each other, making symbols visible and invisible again, etc. The current operator overloading mechanism for binary operators needs to be changed as well, because it currently requires to maintain a global list of all binops. So I will need to change their semantics; looking up a binary operator should be possible by looking at the operands instead of a global list.

The compiler's semantic processing will need to be restructured. This is how the front end currently works, in 4 phases:
  1. Reading the source from file, lexing, parsing and AST building.
  2. Doing resolve1 recursively: looking up typenames and calculating compile time array sizes
  3. Doing resolve2 recursively: check for duplicate overloads, create default copy constructors
  4. Doing resolve3 recursively: check all other semantics
I have thought of a better way of doing things. The first phase can remain, of course, but the other three need to be changed. I would create a single phase for all semantics. Each AST member would have two functions for semantic checking: one to check interface semantics only, and one to do all checks. Every function would have to check if it isn't in a circular dependency (such would require an error diagnostic of course) and if it hasn't been completed before (in that case it would not do the checks again). If the compiler is resolving some code that uses another class, it would only need to resolve the interface of that other class, i.e. only procedure and constructor headers, and fields. If a procedure is called, it would only need to make sure its interface was resolved. An expression's interface is its result type plus other result properties (L-value or not, compile time value or not, etc.) and those require a complete check, so for an expression there would not be a distinction between 'interface resolve' and 'complete resolve'.

There are other things to be solved as well: how to manage binary compatibility with the standard library, how to create an interface to libraries written in other language etc... These things will need lots of thinking.

Tuesday, September 04, 2007

Memory management

First I would like to let you know there has been a new release of the compiler for Hyper, version 0.3.36. I did not announce that version on my blog.

Now, about memory management. As you might already now, the language will use a garbage collector. As a consequence of that, classes cannot have a destructor because it cannot be guaranteed that such a destructor would really be executed when an object instance's memory is reclaimed. When a short running program exits, the garbage collector might even not have been run. This way, the garbage collector actually serves to pretend we have an infinity amount of memory. But people don't always like it that way, they want deterministic construction/destruction for some things like resources managed by the operating system. And I agree, though Hyper is not a language for doing low level OS stuff it can be necessary to have deterministic resource management for some cases. So I will introduce a way to do that.

First we need a bit of background. There is a feature I thought I have already explained on this blog, but I can't find it anywhere. So I will describe it briefly, and write a full explanation later. This feature is called "restricted pointers", and a shorter name might be "references". Such a reference is like a pointer, except that it cannot be stored outside of the stack and the guarantee that it will never be null. References can be used to point to a temporary object on the stack. A pointer can be implicitly converted to a reference but not the other way around. Temporaries can no longer be converted to pointers but only to references. This implies that when you see a pointer it has to point to an object on the free store. References are not yet implemented in the compiler, but they will have to be in order to get headache free memory management; they prevent the use of a pointer that points to a no longer existing temporary.

I already wrote something about RAII some time ago, but I realized that it would not be good enough for safe deterministic destruction, because the user can use the object without "scope". My new solution is called "auto classes". Such a class is declared with the "auto" keyword, and it must have a destructor. The syntax for the destructor could be "procedure auto()" or "procedure ~new()", I'm not sure yet. Any instance of an "auto class" can only be stored on the stack, or in a (non-static) field of another "auto class". It cannot be created on the free store, as it would not be possible to provide a guarantee for the execution of the destructor then. An object of such a class would need to count internally how many copies of it are still live in order to decide when its resource can be released. My plan is to provide an abstract base class that implements this behaviour so that users don't have to write a reference counter for every auto class they need; they would just inherit from the base class and implement a procedure that specifies what to do when the last instance dies.

A slightly different approach would be a resource-managing class that cannot be changed. So I propose the "auto const class", an auto class that is always const, it does not have an (accessible) assignment operator and all its procedure must be const as well. Its copy constructor can be made private in order to have a RAII class that does not need a reference count because it is the only reference.

Aside from memory management I also propose a (non-auto) "const class". This type of class is always const (thus immutable) and is in the style "create and never modify", like the Java "String" class. It can only be constructed on the free store.

Monday, July 23, 2007

An extension to the type system

Why change Hyper's type system? The main reason is arrays, or more specifically the array index operator. When you have an array of 'int', the return type of the index operator needs to be a pointer to an 'int' in order for you to be able to change the numeric value in the array. (Currently this is not the case yet; the array index operator returns a plain 'int' for such an array. That will be fixed of course.) But what when you have an array of pointers to 'int'? What will be the index operator's return type then? Logically it has to be something like pointer to pointer to int, because of the extra level of indirection that is needed. After all, you should be able to change the int pointer. But at this time the language disallows multiple levels of pointers!

My first idea to solve this problem was to introduce "inout return types". It would provide an extra (hidden) reference to the actual type, as it is done for inout parameters. But I decided not to do that, as it would be a quirky solution. And then someone pointed out that the current pointer system is weird and makes source code unreadable because of the implicit (de)referencing done by the compiler. So I thought of changing the type system to be more like C++, with explicit referencing/dereferencing for most things. Unfortunately it seemed to me that the changes would lead me to something that was almost identical to C++ but a bit more complicated. I decided that this wasn't an option either.

I looked back to the actual problem and my first idea of 'inout return types'. What I have come up with is a bit similar to that solution, only less quirky. The main idea is introducing a second pointer level, thus allowing a pointer-to-pointer-to-class type to be declared in some places. This introduces an ambiguity, namely: to what pointer level does the pointer assignment etc. apply to? Well, the second (i.e. top-level) pointer is only used as a reference to the single pointer; you need to be able to change the single pointer as in the array-of-pointer-to-int example. That means all pointer operations would need to work on the single pointer, the one that points to the 'int' in the example. The double pointer would serve like a reference in C++, only initialized once and always dereferenced when used in an expression. The usage of such a double pointer is not universal; it would not be allowed for parameter types, because input parameters don't need a second level of indirection and because inout parameters already provide an implicit second pointer for a pointer parameter. A double pointer can be useful for return types, for variable types, and maybe for fields as well.

Some examples now:
var i : int = 22
var p: *int = i # points to i
var pp: **int = p # points to p
var b : bool
b = (p = 22) # yes, p equals 22
b = (pp = 22) # yes, pp equals 22
b = (pp $= p) # yes, pp equals p
p = 47 # change i to 47
pp = 84 # change i to 84
p $= new int(55) # p no longer points to i
pp $= new int(31) # p changed again, points to 31
As you see in the example pp can not be changed anymore; it is a reference to p and all pointer assignments done on pp will be therefore applied to p.

This new functionality allows us to write a class that serves as a pointer-to-int:
class IntPtrArray
public:
var fP : [10] * int # array has a fixed size of 10 ints

const procedure size() : nat
return 10
end

operator[](index:nat) : **int
# return a reference to the int pointer
return fP[index]
end

const operator[](index:nat) : *int
# no double pointer return needed (array is const)
return fP[index]
end
end
I'm not sure when I will implement this feature, but unless I get some serious objections against this the feature will be added to the language.

Thursday, July 19, 2007

website moved

My website has moved recently. The new URL for Hyper is: http://users.edpnet.be/hyperquantum/hyper/

The project is somewhat stalled right now. I don't have much time these days, and the language's type system needs to be reworked a bit.

Thursday, May 31, 2007

Hyper compiler 0.3.35 released

It has been a long time, but here's another release. This one is nothing spectacular; most changes and improvements are internal and not visible to a user. I started to use the Boost libraries. The compiler and highlighter programs now use Boost's program_options library for parsing command-line parameters.

The lack of interesting new features in the front end is because I have worked on the back end as well. This work is in a very premature state, so you won't see a full compiler anytime soon.

Saturday, March 17, 2007

Hyper compiler 0.3.34 released

There it is again, another release. It has some very nice new features. The most exciting new thing of this release is ironically not the compiler, but the syntax highlighter.

The syntax highlighter was, before this release, a very simple program. You gave it two file parameters: the first was an existing source file and the second was the name for the HTML file that needed to become the syntax-highlighted version of the original source code. The output file was HTML 4.01 transitional. The new version of the highlighter generates valid strict XHTML 1.0! It was an easy improvement, almost nothing more than replacing the doctype. OK, this change is not very important to most people. But a totally different functionality has been added as well.

The idea for a new feature came from the fact that writing documentation on my website is difficult. Explaining the language is a difficult and time consuming task. But writing example code to illustrate things was just horrible. It required manual construction of a "pre" tag with some sourcecode, interleaved with "span" tags for highlighting. And when an example needed to be changed the highlighting had to be corrected. I did not like this at all, so I figured out a way to have the syntax highlighter do this boring and error-prone task. Now I can write the documentation, have the examples as plain text in the HTML file in a special comment (so they are still very readable), and the syntax highlighter processes the HTML file and generates the highlighted version directly inside the file. Now when I change an example I can just change the sourcecode in the HTML file and then run the highlighter on it to have the desired result.

The changes in the compiler itself are mostly directed towards the back end that needs to be written. First, it now supports dynamic arrays. This is needed to be able to write container classes in the language. Checking for 'static' things is now complete as well. This is of course needed for code generation because for every non-static procedure call an object instance is required, so the compiler must be able to find it. Other important changes are mostly internal. Constant folding is now implemented for a small part of the types, and the operators of the built-in types are now represented as classes specialized for each built-in type. And also some bugfixes this time. I discover little of them, because I am (as far as I know) the only tester. You can of course improve this!

Saturday, February 24, 2007

Hyper compiler 0.3.33 released

I have released a new version of the compiler again, we're at 0.3.33 right now. The version number is actually increasing less fast than before I did any releases. The first private release was 0.3.26, the first public release was 0.3.29, so this is the fifth release I have done. Before I started releasing versions I also incremented the version number, but this was a lot faster than it is now. For the same amount of changes that are now in one new release, I would have probably done about 6 or 7 version increments back then. I cannot do the same thing anymore of course, it would be silly to release a new version each time I have made a couple of trivial changes.

The changes I have made in the latest version of the compiler are steps towards a working code generator. I am giving priority to the front end things that are needed by the compiler back end. The first such thing in the release of today is checking for the program entry point. This includes checking a 'begin' specification if it is present, finding the 'static procedure main' and doing the necessary checks on it. Another improvement is a basic support for constant folding, except that no folding is done at the moment, but the compiler can already use unsigned integer literals in type checking. This feature is used by another one: checking of array sizes. You can declare arrays with fixed size and the compiler will check their compatibility. And the third important change is passing of parameters that are not to be changed, i.e. 'in' parameters. They are now passed by reference and completely read-only.

The next things on my 'to do'-list are real constant folding and constructor initializer lists. I suppose those are sufficient to allow me to start working on the code generator. I still have compilation problems with the LLVM tools as I wrote in my previous blog post. I don't think this will be a problem because the compiler front end linked with LLVM works perfectly. I can still compile the official version of LLVM and use its tools on the LLVM bytecode generated by my compiler.

Another thing: I don't really get any feedback of users yet. So if you try the compiler, please let me know what you think! Tell me the things you like, not just what you don't like. I have installed a visitor counter on my website some time ago, and it seems that I do have some visitors looking around. Yay! :-)

Wednesday, February 07, 2007

type deduction for variables + back end progress

I have an idea for a new feature for variables: automatic type deduction. A new reserved keyword "auto" is used in place of the type for the variable, and the compiler will deduce the type from the type of the initializer expression. The type will always be a pointer or a reference, to avoid unneeded copies of the initializer.
procedure test(x : int, y : const int, z : * int)
var a : auto = x # var a: & int = x
var b : auto = y # var b : & const int = y
var c : auto = z # var c : * int = z
var d : auto = a + b # var d : & int = a + b
end
I am still working on the compiler back end. Progress is very slow, because I don't have much time to work on it. I got most of the LLVM libraries compiled with CMake (LLVM normally uses GNU autotools), and linked with my front end. The front end currently emits a simple hello world program as LLVM assembly code (regardless of what sourcefile you 'compile'). I did not get the LLVM tools compiled yet; for "llc" I am stuck on a link error about some symbols from libtool. I have never used libtool myself so I don't know how it's used in a program. I will have to take another look and if I can't solve it I will have to ask for some help on the CMake mailing list. I wonder why my compiler gets linked with LLVM without errors and why "llc" fails. Other work in this matter is on the front end. There is some functionality that is needed for code generation that is still missing in the front end. For example, I still need to finish constant folding, because this is required for determining the size of array types. After that I can start implementing real code generation.

Monday, January 15, 2007

Hyper compiler 0.3.32 released

Today I have released a new version of the compiler. The most important new thing is support for the new namespace system. This means that finally all example programs are accepted by the compiler, including the new Hello World! Imports are not yet really supported; imports of user-defined sourcefiles are ignored, and the only allowed standard library import is the import of 'system.stdio'. The new compiler now also supports chained comparison operators, so you can now use code like:
if x = y = z then
# TODO : implement this
end
Another nice thing to have is that the compiler will generate a warning by default for this code. Something like:
file.hyp:3:2: warning: TODO: implement this
The compiler will actually notice 'TODO' or 'FIXME' in comments and generate warnings for them. In my opinion warnings provide useful information so they are enabled by default. But you can turn them off individually if you like with a commandline switch like "-W-no-todo".

As you can see the compiler still isn't at version 0.4.0 yet. Well, I have chosen for a different approach. I will keep releasing the front end of the compiler as I am doing now, and I will develop a version of the compiler with LLVM back end in parallel. The version 0.4.0 will be given to a release that marks an important event for the front end; maybe when I am ready to start the implementation of more advanced things like inheritance. I do not plan to release the version-with-back-end to the website soon because it will be completely unusable until some time in the future. And if no one cares anyway, then why bother releasing it?