HyperQuantum's blog

Using data-flow information in the language

2010-09-16T12:46:00.008+02:00

It's been a long time since I've written the previous post on this blog. The project has been mostly inactive for the last year, since I don't have much time anymore to work on it. But I've been on holiday this summer and thought about some new features I'd like to add to the language. This post will start with a simple version of a new concept, and I will expand it in future posts.

What is 'data-flow information'? Well, it's something that is normally used in compilers for optimization. The compiler examines how the source code will be executed, and predicts what value(s) a variable can have at various locations in the source code; this is called data-flow analysis.
But I've been thinking that it can be useful in the language itself. How?

Well, for example, the compiler can check if you are possibly dereferencing a null-pointer. What happens in other languages when you dereference a null-pointer? Some programming languages, like C or C++, assume that the programmer is smart enough to prevent that from happening, and if it does happen, the program will simply crash. Other languages insert run-time checks that look at the value of a pointer before it is dereferenced, and make sure that an exception is thrown when it is null. Both approaches have the disadvantage that bugs regarding null-pointers are only discovered at run-time, and not always before the product is shipped to the customer.

My idea is to have the compiler check if a pointer can be null when you dereference it. So when you have a pointer that needs to be dereferenced, the compiler forces you to write a check in your code to see if the pointer is null or not. An example:

procedure test1()
  var i : int = 10
  var p : * int = SomeExternalClass.getFooBar() # returns a pointer

  i += p  # ERROR: p can be null

  if (p !$ null) then
    i += p  # OK
  end

  i = p * 3  # ERROR: p can be null

  if (p =$ null) then
    return
  end

  i = p * 3 # OK, p cannot be null
end

The compiler can track the value of a pointer variable inside a function, and decide at any point if that value can be null or not. But it doesn't work for all cases. Look at the following example:

class Test2
  var m : * Foo

  procedure test2()
    var p : * Foo = SomeExternalClass.getPropertyX()
  
    p.doSomething()  # ERROR: p can be null
  
    if (p !$ null) then
      p.doSomething()  # OK

      p $= SomeExternalClass.getPropertyX()
      p.doSomething()  # ERROR: p can be null again
    end
  
    m $= SomeExternalClass.getPropertyX()
    m.doSomething()  # ERROR: m can be null

    if (m !$ null) then
      this.someOtherFunction()
      m.doSomething()  # NOT SURE, m might have been manipulated
    end
  end
end

Example function test2 shows some limitations. First, the compiler has to assume that calling "SomeExternalClass.getPropertyX()" returns ANY possible pointer value, even if that function returns the same value over and over again. That's not optimal. Second, the compiler cannot know for sure that class variables aren't changed by other functions (or even code in another thread). So you'd have to assign the value of the class variable to a local variable and work with the local variable if you want to be sure about its value.

What if you're sure that a variable isn't null, but the compiler doesn't know that? I propose a new mechanism for telling the compiler about that:

procedure test3()
  var p : * Foo = SomeExternalClass.getPropertyX()
  
  # We know that the function didn't return a null pointer,
  # so we tell the compiler about that.
  assert (p !$ null)

  p.doSomething()  # OK, the compiler trusts you

end

An assert could be used for debug purposes as well, to verify that your assumptions are valid. The compiler will insert a runtime check to see if you're telling the truth, and the compiler will likely get an option to turn off such checks for release builds.

That's it for now. I will expand the concept in future writings.

changing the 'iterate' statement

2009-02-25T21:51:00.007+01:00

I have been thinking about changing the iterate statement for some time now, and I've written about it before. Well, I have decided to finally do something about it, because the current 'iterate' is useless to me at this time (I cannot implement it in my llvm-new branch of the compiler if I know that the statement will change completely anyway).

To be honest, I still haven't completely figured out a complete solution yet. So I will provide a temporary solution to get a usable 'iterate' statement for now, in a way that will (hopefully) be compatible with the final solution. What kind of final solution am I looking for? It will involve some kind of range type, likely combined with an iterator concept. Iterators will be needed soon for the new interface of the 'string' type.

The new syntax is as follows:

iterate variable in [begin .. end]
(...)
end

This syntax is a hardcoded version of something that the new range feature will provide. I have intentionally kept it very simple; it does not provide stepping or combining multiple iterates into one. A simple example of how it can be used:

procedure fac(n : nat) : nat
var f : nat = 1

iterate i in [2 .. ++n]
 f *= i
end

return f
end

As you can see, there is an important semantical difference with the old syntax: the end specification is no longer inclusive. Another important difference is that the iteration process will be strictly incremental, so the body will not be executed at all if begin is not smaller than end. When iterating an index variable through the elements of a container, you will need to use the size of the container in end:

iterate i in [0 .. v.size]
v[i].doSomething
end

I will try to implement the new syntax and semantics as soon as possible in the front end (trunk), and then add codegen for it in the llvm-new branch.

Website moved again

2009-02-15T12:32:00.002+01:00

My website has moved again. The new location:

http://hyperquantum.be

And the pages about Hyper are here:

http://hyperquantum.be/hyper

I hope it's the last time I need to move my website. Having to update all links to it is not that much fun really.

Having "pure" functions

2009-02-07T14:42:00.009+01:00

I have been thinking on this new feature for some time, but I'm not sure if it's a good idea. Just in case, the keyword for it has already been reserved in the 0.4.2 release of the compiler front end.

The idea is to introduce "pure" functions. Any procedure that is qualified as "pure" would guarantee to have no side effects and to have it's result only depend on the values of its parameters. The main purpose of this is to enable extra optimizations by the compiler, and to guarantee that some functions that do not need side-effects will not have them. So if you have some container class, its "count()" procedure (that returns the number of elements in the container) can be qualified as "pure" because it doesn't (and shouldn't) alter the state of the container, and because as long as the container is not modified, it will keep returning the same value over and over again. This implies that we consider the "this" pointer an implicit parameter for each non-static procedure, and that "pure" is a more restrictive version of the "const" qualifier for procedures. If you need a "pure" procedure that doesn't depend on its hidden "this" parameter, then you should also qualify it as "static".

Now the implications of this new feature:

Because "pure" is actually a more restrictive version of "const", you can have a normal and a "pure" version of the same procedure, just like you can put a normal and a "const" procedure with the same parameters in a class.
You can override a "const" procedure with a "pure" one.
You can only override a "pure" procedure with a "pure" one.
The compiler needs to check if a "pure" procedure really has no side-effects and if it only uses its parameters to calculate the result (!).

Of those four implications, the last two have a large impact.

The third one could make things a bit more difficult than they are now. If you design a class that will be inherited from, you will have to be careful that any procedure you mark as "pure" will never have to be overridden with one that isn't pure. Could be trivial for functions like "count()", but for other cases it might be a (very) difficult decision.

The fourth one might even be a show-stopper for including the feature. If the compiler needs to check that a procedure is really "pure", then it would need to put restrictions on each statement and expression used in that procedure. You would not be allowed to access any class fields, from the same class or from any other class. All expressions used should be pure as well. Any assignment statements should only modify local variables. If you declare variables in the procedure, their constructors should also be pure (no access to outside data and no side-effects elsewhere). It would be possible to have the compiler check those things, but it would put a major burden on the programmer. He would need to have "pure" in his mind everytime he writes a constructor for a class, because it might need to be called at some point somewhere in a "pure" procedure. He would need to mark all (or most) of his user-defined operators as "pure", unless I make "pure" the default or even a requirement for all binary (and unary?) operators. All these things would cause the programmer's code to be filled with things marked as "pure", just in case. Most programmers would probably just avoid "pure" entirely because it makes things too difficult.

So I'm unsure wether or not to include the feature. It could certainly be useful. But would it be worth the effort?

Another possibility is to use "pure" just as an attribute that informs the compiler about potential optimisations. But in that case someone could mark a procedure as "pure" while it does have side-effects.

At this time, I am still thinking about the feature. Feedback is welcome, as usual.

Compiler development update

2009-02-02T21:32:00.005+01:00

So here's an update on how things went after my previous post ("Compiler development roadmap").

After the "typerefactor" branch was more or less finished, I tried to merge it into (a copy of) the "llvm" branch. But the merge didn't work, so I decided to do things in a different way.

I created a completely new branch called "llvm-new", starting from the code of the "typerefactor" branch. Then I added the LLVM 2.4 sources to it, using the CMake build system from LLVM itself. To get the thing completely compiled and working I had to update to a SVN version of the LLVM code, though. And then I started writing my LLVM back end from scratch.

Wanting to avoid the mistakes of the first "llvm" branch, I started immediately on the implementation of a complete mechanism to manage types and to do type conversions. This means dealing with real class types, that are passed as a (this-)pointer, with primitive types that are passed as a value directly, passing parameters by reference or by value, passing values to "inout" parameters, referencing/dereferencing, indirectly returning values (using a extra hidden function parameter), etc... Eventually this implementation turned out pretty good, so it's no longer a PITA to write back end code like passing a value to a parameter.

Now I'm working on the implementation of codegen for all types (classes actually), expressions and statements. For most things implementation is easy, because I can look at the old implementation in the "llvm" branch and port that code to the new way of doing things. This new way is not just the typepassing/-conversion infrastructure; I've also changed the way I emit instructions. Previously I created the LLVM IR directly, but now I use the IRBuilder utility class from LLVM. And I now have a utility class that makes it a lot easier to emit basic blocks and branches to them.

So things are going forward slowly. I think that my "llvm-new" branch now has about 75% of the codegen functionality that was in the "llvm" branch. But things are still primitive; the compiler spits out lot of debug output followed by LLVM IR. You can try it if you want to, the code is available on the Launchpad project page:

https://code.launchpad.net/hyper

So most of my time is currently spent on the "llvm-new" branch. But that doesn't mean that the other branches are dead, however. The "typerefactor" branch now acts as a merge bridge between "llvm-new" and "main" (the 'official' front end branch). All front end related changes in "llvm-new" are regularly merged into "typerefactor". And those improvements are then merged into the "main" branch. The reason for doing things this way is that I like to keep the differences between "llvm-new" and "typerefactor" (in their front end code, at least) as small as possible. And now I can do some more or less invasive changes in the "main" branch without worrying about breaking the back end code.

As you might have noticed, progress has been going rather slowly these days. That's because I found a job as a software developer, using .NET (C# and VB). But at home it's still only Linux (Gentoo) for me. I'd prefer not to depend on Microsoft personally, but I need something to pay the bills of course.

Compiler development roadmap

2008-10-26T11:43:00.004+01:00

You probably already know that there are two important versions of the compiler in development: the "main" branch with only a front-end (called "trunk" on Launchpad), and the "llvm" branch with an experimental LLVM-based back-end. Their source code can be found on the Launchpad project page:

https://code.launchpad.net/hyper

That page contains another development branch as well, called "typerefactor". This branch is a heavily modified version of the main branch with the purpose of refactoring the handling of types and type conversions. I started the typerefactor branch because the code of the llvm branch has become a bit difficult and fragile; the typerefactor branch will improve the front-end part of the compiler so that developing the LLVM back-end will be much easier.

Another thing on my TO-DO list is the upgrade of the back-end to LLVM 2.4. This means that I will have to replace my custom CMake build system for LLVM by the official one. Yes, you read that correctly: LLVM now has an (experimental) CMake-based build system of its own. I posted my code for building LLVM with CMake on the LLVM mailing list quite a while ago, and now someone has used that code as a start for writing a real CMake build system for LLVM (mine was very Unix-oriented and was just enough for using LLVM with my compiler front-end).

So currently I'm planning to do the following:

Finish the typerefactor code.
Create another branch, "llvm-experimental", based on the llvm branch, and merge the typerefactor code into it. It's possible that this will require some extra changes to the front-end code, and those will be merged into the typerefactor branch again.
Upgrade the one of the llvm branches to LLVM 2.4. What branch I will use will depend on how long it takes to finish item 2 and on how difficult the build system transition will be.
When item 2 is done, merge the typerefactor branch into the main branch.
When items 2 and 3 are done, merge the llvm-experimental branch into the llvm branch.
Implement codegen for everything the front-end currently supports (in the llvm branch).
Merge the llvm branch into the main branch.

This is what I have in mind for the future, but it is not guaranteed that development will truly follow this roadmap. And it doesn't account for any front-end-only changes that I might do in the mean time.

Some things that will need to be done to the front-end:

Add support for 'references' (or restricted pointers)
Create a Unicode interface for the "string" type (no more random access)
Change the handling of operator overloading for binary operators (no more global list)

Others things are on my possible TO-DO list as well, such as adding support for "pure" procedures, but that feature will need to be thought out (and published here) first.

Launchpad

2008-09-19T22:15:00.003+02:00

As the title of this posting says, I have registered a project on Launchpad for my programming language Hyper:

https://launchpad.net/hyper

The main website for the language is still here:

http://users.edpnet.be/hyperquantum/hyper/

Why Launchpad instead of, say, Sourceforge? Because Launchpad is one of the few websites that has support for the Bazaar version control system. So I can easily upload the code to Launchpad with a simple "bzr push", and anyone can easily branch from it and make his own changes. And it's always nice to have an extra backup of my code :)

The project will probably slow down in the near future, as I have graduated now (masters degree, computer science) and am looking for a job.

About the 0.4.0 release and the back end

2008-06-15T13:13:00.004+02:00

The compiler had a new release last month, with some big improvements. So the version number was not increased to 0.3.39, but to 0.4.0.

I did a major internal restructuring of the compiler's semantic checking, as I wrote earlier. These changes allow me to detect circular dependencies between classes, and fix some nasty bugs. I no longer keep a separate symbol table; the symbol names and symbol lookup are now embedded in the AST data structures directly.

And the compiler now finally has full Unicode support! Source files are assumed to be in UTF-8 format now, because only that format is supported at this time. It also has the advantage that it's a superset of ASCII, so you can open a source file with a dumb text editor that doesn't know about Unicode and still have readable source, except for some special characters that aren't displayed correctly because they're not ASCII. Other formats like UTF-16 are on my TO-DO list, but do not have a very high priority. And though the compiler now has Unicode support, the language still needs to be adapted partially to it. The string type, for instance, will need to have its interface changed because it will use UTF-8; the random access (array) operator needs to go, and a replacement mechanism must be provided to be able to iterate over the characters in the string.

The compiler with back end has gotten some of my attention as well. It has inherited the new features from the 0.4.0 release and can do LLVM code generation for most expressions and statements, involving only the built-in types. User-defined classes, strings, floating-point types, and the 'iterate' statement are not supported yet. I have added support for printing single characters to stdout, so I wouldn't have to wait for full string support until I could see simple test programs working. Next thing I would like to be compilable is the eight-queens program.

My efforts will be mostly on the compiler back end now. I would like to have complete code generation for what the front end currently supports, and then I can merge the llvm-branch to the main-branch so that the main release is no longer front-end only. But I guess I will need to implement a solution for the iterate problem first.

The compiler sources are getting pretty big now. The main branch now has almost 44000 lines of code, and the llvm branch has about 49000 lines (not counting the LLVM sources, of course). That's a simple count, including header files, blank lines and comments. It's still quite impressive to me, though. My biggest project ever :-)

iterate revisited

2007-12-19T18:41:00.000+01:00

I am going to talk a bit about the iterate statement in this post. If you don't know what it is, look at the web page behind the hyperlink.

The iterate statement was derived from the FOR statement from PASCAL and BASIC (and their derivatives). I had to generalize it in order to allow it being used with user-defined types instead of just numbers; my intention is to have it support the behavior of 'foreach' in some other languages. So I came up with some scheme to support the use of iterators. I defined some operators that need to be supported by the types used in an iterate statement. For example:

The variable should be initializable by the from-expression. The variable should support increment when using "step ++". The variable must be able to be compared to the to-expression. In an "iterate v : t from x to y step z" the compiler would first compare x and y, to see which one is the largest value. If x <= y then the step would be considered to be an incremental value, and otherwise a decrementing value. And there are operators needed that can decide when the to-value is reached for both cases; so you would need comparisons v <= y (for an incrementing step) and y<= v (for a decrementing step) to be available. But this scheme is not good enough. When I came up with this I assumed the use of a "begin" and "end" iterator, where the "end" iterator should point at the last element of the collection. In C++ the 'end' iterator points one past the last element. I thought that my change would not be a problem, but I forgot to think of what would happen when iterating over an empty collection. Such a collection does not have a first or last element iterator!

So I will have to change the current way of doing things. One possible way is having the 'end' iterator should be done as in C++, and point to one past the last element. This means an iterate will look as follows:

iterate v : c.iterator from c.begin to --c.end
# do something
end

It would require the programmer to remember using a "--" in front of the "end" iterator, because the "end" iterator does not point to a valid element in the collection. The iterator also must support the predecessor operator, but what does it return when used on the "end" iterator in an empty collection?

Another possibility is having iterate extract some other info from the collection before starting the loop. It could first query the iterators for the instance of the collection class they belong to (and both iterators must return the same value) and then ask the collection class whether it's non-empty and skip the loop if it's not. But this would require allowing invalid iterators returned from the collection in order to return their collection object when the collection is empty.

And a third option would be to specify the collection object explicitly in the iterate statement itself. Something like:

iterate i in c
 # do something for each element
end

# - or -

iterate i in c from someStartIterator  # optional "to" or "step" omitted here
 # do something for each element, starting from some given iterator
end

This would need an extra keyword "in" (already silently supported by the latest compilers), and some predefined methods in the collection, required by the iterate statement for determining the non-emptyness of the collection and its start and end values. It would not change the existing iterate behaviour but only extend it.

I don't know yet what I'll choose as the solution. Maybe you'll see the change as it appears in a new compiler release, or maybe I'll write about it before I implement it.

Status update & future directions

2007-12-09T14:37:00.000+01:00

It has been some time since my last writing here. In the mean time there have been two compiler releases, 0.3.37 and 0.3.38. Most important new features from those are:

double pointers (pointer to pointer, look here)
pointer L-values and class L-values
sorted diagnostics, with common directory prefix printed separately
a new type of comment
first steps of platform detection (32 bit or 64 bit)

The new comment type was introduced because of an annoying property that the existing line comments have. Suppose you have a small procedure (one ore two lines lines) you want to comment out. The fastest way is to put a hash character ('#') in front of each line. But this does not work if one of the lines already has a line comment behind it; putting a hash mark in front of the line cancels out the comment that already was there, it will no longer be treated as comment text. So I created a new type of line comment that simply reaches until the end of the line, ignoring any hash characters that are already there. You just type two hash characters (with no space in between). Isn't it simple?

The platform detection that was added is limited. If you have an amd64 (or derived) CPU, it assumes the HOST platform is 64-bit. If it is not, and you have an x86 or derived CPU, the HOST is 32-bit. And otherwise it produces an error, because your platform is not supported yet. Of course, support for more architectures will come later.

I have decided to require all source files be in UTF-8 format, unless specified otherwise (by a magic number in the file). This has not been implemented yet. The char type will most likely be represented by a 32 bit number (UTF-32/UCS-4). And string will probably use UTF-8, which unfortunately requires me to remove the random access functionality.

As you probably already know, I am working on a compiler with back end. Most basic features are implemented, such as statements and expressions, integral types and arrays. Some exceptions are the iterate statement and the chained comparison expression. I would like to see some simple programs compiled completely, but that is not possible yet. A simple program could calculate something and then print the result. But: I haven't implemented standard output yet, and I haven't implemented string types yet. The standard output only accepts strings, so strings are a dependency in this case. And if I have to implement strings, I need the char type supported as well.

Unfortunately there are some things I will have to change in the current implementation. I will need to change how the symbol table works, because it needs to support sourcefiles importing each other, making symbols visible and invisible again, etc. The current operator overloading mechanism for binary operators needs to be changed as well, because it currently requires to maintain a global list of all binops. So I will need to change their semantics; looking up a binary operator should be possible by looking at the operands instead of a global list.

The compiler's semantic processing will need to be restructured. This is how the front end currently works, in 4 phases:

Reading the source from file, lexing, parsing and AST building.
Doing resolve1 recursively: looking up typenames and calculating compile time array sizes
Doing resolve2 recursively: check for duplicate overloads, create default copy constructors
Doing resolve3 recursively: check all other semantics

I have thought of a better way of doing things. The first phase can remain, of course, but the other three need to be changed. I would create a single phase for all semantics. Each AST member would have two functions for semantic checking: one to check interface semantics only, and one to do all checks. Every function would have to check if it isn't in a circular dependency (such would require an error diagnostic of course) and if it hasn't been completed before (in that case it would not do the checks again). If the compiler is resolving some code that uses another class, it would only need to resolve the interface of that other class, i.e. only procedure and constructor headers, and fields. If a procedure is called, it would only need to make sure its interface was resolved. An expression's interface is its result type plus other result properties (L-value or not, compile time value or not, etc.) and those require a complete check, so for an expression there would not be a distinction between 'interface resolve' and 'complete resolve'.

There are other things to be solved as well: how to manage binary compatibility with the standard library, how to create an interface to libraries written in other language etc... These things will need lots of thinking.

Memory management

2007-09-04T16:46:00.000+02:00

First I would like to let you know there has been a new release of the compiler for Hyper, version 0.3.36. I did not announce that version on my blog.

Now, about memory management. As you might already now, the language will use a garbage collector. As a consequence of that, classes cannot have a destructor because it cannot be guaranteed that such a destructor would really be executed when an object instance's memory is reclaimed. When a short running program exits, the garbage collector might even not have been run. This way, the garbage collector actually serves to pretend we have an infinity amount of memory. But people don't always like it that way, they want deterministic construction/destruction for some things like resources managed by the operating system. And I agree, though Hyper is not a language for doing low level OS stuff it can be necessary to have deterministic resource management for some cases. So I will introduce a way to do that.

First we need a bit of background. There is a feature I thought I have already explained on this blog, but I can't find it anywhere. So I will describe it briefly, and write a full explanation later. This feature is called "restricted pointers", and a shorter name might be "references". Such a reference is like a pointer, except that it cannot be stored outside of the stack and the guarantee that it will never be null. References can be used to point to a temporary object on the stack. A pointer can be implicitly converted to a reference but not the other way around. Temporaries can no longer be converted to pointers but only to references. This implies that when you see a pointer it has to point to an object on the free store. References are not yet implemented in the compiler, but they will have to be in order to get headache free memory management; they prevent the use of a pointer that points to a no longer existing temporary.

I already wrote something about RAII some time ago, but I realized that it would not be good enough for safe deterministic destruction, because the user can use the object without "scope". My new solution is called "auto classes". Such a class is declared with the "auto" keyword, and it must have a destructor. The syntax for the destructor could be "procedure auto()" or "procedure ~new()", I'm not sure yet. Any instance of an "auto class" can only be stored on the stack, or in a (non-static) field of another "auto class". It cannot be created on the free store, as it would not be possible to provide a guarantee for the execution of the destructor then. An object of such a class would need to count internally how many copies of it are still live in order to decide when its resource can be released. My plan is to provide an abstract base class that implements this behaviour so that users don't have to write a reference counter for every auto class they need; they would just inherit from the base class and implement a procedure that specifies what to do when the last instance dies.

A slightly different approach would be a resource-managing class that cannot be changed. So I propose the "auto const class", an auto class that is always const, it does not have an (accessible) assignment operator and all its procedure must be const as well. Its copy constructor can be made private in order to have a RAII class that does not need a reference count because it is the only reference.

Aside from memory management I also propose a (non-auto) "const class". This type of class is always const (thus immutable) and is in the style "create and never modify", like the Java "String" class. It can only be constructed on the free store.

An extension to the type system

2007-07-23T13:21:00.000+02:00

Why change Hyper's type system? The main reason is arrays, or more specifically the array index operator. When you have an array of 'int', the return type of the index operator needs to be a pointer to an 'int' in order for you to be able to change the numeric value in the array. (Currently this is not the case yet; the array index operator returns a plain 'int' for such an array. That will be fixed of course.) But what when you have an array of pointers to 'int'? What will be the index operator's return type then? Logically it has to be something like pointer to pointer to int, because of the extra level of indirection that is needed. After all, you should be able to change the int pointer. But at this time the language disallows multiple levels of pointers!

My first idea to solve this problem was to introduce "inout return types". It would provide an extra (hidden) reference to the actual type, as it is done for inout parameters. But I decided not to do that, as it would be a quirky solution. And then someone pointed out that the current pointer system is weird and makes source code unreadable because of the implicit (de)referencing done by the compiler. So I thought of changing the type system to be more like C++, with explicit referencing/dereferencing for most things. Unfortunately it seemed to me that the changes would lead me to something that was almost identical to C++ but a bit more complicated. I decided that this wasn't an option either.

I looked back to the actual problem and my first idea of 'inout return types'. What I have come up with is a bit similar to that solution, only less quirky. The main idea is introducing a second pointer level, thus allowing a pointer-to-pointer-to-class type to be declared in some places. This introduces an ambiguity, namely: to what pointer level does the pointer assignment etc. apply to? Well, the second (i.e. top-level) pointer is only used as a reference to the single pointer; you need to be able to change the single pointer as in the array-of-pointer-to-int example. That means all pointer operations would need to work on the single pointer, the one that points to the 'int' in the example. The double pointer would serve like a reference in C++, only initialized once and always dereferenced when used in an expression. The usage of such a double pointer is not universal; it would not be allowed for parameter types, because input parameters don't need a second level of indirection and because inout parameters already provide an implicit second pointer for a pointer parameter. A double pointer can be useful for return types, for variable types, and maybe for fields as well.

Some examples now:

var i : int = 22
var p: *int = i  # points to i
var pp: **int = p  # points to p
var b : bool
b = (p = 22)  # yes, p equals 22
b = (pp = 22)  # yes, pp equals 22
b = (pp $= p)  # yes, pp equals p
p = 47  # change i to 47
pp = 84  # change i to 84
p $= new int(55)  # p no longer points to i
pp $= new int(31)  # p changed again, points to 31

As you see in the example pp can not be changed anymore; it is a reference to p and all pointer assignments done on pp will be therefore applied to p.

This new functionality allows us to write a class that serves as a pointer-to-int:

class IntPtrArray
public:
var fP : [10] * int  # array has a fixed size of 10 ints

const procedure size() : nat
  return 10
end

operator[](index:nat) : **int
  # return a reference to the int pointer
  return fP[index]
end

const operator[](index:nat) : *int
  # no double pointer return needed (array is const)
  return fP[index]
end
end

I'm not sure when I will implement this feature, but unless I get some serious objections against this the feature will be added to the language.

website moved

2007-07-19T13:23:00.000+02:00

My website has moved recently. The new URL for Hyper is: http://users.edpnet.be/hyperquantum/hyper/

The project is somewhat stalled right now. I don't have much time these days, and the language's type system needs to be reworked a bit.

Hyper compiler 0.3.35 released

2007-05-31T16:08:00.000+02:00

It has been a long time, but here's another release. This one is nothing spectacular; most changes and improvements are internal and not visible to a user. I started to use the Boost libraries. The compiler and highlighter programs now use Boost's program_options library for parsing command-line parameters.

The lack of interesting new features in the front end is because I have worked on the back end as well. This work is in a very premature state, so you won't see a full compiler anytime soon.

Hyper compiler 0.3.34 released

2007-03-17T18:07:00.000+01:00

There it is again, another release. It has some very nice new features. The most exciting new thing of this release is ironically not the compiler, but the syntax highlighter.

The syntax highlighter was, before this release, a very simple program. You gave it two file parameters: the first was an existing source file and the second was the name for the HTML file that needed to become the syntax-highlighted version of the original source code. The output file was HTML 4.01 transitional. The new version of the highlighter generates valid strict XHTML 1.0! It was an easy improvement, almost nothing more than replacing the doctype. OK, this change is not very important to most people. But a totally different functionality has been added as well.

The idea for a new feature came from the fact that writing documentation on my website is difficult. Explaining the language is a difficult and time consuming task. But writing example code to illustrate things was just horrible. It required manual construction of a "pre" tag with some sourcecode, interleaved with "span" tags for highlighting. And when an example needed to be changed the highlighting had to be corrected. I did not like this at all, so I figured out a way to have the syntax highlighter do this boring and error-prone task. Now I can write the documentation, have the examples as plain text in the HTML file in a special comment (so they are still very readable), and the syntax highlighter processes the HTML file and generates the highlighted version directly inside the file. Now when I change an example I can just change the sourcecode in the HTML file and then run the highlighter on it to have the desired result.

The changes in the compiler itself are mostly directed towards the back end that needs to be written. First, it now supports dynamic arrays. This is needed to be able to write container classes in the language. Checking for 'static' things is now complete as well. This is of course needed for code generation because for every non-static procedure call an object instance is required, so the compiler must be able to find it. Other important changes are mostly internal. Constant folding is now implemented for a small part of the types, and the operators of the built-in types are now represented as classes specialized for each built-in type. And also some bugfixes this time. I discover little of them, because I am (as far as I know) the only tester. You can of course improve this!

Hyper compiler 0.3.33 released

2007-02-24T16:12:00.000+01:00

I have released a new version of the compiler again, we're at 0.3.33 right now. The version number is actually increasing less fast than before I did any releases. The first private release was 0.3.26, the first public release was 0.3.29, so this is the fifth release I have done. Before I started releasing versions I also incremented the version number, but this was a lot faster than it is now. For the same amount of changes that are now in one new release, I would have probably done about 6 or 7 version increments back then. I cannot do the same thing anymore of course, it would be silly to release a new version each time I have made a couple of trivial changes.

The changes I have made in the latest version of the compiler are steps towards a working code generator. I am giving priority to the front end things that are needed by the compiler back end. The first such thing in the release of today is checking for the program entry point. This includes checking a 'begin' specification if it is present, finding the 'static procedure main' and doing the necessary checks on it. Another improvement is a basic support for constant folding, except that no folding is done at the moment, but the compiler can already use unsigned integer literals in type checking. This feature is used by another one: checking of array sizes. You can declare arrays with fixed size and the compiler will check their compatibility. And the third important change is passing of parameters that are not to be changed, i.e. 'in' parameters. They are now passed by reference and completely read-only.

The next things on my 'to do'-list are real constant folding and constructor initializer lists. I suppose those are sufficient to allow me to start working on the code generator. I still have compilation problems with the LLVM tools as I wrote in my previous blog post. I don't think this will be a problem because the compiler front end linked with LLVM works perfectly. I can still compile the official version of LLVM and use its tools on the LLVM bytecode generated by my compiler.

Another thing: I don't really get any feedback of users yet. So if you try the compiler, please let me know what you think! Tell me the things you like, not just what you don't like. I have installed a visitor counter on my website some time ago, and it seems that I do have some visitors looking around. Yay! :-)

type deduction for variables + back end progress

2007-02-07T17:57:00.000+01:00

I have an idea for a new feature for variables: automatic type deduction. A new reserved keyword "auto" is used in place of the type for the variable, and the compiler will deduce the type from the type of the initializer expression. The type will always be a pointer or a reference, to avoid unneeded copies of the initializer.

procedure test(x : int, y : const int, z : * int)
var a : auto = x   # var a: & int = x
var b : auto = y   # var b : & const int = y
var c : auto = z   # var c : * int = z
var d : auto = a + b  # var d : & int = a + b
end

I am still working on the compiler back end. Progress is very slow, because I don't have much time to work on it. I got most of the LLVM libraries compiled with CMake (LLVM normally uses GNU autotools), and linked with my front end. The front end currently emits a simple hello world program as LLVM assembly code (regardless of what sourcefile you 'compile'). I did not get the LLVM tools compiled yet; for "llc" I am stuck on a link error about some symbols from libtool. I have never used libtool myself so I don't know how it's used in a program. I will have to take another look and if I can't solve it I will have to ask for some help on the CMake mailing list. I wonder why my compiler gets linked with LLVM without errors and why "llc" fails. Other work in this matter is on the front end. There is some functionality that is needed for code generation that is still missing in the front end. For example, I still need to finish constant folding, because this is required for determining the size of array types. After that I can start implementing real code generation.

Hyper compiler 0.3.32 released

2007-01-15T21:48:00.000+01:00

Today I have released a new version of the compiler. The most important new thing is support for the new namespace system. This means that finally all example programs are accepted by the compiler, including the new Hello World! Imports are not yet really supported; imports of user-defined sourcefiles are ignored, and the only allowed standard library import is the import of 'system.stdio'. The new compiler now also supports chained comparison operators, so you can now use code like:

if x = y = z then
# TODO : implement this
end

Another nice thing to have is that the compiler will generate a warning by default for this code. Something like:

file.hyp:3:2: warning: TODO: implement this

The compiler will actually notice 'TODO' or 'FIXME' in comments and generate warnings for them. In my opinion warnings provide useful information so they are enabled by default. But you can turn them off individually if you like with a commandline switch like "-W-no-todo".

As you can see the compiler still isn't at version 0.4.0 yet. Well, I have chosen for a different approach. I will keep releasing the front end of the compiler as I am doing now, and I will develop a version of the compiler with LLVM back end in parallel. The version 0.4.0 will be given to a release that marks an important event for the front end; maybe when I am ready to start the implementation of more advanced things like inheritance. I do not plan to release the version-with-back-end to the website soon because it will be completely unusable until some time in the future. And if no one cares anyway, then why bother releasing it?

progress...

2006-12-27T22:19:00.000+01:00

It's been a while since I wrote something the last time. But I have made progress since then.

First of all, the Hello world example has changed again. Here is the new (and hopefully final) version:

import system.stdio

class Hello
 static procedure main()
  system.Out.line("Hello, World!")
 end
end

The 'StdIO' class has been replaced by 'Out', which will provide simple console output. The 'printLn' procedures have been renamed to 'line'.

Implementation of the compiler is now much further. The compiler on SVN trunk now supports the new namespace system, and supports importing 'system.stdio'. This means that all sample programs in the directory tests/programs now compile successfully. A new fun feature is that the compiler detects TODO and FIXME in comments and emits warnings for them. Of course new compiler options are provided to turn these warnings off, but they are enabled by default.

The compiler has been restructured internally as well. I have completed 3 major refactorings. But this is not the end of the road, much other improvements will be done in the future.

There is a good chance that the next release of the compiler will have version number 0.4.0 because of all the improvements that have been done. The milestone for 0.4.0 will then be "front-end mature enough to start implementing the back-end".

I have also created a new SVN branch where I will start to work on the compiler back-end. As said earlier I will use LLVM for this. I have imported LLVM 1.9 into the branch. The first thing to do is to get LLVM compiled with CMake, as the LLVM developers use GNU autotools to build it. But I don't, I use CMake for the front-end. Then I will try to get the Hello World program compiled with it.

Today I have written a bunch of docs again. You can find class references for the built-in types on the website.

I currently use Subversion, but I am thinking about switching to Bazaar.

Hyper compiler 0.3.31 released

2006-12-14T19:15:00.000+01:00

I have released a new version of the compiler. This release is fairly bigger than the previous one. It contains more new features and has undergone internal structure improvements.

The compiler now checks for the presence of return statements in procedures that return something. Another very important new feature is public/private access checking. Some small things are not checked yet, like for example the usage of private conversion constructors for passing arguments to a procedure. And const checking is now complete (at least to my knowledge), which means that the compiler does additional checks for const procedures and calling procedures on a const object.

Some things that are not listed in the changelog: a new test program was added, the Hello World test program. This already uses the new import and namespace semantics so the compiler currently rejects it. And the compiler now allows import directives but currently ignores them.

dynamic arrays and array sizes

2006-12-11T22:47:00.000+01:00

Arrays are supported for some time, but dynamically creating an array wasn't possible yet. Time to change that. The syntax is simple:

procedure xxxx(a & b : nat)
 var x : * [5] int = new [5] int()
 var y : * [][] real = new [a][b] real
end

As you can see, the syntax is "new" followed by the type of the array and an optional empty pair of parentheses. Dynamic arrays initialize there elements with their default value. The size of the new array must be fully specified (but for its elements this is not required):

 new [] int      # illegal
 new [][10] int  # illegal
 new [10] * [] int  # allowed

This brings us to the compatibility of array sizes. When pointing an array variable to an array, the sizes that ARE specified must be evaluatable at compile time and be equal. But you don't HAVE to specify the sizes of course. Open arrays accept any size, this means no size specified or a size that isn't known at compile time.

const globalC : nat = 17  # constant field

procedure test(i : * [] int, j : * [17] int, n : nat)
 var a : * [9] int = j  # ERROR: 9 != 17
 var b : * [17] int = j  # OK
 var c : * [17] int = i  # ERROR: unknown size of i
 var d : * [globalC] int = j  # OK, globalC = 17
 var e : * [n] = j  # ERROR : value of n unknown
 var f : * [] int = i  # OK
 var g : * [] int = j  # OK
end

For now you cannot specify an initializer for an array. That's why you don't specify arguments between the parentheses when creating a dynamic array. And that's why an array variable or field can't have an initializer part.

sourcefiles and namespaces again

2006-12-04T16:07:00.000+01:00

I was a little brief on my previous post about sourcefiles, namespaces and imports. I'll try to explain it a bit more here. So here's an example of multiple sourcefiles working together.

# File "someDir/MyApp/GUI/mainwindow.hyp"
namespace MyApp.GUI

class MainWindow
# (...)
end


# File "someDir/MyApp/Data/store.hyp"
namespace MyApp.Data

class DBStorage
# (...)
end


# File "someDir/MyApp/Core/main.hyp"
namespace MyApp.Core

import MyApp.GUI.mainwindow
import MyApp.Data.store

static class Main
static procedure main()
 # main program's entry point
 var dbs : MyApp.Data.DBStorage
 dbs.open()
 var win : MyApp.GUI.MainWindow
 win.show()
 # (...)
end
end

I hope this clears it up. Every file is in a namespace. When you import a file you need to specify the namespace AND the name of the file you want to use. The classes inside a file are in the namespace of that file, so that's why the "main()" code uses the full names like "MyApp.Data.DBStorage" instead of just "DBStorage". To get rid of the long names I will support 'using' declarations (in fact aliases), but that's for later.

Something we don't support now is having public/private members in a sourcefile. The Main class of the example above does not have to be public. For now all direct source file members are simply public.

A program often needs to use libraries outside of its own codebase. Therefore Hyper will support some variation of the 'class path' concept from Java but with a different name. I suggest somethink like 'codebase path' or 'code path'. The default 'code path' will be empty and this means the compiler will only look at the sources you are compiling now (including the imported files from the same codebase). How about closed source libraries? I am thinking to use a concept similar to D's interface files (see the end of this page). This means having a second type of sourcefile that only contains the interface parts of each class (the procedure headers etc...).

Off-topic:
* I am tempted to have the compiler generate C++. This would be somewhat easier than using LLVM, but it would require an extra compilation to get a working program. It could be an acceptable temporary solution.
* The next compiler release will support public/private access checking, full 'const' checking and maybe also 'static' checking. But I am unsure about going for a small or a major release. A major release take much longer to be released, but allows for large internal improvements in the compiler. A minor release would be version 0.3.31, and a major release would be 0.4.0.
* Restricted pointers will definately be part of the language. I just don't know yet when to start implementing it and whether or not to wait after the next major release (0.4.0). I would like to have a better name for them. "References" is a candidate, but it could be too confusing for C++ users that don't know yet what they really are, since they differ a lot from the 'references' from C++.
* Strict in/inout parameters will probably be added when restricted pointers are already implemented.

importing other sourcefiles

2006-12-03T14:06:00.000+01:00

I think I have finally found a way to have multiple sourcefiles working together. I have based it mostly on the packages system from Java, but I don't call it packages anymore. I have decided to keep the 'namespace' keyword for this purpose. In Java each file that is not in the default package is in some specified package, in Hyper each file is in some namespace. This means that namespaces are no longer declared in blocks like classes are, but they are declared in one line on top of the sourcefile. It will also be possible to have a sourcefile that is not in a namespace; this will be useful for one-file test programs. More about that later. Each sourcefile is in a directory structure that corresponds to the namespace of that file (such as for packages in Java). So a file in namespace "Foo.Bar.Baz" could be named "Foo/Bar/Baz/filename.hyp". A file can import other files by specifying the namespace and name of that file, without the extension. So this will be something like:

namespace Abc.Defg.Stuv.Xy
import Foo.Bar.Baz.filename

There are public and private imports, and an import is private by default. If file 1 is publicly imported in file 2, then any file that imports file 2 will have file 1 imported with it. This is not the case if file 1 is privately imported in file 2. For private imports the compiler will have to check that there are no things from the private import exposed to the outside.

Imports are allowed to be circular; this means that file 1 can import file 2 while file 2 also imports file 1. Such things are of course to be used as little as possible. Disallowing circularity is not feasible because these things are not always avoidable, and the language currently does not allow for forward declarations as C++ does.

A sourcefile that is not in a namespace will not be able to import things from other sourcefiles but only from the standard libraries (system.****). And it cannot be imported by any other sourcefiles. This is to minimize its usage. Files not in a namespace are not in some 'default' namespace as Java does it, but aren't in any namespace at all. So there would be no relation to the directory such a file resides in.

The standard library will use the 'system' keyword as the root namespace. Standard input/output will be available with "import system.stdio". (I think I will use the convention of using a lowercase identifier for the name of a sourcefile) This file contains a static class "StdIO" with a number of procedures for stdout printing. There are "print" procedures for literal printing and "printLn" procedures for printing with an additional newline. This would make the "Hello World"-example look like this:

import system.stdio

class Hello
static procedure main()
system.StdIO.printLn("Hello, World!")
end
end

It sure looks better than the current version.

scopes and RAII

2006-11-28T17:16:00.000+01:00

Classes in Hyper cannot have a destructor. I wanted it to be this way because the language uses a garbage collector, and then the execution of a destructor cannot be always guaranteed. So destructors would not be reliable anyway if you want it to release acquired resources. And without destructors there is currently no way to do RAII. This is not acceptable in my opinion. So we need at least one way to do it.

This is my proposition. I would like to introduce a new block statement, "scope". This would take care of resource acquisition and disposal. You would give it an object that support 2 methods: "enterScope()" and "leaveScope()". The scope block would call the first method upon entering the scope, and would make sure that the second one is called whenever execution leaves the scope (i.e. normal exit at the end, plus exceptions and jump statements like "return", "break", etc.). The first form of "scope" would support the declaration of a variable, with or without an initializer. But I think it would also be useful to have an anonymous variable (i.e. to give it no name) and/or to have no variable at all and simply use a temporary object on the stack.

Examples:

procedure test(x : * Xyz)
# indented output
scope i : Indenter = x.getIndenter() # increase indentation
i.printLn("Hello world.")
# ...
end # get indentation back to the previous level

scope this.getMutex() # lock mutex
# ...
end # unlock mutex
end test

The first example uses an explicit variable and the second uses no variable at all. In the second example the mutex returned by "getMutex()" is used as the scope object. In this case the result of the scope expression is probably a pointer (or a reference) to a field, but it does not necessarily has to be a pointer.

This reminds me about how to treat temporaries. I suggest to let them exist until the function they appear in returns. This would be ideal for use with restricted pointers. Any temporary could then be pointed to by a restricted pointer. And those could be stored in variables, so it would be necessary to keep the temporaries alive until the function returns.

operations on numeric types

2006-11-27T17:15:00.000+01:00

First a bit of info about the built-in numeric types, in case you never saw them before or in case you have forgotten. The floating point types: 'single', 'double' and 'real' (not the subject of this article). The integral types: 'byte', 'nat16', 'nat32', 'nat64', 'int16', 'int32', 'int64' and 'int'. A 'byte' is unsigned and 8 bits. The 'int*' types are signed integer types (specified size or native int type), 'nat*' are unsigned integer types (again, of the specified size or else the native size). The native ones, 'int' and 'nat' are equal in size and have the size of the target platform (32 or 64 bit).

I am still somewhat unsure about the operations on those types. For example: what type should the result of a unary minus on a 'nat16' be? I would say 'int32', because the result can require 17 bits and that would not fit into a 'int16'. And I want to avoid unexpected overflows as much as I can. But this leaves me with some unpleasant consequences. I have no type to use for the unary minus on a 'nat64'. And neither I have for the native 'nat', because it would require one more bit than the 'int' has. So at this time there is no unary minus available for those two types. This problem of course doesn't exist for the 'int*' types; for example the unary minus operation on a 'int16' returns an 'int16' again.

I propose the following 'solution': I give the 'nat*' types a member function 'truncateToSigned', or something like that, which discards the most significant bit and then returns a signed type of the same size as the original. Like this:

class nat16
public:
# (...)
const procedure truncateToSigned() : int16
# truncate and convert to signed
end
end

This allows the programmer to do what he/she wants, but allows for data loss. But at least the 'corruption' is visible by looking at the name of the function!

I already provided the 'int*' types with a function to turn the sign. This is a way to write assignments like 'delta = -delta' like this: 'delta.turnSign' (will be available in the next compiler version). This function doesn't return a result and changes the value of the object itself. (So it is different from the unary minus, which does not change its object and instead returns a value.) I wasn't completely sure about the name, maybe I could have used something like 'negate' instead but I am not sure if that means what's intended (I am not a native English speaker).

I am thinking about a signed byte type. At this type the literal -1 has type int16. That looks like a bit of a waste for such a small number. So it would maybe be nice to add a type (e.g. 'sbyte') for unsigned 8-bit values. Then I could add to 'byte' also a member 'truncateToSigned' to return a signed equivalent. But to avoid loss the unary minus of 'byte' would still return an 'int16'.

Another proposition. We now have a way to make a signed from an unsigned but not the other way around. So I would like to add a member 'abs' (= absolute value) that gives the unsigned value. I don't really like 'abs' because it looks to short. But on the other hand, something like 'absoluteValue' looks too long. Suggestions and argumentations for a name are of course welcome :-)

P.S.: I converted my blog to the new Blogger Beta, and it looks like the RSS of the old articles is a bit messed up now