Thursday, September 16, 2010

Using data-flow information in the language

It's been a long time since I've written the previous post on this blog. The project has been mostly inactive for the last year, since I don't have much time anymore to work on it. But I've been on holiday this summer and thought about some new features I'd like to add to the language. This post will start with a simple version of a new concept, and I will expand it in future posts.

What is 'data-flow information'? Well, it's something that is normally used in compilers for optimization. The compiler examines how the source code will be executed, and predicts what value(s) a variable can have at various locations in the source code; this is called data-flow analysis.
But I've been thinking that it can be useful in the language itself. How?

Well, for example, the compiler can check if you are possibly dereferencing a null-pointer. What happens in other languages when you dereference a null-pointer? Some programming languages, like C or C++, assume that the programmer is smart enough to prevent that from happening, and if it does happen, the program will simply crash. Other languages insert run-time checks that look at the value of a pointer before it is dereferenced, and make sure that an exception is thrown when it is null. Both approaches have the disadvantage that bugs regarding null-pointers are only discovered at run-time, and not always before the product is shipped to the customer.

My idea is to have the compiler check if a pointer can be null when you dereference it. So when you have a pointer that needs to be dereferenced, the compiler forces you to write a check in your code to see if the pointer is null or not. An example:

procedure test1()
var i : int = 10
var p : * int = SomeExternalClass.getFooBar() # returns a pointer

i += p # ERROR: p can be null

if (p !$ null) then
i += p # OK
end

i = p * 3 # ERROR: p can be null

if (p =$ null) then
return
end

i = p * 3 # OK, p cannot be null
end

The compiler can track the value of a pointer variable inside a function, and decide at any point if that value can be null or not. But it doesn't work for all cases. Look at the following example:

class Test2
var m : * Foo

procedure test2()
var p : * Foo = SomeExternalClass.getPropertyX()

p.doSomething() # ERROR: p can be null

if (p !$ null) then
p.doSomething() # OK

p $= SomeExternalClass.getPropertyX()
p.doSomething() # ERROR: p can be null again
end

m $= SomeExternalClass.getPropertyX()
m.doSomething() # ERROR: m can be null

if (m !$ null) then
this.someOtherFunction()
m.doSomething() # NOT SURE, m might have been manipulated
end
end
end

Example function test2 shows some limitations. First, the compiler has to assume that calling "SomeExternalClass.getPropertyX()" returns ANY possible pointer value, even if that function returns the same value over and over again. That's not optimal. Second, the compiler cannot know for sure that class variables aren't changed by other functions (or even code in another thread). So you'd have to assign the value of the class variable to a local variable and work with the local variable if you want to be sure about its value.

What if you're sure that a variable isn't null, but the compiler doesn't know that? I propose a new mechanism for telling the compiler about that:

procedure test3()
var p : * Foo = SomeExternalClass.getPropertyX()

# We know that the function didn't return a null pointer,
# so we tell the compiler about that.
assert (p !$ null)

p.doSomething() # OK, the compiler trusts you

end

An assert could be used for debug purposes as well, to verify that your assumptions are valid. The compiler will insert a runtime check to see if you're telling the truth, and the compiler will likely get an option to turn off such checks for release builds.

That's it for now. I will expand the concept in future writings.

Wednesday, February 25, 2009

changing the 'iterate' statement

I have been thinking about changing the iterate statement for some time now, and I've written about it before. Well, I have decided to finally do something about it, because the current 'iterate' is useless to me at this time (I cannot implement it in my llvm-new branch of the compiler if I know that the statement will change completely anyway).

To be honest, I still haven't completely figured out a complete solution yet. So I will provide a temporary solution to get a usable 'iterate' statement for now, in a way that will (hopefully) be compatible with the final solution. What kind of final solution am I looking for? It will involve some kind of range type, likely combined with an iterator concept. Iterators will be needed soon for the new interface of the 'string' type.

The new syntax is as follows:
iterate variable in [begin .. end]
(...)
end
This syntax is a hardcoded version of something that the new range feature will provide. I have intentionally kept it very simple; it does not provide stepping or combining multiple iterates into one. A simple example of how it can be used:
procedure fac(n : nat) : nat
var f : nat = 1

iterate i in [2 .. ++n]
f *= i
end

return f
end
As you can see, there is an important semantical difference with the old syntax: the end specification is no longer inclusive. Another important difference is that the iteration process will be strictly incremental, so the body will not be executed at all if begin is not smaller than end. When iterating an index variable through the elements of a container, you will need to use the size of the container in end:
iterate i in [0 .. v.size]
v[i].doSomething
end
I will try to implement the new syntax and semantics as soon as possible in the front end (trunk), and then add codegen for it in the llvm-new branch.

Sunday, February 15, 2009

Website moved again

My website has moved again. The new location:

http://hyperquantum.be

And the pages about Hyper are here:

http://hyperquantum.be/hyper

I hope it's the last time I need to move my website. Having to update all links to it is not that much fun really.

Saturday, February 07, 2009

Having "pure" functions

I have been thinking on this new feature for some time, but I'm not sure if it's a good idea. Just in case, the keyword for it has already been reserved in the 0.4.2 release of the compiler front end.

The idea is to introduce "pure" functions. Any procedure that is qualified as "pure" would guarantee to have no side effects and to have it's result only depend on the values of its parameters. The main purpose of this is to enable extra optimizations by the compiler, and to guarantee that some functions that do not need side-effects will not have them. So if you have some container class, its "count()" procedure (that returns the number of elements in the container) can be qualified as "pure" because it doesn't (and shouldn't) alter the state of the container, and because as long as the container is not modified, it will keep returning the same value over and over again. This implies that we consider the "this" pointer an implicit parameter for each non-static procedure, and that "pure" is a more restrictive version of the "const" qualifier for procedures. If you need a "pure" procedure that doesn't depend on its hidden "this" parameter, then you should also qualify it as "static".

Now the implications of this new feature:
  1. Because "pure" is actually a more restrictive version of "const", you can have a normal and a "pure" version of the same procedure, just like you can put a normal and a "const" procedure with the same parameters in a class.
  2. You can override a "const" procedure with a "pure" one.
  3. You can only override a "pure" procedure with a "pure" one.
  4. The compiler needs to check if a "pure" procedure really has no side-effects and if it only uses its parameters to calculate the result (!).
Of those four implications, the last two have a large impact.

The third one could make things a bit more difficult than they are now. If you design a class that will be inherited from, you will have to be careful that any procedure you mark as "pure" will never have to be overridden with one that isn't pure. Could be trivial for functions like "count()", but for other cases it might be a (very) difficult decision.

The fourth one might even be a show-stopper for including the feature. If the compiler needs to check that a procedure is really "pure", then it would need to put restrictions on each statement and expression used in that procedure. You would not be allowed to access any class fields, from the same class or from any other class. All expressions used should be pure as well. Any assignment statements should only modify local variables. If you declare variables in the procedure, their constructors should also be pure (no access to outside data and no side-effects elsewhere). It would be possible to have the compiler check those things, but it would put a major burden on the programmer. He would need to have "pure" in his mind everytime he writes a constructor for a class, because it might need to be called at some point somewhere in a "pure" procedure. He would need to mark all (or most) of his user-defined operators as "pure", unless I make "pure" the default or even a requirement for all binary (and unary?) operators. All these things would cause the programmer's code to be filled with things marked as "pure", just in case. Most programmers would probably just avoid "pure" entirely because it makes things too difficult.

So I'm unsure wether or not to include the feature. It could certainly be useful. But would it be worth the effort?

Another possibility is to use "pure" just as an attribute that informs the compiler about potential optimisations. But in that case someone could mark a procedure as "pure" while it does have side-effects.

At this time, I am still thinking about the feature. Feedback is welcome, as usual.