Thursday, September 16, 2010

Using data-flow information in the language

It's been a long time since I've written the previous post on this blog. The project has been mostly inactive for the last year, since I don't have much time anymore to work on it. But I've been on holiday this summer and thought about some new features I'd like to add to the language. This post will start with a simple version of a new concept, and I will expand it in future posts.

What is 'data-flow information'? Well, it's something that is normally used in compilers for optimization. The compiler examines how the source code will be executed, and predicts what value(s) a variable can have at various locations in the source code; this is called data-flow analysis.
But I've been thinking that it can be useful in the language itself. How?

Well, for example, the compiler can check if you are possibly dereferencing a null-pointer. What happens in other languages when you dereference a null-pointer? Some programming languages, like C or C++, assume that the programmer is smart enough to prevent that from happening, and if it does happen, the program will simply crash. Other languages insert run-time checks that look at the value of a pointer before it is dereferenced, and make sure that an exception is thrown when it is null. Both approaches have the disadvantage that bugs regarding null-pointers are only discovered at run-time, and not always before the product is shipped to the customer.

My idea is to have the compiler check if a pointer can be null when you dereference it. So when you have a pointer that needs to be dereferenced, the compiler forces you to write a check in your code to see if the pointer is null or not. An example:

procedure test1()
var i : int = 10
var p : * int = SomeExternalClass.getFooBar() # returns a pointer

i += p # ERROR: p can be null

if (p !$ null) then
i += p # OK
end

i = p * 3 # ERROR: p can be null

if (p =$ null) then
return
end

i = p * 3 # OK, p cannot be null
end

The compiler can track the value of a pointer variable inside a function, and decide at any point if that value can be null or not. But it doesn't work for all cases. Look at the following example:

class Test2
var m : * Foo

procedure test2()
var p : * Foo = SomeExternalClass.getPropertyX()

p.doSomething() # ERROR: p can be null

if (p !$ null) then
p.doSomething() # OK

p $= SomeExternalClass.getPropertyX()
p.doSomething() # ERROR: p can be null again
end

m $= SomeExternalClass.getPropertyX()
m.doSomething() # ERROR: m can be null

if (m !$ null) then
this.someOtherFunction()
m.doSomething() # NOT SURE, m might have been manipulated
end
end
end

Example function test2 shows some limitations. First, the compiler has to assume that calling "SomeExternalClass.getPropertyX()" returns ANY possible pointer value, even if that function returns the same value over and over again. That's not optimal. Second, the compiler cannot know for sure that class variables aren't changed by other functions (or even code in another thread). So you'd have to assign the value of the class variable to a local variable and work with the local variable if you want to be sure about its value.

What if you're sure that a variable isn't null, but the compiler doesn't know that? I propose a new mechanism for telling the compiler about that:

procedure test3()
var p : * Foo = SomeExternalClass.getPropertyX()

# We know that the function didn't return a null pointer,
# so we tell the compiler about that.
assert (p !$ null)

p.doSomething() # OK, the compiler trusts you

end

An assert could be used for debug purposes as well, to verify that your assumptions are valid. The compiler will insert a runtime check to see if you're telling the truth, and the compiler will likely get an option to turn off such checks for release builds.

That's it for now. I will expand the concept in future writings.