July 2010

How to avoid the assignment statement

I said in an earlier essay that the assignment statement is harmful. Using it tend to produce longer and more confusing programs. The catch is, the assignment is so basic and fundamental that limiting its usage looks unrealistic. Fortunately, it’s not, and I’ll show here how to do it.

Before we begin:

My examples are written in a C-like syntax but bear in mind that much of this advice is difficult to apply in C and C++, because they don’t manage memory automatically.
Don’t take the words “correct” and “wrong” literally. They essentially mean “follows my advice” and “doesn’t follow my advice”.

Initialise right away

// wrong      |  // correct
int x;        |
// ...        |  // ...
x = init();   |  int x = init();

The main advantage is that it is easy to see the value of x just by looking at its definition. It also guarantees that x will always contain an actual value, instead of a random one, though most compilers check that for you. Finally, it makes the code less verbose.

The reason why the “wrong” way is used at all is because many old programming languages forced you to declare variables at the beginning of blocks. It was easier for compilers. It’s no longer an issue, however. Not even in C.

Declare new variables

// wrong         |  // correct
int x = init();  |  int x = init();
// ...           |  // ...
x = something(); |  int y = something();

Instead of changing the content of a variable, you can just declare a new one. By avoiding assignment, you can guarantee that your variables won’t change. You can guarantee that the current value of x is the one it has been initialised with: init().

I see two possible reasons why one would want to do it the “wrong” way: efficiency, and conciseness. Efficiency is no longer an issue because modern compilers can optimise away the allocation of the new variable. Conciseness just isn’t worth the confusion it causes.

Make functions, not procedures

// wrong                 |  // correct
void to_utf8(string s);  |  string to_utf8(string s);
                         |
// ...                   |  // ...
                         |
string s1 = latin();     |  use_string(to_utf8(latin()))
to_utf8(s1);             |
use_string(s1);          |

Here the “correct” way uses an ordinary mathematical function: it takes something as its input, and return a result. The “wrong” way, on the other hand, use a procedure. Unlike functions, procedures can have effects beyond their result, like modifying their arguments. That makes them more delicate to use.

When you can write a function instead of a procedure, do so. It makes your program simpler. The trick is to stop thinking about what to do (convert a string) but just about what to produce (a converted string). Try to think about the end result, instead of how to obtain it.

The reason the “wrong” way is used at all is because many old programming languages makes it hard to return more than a simple number. A more convenient way to produce an interesting result was through effects. And the actual result is often a simple error code. Now, in many languages, returning complex results is no longer a problem.

Make your objects immutable

In many introductory courses on Object Oriented Programming, you are introduced to the famous 2D point:

// wrong
class Point
{
public:
  // constructor
  Point() { x = 0; y = 0; }

  float get_x() { return x; }
  float get_y() { return y; }

  void set_x(float new_x) { x = new_x; }
  void set_y(float new_y) { y = new_y; }

  move(Point p) {
    x = x + p.x;
    y = y + p.y;
  }
private:
  float x; float y;
};

The idea behind this design is simple: you can create new points with the constructor, and initialise it with set_x() and set_y(). The internal state is encapsulated (private), and access is done through get_x() and get_y(). As a nice bonus, you can also move the point directly with move().

This may be good for didactic purposes, but for practical ones, this code is needlessly complicated, and makes several mistakes:

The constructor initialises x and y to zero. If you want another values, you have to manually set them. You can’t initialise right away.
The default way to manipulate a point is to modify it. That is a form of assignment. You are not encouraged to declare new variables.
The set_x(), set_y(), and move() methods modify the object in place. They are procedures, not functions.
x (and y) is private, but you can get_x() and set_x() anyway. So, you think you have encapsulation, but you don’t.
The move() method doesn’t need to be in the Point class. Putting it there makes the class larger, and a bit more difficult to read and use.

The correct design is much simpler, and no less capable:

// correct
class Point
{
public:
  // constructor
  Point (float x, float y) {
    _x = x; _y = y;
  }

  x() { return _x; }
  y() { return _y; }

private:
  float _x; float _y
}

Point move(Point p1, Point p2) {
  return Point(p1.x() + p2.x(),
               p1.y() + p2.y());
}

Alternatively, _x and _y could be declared public and constant, if possible. Also, the move() function don’t really move anything any more, so you may want to rename it.

Use purely functional data structures

You remember you should declare new variables? Well this advice applies even with huge data structures. Surprisingly, it is not as inefficient as you might think. The idea is to avoid copying the entire data structure every time you want to make a modification. (This rules out arrays and hash tables.)

Instead you should use what we call purely functional data structures. If you want to know about them, Chris Okasaki’s thesis (and book of the same name) is a great resource. Here, I will just give you a taste with the linked list.

A linked list is either the empty list, or a cell containing an element and a pointer to another list.

┌───┬───┐   ┌───┬───┐
│ x │  ───> │ y │  ───> empty
└───┴───┘   └───┴───┘

Such a data structure is very easy to design in ML languages, a bit less so in class-based ones:

-- Haskell
-- A list is either the Empty list,
-- or it contains an Int and a List
data List = Empty
          | NotEmpty Int List

-- utility functions

is_empty Empty         = true
is_empty NotEmpty x xs = false

head Empty         = error
head NotEmpty x xs = x

tail Empty         = error
tail notEmpty x xs = xs

// Java(ish)
class List
{
public:
  // constructors
  List() { _is_empty = true; }
  List(int i, List next) {
    _i        = i;
    _next     = next;
    _is_empty = false;
  }

  bool is_empty() { return _is_empty; }

  int head() {
    if (_is_empty) error();
    return _i;
  }

  List tail() {
     if (_is_empty) error();
     return _next;
  }

private:
  int  _i;
  List _next;
  bool _is_empty;
}

Now you will note that my List class is immutable. We can’t modify List objects. We can only build new lists out of existing ones. Cheaply. Because when you construct a new list, it shares most of its elements with existing ones. So say we have a list l, and an integer i:

    ┌───┬───┐   ┌───┬───┐
l = │ x │  ───> │ y │  ───> empty
    └───┴───┘   └───┴───┘

i = 42

So, to add i to the beginning of l, we just create a new list l2:

     ┌───┬───┐
l2 = │ i │   │
     └───┴─│─┘
           │
           │   ┌───┬───┐   ┌───┬───┐
l  =       └──>│ x │  ───> │ y │  ───> empty
               └───┴───┘   └───┴───┘

Or, in code:

List l  = List(x, List(y, List()));
int  i  = 42;

List l2 = List(i, l); // cheap

l still exist, unmodified, and the creation of the entire l2 list only involved the creation of one cell. Likewise, removing the top element from a list is just as cheap.

When you can’t help it

Sometimes, you just can’t avoid the assignment statement or other side effects. Maybe you need so much efficiency that you have to mutate state to optimise your program (like in video encoders). Maybe you have to interact with a user, like in video games. Maybe your language doesn’t handle memory automatically, effectively preventing you to use purely functional data structures. Maybe you need to write a loop, and you can’t use list comprehensions, higher order functions, or recursion.

The best you can do here is to isolate your impure code (which uses the assignment statement) from the rest of the program. Say for instance that you want to sort an array. Fast. A good choice could be quicksort. The problem is that quicksort relies heavily on mutation. But you can hide it:

array pure_sort (array a)
{
  array a2 = copy(a);
  quicksort(a2); // modify a2, nothing else
  return a2;
}

So, while the internal code of my pure_sort() function doesn’t follow my own advice, it makes little matter, because no effect leaks from it. In the end, pure_sort() behaves like an ordinary function.

Conversely, when you interact with a user, be careful to separate the interaction part and the computation part. Say for instance that you write a program that draws a point on the screen, and moves it according to the mouse movement. It may look like this:

// wrong

Point p(0, 0);
while(true) // loop forever
{
  p = move(p, get_mouse_movement());

  if (p.x() < 0   ) p = Point(0    , p.y());
  if (p.x() > 1024) p = Point(1024 , p.y());
  if (p.y() < 0   ) p = Point(p.x(), 0    );
  if (p.y() > 768 ) p = Point(p.x(), 768  );

  draw(p);
}

The mistake here is to check for outbound coordinates in the main program. This check is not really part of the interaction, so it should be separated from it:

// correct

float inside(float x, float min, float max)
{
  return x < min ? min
       : x > max ? max
       :           x;
}

point smart_move(point p, point mouse_movement)
{
  Point q = move(p, mouse_movement);
  return Point(inside(q.x(), 0, 1024),
               inside(q.y(), 0,  768));
}

// main program
Point p(0, 0);
while(true) // loop forever
{
  p = smart_move(p, get_mouse_movement());
  draw(p);
}

Now, the main program became much simpler. The computation part, smart_move() and inside(), may be tested separately, and even re-used elsewhere. Now, if you didn’t like the ternary operator’s syntax, you may bend the rules:

// not too bad

float inside(float x, float min, float max)
{
  float result = x;
  if (result < min) result = min;
  if (result > max) result = max;

  return result;
}

But whatever you do, inside() should still be a mere function.

Conclusion

It’s all about decoupling. Programs should have clear internal boundaries. Modules’ interfaces should be small, easy to understand, and easy to use. Avoiding the assignment statement and sticking to immutable objects help the interfaces to be clear and explicit. This is no silver bullet, but it helps. A lot.