June 2009. [New version here.](assignment) Assignment Statement Considered Harmful ======================================= In his essay "Go To Statement Considered Harmful" Edsger W. Dijkstra demonstrated how the use of `goto` made programming harder. Now, `goto` _is_ considered harmful, and has been replaced by more reasonable constructs. I attempt here to demonstrate the same about the assignment statement. Simply put, assignment is rarely mandatory, confuses the word "variable", and impedes program analysis, variable naming, refactoring, and even performance. This is old news. Any programmer who has been exposed to functional languages and practices knows this. It also has been said, more than once. However, I didn't found anyone who said it in those terms, as a direct attack on this basic feature. Clarification ------------- - A _variable_ is a place in which a value is stored. Not to be confused with the value itself. - An _assignment_ is the replacement of the value currently stored in a variable by another. Not to be confused with _initialisation_, which is the the placement of a value in a new variable. Assignment is not mandatory --------------------------- Assignment has two purposes. The first is storing a value for later use. Initializing a new variable can do that, and is less disruptive. The second is the construction of loops. Recursive function calls can do that, but many programmers find plain loops more readable. Fortunately, even loops are rarely mandatory. The factorial is a case in point (here in pseudo-code): fac n = product [1..n] This is just a definition. The natural way to implement it in an imperative language is to use a loop (here in C): int fac(int n) { int acc = 1; while (n > 0) acc *= n--; return acc; } It hardly looks like the definition of a factorial. The product and the sequence of integers are there, but interleaved, somewhat hidden. There is a better way (here in Haskell): fac n = product [1..n] This is actual, working code. `[1..n]` denotes a list of integers, ranging from 1 to n. `product` is a function (not a primitive), which takes a list as argument and returns the product of its elements. This was just an example. In real code, there are all sorts of loops. However, they follow a few well known patterns, just like `goto` did. These patterns have been captured in recent programming languages like Haskell, just like the patterns of `goto` had been captured in imperative languages. Now, in a reasonable programming language, loops are hardly needed, and so is assignment. Assignment makes the term "variable" confusing ---------------------------------------------- Compare these two programs: (* Ocaml *) | /* C */ let x = ref 1 in | int x = 1; let y = ref 42 in | int y = 42; x := !y; | x = y; print_int !x; | printf("%d", x); In the Ocaml program, the "`ref`" keyword and the "`!`" operator make a clear distinction between a variable and its value. In C, such disambiguation is made from context. All popular imperative programming languages are like C in this respect. This leads to many language abuses, like "`x` is equal to `1`, until it is changed to be equal to `y`". Taking this sentence literally is making three mistakes: 1. `x` is a variable. It can't be equal to (`1`), which is an integer. A glass is not the drink it contains. Likewise, a variable is not the value it contains. 2. `x` and `y` are not equal, and will never be. They represent two distinct places. They can hold the same value, though. That two different glasses contain the same drink doesn't make them one and the same. 3. `x` doesn't change. Ever. The value it holds is merely replaced by another. A glass doesn't change when you replace its water by wine. The gap between language abuse and actual misconception is small. If we have any misconception about variables, even temporary, how can we hope to write correct programs? Assignment makes program analysis harder ---------------------------------------- Compiler writers have understood that for quite some time. Now, a typical compiler for an imperative language will first transform the source code to [SSA][4], an intermediate form where assignment is basically banned. This makes optimization simpler. [4]: http://en.wikipedia.org/wiki/Static_single_assignment_form This also apply to manual analysis. Imagine the everyday situation of trying to understand the code of a colleague; int x = 42; /* * Big Blob of Code */ printf("%d", x); /* What does it print? */ Maybe it prints 42. Maybe not, because `x` may have changed (whoops, sorry, _may not contain `42` any more_). To be sure, we have to look at that big blob of code. Forgetting that may introduce a bug. Without assignment, the dependency chain is obvious, and can't be ignored. Assignment makes variable naming harder --------------------------------------- When you know a variable will always hold the same value, you name it after that value. If this value can change, you have to consider _all_ possible values. A name rarely scale that well. That makes code harder to understand. For instance, in my C implementation of the factorial, I named the accumulator of the loop "acc". As a name for an accumulator, this is accurate. However, the last value of `acc` was the factorial of `n`. A good name to reflect that would have been "fac_n". Neither name is satisfactory because each misses something important. Assignment makes refactoring harder ----------------------------------- There is a very important, often overlooked, rule about programming: the more you allow, the more you prevent. This isn't as contradictory as you might think. For each thing you allow in a program, you have to drop a set of assumptions about it. As a result, some manipulations become unsafe or impossible. In high school, a definition like "let a = x + 1" meant any occurrence of "a" or "(x + 1)", can be replaced by the other without changing the meaning of what is written. They are equivalent, and therefore substitutable. Imperative programs are more complicated: int x = 42; x = 1; printf("%d", x); // try and replace x by 42! Without assignment, "`x`" and "`42`" would be equivalent and substitutable. Because they are not, refactoring is harder. Assignment hurts performance ---------------------------- Optimization during compilation can be seen as a form of refactoring. Harder refactoring means harder optimizations. It complicates the compiler and make it generate less efficient code. SSA form mitigate this problem, but don't eliminate it. Another thing you lose when you allow assignment is sharing. It becomes important when you manipulate relatively complicated data structures, such as associative maps. There are three obvious ways to manipulate a data structure: 1. Directly modifying it (assignment allows that). 2. Create a new structure by copying the whole thing. 3. Create a new structure by referencing the old one. (The unchanged parts are shared). Each way have a specific problem. Way 1 is effectively an assignment, with all the disadvantages mentioned above. Way 2 wastes time and memory. Way 3 is unsafe if you ever use way 1. If assignment is allowed, way 2 is often your only safe choice. If it is not, way 3 is safe _and_ convenient _and_ efficient. These problems are pervasive ---------------------------- Using assignment sparingly is one thing. Compromises must be made, for instance to achieve the best possible performance on a critical section of the code, or in high performance applications like device drivers. Using assignment everywhere is another thing. Often a [big ball of mud][5]. [5]: http://www.laputan.org/mud/ Call to action -------------- 1. Learn a purely functional language —I suggest Haskell. It will show you how you can do without assignment, and what are the advantages of not having it. Beware, though: it may be addictive. 2. Avoid making functions which modify their arguments, or object methods which modify the object. Using them is often verbose and error prone. 3. Try to use better languages. At the very least, demand garbage collection. Without it, rule 2 is difficult to follow.