July 2015
What is good code? (part 2)
In part 1, I defined good code in the obvious way: first, it has to meet our needs (that is, satisfy the requirements, be they explicit or unspoken, domain-specific or computer-specific); and it must be cheap. Cheap to write in the first place, cheap to maintain…
Simply put, good code is cheap code.
After a little bit of thinking I concluded that the only way to make cheap code is to keep it short, modular, and not too dependent on huge dependencies or fancy concepts.
This part will focus on how language mechanisms relate to good code. Basically, how they help us write shorter, more modular code. I will mostly cover the mainstream ones, and omit those I don’t know well enough —someone else will have to discuss them.
Functions
Also called subroutines, procedures, methods… Functions are the abstraction mechanism for applicative languages (that is, almost all languages in existence). At its core, a function is a reusable, parametrised piece of code. Obviously, it helps you write much less code. It also increases modularity.
In most languages, functions can’t be inspected from the outside. You can only call them. (If your functions are first class, you can also copy them, pass them around… but you still can’t inspect them.)
Some functions are “pure”: the same input will have them produce the same result, and they won’t touch anything else. In this case you only care about the relation between the input and the result. For instance:
int factorial(uint n)
{
int fac = 1;
while (n > 0)
{
fac *= n;
n--;
}
return fac;
}
While this function mutates local state to perform its computation, it does not leak any. From the outside, I might as well have written:
uint factorial(uint n)
{
return n == 0 ? 1
: n * factorial(n - 1);
}
(Let’s ignore the space leak, shall we?)
Other procedures modify their input:
void bubble_sort(int* arr, size_t size)
{
for (uint i = 1; i < size; i++)
for (uint j = 0; j < size-i; j++)
if (arr[j] > arr[j+1])
swap(arr + j, arr + j+1);
}
That’s the moral equivalent of a fancy assignment statement —it makes the function a bit less modular than a pure one. But not such a big deal. We can isolate the side effect:
int* bubble_wrapper(int* arr, size_t size)
{
int* sorted = malloc(size * sizeof(int)); // gotta free that
memcpy(sorted, arr, size * sizeof(int));
bubble_sort(sorted, size);
return sorted;
}
Yet other procedures maintain persistent state, and don’t return the same result every time. Like this:
int incr(int n)
{
static int sum = 0;
sum += n;
return sum;
}
That’s much less modular. The result of this function doesn’t just depend on its input. It also depends on whether the function has been called somewhere else, how, and how many times. This can add lots and lots of implicit dependencies, making the code that much harder to understand. This is nothing however, compared to procedures that touch global mutable state:
int sum = 0;
int wtf(int n)
{
sum += n;
return sum;
}
Now not only the order in which wtf()
is called matters,
what we did to sum
from the outside also matter. And of
course, sum
depends on the usage of wtf()
, and
who knows how many other uses it may have. One little innocent-looking
global variable, and the number of potential inter-dependencies just
keep piling up. Next thing you know, you end up in tangled mess where
any change may trigger some unexpected behaviour. Not modular at
all.
But you already knew global variables were bad, right?
First class functions
Like functions, only better.
In every single language I know, integers are first class values: you can put them in a variable, use them as arguments, return them to the caller, build compound types with them… They have the full flexibility you would expect from ordinary values.
Functions are mathematical objects like any other. As such, they should enjoy the same flexibility. Unfortunately, that would make them harder to implement. So, many languages put restrictions around what you can do with functions:
- You can’t define them anywhere you’d like. It has to be at the top level, or inside a module or class… You often can’t define a function inside another function.
- You can’t put them in a variable.
- You cant use them as arguments, or return them to the caller.
C for instance has a way around the last two, with function pointers. Those pointers are indeed first class values, but the function they point to still can’t be defined inside another function, then pulled out of its context. To do that, you need to implement lexical closures.
First class functions are all about lifting those restrictions, so you can manipulate them willy nilly. The main advantage is that you can now treat behaviour as data. In particular, you can now have behavioural parameters.
The importance of this cannot be overstated. Every modern
language supports behavioural parameters in one form or another. Even C,
as evidenced by qsort()
from the standard library. C++,
java, C#, and many other had subtyping from the start. And of course,
functional languages have first class functions.
Behavioural parameters are a major boon to both concision and modularity. They help you write shorter code by providing you other ways to factor out your code: write the parts that don’t change once, and use behavioural parameters for the parts that do. They also help increase modularity, by letting you pushing more dependencies to the parameters.
The map
function is a prime example of this. It has 2
arguments: a function, and a list. It loops over the list and creates a
new one, but leaves the transformation to its functional argument.
Concerns are cleanly separated: map
does the loop, its
functional parameter does the transformation. Compared to writing your
own loops manually, this is more modular and more concise.
Mutable state
Of this I have already spoken, here and before. Mutable state is powerful, but dangerous. If you use it wisely, it will give you less code and a faster program. But if you don’t pay attention, it can easily creep everywhere and kill your modularity.
If you can’t avoid mutable state, at least try to hide it behind
referentially transparent interfaces, like I kinda did above with
bubble_wrapper()
.
Parametric polymorphism
Write once, use on everything. See the map function, for instance:
map : (a -> b) -> List a -> List b
map f [] = []
map f (x :: xs) = (f x) :: (map xs)
This will work on any type a
and b
. This
obviously helps code reuse a lot, letting you write shorter programs.
Note that dynamic languages have parametric polymorphism by default. For
a long time, this was a big part of their appeal in my opinion.
Less obviously, parametric polymorphism also contributes to modularity. See, when you work on any type, you can’t assume anything about the values of those types. This forces the generic function to ignore the specifics of the generic values, and therefore focus on more general concerns.
Overall, parametric polymorphism is so useful that failing to including it in a modern language is a mistake, plain and simple.
Java should have had generics from the start. If it looked at ML back in 1995, it could have done that. Maybe it wasn’t OOP enough? Good thing it corrected that mistake.
The Go language, arguably for simplicity reasons, chose not to have generics. In 2009. This is reason enough to ignore the language. We have better alternatives.
Modules
At their core, a modules are packages in which you put things. Functions, constants, data types… Then you export some of those for external use. Their interfaces will combine to form the module’s interface. The rest is encapsulated.
Many module systems are more capable, but this is the bulk of it.
Also, languages that don’t have modules can often emulate them. C for
instance uses compilation units: you use the .h
file to export what you want to export, and hide the implementation
details behind the .c
file.
Modules have little effect on the size of code. They add some administrative overhead, and often help with naming (with namespaces), but that’s about it. (Some module systems are more advanced, but that’s a different topic.)
On the other hand, modules are very good at modularity: since it is not possible to use the internals of a module directly, understanding its interface is enough to use it. Conversely, changing the implementation of a module without touching its interface has no impact on the rest of the code. Just like functions, really. Except modules manage larger chunks of code.
Which is a big help when we want our code to be a sparse dependency graph. Remember dependency graphs from part 1? A cheap graph is a sparse graph, with relatively few edges. This is called “low coupling”.
Modules help you build a sparse graph, under one condition: make sure each module is a neat, self-contained unit of code. Modules should have a small interface, and must depend on few other modules. If you do that, your graph will be very sparse, and an easy to visualise two-level view: in the large, the whole program can be seen as a graph of modules. In the small, each module is a graph of functions (plus a few outbound edges to other modules).
If your program gets real big (big is bad, but we may not have a choice), you may add another level. Java for instance does this with its package system. Methods are put in classes (classes are the rough equivalent of modules), and those are put in packages. Encapsulation happens at each level (classes hide private methods, and packages hide private classes). That way you still have a relatively neat graph, but with 3 levels instead of just 2.
Some module systems even let you nest modules arbitrarily. But I’m not sure what kind of leviathan would require more than 3 levels.
Compound types
There are many kinds of compound types. Some, such as structs, classes records, and tuples, are based on the Cartesian product. Others, such as arrays or lists are more like vectors (though we often have neither scalar product, nor vector addition). We also have more specialised derived types, such as pointers or references. More rarely, we can find sum types, loosely based on the union. Values of those types are generally first class citizen. You can create new ones, pass them to functions, assign them to variables…
Compound types let you express entire concepts with just one value.
Take for instance arrays in C. In my bubble_sort()
above, I
have represented a vector of integers with a pointer to an integer and a
size. They can be put together in the same value:
struct vector {
int* arr;
size_t size;
};
Once I have my new data type, I can use it directly.
void bubble_sort2(struct vector v)
{
for (uint i = 1; i < v.size; i++)
for (uint j = 0; j < v.size-i; j++)
if (v.arr[j] > v.arr[j+1])
swap(v.arr + j, v.arr + j+1);
}
There, only one argument. That should shorten some function calls.
Compound types won’t make your code graph any sparser, but they will help you make sense of it. They can also help you reduce the amount of code a bit. Then, each compound type has its own uses: structs let you put related data together, pointers give you an indirection layer, sum types give you a taste of dynamic typing while preserving static checks…
But the real power of compound types comes to light when you combine them with modules. Together, they give you abstract data types.
Abstract data types
The idea is simple. You make a module, in which you define some compound type, and some functions that manipulate values of that type. Then you export only those functions. You keep the definition of your compound type hidden.
This gives you complete control over how your new data type will be manipulated. You can enforce arbitrary invariants. You can define how your data type will look like from the outside. You can hide (or show) as much as you want to.
This is a step up from mere modules. Those used to be a set of related functions. An abstract data type is a single concept, and thus easier to comprehend and recall.
Objects
There are several ways to define objects. One way is to view them as independent entities that can send and receive messages. Personally, I prefer to call such things processes. The Erlang programming language for instance is all about them. Having no experience with these, I won’t discuss them any further.
More often, when we talk about objects, we mean something that groups some data and some functions together. Those functions typically operate on the data of the object. For some reason, we tend to call them “methods”.
Many languages provide special support for objects, giving the impression they’re something special. In reality, if you have the cartesian product (structs, records…) and first class functions, then you have objects.
Which is why objects provide little value over first class functions. They are dual of each other, and isolated closures are generally easier to work with. When you don’t have first class functions however, objects are often invaluable.
Prototypes
A prototype is an object that is used as a template to create other objects. 2 mechanisms are used for this: copy and delegation. When creating a derived object from a prototype, you can copy part of the prototype. The part that isn’t copied can still be accessed through a delegation pointer stored in the derived object. When you try to access a member of the derived object, and it isn’t there, we walk up the delegation pointer to ask the prototype instead.
Prototype based programming have lots of interesting implications, that I have no experience with. I don’t know how if it helps you write shorter or more modular code, especially compared to other mechanisms. I will just note that prototypes are easily implemented (or emulated) in any dynamic language with first class functions and associative maps.
Classes
A class is a data type whose values are objects. In dynamic languages such as Python, classes tend to behave kinda like prototypes. This section is about those found in statically typed languages. Now brace yourselves.
All the talk about grouping data and functions together is a lie.
There is a sharp distinction between a member variable, and a method. The value of a member variable is determined at run time, and can even change over the lifetime of its object —if it’s mutable. Methods on the other hand are always fixed at compile time.
This difference puts members and methods in 2 different realms. When you instantiate an object from a class, only the members are represented in memory. It would be wasteful to represent the methods as well, since they’re known at compile time, and the same for every objects of the same type.
So, on the one hand you have the equivalent of a struct (or record)
containing your data. And on the other hand, you have functions that
operate on this struct. And of course, that private
&
public
business.
In other words, classes are abstract data types.
At their core, classes are just like modules. The only significant
differences are the presence of a “main type” (the class itself), and
some syntax sugar so you can write obj.f()
instead of
f(obj)
.
As far as code size and modularity is concerned, classes are just as useful as modules. No more, no less… until we unveil the real power of classes: subtyping.
Subtyping
“Type B is a subtype of type A” means that every value of type B is also of type A. That is, every B is also an A.
Most Object oriented languages give you a form of subtyping. For a class B to be a subtype of class A, it must show a compatible public interface. Generally, this means that for each public method in A, you can find a public method in B that has the same type.
Not necessarily the same method. Just one that has the same type. Its implementation could be different. Its behaviour could be different. Hence the killer feature of C++/Java style OOP: subtype polymorphism.
It goes like this: you use an object of type A, (though it might really be of type B). When you call its methods it might come from A, or it might come from B, and behave differently.
Until recently, subtype polymorphism was the only way to parametrise over behaviour in many languages. If there is one advantage of OOP over plain procedural programming, this is it. A I have said already, behavioural parameters can help you make shorter, more modular programs.
On the other hand, subtype polymorphism have no significant advantage over first class functions. They both achieve the same thing, but subtyping is heavier. If you already have first class functions, adding subtyping won’t really help.
A note on implementation. With subtype polymorphism, code that manipulates objects of type A must not call the methods of A every time. If that A is really a B, we must have a way to call B methods instead. The obvious way to do this is to really put the methods and the data together in the same package. Here is an objects that has its own methods, let’s call that instead of a static function somewhere.
But this takes memory, so language implementers came up with the Vtable. The idea is to exploit the fact that as before, methods are defined with the class, and never change. We still have a fixed set of methods for A, B, C, and D. So, what we do is put a single pointer in each object, which will point to the right set of methods depending on its type.
That way, the functions and data are still separated, but you have a much better illusion about them being bundled together. (I’ve gotta admit, it’s the illusion that matters.)
Inheritance
Inheritance lets you define things in terms of differences. It is mostly used with classes, though it applies to other things, such as context free grammars, or shopping lists.
It is most easy to imagine when the “things” we operate on are lists of stuff. A class is a list of methods and members. A grammar is a list of rules. A shopping list is… well. So we write “this new stuff is just like the old one, except we change this element, and add that element…”.
(Just so we’re clear: I’m talking about implementation inheritance. Interface inheritance is the same as subtyping, which we’ve covered already.)
Inheritance used to be marketed as a major boon for code reuse. In reality, it doesn’t work. It doesn’t work for code reuse, because we can achieve the same thing with plain functions —we call that composition; and it doesn’t work for modularity —it breaks it.
Inheritance may work on other things, such as context free grammars and shopping lists, but it doesn’t work with code. Avoid it if you can.
On the other hand, many languages that provide subtyping bundle inheritance with it. If you want a class to be a subtype of another, it must also inherit from it. Try to only inherit from classes that have no implementation: Java has interfaces, C++ has abstract classes…
Static typing
That’s a big one. I’ll make it short.
Static typing is all about compile time information. Basically, it gives you clues about what is going on before you even run the program. This gives you a number of advantages:
Types don’t lie. They document your code in a way the compiler can check. (Type inference may let you hide such documentation, but it is never hard to dig out.)
Types warn you about many stupid mistakes you make in the course of writing programs. This is an even tighter feedback loop than the one you get with a REPL. This speeds up development a little bit.
Types verify, and often prove, many things about your code. That’s tests you won’t need to write, making your code shorter. Sometimes, these proofs even increase modularity
On the other hand, static type systems sometimes reject perfectly valid programs. This forces you to work around them, either by using more or less unsafe escape hatches, or by finding some contrived way that obscures your code. Such workarounds can make your code longer and less modular.
Me, I have settled on what I prefer. Some static types systems are too inflexible, not powerful enough, or too unsafe. I’d rather use dynamic typing instead. Some static types however are quite good. The variations around system F that we can see in ML languages and Haskell reject very few programs I actually want to write, and often catch subtle mistakes. Given a choice, I’d chose them over dynamic typing any day.
Dynamic typing
The other side of the coin.
Dynamic typing is all about run time type information. It checks the validity of various operation at run time, where it can know for sure whether a given operation makes sense. This has a number of advantages:
Dynamic types will never reject correct programs. And if they are sound (most are), invalid operations will error out instead of triggering some scary undefined behaviour. No need to work around anything here.
Run time type information can be inspected, used, and dispatched over at run time. Some type systems can do that as well, by actually using dynamic typing. Sum types in ML are typical: they add a type tag that you can inspect at run time to deduce the actual type of the underlying data.
Dynamic typing is way easier to implement than a decent static type system. The concepts involved are simpler and more obvious to most programmers. Implementing a good static type system on the other hand requires some background knowledge in type theory, and needs a type checker.
Dynamic typing is more approachable than static typing. This might be important for beginners and non-programmers (who sometimes write code anyway).
Obvious disadvantages are the various overhead associated with run time checks, and the absence of the advantages of static typing. You need to be more disciplined about your documentation, you have to write more tests…
Lazy evaluation
Data often flows from a producer to a consumer. An obvious example is
Unix filters, such as grep
, which are both producers and
consumers. In our quest to increase modularity, we like to separate the
consumer and the producer: reading a file and producing a checksum from
its contents are 2 separate things, that should stay separate.
The obvious way is writing a procedure that reads the whole file and return a giant list of bytes. Then you sum over this list. Wait a minute… That is a huge waste of memory!
Indeed. So we have to resort to less obvious choices:
We could read and perform the sum in the same procedure. So much for modularity.
We could have the file reader take a callback as a parameter, that will be called each time a byte is read. The sum is then performed in the callback. But that’s cumbersome. First, the callback must maintain some non-local state, to remember its previous invocation. Second, control flow is dictated by the file reader. We don’t want to know about that, we just want to do a sum!
We could have the file reader read bytes one by one when requested. This gives us a better control flow, but we still leak mutable state from the file reader.
The cleanest solution is lazy evaluation. Instead of producing an ordinary list, you produce a lazy list. To consume the list, you just look at its elements, which will auto-magically be conjured into existence.
Producing the list is a bit more complicated. Instead of directly giving it the values it will hold, you give it a closure, that when evaluated, will yield that part of the list you need. (This is easier when your language has special support for lazy evaluation).
Of course, we don’t have to limit ourselves with lists. This can be generalised to all kinds of data. Quite a boon for modularity. You might even want to make it the default, as Haskell does. The consequences are quite far reaching however, and I won’t cover them. Instead, I recommend you take a look at John Hughes’s famous paper, Why Functional Programming Matters.