February 2011

Class-based Programming as Syntactic Sugar

Class based programming in statically typed languages like Java and C++ is syntactic sugar over functions, records, and closures. It means that a good deal of class based programming is naturally done even in C, and the rest of it can easily be emulated in ML.

This tutorial will show how.

Defining class based programming

I basically mean classes, inheritance, and subtype polymorphism, in statically type languages. I do not mean genericity which, while tremendously usefull, is orthogonal to class based programming (It wasn’t implemented in Java for a long time, after all).

Classes

A class is a record. Like with a C struct, an instance of a class (let’s call it “object”) contains a bunch of “data” and a bunch of “methods”. It is implied that a piece of data can be anything but a function. Methods are just functions (whose first argument, the object they come from, is usually implicit).

Typically, the methods are defined when the class itself is defined. The data, on the other hand, is typically defined at instantiation time, and is often modified after that.

Inheritance

This is a way of defining a class in terms of another class. A derived class is identical to the base class it is build from, except that:

It may specify additional data and methods.
It may specify that some methods are different.

There is also the notion of subtyping, but I’ll sweep that under the rug, because subtyping is only interesting for…

subtype polymorphism

Subtype polymorphism is when you use an object of type A, but its actual methods might be those of a derived class of A. This is useful because one often want flexibility about the behaviour (methods) of an object, as well as about its content (data).

abstraction and encapsulation

I didn’t mention them as necessary components for class based programming, but some programmers insist they are. They’re not:

Abstraction is not a fundamental mechanism of class based programming. It is an effect, that arises from the various mechanisms. If abstract data types can be provided by classes, and classes are syntax sugar, then abstract data types don’t go away when we remove the sugar.
Encapsulation mostly refers to the ability to “hide” an object’s data, most often by using the private keyword. This is a mechanism, but a minor one, that can be enforced by convention, as in Python.

Classes alone

We will start by emulating class based programming without inheritance nor subtype polymorphism. Let’s begin with the genuine thing (in C++). First, the interface of an object:

// point.hpp

class Point {
public:
  Point();
  void move(int x, int y);
  void lift();
  void draw(void) const;
private:
  int x_, y_; // This line doesn't belong in the interface
};

Then, its implementation:

// point.cpp
#include "point.hpp"
#include <cstdio>

Point::Point(): x_(0), y_(0) {}

void Point::move(int x, int y) {
  x_ += x;
  y_ += y;
}

void Point::lift() {
  move(1, 0);
}

void Point::draw(void) const {
  std::printf("(%d, %d)\n", x_, y_);
}

Finally, a use case of that object.

// main.cpp
#include point.hpp

int main(int argc, char* argv[])
{
  Point p;      p.draw();
  p.move(4, 2); p.draw();
  p.lift();     p.draw();
  return 0;
}

(This is a just an example, not acceptable production code. I just wanted more evocative names than fooBarBaz.) Now, here’s how we could do the same thing in C. First the interface:

// point.h

typedef struct Point_t Point;

Point * make_point(void);
void move_point(Point* this, int x, int y);
void lift_point(Point* this);
void draw_point(const Point* this);
void unmake_point(Point* this);

Then its implementation:

// point.c
#include "point.h"
#include <stdio.h>
#include <stdlib.h>

struct Point_t { int x, y; };

Point * make_point(void) {
  Point * this = malloc(sizeof(Point));
  p->x = 0;
  p->y = 0;
  return this;
}

void move_point(Point* this, int x, int y) {
  this->x += x;
  this->y += y;
}

void lift_point(Point* this) {
  move_point(this, 1, 0);
}

void draw_point(const Point * this) {
  // Let's pretend it draws the point.
  printf("x = %d; y = %d\n", this->x, this->y);
}

void unmake_point(Point* this) {
  free(this);
}

Finally, a use case.

// main.c
#include "point.h"

int main(int argc, char* argv[])
{
  Point* p = make_point(); draw_point(p);
  move_point(p, 4, 2);     draw_point(p);
  lift_point(p);           draw_point(p);
  unmake_point(p);
  return 0;
}

Note that this is basically how modular programming is naturally done in C. An old example is FILE* in stdio.h. Yet those two listings are very similar. Indeed, most of their differences lie in syntactic sugar:

The methods and the constructor are declared outside the class.
Their names changed to avoid name clashes.
The public and private keywords have been removed, class became struct, and cstdio became stdio.h.
Methods became ordinary functions.

Some of the changes, however, are a bit deeper.

The this pointer became explicit.
The point is allocated on the heap.
The destructor became explicit.
The constructor got an explicit allocation, and lost the initialization list.

ML, thanks to its garbage collection, can yield purer sugar. Here’s how classes could look like if I was to implement them in Ocaml. First the interface:

interface point = {
  move : int  -> int -> unit;
  lift : unit -> unit;
  draw : unit -> unit;
}

Then, the implementation:

class point
{
members:
  x_ : int ref;
  y_ : int ref;

constructors:
  point () = class point (ref 0) (ref 0);

methods:
  move = (fun x y ->
            x_ := !x_ + x;
            y_ := !y_ + y);
  lift = (fun () -> this.move 1 0);
  draw = (fun () -> print_string
            ("("  ^ string_of_int !x_ ^
             ", " ^ string_of_int !y_ ^ ")\n"));
}

Finally, a use case:

(* Main program (happens to be valid Ocaml) *)
let _ =
  let p = point () in
    p.draw ();
    p.move 4 2; p.draw ();
    p.lift ();  p.draw ()

Quick reference for C/C++ programmers who don’t read Ocaml:

unit in Ocaml means void in C.
move : int -> int -> unit; is a function prototype. The C equivalent would be void move(int, int);.
int ref in Ocaml means int * const in C. You can also think of it as a variable that contains an int value.
x_ := !x_ + x; in Ocaml means *x_ = *x + x; in C.
(fun x y -> (2 * x) + y) is a litteral function, in the same way that 42 is a litteral integer. In english it reads “the function which given x and y gives backs 2x + y).
(* This is a comment. *)
let _ = is line noise. (it declares a variable that won’t be used).
let p = point () in is a variable declaration and initialization.

Note that the interface, class, members, constructors, and methods keywords are my own invention. Think of them as “Ocaml++” features.

A fun fact about the way I wrotes the methods, is that methods don’t have to be functions (much like the actual methods in the actual Ocaml classes). I could have added answer: int; in my interface, and answer = 42; in my implementation with no problem.

Now let’s see the de-sugared version, in pure Ocaml. It is so similar to the sugared version above that I think we shouldn’t bother in real programs. First the interface:

type point = {
  move : int  -> int -> unit;
  lift : unit -> unit;
  draw : unit -> unit;
}

Then, the implementation:

let class_point
    (x_ : int ref)
    (y_ : int ref) =
  let rec this = {
    move = (fun x y ->
              x_ := !x_ + x;
              y_ := !y_ + y);
    lift = (fun () -> this.move 1 0);
    draw = (fun () -> print_string
              ("("  ^ string_of_int !x_ ^
               ", " ^ string_of_int !y_ ^ ")\n"));
  } in this

let point () = class_point (ref 0) (ref 0)

The use case doesn’t change (it was valid ocaml already).

So now we have pure sugar. There are still some notable differences with the C++ program, though:

The this pointer is explicit when methods refer to each other (private member variables are referred to directly). Heavier sugar could make it implicit, but this example is easier to follow.
The type of the class is completely separated from the definition of the class. Actually, the definition of the class itself is something like a class constructor.

On the other hand, I got rid of allocation and destruction problems (thanks to Ocaml’s garbage collection), and had my initialization list back (they’re not essential to class based programming, but I like them for other reasons).

Let’s implement polymorphism

We have polymorphism when two classes share the same type, but have different methods. In C++ and Java, one would use subtyping. No need to here.

Here, a class is just a record (a struct in C), and its methods are just fields of the record. There is no real distinction between a method and member variable: they are both values stored in the record. Functions and integers work the same:

(* numbers *)           (* functions *)

type foo = {            type bar {
  v1 : int;               f1 : int -> int;
  v2 : float;             f2 : float -> float;
}                       }

let foo1 = {            let bar1 = {
  v1 = 1                  f1 = (fun x -> x + 1);
  v2 = 1.                 f2 = (fun x -> x +. 1.);
}                       }

let foo2 = {            let bar1 = {
  v1 = 2                  f1 = (fun x -> x + 2);
  v2 = 2.                 f2 = (fun x -> x +. 2.);
}                       }

The same goes for the point “class”. It is very easy to make different objects of type point, but with different methods:

let fancy_point = {
  move = (fun _ _ -> print_string "No move ");
  lift = (fun ()  -> print_string "No lift ");
  draw = (fun ()  -> print_string "No_draw\n");
}

(* Main program (with the fancy point) *)
let _ =
  let p = if true
          then fancy_point ()
          else point()
  in
    p.draw ();
    p.move 4 2; p.draw ();
    p.lift ();  p.draw ()

The bottom line is, first closures give you polymorphism for free. Conversely, closures can be emulated with subtype polymorphism:

class Computer {
public:
  virtual int compute(int) = 0;
}

Closures and subtype polymorphism are kinda equivalent. In both cases, it’s about using functions or procedures without specifying it at compile time. In practice however, closures are simpler: they let you parametrize over behaviour directly. The need to enclose that behaviour in a class tend to break locality of code. That can render functions like map and filter unusable. C++ algorithm library for instance was a usability nightmare untill C++11 added closures.

Let’s implement inheritance

Ocaml has a nice syntax to define a record in terms of another:

let foo3 = { foo1 with
             v2 = 3.;
           }

So, we could just use that for inheritance:

let class_fast_point super =
  let rec this = { super with
                     move = (fun x y ->
                               super.move x y;
                               super.move x y);
                 } in this

let _ =
  let p = class_fast_point (point ()) in
    p.draw ();
    p.move 4 2; p.draw ();
    p.lift ();  p.draw ()

Note that we derive objects (not classes), and note the explicit reference to the base object (super). Anyway, this is a minor point: it is still a matter of taking the same group of methods, except for the ones we explicitly override.

Alas, there is a fatal flaw in my program. When we look at its output, we see:

(0, 0)
(8, 4)
(9, 4)

The first two lines are fine: the point moves twice as fast, as intended. The third line, however, is off. the lift method is defined in terms of the move methods, so if the point moves twice as fast, it ought to lift twice as high. Here, you can see that lift does not “see” that we changed the move method. In other words, our inheritance scheme implements early binding.

Late binding needs more sugar:

type point = {
  move : int  -> int -> unit;
  lift : unit -> unit;
  draw : unit -> unit;
}

type class_point = {
  move_ : class_point -> int  -> int -> unit;
  lift_ : class_point -> unit -> unit;
  draw_ : class_point -> unit -> unit;
}

let rec make_point cl = {
  move = cl.move_ cl;
  lift = cl.lift_ cl;
  draw = cl.draw_ cl;
}

let class_point
    (x_ : int ref)
    (y_ : int ref) =
  {
    move_ = (fun this x y ->
               x_ := !x_ + x;
               y_ := !y_ + y);
    lift_ = (fun this () -> (this.move_ this) 1 0);
    draw_ = (fun this () -> print_string
               ("("  ^ string_of_int !x_ ^
                ", " ^ string_of_int !y_ ^ ")\n"));
  }

let point () = class_point (ref 0) (ref 0)


let fast_point super =
  let rec this = { super with
                     move_ = (fun this x y ->
                               (super.move_ this) x y;
                               (super.move_ this) x y)
                 } in this

let _ =
  let p = make_point (point ()) in
    p.draw ();
    p.move 4 2; p.draw ();
    p.lift ();  p.draw ()

let _ = print_newline ()

let _ =
  let p = make_point (fast_point (point ())) in
    p.draw ();
    p.move 4 2; p.draw ();
    p.lift ();  p.draw ()

The added de-sugaring is not overwhelming:

The name of the methods are changed to avoid name conflicts.
The definition of the methods have this as their first argument.
this.M becomes (this.M_ this)
super.M becomes (super.M_ this)
We generate one class_I type and one make_I function for every I interface.
Every constructor call is embedded in a make_I call (this is not strictly sugar but it’s still light).

I could have lightened the sugar (no underscores, no class_point class, no make_point function), but it would have added sugar in the main program: p.M method calls would become (p.M p). Another way would be using lazy evaluation, but I couldn’t make Ocaml lazy enough.

Anyway, the output of the program now correctly demonstrates late binding:

(0, 0)
(4, 2)
(5, 2)

(0, 0)
(8, 4)
(10, 4)