Easy String Interning

String interning is an optimisation that speeds up string comparisons, which are frequent in compilers and language runtimes. Given my interests, I need this.

I searched the web for ways to implement it, but the only answer I got back was “hash tables”. Hash tables? That’s a bit early for the nuclear option, don’t you think? I bet I can come up with a simpler solution.

So I did.

(Update: In the course of writing a couple benchmarks for my “neat” idea, and… oops. Turns out, hash tables are simple (even hand-rolling your own is no big deal) and they perform really well. Use them.)

The problem

String interning involves 2 fundamental operations:

Put the string in an intern pool, and get back a unique ID (that is, unique to that particular pool). Of course, if I intern the same string twice, I want the same ID.
Query the intern pool with an ID, and get back the corresponding string (if any). This operation is not always necessary: you can often remember the string by other means, if you need it at all.

The solution: tries

A trie is a clever way to store a bunch of strings in memory. (Or any sequence of bytes.) And guess what, they can replace hash tables!

My solution is a variation on tries. Imagine the strings we have to handle only contain the characters ‘a’, ‘b’, ‘c’, and ‘d’. Here is a trie that represents some of these strings:

           ┌───┬───┬───┬───┐
           │ a │ b │ c │ d │
           └───┴─│─┴─│─┴───┘
        ┌────────┘   └────────┐
┌───┬───┼───┬───┐     ┌───┬───┼───┬───┐
│ a │ b │ c │ d │     │ a │ b │ c │ d │
└───┴───┴───┴───┘     └───┴───┴───┴─│─┘
                                    │
                            ┌───┬───┼───┬───┐
                            │ a │ b │ c │ d │
                            └───┴───┴───┴───┘

I wrote the characters for clarity, but I don’t explicitly store them. I only store pointers to children nodes. For the record, this trie represents the following strings: ““,”a”, “b”, “c”, “d”, “ba”, “bb”, “bc”, “bd”, “ca”, “cb”, “cc”, “cd”, “cda”, “cdb”, “cdc”, “cdd”.

As you may have noted, I don’t have a way to exclude some strings from the trie. In this representation, the empty string is always represented, and if I want to represent, say, “ac”, then I have to represent “a”, “b”, “c”,“d”, “aa”, “ab”, and “ad” as well.

Turns out, it doesn’t matter. I don’t need a way to test whether a given string has already been interned. I just want a unique ID. I don’t really care about interning several strings at once. Let’s for instance add the strings “bcd” do our trie:

           ┌───┬───┬───┬───┐
           │ a │ b │ c │ d │
           └───┴─│─┴─│─┴───┘
        ┌────────┘   └────────┐
┌───┬───┼───┬───┐     ┌───┬───┼───┬───┐
│ a │ b │ c │ d │     │ a │ b │ c │ d │
└───┴───┴─│─┴───┘     └───┴───┴───┴─│─┘
          │                         │
  ┌───┬───┼───┬───┐         ┌───┬───┼───┬───┐
  │ a │ b │ c │ d │         │ a │ b │ c │ d │
  └───┴───┴───┴───┘         └───┴───┴───┴───┘

So we have added “bcd”, and, “bca”, and “bcb”, and “bcc”. And that’s okay.

The unique ID

Once I have my trie, getting a unique ID is easy: just take the address of the pointer corresponding to the end of the string. For instance, the string “cc” would be represented by this pointer:

┌──────────────┐
│┌────────────┐│
││ ID of "cc" │─────────────────┐
│└────────────┘│                │
└──────────────┘                │
           ┌───┬───┬───┬───┐    │
           │ a │ b │ c │ d │    │
           └───┴─│─┴─│─┴───┘    │
        ┌────────┘   └────────┐ │
┌───┬───┼───┬───┐     ┌───┬───┼─V─┬───┐
│ a │ b │ c │ d │     │ a │ b │ c │ d │
└───┴───┴─│─┴───┘     └───┴───┴───┴─│─┘
          │                         │
  ┌───┬───┼───┬───┐         ┌───┬───┼───┬───┐
  │ a │ b │ c │ d │         │ a │ b │ c │ d │
  └───┴───┴───┴───┘         └───┴───┴───┴───┘

The empty string just gets an invalid pointer.

Optimising the trie

The scheme above have two problems. First, since it relies on explicit pointers, it is not easy to port to a language that uses garbage collection. Second, allocating all these nodes with a general purpose allocator would be slow.

So we just put it all in an array. Instead of pointers, we get integer indices. In the abstract, it looks like this:

          ┌─────────────────────────┐       ┌───────────────>...
          │                         │       │
┌───┬───┬─│─┬───┐┌───┬───┬───┬───┐┌─V─┬───┬─│─┬───┐┌───┬───┬─
│ a │ b │ c │ d ││ a │ b │ c │ d ││ a │ b │ c │ d ││ a │ b │ ...
└───┴─│─┴───┴───┘└─^─┴───┴───┴─│─┘└───┴───┴───┴───┘└─^─┴───┴─
      │            │           │                     │
      └────────────┘           └─────────────────────┘

The actual indices are:

           ┌──────────────────────────────┐         ┌─────────────>...
           │                              │         │
┌────┬───┬─│─┬────┐┌────┬────┬────┬────┐┌─V──┬────┬─│──┬────┐┌────┬
│ -1 │ 4 │ 8 │ -1 ││ -1 │ -1 │ -1 │ 12 ││ -1 │ -1 │ 16 │ -1 ││ -1 │...
└────┴─│─┴───┴────┘└─^──┴────┴────┴─│──┘└────┴────┴────┴────┘└─^──┴
       │             │              │                          │
       └─────────────┘              └──────────────────────────┘

To allocate a new node, just append 4 elements a the end of the array, and set them all to -1. This might involve reallocating some more memory for the array. You can do this allocation manually, or use a dynamic array. And of course, this new node will have a parent, that you need to update.

Now for each ID, you will have a string:

-1  ->  ""
 0  ->  "a"
 1  ->  "b"
 2  ->  "c"
 3  ->  "d"
 4  ->  "ba"
 5  ->  "bb"
 6  ->  "bc"
 ...

Source code

Here is the full solution in C++:

class Intern {
public:
    Intern() :fwd(1, 0)  {}

    uint add(const char* s)
    {
        uint index = 0;
        char c;
        while ((c = *(s++)) != 0)
        {
            uint block = fwd[index];;
            if (block == 0)
            {
                block = fwd.size();
                fwd.resize(fwd.size() + 256, 0);
                fwd[index] = block;
            }
            index = block + c;
        }
        return index;
    }

private:
    std::vector<uint> fwd;
};

Here I intern strings only, but you could intern any sequence of bytes that way. Note that since a byte can range from 0 to 255, blocs have 256 elements. That much memory, to the point of slowing things down. Instead, we can add a little more complexity by breaking up each byte in several parts (here in 4 parts):

class Intern {
public:
    Intern() :fwd(1, 0)  {}

    uint step(uint index, uint offset)
    {
        uint block = fwd[index];
        if (block == 0)
        {
            block = fwd.size();
            fwd.resize(fwd.size() + 4, 0);
            fwd[index] = block;
        }
        return block + offset;
    }

    uint add(const char* s)
    {
        uint index = 0;
        char c;
        while ((c = *(s++)) != 0)
        {
            index = step(index, c >> 6);
            index = step(index, (c >> 4) & 0x03);
            index = step(index, (c >> 3) & 0x03);
            index = step(index, c & 0x03);
        }
        return index;
    }

private:
    std::vector<uint> fwd;
};

Now the nodes are only 4 elements wide. There’s more bookeeping but that performs better on my machine anyway. I suspect cache locality.

Retrieving strings

As it is, our data structure doesn’t let us look up the string with an ID. There are two ways to do this:

We could maintain an array of strings that is as big as our trie. When we intern a string, we just add it to this array. The empty string doesn’t get stored, though.
Or we can be clever, and just store an array of back pointers. This makes the retrieval slower, but it consumes less memory.

Let’s see the clever solution:

class Intern {
public:
    uint add(const std::string& s)
    {
        uint index = -1;
        uint block = 0;
        for (uint c : s)
        {
            if (block >= fwd.size())
            {
                block = fwd.size();
                fwd.resize(fwd.size() + 256, -1);
                bwd.push_back(index); // back reference
                if (index != (uint)-1)
                    fwd[index] = block;
            }
            index = block + c;
            block = fwd[index];
        }
        return index;
    }

    std::string get(uint index)
    {
        std::string res;
        while (index != (unsigned)-1)
        {
            res.insert(0, 1, index % 256); // TODO: fix that O(N²) insertion
            index = bwd[index / 256];
        }
        return res;
    }

private:
    std::vector<uint> fwd;
    std::vector<uint> bwd;
};

I have added 3 thing:

An array of back references (bwd).
A method to retrieve the string (get())
A line of code to add the back references when I intern a string (bwd.push_back(index);).

Essentially, we represents strings as linked lists of characters, in reverse order. Each block in the trie matches a back reference in the other array. That back reference is the index of the parent string.

Type safety

As you may have noted, my unique IDs aren’t exactly secure. The point of a unique ID is to be unique, and that’s it. The only operation it should support is equality. The rest is just a bug waiting to happen.

In the example above, you can see we don’t check the validity of the index at all. We could. To be valid, the index only have to be -1, or to be less than the size of our trie. It doesn’t even have to match a string we interned explicitly. (Remember: when we intern a string, we often intern many strings along with it, because we don’t bother not to).

If we break up the bytes of our strings, that check becomes a tad more complex. We have to make sure the lower bits of the index match the right pattern, lest we hit a node that doesn’t correspond to the end of a byte.

Another path to safety is to get the type system to help you. In C++ for instance, you can easily hide the ID in an abstract data type, this way:

class ID {
public:
    ID(uint id) :id(id) {}
    bool operator==(ID other) {
        return other.id == id;
    }
    uint get_index() {
        return id;
    }
private:
    uint id;
};

That will help, but you can still screw this up:

You could create an ID outside of the intern pool. That’s easy to avoid though: just don’t use the constructor outside of the intern pool.
You could extract the integer from the ID. There is no good reason to ever do this. But again, easy to avoid: just don’t use the get_index() method outside of the intern pool.

If you don’t trust yourself not to make those mistakes on a busy day, you can be more paranoid:

class ID {
    friend Intern;
public:
    bool operator==(ID other) {
        return other.index == index;
    }
    ID(const ID& source) :index(source.index) {}
private:
    ID(uint index) :index(index) {}
    uint index;
};

Now only the intern pool can access the internals of an ID. There is still one mistake you could make though:

You could get a key from one pool, then use it to retrieve a string on another pool. Unlike the other mistakes, this one is not obvious at all, and worse, I don’t know how to fix this in C++. That said, if you stick to one intern pool (you probably will), you’ll be safe.

If you must solve this problem, you may want to take a look at Haskell’s existential types.