Trie vs Hash Table Deathmatch!

A few days ago, I have submitted my simple string interning implementation on Reddit. I got a few internet points, and 2 major objections to my approach:

Hash tables are actually simple. They can fit in less than 100 lines of code.
I didn’t measure the performance of my code. String interning is an optimisation after all.

If hash tables were the scary data structure I thought they were, performance would have been a secondary concern. They’re not, so I have to measure.

Which I did.

The data

I have used the GitHub Java Corpus. For each project, I have concatenated all .java files together, then used this program to remove everything but the keywords and identifiers. The result is a long list of newline separated identifiers as they appear in the project. Of course, there is a lot of repetition.

The test were performed on some of the smallest projects (a batch of 2000 of them, most of which where only a few Kb worth of source code), and on 50 of the biggest projects. The biggest of all weighted 65Mb, including 85,000 unique identifiers (3 million if you count the repetitions).

The programs

I have tested 4 implementations:

A control in pure C,
a Trie in pure C,
a hand rolled hash table in pure C,
and an std::unordered_map (hash table) wrapper, in C++.

All 4 program have the same main() function. The only difference is which data structure is used to store the strings.

The control program just reads the files. The other 3 put all the identifiers of each project in an intern pool (either a trie based one, or a hash table based one). A new pool is created for each project.

All happen in a single thread.

The protocol

All my timings measure total CPU time (user + system).

Each program was run about half a dozen time against the same 2 batches of data (many small projects, or a few big ones). Timings were very stable and easy to measure, so I just ran the same tests a couple times to check for outliers.

The timings of the “control” program were used to deduce more precise timings for the other 3. By subtracting the time doing IO, we get the time actually spent on adding identifiers to the pool.

The only operation tested here was adding a string to the pool. There is no support for deletion (which hardly matters anyway) or retrieval. (Tries don’t support retrieval by default, hash tables do.)

The rig

Intel Core i5, Debian, i686 (32 bits mode).

The results

Here are the timings (in seconds):

	IO alone	Hash table	Hash table (STL)	Trie
big projects	3.50s	5.76s	12.35s	10.12s
small projects	0.21s	0.30s	0.70s	0.53s

big - control	0	2.26s	8.85s	6.62s
small - control	0	0.09s	0.49s	0.32s

Slowdown (big)		×1	×3.9	×2.9
Slowdown (small)		×1	×5.4	×3.6

(Update: I redid those tests a day later on the same system, and the timings were significantly different. My main points stand, but I’m a bit puzzled.)

My hand rolled hash table was fastest by a wide margin. My trie was significantly slower, and the STL wrapper was the slowest of all.

It seems the terrible performance of the STL can be explained by std::string: this thing hits the general purpose allocator every time a new string is constructed. In this benchmark, that means every time we insert a string, possibly more. Not good for such an inner loop. There are ways to speed things up, but that would complicate the code, and defeat the purpose of leaning on the standard library.

Finally, an important word on memory consumption. I have measured that on the biggest project only:

39Mb worth of identifiers.
3 million identifiers total (39Mb).
85,000 unique identifiers (1.4Mb).

My hand rolled hash table needed 3Mb to process it. My trie, 65Mb.

And that trie didn’t even support the retrieval operation. If it did, it would have taken 80Mb (“clever”, slow retrieval), or even 132Mb (fast retrieval). And of course, the additional bookkeeping would have slowed down even insertions.

Conclusion

Tries are slower than hash tables, and consume a lot more memory. Sure, they’re simple, but then so are hash tables.

Use hash tables.