Trie vs Hash Table Deathmatch!
A few days ago, I have submitted my simple string interning implementation on Reddit. I got a few internet points, and 2 major objections to my approach:
- Hash tables are actually simple. They can fit in less than 100 lines of code.
- I didn't measure the performance of my code. String interning is an optimisation after all.
If hash tables were the scary data structure I thought they were, performance would have been a secondary concern. They're not, so I have to measure.
Which I did.
I have used the GitHub Java Corpus. For each project, I have
.java files together, then used this program
to remove everything but the keywords and identifiers. The result is a
long list of newline separated identifiers as they appear in the
project. Of course, there is a lot of repetition.
The test were performed on some of the smallest projects (a batch of 2000 of them, most of which where only a few Kb worth of source code), and on 50 of the biggest projects. The biggest of all weighted 65Mb, including 85,000 unique identifiers (3 million if you count the repetitions).
I have tested 4 implementations:
- A control in pure C,
- a Trie in pure C,
- a hand rolled hash table in pure C,
- and an
std::unordered_map(hash table) wrapper, in C++.
All 4 program have the same
main() function. The only difference is
which data structure is used to store the strings.
The control program just reads the files. The other 3 put all the identifiers of each project in an intern pool (either a trie based one, or a hash table based one). A new pool is created for each project.
All happen in a single thread.
All my timings measure total CPU time (user + system).
Each program was run about half a dozen time against the same 2 batches of data (many small projects, or a few big ones). Timings were very stable and easy to measure, so I just ran the same tests a couple times to check for outliers.
The timings of the "control" program were used to deduce more precise timings for the other 3. By subtracting the time doing IO, we get the time actually spent on adding identifiers to the pool.
The only operation tested here was adding a string to the pool. There is no support for deletion (which hardly matters anyway) or retrieval. (Tries don't support retrieval by default, hash tables do.)
Intel Core i5, Debian, i686 (32 bits mode).
Here are the timings (in seconds):
|IO alone||Hash table||Hash table (STL)||Trie|
|big - control||0||2.26s||8.85s||6.62s|
|small - control||0||0.09s||0.49s||0.32s|
(Update: I redid those tests a day later on the same system, and the timings were significantly different. My main points stand, but I'm a bit puzzled.)
My hand rolled hash table was fastest by a wide margin. My trie was significantly slower, and the STL wrapper was the slowest of all.
It seems the terrible performance of the STL can be explained by
std::string: this thing hits the general purpose allocator every
time a new string is constructed. In this benchmark, that means every
time we insert a string, possibly more. Not good for such an inner
loop. There are ways to speed things up, but that would complicate
the code, and defeat the purpose of leaning on the standard library.
Finally, an important word on memory consumption. I have measured that on the biggest project only:
- 39Mb worth of identifiers.
- 3 million identifiers total (39Mb).
- 85,000 unique identifiers (1.4Mb).
My hand rolled hash table needed 3Mb to process it. My trie, 65Mb.
And that trie didn't even support the retrieval operation. If it did, it would have taken 80Mb ("clever", slow retrieval), or even 132Mb (fast retrieval). And of course, the additional bookkeeping would have slowed down even insertions.
Tries are slower than hash tables, and consume a lot more memory. Sure, they're simple, but then so are hash tables.
Use hash tables.