August 2017

How I implemented my own crypto

At long last, Monocypher, my crypto library, is done and ready for production. It was more work than expected, and I’ve learned more than I cared to, but we now have a crypto library that could displace Libsodium itself.

This is how I did it.

Why

It all started with a simple complaint: the company I was working for relied on a proprietary file encryption tool to exchange files over email. Without the source code, it was of course impossible to properly audit it, and their technical specs are a bit sketchy to begin with. I didn’t trust them.

This got me interested in this kind of software, and looked around for Free alternatives. I haven’t found anything I really liked. I wanted something small and self-contained. I mostly found big multi-purpose tools. (7-zip for instance).

Maybe I should write my own, which got me interested in crypto libraries. The NaCl family then caught my attention: the primitives were fast, easy to implement, and easy to protect against timing attacks. The following advice did not surprise me:

Use libsodium. Use libsodium. Use libsodium.

I would have heeded it, if I didn’t known about TweetNaCl: its 700 lines of code makes Libsodium look positively bloated (about 30K lines). I figured I could do better, and ignored all the warnings about never implementing your own crypto.

I believe the result is close to the best of both worlds: Monocypher is almost as small as TweetNaCl, not too shabby compared to Libsodium, and easier to use than either. Call me partial, but Monocypher is now my new favourite crypto library.

Timeline

2016-09-04: first commit.
2016-12-20: published Rolling your Own Crypto to justify myself. (Discussions on Reddit and Hacker News.)
2017-01-06: found an error in the Argon2 reference implementation.
2017-01-10: Published a piece on Chacha20’s design.
2017-01-30: Published a piece on Poly1305’s design.
2017-02-13: First Monocypher publication.
2017-03-06: 3 bugs found by external reviewers.
2017-03-16: 1 bug found by an external reviewer.
2017-07-15: optimised EdDSA signatures.
2017-07-18: 1 bug found by Clang sanitisers.
2017-07-18: 1.0 is out! At last, we have a crypto library.
2017-07-23: Optimised symmetric crypto, published 1.0.1.

Lessons learned

Crypto is hard. The stakes can be insanely high, and a single error may break everything. Correctness is not enough, we need confidence. To make things worse, compatibility and performance constraints often force us to use notoriously unsafe languages, such as C or x86 assembly. (Rust isn’t as ubiquitous as C, unfortunately.)

The first time I released Monocypher, I was wildly over-confident:

Monocypher is probably already bug-free.

Yeah, right. Since then five bugs were found, four of them by external reviewers. Three undefined behaviours, one misinterpretation of the spec, and a critical memory error. Yet here I was, thinking my tests vectors and Valgrind were enough. Thomas Ptacek did warn me:

passing test vectors doesn’t make something more trustworthy; it’s table stakes.

I guess I lost the table stakes, then. Still, those bugs taught me a number of lessons regarding proper testing. The first, and I believe most important lesson, is this:

Don’t do it alone. It’s almost impossible not to miss anything, especially if you lack special training like I did. Outside help is crucial.

Bug 1: use of uninitialised value

That bug was caught by Andre Maroneze, using Frama-C. This was the first time I’ve heard of this tool.

To save a line of code, I let a loop go one word past its boundary. That word was never used in the final result, so I figured it was okay. But it was used uninitialised in some intermediate computation, and that’s undefined behaviour. My clever “intentional off-by-one” appeared to work, yet didn’t.

Lesson learned: don’t be clever. Crypto code is complex enough already. C and C++ are unsafe enough already.

Bug 2: left shifting a negative value

Again caught by Andre Maroneze with Frama-C.

Carry propagation on elliptic curve arithmetic often rely on arithmetic shifts to multiply and divide stuff by powers of 2. With unsigned integers, this is perfectly defined behaviour. Signed integers are more complicated:

Overflows are undefined.
Right shifting negative integers is implementation defined. Monocypher, like Libsodium and TweetNaCl, chose to rely on it.
Left shifting negative integers is undefined. WTF?

This bug was a complete surprise. I double checked the spec to make sure the reporter wasn’t on drugs, but he was right. And again, this bug was silent on my platform.

Lesson learned: you don’t know all undefined behaviours. Use tools to catch them, you may have a couple surprises.

Bug 3: critical memory error

This one is pure negligence. I would have caught it myself if I had listen to Thomas Ptacek. It was caught by Mike Pechkin, by comparing Monocypher with another implementation.

My mistake was simple: my test vectors were incomplete. There was that code path that was only hit when we had more than 128 blocks per segment, so I made sure I had a test with lanes longer than 128 blocks.

Big mistake.

Argon2 has four segments per lane. I needed to exceed 512 blocks, not 128. I ended up with a completely untested code path, which read uninitialised memory. Worse, I claimed 100% coverage, yet used no coverage tool to confirm it. I believe this is a firing offence in some settings. Never. Again.

I vowed at this point to end this carelessness, and write an iron-clad test suite before I released Monocypher for real. With better tests vectors, but also comparisons against other implementations such as Libsodium. Oh, and let’s use all the sanitisers I could get my hands on while we’re at it. And I need this Frama-C to confirm the absence of undefined behaviours.

I will implement my own crypto, dammit.

Bug 4: misinterpretation of the specs

This one was again caught by Mike Pechkin, by comparing Monocypher with Libsodium. Simply put, Monocypher and Libsodium yielded different hashes on empty input.

Again, my test vectors were incomplete (no empty input). Again, this was an untested code path. Funnily enough, this incompatibility happened to preserve security. I guess I was lucky.

Lesson learned: test edge cases, such as empty inputs. Obvious in retrospect, but I’ve been stupid enough to forget it, so…

Bug 5: signed integer overflow

This one was sneaky. I wouldn’t have caught it without UBSan.

I was shifting a uint8_t, 24 bits to the left. I failed to realise that integer promotion means this unsigned byte would be converted to a signed integer, and overflow if the byte exceeded 127. (Also, on crazy platforms where integers are smaller than 32 bits, this would never have worked.) An explicit conversion to uint32_t did the trick.

At this point, I was running the various sanitisers just to increase confidence. Since I used Valgrind already, I didn’t expect to actually catch a bug. Good thing I did it anyway.

Lesson learned: Never try anything serious in C or C++ without sanitisers. They’re not just for theatrics, they catch real bugs.

The test suite

The implementation of a cipher suite is almost less important than its test suite. While the implementation does have to be correct, only the test suite can give us any confidence in this correctness.

Monocypher’s test suite comprises 3 parts: test vectors, property based tests, and comparisons against other implementations. There’s also a speed benchmark, but I do not count it as a real test.

The test vectors were taken from official standards, or other sources. When such sources were nonexistent or unsuitable (as was the case for Argon2i), I gathered them myself, using the reference implementation. I did my best to catch as many code paths as I could with the test vectors, but that isn’t enough in some cases. Table stakes, remember?
The property based tests check various invariants. For instance, computing a hash in chunks or all at once should yield the same results. This is sometimes the only way to hit some code paths. Also, property based tests tend to be more thorough than test vectors. Edge cases are easier to hit.
The comparison tests try to be as exhaustive as possible, from empty inputs to several time the size of the block used by the internal round function.

The entire test suite is run normally, under ASan, MSan, UBSan, and Valgrind. Clang’s coverage tool is also used to ascertain coverage, which is near perfect. The few statements that aren’t covered are easily reviewed by hand.

I call this the bare minimum.

To get more serious, I had Frama-C. While it looks very good, it currently generates a lot of false-positives. The TIS-Interpreter however does exactly what I want. It interprets C programs, and warns of any undefined behaviour it can catch —it claims to catch more than the likes of UBSan. Used in conjunction with a coverage tool, we can ensure everything is thoroughly tested. For real, this time.

This should be good enough to call it a crypto library.

Design goals

Monocypher’s goals shifted a bit over time, but here is what I have settled on:

It must be small. Its size must be commensurate with that of TweetNaCl, to facilitate deployment and audits. Audits are especially important, because unlike libsodium and NaCl, I cannot compensate bloat with fame.

Monocypher is a little under 1300 lines. Bigger than TweetNaCl’s 700 lines, but still manageable, and small enough to fit in a single source file (easier to deploy that way).
It must be portable. Many environments are pretty conservative. Making Monocypher compatible with various C and C++ dialects wasn’t very hard, so I made sure it could compile as C99, C11, and C++ (any version).
It must be easy to use. While NaCl and Libsodium are much easier to use than the likes of OpenSSL, they aren’t ideal just yet. I believe Monocypher is a little better still.
It should be fast. Speed matters more than I initially realised. If Monocypher doesn’t run fast enough, many people won’t even look at it. Ideally, it should be as fast as libsodium —though we can’t really compete with optimised assembly.

Monocypher does pretty well there: it is not ridiculous at all next to Libsodium, and much faster than TweetNaCl.
It must allow upgrades. Again, a speed issue. Many people will reject Monocypher if it doesn’t allow them to upgrade to faster implementations themselves. (Who am I to tell them they won’t scale like Google?) The chosen primitives must therefore have efficient implementations somewhere.

Overall, I think I’ve hit a pretty sweet spot.

Choice of primitives

The first thing I have done for Monocypher was choosing primitives. As recommended in my Rolling your Own Crypto article, I have chosen Chacha20, Poly1305, Blake2b, Argon2i, x25519, and EdDSA. Some choices need a bit of clarification, though.

XChacha20

The stream cipher used for authenticated encryption wasn’t Chacha20, or XSalsa20, but XChacha20. The construction is the same as XSalsa20, applied to Chacha20. I have explained it in detail here. It has a bigger nonce than Chacha20 (big enough to be chosen at random), at the cost of a slightly slower initialisation.

This could have been controversial a few months ago, because nobody else seemed to have released it publicly. I had nothing to compare to. Thankfully, Libsodium now have an implementation of its own, and I was able to compare the two. I’m now glad I stuck to my guns.

Blake2b

It has been chosen over the obvious standard (SHA-3, also known as Keccak), for two reasons: Blake2b is faster, and can be reused to implement Argon2i.

Argon2i

It was chosen mainly because it won the Password Hashing Competition. I settled on Argon2i because of its immunity to timing attacks. Argon2d is not implemented, though it could be if there is demand.

To avoid dependencies, Monocypher doesn’t support threads. This limits Argon2 to one lane. While multiple lanes are possible without multiple threads, it would make Argon2 less secure. Better not let the user make such a mistake.

EdDSA

By default, Monocypher does not use the classic Ed25519, but another variant of EdDSA where SHA-512 is replaced by Blake2b. This was to make Monocypher more orthogonal, and avoid the bloat of another hash function. SHA-512 is still supported as an option, but the user will have to go out of its way to get it.

Not everybody agreed with this. Some were afraid this would block their path to more efficient implementations, should they ever need it. I believe their fears are unfounded for two reasons:

I don’t believe such an upgrade will ever be needed: on my laptop, Monocypher can sign almost 7.000 documents per second, on a single thread. Even Let’s Encrypt never goes that far (their biggest day was 15 certificates per second on average).
The upgrade path isn’t blocked anyway. Ed25519-donna provides an insanely fast implementation, which can be used with a custom hash. I used it for my tests. (Had to make sure both implementations agreed.)

API design

The proper design of a crypto API is more delicate than I anticipated. Users will need it for various reasons, and the API must serve them without giving in to bloat or encouraging misuse. There were many details to consider.

Naming & constants

NaCl, and inevitably Libsodium, chose to name many functions after their purpose. I’d rather name them after their nature. There are two reasons for this:

First, changing a primitive (encryption, hash, signature…) breaks compatibility. Old data needs to be converted to the new encryption, re-hashed, or signed again. The API should reflect this, and make such compatibility breaks visible. Besides, it can help making the two versions coexist during the transition.

Second, the user needs to know which primitive is being used. Different primitives offer different safety guarantees and performance characteristics. There is no point in hiding that to the user.

Likewise, there is no need for named constants. First, the size of a key or a hash can influence the design of the final software. One may not handle a 64-byte hash the same way one stores a 16-byte tag. Second, those are not magic numbers that may change at the whim of the requirements. Such change only happens when a primitive changes, which is pretty rare.

For Monocypher, this also simplifies things quite a bit:

x25519 private key: 32 bytes.
x25519 public key: 32 bytes.
EdDSA private key: 32 bytes.
EdDSA public key: 32 bytes.
Chacha20 encryption key: 32 bytes.
Poly1305 authentication key: 32 bytes.

Simply put, keys are 32 bytes long. I’d rather remember that, than looking up 6 long names.

High level API: keep it orthogonal

I thought for a time I’d write functions like crypto_box(), provide combined and detached modes, and overall think of every possible use case. Ultimately though, it bloats the API for little value.

Instead, Monocypher only provides functions that are easy to screw up, such as authenticated encryption. The rest can be left to the user. This makes the API smaller and easier to learn. Contrary to what one might expect, this hardly pushes any complexity up the application code. For instance, one need only 3 functions to exchange a message:

crypto_key_exchange() or crypto_argon2i() to get a session key.
crypto_lock() to encrypt and authenticate the message.
crypto_unlock() to verify then decrypt the encrypted message.

Barely more complex than crypto_box() and crypto_box_open() (or their easy counterparts).

Key pair generation can likewise be simplified. For x25519 key pairs for instance, Libsodium provides crypto_box_keypair() and crypto_box_seed_keypair(). This is one function too many. Monocypher instead only provides crypto_x25519_public_key(), which generates a public key with a private key. The user can handle the generation of the private key, it’s just 32 random bytes.

Low-level API: facilitate incremental processing.

I got this idea from a Blake2b implementation. The idea is to keep track of what we have done so far in a context, to process the data bit by bit. This is especially useful wen processing huge amounts of data. it works like this:

context_t ctx
init  (ctx, parameters);
update(ctx, input1);
update(ctx, input2);
update(ctx, input3);
final (ctx, result);

This is very flexible, yet easy to use correctly. Applying this to Blake2b and Poly1305 was obvious, but even Chacha20 benefited from it: using a context is easier than explicitly changing the counter, and prevents accidental reuse of part of the stream. (One can still change that counter, though.)

Performance

Overall, performance considerations are secondary to ease of use, ease of deployment and source code readability. Still, it doesn’t mean I cannot pluck the low hanging fruits.

curve25519: difficult choices

I wasn’t able to implement curve25519 on my own at the time (it would be hard even now), so I was forced to pick existing code. At a first approximation, this meant TweetNaCl or ref10.

As it stands, ref10 is too big, so my first choice went to TweetNaCl. Then I noticed x25519’s ref10 can be compacted pretty well. Adapting TweetNaCl’s EdDSA to ref10’s field arithmetic was a bit tricky, but I managed.

The one thing I couldn’t do was going full ref10 for EdDSA: it comes with a huge pre-computed table that would have bloated Monocypher beyond my ability to tolerate it. I didn’t like the performance implications, but I had decided to live with them.

Then, just before releasing Monocypher, I had a breakthrough. A few months before, I had tried to compute the EdDSA scalar multiplication in Montgomery space to make it faster. Just convert, multiply, then convert back. There’s a dirty trick to it, but that’s not too bad.

Still, it didn’t work. I scratched my head for a while, asked around, then gave up. Then, when I vented my frustration on Hacker News, someone answered (thank you, pbsd!). I was missing two multiplications by a constant, because I was blindly relying on formulas from Wikipedia, instead of reading the RFC 7748 more carefully.

This doubled the speed of signatures and checks, making them commensurate with Libsodium (2 or 3 times as slow). Now we do have a worthy trade-off.

symmetric crypto: unexpected bottleneck

I discovered this one by accident.

Symmetric crypto (Chacha20 and Blake2b) used to be twice as slow as Libsodium. Poly1305 was slower still. I figured this was because of the assembly routines against which portable C cannot compete.

But when I benchmarked Monocypher against TweetNaCl using the -O3 optimisation level (I used to stop at -O2), I noticed something fishy: in this particular setting, Chacha20 and SHA-512 were significantly slower than TweetNaCl.

Wait a minute, Monocypher doesn’t sacrifice speed, how could it go slower than TweetNaCl? Well… in version 1.0 and before, Monocypher loaded and unloaded data byte by byte. Maybe we could start there?

I didn’t expect much. The bottleneck is obviously the round function, where we scramble the data like crazy. Loading stuff byte by byte cannot cause any significant slowdown.

Or so I thought.

I modified SHA-512’s loading code to load data word by word instead, just to see what would happen. It’s more complex (new edge cases to deal with), but who knows, I might gain a few percents.

“A few”, turned out to be more than fifty.

I couldn’t let it slide, so I applied the same principles to Chacha20, Poly1305, and Blake2b. Now their speed actually competes with Libsodium’s (except Chacha20 on x86-64, but I still doubled its speed). I guess portable C doesn’t have to be slow.

By the way, this most probably explains why Monocypher’s Argon2i is faster than Libsodium’s reference implementation: inefficiencies can be hard to spot. I know I was very lucky to find about the loading code. Without TweetNaCl suddenly becoming faster than Monocypher, I would never have noticed.

Implementation tribulations

I have covered public key crypto already. Chacha20 was easy as pie, Blake2b was fine, and SHA-512 didn’t cause any problems.

This leaves Poly1305 and Argon2i.

Poly1305: scary, but doable

I was originally scared of Poly1305 because of its big-number arithmetic, about which I knew nothing. I’ve also read about hard to detect limb overflows, so I didn’t dare touch it initially.

Nevertheless, I wasn’t satisfied with the obvious inefficiency of TweetNaCl, or the verbosity of Poly1305-donna. I was wondering why it didn’t use 32-bytes limbs, which would simplify everything. So I studied the subject, and figured I’d give 32-bit limbs a try.

32-bit limbs are generally a bad idea, because they don’t allow you to delay carry propagation. As soon as you multiply 2 limbs, you have to propagate the carry upward, or you’ll overflow your 64-bit temporary variables. Data flow becomes less parallel, slowing down out of order CPUs.

Smaller limbs on the other hand let you delay carry propagation. It increase parallelism, which makes everything faster.

Poly1305 however is special. When performing the multiplication, one of its operands has a restricted value: some of its bits are chopped off. So instead of having to multiply two 32-bit integers, we multiply a 32-bit integer and a 28-bit integer. This leaves enough room to delay carry propagation.

This performs pretty well.

Argon2i: from scratch

I didn’t mean to implement Argon2i myself. I first thought I’d scavenge a reference implementation. But the one I found in the RFC draft was hopelessly incomplete, and the reference implementation itself was quite big. Too big in fact to fit in Monocypher. I figured I could do simpler.

And I did. I even managed to run faster than the portable C reference implementation. It wasn’t trivial though. And no, I’m not talking about my blunder. Once discovered, that bug was fixed in less than an hour. I’m talking about something completely outside my control:

There was a bug in the reference implementation.

Test vectors just didn’t work. So I verified, dug in, debugged, looked at intermediary results (thank goodness the reference implementation could generate vectors for those), then finally found the error:

According to the spec, any previously non-overwritten block of the same lane may be referenced. But the implementation made that choice among previous segments that haven’t been overwritten yet, effectively treating the current lane just like the others.

Different reference sets, different results. No surprise there. The real problem was shrinking the reference set by half a segment, on average. This does hurt memory hardness a tiny little bit. The authors have since decided it wasn’t worth breaking compatibility. I agree with them, though I lack the expertise required to have a strong opinion anyway.

There’s something worrying about this bug: I was the first to discover it, in January 2017. According to Khovratovich himself, it was two years old. Now I understand why the authors themselves didn’t find it: unlike me, they didn’t have a reference implementation to compare to.

What I don’t understand is, how come nobody else discovered and reported this bug? If you implement Argon2 from spec, you cannot miss it. Test vectors won’t match, and searching for the cause will naturally lead to this bug. I can draw only one conclusion from this:

I’m the first to independently re-implement Argon2.

This “never implement your own crypto” business went a little too far.