@ Loup's Impossible? Like that would stop me.

January 2017

(Update: thanks Thomas Ptacek for pointing out the origins of Chacha20.)

The design of Chacha20

Chacha20 is a secure, fast, and amazingly simple encryption algorithm. It’s author Daniel J. Bernstein explains it well in his Salsa20 and Chacha20 design papers (which I recommend), but did not dwell on details experts already know. Filling the gap took me a while.

Quick summary: Chacha20 is ARX-based hash function, keyed, running in counter mode. It embodies the idea that one can use a hash function to encrypt data.

Stream ciphers

While Chacha20 is mainly used for encryption, its core is a pseudo-random number generator. The cipher text is obtained by XOR’ing the plain text with a pseudo-random stream:

cipher_text = plain_text XOR chacha_stream(key, nonce)

Provided you never use the same nonce with the same key twice, you can treat that stream as a one time pad. This makes it very simple: unlike block ciphers, you don’t have to worry about padding, and decryption is the same operation as encryption:

plain_text = cipher_text XOR chacha_stream(key, nonce)

Now we just have to get that stream.

The quarter-round

Chacha20 works by scrambling the hell out of 512-bit blocks, using 80 applications of the quarter-round. The quarter round itself scrambles 4 32-bit numbers (that’s a fourth of a block) thus:

a += b;  d ^= a;  d <<<= 16;
c += d;  b ^= c;  b <<<= 12;
a += b;  d ^= a;  d <<<=  8;
c += d;  b ^= c;  b <<<=  7;

In this pseudo-C code, the <<< operator means bit-wise rotation to the left. For instance, 0xaabbccdd <<< 8 yields 0xbbccddaa. x86 has a dedicated instruction for this, and compilers know how to optimise accordingly.

The expert will immediately notice this quarter-round is an ARX based design (ARX stands for, Addition, Rotation, Xor) which despite its simplicity can be made as good as a regular permutation-substitution network.

In any case, do not be fooled by this simplicity. Daniel Bernstein put a lot of thought in this quarter-round, for both performance and scrambling quality. For instance, it works on 4 words at a time to minimise memory access. It is very cache friendly. It also scrambles data a bit better than his earlier Salsa20 quarter-round, because each word is updated twice here, and every word has a chance to influence the three others. This makes Chacha20 a little stronger than Salsa20 in practice.

A complete round

A quarter-round scrambles 4 32-bit words. A 512 block has 16. Now imagine the block is a 4×4 matrix of 32-bit integers:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

To scramble it all properly, we’re first going to apply 4 quarter-rounds in parallel, column by column:

 0 |  1 |  2 |  3
 4 |  5 |  6 |  7
 8 |  9 | 10 | 11
12 | 13 | 14 | 15

Then 4 more, diagonal by diagonal:

0 \  1 \  2 \  3
 5 \  6 \  7 \  4
 10 \ 11 \  8 \  9
  15 \ 12 \ 13 \ 14

Repeat the process 9 times more, and you get your 20 rounds (10 column rounds and 10 diagonal rounds). Here:

#define ROT_L32(x, n) x = (x << n) | (x >> (32 - n))
#define QUARTERROUND(a, b, c, d)       \
    a += b;  d ^= a;  ROT_L32(d, 16);  \
    c += d;  b ^= c;  ROT_L32(b, 12);  \
    a += b;  d ^= a;  ROT_L32(d,  8);  \
    c += d;  b ^= c;  ROT_L32(b,  7)

for (int i = 0; i < 10; i++) { // 20 rounds, 2 rounds per loop.
    QUARTERROUND(block[0], block[4], block[ 8], block[12]); // column 0
    QUARTERROUND(block[1], block[5], block[ 9], block[13]); // column 1
    QUARTERROUND(block[2], block[6], block[10], block[14]); // column 2
    QUARTERROUND(block[3], block[7], block[11], block[15]); // column 3
    QUARTERROUND(block[0], block[5], block[10], block[15]); // diagonal 1
    QUARTERROUND(block[1], block[6], block[11], block[12]); // diagonal 2
    QUARTERROUND(block[2], block[7], block[ 8], block[13]); // diagonal 3
    QUARTERROUND(block[3], block[4], block[ 9], block[14]); // diagonal 4
}

You may wonder why the second round is not row by row. It’s certainly simpler, right? Sure enough, Salsa20 was designed that way. However, implementers noticed that while column rounds were very easy to vectorise, row rounds were not. So they transformed the Salsa block before each round, so they could have “column” rounds every time more cheaply.

Daniel Bernstein picked up on this, and specified the Chacha rounds so implementers don’t have to do this trick. Now we just rotate 3 rows to transform the diagonal rounds into “column” rounds —like I did in my ASCII drawing. As for the straightforward implementations, they work just as well.

Invertibility

The 20 rounds shown there are not enough. While they do scramble the block beyond recognition, they’re invertible. It seems there are good reasons for this, I’m not sure I understand them. I think this has something to do with conservation of entropy, which should help with keeping your outputs unpredictable.

Anyway, applying the reverse of each operation in the reverse order get you the original block —and with it, the ability to predict the entire stream, and recover the original plain-text.

That won’t do.

Chacha20 prevents this disaster very simply: it adds the original block to the scrambled block, word wise, and use that as a pseudo-random block:

// Save the block in a temporary buffer
uint32_t tmp[16];
for (int i = 0; i < 16; i++) {
    tmp[i] = block[i];
}

// Scramble the block
// [20 rounds]

// Add the unscrambled block to the scrambled block
for (int i = 0; i < 16; i++) {
    block[i] += tmp[i];
}

The idea is, if you don’t know the contents of the block to begin with, you won’t be able to reverse the computation —I think we can prove this. Chacha20 relies on a weaker gamble however: we hope the attacker cannot reverse the computation if he doesn’t know half the block. (This conjecture is expected to hold for the foreseeable future, don’t worry.)

The Chacha block

The structure of the unencrypted Chacha block ensures just that: half the space there is occupied by a random secret key. Here:

block[ 0]: "expa"    block[ 8]: key[4]
block[ 1]: "nd 3"    block[ 9]: key[5]
block[ 2]: "2-by"    block[10]: key[6]
block[ 3]: "te k"    block[11]: key[7]
block[ 4]: key[0]    block[12]: counter[0]
block[ 5]: key[1]    block[13]: counter[1]
block[ 6]: key[2]    block[14]: nonce[0]
block[ 7]: key[3]    block[15]: nonce[1]

Besides the key, a quarter of the block is occupied by the nonce and counter (that the attacker may control), and a quarter is filled with a constant:

"expand 32-byte k"

At a first glance, this constant doesn’t do much. The first time I looked at it, it felt like wasted space. We could use this as extra nonce or counter space! The benefits would be obvious: if used as a nonce, we get a 192-bit nonce, big enough to be selected at random without fear of collision. If used as a counter, the size of the message would become virtually unlimited (though they arguably are already). Let’s invent our own crypto, shall we? It’s just a little tweak, it couldn’t harm right?

Wrong.

The constant has a very important purpose: reducing the amount of attacker controlled input. If this was a nonce or a counter, the attacker would control half the block. With the constant, he only gets a quarter. Reducing the attack surface this way has been shown to increase security in the past.

Another benefit is, it prevents the all-zero block. When you scramble an all-zero block you get an all-zero block. Take a look at the quarter round, this is easy to check. Anyway, an all-zero stream isn’t very effective at hiding your data.

Finally, this constant has some “asymmetry” (I’m not sure what that means) that should help the scrambling. And it’s readable ASCII text, so you can be pretty sure there’s no back door in there.

A bigger nonce

Okay, we need that constant. Still, that bigger nonce would have been a good thing, especially for stateless systems: if you can’t chose a random nonce, you kinda have to keep track of previously nonces, and that’s not always practical, or even possible.

I know of 2 approaches. One comes from the IETF, in the rfc7539. You basically use 3 words for the nonce, and one for the counter:

block[ 0]: "expa"    block[ 8]: key[4]
block[ 1]: "nd 3"    block[ 9]: key[5]
block[ 2]: "2-by"    block[10]: key[6]
block[ 3]: "te k"    block[11]: key[7]
block[ 4]: key[0]    block[12]: counter
block[ 5]: key[1]    block[13]: nonce[0]
block[ 6]: key[2]    block[14]: nonce[1]
block[ 7]: key[3]    block[15]: nonce[2]

This makes it easier to deal with nonces, and even allows wasting a bunch of them. Still, random nonces aren’t very safe yet, and now the message length limitation starts getting real. Though not many people send over 128Gb messages over the wire.

Still, we can do even better…

A much bigger nonce: XChacha20

This scheme allows a 192-bit nonce. Now that can be picked at random. And it does so without sacrificing either the counter or the constant. The attacker still won’t control more than a fourth of your block.

The idea is to use a cascading scheme. First, we initialise one block almost as usual.

block[ 0]: "expa"    block[ 8]: key[4]
block[ 1]: "nd 3"    block[ 9]: key[5]
block[ 2]: "2-by"    block[10]: key[6]
block[ 3]: "te k"    block[11]: key[7]
block[ 4]: key[0]    block[12]: nonce[0]
block[ 5]: key[1]    block[13]: nonce[1]
block[ 6]: key[2]    block[14]: nonce[2]
block[ 7]: key[3]    block[15]: nonce[3]

Yep, no counter. We won’t need it because we’ll be using this block only once: to construct the real initial block. First, we will apply the 20 rounds on the block, without the addition trick.

kcolb[ 0]: stuff    kcolb[ 8]: stuff
kcolb[ 1]: stuff    kcolb[ 9]: stuff
kcolb[ 2]: stuff    kcolb[10]: stuff
kcolb[ 3]: stuff    kcolb[11]: stuff
kcolb[ 4]: stuff    kcolb[12]: stuff
kcolb[ 5]: stuff    kcolb[13]: stuff
kcolb[ 6]: stuff    kcolb[14]: stuff
kcolb[ 7]: stuff    kcolb[15]: stuff

One might object to this reversible computation. But we don’t have to fear. The addition trick works because it makes sure the attacker doesn’t know half the block. Adding a random secret key works, but so does not revealing half the block at all.

Picture an attacker trying to reverse a regular Chacha20 computation. He may try to subtract the initial block from the scrambled result —if he can do that, we’re screwed. With the first 4 words (constant), and the last 4 (nonce and counter), he can, because he knows them. Words 4 to 11 however contain the secret key. The attacker doesn’t have that, so he’s stuck.

We’ll use the information the attacker could have deduced anyway, and throw away the rest. This is just as secure as the addition trick, and we can prove it. (Such a proof is called a security reduction: if Chacha20 is secure, then so is this trick.)

The words the attacker could have deduced are the first 4 (previously the constant), and the last 4 (previously the nonce). We’ll use them as a new pseudo-random key to construct a regular Chacha Block:

block'[ 0]: "expa"      block'[ 8]: kcolb[12]
block'[ 1]: "nd 3"      block'[ 9]: kcolb[13]
block'[ 2]: "2-by"      block'[10]: kcolb[14]
block'[ 3]: "te k"      block'[11]: kcolb[15]
block'[ 4]: kcolb[0]    block'[12]: counter[0]
block'[ 5]: kcolb[1]    block'[13]: counter[1]
block'[ 6]: kcolb[2]    block'[14]: nonce[4]
block'[ 7]: kcolb[3]    block'[15]: nonce[5]

One may argue that we could have used whatever half of the block we wanted —it is most probably secure. Alas, doing so would void the security reduction, so… don’t.

Another alternative would be to rely on the regular Chacha20 computation, with the addition trick. It’s just as secure, and simpler since we need to code that computation anyway. It would be a bit less efficient however, so XChacha20 is easier to sell.

That said, those extra rounds still make XChacha20 less efficient than plain Chacha20. If a 64-bit nonce is enough, you may prefer Chacha20.

Counter mode

So far, we only generated one Chacha block, and we don’t even know what to put in the counter. It’s actually pretty simple: the first unencrypted block gets a counter initialised to 0:

block'[ 0]: "expa"    block'[ 8]: key[4]
block'[ 1]: "nd 3"    block'[ 9]: key[5]
block'[ 2]: "2-by"    block'[10]: key[6]
block'[ 3]: "te k"    block'[11]: key[7]
block'[ 4]: key[0]    block'[12]: 0
block'[ 5]: key[1]    block'[13]: 0
block'[ 6]: key[2]    block'[14]: nonce[0]
block'[ 7]: key[3]    block'[15]: nonce[1]

The second one will have an incremented counter:

block'[ 0]: "expa"    block'[ 8]: key[4]
block'[ 1]: "nd 3"    block'[ 9]: key[5]
block'[ 2]: "2-by"    block'[10]: key[6]
block'[ 3]: "te k"    block'[11]: key[7]
block'[ 4]: key[0]    block'[12]: 1
block'[ 5]: key[1]    block'[13]: 0
block'[ 6]: key[2]    block'[14]: nonce[0]
block'[ 7]: key[3]    block'[15]: nonce[1]

And so on. Each block correspond to 64 bytes of the message. For each 64 byte chunk, scramble the right block, and XOR the result with the message. And that’s pretty much it.

Yes, the difference between 2 blocks is often only one bit. But after the scrambling, that single bit will have wrecked so much havoc it won’t matter: the scrambled blocks will look unrelated to the attacker.

Source code

Just look at Monocypher.