January 2017

(Update: thanks Thomas Ptacek for pointing out the origins of Chacha20.)

# The design of Chacha20

Chacha20 is a secure, fast, and amazingly simple encryption algorithm. It's author Daniel J. Bernstein explains it well in his Salsa20 and Chacha20 design papers (which I recommend), but did not dwell on details experts already know. Filling the gap took me a while.

Quick summary: Chacha20 is ARX-based hash function, keyed, running in counter mode. It embodies the idea that one can use a hash function to encrypt data.

## Stream ciphers

While Chacha20 is mainly used for encryption, its core is a pseudo-random number generator. The cipher text is obtained by XOR'ing the plain text with a pseudo-random stream:

```
cipher_text = plain_text XOR chacha_stream(key, nonce)
```

Provided you never use the same nonce with the same key twice, you can treat that stream as a one time pad. This makes it very simple: unlike block ciphers, you don't have to worry about padding, and decryption is the same operation as encryption:

```
plain_text = cipher_text XOR chacha_stream(key, nonce)
```

Now we just have to get that stream.

## The quarter-round

Chacha20 works by scrambling the hell out of 512-bit blocks, using 80
applications of the *quarter-round*. The quarter round itself
scrambles 4 32-bit numbers (that's a fourth of a block) thus:

```
a += b; d ^= a; d <<<= 16;
c += d; b ^= c; b <<<= 12;
a += b; d ^= a; d <<<= 8;
c += d; b ^= c; b <<<= 7;
```

*In this pseudo-C code, the <<< operator means bit-wise rotation to
the left. For instance, 0xaabbccdd <<< 8 yields 0xbbccddaa. x86
has a dedicated instruction for this, and compilers know how to
optimise accordingly.*

The expert will immediately notice this quarter-round is an ARX based design (ARX stands for, Addition, Rotation, Xor) which despite its simplicity can be made as good as a regular permutation-substitution network.

In any case, do not be fooled by this simplicity. Daniel Bernstein
put a lot of thought in this quarter-round, for both performance and
scrambling quality. For instance, it works on 4 words at a time to
minimise memory access. It is very cache friendly. It also scrambles
data a bit better than his earlier Salsa20 quarter-round, because each
word is updated *twice* here, and every word has a chance to influence
the three others. This makes Chacha20 a little stronger than Salsa20
in practice.

## A complete round

A quarter-round scrambles 4 32-bit words. A 512 block has 16. Now imagine the block is a 4×4 matrix of 32-bit integers:

```
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
```

To scramble it all properly, we're first going to apply 4 quarter-rounds in parallel, column by column:

```
0 | 1 | 2 | 3
4 | 5 | 6 | 7
8 | 9 | 10 | 11
12 | 13 | 14 | 15
```

Then 4 more, diagonal by diagonal:

```
0 \ 1 \ 2 \ 3
5 \ 6 \ 7 \ 4
10 \ 11 \ 8 \ 9
15 \ 12 \ 13 \ 14
```

Repeat the process 9 times more, and you get your 20 rounds (10 column rounds and 10 diagonal rounds). Here:

```
#define ROT_L32(x, n) x = (x << n) | (x >> (32 - n))
#define QUARTERROUND(a, b, c, d) \
a += b; d ^= a; ROT_L32(d, 16); \
c += d; b ^= c; ROT_L32(b, 12); \
a += b; d ^= a; ROT_L32(d, 8); \
c += d; b ^= c; ROT_L32(b, 7)
for (int i = 0; i < 10; i++) { // 20 rounds, 2 rounds per loop.
QUARTERROUND(block[0], block[4], block[ 8], block[12]); // column 0
QUARTERROUND(block[1], block[5], block[ 9], block[13]); // column 1
QUARTERROUND(block[2], block[6], block[10], block[14]); // column 2
QUARTERROUND(block[3], block[7], block[11], block[15]); // column 3
QUARTERROUND(block[0], block[5], block[10], block[15]); // diagonal 1
QUARTERROUND(block[1], block[6], block[11], block[12]); // diagonal 2
QUARTERROUND(block[2], block[7], block[ 8], block[13]); // diagonal 3
QUARTERROUND(block[3], block[4], block[ 9], block[14]); // diagonal 4
}
```

You may wonder why the second round is not row by row. It's certainly simpler, right? Sure enough, Salsa20 was designed that way. However, implementers noticed that while column rounds were very easy to vectorise, row rounds were not. So they transformed the Salsa block before each round, so they could have "column" rounds every time more cheaply.

Daniel Bernstein picked up on this, and specified the Chacha rounds so implementers don't have to do this trick. Now we just rotate 3 rows to transform the diagonal rounds into "column" rounds —like I did in my ASCII drawing. As for the straightforward implementations, they work just as well.

## Invertibility

The 20 rounds shown there are not enough. While they do scramble the
block beyond recognition, they're *invertible*. It seems there are
good reasons for this, I'm not sure I understand them. I *think* this
has something to do with conservation of entropy, which should help
with keeping your outputs unpredictable.

Anyway, applying the reverse of each operation in the reverse order get you the original block —and with it, the ability to predict the entire stream, and recover the original plain-text.

That won't do.

Chacha20 prevents this disaster very simply: it adds the original
block to the scrambled block, word wise, and use *that* as a
pseudo-random block:

```
// Save the block in a temporary buffer
uint32_t tmp[16];
for (int i = 0; i < 16; i++) {
tmp[i] = block[i];
}
// Scramble the block
// [20 rounds]
// Add the unscrambled block to the scrambled block
for (int i = 0; i < 16; i++) {
block[i] += tmp[i];
}
```

The idea is, if you don't know the contents of the block to begin
with, you won't be able to reverse the computation —I think we can
prove this. Chacha20 relies on a weaker gamble however: we hope the
attacker cannot reverse the computation if he doesn't know *half* the
block. (This conjecture is expected to hold for the foreseeable
future, don't worry.)

## The Chacha block

The structure of the unencrypted Chacha block ensures just that: half the space there is occupied by a random secret key. Here:

```
block[ 0]: "expa" block[ 8]: key[4]
block[ 1]: "nd 3" block[ 9]: key[5]
block[ 2]: "2-by" block[10]: key[6]
block[ 3]: "te k" block[11]: key[7]
block[ 4]: key[0] block[12]: counter[0]
block[ 5]: key[1] block[13]: counter[1]
block[ 6]: key[2] block[14]: nonce[0]
block[ 7]: key[3] block[15]: nonce[1]
```

Besides the key, a quarter of the block is occupied by the nonce and counter (that the attacker may control), and a quarter is filled with a constant:

```
"expand 32-byte k"
```

At a first glance, this constant doesn't do much. The first time I looked at it, it felt like wasted space. We could use this as extra nonce or counter space! The benefits would be obvious: if used as a nonce, we get a 192-bit nonce, big enough to be selected at random without fear of collision. If used as a counter, the size of the message would become virtually unlimited (though they arguably are already). Let's invent our own crypto, shall we? It's just a little tweak, it couldn't harm right?

Wrong.

The constant has a very important purpose: reducing the amount of attacker controlled input. If this was a nonce or a counter, the attacker would control half the block. With the constant, he only gets a quarter. Reducing the attack surface this way has been shown to increase security in the past.

Another benefit is, it prevents the all-zero block. When you scramble an all-zero block you get an all-zero block. Take a look at the quarter round, this is easy to check. Anyway, an all-zero stream isn't very effective at hiding your data.

Finally, this constant has some "asymmetry" (I'm not sure what that means) that should help the scrambling. And it's readable ASCII text, so you can be pretty sure there's no back door in there.

## A bigger nonce

Okay, we need that constant. Still, that bigger nonce would have been a good thing, especially for stateless systems: if you can't chose a random nonce, you kinda have to keep track of previously nonces, and that's not always practical, or even possible.

I know of 2 approaches. One comes from the IETF, in the rfc7539. You basically use 3 words for the nonce, and one for the counter:

```
block[ 0]: "expa" block[ 8]: key[4]
block[ 1]: "nd 3" block[ 9]: key[5]
block[ 2]: "2-by" block[10]: key[6]
block[ 3]: "te k" block[11]: key[7]
block[ 4]: key[0] block[12]: counter
block[ 5]: key[1] block[13]: nonce[0]
block[ 6]: key[2] block[14]: nonce[1]
block[ 7]: key[3] block[15]: nonce[2]
```

This makes it easier to deal with nonces, and even allows wasting a bunch of them. Still, random nonces aren't very safe yet, and now the message length limitation starts getting real. Though not many people send over 128Gb messages over the wire.

Still, we can do even better…

## A much bigger nonce: XChacha20

This scheme allows a 192-bit nonce. Now *that* can be picked at
random. And it does so without sacrificing either the counter or the
constant. The attacker still won't control more than a fourth of your
block.

The idea is to use a cascading scheme. First, we initialise one block
*almost* as usual.

```
block[ 0]: "expa" block[ 8]: key[4]
block[ 1]: "nd 3" block[ 9]: key[5]
block[ 2]: "2-by" block[10]: key[6]
block[ 3]: "te k" block[11]: key[7]
block[ 4]: key[0] block[12]: nonce[0]
block[ 5]: key[1] block[13]: nonce[1]
block[ 6]: key[2] block[14]: nonce[2]
block[ 7]: key[3] block[15]: nonce[3]
```

Yep, no counter. We won't need it because we'll be using this block
only once: to construct the *real* initial block. First, we will
apply the 20 rounds on the block, *without* the addition trick.

```
kcolb[ 0]: stuff kcolb[ 8]: stuff
kcolb[ 1]: stuff kcolb[ 9]: stuff
kcolb[ 2]: stuff kcolb[10]: stuff
kcolb[ 3]: stuff kcolb[11]: stuff
kcolb[ 4]: stuff kcolb[12]: stuff
kcolb[ 5]: stuff kcolb[13]: stuff
kcolb[ 6]: stuff kcolb[14]: stuff
kcolb[ 7]: stuff kcolb[15]: stuff
```

One might object to this reversible computation. But we don't have to fear. The addition trick works because it makes sure the attacker doesn't know half the block. Adding a random secret key works, but so does not revealing half the block at all.

Picture an attacker trying to reverse a regular Chacha20 computation. He may try to subtract the initial block from the scrambled result —if he can do that, we're screwed. With the first 4 words (constant), and the last 4 (nonce and counter), he can, because he knows them. Words 4 to 11 however contain the secret key. The attacker doesn't have that, so he's stuck.

We'll use the information the attacker could have deduced anyway, and throw away the rest. This is just as secure as the addition trick, and we can prove it. (Such a proof is called a security reduction: if Chacha20 is secure, then so is this trick.)

The words the attacker could have deduced are the first 4 (previously the constant), and the last 4 (previously the nonce). We'll use them as a new pseudo-random key to construct a regular Chacha Block:

```
block'[ 0]: "expa" block'[ 8]: kcolb[12]
block'[ 1]: "nd 3" block'[ 9]: kcolb[13]
block'[ 2]: "2-by" block'[10]: kcolb[14]
block'[ 3]: "te k" block'[11]: kcolb[15]
block'[ 4]: kcolb[0] block'[12]: counter[0]
block'[ 5]: kcolb[1] block'[13]: counter[1]
block'[ 6]: kcolb[2] block'[14]: nonce[4]
block'[ 7]: kcolb[3] block'[15]: nonce[5]
```

One may argue that we could have used whatever half of the block we wanted —it is most probably secure. Alas, doing so would void the security reduction, so… don't.

Another alternative would be to rely on the regular Chacha20
computation, *with* the addition trick. It's just as secure, and
simpler since we need to code that computation anyway. It would be a
bit less efficient however, so XChacha20 is easier to sell.

That said, those extra rounds still make XChacha20 less efficient than plain Chacha20. If a 64-bit nonce is enough, you may prefer Chacha20.

## Counter mode

So far, we only generated *one* Chacha block, and we don't even know
what to put in the counter. It's actually pretty simple: the first
unencrypted block gets a counter initialised to 0:

```
block'[ 0]: "expa" block'[ 8]: key[4]
block'[ 1]: "nd 3" block'[ 9]: key[5]
block'[ 2]: "2-by" block'[10]: key[6]
block'[ 3]: "te k" block'[11]: key[7]
block'[ 4]: key[0] block'[12]: 0
block'[ 5]: key[1] block'[13]: 0
block'[ 6]: key[2] block'[14]: nonce[0]
block'[ 7]: key[3] block'[15]: nonce[1]
```

The second one will have an incremented counter:

```
block'[ 0]: "expa" block'[ 8]: key[4]
block'[ 1]: "nd 3" block'[ 9]: key[5]
block'[ 2]: "2-by" block'[10]: key[6]
block'[ 3]: "te k" block'[11]: key[7]
block'[ 4]: key[0] block'[12]: 1
block'[ 5]: key[1] block'[13]: 0
block'[ 6]: key[2] block'[14]: nonce[0]
block'[ 7]: key[3] block'[15]: nonce[1]
```

And so on. Each block correspond to 64 bytes of the message. For each 64 byte chunk, scramble the right block, and XOR the result with the message. And that's pretty much it.

Yes, the difference between 2 blocks is often only one bit. But after the scrambling, that single bit will have wrecked so much havoc it won't matter: the scrambled blocks will look unrelated to the attacker.

## Source code

Just look at Monocypher.