January 2017

# The design of Poly1305

Poly1305 is a fast, provably secure, and surprisingly simple one time
authenticator. Its author, Daniel J. Bernstein explains it well in
his paper, *if* you're already an expert. The rest of us is kinda
left in the dust.

## Security guarantee

Poly1305 generates a Message Authentication Code (MAC) from a message and a one time key. In C, it would look like this:

```
void poly1305(uint8_t mac[16],
const uint8_t *message,
size_t message_size ,
const uint8_t key[32])
```

Assuming the key is a uniform, *independent* number unknown to the
attacker, the probability of forgery is close enough to zero to be
considered flat out impossible. The problem is coming up with such a
secret. Having to generate a whole new 256-bit shared secret *for
each message* is a bit unwieldy. Which is why in practice, we don't.

Instead, we derive the key from a longer-term key and and nonce with a symmetric cipher —usually that same cipher that provides encryption. In this context, you only have to generate a random shared secret once. You still need a unique nonce per message, but that nonce doesn't have to be random nor secret. To break this construction, you have to break the underlying cipher.

## Polynomial evaluation MACs

The idea seems to be derived from Carter & Wegman *universal hashing*.
We treat the message as a series of numbers `m1`

, `m2`

, `m3`

… and use
those numbers and the secrets to compute a series of additions and
multiplications to generate the MAC.

There are many variations on this theme, though most are slow or require long keys. The variation we're interested in is a high-degree polynomial over a prime field.

The idea is to evaluate your polynomial over ℤp (All natural numbers
between 0 and p-1), for some big prime number p. You basically need 2
secrets, `r`

and `k`

. You then compute your high-degree polynomial
with your message and `r`

, and finally add your secret key `k`

to the
result. Assuming your message has 4 parts, you would compute

```
mac = (m1 r⁴ + m2 r³ + m3 r² + m4 r + k) mod p
```

To limit the number of multiplications, you can use Horner's rule:

```
mac = (r (r (r (r m1 + m2) + m3) + m4) + k) mod p
```

Which is easy to compute iteratively (here in pseudo C):

```
h = 0;
h += m1; h *= r;
h += m2; h *= r;
h += m3; h *= r;
h += m4; h *= r;
mac = (h + k) % p;
```

Now we can hash an n-chunk message with n multiplications and n+1 additions. Should be a snap, right?

Not quite. First, we are dealing with relatively big numbers here. The message chunks won't fit in your vanilla 32-bit register, and neither will the prime p. To get enough security, you want something closer to 128 bits.

Second, If we keep adding and multiplying numbers without thinking,
we would have to handle bigger and bigger numbers as we iterate. The
last value of `h`

would be as big as the message itself! So we need
to keep reducing those numbers modulo p, thus:

```
h = 0;
h += m1; h *= r; h %= p;
h += m2; h *= r; h %= p;
h += m3; h *= r; h %= p;
h += m4; h *= r; h %= p;
mac = (h + k) % p;
```

But if we do *that*, we end up performing n *divisions* as well as n
multiplications, and dividing big numbers is bloody expensive.

## Cheating at modular arithmetic

*Learn this one weird trick to modular multiplication. Cryptographers
don't want you to know!*

Modular addition is easy: as long as the operands are both less than p, the total has to be less than 2p. So we just test if the sum is greater than p, and subtract p if it is.

Multiplication is more problematic: even if both operands are between
0 and p-1, their product can go up to p *squared*. We're not getting
away with successive subtractions, especially if p is big.

Turns out, you can cheat if you select the right prime: just make sure
it is of the form 2^n - q, where q is a small number (often less than
10). 2^{130} - 5 is a perfect candidate for this —a big reason why
Daniel Bernstein chose this particular prime.

To understand this trick, we need to go back to kindergarten. Here, we will work in base 10, and chose a prime just below a power of 10: 9973 (10⁴ - 27), and perform modulo arithmetic over that. The structure of such a multiplication looks like this:

```
8 7 6 5
× 4 3 2 1
-------------
8 7 6 5 8765 × 1
+ 16 14 12 10 . 8765 × 2 × 10
+ 24 21 18 15 . . 8765 × 3 × 100
+ 32 28 24 20 . . . 8765 × 4 × 1000
--------------------
32 52 61 60 34 16 5 before carry propagation
3 7 8 7 3 5 6 5 after carry propagation
```

It may be a little different from what you're used to, since I did not compute the carry until the very end. It's a bit harder to do by hand, and a bit easier to program. This is sometimes called the Comba multiplication.

We're not interested in 37873565, though. We want 37873565 *modulo
9973*. Here's the trick:

```
10,000 mod 9973 = 27
```

Which means, multiplying by 10,000 is the same as multiplying by 27! Now watch me perform the modular reduction using only additions and multiplications. No hard division in there:

```
p = 9973
37873565 mod p = (3787 × 10,000 + 3565) mod p
= (3787 × 27 + 3565) mod p
= (105814 ) mod p
```

Once more:

```
105814 mod p = (10 × 10,000 + 5814) mod p
= (10 × 27 + 5814) mod p
= (270 + 5814) mod p
= 6084
```

6084 is smaller than 9973, so 37873565 mod 9973 = 6084.

Once we understand the trick, we can use it in to simplify the schoolbook multiplication. First, we need to separate the numbers on the left from the numbers on the right:

```
8 7 6 5
× 4 3 2 1
-------------
| 8 7 6 5
16 | 14 12 10 .
24 21 | 18 15 . .
32 28 24 | 20 . . .
```

Second, we divide the numbers on the left by 10,000:

```
8 7 6 5
× 4 3 2 1
-------------
8 7 6 5
14 12 10 .
18 15 . .
20 . . .
16
24 21
32 28 24
```

Third, we multiply them by 27 to compensate (Remember, 10,000 = 27):

```
8 7 6 5
× 4 3 2 1
----- --- -- ---
8 7 6 5
14 12 10 .
18 15 . .
20 . . .
432
648 567
864 756 648
```

This gives us a nice matrix of numbers to sum:

```
8 7 6 5
× 4 3 2 1
-------------------
8 7 6 5
+ 14 12 10 432
+ 18 15 648 567
+ 20 864 756 648
-------------------
60 898 1420 1652 before carry propagation
16 5 6 5 2 after carry propagation
```

We still have a big digit dangling around. We can deal with it with the same trick.

```
5 6 5 2
+ 432 160,000 ÷ 10,000 × 27
-------------
5 6 5 434 before carry propagation
6 0 8 4 after carry propagation
```

Finally, we're down to 4 digits. We can stop there if we want, or we can complete the reduction by testing if it is bigger than 9973, and subtracting 9973 if it is. Here, 6084 happens to be smaller than 9973, so we won't need to subtract anything for the full reduction.

## Poly1305's prime: 2^{130} - 5

*(130… 5… 1305. Of course. How could I have missed it?)*

Now computers don't count in base 10. They work with base 2. As I
said, Daniel Bernstein chose 2^{130} - 5 in part because 5 is such
a small number, which allows to cheat modular reduction.

Another reason is, 2^{130} - 5 is a little *above*
2^{128}. This allows a very nice encoding of the message into
`m1`

, `m2`

etc: just chop 16 bytes off the message, append a 1 byte to
that, then treat those 17 bytes as a little endian number. 16 bytes is
a very nice power-of-2 size, but you need a large enough prime to deal
with it.

The reason for appending a 1 at the end of each block is to detect trailing zeros. Without that, those 2 messages:

```
68e5904d20dd271607a3658d0e49a77302000000
68e5904d20dd271607a3658d0e49a77302
```

would have the same MAC, making some forgeries very easy.

Yet another reason for choosing 2^{130} - 5 is, 130 is a
multiple of 26. This allows an implementation to encode its number in
five 26-bit "digits", or *limbs*. From there, we perform our schoolbook
multiplication with enormous "digits" that range from 0 to 2²⁶-1. With
five limbs, we span over exactly 130 bits.

Now since our prime is a power of 2²⁶ *minus 5*, we need to use the
trick we learned in base 10: when we move the overflowing limbs to the
right, we need to multiply them by 5 to compensate:

```
a4 a3 a2 a1 a0
× b4 b3 b2 b1 b0
---------------------------------------
a4×b0 a3×b0 a2×b0 a1×b0 a0×b0
+ a3×b1 a2×b1 a1×b1 a0×b1 5×a4×b1
+ a2×b2 a1×b2 a0×b2 5×a4×b2 5×a3×b2
+ a1×b3 a0×b3 5×a4×b3 5×a3×b3 5×a2×b3
+ a0×b4 5×a4×b4 5×a3×b4 5×a2×b4 5×a1×b4
```

Translating that in C is fairly mechanical:

```
void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
{
uint64_t a0 = a[0]; uint64_t b0 = b[0];
uint64_t a1 = a[1]; uint64_t b1 = b[1];
uint64_t a2 = a[2]; uint64_t b2 = b[2];
uint64_t a3 = a[3]; uint64_t b3 = b[3];
uint64_t a4 = a[4]; uint64_t b4 = b[4];
p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
p[1] = a0*b1 + a1*b0 + a2*b4*5 + a3*b3*5 + a4*b2*5;
p[2] = a0*b2 + a1*b1 + a2*b0 + a3*b4*5 + a4*b3*5;
p[3] = a0*b3 + a1*b2 + a2*b1 + a3*b0 + a4*b4*5;
p[4] = a0*b4 + a1*b3 + a2*b2 + a3*b1 + a4*b0 ;
}
```

Here we see another advantage of having limbs smaller than 32 bits:
all those products *and sums* fit in 64 bits without overflowing. If
the vectors a and b fit in 26 bits, then b×5 fits in 28 bits, and the
products all fits in 55 bits. Make 4 sums with *that*, and the result
fits in 57-bits, much less than the 64 available.

Now we must do carry propagation. Again, we must make sure this won't overflow:

```
uint64_t propagate_carry(uint64_t p[5], uint64_t carry)
{
p[0] += carry * 5; carry = p[0] >> 26; p[0] -= carry << 26;
p[1] += carry ; carry = p[1] >> 26; p[1] -= carry << 26;
p[2] += carry ; carry = p[2] >> 26; p[2] -= carry << 26;
p[3] += carry ; carry = p[3] >> 26; p[3] -= carry << 26;
p[4] += carry ; carry = p[4] >> 26; p[4] -= carry << 26;
return carry;
}
```

The number we're reducing here is p + 2^{130}×carry. This is
why I multiply carry by 5 the first time around: I have to compensate
the implicit division by 2^{130}. When this is finished, every
limb in p are in the proper range (26-bits), and we may have a remaining
carry dangling around (the returned value). We can then repeat the
operation until there is no more carry:

```
uint64_t carry = 0;
do {
carry = propagate_carry(p, carry);
} while (carry != 0);
```

Once the carry is null, the full modular reduction is just a matter of
checking whether the remaining number is higher than 2^{130} - 5,
and subtract 2^{130} - 5 if it is.

Careful how you do it however: this loop is prone to timing attacks. Don't use it in real crypto code. Perform a constant number of carry propagations instead. (My experiments suggest that if the operands of the multiplication have all their limbs in range, 3 iterations are enough.) Also make sure the final test and subtraction are done in a constant time fashion too. It can be done with bit twiddling.

Now we don't have to perform a full modular reduction at the end of each multiplication. Our 26-bit limbs don't really have to fit in 26 bits, as long as we make sure we don't overflow our 32-bit and 64-bit registers. There are also cleverer ways to perform carry propagation. To see a good example, I suggest you take a look at Floodyberry's 32-bit poly1305-donna implementation.

## Parallelism

Horner's rule has one limitation: it's a single threaded algorithm. Each intermediate result depends on the previous (let's assume multiplication also includes partial modular reduction):

```
h0 = 0
h1 = (h0 + m1) * r
h2 = (h1 + m2) * r
h3 = (h2 + m3) * r
h4 = (h3 + m4) * r
...
```

Let's assume the message is only 4 blocks long, so the final hash is
`h4`

. Let's write h4 in terms of h0:

```
h4 = (h3 + m4) * r
h4 = ((h2 + m3) * r + m4) * r
h4 = (((h1 + m2) * r + m3) * r + m4) * r
h4 = ((((h0 + m1) * r + m2) * r + m3) * r + m4) * r
h4 = (((m1 * r + m2) * r + m3) * r + m4) * r
```

Now let's rewrite the polynomial as a sum of products:

```
h4 = ((m1 * r + m2) * r + m3) * r² + m4 * r
h4 = (m1 * r + m2) * r³ + m3 * r² + m4 * r
h4 = m1 * r⁴ + m2 * r³ + m3 * r² + m4 * r
```

This form is the most parallel possible. We could compute every product
in parallel, then sum everything, map-reduce style. There's only one
problem: we need to compute `r`

squared, cubed, and so on. To avoid
this, we can take a hybrid approach, with *two* applications of Horner's
rule instead of just one:

```
h4 = m1 * r⁴ + m3 * r² + m2 * r³ + m4 * r
h4 = ((m1*r² + m3) * r²) + ((m2*r² + m4) * r)
```

In exchange for having to compute `r`

squared, we can now hash two lanes
in parallel:

```
h0 = 0
h1 = (h0 + m1) * r² h2 = (h0 + m2) * r²
h3 = (h1 + m3) * r² h4 = (h2 + m4) * r
h = h3 + h4
```

This generalises to bigger messages:

```
h0 = 0
h1 = (h0 + m1) * r² h2 = (h0 + m2) * r²
h3 = (h1 + m3) * r² h4 = (h2 + m4) * r²
h5 = (h3 + m5) * r² h6 = (h2 + m6) * r²
h7 = (h5 + m7) * r² h8 = (h2 + m8) * r
h = h7 + h8
```

As well as wider parallelism:

```
h0 = 0
h1 = (h0 + m1) * r³ h2 = (h0 + m2) * r³ h3 = (h0 + m3) * r³
h4 = (h1 + m4) * r³ h5 = (h2 + m5) * r³ h6 = (h3 + m6) * r³
h7 = (h4 + m7) * r³ h8 = (h5 + m8) * r² h9 = (h6 + m9) * r
h = h7 + h8 + h9
```

This is quite SIMD friendly. A modern CPU with 256-bit vector units can compute 4 lanes in parallel in a single thread. My benchmarks with Libsodium suggest this is twice as fast as the regular Horner's rule.

## Restricting r

Daniel Bernstein could have stopped there. But he went further, and
restricted the value of the secret `r`

. Instead of using all 128 bits
of the random secret, he cleared 22 bits, sacrificing a bit of
security *"to simplify and accelerate implementations […] in various
contexts"*.

The nature of this simplification becomes apparent when you encode `r`

in 4 32-bit words. There, the 22 bits are cleared thus:

```
r[0] &= 0x0fffffff; // clear top 4 bits
r[1] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
r[2] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
r[3] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
```

This all makes sense when you think about the unthinkable: having the
limbs span *32 bits*.

32 bit limbs are generally a bad idea. They make carry propagation impossible to delay, forcing the code to be more complex and slower. Imagine your machine has 64-bit registers: when you multiply 32-bit limbs, the result spans the whole register. You can't sum it with another product without risking overflow.

Poly1305 in its straightforward Horner's rule implementation however
only makes *one* kind of product, and that product uses `r`

:

```
h = 0;
for each message block mi {
h += mi;
h *= r;
}
mac = (h + k) % (2^130 - 5);
```

Since `r`

limbs span 28 bits instead of the full 32, the limb products
with `h`

will span 60 bits at most (well, 61 because of the modulo
trick). We can now sum those products without overflowing the 64-bit
register. (Modern processors have 128-bit registers, so they can use
64 bits limbs instead. The basic principles are the same.)

That's not the only problem however. 32 is *not* a multiple of 130.
The closest multiple is 128. Performing the modulo trick requires us
to shift 4 limbs to the right, then 2 bits *to the right*. Doing so
loses the 2 least significant bits.

But we don't care, because those bits are already cleared! Well, 2 of
them aren't, but we can easily recover them. In the end, we save a
couple multiplications and simplify (de)serialisation code a good
deal. Carry propagation requires a bit of care (the one I showed
above would overflow), but the code is still pretty simple. Here is a
multiplication, with a subsequent reduction below 2^{130} (not a full
modular reduction, that one can be deferred to the end):

```
// Multiplies h and r, put the result in h
static void poly_mul(uint32_t h[5], const uint32_t r[4])
{
// These would fit in 32 bits, but we need 64 bit multiplications
const uint64_t r0 = r[0];
const uint64_t r1 = r[1];
const uint64_t r2 = r[2];
const uint64_t r3 = r[3];
const uint64_t rr0 = (r[0] >> 2) * 5; // lose 2 bottom bits...
const uint64_t rr1 = (r[1] >> 2) * 5; // 2 bottom bits already cleared
const uint64_t rr2 = (r[2] >> 2) * 5; // 2 bottom bits already cleared
const uint64_t rr3 = (r[3] >> 2) * 5; // 2 bottom bits already cleared
// school book modular multiplication (without carry propagation)
const uint64_t x0 = h[0]*r0 + h[1]*rr3 + h[2]*rr2 + h[3]*rr1 + h[4]*rr0;
const uint64_t x1 = h[0]*r1 + h[1]*r0 + h[2]*rr3 + h[3]*rr2 + h[4]*rr1;
const uint64_t x2 = h[0]*r2 + h[1]*r1 + h[2]*r0 + h[3]*rr3 + h[4]*rr2;
const uint64_t x3 = h[0]*r3 + h[1]*r2 + h[2]*r1 + h[3]*r0 + h[4]*rr3;
const uint64_t x4 = h[4] * (r0 & 3); // ...recover those 2 bits
// carry propagation (put the result back in h)
const uint64_t msb = x4 + (x3 >> 32);
uint64_t u = (msb >> 2) * 5; // lose 2 bottom bits...
u += (x0 & 0xffffffff) ; h[0] = u & 0xffffffff; u >>= 32;
u += (x1 & 0xffffffff) + (x0 >> 32); h[1] = u & 0xffffffff; u >>= 32;
u += (x2 & 0xffffffff) + (x1 >> 32); h[2] = u & 0xffffffff; u >>= 32;
u += (x3 & 0xffffffff) + (x2 >> 32); h[3] = u & 0xffffffff; u >>= 32;
u += msb & 3 /* ...recover them */ ; h[4] = u;
}
```

I wonder why Daniel Bernstein's own reference implementation doesn't use this trick. I mean, he designed the damn authenticator, and this trick is barely more verbose than TweetNaCl, especially without the unrolled loops. My current guess is that he wanted the code to work even with old C89 compilers, which don't necessarily provide 64-bit variables.

## Source code

Just look at Monocypher.