# @ Loup's

Impossible? Like that would stop me.

January 2017

# The design of Poly1305

Poly1305 is a fast, provably secure, and surprisingly simple one time authenticator. Its author, Daniel J. Bernstein explains it well in his paper, if you're already an expert. The rest of us is kinda left in the dust.

## Security guarantee

Poly1305 generates a Message Authentication Code (MAC) from a message and a one time key. In C, it would look like this:

``````void poly1305(uint8_t        mac[16],
const uint8_t *message,
size_t         message_size ,
const uint8_t  key[32])
``````

Assuming the key is a uniform, independent number unknown to the attacker, the probability of forgery is close enough to zero to be considered flat out impossible. The problem is coming up with such a secret. Having to generate a whole new 256-bit shared secret for each message is a bit unwieldy. Which is why in practice, we don't.

Instead, we derive the key from a longer-term key and and nonce with a symmetric cipher —usually that same cipher that provides encryption. In this context, you only have to generate a random shared secret once. You still need a unique nonce per message, but that nonce doesn't have to be random nor secret. To break this construction, you have to break the underlying cipher.

## Polynomial evaluation MACs

The idea seems to be derived from Carter & Wegman universal hashing. We treat the message as a series of numbers `m1`, `m2`, `m3`… and use those numbers and the secrets to compute a series of additions and multiplications to generate the MAC.

There are many variations on this theme, though most are slow or require long keys. The variation we're interested in is a high-degree polynomial over a prime field.

The idea is to evaluate your polynomial over ℤp (All natural numbers between 0 and p-1), for some big prime number p. You basically need 2 secrets, `r` and `k`. You then compute your high-degree polynomial with your message and `r`, and finally add your secret key `k` to the result. Assuming your message has 4 parts, you would compute

``````mac = (m1 r⁴ + m2 r³ + m3 r² + m4 r + k) mod p
``````

To limit the number of multiplications, you can use Horner's rule:

``````mac = (r (r (r (r m1 + m2) + m3) + m4) + k) mod p
``````

Which is easy to compute iteratively (here in pseudo C):

``````h = 0;
h += m1; h *= r;
h += m2; h *= r;
h += m3; h *= r;
h += m4; h *= r;
mac = (h + k) % p;
``````

Now we can hash an n-chunk message with n multiplications and n+1 additions. Should be a snap, right?

Not quite. First, we are dealing with relatively big numbers here. The message chunks won't fit in your vanilla 32-bit register, and neither will the prime p. To get enough security, you want something closer to 128 bits.

Second, If we keep adding and multiplying numbers without thinking, we would have to handle bigger and bigger numbers as we iterate. The last value of `h` would be as big as the message itself! So we need to keep reducing those numbers modulo p, thus:

``````h = 0;
h += m1; h *= r; h %= p;
h += m2; h *= r; h %= p;
h += m3; h *= r; h %= p;
h += m4; h *= r; h %= p;
mac = (h + k) % p;
``````

But if we do that, we end up performing n divisions as well as n multiplications, and dividing big numbers is bloody expensive.

## Cheating at modular arithmetic

Learn this one weird trick to modular multiplication. Cryptographers don't want you to know!

Modular addition is easy: as long as the operands are both less than p, the total has to be less than 2p. So we just test if the sum is greater than p, and subtract p if it is.

Multiplication is more problematic: even if both operands are between 0 and p-1, their product can go up to p squared. We're not getting away with successive subtractions, especially if p is big.

Turns out, you can cheat if you select the right prime: just make sure it is of the form 2^n - q, where q is a small number (often less than 10). 2¹³⁰ - 5 is a perfect candidate for this —a big reason why Daniel Bernstein chose this particular prime.

To understand this trick, we need to go back to kindergarten. Here, we will work in base 10, and chose a prime just below a power of 10: 9973 (10⁴ - 27), and perform modulo arithmetic over that. The structure of such a multiplication looks like this:

``````            8  7  6  5
×  4  3  2  1
-------------
8  7  6  5  8765 × 1
+       16 14 12 10  .  8765 × 2 × 10
+    24 21 18 15  .  .  8765 × 3 × 100
+ 32 28 24 20  .  .  .  8765 × 4 × 1000
--------------------
32 52 61 60 34 16  5  before carry propagation
3  7  8  7  3  5  6  5  after carry propagation
``````

It may be a little different from what you're used to, since I did not compute the carry until the very end. It's a bit harder to do by hand, and a bit easier to program. This is sometimes called the Comba multiplication.

We're not interested in 37873565, though. We want 37873565 modulo 9973. Here's the trick:

``````10,000 mod 9973 = 27
``````

Which means, multiplying by 10,000 is the same as multiplying by 27! Now watch me perform the modular reduction using only additions and multiplications. No hard division in there:

``````p = 9973
37873565 mod p = (3787 × 10,000 + 3565) mod p
= (3787 × 27     + 3565) mod p
= (105814              ) mod p
``````

Once more:

``````105814   mod p = (10 × 10,000 + 5814) mod p
= (10 × 27     + 5814) mod p
= (270         + 5814) mod p
= 6084
``````

6084 is smaller than 9973, so 37873565 mod 9973 = 6084.

Once we understand the trick, we can use it in to simplify the schoolbook multiplication. First, we need to separate the numbers on the left from the numbers on the right:

``````              8  7  6  5
×  4  3  2  1
-------------
|  8  7  6  5
16 | 14 12 10  .
24 21 | 18 15  .  .
32 28 24 | 20  .  .  .
``````

Second, we divide the numbers on the left by 10,000:

``````   8  7  6  5
×  4  3  2  1
-------------
8  7  6  5
14 12 10  .
18 15  .  .
20  .  .  .
16
24 21
32 28 24
``````

Third, we multiply them by 27 to compensate (Remember, 10,000 = 27):

``````   8   7   6   5
×  4   3   2   1
----- --- -- ---
8   7   6   5
14  12  10   .
18  15   .   .
20   .   .   .
432
648 567
864 756 648
``````

This gives us a nice matrix of numbers to sum:

``````     8    7    6    5
×  4    3    2    1
-------------------
8    7    6    5
+ 14   12   10  432
+ 18   15  648  567
+ 20  864  756  648
-------------------
60  898 1420 1652   before carry propagation
16   5    6    5    2   after carry propagation
``````

We still have a big digit dangling around. We can deal with it with the same trick.

``````  5   6   5   2
+           432   160,000 ÷ 10,000 × 27
-------------
5   6   5 434   before carry propagation
6   0   8   4   after carry propagation
``````

Finally, we're down to 4 digits. We can stop there if we want, or we can complete the reduction by testing if it is bigger than 9973, and subtracting 9973 if it is. Here, 6084 happens to be smaller than 9973, so we won't need to subtract anything for the full reduction.

## Poly1305's prime: 2¹³⁰ - 5

(130… 5… 1305. Of course. How could I have missed it?)

Now computers don't count in base 10. They work with base 2. As I said, Daniel Bernstein chose 2¹³⁰ - 5 in part because 5 is such a small number, which allows to cheat modular reduction.

Another reason is, 2¹³⁰ - 5 is a little above 2¹²⁸. This allows a very nice encoding of the message into `m1`, `m2` etc: just chop 16 bytes off the message, append a 1 byte to that, then treat those 17 bytes as a little endian number. 16 bytes is a very nice power-of-2 size, but you need a large enough prime to deal with it.

The reason for appending a 1 at the end of each block is to detect trailing zeros. Without that, those 2 messages:

``````68e5904d20dd271607a3658d0e49a77302000000
68e5904d20dd271607a3658d0e49a77302
``````

would have the same MAC, making some forgeries very easy.

Yet another reason for choosing 2¹³⁰ - 5 is, 130 is a multiple of 26. This allows an implementation to encode its number in five 26-bit "digits", or limbs. From there, we perform our schoolbook multiplication with enormous "digits" that range from 0 to 2²⁶-1.

``````void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
{
uint64_t a0 = a[0];  uint64_t b0 = b[0];
uint64_t a1 = a[1];  uint64_t b1 = b[1];  uint64_t b51 = b[1] * 5;
uint64_t a2 = a[2];  uint64_t b2 = b[2];  uint64_t b52 = b[2] * 5;
uint64_t a3 = a[3];  uint64_t b3 = b[3];  uint64_t b53 = b[3] * 5;
uint64_t a4 = a[4];  uint64_t b4 = b[4];  uint64_t b54 = b[4] * 5;
p[0] = a0*b0 + a1*b54 + a2*b53 + a3*b52 + a4*b51;
p[1] = a0*b1 + a1*b0  + a2*b54 + a3*b53 + a4*b52;
p[2] = a0*b2 + a1*b1  + a2*b0  + a3*b54 + a4*b53;
p[3] = a0*b3 + a1*b2  + a2*b1  + a3*b0  + a4*b54;
p[4] = a0*b4 + a1*b3  + a2*b2  + a3*b1  + a4*b0 ;
}
``````

It may not look like it, but this is really the modified schoolbook multiplication we saw earlier (note the numbers above were represented in little endian):

``````     a4      a3      a2      a1      a0
×    b4      b3      b2      b1      b0
---------------------------------------
a4×b0   a3×b0   a2×b0   a1×b0   a0×b0
+ a3×b1   a2×b1   a1×b1   a0×b1 5×a4×b1
+ a2×b2   a1×b2   a0×b2 5×a4×b2 5×a3×b2
+ a1×b3   a0×b3 5×a4×b3 5×a3×b3 5×a2×b3
+ a0×b4 5×a4×b4 5×a3×b4 5×a2×b4 5×a1×b4
``````

Here we see another advantage of having limbs smaller than 32 bits: all those products and sums fit in 64 bits without overflowing. If the vectors a and b fit in 26 bits, then b5 fits in 27 bits. So the products all fit in 53 bits. Make 4 sums with that, and the result fits in 56-bits, much less than the 64 available.

Now we must do carry propagation. Again, we must make sure this won't overflow:

``````uint64_t propagate_carry(uint64_t p[5], uint64_t carry)
{
p[0] += carry * 5;  carry = p[0] >> 26;  p[0] -= carry << 26;
p[1] += carry    ;  carry = p[1] >> 26;  p[1] -= carry << 26;
p[2] += carry    ;  carry = p[2] >> 26;  p[2] -= carry << 26;
p[3] += carry    ;  carry = p[3] >> 26;  p[3] -= carry << 26;
p[4] += carry    ;  carry = p[4] >> 26;  p[4] -= carry << 26;
return carry;
}
``````

The number we're reducing here is p + 2¹³⁰×carry. This is why I multiply carry by 5 the first time around: I have to compensate the implicit division by 2¹³⁰. When this is finished, every limb in p are in the proper range (26-bits), and we may have a remaining carry dangling around (the returned value). We can then repeat the operation until there is no more carry:

``````uint64_t carry = 0;
do {
carry = propagate_carry(p, carry);
} while (carry != 0);
``````

Once the carry is null, the full modular reduction is just a matter of checking whether the remaining number is higher than 2¹³⁰-5, and subtract 2¹³⁰-5 if it is.

Careful how you do it however: this loop is prone to timing attacks. Don't use it in real crypto code. Perform a constant number of carry propagations instead. (My experiments suggest that if the operands of the multiplication have all their limbs in range, 3 iterations are enough.) Also make sure the final test and subtraction are done in a constant time fashion too. It can be done with bit twiddling.

Now we don't have to perform a full modular reduction at the end of each multiplication. Our 26-bit limbs don't really have to fit in 26 bits, as long as we make sure we don't overflow our 32-bit and 64-bit registers. There are also cleverer ways to perform carry propagation. To see a good example, I suggest you take a look at Floodyberry's 32-bit poly1305-donna implementation.

## Restricting r

Daniel Bernstein could have stopped there. But he went further, and restricted the value of the secret `r`. Instead of using all 128 bits of the random secret, he cleared 22 bits, sacrificing a bit of security "to simplify and accelerate implementations […] in various contexts".

The nature of this simplification becomes apparent when you encode `r` in 4 32-bit words. There, the 22 bits are cleared thus:

``````r[0] &= 0x0fffffff; // clear top 4 bits
r[1] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
r[2] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
r[3] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
``````

This all makes sense when you think about the unthinkable: having the limbs span 32 bits.

32 bit limbs are generally a bad idea. They make carry propagation impossible to delay, forcing the code to be more complex and slower. Imagine your machine has 64-bit registers: when you multiply 32-bit limbs, the result spans the whole register. You can't sum it with another product without risking overflow.

Poly1305 in its straightforward Horner's rule implementation however only makes one kind of product, and that product uses `r`:

``````h = 0;
for each message block mi {
h += mi;
h *= r;
}
mac = (h + k) % (2^130 - 5);
``````

Since `r` limbs span 28 bits instead of the full 32, the limb products with `h` will span 60 bits at most (well, 61 because of the modulo trick). We can now sum those products without overflowing the 64-bit register. (Modern processors have 128-bit registers, so they can use 64 bits limbs instead. The basic principles are the same.)

That's not the only problem however. 32 is not a multiple of 130. The closest multiple is 128. Performing the modulo trick requires us to shift 4 limbs to the right, then 2 bits to the right. Doing so loses the 2 least significant bits.

But we don't care, because those bits are already cleared! Well, 2 of them aren't, but we can easily recover them. In the end, we save a couple multiplications and simplify (de)serialisation code a good deal. Carry propagation requires a bit of care (the one I showed above would overflow), but the code is still pretty simple. Here is a multiplication, with a subsequent reduction below 2¹³⁰ (not a full modular reduction, that one can be deferred to the end):

``````// Multiplies h and r, put the result in h
static void poly_mul(uint32_t h[5], const uint32_t r[4])
{
// These would fit in 32 bits, but we need 64 bit multiplications
const uint64_t r0 = r[0];
const uint64_t r1 = r[1];
const uint64_t r2 = r[2];
const uint64_t r3 = r[3];
const uint64_t rr0 = (r[0] >> 2) * 5; // lose 2 bottom bits...
const uint64_t rr1 = (r[1] >> 2) * 5; // 2 bottom bits already cleared
const uint64_t rr2 = (r[2] >> 2) * 5; // 2 bottom bits already cleared
const uint64_t rr3 = (r[3] >> 2) * 5; // 2 bottom bits already cleared

// school book modular multiplication (without carry propagation)
const uint64_t x0 = h[0]*r0 + h[1]*rr3 + h[2]*rr2 + h[3]*rr1 + h[4]*rr0;
const uint64_t x1 = h[0]*r1 + h[1]*r0  + h[2]*rr3 + h[3]*rr2 + h[4]*rr1;
const uint64_t x2 = h[0]*r2 + h[1]*r1  + h[2]*r0  + h[3]*rr3 + h[4]*rr2;
const uint64_t x3 = h[0]*r3 + h[1]*r2  + h[2]*r1  + h[3]*r0  + h[4]*rr3;
const uint64_t x4 = h[4] * (r0 & 3); // ...recover those 2 bits

// carry propagation (put the result back in h)
const uint64_t msb = x4 + (x3 >> 32);
uint64_t       u   = (msb >> 2) * 5; // lose 2 bottom bits...
u += (x0 & 0xffffffff)             ;  h[0] = u & 0xffffffff;  u >>= 32;
u += (x1 & 0xffffffff) + (x0 >> 32);  h[1] = u & 0xffffffff;  u >>= 32;
u += (x2 & 0xffffffff) + (x1 >> 32);  h[2] = u & 0xffffffff;  u >>= 32;
u += (x3 & 0xffffffff) + (x2 >> 32);  h[3] = u & 0xffffffff;  u >>= 32;
u += msb & 3 /* ...recover them */ ;  h[4] = u;
}
``````

I wonder why Daniel Bernstein's own reference implementation doesn't use this trick. I mean, he designed the damn authenticator, and this trick is barely more verbose than TweetNaCl, especially without the unrolled loops. My current guess is that he wanted the code to work even with old C89 compilers, which don't necessarily provide 64-bit variables.

## Source code

For reference. It took me a long while to understand this stuff enough to implement it myself.

Do not use it in production. While this code is probably bug free, I cannot guarantee there isn't some rarely occurring limb overflow lurking in there. Big number arithmetic is error prone, even world class experts can screw this up.

Some day, this code will get properly reviewed and audited. Until then, use tweetNaCl or libsodium. After that, still do not use it —not directly. You want authenticated encryption, for instance Chacha20-Poly1305. Libsodium and TweetNaCl have an interface for that.

For those who like an official statement:

Copying and distribution of the following code, with or without modification, are permitted in any medium without royalty. This code is offered as-is, without any warranty.

### poly1305.h

``````#ifndef POLY1305_H
#define POLY1305_H

#include <inttypes.h>
#include <stddef.h>

typedef struct {
uint32_t r[4];
uint32_t h[5];
uint32_t c[5];
size_t   c_index;
} crypto_poly1305_ctx;

// Initializes the poly1305 context with the secret key.
// Call first (obviously).
// WARNING: NEVER AUTHENTICATE 2 MESSAGES WITH THE SAME KEY.
// This is a ONE TIME authenticator.  If you authenticate 2 messages
// with the same key, the attacker may deduce your secret key and
void crypto_poly1305_init(crypto_poly1305_ctx *ctx, const uint8_t key[32]);

// Updates the poly1305 context with a chunk of the message
// Can be called multiple times, once for each chunk.
// Make sure the chunks are processed in order, without overlap or hole...
void crypto_poly1305_update(crypto_poly1305_ctx *ctx,
const uint8_t *m, size_t bytes);

// Authenticate the message munched through previous update() calls.
// Call last (obviously).
void crypto_poly1305_finish(crypto_poly1305_ctx *ctx, uint8_t mac[16]);

// Convenience all in one function
void crypto_poly1305_auth(uint8_t        mac[16],
const uint8_t *m,
size_t         msg_length,
const uint8_t  key[32]);

// Constant time equality verification
// returns 0 if it matches, something else otherwise.
int crypto_memcmp_16(const uint8_t mac1[16], const uint8_t mac2[16]);

#endif // POLY1305_H
``````

### poly1305.c

``````#include "poly1305.h"

{
return s[0]
| (s[1] <<  8)
| (s[2] << 16)
| (s[3] << 24);
}

static void store32_le(uint8_t output[4], uint32_t input)
{
output[0] =  input        & 0xff;
output[1] = (input >>  8) & 0xff;
output[2] = (input >> 16) & 0xff;
output[3] = (input >> 24) & 0xff;
}

static void poly_load(uint32_t out[4], const uint8_t in[16])
{
for (int i = 0; i < 4; i++)
}

static void poly_add(uint32_t out[5], const uint32_t a[5], const uint32_t b[5])
{
uint64_t carry = 0;
for (int i = 0; i < 5; i++) {
carry  += (int64_t)(a[i]) + b[i];
out[i]  = carry & 0xffffffff; // lower 32 bits right there.
carry >>= 32;                 // retain the carry
}
}

// h = (h + c) * r
static void poly_block(crypto_poly1305_ctx *ctx)
{
// h + c, without carry propagation
const uint64_t h0 = ctx->h[0] + (uint64_t)ctx->c[0];
const uint64_t h1 = ctx->h[1] + (uint64_t)ctx->c[1];
const uint64_t h2 = ctx->h[2] + (uint64_t)ctx->c[2];
const uint64_t h3 = ctx->h[3] + (uint64_t)ctx->c[3];
const uint64_t h4 = ctx->h[4] + (uint64_t)ctx->c[4];

// Local all the things!
const uint64_t r0 = ctx->r[0];
const uint64_t r1 = ctx->r[1];
const uint64_t r2 = ctx->r[2];
const uint64_t r3 = ctx->r[3];
const uint64_t rr0 = (ctx->r[0] >> 2) * 5; // lose 2 bottom bits...
const uint64_t rr1 = (ctx->r[1] >> 2) * 5; // 2 bottom bits already cleared
const uint64_t rr2 = (ctx->r[2] >> 2) * 5; // 2 bottom bits already cleared
const uint64_t rr3 = (ctx->r[3] >> 2) * 5; // 2 bottom bits already cleared

// (h + c) * r, without carry propagation
const uint64_t x0 = h0*r0 + h1*rr3 + h2*rr2 + h3*rr1 + h4*rr0;
const uint64_t x1 = h0*r1 + h1*r0  + h2*rr3 + h3*rr2 + h4*rr1;
const uint64_t x2 = h0*r2 + h1*r1  + h2*r0  + h3*rr3 + h4*rr2;
const uint64_t x3 = h0*r3 + h1*r2  + h2*r1  + h3*r0  + h4*rr3;
const uint64_t x4 = h4 * (r0 & 3); // ...recover those 2 bits

// carry propagation, put ctx->h under 2^130
const uint64_t msb = x4 + (x3 >> 32);
uint64_t       u   = (msb >> 2) * 5; // lose 2 bottom bits...
u += (x0 & 0xffffffff)             ;  ctx->h[0] = u & 0xffffffff;  u >>= 32;
u += (x1 & 0xffffffff) + (x0 >> 32);  ctx->h[1] = u & 0xffffffff;  u >>= 32;
u += (x2 & 0xffffffff) + (x1 >> 32);  ctx->h[2] = u & 0xffffffff;  u >>= 32;
u += (x3 & 0xffffffff) + (x2 >> 32);  ctx->h[3] = u & 0xffffffff;  u >>= 32;
u += msb & 3 /* ...recover them */ ;  ctx->h[4] = u;
}

// (re-)initializes the input counter and input buffer
static void poly_clear_c(crypto_poly1305_ctx *ctx)
{
for (int i = 0; i < 4; i++)
ctx->c[i] = 0;
ctx->c_index = 0;
}

void crypto_poly1305_init(crypto_poly1305_ctx *ctx, const uint8_t key[32])
{
// initial h: zero
for (int i =  0; i < 5; i++)
ctx->h [i] = 0;
// initial r: first half of the key, minus a few bits
ctx->r[0] &= 0x0fffffff; // clear top 4 bits
ctx->r[1] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
ctx->r[2] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
ctx->r[3] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
ctx->c[4]  = 1;
// second half of the key, saved for later
// buffer and counter
poly_clear_c(ctx);
}

void crypto_poly1305_update(crypto_poly1305_ctx *ctx,
const uint8_t *m, size_t bytes)
{
while (bytes > 0) {
if (ctx->c_index == 16) {
poly_block(ctx);
poly_clear_c(ctx);
}
// feed the input buffer
ctx->c[ctx->c_index / 4] |= *m << ((ctx->c_index % 4) * 8);
ctx->c_index++;
m++;
bytes--;
}
}

void crypto_poly1305_finish(crypto_poly1305_ctx *ctx, uint8_t mac[16])
{
// move the final 1 according to remaining input length
ctx->c[4] = 0;
ctx->c[ctx->c_index / 4] |= 1 << ((ctx->c_index % 4) * 8);
// one last hash update...
poly_block(ctx);
// ... this time with full modular reduction
// We only need to conditionally subtract 2^130-5,
// using bit twidling to prevent timing attacks.
static const uint32_t minus_p[5] = { 5, 0, 0, 0, 0xfffffffc };
uint32_t h_minus_p[5];
uint32_t negative = ~(-(h_minus_p[4] >> 31)); // 0 or -1 (2's complement)
for (int i = 0; i < 5; i++) {
ctx->h[i] ^= negative & (ctx->h[i] ^ h_minus_p[i]);
}
for (int i = 0; i < 4; i++)
store32_le(mac + i*4, ctx->h[i]);
}

void crypto_poly1305_auth(uint8_t mac[16], const uint8_t *m,
size_t  m_size , const uint8_t  key[32])
{
crypto_poly1305_ctx ctx;
crypto_poly1305_init  (&ctx, key);
crypto_poly1305_update(&ctx, m, m_size);
crypto_poly1305_finish(&ctx, mac);
}

int crypto_memcmp_16(const uint8_t mac1[16], const uint8_t mac2[16])
{
unsigned diff = 0;
for (int i = 0; i < 16; i++) {
diff |= (mac1[i] ^ mac2[i]);
}
return diff;
}
``````