@ Loup's Impossible? Like that would stop me.

January 2017

The design of Poly1305

Poly1305 is a fast, provably secure, and surprisingly simple one time authenticator. Its author, Daniel J. Bernstein explains it well in his paper, if you’re already an expert. The rest of us is kinda left in the dust.

Security guarantee

Poly1305 generates a Message Authentication Code (MAC) from a message and a one time key. In C, it would look like this:

void poly1305(uint8_t        mac[16],
              const uint8_t *message,
              size_t         message_size ,
              const uint8_t  key[32])

Assuming the key is a uniform, independent number unknown to the attacker, the probability of forgery is close enough to zero to be considered flat out impossible. The problem is coming up with such a secret. Having to generate a whole new 256-bit shared secret for each message is a bit unwieldy. Which is why in practice, we don’t.

Instead, we derive the key from a longer-term key and and nonce with a symmetric cipher —usually that same cipher that provides encryption. In this context, you only have to generate a random shared secret once. You still need a unique nonce per message, but that nonce doesn’t have to be random nor secret. To break this construction, you have to break the underlying cipher.

Polynomial evaluation MACs

The idea seems to be derived from Carter & Wegman universal hashing. We treat the message as a series of numbers m1, m2, m3… and use those numbers and the secrets to compute a series of additions and multiplications to generate the MAC.

There are many variations on this theme, though most are slow or require long keys. The variation we’re interested in is a high-degree polynomial over a prime field.

The idea is to evaluate your polynomial over ℤp (All natural numbers between 0 and p-1), for some big prime number p. You basically need 2 secrets, r and k. You then compute your high-degree polynomial with your message and r, and finally add your secret key k to the result. Assuming your message has 4 parts, you would compute

mac = (m1 r⁴ + m2 r³ + m3 r² + m4 r + k) mod p

To limit the number of multiplications, you can use Horner’s rule:

mac = (r (r (r (r m1 + m2) + m3) + m4) + k) mod p

Which is easy to compute iteratively (here in pseudo C):

h = 0;
h += m1; h *= r;
h += m2; h *= r;
h += m3; h *= r;
h += m4; h *= r;
mac = (h + k) % p;

Now we can hash an n-chunk message with n multiplications and n+1 additions. Should be a snap, right?

Not quite. First, we are dealing with relatively big numbers here. The message chunks won’t fit in your vanilla 32-bit register, and neither will the prime p. To get enough security, you want something closer to 128 bits.

Second, If we keep adding and multiplying numbers without thinking, we would have to handle bigger and bigger numbers as we iterate. The last value of h would be as big as the message itself! So we need to keep reducing those numbers modulo p, thus:

h = 0;
h += m1; h *= r; h %= p;
h += m2; h *= r; h %= p;
h += m3; h *= r; h %= p;
h += m4; h *= r; h %= p;
mac = (h + k) % p;

But if we do that, we end up performing n divisions as well as n multiplications, and dividing big numbers is bloody expensive.

Cheating at modular arithmetic

Learn this one weird trick to modular multiplication. Cryptographers don’t want you to know!

Modular addition is easy: as long as the operands are both less than p, the total has to be less than 2p. So we just test if the sum is greater than p, and subtract p if it is.

Multiplication is more problematic: even if both operands are between 0 and p-1, their product can go up to p squared. We’re not getting away with successive subtractions, especially if p is big.

Turns out, you can cheat if you select the right prime: just make sure it is of the form 2^n - q, where q is a small number (often less than 10). 2130 - 5 is a perfect candidate for this —a big reason why Daniel Bernstein chose this particular prime.

To understand this trick, we need to go back to kindergarten. Here, we will work in base 10, and chose a prime just below a power of 10: 9973 (10⁴ - 27), and perform modulo arithmetic over that. The structure of such a multiplication looks like this:

            8  7  6  5
         ×  4  3  2  1
         -------------
            8  7  6  5  8765 × 1
+       16 14 12 10  .  8765 × 2 × 10
+    24 21 18 15  .  .  8765 × 3 × 100
+ 32 28 24 20  .  .  .  8765 × 4 × 1000
  --------------------
  32 52 61 60 34 16  5  before carry propagation
3  7  8  7  3  5  6  5  after carry propagation

It may be a little different from what you’re used to, since I did not compute the carry until the very end. It’s a bit harder to do by hand, and a bit easier to program. This is sometimes called the Comba multiplication.

We’re not interested in 37873565, though. We want 37873565 modulo 9973. Here’s the trick:

10,000 mod 9973 = 27

Which means, multiplying by 10,000 is the same as multiplying by 27! Now watch me perform the modular reduction using only additions and multiplications. No hard division in there:

p = 9973
37873565 mod p = (3787 × 10,000 + 3565) mod p
               = (3787 × 27     + 3565) mod p
               = (105814              ) mod p

Once more:

105814   mod p = (10 × 10,000 + 5814) mod p
               = (10 × 27     + 5814) mod p
               = (270         + 5814) mod p
               = 6084

6084 is smaller than 9973, so 37873565 mod 9973 = 6084.

Once we understand the trick, we can use it in to simplify the schoolbook multiplication. First, we need to separate the numbers on the left from the numbers on the right:

              8  7  6  5
           ×  4  3  2  1
           -------------
           |  8  7  6  5
        16 | 14 12 10  .
     24 21 | 18 15  .  .
  32 28 24 | 20  .  .  .

Second, we divide the numbers on the left by 10,000:

   8  7  6  5
×  4  3  2  1
-------------
   8  7  6  5
  14 12 10  .
  18 15  .  .
  20  .  .  .
           16
        24 21
     32 28 24

Third, we multiply them by 27 to compensate (Remember, 10,000 = 27):

   8   7   6   5
×  4   3   2   1
----- --- -- ---
   8   7   6   5
  14  12  10   .
  18  15   .   .
  20   .   .   .
             432
         648 567
     864 756 648

This gives us a nice matrix of numbers to sum:

     8    7    6    5
  ×  4    3    2    1
  -------------------
     8    7    6    5
  + 14   12   10  432
  + 18   15  648  567
  + 20  864  756  648
  -------------------
    60  898 1420 1652   before carry propagation
16   5    6    5    2   after carry propagation

We still have a big digit dangling around. We can deal with it with the same trick.

  5   6   5   2
+           432   160,000 ÷ 10,000 × 27
  -------------
  5   6   5 434   before carry propagation
  6   0   8   4   after carry propagation

Finally, we’re down to 4 digits. We can stop there if we want, or we can complete the reduction by testing if it is bigger than 9973, and subtracting 9973 if it is. Here, 6084 happens to be smaller than 9973, so we won’t need to subtract anything for the full reduction.

Poly1305’s prime: 2130 - 5

(130… 5… 1305. Of course. How could I have missed it?)

Now computers don’t count in base 10. They work with base 2. As I said, Daniel Bernstein chose 2130 - 5 in part because 5 is such a small number, which allows to cheat modular reduction.

Another reason is, 2130 - 5 is a little above 2128. This allows a very nice encoding of the message into m1, m2 etc: just chop 16 bytes off the message, append a 1 byte to that, then treat those 17 bytes as a little endian number. 16 bytes is a very nice power-of-2 size, but you need a large enough prime to deal with it.

The reason for appending a 1 at the end of each block is to detect trailing zeros. Without that, those 2 messages:

68e5904d20dd271607a3658d0e49a77302000000
68e5904d20dd271607a3658d0e49a77302

would have the same MAC, making some forgeries very easy.

Yet another reason for choosing 2130 - 5 is, 130 is a multiple of 26. This allows an implementation to encode its number in five 26-bit “digits”, or limbs. From there, we perform our schoolbook multiplication with enormous “digits” that range from 0 to 2²⁶-1. With five limbs, we span over exactly 130 bits.

Now since our prime is a power of 2²⁶ minus 5, we need to use the trick we learned in base 10: when we move the overflowing limbs to the right, we need to multiply them by 5 to compensate:

     a4      a3      a2      a1      a0
×    b4      b3      b2      b1      b0
---------------------------------------
  a4×b0   a3×b0   a2×b0   a1×b0   a0×b0
+ a3×b1   a2×b1   a1×b1   a0×b1 5×a4×b1
+ a2×b2   a1×b2   a0×b2 5×a4×b2 5×a3×b2
+ a1×b3   a0×b3 5×a4×b3 5×a3×b3 5×a2×b3
+ a0×b4 5×a4×b4 5×a3×b4 5×a2×b4 5×a1×b4

Translating that in C is fairly mechanical:

void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
{
    uint64_t a0 = a[0];  uint64_t b0 = b[0];
    uint64_t a1 = a[1];  uint64_t b1 = b[1];
    uint64_t a2 = a[2];  uint64_t b2 = b[2];
    uint64_t a3 = a[3];  uint64_t b3 = b[3];
    uint64_t a4 = a[4];  uint64_t b4 = b[4];
    p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
    p[1] = a0*b1 + a1*b0   + a2*b4*5 + a3*b3*5 + a4*b2*5;
    p[2] = a0*b2 + a1*b1   + a2*b0   + a3*b4*5 + a4*b3*5;
    p[3] = a0*b3 + a1*b2   + a2*b1   + a3*b0   + a4*b4*5;
    p[4] = a0*b4 + a1*b3   + a2*b2   + a3*b1   + a4*b0  ;
}

Here we see another advantage of having limbs smaller than 32 bits: all those products and sums fit in 64 bits without overflowing. If the vectors a and b fit in 26 bits, then b×5 fits in 28 bits, and the products all fits in 55 bits. Make 4 sums with that, and the result fits in 57-bits, much less than the 64 available.

Now we must do carry propagation. Again, we must make sure this won’t overflow:

uint64_t propagate_carry(uint64_t p[5], uint64_t carry)
{
    p[0] += carry * 5;  carry = p[0] >> 26;  p[0] -= carry << 26;
    p[1] += carry    ;  carry = p[1] >> 26;  p[1] -= carry << 26;
    p[2] += carry    ;  carry = p[2] >> 26;  p[2] -= carry << 26;
    p[3] += carry    ;  carry = p[3] >> 26;  p[3] -= carry << 26;
    p[4] += carry    ;  carry = p[4] >> 26;  p[4] -= carry << 26;
    return carry;
}

The number we’re reducing here is p + 2130×carry. This is why I multiply carry by 5 the first time around: I have to compensate the implicit division by 2130. When this is finished, every limb in p are in the proper range (26-bits), and we may have a remaining carry dangling around (the returned value). We can then repeat the operation until there is no more carry:

uint64_t carry = 0;
do {
    carry = propagate_carry(p, carry);
} while (carry != 0);

Once the carry is null, the full modular reduction is just a matter of checking whether the remaining number is higher than 2130 - 5, and subtract 2130 - 5 if it is.

Careful how you do it however: this loop is prone to timing attacks. Don’t use it in real crypto code. Perform a constant number of carry propagations instead. (My experiments suggest that if the operands of the multiplication have all their limbs in range, 3 iterations are enough.) Also make sure the final test and subtraction are done in a constant time fashion too. It can be done with bit twiddling.

Now we don’t have to perform a full modular reduction at the end of each multiplication. Our 26-bit limbs don’t really have to fit in 26 bits, as long as we make sure we don’t overflow our 32-bit and 64-bit registers. There are also cleverer ways to perform carry propagation. To see a good example, I suggest you take a look at Floodyberry’s 32-bit poly1305-donna implementation.

Parallelism

Horner’s rule has one limitation: it’s a single threaded algorithm. Each intermediate result depends on the previous (let’s assume multiplication also includes partial modular reduction):

h0 = 0
h1 = (h0 + m1) * r
h2 = (h1 + m2) * r
h3 = (h2 + m3) * r
h4 = (h3 + m4) * r
...

Let’s assume the message is only 4 blocks long, so the final hash is h4. Let’s write h4 in terms of h0:

h4 = (h3 + m4) * r
h4 = ((h2 + m3) * r + m4) * r
h4 = (((h1 + m2) * r + m3) * r + m4) * r
h4 = ((((h0 + m1) * r + m2) * r + m3) * r + m4) * r
h4 = (((m1 * r + m2) * r + m3) * r + m4) * r

Now let’s rewrite the polynomial as a sum of products:

h4 = ((m1 * r + m2) * r + m3) * r² + m4 * r
h4 =  (m1 * r + m2) * r³ + m3 * r² + m4 * r
h4 =   m1 * r⁴ + m2 * r³ + m3 * r² + m4 * r

This form is the most parallel possible. We could compute every product in parallel, then sum everything, map-reduce style. There’s only one problem: we need to compute r squared, cubed, and so on. To avoid this, we can take a hybrid approach, with two applications of Horner’s rule instead of just one:

h4 =  m1 * r⁴ + m3  * r²  +   m2 * r³ + m4 * r
h4 = ((m1*r² + m3) * r²)  +  ((m2*r² + m4) * r)

In exchange for having to compute r squared, we can now hash two lanes in parallel:

h0 = 0
h1 = (h0 + m1) * r²     h2 = (h0 + m2) * r²
h3 = (h1 + m3) * r²     h4 = (h2 + m4) * r
h  = h3 + h4

This generalises to bigger messages:

h0 = 0
h1 = (h0 + m1) * r²     h2 = (h0 + m2) * r²
h3 = (h1 + m3) * r²     h4 = (h2 + m4) * r²
h5 = (h3 + m5) * r²     h6 = (h2 + m6) * r²
h7 = (h5 + m7) * r²     h8 = (h2 + m8) * r
h  = h7 + h8

As well as wider parallelism:

h0 = 0
h1 = (h0 + m1) * r³     h2 = (h0 + m2) * r³     h3 = (h0 + m3) * r³
h4 = (h1 + m4) * r³     h5 = (h2 + m5) * r³     h6 = (h3 + m6) * r³
h7 = (h4 + m7) * r³     h8 = (h5 + m8) * r²     h9 = (h6 + m9) * r
h  = h7 + h8 + h9

This is quite SIMD friendly. A modern CPU with 256-bit vector units can compute 4 lanes in parallel in a single thread. My benchmarks with Libsodium suggest this is twice as fast as the regular Horner’s rule.

Restricting r

Daniel Bernstein could have stopped there. But he went further, and restricted the value of the secret r. Instead of using all 128 bits of the random secret, he cleared 22 bits, sacrificing a bit of security “to simplify and accelerate implementations […] in various contexts”.

The nature of this simplification becomes apparent when you encode r in 4 32-bit words. There, the 22 bits are cleared thus:

r[0] &= 0x0fffffff; // clear top 4 bits
r[1] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
r[2] &= 0x0ffffffc; // clear top 4 & bottom 2 bits
r[3] &= 0x0ffffffc; // clear top 4 & bottom 2 bits

This all makes sense when you think about the unthinkable: having the limbs span 32 bits.

32 bit limbs are generally a bad idea. They make carry propagation impossible to delay, forcing the code to be more complex and slower. Imagine your machine has 64-bit registers: when you multiply 32-bit limbs, the result spans the whole register. You can’t sum it with another product without risking overflow.

Poly1305 in its straightforward Horner’s rule implementation however only makes one kind of product, and that product uses r:

h = 0;
for each message block mi {
    h += mi;
    h *= r;
}
mac = (h + k) % (2^130 - 5);
mac = mac % 2^128;  // ignore 2 bits for the final output

Since r limbs span 28 bits instead of the full 32, the limb products with h will span 60 bits at most (well, 61 because of the modulo trick). We can now sum those products without overflowing the 64-bit register. (Modern processors have 128-bit registers, so they can use 64 bits limbs instead. The basic principles are the same.)

That’s not the only problem however. 32 is not a multiple of 130. The closest multiple is 128. Performing the modulo trick requires us to shift 4 limbs to the right, then 2 bits to the right. Doing so loses the 2 least significant bits.

But we don’t care, because those bits are already cleared! Well, 2 of them aren’t, but we can easily recover them. In the end, we save a couple multiplications and simplify (de)serialisation code a good deal. Carry propagation requires a bit of care (the one I showed above would overflow), but the code is still pretty simple. Here is a multiplication, with a subsequent reduction below 2130 (not a full modular reduction, that one can be deferred to the end):

// Multiplies h and r, put the result in h
static void poly_mul(uint32_t h[5], const uint32_t r[4])
{
    // These would fit in 32 bits, but we need 64 bit multiplications
    const uint64_t r0 = r[0];
    const uint64_t r1 = r[1];
    const uint64_t r2 = r[2];
    const uint64_t r3 = r[3];
    const uint64_t rr0 = (r[0] >> 2) * 5; // lose 2 bottom bits...
    const uint64_t rr1 = (r[1] >> 2) * 5; // 2 bottom bits already cleared
    const uint64_t rr2 = (r[2] >> 2) * 5; // 2 bottom bits already cleared
    const uint64_t rr3 = (r[3] >> 2) * 5; // 2 bottom bits already cleared

    // school book modular multiplication (without carry propagation)
    const uint64_t x0 = h[0]*r0 + h[1]*rr3 + h[2]*rr2 + h[3]*rr1 + h[4]*rr0;
    const uint64_t x1 = h[0]*r1 + h[1]*r0  + h[2]*rr3 + h[3]*rr2 + h[4]*rr1;
    const uint64_t x2 = h[0]*r2 + h[1]*r1  + h[2]*r0  + h[3]*rr3 + h[4]*rr2;
    const uint64_t x3 = h[0]*r3 + h[1]*r2  + h[2]*r1  + h[3]*r0  + h[4]*rr3;
    const uint64_t x4 = h[4] * (r0 & 3); // ...recover those 2 bits

    // carry propagation (put the result back in h)
    const uint64_t msb = x4 + (x3 >> 32);
    uint64_t       u   = (msb >> 2) * 5; // lose 2 bottom bits...
    u += (x0 & 0xffffffff)             ;  h[0] = u & 0xffffffff;  u >>= 32;
    u += (x1 & 0xffffffff) + (x0 >> 32);  h[1] = u & 0xffffffff;  u >>= 32;
    u += (x2 & 0xffffffff) + (x1 >> 32);  h[2] = u & 0xffffffff;  u >>= 32;
    u += (x3 & 0xffffffff) + (x2 >> 32);  h[3] = u & 0xffffffff;  u >>= 32;
    u += msb & 3 /* ...recover them */ ;  h[4] = u;
}

I wonder why Daniel Bernstein’s own reference implementation doesn’t use this trick. I mean, he designed the damn authenticator, and this trick is barely more verbose than TweetNaCl, especially without the unrolled loops. My current guess is that he wanted the code to work even with old C89 compilers, which don’t necessarily provide 64-bit variables.

Source code

Just look at Monocypher.