Fast Multiplication with Slow Additions
Public key cryptography routinely involves scalar multiplication over a group with pretty slow additions. RSA for instance needs modular exponentiation with big numbers (in this case the "addition" is actually a modular multiplication).
Here we will focus on elliptic curves (which is what my own crypto library uses), but most of this will apply to any group. Elliptic curves have this notion of "point addition", where adding two points gives a third point. We will talk about "additions" and "doublings":
P + Q (addition) P + P (doubling)
(While doublings are ostensibly just a kind of addition, they're often implemented differently, and are generally faster.)
The problem we want to solve is that of scalar multiplication:
s × P = P + P + ... + P (s times)
where s is some huge number, typically over several hundred bits. The brute force approach of adding P to itself 2256 times will obviously not work. We need a faster addition chain.
A bit of notation
For the remainder of this tutorial, we will represent the scalar
an array, like this:
s = s + s×2 + s×4 + ... + s[n-1]×2^n-1
Note that this is a little endian representation. For instance, the number thirteen (13) can be represented this way:
thirteen = 1101 = 1 + 0×2 + 1×4 + 1×8 thirteen = 1 thirteen = 0 thirteen = 1 thirteen = 1
This is a straightforward binary representation. But it is not the only one. If we allow the binary digits to take other values than 0 or 1, we can represent s in other ways. For instance:
thirteen = 0301 = 1 + 0×2 + 3×4 + 0×8 thirteen = 1005 = 5 + 0×2 + 0×4 + 1×8
1005, the digits are still in binary positions).
Such exotic number representations are at the root of most optimisations we will see here.
Naive approach: double and add
s is in straightforward binary form over
n digits, the
scalar multiplication is pretty simple:
D <- P R <- zero for (i : 0, n-1) if (s[i] = 1) R <- R + D D <- D + D return R
We just compute s[i] × 2^i for each i, and add it to the result. We start from the least significant bit, and double our way up to the most significant bit. Alternatively, we could start from the most significant bit, and double the result instead. This is more flexible.
R <- zero for i : n-1, 0 R <- R + R if s[i] = 1 R <- R + P return R
On average, the naive approach will require n doublings and n/2 additions. Not bad compared to the impossible 2n additions, but not quite optimal either.
A word of warning: the above does not run in constant time, and will leak information about the scalar. If the scalar must remain secret, we need to tweak this a bit:
R <- zero for i : n-1, 0 R <- R + R T <- R + P ccopy(R, T, s[i]) return R
ccopy() is an operation that copies the second point into the
first if the third argument is one. By using arithmetic tricks, this
operation can be done in constant time. The cost however is that we now
perform n additions instead of just n/2. (We still have n doublings).
Less additions with windows
The idea is to batch the additions a bit. Instead of processing digits one by one, we could process them, say… 4 by 4. So instead of adding 0 or P, we could add anywhere between 0 and 15×P. To do this, we need to change the representation of the scalar so that only a fourth of the digits is non zero:
s = 1001 1010 0011 1011 1100 1010 0011 0001 s = 0009 000a 0003 000b 000c 000a 0003 0001
(The digits are hexadecimal, from 0 to f)
Transforming s into this fixed windows representation is pretty straightforward:
for i : 0, n/4 s[i×4] <- s[i×4 ] + 2 × s[i×4+1] + 4 × s[i×4+2] + 8 × s[i×4+3] s[i×4+1] <- 0 s[i×4+2] <- 0 s[i×4+3] <- 0
To handle this representation, we must build a table to hold the multiples of P first (this costs 15 additions), then modify the main loop a bit:
-- Build the look up table Px <- zero for i : 1, 15 Px[i] = Px[i-1] + P -- Main loop R <- zero for i : n-1, 0 R <- R + R if s[i] ≠ 0 R <- R + Px[s[i]] return R
This algorithm is nearly identical to the naive double and add, except it can handle any digit between 0 and 15. The speed up here comes from the representation, which guarantees that 3 digits out of 4 are zeroes. So instead of n/2 additions, we perform a little less than n/4 + 15.
If we were using a 5-bit table instead, we would have n/5 + 31 additions. 6-bit table, n/6 + 63. And so on. Note that eventually, the table costs more than it saves, so there's a limit to how few additions you can perform. In practice, for a 256-bit scalar, the fastest tables are 4 or 5 bits wide.
If we need to do this in constant time, we can again tweak the algorithm a little:
R <- zero for i : n-1, 0 R <- R + R if i mod 4 = 0 -- always add 1 time in 4 R <- R + clookup(Px, s[i]) return R
clookup() is a constant time look up which scans the table
linearly (to guarantee constant memory access patterns), and uses bit
twiddling tricks to produce the right value (to avoid branching). Such
a lookup gets slower as the table gets bigger, which again limits how
big a table can be.
Sliding windows: smaller tables, larger windows
(Not constant time. Don't use if the scalar must remain secret.)
The size of the table is a pretty big problem. If we could find ways to get smaller tables but keeping the windows just as wide…
Turns out that we can. The core idea is pretty simple: instead of
placing windows at fixed intervals, we place them such that the least
significant bit of each window is a 1. The
s scalar above for
instance can be chopped up like this instead of using fixed width
s = 1 001101 000111 01111 00101 00011 0001 s = 1 00000d 000007 0000f 00005 00003 0001
This has two advantages: first, we skip over zeroes, so we have fewer windows. Second, the windows always end by 1, so the look up table only needs the odd elements, so it can be smaller.
The sliding windows transformation is a bit more complex than the fixed windows one.
i <- 0 for i : 0, n-1 if s[i] = 1 s[i] <- s[i ] + 2 × s[i+1] + 4 × s[i+2] + 8 × s[i+3] s[i+1] <- 0 s[i+2] <- 0 s[i+3] <- 0
(Real implementations should mind the edge cases at the end, or make
s has enough zeroes at the most significant end.)
The sliding windows representation uses the same digits as the fixed windows representation, so the same algorithm can handle it. The construction of the window however only costs half as many additions, since it can skip the even slots:
-- Build the look up table (odd elements only) P2 <- P + P Px <- P for i : 1, 7 Px[2×i+1] = Px[2×i-1] + P2 -- Main loop R <- zero for i : n-1, 0 R <- R + R if s[i] ≠ 0 R <- R + Px[s[i]] return R
(Real implementations would use a smaller table instead of leaving half elements uninitialised.)
The average cost of a 4-bit sliding windows is n/5 + 8 additions, instead of n/4 + 15. Sliding windows are effectively smaller and wider than fixed windows. On the other hand, not being constant time makes sliding windows less widely applicable.
This should be as far as windows can go. But some groups also have cheap subtraction. Twisted Edwards curves used in EdDSA are such a group. When subtraction doesn't cost more than addition, we can use signed sliding windows. But first, we need to make a little detour.
Non adjacent form: negative digits, fewer digits
(Not constant time. Don't use if the scalar must remain secret.)
Non adjacent form is a representation where the digits can be either 0, 1, or -1. We use -1 to minimise non-zero digits, so we can have fewer additions (and subtractions). Take this number for instance:
In non adjacent form, it would look like this instead:
The way it works is pretty simple. We work from right to left, and look at the last 2 digits. if they are both ones, we replace the rightmost 1 by a -1 (effectively subtracting 2), and propagate a carry to compensate (effectively adding 2). For the number above, this worked like this:
00111 01 1 00111 10-1 01000-10-1
Here is the complete algorithm. Note that we could overflow s that way, make sure you have enough room for one extra digit.
for i : 0, n-1 if s[i] = 1 and s[i+1] = 1 s[i] <- -1 j <- i while s[j] = 1 s[j] <- 0 j <- j+1 s[j] <- 1
To handle this, we need to tweak the double and add algorithm a little:
R <- zero for i : n-1, 0 R <- R + R if s[i] = 1 then R <- R + P if s[i] = -1 then R <- R - P return R
On average, this costs the usual n doublings, but only n/3 additions, without any look up table. Forget about running in constant time, though.
Signed sliding windows: non adjacent sliding windows
(Still not constant time.)
Signed sliding windows are a generalisation of non-adjacent form. When we find a 1, we don't just look at two bit, but at a whole window (four bits for a 4-bit window). If the most significant bit of the window is a 1, then we put a negative number, and propagate the carry just like non adjacent form. For instance, our scalar would be transformed thus:
s = 100 110 1 000111 0111 1 00101 00011 0001 (binary) s = 100 110 1 000111 0111 1 00101 00003 0001 s = 100 110 1 000111 0111 1 00005 00003 0001 s = 100 110 1 000111 1000-1 00005 00003 0001 s = 100 110 1 000003 1000-1 00005 00003 0001 s = 101 000-3 000003 1000-1 00005 00003 0001 s = 005 000-3 000003 1000-1 00005 00003 0001 (signed windows)
The carry propagation works the same as non adjacent form, we just look at a whole window instead of just 2 bits:
for i : 1, n-1 if s[i] ≠ 0 -- 4-bit unsigned window for j : 1, 3 s[i ] <- s[i] + s[i+j] × 2^j s[i+j] <- 0 if s[i] > 8 -- convert to signed window, propagate carry s[i] <- s[i] - 16 j <- i + 4 while s[j] != 0 s[j] <- 0 j <- j + 1 s[j] = 1
The double and add algorithm is almost the same as the one used for non adjacent form:
R <- zero for i : n-1, 0 R <- R + R if s[i] > 0 then R <- R + Px[ s[i]] if s[i] < 0 then R <- R - Px[-s[i]] return R
The average cost is n doublings and n/5 + 4 additions (assuming a 4-bit signed sliding window). Signed sliding windows are effectively half as big, (or one bit wider), than their unsigned counterpart. In practice though, this doesn't make a huge difference. While they do require fewer additions, having to handle subtraction as well as addition offsets this gain somewhat.
Now we've reduced the number of additions as much as we could. While this does speed things up a bit, we only went from 3n/2 operations to something like 7n/6. That's better, but we won't go below n without reducing the number of doublings.
And that requires a little bit of cheating.
Cheat 1: double scalar multiplication
EdDSA verification doesn't exactly need a scalar multiplication. It needs the sum of two scalar multiplications. That is, something like:
s1×P1 + s2×P2
Computing that with the naive approach requires 3n operations (2n doublings and n additions). We can use sliding windows, but this still requires 2n + n/3 + 16 operations.
The trick here is to fuse the two scalar multiplications. First, let's write the two of them separately:
s1×P1 = s1×P1 + s1×2×P1 + s1×4×P1 + ... + s1[n-1]×2^n-1×P1 s2×P2 = s2×P2 + s2×2×P2 + s2×4×P2 + ... + s2[n-1]×2^n-1×P2
Then the sum:
s1×P1 = s1×P1 + s1×2×P1 + s1×4×P1 + ... + s1[n-1]×2^n-1×P1 + + + + + s2×P2 = s2×P2 + s2×2×P2 + s2×4×P2 + ... + s2[n-1]×2^n-1×P2
The fused double and add algorithm is then very simple:
R <- zero for i : n-1, 0 R <- R + R if s1[i] = 1 then R <- R + P1 if s2[i] = 1 then R <- R + P2 return R
This only requires n doublings and n additions on average, for a total of 2n operations. Such a simple trick, and already we're already exceeding the performance of sliding windows. We can also make it run in constant time, but this will cost 3n operations instead (n doublings, 2n additions).
Of course, we can use this with windows. For instance, we can transform s1 and s2 in signed sliding windows form, compute the usual look up table, and finally:
R <- zero for i : n-1, 0 R <- R + R if s1[i] = 1 then R <- R + P1x[i] if s1[i] = -1 then R <- R - P1x[i] if s2[i] = 1 then R <- R + P2x[i] if s2[i] = -1 then R <- R - P1x[i] return R
With our 5-bit signed windows, this only costs 4n/3 + 16 operations, well below 2n. Not constant time at all, but still has worthy applications, such as EdDSA verification.
Cheat 2: knowing the point in advance
EdDSA signatures only multiply by the base point. We know that point in advance, so we can precompute some stuff. We could of course precompute the window, thus saving the additional additions it requires. But this wouldn't reduce the number of doublings.
Combs however, do.
The idea behind a window is to add 4 bits at a time (for a 4-bit window). To avoid adding the same bit twice, we shift by 4 bits by doubling 4 times.
The idea behind a a comb is to again add 4 bits at a time (for a 4-bit comb), except this time the bits are spread apart. That way we only have to shift the comb once before adding again. A 4-bit comb not only divides the number of additions by 4, it divides the number of doublings by 4.
Here's an example with our trusty 32-bit scalar:
s = 10011010001110111100101000110001
A 4-bit comb would divide it like this:
1 0 1 0 -> comb 7 0 0 1 0 -> comb 6 0 1 0 1 -> comb 5 1 1 0 1 -> comb 4 1 1 1 0 -> comb 3 0 0 0 0 -> comb 2 1 1 1 0 -> comb 1 0 1 0 1 -> comb 0
First, we add those bits:
comb 7 ┌───────┬───────┬───────┐ │ │ │ │ 10011010001110111100101000110001
Then we double once, and add those:
comb 6 ┌───────┬───────┬───────┐ │ │ │ │ 10011010001110111100101000110001
Double once again, add those:
comb 5 ┌───────┬───────┬───────┐ │ │ │ │ 10011010001110111100101000110001
The look up table required to support this is a little different from a window look up table. At a first glance, a 4-bit window simply contains all multiples of P, from zero to 15×P. Another way to look at it is that it represents all binary numbers from 0 to 15, where each bit represent P, 2P, 4P and 8P respectively. The other elements are then obtained by summing those base elements.
Instead of looking at the table like this:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We look at it like that:
0 1 2 2+1 4 4+1 4+2 4+2+1 8 8+ 1 8+ 2 8+ 2+1 8+4 8+4+1 8+4+2 8+4+2+1
That's the real structure of the look up table. With windows, sure, 4+2 also happens to be 2+2+2, and we take advantage of this when constructing window tables. Comb tables are different. Their individual bits have nothing to do with each other, and the usual algebraic properties do not hold. We must add from individual bits.
To construct a 4-bit comb, we need 4 constants:
P0 = P P1 = P × 2^( n/4) P2 = P × 2^(2×n/4) P3 = P × 2^(3×n/4)
In the case of our 32-bit number, this means:
P0 = P P1 = P × 2^8 P2 = P × 2^16 P3 = P × 2^24
Those are obtained by repeatedly doubling the point P. Then we can construct the table which will be hard coded into the main program.
for i : 1, 15; Px[i] <- zero if i mod 2 = 1 then Px[i] <- Px[i] + P0 if i/2 mod 2 = 1 then Px[i] <- Px[i] + P1 if i/4 mod 2 = 1 then Px[i] <- Px[i] + P2 if i/8 mod 2 = 1 then Px[i] <- Px[i] + P3
Another step is transforming s in comb form. This is almost the same as the fixed window transformation we saw earlier, only with interleaved bits instead:
for i : 0, (n/4)-1 comb[i] <- s[i ] + 2 × s[i + n/4 ] + 4 × s[i + n/4 × 2] + 8 × s[i + n/4 × 3] return comb
The main loop is then very simple:
R <- zero for i : (n/4)-1, 0 R <- R + R R <- R + Px[comb[i]] return R
This requires a total of n/2 operations. Note that in most cases, the scalar is secret, so we require constant time lookup:
R <- zero for i : (n/4)-1, 0 R <- R + R R <- R + clookup(Px, comb[i]) return R
Since such lookups scan the table linearly, this limits how big a table can be before we waste more time on scanning the table than we save performing fewer additions and doublings.
Thankfully, there's a way to further reduce the number of doublings without scanning bigger tables.
Instead of having bigger tables, we could have more tables. The core idea is to chop up the scalar differently (here with two combs):
s = 10011010001110111100101000110001 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1
Now we could use an 8-bit comb for this:
┌───┬───┬───┬───┬───┬───┬───┐ │ │ │ │ │ │ │ │ ┌───┬───┬───┬───┬───┬───┬───┐ │ │ │ │ │ │ │ │ ┌───┬───┬───┬───┬───┬───┬───┐ │ │ │ │ │ │ │ │ ┌───┬───┬───┬───┬───┬───┬───┐ │ │ │ │ │ │ │ │ 10011010001110111100101000110001
But 256 elements in the look up table is way too big in most settings. Instead, we could use two 4-bit combs:
┌───┬───┬───┐ ┌───┬───┬───┐ │ │ │ │ │ │ │ │ ┌───┬───┬───┐ ┌───┬───┬───┐ │ │ │ │ │ │ │ │ ┌───┬───┬───┐ ┌───┬───┬───┐ │ │ │ │ │ │ │ │ ┌───┬───┬───┐ ┌───┬───┬───┐ │ │ │ │ │ │ │ │ 10011010001110111100101000110001
With this, we need to do everything twice: we need two look up tables, constructed from two set of teeth:
P0x | P1x ---------------------+-------------------- P00 = P | P10 = P × 2^(4×n/8) P01 = P × 2^( n/8) | P11 = P × 2^(5×n/8) P02 = P × 2^(2×n/8) | P12 = P × 2^(6×n/8) P03 = P × 2^(3×n/8) | P13 = P × 2^(7×n/8)
Then we need two transformations to comb form:
for i : 0, (n/8)-1 comb0[i] <- s[i ] + 2 × s[i + n/8 ] + 4 × s[i + n/8 × 2] + 8 × s[i + n/8 × 3] comb1[i] <- s[i + n/2] + 2 × s[i + n/8 + n/2] + 4 × s[i + n/8 × 2 + n/2] + 8 × s[i + n/8 × 3 + n/2] return comb0, comb1
Finally, we perform two additions per loop iterations (though we have half as many loop iterations):
R <- zero for i : (n/8)-1, 0 R <- R + R R <- R + clookup(P0x, comb0[i]) R <- R + clookup(P1x, comb1[i]) return R
There, we just halved the number of doublings. We have two tables instead of just one, but the scans were just as short. More generally, the number of doublings is divided by the number of combs.
Of course, more combs will put more stress on the CPU data cache. Also, one should also consider using a wider comb instead, which reduces the number of both the doublings and additions, though at the cost of a longer constant time scan for additions. Still, multiple combs are worth considering whenever performance matters more than compactness.
Besides, there are ways to halve the size of the look up tables.
Sliding combs are not very interesting because most cryptographic applications of fixed point scalar multiplication involve a secret scalar. Also, the idea is pretty trivial: just apply the sliding ideas to combs. We can also make signed combs to reduce the size of the precomputed tables.
Their speed on the other hand is most impressive: on average, a 4-bit signed comb will have a very small table (8 points), and will cost an average of n/3 operations (n/6 doublings, n/6 additions). And with fully precomputed tables and variable time lookups, we can easily use larger tables at relatively little cost.
Still, we ultimately want a constant time way to halve the size of our look up tables.
Cheat 2.1: odd order groups
The first instance of this cheat I know of comes from Mike Hamburg (2012). The idea is, what if there's a representation of S where every bit would be either 1 or -1? If we were to make a 4-bit look up table for such a representation, it would look like this ("+" means 1, "-" means -1):
---- | -8 - 4 - 2 - 1 | -15 ---+ | -8 - 4 - 2 + 1 | -13 --+- | -8 - 4 + 2 - 1 | -11 --++ | -8 - 4 + 2 + 1 | -9 -+-- | -8 + 4 - 2 - 1 | -7 -+-+ | -8 + 4 - 2 + 1 | -5 -++- | -8 + 4 + 2 - 1 | -3 -+++ | -8 + 4 + 2 + 1 | -1 +--- | 8 - 4 - 2 - 1 | 1 +--+ | 8 - 4 - 2 + 1 | 3 +-+- | 8 - 4 + 2 - 1 | 5 +-++ | 8 - 4 + 2 + 1 | 7 ++-- | 8 + 4 - 2 - 1 | 9 ++-+ | 8 + 4 - 2 + 1 | 11 +++- | 8 + 4 + 2 - 1 | 13 ++++ | 8 + 4 + 2 + 1 | 15
Notice how the upper half of the look up table is the opposite of the
lower half. By flipping all bits of an index, we get the index to the
opposite value. Assuming our group can subtract as efficiently as it
can add, we could omit one half of the table altogether: If we get
+??? as an index, we just add the corresponding value. If we get
-???, we first flip all +/- then subtract the corresponding value.
That way we never need the upper half of the table. For instance, if we
-+-- instead of just adding -5, we can first flip all bits
+-++) then subtract 7.
The look up table for this comb can be constructed from the unsigned
look up table (the part after the
-- is just comments):
Sx <- Px[ 8] - Px -- 1000 - 0111 = +--- Sx <- Px[ 9] - Px -- 1001 - 0110 = +--+ Sx <- Px - Px -- 1010 - 0101 = +-+- Sx <- Px - Px -- 1011 - 0100 = +-++ Sx <- Px - Px -- 1100 - 0011 = ++-- Sx <- Px - Px -- 1101 - 0010 = ++-+ Sx <- Px - Px -- 1110 - 0001 = +++- Sx <- Px - Px -- 1111 - 0000 = ++++
Another way to look at it:
for i : 0, 7 Sx[i] <- Px[i+8] - Px[flip_bits(i + 8)]
Now the problem is obtaining that all bits set form to begin with. This is where the odd order will help. Most groups we deal with in cryptography are actually modular, with some order L (where L is some huge number). That is, for any scalar s and point P, the scalar products s×P and (L+S)×P produce the same result. So the all bits set form doesn't have to be equal to s, only congruent (modulo L). When L is odd, such a form always exist.
Let's reason backwards, and pretend we already have our all bits set form abs. For ease of manipulation, we'd like to replace its digits +1 and -1 by the more regular 1 and 0 (let's call the result hbs or "half bits set". That way handling the look up table is easier (recall that we only need the lower half):
+--- -> 1 | 1000 = 8 + 000 = 8 + 0 +--+ -> 3 | 1001 = 8 + 001 = 8 + 1 +-+- -> 5 | 1010 = 8 + 010 = 8 + 2 +-++ -> 7 | 1011 = 8 + 011 = 8 + 3 ++-- -> 9 | 1100 = 8 + 100 = 8 + 4 ++-+ -> 11 | 1101 = 8 + 101 = 8 + 5 +++- -> 13 | 1110 = 8 + 110 = 8 + 6 ++++ -> 15 | 1111 = 8 + 111 = 8 + 7
hbs <- 0 for i : 0, n-1 if abs[i] = 1 then hbs[i] <- 1 if abs[i] = -1 then hbs[i] <- 0 return hbs
The same can be achieved with a simple bit of linear arithmetic:
hbs <- 0 for i : 0, n-1 hbs[i] <- (abs[i] + 1) / 2 return hbs
Dividing each bit by 2 is the same as dividing the whole thing by 2. Adding one to each bit is the same as adding 2n-1. So, hbs is the same as (abs + 2n-1) / 2. But then abs is the same as the scalar s, by definition. Which means hbs is the same as (s + 2n-1) / 2.
When the order of the group L is odd, the division by 2 is always possible (just multiply by 1/2 modulo L). With those groups, the hbs form always exist, and so does the abs form. Odd order groups are not uncommon. Curve25519 for instance (used for Ed25519 signatures) have a prime order, and all interesting primes are odd.
The whole operation is done in 3 steps. First, transform the scalar into an all bits set form (or rather, the half bits set form).
-- This is all done modulo L hbs <- (s + 2^n - 1) / 2 return hbs
Second, turn the hbs form into comb form.
for i : 0, n/4 comb[i] <- hbs[i ] + 2 × hbs[i + n/4 ] + 4 × hbs[i + n/4 × 2] + 8 × hbs[i + n/4 × 3] return comb
Third, use the comb form to actually perform the scalar multiplication (note we're using the signed look up table Sx instead of the usual Px):
R <- zero for i : (n/4)-1, 0 R <- R + R if comb[i] >= 8 then R <- R + Sx[comb[i] - 8 ] else R <- R - Sx[flip_bits(comb[i])] return R
The constant time version of this still has to scan the table linearly, and we need a constant time negation function, but it only scans 8 elements instead of 16:
R <- zero for i : (n/4)-1, 0 R <- R + R high <- comb[i] / 8 -- rounded down low <- 1 - high -- not high index <- (teeth XOR (high - 1)) AND 7 -- bitwise XOR and AND P <- clookup(Sx, index) -- 0 < index < 8 P <- cneg(P, low) -- conditional negation return R
cneg() are supposed to work in constant time.