# Fast Multiplication with Slow Additions

Public key cryptography routinely involves scalar multiplication over a group with pretty slow additions. RSA for instance needs modular exponentiation with big numbers (in this case the "addition" is actually a modular multiplication).

Here we will focus on elliptic curves (which is what my own crypto library uses), but most of this will apply to any group. Elliptic curves have this notion of "point addition", where adding two points gives a third point. We will talk about "additions" and "doublings":

```
P + Q (addition)
P + P (doubling)
```

*(While doublings are ostensibly just a kind of addition, they're often
implemented differently, and are generally faster.)*

The problem we want to solve is that of scalar multiplication:

```
s × P = P + P + ... + P (s times)
```

where s is some huge number, typically over several hundred bits. The
brute force approach of adding P to itself 2^{256} times will
obviously not work. We need a faster addition chain.

## A bit of notation

For the remainder of this tutorial, we will represent the scalar `s`

in
an array, like this:

```
s = s[0] + s[1]×2 + s[2]×4 + ... + s[n-1]×2^n-1
```

Note that this is a little endian representation. For instance, the number thirteen (13) can be represented this way:

```
thirteen = 1101 = 1 + 0×2 + 1×4 + 1×8
thirteen[0] = 1
thirteen[1] = 0
thirteen[2] = 1
thirteen[3] = 1
```

This is a straightforward binary representation. But it is not the only one. If we allow the binary digits to take other values than 0 or 1, we can represent s in other ways. For instance:

```
thirteen = 0301 = 1 + 0×2 + 3×4 + 0×8
thirteen = 1005 = 5 + 0×2 + 0×4 + 1×8
```

*(In 0301 and 1005, the digits are still in binary positions).*

Such exotic number representations are at the root of most optimisations we will see here.

## Naive approach: double and add

Assuming `s`

is in straightforward binary form over `n`

digits, the
scalar multiplication is pretty simple:

```
D <- P
R <- zero
for (i : 0, n-1)
if (s[i] = 1)
R <- R + D
D <- D + D
return R
```

We just compute s[i] × 2^{i} for each i, and add it to the
result. We start from the least significant bit, and double our way up
to the most significant bit. Alternatively, we could start from the
most significant bit, and double the *result* instead. This is more
flexible.

```
R <- zero
for i : n-1, 0
R <- R + R
if s[i] = 1
R <- R + P
return R
```

On average, the naive approach will require n doublings and n/2
additions. Not bad compared to the impossible 2^{n} additions,
but not quite optimal either.

A word of warning: the above does not run in constant time, and will leak information about the scalar. If the scalar must remain secret, we need to tweak this a bit:

```
R <- zero
for i : n-1, 0
R <- R + R
T <- R + P
ccopy(R, T, s[i])
return R
```

Where `ccopy()`

is an operation that copies the second point into the
first if the third argument is one. By using arithmetic tricks, this
operation can be done in constant time. The cost however is that we now
perform n additions instead of just n/2. (We still have n doublings).

## Less additions with windows

The idea is to batch the additions a bit. Instead of processing digits one by one, we could process them, say… 4 by 4. So instead of adding 0 or P, we could add anywhere between 0 and 15×P. To do this, we need to change the representation of the scalar so that only a fourth of the digits is non zero:

```
s = 1001 1010 0011 1011 1100 1010 0011 0001
s = 0009 000a 0003 000b 000c 000a 0003 0001
```

*(The digits are hexadecimal, from 0 to f)*

Transforming s into this fixed windows representation is pretty straightforward:

```
for i : 0, n/4
s[i×4] <- s[i×4 ]
+ 2 × s[i×4+1]
+ 4 × s[i×4+2]
+ 8 × s[i×4+3]
s[i×4+1] <- 0
s[i×4+2] <- 0
s[i×4+3] <- 0
```

To handle this representation, we must build a table to hold the multiples of P first (this costs 15 additions), then modify the main loop a bit:

```
-- Build the look up table
Px[0] <- zero
for i : 1, 15
Px[i] = Px[i-1] + P
-- Main loop
R <- zero
for i : n-1, 0
R <- R + R
if s[i] ≠ 0
R <- R + Px[s[i]]
return R
```

This algorithm is nearly identical to the naive double and add, except it can handle any digit between 0 and 15. The speed up here comes from the representation, which guarantees that 3 digits out of 4 are zeroes. So instead of n/2 additions, we perform a little less than n/4 + 15.

If we were using a 5-bit table instead, we would have n/5 + 31 additions. 6-bit table, n/6 + 63. And so on. Note that eventually, the table costs more than it saves, so there's a limit to how few additions you can perform. In practice, for a 256-bit scalar, the fastest tables are 4 or 5 bits wide.

If we need to do this in constant time, we can again tweak the algorithm a little:

```
R <- zero
for i : n-1, 0
R <- R + R
if i mod 4 = 0 -- always add 1 time in 4
R <- R + clookup(Px, s[i])
return R
```

Where `clookup()`

is a constant time look up which scans the table
linearly (to guarantee constant memory access patterns), and uses bit
twiddling tricks to produce the right value (to avoid branching). Such
a lookup gets slower as the table gets bigger, which again limits how
big a table can be.

## Sliding windows: smaller tables, larger windows

*(Not constant time. Don't use if the scalar must remain secret.)*

The size of the table is a pretty big problem. If we could find ways to get smaller tables but keeping the windows just as wide…

Turns out that we can. The core idea is pretty simple: instead of
placing windows at fixed intervals, we place them such that the least
significant bit of each window is a 1. The `s`

scalar above for
instance can be chopped up like this instead of using fixed width
windows:

```
s = 1 001101 000111 01111 00101 00011 0001
s = 1 00000d 000007 0000f 00005 00003 0001
```

This has two advantages: first, we skip over zeroes, so we have fewer windows. Second, the windows always end by 1, so the look up table only needs the odd elements, so it can be smaller.

The sliding windows transformation is a bit more complex than the fixed windows one.

```
i <- 0
for i : 0, n-1
if s[i] = 1
s[i] <- s[i ]
+ 2 × s[i+1]
+ 4 × s[i+2]
+ 8 × s[i+3]
s[i+1] <- 0
s[i+2] <- 0
s[i+3] <- 0
```

*(Real implementations should mind the edge cases at the end, or make
sure s has enough zeroes at the most significant end.)*

The sliding windows representation uses the same digits as the fixed windows representation, so the same algorithm can handle it. The construction of the window however only costs half as many additions, since it can skip the even slots:

```
-- Build the look up table (odd elements only)
P2 <- P + P
Px[1] <- P
for i : 1, 7
Px[2×i+1] = Px[2×i-1] + P2
-- Main loop
R <- zero
for i : n-1, 0
R <- R + R
if s[i] ≠ 0
R <- R + Px[s[i]]
return R
```

*(Real implementations would use a smaller table instead of leaving half
elements uninitialised.)*

The average cost of a 4-bit sliding windows is n/5 + 8 additions,
instead of n/4 + 15. Sliding windows are effectively smaller *and*
wider than fixed windows. On the other hand, not being constant time
makes sliding windows less widely applicable.

This should be as far as windows can go. But some groups also have
cheap *subtraction*. Twisted Edwards curves used in EdDSA are such a
group. When subtraction doesn't cost more than addition, we can use
*signed* sliding windows. But first, we need to make a little detour.

## Non adjacent form: negative digits, fewer digits

*(Not constant time. Don't use if the scalar must remain secret.)*

Non adjacent form is a representation where the digits can be either 0, 1, or -1. We use -1 to minimise non-zero digits, so we can have fewer additions (and subtractions). Take this number for instance:

```
00111011
```

In non adjacent form, it would look like this instead:

```
01000-10-1
```

The way it works is pretty simple. We work from right to left, and look at the last 2 digits. if they are both ones, we replace the rightmost 1 by a -1 (effectively subtracting 2), and propagate a carry to compensate (effectively adding 2). For the number above, this worked like this:

```
00111 01 1
00111 10-1
01000-10-1
```

Here is the complete algorithm. Note that we could overflow s that way, make sure you have enough room for one extra digit.

```
for i : 0, n-1
if s[i] = 1 and s[i+1] = 1
s[i] <- -1
j <- i
while s[j] = 1
s[j] <- 0
j <- j+1
s[j] <- 1
```

To handle this, we need to tweak the double and add algorithm a little:

```
R <- zero
for i : n-1, 0
R <- R + R
if s[i] = 1 then R <- R + P
if s[i] = -1 then R <- R - P
return R
```

On average, this costs the usual n doublings, but only n/3 additions, without any look up table. Forget about running in constant time, though.

## Signed sliding windows: non adjacent sliding windows

*(Still not constant time.)*

Signed sliding windows are a generalisation of non-adjacent form. When we find a 1, we don't just look at two bit, but at a whole window (four bits for a 4-bit window). If the most significant bit of the window is a 1, then we put a negative number, and propagate the carry just like non adjacent form. For instance, our scalar would be transformed thus:

```
s = 100 110 1 000111 0111 1 00101 00011 0001 (binary)
s = 100 110 1 000111 0111 1 00101 00003 0001
s = 100 110 1 000111 0111 1 00005 00003 0001
s = 100 110 1 000111 1000-1 00005 00003 0001
s = 100 110 1 000003 1000-1 00005 00003 0001
s = 101 000-3 000003 1000-1 00005 00003 0001
s = 005 000-3 000003 1000-1 00005 00003 0001 (signed windows)
```

The carry propagation works the same as non adjacent form, we just look at a whole window instead of just 2 bits:

```
for i : 1, n-1
if s[i] ≠ 0
-- 4-bit unsigned window
for j : 1, 3
s[i ] <- s[i] + s[i+j] × 2^j
s[i+j] <- 0
if s[i] > 8
-- convert to signed window, propagate carry
s[i] <- s[i] - 16
j <- i + 4
while s[j] != 0
s[j] <- 0
j <- j + 1
s[j] = 1
```

The double and add algorithm is almost the same as the one used for non adjacent form:

```
R <- zero
for i : n-1, 0
R <- R + R
if s[i] > 0 then R <- R + Px[ s[i]]
if s[i] < 0 then R <- R - Px[-s[i]]
return R
```

The average cost is n doublings and n/5 + 4 additions (assuming a 4-bit signed sliding window). Signed sliding windows are effectively half as big, (or one bit wider), than their unsigned counterpart. In practice though, this doesn't make a huge difference. While they do require fewer additions, having to handle subtraction as well as addition offsets this gain somewhat.

Now we've reduced the number of additions as much as we could. While this does speed things up a bit, we only went from 3n/2 operations to something like 7n/6. That's better, but we won't go below n without reducing the number of doublings.

And that requires a little bit of cheating.

## Cheat 1: double scalar multiplication

EdDSA verification doesn't exactly need a scalar multiplication. It
needs the *sum* of two scalar multiplications. That is, something like:

```
s1×P1 + s2×P2
```

Computing that with the naive approach requires 3n operations (2n doublings and n additions). We can use sliding windows, but this still requires 2n + n/3 + 16 operations.

The trick here is to fuse the two scalar multiplications. First, let's write the two of them separately:

```
s1×P1 = s1[0]×P1 + s1[1]×2×P1 + s1[2]×4×P1 + ... + s1[n-1]×2^n-1×P1
s2×P2 = s2[0]×P2 + s2[1]×2×P2 + s2[2]×4×P2 + ... + s2[n-1]×2^n-1×P2
```

Then the sum:

```
s1×P1 = s1[0]×P1 + s1[1]×2×P1 + s1[2]×4×P1 + ... + s1[n-1]×2^n-1×P1
+ + + + +
s2×P2 = s2[0]×P2 + s2[1]×2×P2 + s2[2]×4×P2 + ... + s2[n-1]×2^n-1×P2
```

The fused double and add algorithm is then very simple:

```
R <- zero
for i : n-1, 0
R <- R + R
if s1[i] = 1 then R <- R + P1
if s2[i] = 1 then R <- R + P2
return R
```

This only requires n doublings and n additions on average, for a total of 2n operations. Such a simple trick, and already we're already exceeding the performance of sliding windows. We can also make it run in constant time, but this will cost 3n operations instead (n doublings, 2n additions).

Of course, we can use this with windows. For instance, we can transform s1 and s2 in signed sliding windows form, compute the usual look up table, and finally:

```
R <- zero
for i : n-1, 0
R <- R + R
if s1[i] = 1 then R <- R + P1x[ s[i]]
if s1[i] = -1 then R <- R - P1x[-s[i]]
if s2[i] = 1 then R <- R + P2x[ s[i]]
if s2[i] = -1 then R <- R - P1x[-s[i]]
return R
```

With our 5-bit signed windows, this only costs 4n/3 + 16 operations, well below 2n. Not constant time at all, but still has worthy applications, such as EdDSA verification.

## Cheat 2: knowing the point in advance

EdDSA signatures only multiply by the base point. We know that point in advance, so we can precompute some stuff. We could of course precompute the window, thus saving the additional additions it requires. But this wouldn't reduce the number of doublings.

*Combs* however, do.

The idea behind a window is to add 4 bits at a time (for a 4-bit window). To avoid adding the same bit twice, we shift by 4 bits by doubling 4 times.

The idea behind a a comb is to again add 4 bits at a time (for a 4-bit
comb), except this time the bits are *spread apart*. That way we only
have to shift the comb *once* before adding again. A 4-bit comb not
only divides the number of additions by 4, it divides the number of
*doublings* by 4.

Here's an example with our trusty 32-bit scalar:

```
s = 10011010001110111100101000110001
```

A 4-bit comb would divide it like this:

```
1 0 1 0 -> comb 7
0 0 1 0 -> comb 6
0 1 0 1 -> comb 5
1 1 0 1 -> comb 4
1 1 1 0 -> comb 3
0 0 0 0 -> comb 2
1 1 1 0 -> comb 1
0 1 0 1 -> comb 0
```

First, we add those bits:

```
comb 7
┌───────┬───────┬───────┐
│ │ │ │
10011010001110111100101000110001
```

Then we double *once*, and add those:

```
comb 6
┌───────┬───────┬───────┐
│ │ │ │
10011010001110111100101000110001
```

Double once again, add those:

```
comb 5
┌───────┬───────┬───────┐
│ │ │ │
10011010001110111100101000110001
```

Etc.

The look up table required to support this is a little different from a window look up table. At a first glance, a 4-bit window simply contains all multiples of P, from zero to 15×P. Another way to look at it is that it represents all binary numbers from 0 to 15, where each bit represents P, 2P, 4P and 8P respectively. The other elements are then obtained by summing those base elements.

Instead of looking at the table like this:

```
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
```

We look at it like *that*:

```
0 1 2 2+1
4 4+1 4+2 4+2+1
8 8+ 1 8+ 2 8+ 2+1
8+4 8+4+1 8+4+2 8+4+2+1
```

That's the real structure of the look up table. With windows, sure, 4+2 also happens to be 2+2+2, and we take advantage of this when constructing window tables. Comb tables are different. Their individual bits have nothing to do with each other, and the usual algebraic properties do not hold. We must add from individual bits.

To construct a 4-bit comb, we need 4 constants:

```
P0 = P
P1 = P × 2^( n/4)
P2 = P × 2^(2×n/4)
P3 = P × 2^(3×n/4)
```

In the case of our 32-bit number, this means:

```
P0 = P
P1 = P × 2^8
P2 = P × 2^16
P3 = P × 2^24
```

Those are obtained by repeatedly doubling the point P. Then we can construct the table which will be hard coded into the main program.

```
for i : 1, 15;
Px[i] <- zero
if i mod 2 = 1 then Px[i] <- Px[i] + P0
if i/2 mod 2 = 1 then Px[i] <- Px[i] + P1
if i/4 mod 2 = 1 then Px[i] <- Px[i] + P2
if i/8 mod 2 = 1 then Px[i] <- Px[i] + P3
```

Another step is transforming s in comb form. This is almost the same as the fixed window transformation we saw earlier, only with interleaved bits instead:

```
for i : 0, (n/4)-1
comb[i] <- s[i ]
+ 2 × s[i + n/4 ]
+ 4 × s[i + n/4 × 2]
+ 8 × s[i + n/4 × 3]
return comb
```

The main loop is then very simple:

```
R <- zero
for i : (n/4)-1, 0
R <- R + R
R <- R + Px[comb[i]]
return R
```

This requires a total of n/2 operations. Note that in most cases, the scalar is secret, so we require constant time lookup:

```
R <- zero
for i : (n/4)-1, 0
R <- R + R
R <- R + clookup(Px, comb[i])
return R
```

Since such lookups scan the table linearly, this limits how big a table can be before we waste more time on scanning the table than we save performing fewer additions and doublings.

Thankfully, there's a way to further reduce the number of doublings without scanning bigger tables.

## Multiple combs

Instead of having bigger tables, we could have *more* tables. The core
idea is to chop up the scalar differently (here with two combs):

```
s = 10011010001110111100101000110001
1 1 0 1 1 1 0 0
0 0 0 0 1 0 0 0
0 1 1 1 0 1 1 0
1 0 1 1 0 0 1 1
```

Now we *could* use an 8-bit comb for this:

```
┌───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │
┌───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │
┌───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │
┌───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │
10011010001110111100101000110001
```

But 256 elements in the look up table is way too big in most settings. Instead, we could use two 4-bit combs:

```
┌───┬───┬───┐ ┌───┬───┬───┐
│ │ │ │ │ │ │ │
┌───┬───┬───┐ ┌───┬───┬───┐
│ │ │ │ │ │ │ │
┌───┬───┬───┐ ┌───┬───┬───┐
│ │ │ │ │ │ │ │
┌───┬───┬───┐ ┌───┬───┬───┐
│ │ │ │ │ │ │ │
10011010001110111100101000110001
```

With this, we need to do everything twice: we need two look up tables, constructed from two set of teeth:

```
P0x | P1x
---------------------+--------------------
P00 = P | P10 = P × 2^(4×n/8)
P01 = P × 2^( n/8) | P11 = P × 2^(5×n/8)
P02 = P × 2^(2×n/8) | P12 = P × 2^(6×n/8)
P03 = P × 2^(3×n/8) | P13 = P × 2^(7×n/8)
```

Then we need two combs:

```
for i : 0, (n/8)-1
comb0[i] <- s[i ]
+ 2 × s[i + n/8 ]
+ 4 × s[i + n/8 × 2]
+ 8 × s[i + n/8 × 3]
comb1[i] <- s[i + n/2]
+ 2 × s[i + n/8 + n/2]
+ 4 × s[i + n/8 × 2 + n/2]
+ 8 × s[i + n/8 × 3 + n/2]
return comb0, comb1
```

Finally, we perform two additions per loop iterations (though we have half as many loop iterations):

```
R <- zero
for i : (n/8)-1, 0
R <- R + R
R <- R + clookup(P0x, comb0[i])
R <- R + clookup(P1x, comb1[i])
return R
```

There, we just halved the number of doublings. We have two tables instead of just one, but the scans were just as short. More generally, the number of doublings is divided by the number of combs.

Of course, more combs will put more stress on the CPU data cache. Also, one should also consider using a wider comb instead, which reduces the number of both the doublings and additions, though at the cost of a longer constant time scan for additions. Still, multiple combs are worth considering whenever performance matters more than compactness.

Besides, there are ways to halve the size of the look up tables.

## Sliding combs

Sliding combs are not very interesting because most cryptographic applications of fixed point scalar multiplication involve a secret scalar. Also, the idea is pretty trivial: just apply the sliding ideas to combs. We can also make signed combs to reduce the size of the precomputed tables.

Their speed on the other hand is most impressive: on average, a 4-bit signed comb will have a very small table (8 points), and will cost an average of n/3 operations (n/6 doublings, n/6 additions). And with fully precomputed tables and variable time lookups, we can easily use larger tables at relatively little cost.

Still, we ultimately want a *constant time* way to halve the size of our
look up tables.

## Cheat 2.1: odd order groups

The first instance of this cheat I know of comes from Mike Hamburg (2012). The idea is, what if there's a representation of S where every bit would be either 1 or -1? If we were to make a 4-bit look up table for such a representation, it would look like this ("+" means 1, "-" means -1):

```
---- | -8 - 4 - 2 - 1 | -15
---+ | -8 - 4 - 2 + 1 | -13
--+- | -8 - 4 + 2 - 1 | -11
--++ | -8 - 4 + 2 + 1 | -9
-+-- | -8 + 4 - 2 - 1 | -7
-+-+ | -8 + 4 - 2 + 1 | -5
-++- | -8 + 4 + 2 - 1 | -3
-+++ | -8 + 4 + 2 + 1 | -1
+--- | 8 - 4 - 2 - 1 | 1
+--+ | 8 - 4 - 2 + 1 | 3
+-+- | 8 - 4 + 2 - 1 | 5
+-++ | 8 - 4 + 2 + 1 | 7
++-- | 8 + 4 - 2 - 1 | 9
++-+ | 8 + 4 - 2 + 1 | 11
+++- | 8 + 4 + 2 - 1 | 13
++++ | 8 + 4 + 2 + 1 | 15
```

Notice how the upper half of the look up table is the opposite of the
lower half. By flipping all bits of an index, we get the index to the
opposite value. Assuming our group can subtract as efficiently as it
can add, we could omit one half of the table altogether: If we get
`+???`

as an index, we just add the corresponding value. If we get
`-???`

, we first flip all +/- then *subtract* the corresponding value.
That way we never need the upper half of the table. For instance, if we
get `-+--`

instead of just adding -5, we can first flip all bits
(`+-++`

) then subtract 7.

The look up table for this comb can be constructed from the unsigned
look up table (the part after the `--`

is just comments):

```
Sx[0] <- Px[ 8] - Px[7] -- 1000 - 0111 = +---
Sx[1] <- Px[ 9] - Px[6] -- 1001 - 0110 = +--+
Sx[2] <- Px[10] - Px[5] -- 1010 - 0101 = +-+-
Sx[3] <- Px[11] - Px[4] -- 1011 - 0100 = +-++
Sx[4] <- Px[12] - Px[3] -- 1100 - 0011 = ++--
Sx[5] <- Px[13] - Px[2] -- 1101 - 0010 = ++-+
Sx[6] <- Px[14] - Px[1] -- 1110 - 0001 = +++-
Sx[7] <- Px[15] - Px[0] -- 1111 - 0000 = ++++
```

Another way to look at it:

```
for i : 0, 7
Sx[i] <- Px[i+8] - Px[flip_bits(i + 8)]
```

Now the problem is obtaining that all bits set form to begin with. This
is where the odd order will help. Most groups we deal with in
cryptography are actually modular, with some order L (where L is some
huge number). That is, for any scalar s and point P, the scalar
products s×P and (L+S)×P produce the same result. So the all bits set
form doesn't have to be equal to s, only *congruent* (modulo L). When L
is odd, such a form always exist.

Let's reason backwards, and pretend we already have our all bits set
form *abs*. For ease of manipulation, we'd like to replace its digits
+1 and -1 by the more regular 1 and 0 (let's call the result *hbs* or
"half bits set". That way handling the look up table is easier (recall
that we only need the lower half):

```
+--- -> 1 | 1000 = 8 + 000 = 8 + 0
+--+ -> 3 | 1001 = 8 + 001 = 8 + 1
+-+- -> 5 | 1010 = 8 + 010 = 8 + 2
+-++ -> 7 | 1011 = 8 + 011 = 8 + 3
++-- -> 9 | 1100 = 8 + 100 = 8 + 4
++-+ -> 11 | 1101 = 8 + 101 = 8 + 5
+++- -> 13 | 1110 = 8 + 110 = 8 + 6
++++ -> 15 | 1111 = 8 + 111 = 8 + 7
```

So:

```
hbs <- 0
for i : 0, n-1
if abs[i] = 1 then hbs[i] <- 1
if abs[i] = -1 then hbs[i] <- 0
return hbs
```

The same can be achieved with a simple bit of linear arithmetic:

```
hbs <- 0
for i : 0, n-1
hbs[i] <- (abs[i] + 1) / 2
return hbs
```

Dividing each bit by 2 is the same as dividing the whole thing
by 2. Adding one to each bit is the same as adding 2^{n}-1. So,
hbs is the same as (abs + 2^{n}-1) / 2. But then abs is the
same as the scalar s, *by definition*. Which means hbs is the same as
*(s + 2 ^{n}-1) / 2*.

When the order of the group L is odd, the division by 2 is always
possible (just multiply by 1/2 modulo L). With those groups, the hbs
form always exist, and so does the abs form. Odd order groups are not
uncommon. Curve25519 for instance (used for Ed25519 signatures) have a
*prime* order, and all interesting primes are odd.

The whole operation is done in 3 steps. First, transform the scalar
into an all bits set form (or rather, the *half* bits set form).

```
-- This is all done modulo L
hbs <- (s + 2^n - 1) / 2
return hbs
```

Second, turn the hbs form into comb form.

```
for i : 0, n/4
comb[i] <- hbs[i ]
+ 2 × hbs[i + n/4 ]
+ 4 × hbs[i + n/4 × 2]
+ 8 × hbs[i + n/4 × 3]
return comb
```

Third, use the comb form to actually perform the scalar multiplication
(note we're using the signed look up table *Sx* instead of the usual
*Px*):

```
R <- zero
for i : (n/4)-1, 0
R <- R + R
if comb[i] >= 8 then R <- R + Sx[comb[i] - 8 ]
else R <- R - Sx[flip_bits(comb[i])]
return R
```

The constant time version of this still has to scan the table linearly, and we need a constant time negation function, but it only scans 8 elements instead of 16:

```
R <- zero
for i : (n/4)-1, 0
R <- R + R
high <- comb[i] / 8 -- rounded down
low <- 1 - high -- not high
index <- (teeth XOR (high - 1)) AND 7 -- bitwise XOR and AND
P <- clookup(Sx, index) -- 0 < index < 8
P <- cneg(P, low) -- conditional negation
return R
```

Again, `clookup()`

and `cneg()`

are supposed to work in constant time.