@ Loup's

Impossible? Like that would stop me.

Fast Multiplication with Slow Additions

Public key cryptography routinely involves scalar multiplication over a group with pretty slow additions. RSA for instance needs modular exponentiation with big numbers (in this case the "addition" is actually a modular multiplication).

Here we will focus on elliptic curves (which is what my own crypto library uses), but most of this will apply to any group. Elliptic curves have this notion of "point addition", where adding two points gives a third point. We will talk about "additions" and "doublings":

P + Q  (addition)
P + P  (doubling)

(While doublings are ostensibly just a kind of addition, they're often implemented differently, and are generally faster.)

The problem we want to solve is that of scalar multiplication:

s × P = P + P + ... + P (s times)

where s is some huge number, typically over several hundred bits. The brute force approach of adding P to itself 2256 times will obviously not work. We need a faster addition chain.

A bit of notation

For the remainder of this tutorial, we will represent the scalar s in an array, like this:

s = s[0] + s[1]×2 + s[2]×4 + ... + s[n-1]×2^n-1

Note that this is a little endian representation. For instance, the number thirteen (13) can be represented this way:

thirteen = 1101 = 1 + 0×2 + 1×4 + 1×8
thirteen[0] = 1
thirteen[1] = 0
thirteen[2] = 1
thirteen[3] = 1

This is a straightforward binary representation. But it is not the only one. If we allow the binary digits to take other values than 0 or 1, we can represent s in other ways. For instance:

thirteen = 0301 = 1 + 0×2 + 3×4 + 0×8
thirteen = 1005 = 5 + 0×2 + 0×4 + 1×8

(In 0301 and 1005, the digits are still in binary positions).

Such exotic number representations are at the root of most optimisations we will see here.

Naive approach: double and add

Assuming s is in straightforward binary form over n digits, the scalar multiplication is pretty simple:

D <- P
R <- zero
for (i : 0, n-1)
    if (s[i] = 1)
        R <- R + D
    D <- D + D
return R

We just compute s[i] × 2^i for each i, and add it to the result. We start from the least significant bit, and double our way up to the most significant bit. Alternatively, we could start from the most significant bit, and double the result instead. This is more flexible.

R <- zero
for i : n-1, 0
    R <- R + R
    if s[i] = 1
        R <- R + P
return R

On average, the naive approach will require n doublings and n/2 additions. Not bad compared to the impossible 2n additions, but not quite optimal either.

A word of warning: the above does not run in constant time, and will leak information about the scalar. If the scalar must remain secret, we need to tweak this a bit:

R <- zero
for i : n-1, 0
    R <- R + R
    T <- R + P
    ccopy(R, T, s[i])
return R

Where ccopy() is an operation that copies the second point into the first if the third argument is one. By using arithmetic tricks, this operation can be done in constant time. The cost however is that we now perform n additions instead of just n/2. (We still have n doublings).

Less additions with windows

The idea is to batch the additions a bit. Instead of processing digits one by one, we could process them, say… 4 by 4. So instead of adding 0 or P, we could add anywhere between 0 and 15×P. To do this, we need to change the representation of the scalar so that only a fourth of the digits is non zero:

s = 1001 1010 0011 1011 1100 1010 0011 0001
s = 0009 000a 0003 000b 000c 000a 0003 0001

(The digits are hexadecimal, from 0 to f)

Transforming s into this fixed windows representation is pretty straightforward:

for i : 0, n/4
    s[i×4] <-     s[i×4  ]
            + 2 × s[i×4+1]
            + 4 × s[i×4+2]
            + 8 × s[i×4+3]
    s[i×4+1] <- 0
    s[i×4+2] <- 0
    s[i×4+3] <- 0

To handle this representation, we must build a table to hold the multiples of P first (this costs 15 additions), then modify the main loop a bit:

-- Build the look up table
Px[0] <- zero
for i : 1, 15
    Px[i] = Px[i-1] + P

-- Main loop
R <- zero
for i : n-1, 0
    R <- R + R
    if s[i] ≠ 0
        R <- R + Px[s[i]]
return R

This algorithm is nearly identical to the naive double and add, except it can handle any digit between 0 and 15. The speed up here comes from the representation, which guarantees that 3 digits out of 4 are zeroes. So instead of n/2 additions, we perform a little less than n/4 + 15.

If we were using a 5-bit table instead, we would have n/5 + 31 additions. 6-bit table, n/6 + 63. And so on. Note that eventually, the table costs more than it saves, so there's a limit to how few additions you can perform. In practice, for a 256-bit scalar, the fastest tables are 4 or 5 bits wide.

If we need to do this in constant time, we can again tweak the algorithm a little:

R <- zero
for i : n-1, 0
    R <- R + R
    if i mod 4 = 0 -- always add 1 time in 4
        R <- R + clookup(Px, s[i])
return R

Where clookup() is a constant time look up which scans the table linearly (to guarantee constant memory access patterns), and uses bit twiddling tricks to produce the right value (to avoid branching). Such a lookup gets slower as the table gets bigger, which again limits how big a table can be.

Sliding windows: smaller tables, larger windows

(Not constant time. Don't use if the scalar must remain secret.)

The size of the table is a pretty big problem. If we could find ways to get smaller tables but keeping the windows just as wide…

Turns out that we can. The core idea is pretty simple: instead of placing windows at fixed intervals, we place them such that the least significant bit of each window is a 1. The s scalar above for instance can be chopped up like this instead of using fixed width windows:

s = 1 001101 000111 01111 00101 00011 0001
s = 1 00000d 000007 0000f 00005 00003 0001

This has two advantages: first, we skip over zeroes, so we have fewer windows. Second, the windows always end by 1, so the look up table only needs the odd elements, so it can be smaller.

The sliding windows transformation is a bit more complex than the fixed windows one.

i <- 0
for i : 0, n-1
    if s[i] = 1
        s[i] <-     s[i  ]
              + 2 × s[i+1]
              + 4 × s[i+2]
              + 8 × s[i+3]
        s[i+1] <- 0
        s[i+2] <- 0
        s[i+3] <- 0

(Real implementations should mind the edge cases at the end, or make sure s has enough zeroes at the most significant end.)

The sliding windows representation uses the same digits as the fixed windows representation, so the same algorithm can handle it. The construction of the window however only costs half as many additions, since it can skip the even slots:

-- Build the look up table (odd elements only)
P2    <- P + P
Px[1] <- P
for i : 1, 7
    Px[2×i+1] = Px[2×i-1] + P2

-- Main loop
R <- zero
for i : n-1, 0
    R <- R + R
    if s[i] ≠ 0
        R <- R + Px[s[i]]
return R

(Real implementations would use a smaller table instead of leaving half elements uninitialised.)

The average cost of a 4-bit sliding windows is n/5 + 8 additions, instead of n/4 + 15. Sliding windows are effectively smaller and wider than fixed windows. On the other hand, not being constant time makes sliding windows less widely applicable.

This should be as far as windows can go. But some groups also have cheap subtraction. Twisted Edwards curves used in EdDSA are such a group. When subtraction doesn't cost more than addition, we can use signed sliding windows. But first, we need to make a little detour.

Non adjacent form: negative digits, fewer digits

(Not constant time. Don't use if the scalar must remain secret.)

Non adjacent form is a representation where the digits can be either 0, 1, or -1. We use -1 to minimise non-zero digits, so we can have fewer additions (and subtractions). Take this number for instance:

00111011

In non adjacent form, it would look like this instead:

01000-10-1

The way it works is pretty simple. We work from right to left, and look at the last 2 digits. if they are both ones, we replace the rightmost 1 by a -1 (effectively subtracting 2), and propagate a carry to compensate (effectively adding 2). For the number above, this worked like this:

00111 01 1
00111 10-1
01000-10-1

Here is the complete algorithm. Note that we could overflow s that way, make sure you have enough room for one extra digit.

for i : 0, n-1
    if s[i] = 1 and s[i+1] = 1
        s[i] <- -1
        j    <- i
        while s[j] = 1
            s[j] <- 0
            j    <- j+1
        s[j] <- 1

To handle this, we need to tweak the double and add algorithm a little:

R <- zero
for i : n-1, 0
    R <- R + R
    if s[i] =  1  then  R <- R + P
    if s[i] = -1  then  R <- R - P
return R

On average, this costs the usual n doublings, but only n/3 additions, without any look up table. Forget about running in constant time, though.

Signed sliding windows: non adjacent sliding windows

(Still not constant time.)

Signed sliding windows are a generalisation of non-adjacent form. When we find a 1, we don't just look at two bit, but at a whole window (four bits for a 4-bit window). If the most significant bit of the window is a 1, then we put a negative number, and propagate the carry just like non adjacent form. For instance, our scalar would be transformed thus:

s = 100 110 1 000111 0111 1 00101 00011 0001 (binary)
s = 100 110 1 000111 0111 1 00101 00003 0001
s = 100 110 1 000111 0111 1 00005 00003 0001
s = 100 110 1 000111 1000-1 00005 00003 0001
s = 100 110 1 000003 1000-1 00005 00003 0001
s = 101 000-3 000003 1000-1 00005 00003 0001
s = 005 000-3 000003 1000-1 00005 00003 0001 (signed windows)

The carry propagation works the same as non adjacent form, we just look at a whole window instead of just 2 bits:

for i : 1, n-1
    if s[i] ≠ 0
        -- 4-bit unsigned window
        for j : 1, 3
            s[i  ] <- s[i] + s[i+j] × 2^j
            s[i+j] <- 0
        if s[i] > 8
            -- convert to signed window, propagate carry
            s[i] <- s[i] - 16
            j <- i + 4
            while s[j] != 0
                s[j] <- 0
                j    <- j + 1
            s[j] = 1

The double and add algorithm is almost the same as the one used for non adjacent form:

R <- zero
for i : n-1, 0
    R <- R + R
    if s[i] > 0  then  R <- R + Px[ s[i]]
    if s[i] < 0  then  R <- R - Px[-s[i]]
return R

The average cost is n doublings and n/5 + 4 additions (assuming a 4-bit signed sliding window). Signed sliding windows are effectively half as big, (or one bit wider), than their unsigned counterpart. In practice though, this doesn't make a huge difference. While they do require fewer additions, having to handle subtraction as well as addition offsets this gain somewhat.

Now we've reduced the number of additions as much as we could. While this does speed things up a bit, we only went from 3n/2 operations to something like 7n/6. That's better, but we won't go below n without reducing the number of doublings.

And that requires a little bit of cheating.

Cheat 1: double scalar multiplication

EdDSA verification doesn't exactly need a scalar multiplication. It needs the sum of two scalar multiplications. That is, something like:

s1×P1 + s2×P2

Computing that with the naive approach requires 3n operations (2n doublings and n additions). We can use sliding windows, but this still requires 2n + n/3 + 16 operations.

The trick here is to fuse the two scalar multiplications. First, let's write the two of them separately:

s1×P1 = s1[0]×P1 + s1[1]×2×P1 + s1[2]×4×P1 + ... + s1[n-1]×2^n-1×P1
s2×P2 = s2[0]×P2 + s2[1]×2×P2 + s2[2]×4×P2 + ... + s2[n-1]×2^n-1×P2

Then the sum:

s1×P1 = s1[0]×P1 + s1[1]×2×P1 + s1[2]×4×P1 + ... + s1[n-1]×2^n-1×P1
  +          +            +            +                        +
s2×P2 = s2[0]×P2 + s2[1]×2×P2 + s2[2]×4×P2 + ... + s2[n-1]×2^n-1×P2

The fused double and add algorithm is then very simple:

R <- zero
for i : n-1, 0
    R <- R + R
    if s1[i] = 1  then  R <- R + P1
    if s2[i] = 1  then  R <- R + P2
return R

This only requires n doublings and n additions on average, for a total of 2n operations. Such a simple trick, and already we're already exceeding the performance of sliding windows. We can also make it run in constant time, but this will cost 3n operations instead (n doublings, 2n additions).

Of course, we can use this with windows. For instance, we can transform s1 and s2 in signed sliding windows form, compute the usual look up table, and finally:

R <- zero
for i : n-1, 0
    R <- R + R
    if s1[i] =  1  then  R <- R + P1x[i]
    if s1[i] = -1  then  R <- R - P1x[i]
    if s2[i] =  1  then  R <- R + P2x[i]
    if s2[i] = -1  then  R <- R - P1x[i]
return R

With our 5-bit signed windows, this only costs 4n/3 + 16 operations, well below 2n. Not constant time at all, but still has worthy applications, such as EdDSA verification.

Cheat 2: knowing the point in advance

EdDSA signatures only multiply by the base point. We know that point in advance, so we can precompute some stuff. We could of course precompute the window, thus saving the additional additions it requires. But this wouldn't reduce the number of doublings.

Combs however, do.

The idea behind a window is to add 4 bits at a time (for a 4-bit window). To avoid adding the same bit twice, we shift by 4 bits by doubling 4 times.

The idea behind a a comb is to again add 4 bits at a time (for a 4-bit comb), except this time the bits are spread apart. That way we only have to shift the comb once before adding again. A 4-bit comb not only divides the number of additions by 4, it divides the number of doublings by 4.

Here's an example with our trusty 32-bit scalar:

s = 10011010001110111100101000110001

A 4-bit comb would divide it like this:

    1       0       1       0          -> comb 7
     0       0       1       0         -> comb 6
      0       1       0       1        -> comb 5
       1       1       0       1       -> comb 4
        1       1       1       0      -> comb 3
         0       0       0       0     -> comb 2
          1       1       1       0    -> comb 1
           0       1       0       1   -> comb 0

First, we add those bits:

         comb 7
┌───────┬───────┬───────┐
│       │       │       │
10011010001110111100101000110001

Then we double once, and add those:

          comb 6
 ┌───────┬───────┬───────┐
 │       │       │       │
10011010001110111100101000110001

Double once again, add those:

           comb 5
  ┌───────┬───────┬───────┐
  │       │       │       │
10011010001110111100101000110001

Etc.

The look up table required to support this is a little different from a window look up table. At a first glance, a 4-bit window simply contains all multiples of P, from zero to 15×P. Another way to look at it is that it represents all binary numbers from 0 to 15, where each bit represent P, 2P, 4P and 8P respectively. The other elements are then obtained by summing those base elements.

Instead of looking at the table like this:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

We look at it like that:

0        1      2      2+1
4      4+1    4+2    4+2+1
8    8+  1  8+  2  8+  2+1
8+4  8+4+1  8+4+2  8+4+2+1

That's the real structure of the look up table. With windows, sure, 4+2 also happens to be 2+2+2, and we take advantage of this when constructing window tables. Comb tables are different. Their individual bits have nothing to do with each other, and the usual algebraic properties do not hold. We must add from individual bits.

To construct a 4-bit comb, we need 4 constants:

P0 = P
P1 = P × 2^(  n/4)
P2 = P × 2^(2×n/4)
P3 = P × 2^(3×n/4)

In the case of our 32-bit number, this means:

P0 = P
P1 = P × 2^8
P2 = P × 2^16
P3 = P × 2^24

Those are obtained by repeatedly doubling the point P. Then we can construct the table which will be hard coded into the main program.

for i : 1, 15;
    Px[i] <- zero
    if i   mod 2 = 1  then  Px[i] <- Px[i] + P0
    if i/2 mod 2 = 1  then  Px[i] <- Px[i] + P1
    if i/4 mod 2 = 1  then  Px[i] <- Px[i] + P2
    if i/8 mod 2 = 1  then  Px[i] <- Px[i] + P3

Another step is transforming s in comb form. This is almost the same as the fixed window transformation we saw earlier, only with interleaved bits instead:

for i : 0, (n/4)-1
    comb[i] <-     s[i          ]
             + 2 × s[i + n/4    ]
             + 4 × s[i + n/4 × 2]
             + 8 × s[i + n/4 × 3]
return comb

The main loop is then very simple:

R <- zero
for i : (n/4)-1, 0
    R <- R + R
    R <- R + Px[comb[i]]
return R

This requires a total of n/2 operations. Note that in most cases, the scalar is secret, so we require constant time lookup:

R <- zero
for i : (n/4)-1, 0
    R <- R + R
    R <- R + clookup(Px, comb[i])
return R

Since such lookups scan the table linearly, this limits how big a table can be before we waste more time on scanning the table than we save performing fewer additions and doublings.

Thankfully, there's a way to further reduce the number of doublings without scanning bigger tables.

Multiple combs

Instead of having bigger tables, we could have more tables. The core idea is to chop up the scalar differently (here with two combs):

s = 10011010001110111100101000110001

    1   1   0   1   1   1   0   0
     0   0   0   0   1   0   0   0
      0   1   1   1   0   1   1   0
       1   0   1   1   0   0   1   1

Now we could use an 8-bit comb for this:

┌───┬───┬───┬───┬───┬───┬───┐
│   │   │   │   │   │   │   │
 ┌───┬───┬───┬───┬───┬───┬───┐
 │   │   │   │   │   │   │   │
  ┌───┬───┬───┬───┬───┬───┬───┐
  │   │   │   │   │   │   │   │
   ┌───┬───┬───┬───┬───┬───┬───┐
   │   │   │   │   │   │   │   │
10011010001110111100101000110001

But 256 elements in the look up table is way too big in most settings. Instead, we could use two 4-bit combs:

┌───┬───┬───┐   ┌───┬───┬───┐
│   │   │   │   │   │   │   │
 ┌───┬───┬───┐   ┌───┬───┬───┐
 │   │   │   │   │   │   │   │
  ┌───┬───┬───┐   ┌───┬───┬───┐
  │   │   │   │   │   │   │   │
   ┌───┬───┬───┐   ┌───┬───┬───┐
   │   │   │   │   │   │   │   │
10011010001110111100101000110001

With this, we need to do everything twice: we need two look up tables, constructed from two set of teeth:

        P0x          |         P1x
---------------------+--------------------
P00 = P              |  P10 = P × 2^(4×n/8)
P01 = P × 2^(  n/8)  |  P11 = P × 2^(5×n/8)
P02 = P × 2^(2×n/8)  |  P12 = P × 2^(6×n/8)
P03 = P × 2^(3×n/8)  |  P13 = P × 2^(7×n/8)

Then we need two transformations to comb form:

for i : 0, (n/8)-1
    comb0[i] <-     s[i          ]
              + 2 × s[i + n/8    ]
              + 4 × s[i + n/8 × 2]
              + 8 × s[i + n/8 × 3]
    comb1[i] <-     s[i           + n/2]
              + 2 × s[i + n/8     + n/2]
              + 4 × s[i + n/8 × 2 + n/2]
              + 8 × s[i + n/8 × 3 + n/2]
return comb0, comb1

Finally, we perform two additions per loop iterations (though we have half as many loop iterations):

R <- zero
for i : (n/8)-1, 0
    R <- R + R
    R <- R + clookup(P0x, comb0[i])
    R <- R + clookup(P1x, comb1[i])
return R

There, we just halved the number of doublings. We have two tables instead of just one, but the scans were just as short. More generally, the number of doublings is divided by the number of combs.

Of course, more combs will put more stress on the CPU data cache. Also, one should also consider using a wider comb instead, which reduces the number of both the doublings and additions, though at the cost of a longer constant time scan for additions. Still, multiple combs are worth considering whenever performance matters more than compactness.

Besides, there are ways to halve the size of the look up tables.

Sliding combs

Sliding combs are not very interesting because most cryptographic applications of fixed point scalar multiplication involve a secret scalar. Also, the idea is pretty trivial: just apply the sliding ideas to combs. We can also make signed combs to reduce the size of the precomputed tables.

Their speed on the other hand is most impressive: on average, a 4-bit signed comb will have a very small table (8 points), and will cost an average of n/3 operations (n/6 doublings, n/6 additions). And with fully precomputed tables and variable time lookups, we can easily use larger tables at relatively little cost.

Still, we ultimately want a constant time way to halve the size of our look up tables.

Cheat 2.1: odd order groups

The first instance of this cheat I know of comes from Mike Hamburg (2012). The idea is, what if there's a representation of S where every bit would be either 1 or -1? If we were to make a 4-bit look up table for such a representation, it would look like this ("+" means 1, "-" means -1):

----  | -8 - 4 - 2 - 1  | -15
---+  | -8 - 4 - 2 + 1  | -13
--+-  | -8 - 4 + 2 - 1  | -11
--++  | -8 - 4 + 2 + 1  |  -9
-+--  | -8 + 4 - 2 - 1  |  -7
-+-+  | -8 + 4 - 2 + 1  |  -5
-++-  | -8 + 4 + 2 - 1  |  -3
-+++  | -8 + 4 + 2 + 1  |  -1
+---  |  8 - 4 - 2 - 1  |   1
+--+  |  8 - 4 - 2 + 1  |   3
+-+-  |  8 - 4 + 2 - 1  |   5
+-++  |  8 - 4 + 2 + 1  |   7
++--  |  8 + 4 - 2 - 1  |   9
++-+  |  8 + 4 - 2 + 1  |  11
+++-  |  8 + 4 + 2 - 1  |  13
++++  |  8 + 4 + 2 + 1  |  15

Notice how the upper half of the look up table is the opposite of the lower half. By flipping all bits of an index, we get the index to the opposite value. Assuming our group can subtract as efficiently as it can add, we could omit one half of the table altogether: If we get +??? as an index, we just add the corresponding value. If we get -???, we first flip all +/- then subtract the corresponding value. That way we never need the upper half of the table. For instance, if we get -+-- instead of just adding -5, we can first flip all bits (+-++) then subtract 7.

The look up table for this comb can be constructed from the unsigned look up table (the part after the -- is just comments):

Sx[0] <- Px[ 8] - Px[7]  --  1000 - 0111 = +---
Sx[1] <- Px[ 9] - Px[6]  --  1001 - 0110 = +--+
Sx[2] <- Px[10] - Px[5]  --  1010 - 0101 = +-+-
Sx[3] <- Px[11] - Px[4]  --  1011 - 0100 = +-++
Sx[4] <- Px[12] - Px[3]  --  1100 - 0011 = ++--
Sx[5] <- Px[13] - Px[2]  --  1101 - 0010 = ++-+
Sx[6] <- Px[14] - Px[1]  --  1110 - 0001 = +++-
Sx[7] <- Px[15] - Px[0]  --  1111 - 0000 = ++++

Another way to look at it:

for i : 0, 7
    Sx[i] <- Px[i+8] - Px[flip_bits(i + 8)]

Now the problem is obtaining that all bits set form to begin with. This is where the odd order will help. Most groups we deal with in cryptography are actually modular, with some order L (where L is some huge number). That is, for any scalar s and point P, the scalar products s×P and (L+S)×P produce the same result. So the all bits set form doesn't have to be equal to s, only congruent (modulo L). When L is odd, such a form always exist.

Let's reason backwards, and pretend we already have our all bits set form abs. For ease of manipulation, we'd like to replace its digits +1 and -1 by the more regular 1 and 0 (let's call the result hbs or "half bits set". That way handling the look up table is easier (recall that we only need the lower half):

+---  ->    1  |  1000 = 8 + 000 = 8 + 0
+--+  ->    3  |  1001 = 8 + 001 = 8 + 1
+-+-  ->    5  |  1010 = 8 + 010 = 8 + 2
+-++  ->    7  |  1011 = 8 + 011 = 8 + 3
++--  ->    9  |  1100 = 8 + 100 = 8 + 4
++-+  ->   11  |  1101 = 8 + 101 = 8 + 5
+++-  ->   13  |  1110 = 8 + 110 = 8 + 6
++++  ->   15  |  1111 = 8 + 111 = 8 + 7

So:

hbs <- 0
for i : 0, n-1
    if abs[i] =  1  then  hbs[i] <- 1
    if abs[i] = -1  then  hbs[i] <- 0
return hbs

The same can be achieved with a simple bit of linear arithmetic:

hbs <- 0
for i : 0, n-1
    hbs[i] <- (abs[i] + 1) / 2
return hbs

Dividing each bit by 2 is the same as dividing the whole thing by 2. Adding one to each bit is the same as adding 2n-1. So, hbs is the same as (abs + 2n-1) / 2. But then abs is the same as the scalar s, by definition. Which means hbs is the same as (s + 2n-1) / 2.

When the order of the group L is odd, the division by 2 is always possible (just multiply by 1/2 modulo L). With those groups, the hbs form always exist, and so does the abs form. Odd order groups are not uncommon. Curve25519 for instance (used for Ed25519 signatures) have a prime order, and all interesting primes are odd.

The whole operation is done in 3 steps. First, transform the scalar into an all bits set form (or rather, the half bits set form).

-- This is all done modulo L
hbs <- (s + 2^n - 1) / 2
return hbs

Second, turn the hbs form into comb form.

for i : 0, n/4
    comb[i] <-     hbs[i          ]
             + 2 × hbs[i + n/4    ]
             + 4 × hbs[i + n/4 × 2]
             + 8 × hbs[i + n/4 × 3]
return comb

Third, use the comb form to actually perform the scalar multiplication (note we're using the signed look up table Sx instead of the usual Px):

R <- zero
for i : (n/4)-1, 0
    R     <- R + R
    if comb[i] >= 8  then  R <- R + Sx[comb[i] - 8       ]
                     else  R <- R - Sx[flip_bits(comb[i])]
return R

The constant time version of this still has to scan the table linearly, and we need a constant time negation function, but it only scans 8 elements instead of 16:

R <- zero
for i : (n/4)-1, 0
    R     <- R + R
    high  <- comb[i] / 8                  -- rounded down
    low   <- 1 - high                     -- not high
    index <- (teeth XOR (high - 1)) AND 7 -- bitwise XOR and AND
    P     <- clookup(Sx, index)           -- 0 < index < 8
    P     <- cneg(P, low)                 -- conditional negation
return R

Again, clookup() and cneg() are supposed to work in constant time.