Metamorphosis III (detail) by M.C. Escher. Source: The Berkshire Museum

Logjumps is a recently discovered technique for modular reduction over large prime fields. Unlike Montgomery reduction, which requires n2+n multiplications, Logjumps only needs n2+1.

Although Montgomery reduction is four decades old, Yuval Domb of Ingonyama first discovered this improvement when optimising client-side proving. To our knowledge, no-one else found this method before Yuval.

This research note offers a low-level explanation of Logjumps. It aims to help cryptography engineers gain intuition not only about the algebraic principles behind Logjumps and classic Montgomery reduction, but also how these algebraic principles translate to code.

Existing literature typically focuses on either high-level concepts or low-level optimisations, often leaving a gap between theory and implementation. To bridge it, I will first present the algebra behind Montgomery reduction, then explain how these principles apply to multiprecision modular reduction. Finally, I will do the same for Logjumps, and discuss its performance.

Reference code for the both algorithms can be found here.

Background

The Montgomery reduction algorithm takes an input cmodp and returns cR1modp. It is closely related to Montgomery multiplication, aR,bRmodp, and returns abRmodp.

We use Montgomery form to describe field elements xRmodp for any x, where R is the Montgomery radix, a power of 2 that is greater than p. For a more detailed explanation, refer to this post.

This technique is essential for optimising cryptographic schemes that require consecutive multiplication-heavy field operations, such as elliptic curve arithmetic or algebraic hash functions. The performance gains from Montgomery multiplication should ideally far outweigh the overhead of converting the inputs to and from Montgomery form.

In this note, we use the following symbols:

Symbol Description Notes
p The field modulus. A prime number.
c The value to reduce. 0cp2
w The number of bits per word. Must fit within a single CPU-native data type, such as a 32 or 64-bit unsigned integer.
n The number of words. Must be large enough such that wn is equal to or larger than the bit-length of p.
R The Montgomery radix. Coprime to and greater than p. Must be a power of 2, usually 2wn.
μ The precomputed Montgomery constant. p1modR, computed using the extended Euclidean algorithm.

The steps of the reduction algorithm are:

  1. qμcmodR
  2. t(c+pq)/R
    • Note: division by R is a right-shift by wn bits.
  3. If tp then ttp.
  4. Return t.

Substitute the terms μ and q into steps 1 and 2:

t(c+p(p1c))/R

Since pp11:

t(cc)/R0/R

This means that t is an exact multiple of R. Since Rp, t remains nonzero in modulo p arithmetic. We can therefore discard the lowermost n words to divide by R. A final conditional subtraction is all that is left to derive cR1modp.

The multiprecision algorithm

To implement this algorithm in a CPU, we use arrays of unsigned integers to represent w-bit words. For instance, if w is 32, we must work with arrays of at least 32-bit values.

Variable Description
p[n] The field modulus as an n-length array.
c[n*2] The value to reduce as a 2n-length array.
w The number of bits per word.
n The number of words.
mu The least significant w-bit word of p1modR.

The pseudocode which implements this algorithm looks like the following.

def mont_redc(c, p, mu, n, w):
    mask = (2 ** w) - 1
    t = list(c) # copy c to t

    for i in 0..n:
        q = (t[i] * mu) & mask

        # 1st inner loop: multiply p by q, word by word, while propagating carries.
        # Note that `& mask` is equivalent to mod w, and
        # `>> w` is equivalent to division by w.
        pq = [0] * (n + 1)
        for j in 0..n:
            prod  = q * p[j] + carry
            pq[j] = prod & mask
            carry = prod >> w
        pq[n] = carry

        # 2nd inner loop: add q * p to t word by word, while propagating carries.
        carry = 0;
        for j in 0..n + 1:
            s        = t[i + j] + pq[j] + carry
            t[i + j] = s & mask
            carry    = s >> w

        # Propagate the carry to the rest of t.
        k = i + n + 1
        while carry > 0 && k < (2 * n):
            s     = t[k] + carry
            t[k]  = s & mask
            carry = s >> w
            k ++

    # At this point, t begins with n zeros, followed by n unreduced words
    # of the result
    lhs = t[n:n*2]

    # Assume gte and sub are multiprecision greater-than and subtraction
    # functions
    if gte(lhs, p):
        return sub(lhs, &p, w)
    else:
        return lhs

Now, let us understand exactly how the pseudocode implements the reduction algorithm. Recall the first two steps of its algebraic description:

  1. qμcmodR
  2. t(c+pq)/R

Consider that c has 2n words:

c[c0,c1,c2,,c2n1]

Also recall that the integer c is the dot product of its component words and powers of 2w:

j=02n1cj(2wj)

Our goal is to reduce c to n words, one word at a time.

We use c(1) to represent what c becomes after one loop iteration, c(2) after two iterations, and so on:

c(1)[c1,c2,,c2n1]

c(2)[c2,,c2n1]

Ultimately, the result should have n words:

cn[cn,,c2n1]

We need to show that after each outer loop iteration:

a. the least significant word (LSW) is zeroed out

b. and each c(i) is congruent to c(i+1)modp.

Proving that the LSW is zeroed out

We compute pq using the first inner loop:

pqc0(μmod2w)p

pq is a n+1-word array since p has n words and multiplying it by a single word yields n+1 words.

Next, recall that when we multiply a multi-digit value by a single word, the LSW of the result is the product of the LSW of both the multiplicand and multiplier, modulo 2w. Also recall that μp equals 1mod2wn, so its LSW is 1mod2w. As such, when we multiply (μmod2w)p by c0, we can rewrite pq as:

pq[c0,pq1,,pqn]

The second inner loop adds pq to c:

[c0,]+[c0,pq1,,pqn][0,pq1,,pqn]

The LSW is zero because c0c00. This corresponds to cc in the algebraic explanation above (albeit for one out of n iterations).

Proving that c(i) is congruent to c(i+1)modp

Next, we must prove the inductive step to show that our result has n zeros after n iterations. We can then efficiently divide by R by retaining the remaining n words.

Rewrite the high-level algorithm to reduce c(i) one word at a time:

  1. q(i)(μmod2w)(c(i)0)mod2w
  2. c(i+1)(c(i)+pq(i))/2w

Multiply both sides by 2w:

c(i+1)2wc(i)+pq(i)

To see the that both sides of the above equation are congruent modulo p, note that pq(i) is a nonzero integer that is a multiple of p. Rewrite the above:

c(i+1)2wc(i)modp

This is equivalent to stating that:

[0,ci+1,,c2n1][ci,ci+1,,c2n1]modp

which is exactly what we want to prove.

By now, the reader should be familiar with the core principles behind various multiprecision Montgomery multiplication algorithms, such as SOS, CIOS, and FIOS, described in Acar (1996). The differences between these variants only lie in how the reduction and addition steps are interleaved. The complexity of these algorithms vanishes when one grasps the core principles behind how word-by-word reduction corresponds with high-level algebraic reasoning.

The Logjumps algorithm

Now let us examine Logjumps. It uses the precomputed constants rho and μ:

Symbol Description Notes
rho 2w(modp) This is the inverse of two raised to the power of the word size. It has n words. It should not be confused with R1. Also note that although Yuval's writeup uses ρ (the Greek letter rho), I will use rho instead to avoid visual confusion with the prime modulus p.
μ The precomputed Montgomery constant. p1modR, computed using the extended Euclidean algorithm.

The pseudocode for this method, assuming that n=4 and w=64 is as follows:

# Calculates low * rho[0..4], returning a 5-word result.
def calc_m(low):
    carry = 0
    res = [0] * 5
    for i in 0..4:
        prod = low * rho[i] + carry
        res[i] = prod & ((1 << 64) - 1)
        carry = prod >> 64
    res[4] = carry
    return res

# The Logjumps algorithm.
def mul_logjumps_sos(c):
    res = [0] * 8
    mask = (1 << 64) - 1

    # Step 1: the first Logjumps iteration
    m = calc_m(c[0])
    carry = 0
    for i in 0..5:
        s = c[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    # Propagate carries to res[6] and res[7]
    s = c[6] + carry
    res[5] = s & mask
    carry = s  >> 64

    s = c[7] + carry
    res[6] = s & mask
    carry = s  >> 64

    res[7] = carry
    carry = 0

    # Step 2: the second Logjumps iteration
    m = calc_m(res[0])
    carry = 0
    for i in 0..5:
        s = res[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    # Propagate carries to res[6]
    s = res[6] + carry
    res[5] = s & mask
    carry = s  >> 64

    res[6] = carry
    carry = 0

    # Step 3: the third Logjumps iteration
    m = calc_m(res[0])
    carry = 0
    for i in 0..5:
        s = res[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    res[5] = carry
    carry = 0

    # Step 4: use one iteration of Montgomery reduction to reduce res[] by one word
    q = (res[0] * U64_MU0) & mask
    pq = [0] * 5
    for i in 0..4:
        prod = q * U64_P[i] + carry
        pq[i] = prod & mask
        carry = prod >> 64
    pq[4] = carry
    carry = 0

    for i in 0..5:
        s = res[i] + pq[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    return [res[1], res[2], res[3], res[4]]

Let c be the 2n-word multiprecision integer we want to reduce, one w-bit word at a time.

c[c0,c1,,c2n1].

We can rewrite c as such:

cH2w+c0

where c0 is the least significant word, and H represents all bits higher than w.

H[c1,,c2n1]

Next, compute M:

Mc0rho

and add it to the higher words, but in an aligned way: that is, we shift M one word to the left and add the result to H2w:

(H+M)2w

This sets c0 to 0, and shifts right by one word, which reduces c by one word.

To show that the above is correct, we need to prove the following equivalence:

(H+M)2w+0modpH2w+c0modp

The proof is simple. First, we substitute the definition of M, and simplify:

(H+c02w)2wmodp

Next, we expand the terms in the parentheses, and observe that we get the right-hand side as desired:

H2w+c0modp

We can repeat this until we have n+1 words, and finish with one iteration of classic Montgomery reduction to get n words. We can then discard the least significant n words to get the fully reduced result cR1.

Performance

Each inner loop of the Montgomery reduction algorithm performs one multiplication to compute q, and n multiplications to compute q * p[j]. Since there are n inner loops, the total number of multiplications is n2+n.

The Logjumps algorithm, however, only takes n2+1 multiplications. It involves n1 invocations of calc_m() (which does n multiplications), and a final Montgomery reduction step, which does n+1 multiplications. The total number of multiplications is therefore n(n1)+n+1n2+1.

For 256-bit primes (used in common elliptic-curve cryptographic schemes) on 64-bit CPUs, n is 4, we are looking at 17 multiplications for Logjumps, versus 20 for traditional Montgomery. On 32-bit CPUs, that would be 72 versus 64 multiplications.

Yuval's writeup discusses a further optimisation to Logjumps. It is to compute c0r3, c1r2, and c2r1 in parallel, and add these three products to c[37] to reduce multiple words at a time. This parallelism can be exploited at the CPU pipeline level and, depending on the hardware, may yield additional performance gains.

Implications

Field arithmetic is widespread in modern cryptography, so Logjumps can speed up a wide range of applications. However, its performance depends heavily on the hardware platform on which it is implemented, so one should be cautious about making concrete predictions of performance gains. For instance, the ability for a CPU to efficiently compute the independent products described above (such as c0r3 and so on) may be limited by how many dedicated multipliers it has at the hardware level. Other platforms, however, such as GPUs or FPGAs, may be able to take advantage of this parallel technique, albeit with tradeoffs which would take many more articles to explore.

Further research is therefore needed to benchmark this technique in production systems and refine its implementation for different hardware platforms. Its potential for parallelism may be even more significant for very large p values. The most practical next step is simply to implement it in existing libraries so that existing applications can immediately benefit from this new technique. I hope that this research note has provided clarity and intuition to all involved.

Credits

Many thanks to Yuval Domb, Kobi Gurkan, Guillermo Angeris, and Nicolas Mohnblatt for their valuable comments and feedback on this article.