Insights

*Metamorphosis III (detail)* by M.C. Escher. Source: The Berkshire Museum

Logjumps is a recently discovered technique for modular reduction over large prime fields. Unlike Montgomery reduction, which requires $n^2 + n$ multiplications, Logjumps only needs $n^2 + 1$.

Although Montgomery reduction is four decades old, Yuval Domb of Ingonyama first discovered this improvement when optimising client-side proving. To our knowledge, no-one else found this method before Yuval.

This research note offers a low-level explanation of Logjumps. It aims to help cryptography engineers gain intuition not only about the algebraic principles behind Logjumps and classic Montgomery reduction, but also how these algebraic principles translate to code.

Existing literature typically focuses on either high-level concepts or low-level optimisations, often leaving a gap between theory and implementation. To bridge it, I will first present the algebra behind Montgomery reduction, then explain how these principles apply to multiprecision modular reduction. Finally, I will do the same for Logjumps, and discuss its performance.

Reference code for the both algorithms can be found here.

Background

The Montgomery reduction algorithm takes an input $\mathbf{c} \mod p$ and returns $\mathbf{c}R^{-1} \mod p$. It is closely related to Montgomery multiplication, $aR, bR \mod p$, and returns $abR \mod p$.

We use Montgomery form to describe field elements $xR \mod p$ for any $x$, where $R$ is the Montgomery radix, a power of 2 that is greater than $p$. For a more detailed explanation, refer to this post.

This technique is essential for optimising cryptographic schemes that require consecutive multiplication-heavy field operations, such as elliptic curve arithmetic or algebraic hash functions. The performance gains from Montgomery multiplication should ideally far outweigh the overhead of converting the inputs to and from Montgomery form.

In this note, we use the following symbols:

Symbol	Description	Notes
$p$	The field modulus.	A prime number.
$\mathbf{c}$	The value to reduce.	$0 \le \mathbf{c} \le p^2$
$w$	The number of bits per word.	Must fit within a single CPU-native data type, such as a 32 or 64-bit unsigned integer.
$n$	The number of words.	Must be large enough such that $wn$ is equal to or larger than the bit-length of $p$.
$R$	The Montgomery radix.	Coprime to and greater than $p$. Must be a power of 2, usually $2^{wn}$.
$\mu$	The precomputed Montgomery constant.	$-p^{-1} \mod R$, computed using the extended Euclidean algorithm.

The steps of the reduction algorithm are:

$q \leftarrow \mu \mathbf{c} \mod R$
$t \leftarrow (\mathbf{c} + pq) / R$
- Note: division by $R$ is a right-shift by $wn$ bits.
If $t \ge p$ then $t \leftarrow t – p$.
Return $t$.

Substitute the terms $\mu$ and $q$ into steps 1 and 2:

$t \leftarrow (\mathbf{c} + p(-p^{-1} \mathbf{c})) / R$

Since $p\cdot-p^{-1} \equiv -1$:

$t \equiv (\mathbf{c} – \mathbf{c}) / R \equiv 0/R$

This means that $t$ is an exact multiple of $R$. Since $R \neq p$, $t$ remains nonzero in modulo $p$ arithmetic. We can therefore discard the lowermost $n$ words to divide by $R$. A final conditional subtraction is all that is left to derive $\mathbf{c}R^{-1} \mod p$.

The multiprecision algorithm

To implement this algorithm in a CPU, we use arrays of unsigned integers to represent $w$-bit words. For instance, if $w$ is 32, we must work with arrays of at least 32-bit values.

Variable	Description
`p[n]`	The field modulus as an $n$-length array.
`c[n*2]`	The value to reduce as a $2n$-length array.
`w`	The number of bits per word.
`n`	The number of words.
`mu`	The least significant $w$-bit word of $-p^{-1} \mod R$.

The pseudocode which implements this algorithm looks like the following.

def mont_redc(c, p, mu, n, w):
    mask = (2 ** w) - 1
    t = list(c) # copy c to t

    for i in 0..n:
        q = (t[i] * mu) & mask

        # 1st inner loop: multiply p by q, word by word, while propagating carries.
        # Note that `& mask` is equivalent to mod w, and
        # `>> w` is equivalent to division by w.
        pq = [0] * (n + 1)
        for j in 0..n:
            prod  = q * p[j] + carry
            pq[j] = prod & mask
            carry = prod >> w
        pq[n] = carry

        # 2nd inner loop: add q * p to t word by word, while propagating carries.
        carry = 0;
        for j in 0..n + 1:
            s        = t[i + j] + pq[j] + carry
            t[i + j] = s & mask
            carry    = s >> w

        # Propagate the carry to the rest of t.
        k = i + n + 1
        while carry > 0 && k < (2 * n):
            s     = t[k] + carry
            t[k]  = s & mask
            carry = s >> w
            k ++

    # At this point, t begins with n zeros, followed by n unreduced words
    # of the result
    lhs = t[n:n*2]

    # Assume gte and sub are multiprecision greater-than and subtraction
    # functions
    if gte(lhs, p):
        return sub(lhs, &p, w)
    else:
        return lhs

Now, let us understand exactly how the pseudocode implements the reduction algorithm. Recall the first two steps of its algebraic description:

$q \leftarrow \mu \mathbf{c} \mod R$
$t \leftarrow (\mathbf{c} + pq) / R$

Consider that $\mathbf{c}$ has $2n$ words:

$\mathbf{c} \equiv [c_0, c_1, c_2, \cdots, c_{2n-1}]$

Also recall that the integer $\mathbf{c}$ is the dot product of its component words and powers of $2^w$:

$\sum_{j=0}^{2n-1}c_j(2^{wj})$

Our goal is to reduce $\mathbf{c}$ to $n$ words, one word at a time.

We use $\mathbf{c}^{(1)}$ to represent what $\mathbf{c}$ becomes after one loop iteration, $\mathbf{c}^{(2)}$ after two iterations, and so on:

$\mathbf{c}^{(1)} \equiv [c_1, c_2, \cdots, c_{2n-1}]$

$\mathbf{c}^{(2)} \equiv [c_2, \cdots, c_{2n-1}]$

Ultimately, the result should have $n$ words:

$\mathbf{c}^{n} \equiv [c_n, \cdots, c_{2n-1}]$

We need to show that after each outer loop iteration:

a. the least significant word (LSW) is zeroed out

b. and each $\mathbf{c^{(i)}}$ is congruent to $\mathbf{c^{(i + 1)}} \mod p$.

Proving that the LSW is zeroed out

We compute pq using the first inner loop:

pq$\leftarrow c_0 \cdot (\mu \mod 2^w) \cdot p$

pq is a $n+1$-word array since $p$ has $n$ words and multiplying it by a single word yields $n+1$ words.

Next, recall that when we multiply a multi-digit value by a single word, the LSW of the result is the product of the LSW of both the multiplicand and multiplier, modulo $2^w$. Also recall that $\mu p$ equals $-1 \mod 2^{wn}$, so its LSW is $-1 \mod 2^w$. As such, when we multiply $(\mu \mod 2^w) \cdot p$ by $c_0$, we can rewrite $pq$ as:

$pq \equiv [-c_0, pq_{1}, \cdots, pq_{n}]$

The second inner loop adds $pq$ to $\mathbf{c}$:

$[c_0, \cdots] + [-c_0, pq_{1}, \cdots, pq_{n}] \equiv [0, pq_{1}, \cdots, pq_{n}]$

The LSW is zero because $c_0 – c_0 \equiv 0$. This corresponds to $\mathbf{c} – \mathbf{c}$ in the algebraic explanation above (albeit for one out of $n$ iterations).

Proving that $\mathbf{c^{(i)}}$ is congruent to $\mathbf{c^{(i + 1)}} \mod p$

Next, we must prove the inductive step to show that our result has $n$ zeros after $n$ iterations. We can then efficiently divide by $R$ by retaining the remaining $n$ words.

Rewrite the high-level algorithm to reduce $\mathbf{c^{(i)}}$ one word at a time:

$q^{(i)} \leftarrow (\mu \mod 2^w)({c^{(i)}}_0) \mod 2^w$
$\mathbf{c}^{(i+1)} \leftarrow (\mathbf{c^{(i)}} + pq^{(i)}) / 2^w$

Multiply both sides by $2^w$:

$\mathbf{c}^{(i+1)} \cdot 2^w \leftarrow \mathbf{c^{(i)}} + pq^{(i)}$

To see the that both sides of the above equation are congruent modulo $p$, note that $pq^{(i)}$ is a nonzero integer that is a multiple of $p$. Rewrite the above:

$\mathbf{c}^{(i+1)} \cdot 2^w \equiv \mathbf{c^{(i)}} \mod p$

This is equivalent to stating that:

$[0, c_{i + 1}, \cdots, c_{2n-1}] \equiv [c_i, c_{i + 1}, \cdots, c_{2n-1}] \mod p$

which is exactly what we want to prove.

By now, the reader should be familiar with the core principles behind various multiprecision Montgomery multiplication algorithms, such as SOS, CIOS, and FIOS, described in Acar (1996). The differences between these variants only lie in how the reduction and addition steps are interleaved. The complexity of these algorithms vanishes when one grasps the core principles behind how word-by-word reduction corresponds with high-level algebraic reasoning.

The Logjumps algorithm

Now let us examine Logjumps. It uses the precomputed constants $\mathbf{rho}$ and $\mu$:

Symbol	Description	Notes
$\mathbf{rho}$	$2^{-w} \pmod p$	This is the inverse of two raised to the power of the word size. It has $n$ words. It should not be confused with $R^{-1}$. Also note that although Yuval's writeup uses $\rho$ (the Greek letter rho), I will use $\mathbf{rho}$ instead to avoid visual confusion with the prime modulus $p$.
$\mu$	The precomputed Montgomery constant.	$-p^{-1} \mod R$, computed using the extended Euclidean algorithm.

The pseudocode for this method, assuming that $n=4$ and $w=64$ is as follows:

# Calculates low * rho[0..4], returning a 5-word result.
def calc_m(low):
    carry = 0
    res = [0] * 5
    for i in 0..4:
        prod = low * rho[i] + carry
        res[i] = prod & ((1 << 64) - 1)
        carry = prod >> 64
    res[4] = carry
    return res

# The Logjumps algorithm.
def mul_logjumps_sos(c):
    res = [0] * 8
    mask = (1 << 64) - 1

    # Step 1: the first Logjumps iteration
    m = calc_m(c[0])
    carry = 0
    for i in 0..5:
        s = c[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    # Propagate carries to res[6] and res[7]
    s = c[6] + carry
    res[5] = s & mask
    carry = s  >> 64

    s = c[7] + carry
    res[6] = s & mask
    carry = s  >> 64

    res[7] = carry
    carry = 0

    # Step 2: the second Logjumps iteration
    m = calc_m(res[0])
    carry = 0
    for i in 0..5:
        s = res[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    # Propagate carries to res[6]
    s = res[6] + carry
    res[5] = s & mask
    carry = s  >> 64

    res[6] = carry
    carry = 0

    # Step 3: the third Logjumps iteration
    m = calc_m(res[0])
    carry = 0
    for i in 0..5:
        s = res[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    res[5] = carry
    carry = 0

    # Step 4: use one iteration of Montgomery reduction to reduce res[] by one word
    q = (res[0] * U64_MU0) & mask
    pq = [0] * 5
    for i in 0..4:
        prod = q * U64_P[i] + carry
        pq[i] = prod & mask
        carry = prod >> 64
    pq[4] = carry
    carry = 0

    for i in 0..5:
        s = res[i] + pq[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    return [res[1], res[2], res[3], res[4]]

Let $\mathbf{c}$ be the $2n$-word multiprecision integer we want to reduce, one $w$-bit word at a time.

$\mathbf{c} \equiv [c_0, c_1, \cdots, c_{2n – 1}]$.

We can rewrite $\mathbf{c}$ as such:

$\mathbf{c} \equiv H\cdot 2^w + c_0$

where $c_0$ is the least significant word, and $H$ represents all bits higher than $w$.

$H \equiv [c_1, \cdots, c_{2n – 1}]$

Next, compute $M$:

$M \leftarrow c_0 \cdot \mathbf{rho}$

and add it to the higher words, but in an aligned way: that is, we shift $M$ one word to the left and add the result to $H \cdot 2^w$:

$(H + M) \cdot 2^w$

This sets $c_0$ to 0, and shifts right by one word, which reduces $\mathbf{c}$ by one word.

To show that the above is correct, we need to prove the following equivalence:

$(H + M) \cdot 2^w + 0 \mod p \equiv H \cdot 2^w + c_0 \mod p$

The proof is simple. First, we substitute the definition of $M$, and simplify:

$(H + c_0 \cdot 2^{-w}) \cdot 2^w \mod p$

Next, we expand the terms in the parentheses, and observe that we get the right-hand side as desired:

$H \cdot 2^w + c_0 \mod p$

We can repeat this until we have $n+1$ words, and finish with one iteration of classic Montgomery reduction to get $n$ words. We can then discard the least significant $n$ words to get the fully reduced result $\mathbf{c}R^{-1}$.

Performance

Each inner loop of the Montgomery reduction algorithm performs one multiplication to compute q, and $n$ multiplications to compute q * p[j]. Since there are $n$ inner loops, the total number of multiplications is $n^2 + n$.

The Logjumps algorithm, however, only takes $n^2 + 1$ multiplications. It involves $n-1$ invocations of calc_m() (which does $n$ multiplications), and a final Montgomery reduction step, which does $n + 1$ multiplications. The total number of multiplications is therefore $n(n-1) + n + 1 \equiv n^2 + 1$.

For 256-bit primes (used in common elliptic-curve cryptographic schemes) on 64-bit CPUs, $n$ is 4, we are looking at 17 multiplications for Logjumps, versus 20 for traditional Montgomery. On 32-bit CPUs, that would be 72 versus 64 multiplications.

Yuval's writeup discusses a further optimisation to Logjumps. It is to compute $c_0 \cdot r^{-3}$, $c_1 \cdot r^{-2}$, and $c_2 \cdot r^{-1}$ in parallel, and add these three products to $c[3\cdots7]$ to reduce multiple words at a time. This parallelism can be exploited at the CPU pipeline level and, depending on the hardware, may yield additional performance gains.

Implications

Field arithmetic is widespread in modern cryptography, so Logjumps can speed up a wide range of applications. However, its performance depends heavily on the hardware platform on which it is implemented, so one should be cautious about making concrete predictions of performance gains. For instance, the ability for a CPU to efficiently compute the independent products described above (such as $c_0 \cdot r^{-3}$ and so on) may be limited by how many dedicated multipliers it has at the hardware level. Other platforms, however, such as GPUs or FPGAs, may be able to take advantage of this parallel technique, albeit with tradeoffs which would take many more articles to explore.

Further research is therefore needed to benchmark this technique in production systems and refine its implementation for different hardware platforms. Its potential for parallelism may be even more significant for very large $p$ values. The most practical next step is simply to implement it in existing libraries so that existing applications can immediately benefit from this new technique. I hope that this research note has provided clarity and intuition to all involved.

Credits

Many thanks to Yuval Domb, Kobi Gurkan, Guillermo Angeris, and Nicolas Mohnblatt for their valuable comments and feedback on this article.

A Deep Dive into Logjumps: a Faster Modular Reduction Algorithm