Skip to content
Bain Capital Crypto
  • Team
  • Portfolio
  • Insights
  • Jobs
  • Contact
  • Basics
  • Cryptography
  • DeFi
  • Hiring
  • MEV
  • Press Release
  • Privacy
  • Regulatory
  • Research
  • Akshay Agrawal
  • Alex Evans
  • Andrew Cleverdon
  • Andrija Novakovic
  • Charlotte Kyle
  • Ciamac Moallemi
  • Eugene Chen
  • grjte
  • Guillermo Angeris
  • Haley MacCormac
  • Justin Rattigan
  • Kaley Petulla
  • Kamil Yusubov
  • Kara Reusch
  • Kevin Zhang
  • Khurram Dara
  • Kimleang Svay
  • Kobi Gurkan
  • Koh Wei Jie
  • Kshitij Kulkarni
  • Mallesh Pai
  • Mandy Campbell
  • Max Resnick
  • Michael Chagnon
  • Myles Maloof
  • Nashqueue
  • Natalie Mullins
  • Nathan Sheng
  • Nicolas Mohnblatt
  • Parth Chopra
  • Sanaz Taheri
  • Stefan Cohen
  • Stephen Boyd
  • Tarun Chitra
A Deep Dive into Logjumps: a Faster Modular Reduction Algorithm
Logjumps is a recently discovered technique for modular reduction over large prime fields. Unlike Montgomery reduction, which requires $n^2 +…
  • Koh Wei Jie
  • Cryptography,
  • Research
06.10.25
Ligerito: A Small and Concretely Fast Polynomial Commitment Scheme
In this note we present Ligerito, a small and practically fast polynomial commitment and inner product scheme. For the case…
  • Andrija Novakovic,
  • Guillermo Angeris
  • Privacy,
  • Research
05.01.25
Exploring interoperability & composability for local-first software
Check out the project at https://github.com/grjte/groundmist-syncCross-posted to my atproto PDS and viewable at Groundmist Notebook and WhiteWind. This is the third exploration connecting local-first…
  • grjte
  • Privacy,
  • Research
04.22.25
Exploring the AT Protocol as a legibility layer for local-first software
Check out the project at https://notebook.groundmist.xyzCross-posted to my atproto PDS and viewable at Groundmist Notebook and WhiteWind Some people like vim and some…
  • grjte
  • Privacy,
  • Research
04.22.25
Exploring the AT Protocol as a distribution layer for local-first software
Check out the project at https://library.groundmist.xyzCross-posted to my atproto PDS and viewable at Groundmist Notebook and WhiteWind What if private…
  • grjte
  • Privacy,
  • Research
04.22.25
CryptoUtilities.jl: A Small Julia Library for Succinct Proofs
We’re excited to open-source CryptoUtilities.jl, a collection of Julia packages built to prototype and benchmark succinct proof systems over binary…
  • Andrija Novakovic,
  • Guillermo Angeris
  • Cryptography,
  • Research
04.16.25
Public, Provable Prices
What happens when exchanges operate with a discrete clock? In our last post, we argued that blockchains have a system…
  • Theo Diamandis,
  • Khurram Dara
  • Regulatory,
  • Research
02.26.25
Perpetual Demand Lending Pools
Decentralized perpetuals protocols have collectively reached billions of dollars of daily trading volume, yet are still not serious competitors on…
  • Tarun Chitra,
  • Theo Diamandis,
  • Kamil Yusubov,
  • Nathan Sheng
  • DeFi,
  • Research
02.12.25
The Accidental Computer: Polynomial Commitments from Data Availability
In this paper, we present two simple variations of a data availability scheme that allow it to function as a…
  • Alex Evans,
  • Guillermo Angeris
  • Research
01.31.25
Manipulability Is a Bug, Not a Feature
Like a computer, a blockchain has an internal clock: the blocktime [1]. Users send the blockchain transactions which contain sequences…
  • Theo Diamandis,
  • Khurram Dara
  • Regulatory,
  • Research
01.30.25
Optimizing Montgomery Multiplication in WebAssembly
Operations on elliptic curves over large prime fields can be significantly sped up via optimisations to their underlying field multiplication…
  • Koh Wei Jie
  • Research
12.05.24
Chosen-Instance Attack
How succinct proofs leak information What happens when a succinct proof does not have the zero-knowledge property? There is a…
  • Nicolas Mohnblatt
  • Privacy,
  • Research
12.04.24
ZODA: Zero-Overhead Data Availability
We introduce ZODA, short for ‘zero-overhead data availability,’ which is a protocol for proving that symbols received from an encoding…
  • Alex Evans,
  • Nicolas Mohnblatt,
  • Guillermo Angeris
  • Research
12.03.24
ZODA: An Explainer
Data availability sampling (DAS) is critical to scaling blockchains while maintaining decentralization [ASB18,HASW23]. In our previous post, we informally introduced…
  • Alex Evans,
  • Nicolas Mohnblatt,
  • Guillermo Angeris,
  • Sanaz Taheri,
  • Nashqueue
  • Research
12.03.24
Sampling for Proximity and Availability
In blockchains, nodes can ensure that the chain is valid without trusting anyone, not even the validators or miners. Early…
  • Alex Evans,
  • Nicolas Mohnblatt,
  • Guillermo Angeris
  • Research
11.08.24
Expanding
At Bain Capital Crypto, research and investing are interlocked. Researching foundational problems has led to our most successful investments. Working…
  • Alex Evans
  • Hiring
10.29.24
An Analysis of Intent-Based Markets
Mechanisms for decentralized finance on blockchains suffer from various problems, including suboptimal price execution for users, latency, and a worse…
  • Tarun Chitra,
  • Theo Diamandis,
  • Kshitij Kulkarni,
  • Mallesh Pai
  • DeFi,
  • Research
03.06.24
Multidimensional Blockchain Fees are (Essentially) Optimal
Abstract In this paper we show that, using only mild assumptions, previously proposed multidimensional blockchain fee markets are essentially optimal,…
  • Guillermo Angeris,
  • Theo Diamandis,
  • Ciamac Moallemi
  • Research
02.13.24
Toward Multidimensional Solana Fees
A Solana transaction’s journey from user submission to block inclusion can be arduous. Even once the transaction reaches the current leader, it…
  • Theo Diamandis,
  • Tarun Chitra,
  • Eugene Chen
  • Research
01.31.24
Succinct Proofs and Linear Algebra
Abstract The intuitions behind succinct proof systems are often difficult to separate from some of the deep cryptographic techniques that…
  • Alex Evans,
  • Guillermo Angeris
  • Research
09.21.23
The Specter (and Spectra) of MEV
Abstract Miner extractable value (MEV) refers to any excess value that a transaction validator can realize by manipulating the ordering…
  • Guillermo Angeris,
  • Tarun Chitra,
  • Theo Diamandis,
  • Kshitij Kulkarni
  • Research
08.14.23
The Geometry of Constant Function Market Makers
Abstract Constant function market makers (CFMMs) are the most popular type of decentralized trading venue for cryptocurrency tokens. In this paper,…
  • Guillermo Angeris,
  • Tarun Chitra,
  • Theo Diamandis,
  • Alex Evans,
  • Kshitij Kulkarni
  • Research
07.20.23
Our Comment on The SEC’s Proposed Amendments to Exchange Act Rule 3b-16
This week, we submitted a comment in response to the SEC’s proposed amendments to Exchange Act Rule 3b-16 regarding the…
  • Regulatory
06.15.23
Opinion: A House Bill Would Make It Harder for the SEC to Argue Crypto Tokens Are Securities
The proposed Securities Clarity Act by Representatives Tom Emmer and Darren Soto would significantly reduce uncertainty for both crypto investors…
  • Khurram Dara
  • Regulatory
06.01.23
Opinion: Regulators Should Not ‘Front-Run’ Congress on Stablecoins
Growing consensus on the need for comprehensive legislation on payment stablecoins provides Congress with an opportunity to enact sensible regulation…
  • Khurram Dara
  • Regulatory
05.17.23
Our Comment on The SEC’s Proposed Custody Rule
This week, we submitted a comment in response to the SEC’s proposed custody rule, together with Dragonfly Capital, Electric Capital,…
  • Regulatory
05.09.23
A Note on the Welfare Gap in Fair Ordering
In this short note, we show a gap between the welfare of a traditionally ‘fair’ ordering, namely first-in-first-out (an ideal…
  • Theo Diamandis,
  • Guillermo Angeris
  • Research
03.27.23
An Efficient Algorithm for Optimal Routing Through Constant Function Market Makers
Constant function market makers (CFMMs) such as Uniswap have facilitated trillions of dollars of digital asset trades and have billions…
  • Theo Diamandis,
  • Max Resnick,
  • Tarun Chitra,
  • Guillermo Angeris
  • DeFi,
  • Research
02.17.23
Multi-dimensional On-chain Resource Pricing
Public blockchains allow any user to submit transactions which modify the shared state of the network. These transactions are independently…
  • Theo Diamandis
  • Basics
08.16.22
Dynamic Pricing for Non-fungible Resources
Public blockchains implement a fee mechanism to allocate scarce computational resources across competing transactions. Most existing fee market designs utilize a joint, fungible unit of account (e.g., gas in Ethereum) to price otherwise non-fungible resources such as bandwidth, computation, and storage, by hardcoding their relative prices. Fixing the relative price of each resource in this way inhibits granular price discovery, limiting scalability and opening up the possibility of denial-of-service attacks.
  • Theo Diamandis,
  • Alex Evans,
  • Tarun Chitra,
  • Guillermo Angeris
  • Basics
08.16.22
Introducing CFMMRouter.jl
We created CFMMRouter.jl for convex optimization enthusiasts, twitter anons, and Tarun Chitra to easily find the optimal way to execute…
  • Guillermo Angeris,
  • Theo Diamandis
  • DeFi,
  • MEV
04.05.22
Introducing Bain Capital Crypto
We are excited to announce Bain Capital Crypto (BCC), our first $560mm fund, and the launch of a new platform…
  • Stefan Cohen
  • Press Release
03.08.22
Optimal Routing for Constant Function Market Makers
We consider the problem of optimally executing an order involving multiple cryptoassets, sometimes called tokens, on a network of multiple constant function market makers (CFMMs). When we ignore the fixed cost associated with executing an order on a CFMM, this optimal routing problem can be cast as a convex optimization problem, which is computationally tractable. When we include the fixed costs, the optimal routing problem is a mixed-integer convex problem, which can be solved using (sometimes slow) global optimization methods, or approximately solved using various heuristics based on convex optimization. The optimal routing problem includes as a special case the problem of identifying an arbitrage present in a network of CFMMs, or certifying that none exists.
  • Guillermo Angeris,
  • Tarun Chitra,
  • Alex Evans,
  • Stephen Boyd
  • MEV
12.01.21
Replicating Monotonic Payoffs Without Oracles
In this paper, we show that any monotonic payoff can be replicated using only liquidity provider shares in constant function market makers (CFMMs), without the need for additional collateral or oracles. Such payoffs include cash-or-nothing calls and capped calls, among many others, and we give an explicit method for finding a trading function matching these payoffs. For example, this method provides an easy way to show that the trading function for maintaining a portfolio where 50% of the portfolio is allocated in one asset and 50% in the other is exactly the constant product market maker (e.g., Uniswap) from first principles. We additionally provide a simple formula for the total earnings of an arbitrageur who is arbitraging against these CFMMs.
  • Guillermo Angeris,
  • Alex Evans,
  • Tarun Chitra
  • DeFi
09.01.21
Constant Function Market Makers: Multi-Asset Trades via Convex Optimization
The rise of Ethereum and other blockchains that support smart contracts has led to the creation of decentralized exchanges (DEXs), such as Uniswap, Balancer, Curve, mStable, and SushiSwap, which enable agents to trade cryptocurrencies without trusting a centralized authority. While traditional exchanges use order books to match and execute trades, DEXs are typically organized as constant function market makers (CFMMs). CFMMs accept and reject proposed trades based on the evaluation of a function that depends on the proposed trade and the current reserves of the DEX. For trades that involve only two assets, CFMMs are easy to understand, via two functions that give the quantity of one asset that must be tendered to receive a given quantity of the other, and vice versa. When more than two assets are being exchanged, it is harder to understand the landscape of possible trades. We observe that various problems of choosing a multi-asset trade can be formulated as convex optimization problems, and can therefore be reliably and efficiently solved.
  • Guillermo Angeris,
  • Akshay Agrawal,
  • Alex Evans,
  • Tarun Chitra,
  • Stephen Boyd
  • Basics,
  • DeFi
07.01.21
Replicating Market Makers
We present a method for constructing Constant Function Market Makers (CFMMs) whose portfolio value functions match a desired payoff. More specifically, we show that the space of concave, nonnegative, nondecreasing, 1-homogeneous payoff functions and the space of convex CFMMs are equivalent; in other words, every CFMM has a concave, nonnegative, nondecreasing, 1-homogeneous payoff function, and every payoff function with these properties has a corresponding convex CFMM. We demonstrate a simple method for recovering a CFMM trading function that produces this desired payoff. This method uses only basic tools from convex analysis and is intimately related to Fenchel conjugacy. We demonstrate our result by constructing trading functions corresponding to basic payoffs, as well as standard financial derivatives such as options and swaps.
  • Guillermo Angeris,
  • Alex Evans,
  • Tarun Chitra
  • DeFi
03.01.21
A Note on Privacy in Constant Function Market Makers
Constant function market makers (CFMMs) such as Uniswap, Balancer, Curve, and mStable, among many others, make up some of the largest decentralized exchanges on Ethereum and other blockchains. Because all transactions are public in current implementations, a natural next question is if there exist similar decentralized exchanges which are privacy-preserving; i.e., if a transaction’s quantities are hidden from the public view, then an adversary cannot correctly reconstruct the traded quantities from other public information. In this note, we show that privacy is impossible with the usual implementations of CFMMs under most reasonable models of an adversary and provide some mitigating strategies.
  • Guillermo Angeris,
  • Alex Evans,
  • Tarun Chitra
  • Privacy
02.01.21
Optimal Fees for Geometric Mean Market Makers
Constant Function Market Makers (CFMMs) are a family of automated market makers that enable censorship-resistant decentralized exchange on public blockchains. Arbitrage trades have been shown to align the prices reported by CFMMs with those of external markets. These trades impose costs on Liquidity Providers (LPs) who supply reserves to CFMMs. Trading fees have been proposed as a mechanism for compensating LPs for arbitrage losses. However, large fees reduce the accuracy of the prices reported by CFMMs and can cause reserves to deviate from desirable asset compositions. CFMM designers are therefore faced with the problem of how to optimally select fees to attract liquidity. We develop a framework for determining the value to LPs of supplying liquidity to a CFMM with fees when the underlying process follows a general diffusion. Focusing on a popular class of CFMMs which we call Geometric Mean Market Makers (G3Ms), our approach also allows one to select optimal fees for maximizing LP value. We illustrate our methodology by showing that an LP with mean-variance utility will prefer a G3M over all alternative trading strategies as fees approach zero.
  • Guillermo Angeris,
  • Tarun Chitra,
  • Alex Evans,
  • Stephen Boyd
  • DeFi
01.04.21
Liquidity Provider Returns in Geometric Mean Markets
Geometric mean market makers (G3Ms), such as Uniswap and Balancer, comprise a popular class of automated market makers (AMMs) defined by the following rule: the reserves of the AMM before and after each trade must have the same (weighted) geometric mean. This paper extends several results known for constant-weight G3Ms to the general case of G3Ms with time-varying and potentially stochastic weights. These results include the returns and no-arbitrage prices of liquidity pool (LP) shares that investors receive for supplying liquidity to G3Ms. Using these expressions, we show how to create G3Ms whose LP shares replicate the payoffs of financial derivatives. The resulting hedges are model-independent and exact for derivative contracts whose payoff functions satisfy an elasticity constraint. These strategies allow LP shares to replicate various trading strategies and financial contracts, including standard options. G3Ms are thus shown to be capable of recreating a variety of active trading strategies through passive positions in LP shares.
  • Alex Evans
  • DeFi
06.01.20
Back

A Deep Dive into Logjumps: a Faster Modular Reduction Algorithm

  • Koh Wei Jie
  • Cryptography,
  • Research
06.10.25
  • Share on Twitter
  • Copy Link
Metamorphosis III (detail) by M.C. Escher. Source: The Berkshire Museum

Logjumps is a recently discovered technique for modular reduction over large prime fields. Unlike Montgomery reduction, which requires $n^2 + n$ multiplications, Logjumps only needs $n^2 + 1$.

Although Montgomery reduction is four decades old, Yuval Domb of Ingonyama first discovered this improvement when optimising client-side proving. To our knowledge, no-one else found this method before Yuval.

This research note offers a low-level explanation of Logjumps. It aims to help cryptography engineers gain intuition not only about the algebraic principles behind Logjumps and classic Montgomery reduction, but also how these algebraic principles translate to code.

Existing literature typically focuses on either high-level concepts or low-level optimisations, often leaving a gap between theory and implementation. To bridge it, I will first present the algebra behind Montgomery reduction, then explain how these principles apply to multiprecision modular reduction. Finally, I will do the same for Logjumps, and discuss its performance.

Reference code for the both algorithms can be found here.

Background

The Montgomery reduction algorithm takes an input $\mathbf{c} \mod p$ and returns $\mathbf{c}R^{-1} \mod p$. It is closely related to Montgomery multiplication, $aR, bR \mod p$, and returns $abR \mod p$.

We use Montgomery form to describe field elements $xR \mod p$ for any $x$, where $R$ is the Montgomery radix, a power of 2 that is greater than $p$. For a more detailed explanation, refer to this post.

This technique is essential for optimising cryptographic schemes that require consecutive multiplication-heavy field operations, such as elliptic curve arithmetic or algebraic hash functions. The performance gains from Montgomery multiplication should ideally far outweigh the overhead of converting the inputs to and from Montgomery form.

In this note, we use the following symbols:

Symbol Description Notes
$p$ The field modulus. A prime number.
$\mathbf{c}$ The value to reduce. $0 \le \mathbf{c} \le p^2$
$w$ The number of bits per word. Must fit within a single CPU-native data type, such as a 32 or 64-bit unsigned integer.
$n$ The number of words. Must be large enough such that $wn$ is equal to or larger than the bit-length of $p$.
$R$ The Montgomery radix. Coprime to and greater than $p$. Must be a power of 2, usually $2^{wn}$.
$\mu$ The precomputed Montgomery constant. $-p^{-1} \mod R$, computed using the extended Euclidean algorithm.

The steps of the reduction algorithm are:

  1. $q \leftarrow \mu \mathbf{c} \mod R$
  2. $t \leftarrow (\mathbf{c} + pq) / R$
    • Note: division by $R$ is a right-shift by $wn$ bits.
  3. If $t \ge p$ then $t \leftarrow t – p$.
  4. Return $t$.

Substitute the terms $\mu$ and $q$ into steps 1 and 2:

$t \leftarrow (\mathbf{c} + p(-p^{-1} \mathbf{c})) / R$

Since $p\cdot-p^{-1} \equiv -1$:

$t \equiv (\mathbf{c} – \mathbf{c}) / R \equiv 0/R$

This means that $t$ is an exact multiple of $R$. Since $R \neq p$, $t$ remains nonzero in modulo $p$ arithmetic. We can therefore discard the lowermost $n$ words to divide by $R$. A final conditional subtraction is all that is left to derive $\mathbf{c}R^{-1} \mod p$.

The multiprecision algorithm

To implement this algorithm in a CPU, we use arrays of unsigned integers to represent $w$-bit words. For instance, if $w$ is 32, we must work with arrays of at least 32-bit values.

Variable Description
p[n] The field modulus as an $n$-length array.
c[n*2] The value to reduce as a $2n$-length array.
w The number of bits per word.
n The number of words.
mu The least significant $w$-bit word of $-p^{-1} \mod R$.

The pseudocode which implements this algorithm looks like the following.

def mont_redc(c, p, mu, n, w):
    mask = (2 ** w) - 1
    t = list(c) # copy c to t

    for i in 0..n:
        q = (t[i] * mu) & mask

        # 1st inner loop: multiply p by q, word by word, while propagating carries.
        # Note that `& mask` is equivalent to mod w, and
        # `>> w` is equivalent to division by w.
        pq = [0] * (n + 1)
        for j in 0..n:
            prod  = q * p[j] + carry
            pq[j] = prod & mask
            carry = prod >> w
        pq[n] = carry

        # 2nd inner loop: add q * p to t word by word, while propagating carries.
        carry = 0;
        for j in 0..n + 1:
            s        = t[i + j] + pq[j] + carry
            t[i + j] = s & mask
            carry    = s >> w

        # Propagate the carry to the rest of t.
        k = i + n + 1
        while carry > 0 && k < (2 * n):
            s     = t[k] + carry
            t[k]  = s & mask
            carry = s >> w
            k ++

    # At this point, t begins with n zeros, followed by n unreduced words
    # of the result
    lhs = t[n:n*2]

    # Assume gte and sub are multiprecision greater-than and subtraction
    # functions
    if gte(lhs, p):
        return sub(lhs, &p, w)
    else:
        return lhs

Now, let us understand exactly how the pseudocode implements the reduction algorithm. Recall the first two steps of its algebraic description:

  1. $q \leftarrow \mu \mathbf{c} \mod R$
  2. $t \leftarrow (\mathbf{c} + pq) / R$

Consider that $\mathbf{c}$ has $2n$ words:

$\mathbf{c} \equiv [c_0, c_1, c_2, \cdots, c_{2n-1}]$

Also recall that the integer $\mathbf{c}$ is the dot product of its component words and powers of $2^w$:

$\sum_{j=0}^{2n-1}c_j(2^{wj})$

Our goal is to reduce $\mathbf{c}$ to $n$ words, one word at a time.

We use $\mathbf{c}^{(1)}$ to represent what $\mathbf{c}$ becomes after one loop iteration, $\mathbf{c}^{(2)}$ after two iterations, and so on:

$\mathbf{c}^{(1)} \equiv [c_1, c_2, \cdots, c_{2n-1}]$

$\mathbf{c}^{(2)} \equiv [c_2, \cdots, c_{2n-1}]$

Ultimately, the result should have $n$ words:

$\mathbf{c}^{n} \equiv [c_n, \cdots, c_{2n-1}]$

We need to show that after each outer loop iteration:

a. the least significant word (LSW) is zeroed out

b. and each $\mathbf{c^{(i)}}$ is congruent to $\mathbf{c^{(i + 1)}} \mod p$.

Proving that the LSW is zeroed out

We compute pq using the first inner loop:

pq$\leftarrow c_0 \cdot (\mu \mod 2^w) \cdot p$

pq is a $n+1$-word array since $p$ has $n$ words and multiplying it by a single word yields $n+1$ words.

Next, recall that when we multiply a multi-digit value by a single word, the LSW of the result is the product of the LSW of both the multiplicand and multiplier, modulo $2^w$. Also recall that $\mu p$ equals $-1 \mod 2^{wn}$, so its LSW is $-1 \mod 2^w$. As such, when we multiply $(\mu \mod 2^w) \cdot p$ by $c_0$, we can rewrite $pq$ as:

$pq \equiv [-c_0, pq_{1}, \cdots, pq_{n}]$

The second inner loop adds $pq$ to $\mathbf{c}$:

$[c_0, \cdots] + [-c_0, pq_{1}, \cdots, pq_{n}] \equiv [0, pq_{1}, \cdots, pq_{n}]$

The LSW is zero because $c_0 – c_0 \equiv 0$. This corresponds to $\mathbf{c} – \mathbf{c}$ in the algebraic explanation above (albeit for one out of $n$ iterations).

Proving that $\mathbf{c^{(i)}}$ is congruent to $\mathbf{c^{(i + 1)}} \mod p$

Next, we must prove the inductive step to show that our result has $n$ zeros after $n$ iterations. We can then efficiently divide by $R$ by retaining the remaining $n$ words.

Rewrite the high-level algorithm to reduce $\mathbf{c^{(i)}}$ one word at a time:

  1. $q^{(i)} \leftarrow (\mu \mod 2^w)({c^{(i)}}_0) \mod 2^w$
  2. $\mathbf{c}^{(i+1)} \leftarrow (\mathbf{c^{(i)}} + pq^{(i)}) / 2^w$

Multiply both sides by $2^w$:

$\mathbf{c}^{(i+1)} \cdot 2^w \leftarrow \mathbf{c^{(i)}} + pq^{(i)}$

To see the that both sides of the above equation are congruent modulo $p$, note that $pq^{(i)}$ is a nonzero integer that is a multiple of $p$. Rewrite the above:

$\mathbf{c}^{(i+1)} \cdot 2^w \equiv \mathbf{c^{(i)}} \mod p$

This is equivalent to stating that:

$[0, c_{i + 1}, \cdots, c_{2n-1}] \equiv [c_i, c_{i + 1}, \cdots, c_{2n-1}] \mod p$

which is exactly what we want to prove.

By now, the reader should be familiar with the core principles behind various multiprecision Montgomery multiplication algorithms, such as SOS, CIOS, and FIOS, described in Acar (1996). The differences between these variants only lie in how the reduction and addition steps are interleaved. The complexity of these algorithms vanishes when one grasps the core principles behind how word-by-word reduction corresponds with high-level algebraic reasoning.

The Logjumps algorithm

Now let us examine Logjumps. It uses the precomputed constants $\mathbf{rho}$ and $\mu$:

Symbol Description Notes
$\mathbf{rho}$ $2^{-w} \pmod p$ This is the inverse of two raised to the power of the word size. It has $n$ words. It should not be confused with $R^{-1}$. Also note that although Yuval's writeup uses $\rho$ (the Greek letter rho), I will use $\mathbf{rho}$ instead to avoid visual confusion with the prime modulus $p$.
$\mu$ The precomputed Montgomery constant. $-p^{-1} \mod R$, computed using the extended Euclidean algorithm.

The pseudocode for this method, assuming that $n=4$ and $w=64$ is as follows:

# Calculates low * rho[0..4], returning a 5-word result.
def calc_m(low):
    carry = 0
    res = [0] * 5
    for i in 0..4:
        prod = low * rho[i] + carry
        res[i] = prod & ((1 << 64) - 1)
        carry = prod >> 64
    res[4] = carry
    return res

# The Logjumps algorithm.
def mul_logjumps_sos(c):
    res = [0] * 8
    mask = (1 << 64) - 1

    # Step 1: the first Logjumps iteration
    m = calc_m(c[0])
    carry = 0
    for i in 0..5:
        s = c[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    # Propagate carries to res[6] and res[7]
    s = c[6] + carry
    res[5] = s & mask
    carry = s  >> 64

    s = c[7] + carry
    res[6] = s & mask
    carry = s  >> 64

    res[7] = carry
    carry = 0

    # Step 2: the second Logjumps iteration
    m = calc_m(res[0])
    carry = 0
    for i in 0..5:
        s = res[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    # Propagate carries to res[6]
    s = res[6] + carry
    res[5] = s & mask
    carry = s  >> 64

    res[6] = carry
    carry = 0

    # Step 3: the third Logjumps iteration
    m = calc_m(res[0])
    carry = 0
    for i in 0..5:
        s = res[i + 1] + m[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    res[5] = carry
    carry = 0

    # Step 4: use one iteration of Montgomery reduction to reduce res[] by one word
    q = (res[0] * U64_MU0) & mask
    pq = [0] * 5
    for i in 0..4:
        prod = q * U64_P[i] + carry
        pq[i] = prod & mask
        carry = prod >> 64
    pq[4] = carry
    carry = 0

    for i in 0..5:
        s = res[i] + pq[i] + carry
        res[i] = s & mask
        carry = s  >> 64

    return [res[1], res[2], res[3], res[4]]

Let $\mathbf{c}$ be the $2n$-word multiprecision integer we want to reduce, one $w$-bit word at a time.

$\mathbf{c} \equiv [c_0, c_1, \cdots, c_{2n – 1}]$.

We can rewrite $\mathbf{c}$ as such:

$\mathbf{c} \equiv H\cdot 2^w + c_0$

where $c_0$ is the least significant word, and $H$ represents all bits higher than $w$.

$H \equiv [c_1, \cdots, c_{2n – 1}]$

Next, compute $M$:

$M \leftarrow c_0 \cdot \mathbf{rho}$

and add it to the higher words, but in an aligned way: that is, we shift $M$ one word to the left and add the result to $H \cdot 2^w$:

$(H + M) \cdot 2^w$

This sets $c_0$ to 0, and shifts right by one word, which reduces $\mathbf{c}$ by one word.

To show that the above is correct, we need to prove the following equivalence:

$(H + M) \cdot 2^w + 0 \mod p \equiv H \cdot 2^w + c_0 \mod p$

The proof is simple. First, we substitute the definition of $M$, and simplify:

$(H + c_0 \cdot 2^{-w}) \cdot 2^w \mod p$

Next, we expand the terms in the parentheses, and observe that we get the right-hand side as desired:

$H \cdot 2^w + c_0 \mod p$

We can repeat this until we have $n+1$ words, and finish with one iteration of classic Montgomery reduction to get $n$ words. We can then discard the least significant $n$ words to get the fully reduced result $\mathbf{c}R^{-1}$.

Performance

Each inner loop of the Montgomery reduction algorithm performs one multiplication to compute q, and $n$ multiplications to compute q * p[j]. Since there are $n$ inner loops, the total number of multiplications is $n^2 + n$.

The Logjumps algorithm, however, only takes $n^2 + 1$ multiplications. It involves $n-1$ invocations of calc_m() (which does $n$ multiplications), and a final Montgomery reduction step, which does $n + 1$ multiplications. The total number of multiplications is therefore $n(n-1) + n + 1 \equiv n^2 + 1$.

For 256-bit primes (used in common elliptic-curve cryptographic schemes) on 64-bit CPUs, $n$ is 4, we are looking at 17 multiplications for Logjumps, versus 20 for traditional Montgomery. On 32-bit CPUs, that would be 72 versus 64 multiplications.

Yuval's writeup discusses a further optimisation to Logjumps. It is to compute $c_0 \cdot r^{-3}$, $c_1 \cdot r^{-2}$, and $c_2 \cdot r^{-1}$ in parallel, and add these three products to $c[3\cdots7]$ to reduce multiple words at a time. This parallelism can be exploited at the CPU pipeline level and, depending on the hardware, may yield additional performance gains.

Implications

Field arithmetic is widespread in modern cryptography, so Logjumps can speed up a wide range of applications. However, its performance depends heavily on the hardware platform on which it is implemented, so one should be cautious about making concrete predictions of performance gains. For instance, the ability for a CPU to efficiently compute the independent products described above (such as $c_0 \cdot r^{-3}$ and so on) may be limited by how many dedicated multipliers it has at the hardware level. Other platforms, however, such as GPUs or FPGAs, may be able to take advantage of this parallel technique, albeit with tradeoffs which would take many more articles to explore.

Further research is therefore needed to benchmark this technique in production systems and refine its implementation for different hardware platforms. Its potential for parallelism may be even more significant for very large $p$ values. The most practical next step is simply to implement it in existing libraries so that existing applications can immediately benefit from this new technique. I hope that this research note has provided clarity and intuition to all involved.

Credits

Many thanks to Yuval Domb, Kobi Gurkan, Guillermo Angeris, and Nicolas Mohnblatt for their valuable comments and feedback on this article.

Share
  • Share on Twitter
  • Copy Link
BainCapital
  • Twitter
  • LinkedIn
  • Terms of Use
  • Disclosures