Permutation Proofs Déjà Vu
Modern zero-knowledge proofs are revisiting 20-year-old ideas If you have been working in zero-knowledge cryptography for a long time, the
If you have been working in zero-knowledge cryptography for a long time, the recent breakthroughs in zkVM design might give you a distinct feeling of déjà vu.
In 2025, new research — ST25: Twist & Shout, BCD25: BiPerm & MulPerm, CTHK25 — shattered the status quo for prover efficiency in permutation and lookup arguments. This is important because permutation and lookup arguments are among the most costly components of modern zkVMs, consuming up to half of the prover's computation time.
The idea they explore is to represent relations between two vectors of elements as Permutation Matrices or Lookup Matrices. Their core contributions address the critical questions: how do you exploit sparseness to commit to such a large matrix, and how do you efficiently prove it is valid?
For veterans in the field of shuffle proofs, this is a "Wait, haven't we tried this before?" moment. It turns out the evolution of permutation arguments used in zkVMs almost perfectly mirrors the evolution of shuffle proofs from twenty years ago. We are cycling through the same three core ideas for ensuring a permutation relation — Switching Network, Grand Product Polynomial Identity, and Permutation Matrix—albeit with a new engine under the hood.
Here is the story of how we are going full circle.
Before we talk about modern zkVMs, let us look back at mix-nets in the 1990s and early 2000s and in particular a core building block they used: shuffle proofs.
In a mix-net, which I encountered in my early career research on e-voting, servers shuffle encrypted messages, so nobody can trace them back to the senders. The mix-net is run by a set of mix-servers that each do a shuffle, one after the other. The most common encryption scheme used in mix-nets is ElGamal encryption, which is homomorphic and easy to rerandomize such that a new ciphertext with the same plaintext but updated randomness looks unrelated.
A server gets a batch of $n$ input ciphertexts, permutes them, and rerandomizes the encryptions to produce an output batch of $n$ ciphertexts. An honest server will pick the permutation at random and rerandomize well to hide the relation between input ciphertexts and output ciphertexts.
We say that the output ciphertexts are a shuffle of the input ciphertexts when a permutation and rerandomization relation holds between the input and output ciphertexts. A vanilla mix-net is not secure, a malicious server can easily swap your ciphertext for one of its own choosing. To prevent tampering, each server must provide a shuffle proof: a zero-knowledge proof that the server has taken a batch of input ciphertexts, permuted their order, and then rerandomized the encryptions.
Significant research was devoted to constructing efficient shuffle proofs; here are some highlights:
1. Cut and Choose ( Sako-Kilian 1995)
The first somewhat efficient shuffle proofs were proposed by Sako and Kilian. This was a cut-and-choose type of protocol where many shuffles were created, all related to the main shuffle to be proven. A verifier could now challenge the prover to open half of the shuffles, i.e., reveal the permutation and how the rerandomization was done. Due to the many intermediate shuffle proofs, this is expensive.
2. Switching Networks (Abe 1999, Abe-Hoshino 2001)
The first efficient shuffle proof was inspired by switching networks. To explain what switching networks are, consider a telephone network. Suppose we have $n$ callers and $n$ receivers. Nobody can talk to two people at once, so the telephone calls can be described as a permutation connecting the callers to their receivers. Switching networks were used to ensure telephone calls between callers and receivers can be set up so that each copper wire is allocated to a single caller-receiver pair.
A $2\times 2$ switch connects two input wires to two output wires, either top-top and bottom-bottom, or top-bottom and bottom-top, i.e., there are two possible ways to configure the switch. The notion can be generalized to $n\times n$ switches, but they have $n!$ possible configurations. Instead, effort went into constructing networks of small switches that could represent any permutation.
An example is a Beneš network, which has around $2\log n$ internal layers of $n/2$ switches, with the output wires of one layer going to the input wires of the next layer. Waksman showed that for any permutation it is possible to efficiently compute a configuration of the $n\log n$ switches so they connect the $n$ inputs to the $n$ outputs according to this permutation.
Abe proposed a shuffle proof with a Beneš network at its core. Create $2\log n$ intermediate layers of ciphertexts, each a shuffle of the previous layer, but where each intermediate permutation must be chosen such that it matches a possible switch configuration between those two layers. The advantage of this restriction is that the permutation between ciphertexts in two consecutive layers can be decomposed into $n/2$ switches, i.e., we can match pairs of ciphertexts in one layer with pairs of rerandomized ciphertexts in the next layer. This breaks the shuffle proof into $n\log n$ proofs of correct switching and rerandomization. The overall prover computation and proof size becomes $O(n \log n)$.
3. Permutation Matrices (Furukawa-Sako 2001)
Furukawa and Sako were the first to represent the shuffle relation via a permutation matrix. If we have two vectors $\vec{a},\vec{b}$ they have the same messages in permuted order if $$\vec{b}=M\vec{a},$$ where $M$ is a permutation matrix, for instance, $$ \mathbf{b}= \begin{pmatrix} b_1=a_2\\ b_2=a_4\\ b_3=a_1\\ b_4=a_3 \end{pmatrix} = \begin{pmatrix} 0&1&0&0\\ 0&0&0&1\\ 1&0&0&0\\ 0&0&1&0 \end{pmatrix} \begin{pmatrix} a_1\\ a_2\\ a_3\\ a_4 \end{pmatrix}. $$
It takes a bit more notation, but similarly we can in a natural way express a shuffle as a permutation matrix applied to a vector of ciphertexts, with an added rerandomization vector (additive notation for ElGamal is most convenient).
You can think of Furukawa and Sako's shuffle proof as using parts of the input ciphertexts to form $n$ Pedersen commitments to rows of a permutation matrix. While a permutation matrix has $n^2$ entries, most of them are 0 so the cost becomes $O(1)$ per row and $O(n)$ in total to commit to the full permutation matrix.
The core innovation in Furukawa and Sako's work is to find a technique to efficiently prove that the matrix is indeed a permutation matrix, i.e., each row and each column is a unit vector with a single 1 entry and the remaining entries being zero, without disclosing the exact location of the 1s. This resulted in the first linear-time shuffle proof with $O(n)$ size and prover computation.
4. Grand Product Polynomial Identity ( "Neff 2001)
Then came the shift to grand product arguments. Lipton89: Fingerprinting Sets showed that two multisets $A$ and $B$ have the same elements with the same multiplicity if and only if the formal polynomials with those elements as roots are identical: $$\prod_{i=1}^{n} (X - a_i) = \prod_{i=1}^{n} (X - b_i).$$
Neff leveraged this grand product polynomial identity to give an efficient shuffle proof. A core part of the proof system is that given a commitment to a vector of field elements $\vec{b}$ we can prove the committed elements are the same as a set of public values $a_1,\ldots,a_n$ just in permuted order by demonstrating the committed elements $b_i$ satisfy $$\prod_{i=1}^{n} (x - a_i) = \prod_{i=1}^{n} (x - b_i)$$ for a verifier-chosen challenge $x$.
Neff and subsequent shuffle proofs in this vein use this subprotocol to commit to verifier-chosen challenges permuted in the same order as the ciphertexts have been shuffled. They then use batch-argument techniques to connect the committed challenges to the shuffled ciphertexts in a way that shows they match the original ciphertexts (we omit the details). This gave a second type of shuffle proof with linear prover computation and proof size $O(n)$.
I have studied grand product argument-style shuffle proofs extensively. In 2008, with Ishai to obtain the first shuffle proof with sublinear size $O(n^{2/3})$. And in collaboration with Bayer in 2012, to improve the proof size to $O(\sqrt n)$. I did not see a way to get similar communication efficiency with permutation matrix style shuffle proofs, so my belief at this stage was that the grand product polynomial identity idea had won.
Fast forward to today. We are building zero-knowledge virtual machines (zkVMs) like S-Two, Nexus, Jolt, and SP1. The goal is to prove the correct execution of a program, i.e., executing a given program on the VM on input $x$ will yield output $z$ (sometimes with additional private inputs or randomness $y$).
The standard way to prove correct execution on a VM is to do it cycle-by-cycle. You may for instance, commit to the relevant words appearing in the execution trace and then specify constraints you want the values to obey. Most constraints are local; for instance for an instruction $\sf{ADD}\ \sf{reg1}\ \sf{reg2}\ \sf{regd}$ there may be constraints ensuring that some corresponding committed words satisfy $\sf{val1}+\sf{val2}=\sf{vald}$.
Some constraints are non-local though. A thorny question for instance is how to ensure memory consistency: when the execution loads the word at address 0x1234, how can you ensure it matches what appeared much earlier in the execution trace when the last store to this address happened?
A strawman solution is to commit to the full memory in each cycle; now everything is "local". This is very costly though: if the execution goes through $C$ cycles and the memory has size $M$, then the prover will be committing to at least $CM$ words.
To solve this, BCGTV13:SNARKs for C suggested using two traces: an execution trace (sorted by timestamp) and a memory trace (sorted by address and timestamp). The prover who wants to demonstrate the zkVM has been executed correctly commits to both traces. Now it can use the execution trace to ensure that instructions have been executed correctly. And it can use the memory trace to ensure that loads from an address always match the last stored value. We have made both execution constraints and memory constraints local.
A problem remains, which is to ensure that the execution trace and the memory trace match each other. If there is a memory access in the execution trace, it should also appear in the memory trace, and vice versa, albeit in different order. This is exactly the kind of problem that can be solved with a permutation argument.
This is where my feeling of déjà vu arises. The core ideas from shuffle proofs reappear.
1. Switching Networks (SNARKs for C 2013)
SNARKs for C were the first to propose the zkVM notion, in their case, for a custom-designed zk-friendly VM called TinyRAM. To ensure memory consistency, they deployed a permutation argument based on a Beneš network. Since there can be up to $C$ memory accesses in an execution that runs for $C$ cycles, this requires committing up to $\log C$ layers of $C$ memory accesses and proving each pair of consecutive layers were related via appropriate switches.
2. Grand Product Polynomial Identity (vRAM 2018)
The switching network idea incurred a logarithmic overhead. Constant overhead was achieved by vRAM in 2018 that switched strategy and used a grand product polynomial identity to match memory accesses in the execution trace with the memory trace. Arya also in 2018 (independently but later) proposed the grand product polynomial identity strategy for ensuring memory consistency in the TinyRAM zkVM. As a co-author of Arya I was well aware of the connection to shuffle proofs, also used earlier to ensure wiring consistency in circuit satisfiability proofs, while the authors of vRAM did not (explicitly) make the connection to the earlier permutation arguments.
Interestingly, already back in the 1990s before shuffle proofs the grand product polynomial identity had been used to ensure, wait for it, memory checking. The problem here is if you outsource your data to a third party, how can you ensure that the data it provides you with is consistent with the data you gave it even if you have very little storage yourself. The problem comes in two flavors: online memory checking where you want to know right away if the outsourced memory is faulty, and offline checking where you eventually detect if there is a fault. Such techniques have found many uses, e.g., in streaming algorithms, and also relate to zkVMs. We can view the memory checking problem in zkVMs as having a CPU processor with little memory that gets fed memory access data in each cycle, and at the end of the execution needs to check that all memory accesses were consistent as is done in offline memory checking.
Arya introduced a lookup argument in their zkVM. A lookup argument ensures committed values $a_1,...,a_n$ all appear in a known table of values. A specific use case in Arya was to ensure committed partial words had bits equal to 0 in all add positions. This differs from the permutation relation, since some table entries may not be hit and some may be hit multiple times. Nonetheless, you can verify the lookup via a grand product polynomial identity $\prod_{i=1}^n(x-a_i)=\prod_{i=1}^t(x-t_j)^{m_j}$, where $m_j$ is the multiplicity of the table entry. Plookup and subsequent works refined and improved grand product-style lookup arguments, and later an additive variant LogUp that can be used both for lookups and permutations was invented.
3. The Matrix Renaissance (2022-2025)
ZGKMR22: Baloo suggested the matrix representation can be useful in lookup arguments and coined the term lookup matrix. Since then several other works explored the lookup matrix representation, with a permutation matrix being a special case.
In 2025, three new papers pushed this strategy further, achieving better performance than the best grand product polynomial identity arguments. This may be a bit surprising, since the grand product argument strategy has complexity $O(n)$ but zero-knowledge proofs are now so mature that the constants matter. The first two papers have slightly superlinear cost; yet, the concrete constants make performance better. The reason for this is that the best grand product polynomial identity arguments either commit to $O(n)$ values, which carries a concrete cost in cryptographic operations, or use a GKR-style proof, which has a significant constant hiding in the big-O notation.
Twist and Shout observes, as Furukawa and Sako did earlier, that for some commitment schemes it is cheap to commit to one-hot vectors, you do not pay for the zeros. It is not entirely free though since there is a cost associated with the commitment key size and evaluation polynomials over the committed values. They therefore propose decomposing rows of a lookup matrix into tensor products of rows in slimmer lookup matrices. For instance, you may decompose rows that are unit vectors of length $n$ into two unit vector rows of length $\sqrt n$ each, giving a lookup representation of the following form:
$$ \begin{pmatrix} 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{pmatrix} \bullet \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 1 & 0 \end{pmatrix} $$ where $\bullet$ stands for the face-splitting product, i.e., the row-wise tensor product.
Further decomposition yields even slimmer matrices and even makes it possible to use commitment schemes where zeros are not free. In the end you get a tradeoff, there is a cost associated with the sizes of vectors you commit to which decreases as you decompose into slimmer matrices but on the other hand the cost increases linearly in the decomposition degree. The proving computation is superlinear, but since the constants are small you can with a suitable decomposition degree get better concrete performance than the grand product polynomial identity strategy.
Surprisingly, Twist and Shout also reawakens the "strawman" approach that SNARKs for C avoided. Recall that a naive zkVM approach is to commit to the full memory every cycle. Twist & Shout instead commits to memory changes, which are of constant size in each cycle. However, by using a sum-check protocol to compute the relevant memory access on the fly, they do in each cycle have virtual access to a recomputation of any memory cell. In some sense, Twist and Shout gets the benefit of the full-memory approach without the cost because the sum-check protocol does not pay for zeros, i.e., non-changes to memory.
BiPerm and MulPerm also decompose lookup matrices, but by working on small bit-segments instead of a bit at a time manage to reduce the proving cost to $O(n\sqrt{\log n})$. This works directly for lookup arguments but they also explicitly consider permutation arguments, where the prover additionally has to prove that the decomposition represents a permutation matrix.
CTHK25 achieves linear prover computation, $O(n)$. Instead of matrix decomposition, they use a sparsity-aware polynomial commitment scheme, e.g., Dory to commit to the full lookup or permutation matrix. To make this work, i.e., ensure that computing an opening is not too costly, they then mix in linear algebra tricks into their proof system.
The same core ideas have been used in shuffle proofs and modern zero-knowledge proofs. While the latter often do not cite the shuffle proof literature and may have rediscovered the idea independently, we have in both domains explored the same ways of representing and proving a permutation relation: switching network, grand product polynomial identity, and sparse matrix with unit-vector rows. There are other ideas that have been explored for memory consistency in zkVMs such as multiset hashing , but mostly the two areas of zero-knowledge proofs have followed the same path.
In the early days of shuffle proofs, even achieving a linear size proof was an achievement. The statement size was large anyway, $n$ input and $n$ output ciphertexts, so an $O(n)$ size proof was acceptable. And once the shuffle proofs became sublinear in size, the dominant communication cost in a mix net would be the ciphertexts.
In zkVMs on the other hand succinctness is critical. A program may run for billions of cycles even if the input/output is tiny. We cannot afford size $O(C)$ proofs or a verifier that runs in $O(C)$ time. Fortunately modern zero-knowledge proofs enable proofs to be succinct and verification to be fast. The main question therefore is the prover computation, and with the recent lookup and permutation arguments we are seeing concrete performance improvements.
I spent a decade optimizing shuffle proofs based on grand product polynomial identities because permutation matrices seemed too heavy, so it has been surprising for me to see that the matrix representation can work so well. Moreover, I was surprised to learn that the careful balancing in MulPerm could bring the prover cost down to $O(n \sqrt{\log n})$ since it is unusual to see $\sqrt{\log n}$ factors in computational complexities.
Are there still things modern zero-knowledge proofs can learn from shuffle proofs? My impression is that the ideas used in shuffle proofs have by now been absorbed or rediscovered. But there may still be a few tricks remaining, Furukawa-Sako (refined in Furukawa 2005), for instance, uses a clever set of constraints to ensure a committed matrix represents a permutation, which I have not seen used elsewhere.