[ 01 / 10 ]The problem: a product metric with a proof-of-work core

The challenge scores one number, and lower is better:

[ the whole game ]
SCORE = round(avg_executed_Toffoli) x peak_qubits

Two factors, multiplied. peak_qubits is the widest the register file ever gets - a space cost. avg_executed_Toffoli is the average number of Toffoli gates executed across 9,024 test shots, not the number emitted: a gate conditionally skipped on the actual data does not count. Because the metric is a product, a one-qubit cut in peak is worth roughly 1,150 Toffoli at the current frontier; the two axes are interchangeable.

The wrinkle that turns this from gate golf into proof-of-work: the 9,024 test inputs are not fixed. They are derived by hashing the circuit's own op-stream with SHAKE256 - a Fiat-Shamir construction - and a 48-bit nonce in a no-op identity tail reseeds which inputs are tested without changing what the circuit computes.

This matters because every score-lowering trick anyone has found works by dropping bits that are provably zero on the reachable support but not on all inputs. That is exactly why the truncation saves gates, and exactly why it fails on a small fraction of inputs. To submit, you must find a nonce whose 9,024 sampled inputs all dodge the failing cases - zero classical mismatches, zero phase garbage, zero ancilla garbage. A clean nonce occurs roughly once per 50 million candidates. Nonce-finding is an unavoidable search, and it shaped everything below.

[ 02 / 10 ]Where the cost lives

Before touching anything, I measured the live circuit's executed-Toffoli breakdown:

SubsystemShare of executed Toffoli
GCD / Kaliski modular inversion~88%
Solinas squaring step~11%
Everything else~1%

That single measurement reframed the whole project: the circuit is inversion-dominated. Any lever not aimed at the modular inversion is rearranging deck chairs. It also pre-filtered the literature for me, as section 05 shows - the headline technique from the best-known paper targets the 12% of the circuit that does not matter.

[ 03 / 10 ]The hunt pipeline, and why correctness paranoia pays

At one clean nonce per ~50 million candidates, a fast pre-filter is mandatory. The pipeline is two stages: a cheap GPU pre-screen replays just the binary-GCD inversion walk for each nonce on-device and rejects nonces that hit a hard input; every survivor is then re-run through the real trusted evaluator for a full clean check. The filter is structurally blind to two failure classes the evaluator catches, so only about 1 in 25 GPU survivors passes the full eval - not a bug, just what happens when a classical convergence filter models a quantum circuit. The confirm step cannot be skipped.

The part I would do again unchanged: both the Metal and CUDA ports were validated 64-for-64 bit-exact against the CPU reference before any hunt. That gate caught two real bugs - a 512-bit reduction silently dropping bit 64, and a SHAKE256 squeeze starting one block late. Either one would have produced a filter that runs at full speed and never finds anything.

[ A silently wrong filter does not crash. It just never finds the winner. ]

The same self-check is now wired into the deploy bootstrap: the pod proves CPU equals GPU before it hunts, or it aborts. One banked speedup lives here too: a wider scalar-multiply comb table bought a measured ~35% kernel speedup, validated 720 for 720 scalars three independent ways.

[ 04 / 10 ]The hardware surprise: this kernel is latency-bound

I benchmarked the same kernel across four GPU classes, and the result cost real money before I understood it:

GPURate (nonce/s)Power draw$/hrValue (nonce/s per $)
T4~454-low-
L4~1,297-$0.39~3,326
RTX 3090~1,75065-74% TDP$0.46~3,800 - best
A100 (SXM)~2,20040% TDP$1.49~1,477 - worst

The kernel's hot path is a sequential 256-step modular inversion: one long dependency chain that cannot be parallelized within a thread. Throughput tracks clock speed times occupancy, not FLOPs. The A100 sat at 40% power because the latency chain starved its compute units - barely 1.7x an L4 at about 4x the cost, the worst value tested. A consumer RTX 3090 beat it on value by roughly 2.6x.

I learned this the expensive way, with a 6x A100 pod that underperformed at full price, before migrating to 3x RTX 3090. The lesson generalizes: for latency-bound, sequential-dependency workloads, datacenter accelerators can be the worst possible buy.

[ 05 / 10 ]Three deep dives, three negatives, one conclusion

Most of the science happened in experiments that came back negative, and the negatives were as valuable as any win because they located the frontier.

The Litinski multiplier. The headline idea from arXiv:2410.00899: a controlled add-subtract costs n-1 Toffoli versus 2n-1 for a controlled adder, a reported ~30% whole-circuit cut. Three problems. The published 30% is for a multiply-dominated circuit; ours is 88% inversion. The primitive was already in the codebase as dead code, abandoned after a prior attempt. And the live adders already use a measurement-vented technique at or below what Litinski would give. Realistic gain here: ~0%. Checking the call-site math before building saved a multi-day implementation chasing nothing.

Fused inversion plus multiply. Seed the Bezout reconstruction so one extended-Euclidean pass folds an inversion and a multiply together. I built a spike and proved the fusion algebra value-exact, 64 for 64. But the route already applies it, the peak is set elsewhere - by the squaring step that runs after the inversion registers are freed - and the cross-pair variant is value-wrong outright. Zero new delta, and a peak regression if forced.

The knob sweep. I swept the route's candidate-knob space with cheap oracles: a score proxy that matched the true score to 0.006%, plus a meter for how far each config displaces hunt difficulty. One width-scheduling parameter in the inversion walk turned out to be a clean Pareto axis - every score cut costs proportionally on the hunt-difficulty axis:

ConfigScoreHunt difficulty (displacement)
baseline1,743,907k12.2
step 11,741,754k15.4
step 21,740,543k18.0
step 31,737,049k30.2

No strictly dominant candidate exists; the route sits exactly on its own frontier. Three deep dives, three negatives, one conclusion: the easy and medium levers were already harvested by the leading teams, and we were standing on the frontier - established with measurements, not vibes.

[ 06 / 10 ]Decoding the live frontier

The leaderboard moved the whole time. Decoding the promoted submissions' public notes revealed what the leaders actually do. One shaved 22 Toffoli by recomputing a cleanup carry from a short high suffix instead of a full-width comparator. Another cut peak width by one qubit by keeping an intermediate product live across a middle subtraction, skipping an uncompute-recompute round trip. The frontier holder swapped a truncated adder back to an exact full-carry variant, recovering about 2,597 executed Toffoli. Every one shipped with a freshly hunted nonce.

Two structural lessons fell out. First, the deepest lever is the metric itself: because the score counts executed rather than emitted Toffoli, conditional-replay cleanups - gates emitted but skipped on the data - lower the average while staying perfectly reversible. That is how the leaders grind Toffoli down at fixed peak. Second, the frontier is fast: it fell about 30 million points in roughly nine hours one day, with teams - often GPT-5 Codex and Claude tandems - landing 22-Toffoli improvements multiple times per day.

[ 07 / 10 ]Why there is no zero-mining path

The most valuable question I asked: is there a Toffoli cut that is exact on every input, so the existing nonce stays valid and you submit with zero mining? I searched exhaustively. The answer is no, and the reason is structural. Every exact-on-all-inputs lever in the route is already enabled - the leaders gated them all to reach this floor. The Toffoli saving and the input-dependence are the same bit-drop; you cannot have one without the other.

A subtle objection survives: what about a change that really is value-exact on all inputs - an exact-adder swap rather than a bit-dropping truncation? The frontier holder's own move is exactly this shape. If a change adds no failing inputs, should the existing nonce not keep working?

I tested it directly: applying that public, provably value-exact swap to the base circuit and evaluating with the existing baked nonce produced 16 classical and 9 phase mismatches - not clean. And the team that shipped it hunted a fresh nonce despite the exactness. The reason is the crux of the entire challenge:

[ Any op-stream change reseeds the test set. Perturb the circuit, even with a perfectly exact swap, and 9,024 different inputs arrive - on which the base's pre-existing truncations now fail. ]

The exactness of your change is irrelevant: you inherit the non-exactness of everything already in the route. A value-exact lever stacked on a truncation-bearing base still needs a hunt, and the frontier base always carries truncations - truncations are how the score got this low. Outside confirmation: every promoted leaderboard entry carries a hunted nonce. If a no-hunt path existed, the full-time teams would be on it.

[ 08 / 10 ]The Red Queen inequality

The most important systems result is not about the circuit at all. It is about the race dynamics, and it decides whether a verified winner can actually be landed. Two clocks run against each other: frontier velocity - bursty, with calm stretches punctuated by jumps, one of which dropped the frontier 3.13 million points in two hours, for an effective 0.5 to 1.5 million per hour in active phases - and hunt latency, the wall-clock to find a clean nonce: about 3 hours on a fast 3-GPU rig, 9 to 12 hours on a single free GPU.

A submission only promotes if its score still beats the frontier at the moment it lands. Since the frontier falls during your hunt, the candidate must satisfy:

[ the landing condition ]
margin > frontier_velocity x hunt_latency

Plugging in: a 3-hour hunt against an active frontier needs a 1.5 to 4.5 million point cushion just to not arrive stale. But margin is bought with truncations, and more truncation lowers clean-nonce density, which raises hunt latency, which raises the required margin - a positive feedback loop. For a lone participant against teams submitting hourly, the inequality is often unsatisfiable. The bottleneck was never GPU cost: more hardware shrinks hunt latency only linearly while the margin's hunt-cost grows super-linearly.

[ 09 / 10 ]The escape hatch: make attempts free

One lever breaks the loop without out-running the frontier: make hunt latency irrelevant by making attempts free and unlimited. The frontier is bursty, not steadily fast - between jumps are calm windows where velocity collapses to near zero. A paid hunt is one expensive bet on timing. A free, repeatable hunt - Kaggle gives roughly 30 GPU-hours a week, unlimited kernel submissions - is many bets: keep re-hunting disjoint nonce ranges, and only one clean nonce has to surface during a calm stretch. That converts an unwinnable single-shot race into a patience game with positive expected value at zero marginal cost.

Two pieces made the deployment concrete. First, a measurement methodology: a small patch to the trusted evaluator made it possible to price any candidate's exact score from a single build and eval, without hunting a nonce for it first. Second, the result of that pricing: a verified candidate measured 3.3 million points ahead of the frontier on the trusted evaluator, whose failures fall in exactly the class the existing GPU filter screens, so it hunts at the same ~1-in-25 survivor economics as the base. The extra margin turned out to be nearly free on the difficulty axis. The exact knob settings stay unpublished while the race is live.

That candidate is hunting now. The kit is turnkey and self-validating - it runs the 64-for-64 GPU-versus-CPU check on the pod and aborts if they disagree - auto-detects its Kaggle GPU, launches one hunt per device on disjoint nonce ranges, and live-streams every survivor to a channel an external monitor polls. The whole thesis in one running job: a verified circuit, on free compute, with live survivor confirmation, waiting out a frontier I cannot out-sprint but can out-wait.

[ 10 / 10 ]Results, and what I take from it

ResultStatus
GPU hunt pipeline (Metal + CUDA), 64/64 bit-exact validatedBuilt
Hardware value characterization (latency-bound finding)Measured
comb-16 scalar-mult speedup, ~35%, 720/720 validatedBanked
Litinski multiplier / fused inversion / knob sweepNegative x3 - frontier located
Leaders' techniques decoded from public notesDone
Price-without-hunting evaluator methodologyBuilt
Verified candidate, +3.3M margin vs frontierMeasured + filter-validated
No-hunt path, including value-exact swapsProven structurally impossible
Red Queen analysis + free-compute deploymentLive on Kaggle
01

The metric is a product, and the cost is inversion. 88% of executed Toffoli is the modular inversion. Any future structural win comes from a better reversible inversion or peak relief on the squaring step - nothing else moves the number.

02

Latency-bound kernels reward clock, not FLOPs. A consumer RTX 3090 beat an A100 on value by ~2.6x. For sequential-dependency workloads, datacenter accelerators can be the worst buy on the menu.

03

The search is cryptographically unavoidable - even for value-exact changes. Every op-stream change reseeds the test set, and the base's pre-existing truncations then fail on the new inputs. Proven empirically on the leaders' own exact-adder swap.

04

The real obstacle is race dynamics, not compute cost. A submission lands only if margin exceeds frontier velocity times hunt latency, and more margin makes the hunt slower. The one escape is free, repeatable hunting: unlimited zero-cost attempts that wait out the frontier's calm windows.

05

Negative results have teeth. Every one of my largest ideas was already-done or structurally blocked, and establishing that with measurements is what proved we stood on the frontier - and stopped me burning days and dollars on levers with no headroom. Knowing precisely where the wall is can be worth more than another inch of progress.

Whether the candidate promotes depends on the frontier's mood during the run. But the strategy is the honest, correct response to a Red Queen race for a single participant: do not pay to lose the sprint. Hunt free, and win the patience game.