Stress-testing Logos formalisation platform on graduate-level probability and stochastic analysis

Prof. Massimiliano Gubinelli (Wallis Professor of Mathematics at the University of Oxford) ran an early prototype of the Logos verification platform through 18 formalisation experiments in graduate probability and stochastic analysis. The campaign produced 144,000 lines of machine-checked Lean across 2,248 verified items, at an estimated API cost of around $15K. The experiments will be documented in detail in a forthcoming preprint; this note summarises what we believe to be the most relevant findings for the current state of autoformalisation.

Formalisation projects

144K

Lines of Lean 4

2,248

Verified items

Motivation

Recent progress on AI for mathematics has concentrated, to a large extent, on competition problems. We are interested in something different. Logos is, less about chasing benchmark numbers, and more about building a solid library of applied mathematics that can actually be used for a wide range of real-world problem solving. The mathematics that working mathematicians teach and use day-to-day is a very different setting from olympiad problems: more structured proofs, larger libraries of background results, and definitions that need to be reusable across whole subfields. We wanted to stress-test the current state of the Logos platform and assess what it can already deliver.

Prof. Gubinelli, a mathematician working at the boundary of analysis and probability, on topics like rough path theory, stochastic PDEs and Euclidean quantum field theory, designed and ran the campaign on his own courses and research interests, so that the choice of problems would reflect graduate-level material from a single domain expert rather than a curated benchmark suite.

The platform is an early prototype: orchestration, library coverage and critic agents are all under active development. Prof. Gubinelli is a Logos collaborator, but the experiments were designed and analysed independently, and we publish his findings without editorial filtering. The full technical report will appear as a preprint shortly.

Scope of the experiments

Eighteen projects in total, organised in three tiers of increasing ambition. The intent was not to build a balanced benchmark, but to give the platform a fairly representative sample of the kinds of mathematics one encounters across a graduate curriculum in probability and analysis: from the easy exercises one assigns to first-year graduate students, up to full bodies of lecture notes covering modern research-level material.

The first tier was made of teaching exercises: the kind of problems that appear on problem sheets of a graduate course in probability. These include random walks on $\mathbb{Z}^d$, Lévy’s construction of Brownian motion, Kolmogorov’s continuity theorem and a few others in the same spirit. Here the mathematics is well understood, and one can judge fairly quickly whether the formalisation captures the intended argument or not.

The second tier collected benchmark theorems of a deeper nature: von Neumann’s double commutant theorem, Schauder’s and Brouwer’s fixed-point theorems, Runge’s theorem, Banach–Stone, Cauchy’s integral formula for general cycles. These are results where Mathlib provides limited scaffolding. The human input was sometimes a single sentence. In several cases the agent expanded a short prompt into a 20–30 line LaTeX statement on its own, then fed it back into the formalisation pipeline. This was a useful test of the system’s ability to operate from very thin specifications.

The third tier, the heaviest one, consisted of full graduate lecture courses. Four bodies of notes, each turned by the system into a verified blueprint of definitions, lemmas and theorems:

Large deviations (with the theorems of Cramér, Sanov, Dawson–Gärtner, Mogulskii and Schilder), from a graduate course delivered in Paris-Dauphine.
Quantum probability (finite-dimensional QM, Bell inequalities, Fock space, Stone–von Neumann uniqueness, Wick’s theorem), from another graduate course held at Bonn University.
Boolean function analysis (Bonami–Gross–Beckner hypercontractivity, Kahn-Kalai-Linial theorem, Friedgut’s junta theorem and other results of Fourier analysis on the boolean hypercube).
Continuous-time martingales, extracted from the stochastic analysis curriculum at the University of Oxford.

Observations

1. Autonomous decomposition

In comparable large-scale formalisation projects, the blueprint, that is, the directed graph of definitions, lemmas and dependencies that organises a proof, is written by a human or by a human team. By comparison, the Carleson project at Bonn (Becker et al. [1]) used a 144-page hand-redacted blueprint with fourteen contributors over thirteen months, while the De Giorgi–Nash–Moser formalisation of Armstrong and Kempe [2] proceeded through extended human–LLM dialogue, with the human shaping the decomposition step by step. In both cases the structural mathematical work, that is, what the proof should look like, was authored by humans.

The Logos pipeline produced the blueprint itself. Given a LaTeX document, which can be a problem sheet, a 39-page set of lecture notes, or in some cases only a one-sentence prompt, the swarm of Logos agents interacts with the Logos platform to extract the mathematical content, decompose each theorem into a dependency graph of items, query LogosLib for reusable building blocks, and then verify every item bottom-up using parallelly orchestrated Lean servers. Prof. Gubinelli supervised the agent but did not author the decomposition or the formalisation.

The decomposition step is where most of the mathematical content of a formalisation actually sits: choosing the intermediate lemmas, recognising which constructions are reusable, identifying the joints of a proof. The campaign suggests that this step can be carried out autonomously at the scale of a full graduate course, with the result type-checked in Lean.

2. Cost: 5x cheaper than a comparable effort

Earlier this year Meta FAIR (Gloeckle et al. [3]) reported an experiment in which approximately 30,000 parallel Claude agents formalised a 500-page graduate textbook in a week, producing 130,000 lines of Lean at an estimated $100K. The team also reported coherence problems (duplicate definitions, conflicting formalisations across agents) which they attribute to the fully distributed decomposition.

The Logos campaign produced a comparable volume (144,000 lines, 2,248 items) at an estimated $15K. The architecture is different: a single sequential orchestrator per project holds the blueprint coherent, at the cost of within-project throughput. The eighteen orchestrators ran in parallel across the campaign.

The two approaches are complementary, and effective production systems will likely combine elements of both. We report the comparison because it gives two recent data points for graduate-level full autoformalisation efforts across multiple subfields.

3. Library coverage was the dominant cost driver

Across the eighteen projects, the strongest predictor of how much compute a project consumed was not mathematical depth but library coverage. Mathlib already contains a substantial body of measure theory, basic probability, and classical analysis, and the projects that could lean on this scaffolding (for instance Lévy’s construction of Brownian motion, or the Banach–Stone theorem) proceeded relatively smoothly. Projects in areas Mathlib does not yet cover well consumed disproportionately more time: Boolean function analysis, where almost no relevant infrastructure was available; quantum probability, where the creation–annihilation operator algebra and its connection to symmetric Fock space had to be built from scratch, accounted for more than 160 hours of active compute by itself.

This observation motivates LogosLib. LogosLib is a graph database of formal-informal mathematics that sits on top of Mathlib but is optimised for AI access rather than for human reading. AI-generated Lean code is curated with minimal filtering through a proprietary post-processing system that we are heavily investing in, whose job is to keep the library stable, internally consistent, and usable by downstream agents. We are growing it deliberately in the directions of applied mathematics topics almost absent from Mathlib: stochastic analysis, PDEs, numerical analysis, optimisation, statistics, and the various interfaces between them. The contents are also more inclusive than what one typically finds in Mathlib: not only general theorems, but also examples, counterexamples, algorithms, and the kinds of computational objects that show up in applications.

The reason for this design choice is straightforward: the campaign suggests that library coverage compounds. Every reusable definition, lemma or sub-decomposition that lands in LogosLib reduces the cost of every subsequent project that touches the same area, and choosing the right level of abstraction for a library API requires judgement about how the definitions will be used, not only what they mean. This is work for mathematicians, not for LLMs.

A point worth making explicitly: the Logos infrastructure is model-agnostic. The campaign was run with the frontier models that happened to be most capable at the time, but the orchestration layer, the blueprint format, the critic structure, and LogosLib itself do not depend on any specific LLM. Models will keep changing, and they will keep getting better; the question of which model one uses is, in our view, not the central one. What actually determines whether a verification platform is useful at scale is the design of the data: how mathematics is stored, indexed, and exposed to agents, what API the library presents, how the dependency graph is curated. Compute is plentiful and getting cheaper; well-designed mathematical data, in a form AI agents can actually consume, is scarce. This is where we are putting our effort.

4. Specification, not proof, is now the bottleneck

One of the clearest lessons from the campaign is that proof generation is no longer the binding bottleneck of a formalisation project. A frontier LLM, deployed inside a structured pipeline with access to Lean and a sufficiently complete library, can now discharge a large volume of lemmas in a day. Once the goal is properly stated, dependencies are in place, and the right lemmas are within reach, the actual production of the formal proof has become the easy part. The verifier then confirms whether the result is correct, and is right to do so.

What remains genuinely hard is everything that surrounds the proof: choosing the right definitions, decomposing an argument into the right intermediate pieces, recognising when an agent has produced a faithful proof of a subtly different theorem from the one intended, and growing the verified infrastructure on which future projects will rely. These are specification-level tasks. The campaign places them firmly on the human side, and we expect them to remain there for some time.

The mathematician’s role under this division of labour is supervisory rather than constructive at the proof level: specify the target, audit the dependency graph that the system produces, intervene at the points where mathematical judgement is irreducible. The 18 projects suggest that this arrangement can produce real, machine-checked mathematics at the scale of a full graduate curriculum.

Prof. Gubinelli’s assessment, after running the experiments, is that this approach (auto-decomposition and verification over a growing applied-mathematics library, supervised by a domain expert) is a viable working strategy for graduate-level and research mathematics. That was the question we wanted to test.

Selected examples

Two of the formalisations from the campaign, in slightly more detail.

Schilder’s theorem

Theorem. Let $B$ be a standard Brownian motion on $[0,1]$ and for each $n \in \mathbb{N}$, let $\mu_n$ be the law on $C([0,1]; \mathbb{R})$ of the rescaled path $X_n(t) = \frac{1}{\sqrt{n+1}}\, B_t$. Then $\{\mu_n\}$ satisfies a noncompact large deviation principle with speed $n+1$ and rate function

$$I(f) = \tfrac{1}{2} \int_0^1 (f'(s))^2 \, ds$$

when $f$ is absolutely continuous with $f(0)=0$, and $+\infty$ otherwise.

Lean 4

open MeasureTheory in
def SatisfiesLDPNoncompact {X : Type*} [TopologicalSpace X] [T2Space X]
    [MeasurableSpace X] [BorelSpace X]
    (μseq : ℕ → Measure X)
    [∀ n, IsProbabilityMeasure (μseq n)]
    (I : X → EReal) : Prop :=
  (∀ x, 0 ≤ I x) ∧
  LowerSemicontinuous I ∧
  (∀ G : Set X, IsOpen G →
    -(⨅ x ∈ G, I x) ≤ Filter.liminf
      (fun (n : ℕ) => (((1 : ℝ) / ((n : ℝ) + 1) : EReal) * extLog (μseq n G)))
      Filter.atTop) ∧
  (∀ F : Set X, IsClosed F → F.Nonempty →
    Filter.limsup
      (fun (n : ℕ) => (((1 : ℝ) / ((n : ℝ) + 1) : EReal) * extLog (μseq n F)))
      Filter.atTop ≤ -(⨅ x ∈ F, I x))

open MeasureTheory in
theorem schilder
    (μseq : ℕ → Measure C(Set.Icc (0 : ℝ) 1, ℝ))
    [∀ n, IsProbabilityMeasure (μseq n)]
    (hμ : ∀ n, μseq n = rescaledBrownianLaw (n + 1) (Nat.succ_pos n)) :
    SatisfiesLDPNoncompact μseq
      (fun f => ENNReal.toEReal (schilderRateFunction f))

This was one of the hardest single items in the campaign, and the input the agent got was deliberately thin. The source course covered large deviations only in the compact setting; Schilder was given as an exam project, with a brief indication that the noncompact extension should go via exponential tightness. The agent had to decide on its own how to structure the extension, which definitions to introduce, and how to connect the compact upper bounds to the full-space result. It introduced a separate SatisfiesLDPNoncompact predicate, proved a lifting theorem from compact to noncompact LDPs via exponential tightness, and then assembled the proof in four chains of lemmas: rate-function identification (requiring the correct handling of approximation of Sobolev functions), the compact upper bound, the open lower bound, and exponential tightness via Gaussian bounds and the Garcia-Rodemich-Rumsey lemma. The proof mixes nicely probabilistic and analytic arguments and reaches out to various parts of the blueprint but needs substantial new ingredients.

The Stone–von Neumann uniqueness theorem

Stone–von Neumann is the cornerstone of canonical quantisation in quantum mechanics. Two strongly continuous one-parameter unitary groups $(U_t, V_s)$ on a Hilbert space $H$ form a Weyl pair if they satisfy the Weyl commutation relation

$$U_t V_s = e^{-ist} V_s U_t.$$

If the pair is irreducible, then $H$ is unitarily equivalent to $L^2(\mathbb{R})$ with $(U_t, V_s)$ intertwined to translations and modulations on $L^2(\mathbb{R})$. The theorem says that, up to unitary equivalence, the canonical commutation relations admit a unique irreducible representation. This is the fact that characterises finite-dimensional quantum mechanics.

Theorem. Let $(U, V)$ be an irreducible strongly continuous Weyl pair on a nontrivial complex Hilbert space $H$. Then its associated Weyl representation is unitarily equivalent to the Schrödinger representation on $L^2(\mathbb{R})$.

Lean 4

structure WeylPair (H : Type*) [NormedAddCommGroup H] [InnerProductSpace ℂ H]
    [CompleteSpace H] where
  U : ℝ → H →L[ℂ] H
  V : ℝ → H →L[ℂ] H
  /-- Group homomorphism property -/
  U_group : ∀ s t, U (s + t) = (U s).comp (U t)
  V_group : ∀ s t, V (s + t) = (V s).comp (V t)
  U_zero : U 0 = ContinuousLinearMap.id ℂ H
  V_zero : V 0 = ContinuousLinearMap.id ℂ H
  /-- Unitarity: U(t) preserves the inner product -/
  U_unitary : ∀ t x y, @inner ℂ _ _ (U t x) (U t y) = @inner ℂ _ _ x y
  V_unitary : ∀ s x y, @inner ℂ _ _ (V s x) (V s y) = @inner ℂ _ _ x y
  /-- Strong continuity: t ↦ U(t)x is continuous for each x -/
  U_continuous : ∀ x, Continuous (fun t => U t x)
  V_continuous : ∀ x, Continuous (fun s => V s x)
  /-- Weyl commutation relation: U_t V_s = e^{-ist} V_s U_t -/
  weyl_rel : ∀ s t, (U t).comp (V s) =
    (Complex.exp (-Complex.I * ↑s * ↑t)) • (V s).comp (U t)

open MeasureTheory in
universe u_svn_thm in
theorem thm_stone_von_neumann (sm : SchrodingerModel.{u_svn_thm})
    {H : Type u_svn_thm} [NormedAddCommGroup H] [InnerProductSpace ℂ H]
    [CompleteSpace H] [Nontrivial H]
    (wp : WeylPair H) (hirr : WeylPairIrreducible wp) :
    let L2R := Lp (α := ℝ) (p := 2) (E := ℂ) MeasureSpace.volume
    let π₁ : (Sum ℝ ℝ) → H →L[ℂ] H := fun | .inl t => wp.U t | .inr s => wp.V s
    let π₂ : (Sum ℝ ℝ) → L2R →L[ℂ] L2R := fun | .inl t => sm.wp.U t | .inr s => sm.wp.V s
    Nonempty (UnitaryEquivalence π₁ π₂)

The Lean formalisation encoded the Weyl pair as a structure carrying the group laws, unitarity, strong continuity and the Weyl commutation relation, and then proved the uniqueness theorem for any model satisfying that interface. The hard mathematical content is verified in a separate instantiation lemma assembling three sub-proofs:

Construction of translation and modulation on $L^2(\mathbb{R})$, with full verification of group laws, unitarity, strong continuity, and the Weyl relation.
Irreducibility, via Gabor completeness (the Moyal identity and Plancherel).
The intertwiner construction, using Weyl-orbit density and Gaussian vacuum vectors.

The instantiation lemma has 100 transitive dependencies, all closed in the current blueprint. By any reasonable standard this is a substantial body of formalised operator theory, assembled from the lecture notes in a single auto-decomposition pass, verified bottom-up, with critic agents catching most specification errors before they propagated into proofs.

Implications for the platform

The campaign sharpens a few priorities for the next iteration of the platform.

The first one is to keep growing LogosLib in the directions that the experiments highlighted as most under-served. Concretely, the plan is to expand coverage in the applied-mathematics areas where Mathlib remains thin: stochastic analysis, PDEs, numerical analysis, optimisation, statistics. Each of these subfields compounds into many downstream projects, both academic and industrial, and the campaign gives a fairly precise empirical picture of where investment pays off first.

The second one is specification fidelity. Critic agents catch hallucinations, but they do not catch the case where a formalisation type-checks in Lean while capturing a semantically different theorem from the one intended. The same failure mode is reported in concurrent work by Armstrong–Kempe [2] and by Meta FAIR [3]. We regard this as a research problem in its own right, and are working on it.

The third one is human input. The most informative outputs of the campaign were not the individual proofs but the dependency graphs, the auto-decomposition choices, and the record of where the platform stretched and where it broke. These artefacts exist because a single domain expert spent overall a few weeks running the system over his own material. The natural next step is to widen this collaboration.

Our ambition is to see LogosLib and the surrounding infrastructure become a widespread substrate for verified mathematics in a wide range of agentic workflows, both in academia and in applied context where hallucination-free reasoning and algorithm verification is a crucial requirement. Since the data and API design are what matter, the marginal cost of plugging in a new agent or a new model is low, and the marginal benefit (one more piece of curated, machine-checkable mathematics added to the library) compounds for everyone.

To this end we are preparing a broader campaign of formalisation experiments, in collaboration with several domain experts across our academic and industrial network, covering stochastic analysis, PDEs, numerical analysis, optimisation, and statistics. Each such collaboration will extend LogosLib in the direction of real problems, and feeds back into the platform’s development. Details and access will be announced shortly.

References

L. Becker et al., A blueprint for the formalization of Carleson’s theorem on convergence of Fourier series. arXiv:2405.06423, 2024.
S. Armstrong and J. Kempe. Formalization of De Giorgi–Nash–Moser Theory in Lean. arXiv:2604.05984, 2026.
F. Gloeckle, A. Rammal, C. Arnal, R. Munos, V. Cabannes, G. Synnaeve, A. Hayat. Automatic Textbook Formalization. Meta FAIR / CERMICS, ENPC, 2026. github.com/facebookresearch/repoprover