Logos achieves state-of-the-art 99.4% on verified code synthesis benchmark

Logos Research evaluated its agentic workflow on VERINA, the verified-code-generation benchmark released by UC Berkeley and Meta. Given each problem’s formal specification, the workflow was tasked with producing both an algorithm and a formal proof that the algorithm satisfies the specification. It resolved 188 of 189 problems, a success rate of 99.4%.

VERINA comprises 189 manually curated Lean tasks (108 basic and 81 advanced), evaluated along three axes: code generation, specification generation, and proof generation, together with their compositions. In December 2025, Harmonic’s Aristotle reported 96.8% on proof generation alone, in which the reference algorithm is supplied. We addressed the harder task of synthesising an algorithm which meets the specification and proving its correctness.

For each task, the reference algorithm was withheld, the specification was locked so the workflow could not modify it, and the workflow was not granted internet access. Every problem was run end-to-end under the same configuration across all 189 tasks. Across the 188 successful runs, the median completion time was 5 minutes, and 90% finished in under 12 minutes. No problem required human intervention.

Dependency graph and item inspector for VERINA advanced 2. Left: original VERINA items (green rectangles) and the auxiliary definitions and lemmas the workflow introduced (purple ellipses). Right: the top-level correctness theorem, its two-line Lean proof, and the two bridge lemmas it depends on. — Blueprint for VERINA advanced 2: green nodes are the specification and the synthesised algorithm; purple nodes are the proved theorems and lemmas (solid for originals, dashed for the workflow’s auxiliaries).

Internally, the workflow operates on a blueprint: a directed dependency graph of definitions, statements, and proofs that serves as the workflow’s graphical working memory. New auxiliary definitions and lemmas are added as nodes; edges record dependencies. The workflow decomposes a theorem about its own synthesised algorithm into sub-goals tractable for Lean (typically between two and ten sub-lemmas on an advanced problem), formalises top-down, and lets compiler feedback dictate which nodes survive.

Consider VERINA advanced 2, the longest common subsequence of two integer arrays. VERINA defines correctness as the maximum length across every common subsequence. This is natural but exponential to evaluate directly. The workflow produced the standard quadratic dynamic-programming algorithm and constructed its own bridge of sub-lemmas back to the specification. That bridge accounts for most of the proof; the top-level theorem closes in two lines.

theorem LongestCommonSubsequence_spec_satisfied (a : Array Int) (b : Array Int) (h_precond : LongestCommonSubsequence_precond (a) (b)) :
    LongestCommonSubsequence_postcond (a) (b) (LongestCommonSubsequence (a) (b) h_precond) h_precond := by
  rw [dp_at_end_eq_lcsLen a b h_precond]
  exact lcsLen_eq_max_common_subseq a b h_precond

The two named lemmas come from the bridge the workflow built underneath: dp_at_end_eq_lcsLen says the imperative table-filling loop computes the same value as a recursive lcsLen over the inputs’ lists, and lcsLen_eq_max_common_subseq says that recursive value coincides with the maximum length over all common subsequences from the specification.

VERINA is public on GitHub, so reference algorithms and proofs may appear in the training data of the underlying base model. The synthesised solutions diverge from the references on a substantial fraction of the run set. Different data layouts, different decompositions, sometimes a different algorithm class entirely. Advanced 2’s algorithm uses a flat 1D array of natural numbers indexed arithmetically, where the reference uses a 2D nested-array layout. Advanced 23 (isPowerOfTwo) lets Lean derive its termination automatically by recursing directly on Nat:

def isPowerOfTwo (n : Int) (h_precond : isPowerOfTwo_precond (n)) : Bool :=
  let _ := h_precond
  let rec aux : Nat → Bool
    | 0 => false
    | 1 => true
    | k + 2 =>
        if (k + 2) % 2 = 0 then aux ((k + 2) / 2) else false
  if n ≤ 0 then false else aux n.toNat

The reference implementation recurses on Int and carries an extra fuel-counter argument that isn’t algorithmically needed. It is present only because Lean’s totality checker doesn’t recognise the recursion as terminating without it.

The single unresolved problem is VERINA advanced 68. Its precondition is stated using Lean core’s recently-redesigned String.all, and the workflow correctly identified the bridging step that would unlock the rest of the proof:

theorem precond_to_list (input : String) (h : runLengthEncoder_precond input) :
    ∀ ch ∈ input.toList, ¬ ch.isDigit := by
  sorry

The corresponding bridging lemma is not yet in Mathlib for the new API. The workflow stated this lemma cleanly and exhausted six attempts at proving it from first principles; every other auxiliary in the decomposition was proved successfully.

VERINA’s tasks are LeetCode-style algorithmic exercises with formal specifications. We are now turning to some of the world’s most complex algorithm synthesis tasks in mathematical domains where verification is non-negotiable, in collaboration with academic and industry partners. Logs and Lean artefacts are available on request.

Logos achieves state-of-the-art 99.4% on verified code synthesis benchmark

Get in Touch

Thank you!