Lean-Agent Protocol Overview

Updated 12 April 2026

Lean-Agent Protocol is a minimalist, modular communication framework that defines structured agentic traces to serialize reasoning, tool calls, and outputs.
The protocol employs deterministic evaluation and fine-tuning using minimal high-quality data to recover suppressed tool-calling abilities in domain-specific models.
It generalizes across domains by enabling structured API schema transfer and multi-agent coordination, ensuring auditability, reproducibility, and security.

The Lean-Agent Protocol encompasses a family of minimalist, modular, and highly structured communication, coordination, and fine-tuning procedures that enable agentic LLMs and multi-agent systems to acquire, recover, or reliably coordinate complex tool-using behaviors. Its implementations span formal mathematics (notably in Lean and AI-augmented proving), deterministic multi-LLM coordination, and even cryptographic agent-to-agent communications, but share the foundational goals of capability compositionality, auditability, and efficient transfer with minimal data or infrastructure overhead. The protocol emphasizes agentic traces, deterministic evaluation metrics, and orchestration primitives for both single and ensemble agent workflows.

1. Conceptual Foundation: Agentic Traces and Protocol Objectives

At its core, the Lean-Agent Protocol introduces the concept of an "agentic trace"—a multi-turn dialogue that serializes both natural-language reasoning (“think” or chain-of-thought), explicit calls to external tools (e.g., LeanSearch for formal math, retrieval APIs, or debugging routines), tool responses, and the final action or proof script. Traces are not mere IO pairs, but structured sequences, e.g.:

System prompt (tool schema)
User: input statement (theorem or bug)
Assistant: > ... (CoT reasoning)
<tool_call>{"name":"tool", "arguments":{...}}</tool_call>
<tool_response>{...}</tool_response>
Assistant: final output (e.g., Lean proof script)

The protocol is expressly designed for situations such as domain-specialized LLMs that have lost general tool-use ability due to heavy fine-tuning. The goal is to determine whether this ability has been erased or just suppressed and, if suppressed, whether it can be rapidly "re-awakened" with a compact set of high-quality, targeted agentic traces (Chung et al., 9 Apr 2026).

2. Minimal Data Recovery: Pipeline, Data, and Saturation

The protocol's most prominent instantiation targets Gödel-Prover-V2, showing that after extensive Lean-domain SFT and RL (1.8M examples), tool-calling accuracy collapsed from 89.4% in the base model to near zero. The recovery experiment involved:

Harvesting ~18,000 high-quality agentic traces, sampling 100 for minimal recovery.
Traces collected in a multi-stage pipeline:
1. Generate candidate agentic CoT + tool call predicates via a scaffold model (e.g., Qwen3-32B).
2. Truncate at proof, regenerate the proof with the specialized agent (Gödel-Prover-V2-SFT).
3. Filter: retain only traces with compiling proofs that cite retrieved theorems.
4. Stratify sampling by number of tool calls, query diversity, and topical coverage.
Fine-tune the domain-specialized agent on just these 100 traces, using standard cross-entropy:

$L(\theta) = -\sum_{(x, y)} \sum_{t=1}^{|y|} \log p_\theta(y_t|x, y_{<t})$

Empirical results demonstrated that as few as 100 traces suffice for >78% function-calling accuracy (vs. 5.35% pre-recovery) on BFCL, with ProofNet pass@32 rising from 21.51% to 25.81%. Recovery saturated quickly; increasing to 1,000 or 18,000 traces yielded diminishing returns, supporting the suppression—not erasure—hypothesis (Chung et al., 9 Apr 2026).

3. Protocol Generalization: Cross-Domain and Schema Transfer

Despite training solely on Lean-specific tool use (using the leansearch API and XML/JSON-style tags), the protocol reactivated general tool-calling capability:

Models recovered API schemas (JSON-style), even for APIs never explicitly seen.
Gains transferred from the original fine-tuning domain (Lean) to the Berkeley Function Calling Leaderboard (Python/Java/JS pseudo-APIs), demonstrating the unlocked general structured tool-calling skill rather than overfitting to a single schema.
The protocol's recovery characteristics (fast saturation, immediate schema generalization, residual tool awareness pre-recovery) stand in sharp contrast to strict-capability-erasure regimes (Chung et al., 9 Apr 2026).

4. Algorithmic Blueprint for New Task Domains

The protocol yields a direct procedural recipe for reawakening tool use in other domains:

Identify a generalist LLM $M_0$ with robust tool-calling (even if poor domain performance).
Curate a domain-relevant problem set $D$ .
Generate agentic dialogues using $M_0$ (CoT $\rightarrow$ tool call $\rightarrow$ raw output).
Extract (CoT + tool call) prefixes; have the specialized model $M^*$ regenerate the domain-specific portion.
Filter traces by objective validity (compilation, test, retrieved artifact use).
Sample a stratified set of $\sim$ 100 traces.
SFT fine-tune $M^*$ using cross-entropy.

Pseudocode is provided for both trace distillation and fine-tuning cycles, with guidance on stratified sampling and balancing trace coverage (Chung et al., 9 Apr 2026).

5. Protocol Structure in Multi-Agent and Coordination Settings

A structurally analogous protocol governs coordinated multi-LLM workflows, e.g., in SLEAN for bug fixing and ensemble audit, and Anemoi for multi-agent cooperation. There, the protocol typically decomposes into three deterministic phases (Vargas, 11 Oct 2025):

Phase	Function	Output Type
Independent analysis	Each agent/provider analyzes/task/audit	Audit report per agent
Cross-critique	Symmetric review, compare/dispute/merge	Consolidation reports
Arbitration	Final accept/reject, attribution, metrics	Definitive fix/result

All communication is mediated by an orchestration layer (SLEAN: prompt-based, file-driven; Anemoi: thread-aware MCP server), never agent-to-agent. Deterministic .txt template prompts, strict provenance, and versioned artifact output enforce reproducibility and CI/CD integration. Mathematical metrics (acceptance rates, confidence intervals, efficiency gains, change surface reduction) are also defined (Vargas, 11 Oct 2025, Ren et al., 23 Aug 2025).

6. Protocol Instantiations Beyond Formal Reasoning

The Lean-Agent Protocol’s operational patterns are adopted in highly diverse domains:

Compliance Guardrails: The "Type-Checked Compliance" setting maps institutional policy to Lean 4 axioms via the Aristotle model, making every agentic action a formally provable conjecture. The Orchestrator translates runtime action into a Lean conjecture; the WASM-hardened kernel only executes if proof succeeds. This gives cryptographic-level compliance (with sub-millisecond overhead) and produces machine-checked audit certificates (Rashie et al., 1 Apr 2026).
Agent Communication: The "Lean-Agent Protocol" as a lean ANP subprofile prescribes a three-layered stack: minimal DID identity and Ed25519/X25519 cryptography, runtime meta-protocol negotiation (e.g., JSON-RPC vs Protobuf), and a pared-down application protocol with succinct, inline schemas and example flows. The LAP approach dramatically reduces implementation, overhead, and dependency complexity while retaining strong security for general agentic web contexts (Chang et al., 18 Jul 2025).
Multi-Agent RL Coordination: In O-RAN slicing, the "standalone explainable protocol" (STEP) enforces efficient resource-sharing and conflict mitigation by forcing agents to communicate only compressed, IB-regularized messages. The protocol achieves a 6.06x reduction in conflict over hardcoded alternatives and provides fully interpretable, SHAP-explainable rules (Rezazadeh et al., 2023).

7. Architectural and Design Principles

The Lean-Agent Protocol emphasizes the following design criteria:

Minimality: Only the essential cryptographic, negotiation, and schema pieces are mandatory; all else is extendable or stubbable.
Determinism and Auditability: Every communication, analysis, and artifact is versioned and logged for total reproducibility.
Compositionality and Generality: The protocol unlocks general tool-use skills with minimal data or domain-specific adaptation; new tools can be plugged into orchestration layers without redesign.
Performance: Interactions minimize handshakes and message overhead (e.g., single-round ECDH, CBOR over JSON, token-delta communication in MCP threads), and in compliant guardrail settings, deterministic proofs can be verified in microseconds.

These principles make the protocol applicable to a wide range of production and research CI/CD, compliance, auditing, and mathematical discovery pipelines where safety, traceability, and interoperability are paramount (Chung et al., 9 Apr 2026, Vargas, 11 Oct 2025, Rashie et al., 1 Apr 2026, Chang et al., 18 Jul 2025, Liu et al., 20 Jan 2026, Ren et al., 23 Aug 2025).

References: