- The paper demonstrates that coupling written theorems with executable Lean formalizations enhances step-by-step proof comprehension.
- It presents a novel LLM-driven pipeline that maps natural language to formal code, enabling traceable, interactive exploration of proofs.
- Empirical results show a 94.9% preference for explorable theorems, reducing cognitive load and improving detailed understanding.
Introduction
The paper "Making Written Theorems Explorable by Grounding Them in Formal Representations" (2604.02598) systematically investigates augmenting written mathematical theorems and proofs with interactive, LLM-driven formal backing. The core premise is that static, natural-language explanations provided by LLMs are limited in their capability to support deep, active mathematical reasoning. In contrast, by coupling these textual artifacts with an underlying formal machine-verifiable representation—here, executable Lean proofs—new affordances for interactive exploration and step-level understanding become possible. The authors instantiate this approach with the explorable theorems system, linking written theorems to Lean formalizations and surfacing the intermediate proof structure and logic as actionable interactive elements. The paper empirically evaluates this approach via a controlled user study, yielding strong evidence that formal-grounded interaction substantially improves proof comprehension and perceived user experience over a standard LLM chatbot baseline.
System Design and Architecture
The explorable theorems system integrates natural-language theorems and written proofs with executable Lean code, allowing the user to concretize, explore, and interrogate formal mathematical content far beyond what is feasible with LLM-generated text. The interface's primary affordances are: input sliders for evaluating claims over concrete values and toggling theorem assumptions, step-wise execution of the proof with worked examples, localized tracing of the logical dependencies, and automatic flagging of failure points when key hypotheses are violated.
Figure 1: The explorable theorems interface for the theorem "For all integers x, if x>2, then x2−1 is not prime", displaying interactive sliders and explicit proof-step states.
The translation pipeline is composed of several LLM-powered and programmatic steps to ensure tight alignment between prose and formal code. This begins with LLM-based generation of structurally parallel Lean proofs and proceeds through bidirectional mapping of Lean code blocks and prose steps, dependency recovery via proof state diffs, and template instantiation for filling written proof templates with machine-extracted values.
Figure 2: The pipeline for grounding interaction affordances in a formal representation, including Lean proof generation, state extraction, step mapping, and instance-based execution.
Critically, every interaction (value testing, proof stepping, dependency visualization) is driven by the underlying semantics in the Lean formalization. Traceability is enforced at the level of named variables and block-to-step mappings, with the system leveraging intermediate Lean proof states for data-binding written skeletons to computed values.
Empirical Evaluation
A controlled user study (n=16) directly compares explorable theorems against a strong LLM chatbot (Gemini) baseline, using fixed mathematical statements, standard proof tasks, and established proof comprehension metrics. The experimental design incorporates not just correctness, but also granularity of explanation, linkage to explicit proof structure, and use of examples. Ranking is made via both expert Trueskill-based comparative judgments and rubric-anchored scoring.

Figure 3: Instructor Trueskill preference distributions; responses with explorable theorems are preferred by the instructor in 94.9% of paired evaluations.
Participants using explorable theorems yield responses judged superior to the baseline in 94.9% of comparisons for open-ended proof summarization, and receive significantly higher fine-grained rubric scores for step explicitness and conceptual correctness.

Figure 4: Response quality by condition—explorable theorems usage produces more correct and more granular proof explanations.
Additionally, the system does not increase surface memorization or recall (as reflected in equivalent Parsons puzzle performance between groups), but specifically targets and improves deeper structural and semantic comprehension. Recorded behavior logs and timelines indicate rich engagement patterns, including frequent toggling between abstract and instance-level reading, non-linear proof navigation, and substantial use of example-driven falsification and edge-case probing.

Figure 5: Timeline of individual participant proof-reading sessions, annotated by proof step, view mode, and dependency navigation.
Self-reported task demand is also lower for explorable theorems; users feel less rushed and more successful, with 81% expressing tool preference for the explorable condition.

Figure 6: Self-reported task load by condition; explorable theorems is perceived as less rushed and more successful.
Technical Analysis
The alignment pipeline leverages advances in LLM-based code generation and natural-lean mappings, but is focused on strict structural alignment rather than holistic Lean code synthesis. For undergraduate-level proofs and algebraic arguments where proof styles are analogous across Lean and prose, the pipeline achieves high mapping fidelity, though it does exhibit occasional block mismatches or dependency omission for tactics like rfl or complex closing steps (omega, contradiction). Current limitations include the inability to guarantee structural isomorphism in Lean code for more exotic or highly informal proofs, sensitivity to the LLM's translation quality, and occasional loss of step-level semantic saliency. Importantly, the pipeline is robust to task-level variation and can handle both true/false verification tasks and existence/constructive proofs with meaningful user bindings.
Theoretical and Practical Implications
Formally grounding exposition surfaces distinct cognitive and pedagogical affordances. By offloading the incidental cognitive load (indexing variable bindings, arithmetic expansion, step tracing) to an automated, provenance-preserving backend, the user can focus on essential logical inference and higher-level reasoning. The interface supports both example-driven and dependency-driven reading strategies, giving flexibility to users with different proof reading expertise profiles.
Unlike prior HCI and mathematical augmentation systems, which either manually authored logical linkages or operated post-hoc at the level of static or visual annotation, explorable theorems structurally binds natural language and verified computation. This enables not just more effective passive reading, but active hypothesis testing, falsification, and localized reasoning about boundary cases, assumptions, and counterexamples.
The techniques presented have clear applications beyond mathematical proof comprehension—potentially extending to program understanding, formal modeling in other domains, or even explorable legal/financial documents where formal structure is available.
Future Directions
The authors suggest several extension directions, including:
- Augmenting conversational interaction by allowing chat interfaces to operate over the formal backbone rather than generating ungrounded text responses;
- Scaling to research-level mathematics, where full formal proofs might not be available but partial formal semantic links can be exploited;
- Generalizing formally-backed explanation and explorability to structured domains such as software verification, circuit analysis, and plan verification.
Ongoing improvements in LLM-based formal proof generation and bidirectional NL-code alignment should ameliorate current pipeline limitations, moving towards broader coverage and more robust semantic mapping.
Conclusion
The paper provides compelling evidence that grounding mathematical exposition in a verified formal representation delivers measurable improvements in active proof comprehension, explanatory quality, and user experience over LLM-only chat-driven explanations (2604.02598). The design demonstrates how step-aligned, formally-executed interfaces can offload incidental cognitive load, surface logical dependencies, and enable both example-based and structure-based reasoning strategies. The approach opens promising avenues for future research in AI-augmented mathematical understanding and interactive, provenance-aware explanation systems across formal domains.