Explorable Theorems: Interactive Math Landscapes

Updated 4 July 2026

Explorable theorems are structured, interactive representations that embed theorem statements, proofs, and dependencies in navigable computational spaces.
They employ diverse methodologies like axiom-based vectors, dependency graphs, semantic embeddings, and formal grounding to revolutionize theorem exploration.
Practical insights include enhanced theorem retrieval, automated discovery, and executable explanations that foster deeper mathematical comprehension.

Searching arXiv for recent and relevant papers on explorable theorems, theorem exploration, semantic theorem retrieval, and interactive theorem explanation. Explorable theorems are theorem-centered representations, systems, and interfaces in which theorem statements, proofs, dependencies, variants, and examples become navigable objects rather than static prose. Across current research, this notion spans axiom-indexed theorem maps, dependency graphs, theorem-level semantic retrieval, automated lemma discovery, geometry-discovery interfaces, and formally grounded proof readers that execute proofs on demand (Yoo, 31 Mar 2025, Wolfram, 2021, Alexander et al., 5 Feb 2026, Kambhamettu et al., 3 Apr 2026). The common objective is not merely theorem search or proof checking in isolation, but the construction of an interactive theorem landscape in which structural proximity, logical strength, local proof state, and counterfactual variation are all inspectable.

1. Conceptual models of theorem spaces

A central development in this area is the shift from treating a theorem as a bibliographic artifact toward treating it as a structured object embedded in a computational space. In the Axiom-Based Atlas, each theorem is represented by a proof vector over a fixed basis of axioms, typically binary but optionally weighted, so that theorem comparison becomes a problem in vector geometry rather than topical classification alone (Yoo, 31 Mar 2025). In empirical metamathematics, theorems appear as nodes in directed acyclic dependency graphs, with explicit notions of prerequisite cone, future cone, depth, centrality, and “power” derived from their role in shortening other proofs (Wolfram, 2021). In semantic theorem retrieval, each theorem is reduced to a short natural-language “slogan” and embedded into a shared vector space with queries, making theorem-level search feasible over a corpus of about $9.2$ million statements (Alexander et al., 5 Feb 2026). In formally grounded reading systems, a written theorem and proof are linked to a Lean proof, so the theorem becomes an executable object with inspectable proof states rather than a fixed textual assertion (Kambhamettu et al., 3 Apr 2026).

These perspectives differ in ontology, but they are compatible. A theorem may simultaneously be a proof vector, a node in a dependency DAG, a point in an embedding space, and a machine-checked program. This suggests that “explorability” is best understood not as one interface style, but as a multi-representational regime in which several mathematically meaningful neighborhoods coexist.

Representation family	Primitive object	Main affordance
Proof-vector atlas	Axiom-indexed vector	similarity, clustering, heatmaps
Dependency graph	DAG node with past/future cones	power, depth, learning paths
Semantic theorem embedding	theorem slogan in $\mathbb{R}^d$	theorem-level retrieval
Formal grounding	Lean theorem/proof state	stepwise execution and dependency tracing

2. Structural organization: axioms, dependencies, and logical strength

The most explicit vector-space treatment is the Axiom-Based Atlas, where a theorem $\tau$ is mapped to $v(\tau)\in\{0,1\}^n$ or a weighted vector in $\mathbb{R}^n$ , with coordinates indexed by axioms from systems such as Hilbert geometry, Peano arithmetic, or ZFC (Yoo, 31 Mar 2025). Similarity can then be computed by cosine similarity, Euclidean distance, or the Jaccard index for binary vectors, and visualized through heatmaps, similarity matrices, and clustering dendrograms. The atlas’s prototype examples are intentionally coarse—e.g., “Infinitely many primes” and commutativity of addition share the same Peano-vector $[1,1,0,0,1]$ —but the framework’s contribution is methodological: theorem similarity is defined by foundational footprint rather than by subject heading.

Dependency-graph approaches supply a different geometry. In the Euclid study, the corpus consists of 465 propositions across 13 books together with 10 axioms, yielding a DAG with direct dependencies, transitive reduction, and transitive closure (Wolfram, 2021). On this graph, several quantitative invariants are defined: direct in-degree $d_{\text{in}}(T)$ , prerequisite set size $P(T)$ , future-cone size $R(T)$ , graph-theoretic depth, betweenness centrality, closeness centrality, and axiom-dependence profile. A theorem’s “power” is formalized operationally by treating it as a superaxiom and measuring proof shortening across the corpus. In this setting, an explorable theorem space is one in which landmarks, bottlenecks, and module boundaries are mathematically explicit rather than pedagogically imposed.

A third structural axis is logical strength. Higher-order Reverse Mathematics shows that many exceptional “zoo” principles, when converted to their uniform forms $UT$ , collapse to the higher-order counterpart of $\mathbb{R}^d$ 0, namely $\mathbb{R}^d$ 1 or $\mathbb{R}^d$ 2, over a Kohlenbach-style base theory (Sanders, 2014). Uniform DNR, UADS, UTS $\mathbb{R}^d$ 3, UCOH, and related principles are all shown to be explicitly equivalent to $\mathbb{R}^d$ 4, with term extraction witnessing the equivalence. In this setting, an explorable theorem landscape has at least two layers: a non-uniform reverse-mathematical layer containing the zoo, and a uniform higher-order layer in which a large part of that zoo collapses into a single equivalence class. This suggests that theorem exploration is sensitive to the representational level at which theorems are compared.

3. Automated generation and discovery

A major branch of explorable-theorem research does not begin from a fixed corpus at all, but generates theorem spaces bottom-up. QuickSpec is the canonical example: given a signature of typed functions and datatypes, it interleaves term generation with random testing to form empirical equivalence classes, and emits equations between irreducible terms as conjectures (Johansson et al., 2021). Terms reducible by already discovered laws are pruned, so “interestingness” is tied to irreducibility with respect to the current rewrite system. QuickSpec has been integrated with HipSpec, Hipster, Hopster, and TIP/Vampire workflows, where its conjectures become inductive lemmas for subsequent automated proof.

TheSy replaces random testing by a purely deductive pipeline based on e-graphs, Syntax Guided Enumeration, Symbolic Observational Equivalence, conjecture screening, and an induction prover (Singher et al., 2020). The base theory consists of inductive datatypes, recursive definitions, and universally quantified equations; exploration then generates new universal equations that are valid, useful, and non-redundant. The authors emphasize two departures from the HipSpec/QuickSpec line: elimination of random testing and SMT-based filtering, and support for compositional reasoning with user-defined higher-order functions. In the reported benchmarks, the implementation finds more lemmas than prior art while avoiding redundancy.

Enumerative generation pushes this idea to its limit in the implicational fragment of intuitionistic linear logic. There, theorems are generated via the Curry–Howard correspondence by enumerating closed linear lambda terms in normal form and inferring their principal types, rather than testing provability formula by formula (Tarau et al., 2020). The resulting Prolog program runs in $\mathbb{R}^d$ 5 space for terms of size $\mathbb{R}^d$ 6 and generates $\mathbb{R}^d$ 7 theorems together with proof terms in normal form in a few hours. In this regime, an explorable theorem space is literally enumerable: the theorem landscape can be traversed by size, proof shape, and type structure.

Geometry supplies a more visually immediate version of discovery. GeoGebra’s Discover tool analyzes a construction around a selected point, generates candidate equalities, collinearities, concyclicity relations, parallel classes, perpendicular direction classes, and segment-congruence classes, and confirms them by a numerical pre-check followed by Gröbner-basis verification (Kovács et al., 2022). Each symbolic check is allotted a 5-second time limit; undecided conjectures are dropped, so the system may miss true statements but is designed not to emit false ones. Equivalence-class representations are crucial: rather than listing all pairwise parallelisms or all quadruples of concyclic points, the tool reports lines, circles, directions, and congruence classes. This is theorem exploration as pattern extraction from a live construction rather than as search over a textual library.

4. Executable explanations and theorem interaction

The strongest current realization of the phrase “explorable theorem” is a system that grounds a written theorem and proof in Lean and exposes the resulting formal structure back through the prose (Kambhamettu et al., 3 Apr 2026). An LLM translates the theorem and written proof into Lean, with have blocks aligned to written proof steps; another model links Lean tactics and identifiers back to the prose; Lean proof states are then diffed to construct a dependency DAG over facts introduced and consumed by each step. Because the proof is executable, the system can instantiate the theorem on custom examples or counterexamples, run the Lean proof, extract intermediate proof states, and inject step-level worked examples directly beneath the corresponding prose. Readers can therefore inspect step-local proof states, test boundary cases, and trace dependencies that would otherwise remain implicit in the written argument.

This grounding changes the interaction model. Rather than asking an LLM to improvise many separate explanations, the reader interacts with one persistent mathematical object: a theorem/proof pair with a machine-checked operational semantics. Custom inputs are propagated through the Lean proof via substitution and simplification, so the interface can show precisely which step fails when an assumption is violated. In the reported user study $\mathbb{R}^d$ 8, participants with access to these explorability features gave better, more correct, and more detailed answers to proof-comprehension questions, indicating stronger overall understanding (Kambhamettu et al., 3 Apr 2026).

Augmented-reality authoring systems extend explorable interaction into educational documents. Augmented Math starts from ordinary textbooks or handouts, extracts formulas and figures using Google Cloud OCR, MathPix OCR, CnSTD, and OpenCV, and binds them to AR overlays rendered with 8th Wall, A-Frame, MathJax, Konva, and MathJS (Chulpongsatorn et al., 2023). The system identifies five recurrent augmentation patterns—dynamic values, interactive figures, relationship highlights, concrete examples, and step-by-step hints—and uses them to turn printed equations, graphs, and diagrams into manipulable objects. The reported studies found AR to be most engaging, with mean engagement $\mathbb{R}^d$ 9; interactive figures were rated $\tau$ 0, relationship highlights $\tau$ 1, step-by-step hints $\tau$ 2, and concrete examples $\tau$ 3. Although this work is framed as math education rather than theorem navigation, it demonstrates a portable design principle: theorem comprehension benefits when symbolic statements, geometric objects, and incremental transformations are bound together in one interface.

Research on explorable explanations in visualization adds a cautionary design lesson. Six recurring explanatory modes—short text, long text, correction, redraw, highlighting, and annotation—were identified, and an explorable-explanation condition was compared against them in two crowd studies $\tau$ 4 each) (Lo et al., 2023). Exposure to explanations improved proficiency in identifying deceptive charts, and participants accepted more than $\tau$ 5 of proposed modifications, but no significant advantage of explorable explanations over the other explanation methods was found. A plausible implication is that theorem interfaces should not assume that interactivity dominates static exposition; the more robust principle is the coordinated use of contrast, annotation, and carefully limited parameter manipulation.

5. Retrieval, agents, and adaptive theorem landscapes

Once the theorem is the unit of indexing, theorem exploration becomes a retrieval problem as much as a proof problem. The largest current corpus treats theorem statements—not papers—as the primary search objects, collecting about $\tau$ 6 million theorem statements extracted from arXiv and seven additional open sources (Alexander et al., 5 Feb 2026). The pipeline parses theorem-like environments, expands basic author-defined macros, and summarizes each statement as a short natural-language slogan. These slogans are embedded with Qwen3-Embedding-8B into $\tau$ 7, indexed with HNSW and binary quantization, and optionally reranked with a cross-encoder. On a curated benchmark of queries written by professional mathematicians, the best system achieved theorem-level $\tau$ 8, $\tau$ 9, and paper-level $v(\tau)\in\{0,1\}^n$ 0, substantially outperforming Google Search, arXiv search, ChatGPT 5.2, and Gemini 3 Pro on the reported metrics (Alexander et al., 5 Feb 2026). This makes theorem browsing feasible at web scale and opens a direct interface between human mathematical intent and theorem-level corpora.

DeepTheorem moves from retrieval to interactive proving. Its corpus contains $v(\tau)\in\{0,1\}^n$ 1K IMO-level informal theorems and proofs, each annotated for correctness, difficulty, and topic category, together with systematically constructed verifiable variants (Zhang et al., 29 May 2025). Reinforcement learning is driven by variant consistency: the model must output \boxed{proved} or \boxed{disproved} for original statements and their entailed or contradictory variants, receiving binary reward for correctness. Outcome and process are evaluated separately, with process score defined as

$v(\tau)\in\{0,1\}^n$ 2

For Qwen2.5-7B, RL-Zero on DeepTheorem yields average outcome $v(\tau)\in\{0,1\}^n$ 3 and process $v(\tau)\in\{0,1\}^n$ 4, improving on SFT baselines (Zhang et al., 29 May 2025). In an explorable-theorem setting, this means that theorem neighborhoods are no longer passive collections: they can be traversed by logical variation, with the model asked to prove, refute, or compare nearby statements.

A theoretical account of self-play theorem proving makes this graph view explicit. There, theorems are nodes $v(\tau)\in\{0,1\}^n$ 5, edges connect semantically similar pairs whose success rates are close for all provers in a class $v(\tau)\in\{0,1\}^n$ 6, and a conjecturer is a reversible random walk or a weighted neighbor sampler on the resulting graph (Chen et al., 1 Jun 2026). Under a strong isoperimetric inequality on the theorem graph, a prover–conjecturer system yields exponential growth in the number of theorems proved with success rate at least $v(\tau)\in\{0,1\}^n$ 7. To counter the empirical pathology that self-play conjecturers tend to generate artificially complex and non-fundamental theorems, the paper introduces a diversity measure

$v(\tau)\in\{0,1\}^n$ 8

and proposes local maximization of this quantity using diffusion similarity between neighboring theorems, approximated via contrastive embeddings (Chen et al., 1 Jun 2026). This reframes exploration as coverage of theorem space rather than mere accumulation of nearby variants.

A complementary philosophical strand emphasizes that proof graphs and search traces may become objects of exploration in their own right. Reflection on proof assistants, LLMs, and hypothetical “AlephZero”-like systems argues that future mathematical experience may include “glitching”: a game-like search for uncanny consequences of definitions, supported by test suites, proof-graph analytics, and interaction with machine-generated proofs (DeDeo, 2023). This suggests that explorable theorems are not only about reading known results more effectively, but also about exposing the latent behavior of formal systems to systematic play.

6. Limits, tensions, and open directions

Current systems expose several unresolved tensions. One concerns representational fidelity. Semantic theorem retrieval deliberately converts formal statements into plain-English slogans because symbolic theorem bodies are difficult to embed directly, but this abstraction can lose crucial logical detail (Alexander et al., 5 Feb 2026). Conversely, formal grounding in Lean preserves logical structure but depends on translating prose into a machine-checked proof and keeping the formal proof aligned to the written argument; this is powerful when it works, but the representation burden is substantial (Kambhamettu et al., 3 Apr 2026). A mature explorable-theorem environment will likely require both layers simultaneously: human-queryable semantic summaries and executable formal cores.

A second tension concerns scalability and noise. Parsing theorem environments from large, heterogeneous corpora remains imperfect, and the public theorem-search dataset is limited to permissively licensed arXiv papers plus open projects (Alexander et al., 5 Feb 2026). In AR document augmentation, equation recognition is around $v(\tau)\in\{0,1\}^n$ 9, graph detection $\mathbb{R}^n$ 0, and geometry detection $\mathbb{R}^n$ 1, which constrains reliable theorem augmentation on printed materials (Chulpongsatorn et al., 2023). GeoGebra discovery is sound-oriented but incomplete, with per-conjecture timeouts and performance degradation on larger constructions such as a regular 20-gon (Kovács et al., 2022). Explorable theorem systems therefore remain bounded not only by interface design, but by front-end extraction quality and by the tractability of the underlying symbolic engines.

A third issue is conceptual granularity. Reverse-mathematical work shows that theorem distinctions can either proliferate or collapse depending on whether one studies non-uniform principles or their uniform higher-order versions (Sanders, 2014). Self-play theory makes a similar point in graph-theoretic language: local exploration can either enlarge genuine coverage or merely densify one cluster of trivial variants (Chen et al., 1 Jun 2026). This suggests that theorem exploration needs explicit controls over what counts as a meaningful neighborhood—semantic similarity, foundational strength, dependency proximity, proof reuse, or executable variation are not interchangeable.

The broad direction of travel is nonetheless clear. Theorem corpora are becoming theorem-level rather than paper-level; theorem representations are becoming vectorial, graphical, executable, and queryable; and theorem interfaces are beginning to expose proof state, dependency structure, examples, and logical variants as first-class interactive objects. A plausible synthesis is a layered theorem environment in which a reader can search by slogan, inspect a theorem’s axiom profile and dependency cone, execute its formal proof on custom inputs, traverse nearby variants, and situate the result within a larger logical landscape. In that sense, explorable theorems are less a single tool category than an emerging research program for turning mathematical results into navigable computational objects.