Premise Selection in Theorem Proving

Updated 13 April 2026

Premise selection is the process of filtering large pools of candidate statements to extract only the essential premises needed to prove a conjecture.
Modern approaches utilize neural, graph-based, and feature-driven methods to rank proof dependencies effectively in extensive formal libraries.
Performance metrics like recall@k and proof success rate validate the efficiency, scalability, and practicality of these selection techniques in theorem proving.

Premise selection is a central challenge in automated theorem proving (ATP) and interactive theorem proving (ITP), consisting of filtering large pools of candidate lemmas, axioms, or previously established results to identify a minimal, relevant subset sufficient to close a given conjecture or subgoal. The efficiency and scalability of both symbolic and neural proof search methods hinge critically on robust premise-selection techniques, particularly in large formal libraries where the search space can comprise hundreds of thousands of candidate statements. Recent advances in machine learning—especially deep learning and graph-based techniques—have substantially improved premise-selection accuracy, supplanted hand-engineered heuristics, and enabled practical hammers in modern ITP environments such as Isabelle and Lean (Mikuła et al., 2023, Zhu et al., 9 Jun 2025, Petrovčič et al., 24 Oct 2025).

1. Formal Problem Definition and Task Objectives

In its most general form, premise selection is defined as follows: Given a conjecture or proof state $q$ and a library of candidate premises $P = \{p_1, p_2, ...\}$ , the task is to produce a relevance ranking or explicit subset $\hat{P}_q \subseteq P$ such that the automated or interactive prover can synthesize a proof of $q$ using exclusively, or at least predominantly, premises in $\hat{P}_q$ .

Formally, many methods pose the problem either as a binary classification or ranking task over pairs $(q, p)$ , where a model $f_\theta(q, p)$ yields relevance scores, probabilities, or marginal utility estimates. In type-theoretic settings (such as Lean), $q$ encodes both a goal $G$ and context $\Gamma$ , and $P = \{p_1, p_2, ...\}$ 0 is dynamically filtered to match the local scope of accessible theorems, definitions, and hypotheses (Zhu et al., 9 Jun 2025). The ranking function is often realized through the cosine similarity or a more refined neural cross-encoder (Mikuła et al., 2023, Tao et al., 21 Jan 2025).

Success is measured via proof-rate (fraction of goals provable from selected premises), recall@k (fraction of true dependencies recovered in top-k), or classification accuracy on human- or ATP-generated ground-truth proof dependencies. High-dimensionality, extreme class imbalance, and the existence of multiple alternative proofs per conjecture motivate advanced sampling and surrogate modeling techniques (Alama et al., 2011, Piotrowski et al., 2018).

2. Methodological Paradigms

2.1 Symbolic and Feature-based Approaches

Traditional systems used hand-crafted symbolic features (symbol presence, subterm structure, type constants) extracted from the logical syntax trees of statements. Early learning approaches applied k-nearest neighbors (k-NN), naive Bayes and linear or kernelized SVMs to this representation (Gauthier et al., 2015, Alama et al., 2011). These methods operate on sparse, binary or count-valued feature vectors and learn to associate conjectures with relevant proof dependencies via supervised multi-label or binary classification, typically over ground-truth corpora assembled via fine-grained dependency extraction (Alama et al., 2011).

SInE-style heuristics, which propagate symbol-based generality notions through the signature graph, can be optimized using Bayesian optimization (GP-based surrogate models), enabling efficient parameter tuning across complex, multidimensional heuristic landscapes (Słowik et al., 2019).

2.2 Neural Sequence and Embedding Models

Deep neural approaches supersede manual feature engineering via end-to-end learning over tokenized, string, or AST-based representations of statements:

Dual/dense encoders: Models such as Magnushammer (Mikuła et al., 2023) and Sentence-BERT-style encoders (Zhu et al., 9 Jun 2025, Tao et al., 21 Jan 2025) train Transformers to produce joint vector embeddings for both proof states and candidate premises, with cosine similarity yielding fast, scalable retrieval. Contrastive learning (batch InfoNCE loss) is used to align positive state-premise pairs and repel negatives.
Cross-encoders: For improved fine-grained signal, candidate premises retrieved by dense encoders are re-ranked with a cross-encoding architecture that jointly embeds (state, premise) pairs into a shared Transformer and predicts a relevance score, typically trained with binary cross-entropy (Mikuła et al., 2023, Tao et al., 21 Jan 2025).
Definition-aware and character/word-level models: DeepMath (Alemi et al., 2016) leverages both character-level and definition-aware word-level convolutional encoders, composing definition embeddings for high-level token representations.
Functional signature embeddings: Simplified approaches reduce the formula to functional signature counts, further compressed via learned distributed representations, achieving competitive accuracy with only shallow classifiers (Kucik et al., 2018).

2.3 Graph-based and Structural Models

Graph neural networks (GNNs) and their variants exploit structural information by encoding formulas (premises, conjectures) or entire proof problems as directed multigraphs or dependency graphs:

Graph embedding of formulas: FormulaNet (Wang et al., 2017) parses higher-order statements into variable-renaming invariant graphs, with node updates incorporating edge ordering via "treelets", yielding state-of-the-art HolStep classification accuracy.
Dependency-graph augmentation: Recent approaches combine dense dual encoders with relational GNNs layered over heterogeneous dependency graphs capturing proof state–premise and premise–premise relations, as in LeanDojo (Petrovčič et al., 24 Oct 2025). GNN propagation refines initial textual embeddings, enabling relational smoothing and capacitating multi-hop dependency recovery.
Graph-to-sequence modeling: Some methods, inspired by image captioning, apply GNNs for graph encoding of the problem, with the pooled embedding fed into a sequential decoder (LSTM) to generate an ordered list of premises, facilitating sequence-level dependencies (Holden et al., 2023).

2.4 Online, Lightweight, and Symbolic Models

Custom random forests and k-NN baselines remain attractive for ultra-lightweight, proof assistant-integrated premise selection—suitable for interactive suggestion and rapid feedback, at the cost of limited global context and expressivity (Piotrowski et al., 2023, Gauthier et al., 2015). Random forests are grown online over symbol/bigram/trigram features and achieve sub-second response times for practical use in Lean (Piotrowski et al., 2023).

Gradient boosting with engineered features and ATP feedback ("ATPboost") closes the loop between learning and symbolic search, iteratively improving premise rankings via new proof discoveries and hard negative mining (Piotrowski et al., 2018).

3. Training Data, Labeling, and Negative Sampling

Effective supervision for premise selection hinges on detailed dependency annotation—minimally sufficient sets of premises actually used in formal or ATP-generated proofs. For classical libraries, fine-grained corpus analysis (splitting micro-articles, greedy minimization of environments) yields training sets pairing conjectures to all and only their true dependencies (Alama et al., 2011).

Positive pairs are straightforward; the challenge lies in constructing informative negative samples:

Random negatives risk uninformative "easy" contrasts, as many 'irrelevant' premises are trivial to reject.
Hard negative mining selects negatives that score highly under current models or which are close to the worst-ranking positive, thereby challenging the model during training and accelerating convergence (Mikuła et al., 2023, Alemi et al., 2016, Piotrowski et al., 2018).
In-batch negatives and masking: Contrastive objectives (InfoNCE) exploit other batch positives as negatives, carefully masking accidental positives to avoid penalizing correct retrieval (especially in multi-proof or partially-labeled settings) (Zhu et al., 9 Jun 2025, Tao et al., 21 Jan 2025).

For language-based settings, domain-specific tokenization and representation (e.g., splitting contexts and goals, use of special markers for hypotheses and conclusions) aligns neural encoders closely with the structure of proof assistant data (Tao et al., 21 Jan 2025, Zhu et al., 9 Jun 2025).

Training datasets in the largest experiments comprise several million state–premise pairs, derived from entire proof assistant libraries (Isabelle, Lean), with up to hundreds of thousands of unique premises (Mikuła et al., 2023, Zhu et al., 9 Jun 2025).

4. Evaluation Protocols and Benchmarks

Evaluation of premise selection entails both intrinsic information-retrieval style metrics and extrinsic proof success rates:

Recall@k, Precision@k, nDCG: Quantify retrieval quality against the gold set of proof dependencies, averaged over test conjectures.
Proof success rate: Measures whether the downstream prover, supplied only with the top-k premises, can reconstruct a proof within resource bounds. This is the definitive metric for system integration.
Mean average precision (MAP), mean reciprocal rank (MRR): Standard in IR tasks, reporting average relevance and ranking performance (Ferreira et al., 2020, Tao et al., 21 Jan 2025).

Benchmarks include PISA and miniF2F for Isabelle (Mikuła et al., 2023); LeanDojo (Petrovčič et al., 24 Oct 2025), Mathlib, and miniCTX-v2 for Lean (Zhu et al., 9 Jun 2025); MPTP2078 and DeepMath for Mizar (Alama et al., 2011, Holden et al., 2023); and HolStep for higher-order logic (Wang et al., 2017). Category-level breakdowns (e.g., Algebra, Number Theory) enable detailed performance diagnostics (Ferreira et al., 2020).

Transformers and graph-augmented models deliver substantial improvements, for example, raising proof success rate on PISA from 38.3% (Sledgehammer) to 59.5% (Magnushammer), and up to 71.0% when combined with a language-model-based generative prover (Mikuła et al., 2023). On LeanDojo, GNN-augmented retrieval outperformed text-only baselines by over 25% on standard metrics (Petrovčič et al., 24 Oct 2025). In Lean's practical hammer, a domain-specific LM retriever enabled a 21% relative increase in end-to-end proof rate versus MePo (Zhu et al., 9 Jun 2025).

5. Comparative Analysis and Future Directions

Premise selection has evolved from symbol and feature-driven methods—k-NN, naive Bayes, SVMs, and random forest classifiers—towards state-of-the-art neural retrieval augmented with structural learning and rich contrastive objectives. The key breakthroughs are:

Scaling neural retrieval and re-ranking: Transformer-based contrastive retrievers, cross-encoders, and hybrid GNN architectures have eliminated reliance on hand-crafted heuristics, made multi-hop and semantic retrieval tractable, and integrated seamlessly with large code and mathematical libraries (Mikuła et al., 2023, Tao et al., 21 Jan 2025, Petrovčič et al., 24 Oct 2025).
Cross-system generality: Modern retrievers, by operating end-to-end on raw or pre-processed text representations, extrapolate across formal languages (Lean, Isabelle, Coq, HOL).
Handling rich type- and dependency-theoretic structure: Dependency graphs and heterogeneous relational GNNs encode both proof and signature relations, enabling improved generalization and multi-step reasoning (Petrovčič et al., 24 Oct 2025).
Practical hammers: Integration of high-recall neural premise selection with symbolic proof search and reconstruction (e.g., with Duper in Lean) has yielded domain-general hammers for previously underserved proof assistants (Zhu et al., 9 Jun 2025).

Important challenges remain:

Context length and scaling: Transformers are limited by maximum sequence lengths, motivating hybrid models combining text and structure.
Negative mining and robust generalization: Further research in dynamic negative sampling, data augmentation, and novel loss designs is needed for resilience.
Integration with generation and proof search: End-to-end joint training of retrieval and generation modules promises even tighter feedback between premise selection and proof synthesis.
Application to mathematical text: Informal mathematical texts pose additional challenges in premise selection, where standard NLP embeddings underperform and structural or graph-based methods remain in early stages (Ferreira et al., 2020).

Premise selection is thus a vibrant intersection of symbolic logic, machine learning, graph theory, and large-scale formalization, with persistent open problems and far-reaching implications for full automation of mathematical reasoning.