Generalization of specialized Lean provers beyond mathematics

Determine whether specialized Large Language Model theorem provers for Lean that are trained and evaluated primarily on mathematical datasets generalize effectively to scientific domains beyond mathematics, and quantify the extent of such generalization relative to their performance within mathematics.

Background

The paper discusses limitations of highly specialized Lean provers (e.g., DeepSeek-Prover, Kimina-Prover) that have been trained largely on mathematical corpora and benchmarks. While these systems achieve strong results on math-focused datasets, their behavior outside mathematics is questioned.

The authors motivate Ax-Prover partly by this uncertainty, aiming to provide a tool-enabled approach that can adapt across domains including physics. This open question concerns the external validity of specialized models beyond their primary training domain.

References

First, since they were mainly trained and tested in the the domain of mathematics, their ability to generalize beyond this domain remains unclear.

— Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics (2510.12787 - Tredici et al., 14 Oct 2025) in Section 1 (Introduction)

Generalization of specialized Lean provers beyond mathematics

Sponsor

Background

References

Related Problems