Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Semantic Search Engine for Mathlib4 (2403.13310v2)

Published 20 Mar 2024 in cs.IR, cs.LG, and cs.LO

Abstract: The interactive theorem prover Lean enables the verification of formal mathematical proofs and is backed by an expanding community. Central to this ecosystem is its mathematical library, mathlib4, which lays the groundwork for the formalization of an expanding range of mathematical theories. However, searching for theorems in mathlib4 can be challenging. To successfully search in mathlib4, users often need to be familiar with its naming conventions or documentation strings. Therefore, creating a semantic search engine that can be used easily by individuals with varying familiarity with mathlib4 is very important. In this paper, we present a semantic search engine (https://leansearch.net/) for mathlib4 that accepts informal queries and finds the relevant theorems. We also establish a benchmark for assessing the performance of various search engines for mathlib4.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.
  2. The lean theorem prover (system description). In CADE, 2015. URL: https://api.semanticscholar.org/CorpusID:232990.
  3. The lean 4 theorem prover and programming language. In CADE, 2021. URL: https://api.semanticscholar.org/CorpusID:235800962.
  4. Choosing math features for bm25 ranking with tangent-l. In Proceedings of the ACM Symposium on Document Engineering 2018, pages 1–10, 2018.
  5. The math retrieval system of icst for ntcir-12 mathir task. In NTCIR, 2016.
  6. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2333–2338, 2013.
  7. Multilingual mathematical autoformalization. arXiv preprint arXiv:2311.03755, 2023.
  8. Dowsing for answers to math questions: Doing better with less. In CLEF (Working Notes), pages 40–62, 2022.
  9. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  10. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
  11. Mcat math retrieval system for ntcir-12 mathir task. In NTCIR, 2016.
  12. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300, 2019.
  13. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
  14. Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. arXiv preprint arXiv:1805.07591, 2018.
  15. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
  16. Overview of arqmath-3 (2022): Third clef lab on answer retrieval for questions on math (working notes version). Working Notes of CLEF, 2022.
  17. Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682, 2018.
  18. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  19. Dowsing for answers to math questions: Ongoing viability of traditional mathir. In CLEF (Working Notes), pages 63–81, 2021.
  20. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
  21. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
  22. Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713, 2020.
  23. Understanding the behaviors of bert in ranking. arXiv preprint arXiv:1904.07531, 2019.
  24. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191, 2020.
  25. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  26. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
  27. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
  28. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022.
  29. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  30. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  31. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
  32. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
  33. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64, 2017.
  34. Dowsing for math answers with tangent-l. 2020.
  35. Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR), 15:331–357, 2012.
  36. Multi-stage math formula search: Using appearance-based similarity metrics at scale. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 145–154, 2016.
  37. One blade for one purpose: advancing math information retrieval using hybrid search. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 141–151, 2023.
  38. Accelerating substructure similarity search for formula retrieval. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part I 42, pages 714–727. Springer, 2020.
  39. Applying structural and dense semantic matching for the arqmath lab 2022, clef. In CLEF (Working Notes), pages 147–170, 2022.
  40. Evaluating token-level and passage-level dense retrieval models for math information retrieval. arXiv preprint arXiv:2203.11163, 2022.
  41. Structural similarity search for formulas using leaf-root paths in operator subtrees. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41, pages 116–129. Springer, 2019.
Citations (1)

Summary

  • The paper introduces a novel semantic search engine for mathlib4 that transforms formal theorem statements into informal queries to simplify retrieval.
  • It employs a vector embedding model and query augmentation to align informal queries with formal theorem representations, surpassing baseline methods.
  • Experimental benchmarks demonstrate significant improvements in theorem retrieval accuracy, enhancing access to formal mathematical proofs for diverse users.

Semantic Search Engine for Mathlib4: Enhancing Theorem Retrieval with Informal Queries

Introduction to Theorem Retrieval Challenges

Lean, a state-of-the-art interactive theorem prover, supports the verification of formal mathematical proofs, facilitated by its extensive mathematical library, mathlib4. A significant challenge within this ecosystem lies in theorem retrieval due to the formal language structure of mathlib4, which often requires users to be familiar with specific naming conventions or documentation strings. This paper addresses these challenges by introducing a semantic search engine for mathlib4 designed to accept informal queries and enhance theorem retrieval efficiency. The proposed engine significantly benefits users, especially beginners, by simplifying the process of locating relevant theorems.

Related Work

Existing text retrieval methods, including both classical approaches like BM25 and modern deep neural networks, primarily focus on capturing semantic similarities between documents and queries. Despite their effectiveness in general text retrieval tasks, these methods fall short in mathematical information retrieval (MIR) due to the structured nature of mathematical formulas. In the domain of MIR, both classical structure search methods and recent dense retrievers have shown promise. However, these approaches have not been fully explored or adapted for searching within formal mathematical libraries, such as mathlib4.

Methodology

Our method involves three main steps: informalizing the formal statements in mathlib4, constructing a semantic search engine tailored for mathlib4 based on these informal-formal theorem pairs, and augmenting user queries to improve retrieval relevance. To transform the formal statements of theorems into informal language comprehensible to diverse users, we employ a LLM. The resulting informal-formal theorem pairs are then embedded into a vector space model, facilitating semantic search capabilities. Furthermore, query augmentation enriches user input with formal expressions and additional context, leveraging LLMs to enhance search precision.

Mathlib4 Semantic Search Benchmark

To evaluate the effectiveness of various search engines for mathlib4, we establish a comprehensive benchmark. This benchmark comprises a curated set of queries, relevance judgments for theorem-query pairs, and a collection of performance metrics tailored to the MIR context. Our benchmark spans a broad range of mathematical topics, offering a robust foundation for the systematic assessment of search engines dedicated to mathlib4.

Experimental Results

Comparative analyses of multiple embedding models reveal that the proposed search engine substantially outperforms existing methods, including Moogle and other baseline embedding models. The augmentation of queries and the incorporation of informal-formal theorem pairs into the search corpus are pivotal to achieving significant improvements in theorem retrieval. These enhancements are evident across various performance metrics, underscoring the utility of our approach in practical theorem search tasks within mathlib4.

Future Directions

The development of the semantic search engine for mathlib4 marks a critical step toward improving access to formal mathematical libraries. Looking ahead, efforts will focus on refining the informalization process of formal statements, expanding the Mathlib4 semantic search benchmark, and exploring fine-tuning of text embedding models for enriched theorem retrieval. By addressing these aspects, we aim to further elevate the efficiency and accuracy of theorem search in mathlib4, thereby supporting a wider range of users in their engagement with formal mathematical proofs.

In conclusion, the semantic search engine presented in this paper not only alleviates the challenges faced by users in navigating mathlib4 but also sets a foundation for future advancements in the retrieval of mathematical theorems from formal libraries. Through continuous development and refinement, this engine has the potential to significantly contribute to the accessibility and usability of formal mathematical resources.