A Semantic Search Engine for Mathlib4 (2403.13310v2)
Abstract: The interactive theorem prover Lean enables the verification of formal mathematical proofs and is backed by an expanding community. Central to this ecosystem is its mathematical library, mathlib4, which lays the groundwork for the formalization of an expanding range of mathematical theories. However, searching for theorems in mathlib4 can be challenging. To successfully search in mathlib4, users often need to be familiar with its naming conventions or documentation strings. Therefore, creating a semantic search engine that can be used easily by individuals with varying familiarity with mathlib4 is very important. In this paper, we present a semantic search engine (https://leansearch.net/) for mathlib4 that accepts informal queries and finds the relevant theorems. We also establish a benchmark for assessing the performance of various search engines for mathlib4.
- Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.
- The lean theorem prover (system description). In CADE, 2015. URL: https://api.semanticscholar.org/CorpusID:232990.
- The lean 4 theorem prover and programming language. In CADE, 2021. URL: https://api.semanticscholar.org/CorpusID:235800962.
- Choosing math features for bm25 ranking with tangent-l. In Proceedings of the ACM Symposium on Document Engineering 2018, pages 1–10, 2018.
- The math retrieval system of icst for ntcir-12 mathir task. In NTCIR, 2016.
- Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2333–2338, 2013.
- Multilingual mathematical autoformalization. arXiv preprint arXiv:2311.03755, 2023.
- Dowsing for answers to math questions: Doing better with less. In CLEF (Working Notes), pages 40–62, 2022.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
- Mcat math retrieval system for ntcir-12 mathir task. In NTCIR, 2016.
- Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300, 2019.
- Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
- Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. arXiv preprint arXiv:1805.07591, 2018.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
- Overview of arqmath-3 (2022): Third clef lab on answer retrieval for questions on math (working notes version). Working Notes of CLEF, 2022.
- Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682, 2018.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
- Dowsing for answers to math questions: Ongoing viability of traditional mathir. In CLEF (Working Notes), pages 63–81, 2021.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
- Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713, 2020.
- Understanding the behaviors of bert in ranking. arXiv preprint arXiv:1904.07531, 2019.
- Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191, 2020.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
- Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
- One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
- End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64, 2017.
- Dowsing for math answers with tangent-l. 2020.
- Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR), 15:331–357, 2012.
- Multi-stage math formula search: Using appearance-based similarity metrics at scale. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 145–154, 2016.
- One blade for one purpose: advancing math information retrieval using hybrid search. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 141–151, 2023.
- Accelerating substructure similarity search for formula retrieval. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part I 42, pages 714–727. Springer, 2020.
- Applying structural and dense semantic matching for the arqmath lab 2022, clef. In CLEF (Working Notes), pages 147–170, 2022.
- Evaluating token-level and passage-level dense retrieval models for math information retrieval. arXiv preprint arXiv:2203.11163, 2022.
- Structural similarity search for formulas using leaf-root paths in operator subtrees. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41, pages 116–129. Springer, 2019.