Papers
Topics
Authors
Recent
2000 character limit reached

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data (2502.05567v2)

Published 8 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Autoformalization, the automatic translation of mathematical content from natural language into machine-verifiable formal languages, has seen significant progress driven by advances in LLMs. Nonetheless, a primary barrier to further improvements is the limited availability of parallel corpora that map informal mathematical text to its formal counterpart. To address this limitation, we propose ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), a novel data generation framework designed to produce large-scale, high-quality parallel corpora of theorem statements. Distinct from prior approaches, ATLAS begins with a concept repository, accelerates the improvement of student model through expert iteration combined with knowledge distillation, and introduces two novel augmentation strategies that exploit the structural characteristics of formal languages. With the proposed ATLAS running for 10 iterations, we construct an undergraduate-level dataset comprising 117k theorem statements and develop ATLAS Translator, which demonstrates statistically significant improvements over both the HERALD Translator and the Kimina-Autoformalizer across all benchmarks ($p<0.05$, two-sided t-test), achieving a new state of the art. The datasets, model, and code will be released to the public soon.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.