Solving Quantitative Reasoning Problems with Language Models (2206.14858v2)

Published 29 Jun 2022 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a LLM pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.

Citations (606)

View on Semantic Scholar

Summary

The paper introduces Minerva, a PaLM-based language model fine-tuned on a specialized scientific dataset to excel in quantitative reasoning tasks.
The model achieves state-of-the-art performance on benchmarks like MATH, GSM8k, and MMLU-STEM, notably improving accuracy with majority-voting inference.
The authors emphasize techniques such as targeted autoregressive training and fine-tuning with structured data, paving the way for advanced educational and research applications.

Solving Quantitative Reasoning Problems with LLMs

The paper "Solving Quantitative Reasoning Problems with LLMs" explores the capability of LLMs to handle tasks necessitating quantitative reasoning, a domain where traditional models have faced significant challenges. The researchers introduce Minerva, a large-scale LLM fine-tuned to excel in solving undergraduate-level math, science, and engineering problems. The model is built on top of the PaLM architecture and is further trained on a specialized dataset enriched with scientific and mathematical content.

Key Contributions and Methodology

Training Dataset: Minerva is trained on a dataset comprising 38.5 billion tokens from arXiv papers and mathematical web pages, juxtaposed with general natural language data. This specialized dataset is designed to enhance the model's ability to process mathematical expressions and scientific data effectively.
Model Architecture: Minerva is built upon the PaLM model and expanded into 8B, 62B, and 540B parameter configurations. The pretraining is continued on the technical dataset with a focus on autoregressive objectives, preparing the model to handle complex quantitative tasks without external computational tools.
Evaluation: The model is tested across several benchmarks, including MATH, GSM8k, MMLU-STEM, and a new OCWCourses dataset, which comprises over 200 undergraduate-level problems. Minerva demonstrates state-of-the-art performance on these benchmarks, outperforming previous models significantly.
Inference Techniques: The researchers employ majority voting over sampled solutions to enhance accuracy, finding that this technique substantially improves the model's effectiveness compared to greedy decoding.

Numerical Results

Minerva achieves impressive results, particularly with the largest model (540B parameters), which when employing majority voting achieves:

MATH dataset: 50.3% accuracy
OCWCourses: 30.8% accuracy
GSM8k: 78.5% accuracy
MMLU-STEM: 75.0% accuracy

These results demonstrate a robust ability to solve quantitative reasoning tasks across diverse scientific and mathematical domains.

Implications and Future Directions

The work extends the frontier of LLMs into quantitative reasoning, moving beyond natural language understanding to tasks that require detailed mathematical and logical reasoning. This advancement holds promise for applications in educational technologies, automated tutoring systems, and potentially supporting scientific research with computational mathematics.

The findings also spotlight areas for future exploration, including integrating external tools like calculators to further enhance performance, developing verification systems to ensure solution correctness, and extending the approach to more complex problem domains.

Conclusion

The paper presents a significant step forward in using LLMs for quantitative reasoning tasks. By training on targeted scientific content and employing advanced inference techniques, Minerva sets a new standard for performance on mathematical and scientific problems at the undergraduate level. This research opens new avenues for the application of AI in fields that require nuanced understanding and problem-solving capabilities.