DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2402.03300v3)

Published 5 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Mathematical reasoning poses a significant challenge for LLMs due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

PDF Abstract

This paper introduces DeepSeekMath 7B, a LLM designed to push the limits of mathematical reasoning. The model is based on the DeepSeek-Coder-Base-v1.5 7B and is pre-trained with 120B math-related tokens from Common Crawl, natural language, and code data. A key contribution is the introduction of Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO) that enhances mathematical reasoning while optimizing memory usage.

The paper highlights the following achievements:

DeepSeekMath 7B achieves 51.7\% accuracy on the MATH benchmark without external toolkits or voting, approaching the performance of Gemini-Ultra and GPT-4.
Self-consistency over 64 samples from DeepSeekMath 7B reaches 60.9\% on MATH.
The model demonstrates proficiency in other mathematical reasoning benchmarks such as GSM8K, SAT, C-Math, and Gaokao.

The authors attribute the model's mathematical reasoning capability to two factors:

The use of publicly available web data through a data selection pipeline.
The introduction of GRPO.

The paper argues that code training improves a model's ability to solve mathematical problems with and without tool use. The paper also argues that while training on arXiv papers is common, it brought no notable improvements on all mathematical benchmarks adopted in this paper.

The paper also introduces the GRPO algorithm, a variant reinforcement learning (RL) algorithm of PPO, which reduces training resources by foregoing the critic model and estimating the baseline from group scores. The authors provide a unified paradigm to understand different methods such as Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), PPO, and GRPO, conceptualizing them as direct or simplified RL techniques.

The paper also includes:

A discussion on why RL boosts the performance of instruction-tuned models.
Potential directions to achieve more effective RL based on the unified paradigm.

The key contributions of the work are:

Scalable math pre-training, including the DeepSeekMath Corpus, a high-quality dataset of 120B tokens extracted from Common Crawl.
The pre-trained base model DeepSeekMath-Base 7B achieves performance comparable to Minerva 540B.
The introduction of GRPO, an efficient and effective RL algorithm that enhances performance while reducing training resources.
A unified paradigm for understanding different methods like RFT, DPO, and PPO.

The authors evaluated their models on a range of English and Chinese mathematical reasoning benchmarks, including GSM8K, MATH, SAT, C-Math, and Gaokao. They also assessed the models' ability to generate self-contained text solutions, solve problems using Python, and conduct formal theorem proving using miniF2F with Isabelle.

The data collection process for the DeepSeekMath Corpus involved an iterative pipeline, starting with OpenWebMath as a seed corpus. A fastText model was trained to identify more mathematical web pages from Common Crawl. The collected pages were ranked, and only the top-ranking ones were preserved. The process was repeated iteratively, with human annotation used to refine the seed corpus and improve the performance of the fastText model. Benchmark contamination was avoided by filtering out web pages containing questions or answers from English and Chinese mathematical benchmarks.

The authors compared the DeepSeekMath Corpus with other math-training corpora, including MathPile, OpenWebMath, and Proof-Pile-2. They found that the DeepSeekMath Corpus was of high quality, covered multilingual mathematical content, and was the largest in size. The evaluation results showed that the model trained on the DeepSeekMath Corpus had a clear performance lead.

The training data for the RL phase consisted of chain-of-thought-format questions related to GSM8K and MATH from the supervised fine-tuning (SFT) data. The reward model was trained based on the DeepSeekMath-Base 7B. GRPO was used to update the policy model, and DeepSeekMath-RL 7B was evaluated on various benchmarks. The results showed that DeepSeekMath-RL 7B achieved significant performance gains, surpassing all open-source models and the majority of closed-source models.

The authors investigated the effect of code training on mathematical reasoning by experimenting with two-stage and one-stage training settings. They found that code training improved program-aided mathematical reasoning under both settings. Code training also improved mathematical reasoning without tool use, but combining code tokens and math tokens for one-stage training compromised performance. They also found that arXiv papers were ineffective in improving mathematical reasoning in their experiments.

The authors provide a unified paradigm to analyze different training methods, such as SFT, RFT, DPO, PPO, and GRPO, and conduct experiments to explore the factors of the unified paradigm. They divide the data source into online sampling and offline sampling, and find that online RFT significantly outperforms RFT. They also highlight the efficiency of altering positive and negative gradient coefficients in GRPO. Furthermore, they explore iterative RL and find that it significantly improves performance.

The authors evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models and find that RL enhances Maj@K's performance but not Pass@K. They suggest that RL enhances the model's overall performance by rendering the output distribution more robust. They also discuss how to achieve more effective RL, focusing on data source, algorithms, and reward function.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Zhihong Shao (20 papers)
Peiyi Wang (48 papers)
Qihao Zhu (27 papers)
Runxin Xu (30 papers)
Junxiao Song (12 papers)
Mingchuan Zhang (8 papers)
Y. K. Li (16 papers)
Y. Wu (639 papers)
Daya Guo (37 papers)
Xiao Bi (8 papers)
Haowei Zhang (17 papers)

Citations (194)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - deepseek-ai/DeepSeek-Math (2,782 stars)

Tweets

https://twitter.com/deepseek_ai/status/1754701472363958581

https://twitter.com/burkov/status/1882538630369227213

https://twitter.com/N8Programs/status/1884110751637594524

https://twitter.com/_akhaliq/status/1754708562561933572

https://twitter.com/arankomatsuzaki/status/1754692186992496816

https://twitter.com/_philschmid/status/1804442397708824700

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2402.03300v3)

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit