Papers
Topics
Authors
Recent
Search
2000 character limit reached

AugGSM8K: Augmented Math Problem Dataset

Updated 6 April 2026
  • The paper shows that dual augmentation via query evolution and diversified reasoning paths systematically increases LLM mathematical accuracy on GSM8K benchmarks.
  • It employs five transformation strategies to generate approximately 200K augmented query-answer pairs, enhancing fine-tuning efficacy on open-source LLMs.
  • Empirical analysis reveals a log-linear scaling law with model size gains, though limited out-of-domain transfer highlights areas for further dataset expansion.

AugGSM8K is a synthetic, augmented version of the GSM8K training set for grade-school mathematical word problems, systematically constructed to investigate the impact of query-level and reasoning-path-level data augmentation on LLM performance in mathematical reasoning. Developed in response to the observed gap in chain-of-thought abilities between open-source LLMs (notably the LLaMA family) and proprietary models such as GPT-3.5 and GPT-4, AugGSM8K targets the diversification and complication of problem statements and their solutions to enhance fine-tuning efficacy and enable detailed studies of scaling laws and generalization behavior (Li et al., 2023).

1. Dataset Construction and Objectives

The principal objective of AugGSM8K is to generate a large-scale, structurally varied corpus of grade-school level mathematics questions and answers by augmenting the original GSM8K set of 7,473 questions. The construction pipeline applies two augmentation axes:

  • Query Evolution: Each original question undergoes five human-inspired “Evol-Instruct” transformations, which, leveraging GPT-3.5-turbo-0613 or GPT-4-0613, systematically increase their diversity and complexity.
  • Diverse Reasoning Pathways: For every augmented question, multiple chain-of-thought solutions are independently sampled from LLMs (GPT-4 or GPT-3.5) at high temperature (≥1.0), capturing varied solution strategies per instance.

This dual-augmentation approach is intended both to simulate a richer instructional range and to provide fine-tuning data that narrows the empirical gap between open and closed models on standard math reasoning benchmarks.

2. Query Augmentation Strategies

Five specific transformation strategies were implemented for each question:

  1. Change of Specific Numbers: Altering numerical values to generate novel problem instances.
  2. Introduction of Fractions or Percentages: Replacing original quantities with fractional/percentage expressions.
  3. Combination of Mathematical Concepts: Merging concepts (e.g., arithmetic with geometry) within a single item.
  4. Addition of Conditional Statements: Embedding “if-then” clauses to increase logical depth.
  5. Increase in Logical Complexity: Appending sub-tasks or multi-part queries to the stem.

For each GSM8K item qiq_i, this yields five augmented forms {qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}. Each augmented query is passed through GPT-4 / GPT-3.5 multiple times to obtain distinct annotated solutions, promoting answer diversity and robustness (Li et al., 2023).

3. Construction Pipeline and Dataset Statistics

The augmentation protocol begins with the tabularized GSM8K set D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}:

  • Query Generation: For each qiq_i, five augmented queries are produced, comprising three base subsets (D1,D2,D3\mathcal{D}_1, \mathcal{D}_2, \mathcal{D}_3), each of 37,365 items, with D1\mathcal{D}_1 generated in five distinct variants.
  • Response Sampling: Each query is annotated with multiple chain-of-thought responses, varying temperature, shot-setting, and model.
  • Filtering: Incoherent outputs (lacking final answers, excessive length, or formatting issues) are removed, standardizing each subset to ≃30,000 valid query-answer pairs.
  • Downsampling: Larger subsets (e.g., 35,000 items) are truncated to 30,000 for dataset balance.
  • Union: AugGSM8K is finalized as

AugGSM8K=D    j=13k=1njDjk\mathrm{AugGSM8K} = \mathcal{D} \;\cup\; \bigcup_{j=1}^3 \bigcup_{k=1}^{n_j} \mathcal{D}_j^k

where n1=8n_1=8, n2=n3=1n_2=n_3=1, resulting in approximately 200,000 augmented pairs plus the GSM8K core.

Subset # Original # Augmented Pairs
D\mathcal{D} 7,473
{qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}0 {qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}1
{qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}2 {qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}3
{qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}4 {qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}5
{qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}6 {qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}7
{qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}8 {qi(1),,qi(5)}\{q_i^{(1)},\dots,q_i^{(5)}\}9
D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}0 D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}1
AugGSM8K Total 7,473 D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}2

4. Scaling Laws and Performance

Fine-tuning LLaMA models (7B, 2-7B, and 2-13B) on incremental portions of AugGSM8K demonstrates a log-linear scaling law:

D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}3

where D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}4 is GSM8K test accuracy and D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}5 the number of queries (thousands). Empirical fits yield:

D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}6

Here, D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}7 quantifies accuracy sensitivity to data scale, D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}8 corresponds to baseline. Doubling D={(qi,ai)}i=17473\mathcal{D}=\{(q_i,a_i)\}_{i=1}^{7473}9 yields gains of approximately +7.4% (7B), +6.8% (2-7B), and +5.3% (2-13B). These findings describe a predictable performance improvement regime for LLMs under systematic data augmentation, emphasizing the utility of combinatorially expanded instruction-tuning data (Li et al., 2023).

5. Generalization and Out-of-Domain Transfer

Models tuned solely on AugGSM8K subsets manifest significant in-domain accuracy but do not generalize robustly to out-of-domain mathematical benchmarks such as MATH. Evaluation on 500 held-out MATH items shows transfer accuracy below 10% for all model sizes, as summarized:

Setting 7B 7B-2 13B-2
In-Context Learning 2.9% 2.5% 3.9%
SFT on MATH only 4.8% 5.8% 6.0%
Multi-task (AugGSM8K + MATH) 4.6% 6.2% 7.6%
Transfer (AugGSM8K→MATH) 5.6% 8.4% 9.4%
Full MuggleMath→MATH 5.6% 6.0% 9.0%

t-SNE analysis of embedding spaces (LLaMA-2-7B) indicates that GSM8K and AugGSM8K queries cluster in a common subspace, while MATH dataset questions are distributed in a distinct region, providing an explanation for transfer limitations. A plausible implication is that query augmentation confined to a single benchmark lacks the breadth required for out-of-domain reasoning generalization.

6. Impact on Fine-Tuning and Benchmark Performance

LLMs fine-tuned on AugGSM8K under the MuggleMath regime attain substantial accuracy improvements on GSM8K, achieving state-of-the-art among open-source models. Comparing several approaches:

Model SFT (orig.) RFTqiq_i0 WizardMathqiq_i1 MuggleMath
LLaMA-7B 35.9% 49.1% 65.4%
LLaMA-2-7B 41.6% 51.2% 54.9% 68.4%
LLaMA-2-13B 50.0% 55.3% 63.9% 74.0%
LLaMA-2-70B 63.2% 64.8% 81.6% 82.3%

MuggleMath-7B gains +29.5 percentage points over baseline SFT, +16.3 points above WizardMath (7B-2). MuggleMath-13B achieves 74.0% accuracy. Results are robust, with standard error ±0.3 percentage points across three training epochs.

7. Limitations and Research Implications

AugGSM8K establishes that systematic and varied data augmentation focusing on query “evolution” and solution diversity can drive large, attributable gains in LLM mathematical reasoning on the original task family. However, the negligible out-of-domain transfer indicates that augmentation should span broader problem distributions to enhance generalization. A plausible implication is that future dataset construction should combine multiple mathematical benchmarks or modify pre-training objectives to bridge distinct problem domains (Li et al., 2023).


qiq_i2 RFT = Rejection-Sampling Fine-Tuning; qiq_i3 WizardMath = Reinforced Evol-Instruct with PPO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AugGSM8K.