MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning (2310.05506v3)

Published 9 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: In math reasoning with LLMs, fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective, profoundly narrowing the gap between open-sourced LLMs and cutting-edge proprietary LLMs. In this paper, we conduct an investigation for such data augmentation in math reasoning and are intended to answer: (1) What strategies of data augmentation are more effective; (2) What is the scaling relationship between the amount of augmented data and model performance; and (3) Can data augmentation incentivize generalization to out-of-domain mathematical reasoning tasks? To this end, we create two new dataset AugGSM8K and AugMATH, by complicating and diversifying the queries and sampling multiple reasoning paths from GSM8K and MATH. We obtained a series of LLMs called MuggleMath by fine-tuning LLaMA models on AugGSM8K and AugMATH. MuggleMath substantially achieves new state-of-the-art on GSM8K and MATH. A log-linear relationship and a segmented log-linear are presented between MuggleMath's performance and the amount of augmented data on GSM8K and MATH, respectively. We also find that it is weak in out-of-domain math reasoning generalization from AugGSM8K to MATH and from AugMATH to GSM8K, which suggests that augmenting queries that cover a broader range of subjects is more beneficial for generalization. We release our codes and augmented data in https://github.com/OFA-Sys/gsm8k-ScRel.

PDF Abstract

Insights into the Efficacy of Data Augmentation for Mathematical Reasoning in LLMs

The paper "Query and response augmentation cannot help out-of-domain math reasoning generalization" offers a comprehensive exploration of data augmentation methodologies tailored to enhance the proficiency of LLMs in mathematical reasoning tasks. Authors Chengpeng Li et al. provide a meticulous analysis through the lens of fine-tuning LLMs, proposing a new dataset, AugGSM8K, formulated by enhancing the diversity and complexity of queries and responses on the existing GSM8K dataset.

Augmentation Strategies and Performance Metrics

The paper primarily explores different augmentation strategies for mathematical reasoning: query evolution and response augmentation. The researchers adopt methods such as altering numerical values, introducing fractions, combining concepts, inserting conditional statements, and boosting problem complexity. These were implemented through AugGSM8K by leveraging proprietary models (GPT-3.5 and GPT-4) to generate diverse and complex queries and multiple reasoning paths for each query.

The efficacy of these strategies is assessed through fine-tuning open-sourced LLMs, including LLaMA and LLaMA-2 at various scales (7B, 13B, and 70B parameters). The key performance metric used is the accuracy on the original GSM8K dataset, and results indicate a significant improvement with AugMath-7B achieving 82.3% accuracy, notably surpassing previous benchmarks.

Findings and Scaling Relationships

A critical finding presented is the log-linear relationship between the amount of augmented data and the model's performance. This suggests that an increase in data augmentation proportionally enhances in-domain performance, albeit with diminishing returns as complexity increases. Interestingly, the paper argues that this augmentation is comparably as effective as human-generated data in boosting in-domain performance metrics.

Furthermore, the authors identify an intriguing yet limiting factor: the augmentation strategies, while significantly improving in-domain task performance on GSM8K, do little to aid generalization to out-of-domain tasks represented by the MATH dataset. This is attributed to the non-overlapping distribution of problem types and complexities between the datasets, as revealed through t-SNE visualizations of problem embeddings. The investigation elucidates that advancements in data augmentation techniques for one specific domain do not straightforwardly translate to broader generalized mathematical reasoning capabilities.

Implications and Speculations

The implications of these findings are twofold. Practically, the research underscores the potential of refined augmentation strategies to enhance model performance in specific domains. However, theoretically, it raises fundamental questions about developing universally applicable augmentation approaches that bridge cross-domain discrepancies in mathematical reasoning tasks.

The paper speculates on the need to potentially augment across a broader array of datasets to foster cross-domain generalization or, alternatively, advance the pre-training processes themselves to cultivate inherent reasoning capabilities in LLMs. As the boundary of in-domain augmentation efficacy becomes clear, further exploration is warranted into methodologies that can harmonize the benefits observed with specific benchmarks across the spectrum of mathematical reasoning challenges.

In summary, while the paper presents robust evidence for the enhanced performance of LLMs through targeted augmentation, it simultaneously serves as a poignant reminder of the persistent challenges faced in achieving true generalization across diverse mathematical domains. This accentuates the future trajectory of research in developing more adaptive and comprehensive pre-training and augmentation paradigms.