Training on the Test Task Confounds Evaluation and Emergence (2407.07890v3)

Published 10 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We study a fundamental problem in the evaluation of LLMs that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data prior to evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of LLMs, with broad implications for benchmarking and the study of emergent capabilities.

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that pretraining on test tasks inflates performance, with models exhibiting 7-point MMLU and 17-point GSM8K improvements before adjustment.
It proposes a fine-tuning adjustment method using uniform, task-specific data to level the playing field and validate the impact of test task training.
The study highlights that emergent capabilities become predictable at lower compute scales after fine-tuning, challenging conventional scaling laws in LLM evaluation.

This paper investigates a phenomenon termed "training on the test task," where knowledge about evaluation tasks is incorporated into the pretraining stage of LLMs. This practice, distinct from malpractice like test set contamination, significantly confounds model evaluations and the paper of emergent capabilities.

The authors demonstrate that newer LLMs often outperform older ones, even with similar pretraining compute, and argue this is largely due to increased training on the test task. To address this, they propose an adjustment method: fine-tuning all models under comparison on the same, sufficient amount of task-specific data before evaluation.

Key Contributions and Findings:

Identifying the Confounding Factor:
- The paper analyzes 53 base LLMs (70M to 70B parameters) on MMLU and GSM8K benchmarks.
- Models trained after November 2023 outperform older models by an average of 7 percentage points on MMLU and 17 on GSM8K, controlling for pretraining compute. This coincides with a trend of newer models (e.g., Qwen 1.5, Olmo 1.7, MAP Neo, StableLM 2, Gemma) incorporating task-relevant data or strategies during pretraining.
Proposed Adjustment Method:
- The core proposal is to fine-tune all models on identical, task-specific datasets before evaluation to level the playing field.
  - For MMLU (multiple-choice QA), the HuggingFace MMLU auxiliary training set (100k examples, 30M tokens) is used.
  - For GSM8K (math reasoning), a combination of MetaMathQA and Orca-Math datasets (600k examples, 200M tokens) is used.
- After this adjustment, the performance gap between newer and older models vanishes, suggesting the initial disparity was indeed due to differential training on the test task. Older models benefit significantly more from this fine-tuning.
Validation of the Adjustment Method:
- In a controlled experiment, older models (trained before Nov 2023) were split into control and treatment groups. The treatment group was fine-tuned on task-relevant data.
- This recreated the performance gap seen between newer and older models.
- Subsequently applying the proposed adjustment (fine-tuning both groups) eliminated this artificially created advantage, validating the method's soundness.
Distinction from Data Contamination:
- The paper examines ARC and HellaSwag benchmarks. Initially, with standard "cloze" evaluations, no significant performance difference between newer and older models is observed.
- However, when these tasks are reformulated into MMLU-style multiple-choice questions, newer models again show superior performance, similar to MMLU.
- This suggests newer models are better at the MMLU-style prompt format, rather than merely memorizing test data. The proposed fine-tuning adjustment (using MMLU auxiliary data) resolves this disparity.
- Evaluating MMLU with cloze prompts also reduces the gap between newer and older models, indicating standard MMLU evaluation conflates knowledge with multiple-choice answering proficiency.
Implications for Model Comparisons:
- Model Families: Comparisons between families like Pythia, Llama 2, and Qwen 1.5 are skewed. Qwen 1.5, which explicitly includes instruction data in pretraining, appears superior initially. After adjustment, all three families show similar scaling trends relative to their pretraining compute. This questions whether "higher quality" pretraining data is superior mainly because it contains more task-relevant data.
- Progress Measurement: Training on the test task overestimates progress. The Pareto frontier of performance vs. compute shows substantial improvement for newer models. After adjustment, this area of improvement reduces sixfold.
Implications for Emergence:
- Emergent capabilities (sharp performance increases at larger scales) are observed in MMLU and GSM8K.
- As models are increasingly fine-tuned on task-relevant data, the "point of emergence" ( $c_e$ , the compute threshold for non-random performance) shifts to significantly lower scales. For MMLU, $c_e$ shifts from ~10<sup\>22</sup> FLOPs (Pythia 6.9B scale) to ~6x10<sup\>20</sup> FLOPs (Pythia 410M scale) after sufficient task-specific fine-tuning.
- Log-linear scaling (accuracy vs. log-compute) fits become much stronger (R<sup\>2</sup> improves, e.g., from 0.63 to 0.95 for MMLU) after this fine-tuning.
- This suggests that emergence, in these cases, can be made predictable and capabilities visible at smaller scales by training on the test task. This effect persists even when using continuous metrics like Brier score, challenging some prior explanations of emergence.

Methodology Details:

Models: 53 base pretrained models (not instruction-tuned), with known training token counts for compute estimation ( $C \approx 6 \cdot N \cdot D$ ).
Evaluation: LM Evaluation Harness, identical to HuggingFace Open LLM Leaderboard.
Statistical Analysis: A regression model $A = \alpha\max(0, \log C - c_e) + \theta N + r + \epsilon$ is used, where $A$ is accuracy, $C$ is pretraining compute, $N$ indicates if trained post-Nov 2023, $r$ is random chance, and $c_e$ is the point of emergence. The coefficient $\theta$ quantifies the average performance difference.
Fine-tuning: Standard hyperparameters, 3 epochs, minimal tuning. Compute for fine-tuning is minimal compared to pretraining.

Limitations:

The adjustment method requires significant computational resources for fine-tuning, which may not be available to all evaluators.
Sufficient task-relevant training data might be expensive or unavailable for many tasks.
Correcting for training on proprietary task-specific data would be difficult.

Conclusion and Recommendation:

The paper argues that training on the test task is a pervasive confounder in LLM evaluation. Instead of trying to detect and disallow such practices, which is often infeasible, the authors advocate for "fighting fire with fire": evaluators should give every model the same, sufficient amount of fine-tuning on task-relevant data prior to evaluation. This harmonizes comparisons, deconfounds scaling laws, and makes capabilities more predictable. This approach also incentivizes the development of models that are easily and effectively fine-tunable.