The Unreasonable Effectiveness of Easy Training Data for Hard Tasks (2401.06751v2)

Published 12 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as LLMs have continually improved. In this paper, we present the surprising conclusion that current pretrained LLMs often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: https://github.com/allenai/easy-to-hard-generalization

References (49)

Citations (21)

View on Semantic Scholar

Summary

The paper demonstrates that language models trained on easy data generalize well to complex tasks, nearly matching the performance of models trained on hard data.
It introduces a range of difficulty metrics, including Bloom's taxonomy and minimum description length, to quantify problem complexity.
The findings imply that using clean, cost-effective training data can efficiently overcome the scalable oversight challenge in AI development.

Introduction

The challenge of teaching LLMs (LMs) to excel on difficult tasks, especially when annotating complex training data can be costly and error-prone, is a significant obstacle in the advancement of artificial intelligence. Investigating the potential of these models to learn from simpler, more accessible training data to perform complex tasks, a paper scrutinized the concept of easy-to-hard generalization—where models are trained on "easy" data, then tested on "hard" data—to find out if and how efficiently LLMs can bridge the gap between these varying complexities of tasks.

Measuring Difficulty Levels

Understanding what constitutes "easy" or "hard" data is essential for studying this discrepancy. The researchers established several criteria for determining difficulty, ranging from the required grade level of education to solve a problem, to expert-determined ratings, and specific cognitive skill requirements based on Bloom's taxonomy. Additionally, metrics like the number of words in a question, the character count in an answer, or the number of reasoning steps involved, provide quantitative measures of complexity. Importantly, they included a model-based difficulty metric, minimum description length (MDL), which reflects how challenging a problem is for a model based on how much learning data it has been exposed to. These various metrics are crucial for a comprehensive assessment because no single measure can fully encapsulate what makes certain data more complex than others.

Generalization from Easy to Hard Data

In examining the performance of LLMs trained on easy data versus those trained on hard data, the findings were quite impressive. Across several different LLMs with up to 70 billion parameters and methodologies like in-context learning and linear classifier heads, the paper consistently found that models generalized from easy to hard data surprisingly well. In some cases, the performance was almost as good as an "oracle" model with access to hard training data. This suggests that training on straightforward data may be sufficient for models to perform at a high level on complex tasks.

Practical Implications

When considering the practical side of data acquisition for training LMs, the paper suggests a paradigm shift. Given that hard data can be more expensive to collect and potentially comes with higher rates of annotation errors, training on easy data could be a more efficient approach. The research provides evidence that cleanly labeled, easy-to-acquire data could be better for training models than investing resources in gathering more complex datasets, as long as the simpler data is of sufficient quality.

Scalability and Oversight

The scalability oversight problem revolves around how to give meaningful feedback to models when it's difficult to judge the correctness of their outputs. The paper's findings imply that this challenge might not be as daunting as previously thought, since models show robust generalization capacities from easy to hard data. Furthermore, the proportion of supervision gap recovered by training on easy data remains largely consistent across models of different sizes, indicating that scalability issues do not necessarily intensify as models become more capable.

Conclusion

The paper paints an encouraging picture of the flexible learning capabilities of LLMs. By showing that models can generalize effectively from simple to complex tasks, this work offers promising pathways for model training that are both cost-effective and resilient to challenges in data annotation. The revelation that quality supervision is possible with less complex datasets may pave the way for more efficient AI training methodologies in the future. Overall, this research marks a significant step towards solving the scalable oversight problem, providing valuable insights for both model developers and stakeholders in AI.

PDF Markdown

Related Papers

GitHub

GitHub - allenai/easy-to-hard-generalization: Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data" (45 stars)

Tweets

https://twitter.com/pragmaticml/status/1746719141401174286

https://twitter.com/fly51fly/status/1746851229240811626

https://twitter.com/jd_pressman/status/1783919004937293959

https://twitter.com/peterbhase/status/1747301130781008164

https://twitter.com/peterbhase/status/1747301137760338283

https://twitter.com/guyd33/status/1748085817682125178

YouTube

Show All Videos