Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks (2401.06751v2)

Published 12 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as LLMs have continually improved. In this paper, we present the surprising conclusion that current pretrained LLMs often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: https://github.com/allenai/easy-to-hard-generalization

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Nancy E Adams. 2015. Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association: JMLA, 103(3):152.
  2. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.
  3. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. End-to-end algorithm synthesis with recurrent networks: Extrapolation without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242.
  6. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
  7. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
  8. Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain. McKay New York.
  9. Unobserved local structures make compositional generalization hard. arXiv preprint arXiv:2201.05899.
  10. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  12. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  15. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR.
  16. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
  17. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654.
  18. Bradley Efron and Robert J Tibshirani. 1994. An Introduction to the Bootstrap. CRC press.
  19. Mohamed Elgaar and Hadi Amiri. 2023. Hucurl: Human-induced curriculum discovery. arXiv preprint arXiv:2307.07412.
  20. Releasing Persimmon-8B. Blogpost, ADEPT AI.
  21. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  22. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  23. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  24. Mixtral of experts.
  25. Brenden Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR.
  26. Matthew Lease. 2011. On quality control and machine learning in crowdsourcing. In Workshops at the twenty-fifth AAAI conference on artificial intelligence. Citeseer.
  27. Still no lie detector for language models: Probing empirical and conceptual roadblocks. arXiv preprint arXiv:2307.00175.
  28. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729.
  29. Adyasha Maharana and Mohit Bansal. 2022. On curriculum learning for commonsense reasoning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 983–992.
  30. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  31. Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824.
  32. Mistral AI. 2023. Announcing Mistral 7B. Blogpost.
  33. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
  34. Pytorch: An imperative style, high-performance deep learning library.
  35. Rissanen data analysis: Examining dataset characteristics via description length. In International Conference on Machine Learning, pages 8500–8513. PMLR.
  36. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022.
  37. Can language models teach weaker agents? teacher explanations improve students via theory of mind. arXiv preprint arXiv:2306.09299.
  38. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15.
  39. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706.
  40. Charles Spearman. 1987. The proof and measurement of association between two things. The American journal of psychology, 100(3/4):441–471.
  41. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293.
  42. TII. 2023. Falcon LLM. Blogpost.
  43. Llama: Open and efficient foundation language models.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  45. Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In EMNLP.
  46. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  47. Curriculum learning for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6095–6104.
  48. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028.
  49. Data factors for better compositional generalization. arXiv preprint arXiv:2311.04420.
Citations (21)

Summary

  • The paper demonstrates that language models trained on easy data generalize well to complex tasks, nearly matching the performance of models trained on hard data.
  • It introduces a range of difficulty metrics, including Bloom's taxonomy and minimum description length, to quantify problem complexity.
  • The findings imply that using clean, cost-effective training data can efficiently overcome the scalable oversight challenge in AI development.

Introduction

The challenge of teaching LLMs (LMs) to excel on difficult tasks, especially when annotating complex training data can be costly and error-prone, is a significant obstacle in the advancement of artificial intelligence. Investigating the potential of these models to learn from simpler, more accessible training data to perform complex tasks, a paper scrutinized the concept of easy-to-hard generalization—where models are trained on "easy" data, then tested on "hard" data—to find out if and how efficiently LLMs can bridge the gap between these varying complexities of tasks.

Measuring Difficulty Levels

Understanding what constitutes "easy" or "hard" data is essential for studying this discrepancy. The researchers established several criteria for determining difficulty, ranging from the required grade level of education to solve a problem, to expert-determined ratings, and specific cognitive skill requirements based on Bloom's taxonomy. Additionally, metrics like the number of words in a question, the character count in an answer, or the number of reasoning steps involved, provide quantitative measures of complexity. Importantly, they included a model-based difficulty metric, minimum description length (MDL), which reflects how challenging a problem is for a model based on how much learning data it has been exposed to. These various metrics are crucial for a comprehensive assessment because no single measure can fully encapsulate what makes certain data more complex than others.

Generalization from Easy to Hard Data

In examining the performance of LLMs trained on easy data versus those trained on hard data, the findings were quite impressive. Across several different LLMs with up to 70 billion parameters and methodologies like in-context learning and linear classifier heads, the paper consistently found that models generalized from easy to hard data surprisingly well. In some cases, the performance was almost as good as an "oracle" model with access to hard training data. This suggests that training on straightforward data may be sufficient for models to perform at a high level on complex tasks.

Practical Implications

When considering the practical side of data acquisition for training LMs, the paper suggests a paradigm shift. Given that hard data can be more expensive to collect and potentially comes with higher rates of annotation errors, training on easy data could be a more efficient approach. The research provides evidence that cleanly labeled, easy-to-acquire data could be better for training models than investing resources in gathering more complex datasets, as long as the simpler data is of sufficient quality.

Scalability and Oversight

The scalability oversight problem revolves around how to give meaningful feedback to models when it's difficult to judge the correctness of their outputs. The paper's findings imply that this challenge might not be as daunting as previously thought, since models show robust generalization capacities from easy to hard data. Furthermore, the proportion of supervision gap recovered by training on easy data remains largely consistent across models of different sizes, indicating that scalability issues do not necessarily intensify as models become more capable.

Conclusion

The paper paints an encouraging picture of the flexible learning capabilities of LLMs. By showing that models can generalize effectively from simple to complex tasks, this work offers promising pathways for model training that are both cost-effective and resilient to challenges in data annotation. The revelation that quality supervision is possible with less complex datasets may pave the way for more efficient AI training methodologies in the future. Overall, this research marks a significant step towards solving the scalable oversight problem, providing valuable insights for both model developers and stakeholders in AI.

Youtube Logo Streamline Icon: https://streamlinehq.com