The Unreasonable Effectiveness of Easy Training Data for Hard Tasks (2401.06751v2)
Abstract: How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as LLMs have continually improved. In this paper, we present the surprising conclusion that current pretrained LLMs often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: https://github.com/allenai/easy-to-hard-generalization
- Nancy E Adams. 2015. Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association: JMLA, 103(3):152.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- End-to-end algorithm synthesis with recurrent networks: Extrapolation without overthinking. Advances in Neural Information Processing Systems, 35:20232–20242.
- Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
- Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain. McKay New York.
- Unobserved local structures make compositional generalization hard. arXiv preprint arXiv:2201.05899.
- Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR.
- QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654.
- Bradley Efron and Robert J Tibshirani. 1994. An Introduction to the Bootstrap. CRC press.
- Mohamed Elgaar and Hadi Amiri. 2023. Hucurl: Human-induced curriculum discovery. arXiv preprint arXiv:2307.07412.
- Releasing Persimmon-8B. Blogpost, ADEPT AI.
- Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Mixtral of experts.
- Brenden Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR.
- Matthew Lease. 2011. On quality control and machine learning in crowdsourcing. In Workshops at the twenty-fifth AAAI conference on artificial intelligence. Citeseer.
- Still no lie detector for language models: Probing empirical and conceptual roadblocks. arXiv preprint arXiv:2307.00175.
- Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729.
- Adyasha Maharana and Mohit Bansal. 2022. On curriculum learning for commonsense reasoning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 983–992.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
- Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824.
- Mistral AI. 2023. Announcing Mistral 7B. Blogpost.
- Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
- Pytorch: An imperative style, high-performance deep learning library.
- Rissanen data analysis: Examining dataset characteristics via description length. In International Conference on Machine Learning, pages 8500–8513. PMLR.
- Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022.
- Can language models teach weaker agents? teacher explanations improve students via theory of mind. arXiv preprint arXiv:2306.09299.
- “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15.
- Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. Advances in Neural Information Processing Systems, 34:6695–6706.
- Charles Spearman. 1987. The proof and measurement of association between two things. The American journal of psychology, 100(3/4):441–471.
- Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293.
- TII. 2023. Falcon LLM. Blogpost.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In EMNLP.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Curriculum learning for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6095–6104.
- What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028.
- Data factors for better compositional generalization. arXiv preprint arXiv:2311.04420.