Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Task Contamination: Language Models May Not Be Few-Shot Anymore (2312.16337v1)

Published 26 Dec 2023 in cs.CL

Abstract: LLMs offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Can we trust the evaluation on ChatGPT? arXiv:2303.12767.
  2. Efficient Large Scale Language Modeling with Mixtures of Experts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11699–11732. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  3. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity.
  4. Prompting Language Models for Linguistic Structure. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 6649–6663. Toronto, Canada: Association for Computational Linguistics.
  5. Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 9432–9452. Toronto, Canada: Association for Computational Linguistics.
  6. Language Models are Few-Shot Learners.
  7. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. arXiv:2305.00118.
  8. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  9. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2924–2936. Minneapolis, Minnesota: Association for Computational Linguistics.
  10. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168.
  11. The CommitmentBank: Investigating projection in naturally occurring discourse.
  12. Transforming Question Answering Datasets Into Natural Language Inference Datasets. ArXiv, abs/1809.02922.
  13. Investigating Data Contamination in Modern Benchmarks for Large Language Models. arXiv:2311.09783.
  14. The Statistical Sign Test. Journal of the American Statistical Association, 41(236): 557–566.
  15. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005).
  16. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics, 9: 346–361.
  17. The Fourth PASCAL Recognizing Textual Entailment Challenge. In Text Analysis Conference.
  18. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493.
  19. NewsMTSC: A Dataset for (Multi-)Target-dependent Sentiment Classification in Political News Articles. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1663–1675. Online: Association for Computational Linguistics.
  20. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR), 54(11s): 1–37.
  21. In-Context Learning for Few-Shot Dialogue State Tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2627–2643. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  22. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. arXiv:2305.10160.
  23. NewsMet : A ‘do it all’ Dataset of Contemporary Metaphors in News Headlines. In Findings of the Association for Computational Linguistics: ACL 2023, 10090–10104. Toronto, Canada: Association for Computational Linguistics.
  24. Validity Assessment of Legal Will Statements as Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, 6047–6056. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  25. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  26. The Winograd schema challenge. KR, 2012: 13th.
  27. Li, Y. 2023. Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation. arXiv:2309.10677.
  28. Data Contamination: From Memorization to Exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 157–165. Dublin, Ireland: Association for Computational Linguistics.
  29. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1): 50–60.
  30. OpenAI. 2023a. OpenAI Examples.
  31. OpenAI. 2023b. OpenAI Models.
  32. Proving Test Set Contamination in Black Box Language Models. arXiv:2310.17623.
  33. Training language models to follow instructions with human feedback. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
  34. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1267–1273. Minneapolis, Minnesota: Association for Computational Linguistics.
  35. Synchromesh: Reliable Code Generation from Pre-trained Language Models. In International Conference on Learning Representations.
  36. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
  37. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5203–5212. Online: Association for Computational Linguistics.
  38. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, 90–95.
  39. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Linguistics: EMNLP 2023.
  40. Did ChatGPT cheat on your test? https://hitz-zentroa.github.io/lm-contamination/blog/.
  41. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  42. Toolformer: Language Models Can Teach Themselves to Use Tools.
  43. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269. Online: Association for Computational Linguistics.
  44. Few-Shot Text Generation with Natural Language Instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 390–402. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  45. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 6664–6679. Toronto, Canada: Association for Computational Linguistics.
  46. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. Seattle, Washington, USA: Association for Computational Linguistics.
  47. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  48. Student. 1908. The probable error of a mean. Biometrika, 1–25.
  49. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca.
  50. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  51. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Red Hook, NY, USA: Curran Associates Inc.
  52. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355. Brussels, Belgium: Association for Computational Linguistics.
  53. Iteratively Prompt Pre-trained Language Models for Chain of Thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2714–2730. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  54. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  55. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
  56. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  57. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. arXiv:2305.18752.
  58. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3911–3921. Brussels, Belgium: Association for Computational Linguistics.
  59. CREPE: Open-Domain Question Answering with False Presuppositions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 10457–10480. Toronto, Canada: Association for Computational Linguistics.
  60. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068.
  61. Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv:2311.01964.
Citations (74)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper finds that LLMs can exhibit high performance due to prior exposure to test-like data rather than genuine reasoning.
  • The methodology involves training data inspection, example extraction, membership inference, and chronological analysis to detect contamination.
  • The results call for more transparent training practices and stronger evaluation frameworks to ensure accurate assessments of LLM capabilities.

Understanding the Pitfalls of Evaluating LLMs

The Study of Task Contamination

LLMs, like GPT-3, have gained traction for their impressive performance in zero-shot and few-shot learning tasks. These tasks are designed to evaluate a model's ability to understand and respond appropriately to prompts without extensive topic-specific pre-training. However, the validity of such evaluations is now under scrutiny. A paper probes the integrity of these evaluations by investigating a phenomenon they describe as "task contamination."

The essence of task contamination is that LLMs might perform well not just on merit, but because they previously encountered similar data during their extensive pre-training phase. If the datasets used for evaluating these models include examples similar to those the models were trained on, the outstanding performance might simply be due to the models 'recalling' this data, rather than genuinely 'reasoning' their responses. This means that what appears to be a remarkable feat of few-shot or zero-shot learning may, in reality, be a mirage stemming from task contamination.

Methodology and Findings

The paper examines models from the GPT-3 series and various other recently divulged LLMs, controlling for dataset difficulty. Surprising results emerged: models tested on datasets released before the LLMs’ training data fared significantly better than on datasets unveiled afterward. This discrepancy suggests that the LLMs' training could include task-specific data, prejudicing zero-shot and few-shot evaluations.

The researchers employed four methods to uncover evidence of task contamination:

  1. Training Data Inspection: Checking the training datasets for examples analogous to test tasks.
  2. Task Example Extraction: Attempting to induce the model to regurgitate training examples using prompts.
  3. Membership Inference Attack: Specifically for generation tasks, checking whether the model regenerates content that's exactly the same as the original dataset.
  4. Chronological Analysis: Evaluating models on datasets relative to their trained data's timeframe, for signs of contamination.

All methods, except the chronological analysis, are high in precision but low in recall—and vice-versa for chronological analysis. This comprehensive approach led to the discovery of strong contamination evidence across models and datasets.

One stark revelation was that for classification tasks devoid of evident task contamination, LLMs infrequently exhibited statistically consequential superiority over basic majority baselines in both zero-shot and few-shot settings. The paper showcases that this absence of performance lift suggests task contamination could be distorting our perception of LLM capabilities in specific contexts.

Implications

The insights from this research have profound implications. They raise concerns about the reliability of LLMs, suggesting that models, particularly those that are closed-source, could be displaying inflated performance, undermining the trustworthiness of current few-shot or zero-shot evaluation methods. It emphasizes the importance of transparently releasing training datasets to facilitate more accurate detection of contamination.

Conclusion

This paper stands as a clarion call for the AI community to approach the evaluation of LLMs with heightened skepticism and rigor. Task contamination is not a trivial issue—it undercuts the foundation upon which the perceived intellect of these models is judged. Future research needs to explore this problem and work towards establishing more robust evaluation frameworks that exclude the possibility of contamination, thus ensuring that we can measure genuine advancements in LLM capabilities.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com