Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

Task Contamination: Language Models May Not Be Few-Shot Anymore (2312.16337v1)

Published 26 Dec 2023 in cs.CL

Abstract: LLMs offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

References (61)

Citations (74)

View on Semantic Scholar

Collections

Summary

The paper finds that LLMs can exhibit high performance due to prior exposure to test-like data rather than genuine reasoning.
The methodology involves training data inspection, example extraction, membership inference, and chronological analysis to detect contamination.
The results call for more transparent training practices and stronger evaluation frameworks to ensure accurate assessments of LLM capabilities.

Understanding the Pitfalls of Evaluating LLMs

The Study of Task Contamination

LLMs, like GPT-3, have gained traction for their impressive performance in zero-shot and few-shot learning tasks. These tasks are designed to evaluate a model's ability to understand and respond appropriately to prompts without extensive topic-specific pre-training. However, the validity of such evaluations is now under scrutiny. A paper probes the integrity of these evaluations by investigating a phenomenon they describe as "task contamination."

The essence of task contamination is that LLMs might perform well not just on merit, but because they previously encountered similar data during their extensive pre-training phase. If the datasets used for evaluating these models include examples similar to those the models were trained on, the outstanding performance might simply be due to the models 'recalling' this data, rather than genuinely 'reasoning' their responses. This means that what appears to be a remarkable feat of few-shot or zero-shot learning may, in reality, be a mirage stemming from task contamination.

Methodology and Findings

The paper examines models from the GPT-3 series and various other recently divulged LLMs, controlling for dataset difficulty. Surprising results emerged: models tested on datasets released before the LLMs’ training data fared significantly better than on datasets unveiled afterward. This discrepancy suggests that the LLMs' training could include task-specific data, prejudicing zero-shot and few-shot evaluations.

The researchers employed four methods to uncover evidence of task contamination:

Training Data Inspection: Checking the training datasets for examples analogous to test tasks.
Task Example Extraction: Attempting to induce the model to regurgitate training examples using prompts.
Membership Inference Attack: Specifically for generation tasks, checking whether the model regenerates content that's exactly the same as the original dataset.
Chronological Analysis: Evaluating models on datasets relative to their trained data's timeframe, for signs of contamination.

All methods, except the chronological analysis, are high in precision but low in recall—and vice-versa for chronological analysis. This comprehensive approach led to the discovery of strong contamination evidence across models and datasets.

One stark revelation was that for classification tasks devoid of evident task contamination, LLMs infrequently exhibited statistically consequential superiority over basic majority baselines in both zero-shot and few-shot settings. The paper showcases that this absence of performance lift suggests task contamination could be distorting our perception of LLM capabilities in specific contexts.

Implications

The insights from this research have profound implications. They raise concerns about the reliability of LLMs, suggesting that models, particularly those that are closed-source, could be displaying inflated performance, undermining the trustworthiness of current few-shot or zero-shot evaluation methods. It emphasizes the importance of transparently releasing training datasets to facilitate more accurate detection of contamination.

Conclusion

This paper stands as a clarion call for the AI community to approach the evaluation of LLMs with heightened skepticism and rigor. Task contamination is not a trivial issue—it undercuts the foundation upon which the perceived intellect of these models is judged. Future research needs to explore this problem and work towards establishing more robust evaluation frameworks that exclude the possibility of contamination, thus ensuring that we can measure genuine advancements in LLM capabilities.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (2)

Tweets

https://twitter.com/1006797311593377792/status/1740721820234649699

https://twitter.com/101584084/status/1741607337822183798

https://twitter.com/22146921/status/1741125501551555070

https://twitter.com/EhudReiter/status/1748013518597575157

https://twitter.com/theomitsa/status/1744427014629519368

https://twitter.com/393675585/status/1741643510946701722

HackerNews

Language Models May Not Be Few-Shot Anymore (13 points, 0 comments)

Task contamination: LLMs might not be few-shot any more (77 points, 10 comments)