Forget What You Know about LLMs Evaluations -- LLMs are Like a Chameleon

Published 11 Feb 2025 in cs.CL, cs.AI, and cs.LG | (2502.07445v1)

Abstract: LLMs often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework to assess if LLMs are overfitted to benchmark datasets instead of showing genuine understanding.
C-BOD uses parametric prompt transformations to evaluate 26 LLMs, finding an average 2.15% performance drop and statistically significant degradation in 20 models, indicating reliance on superficial cues.
Observations show larger LLMs and those with higher baseline accuracy are more sensitive to minor prompt changes, suggesting a potential link between model size and the tendency to overfit benchmarks.

An Analysis of "Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon"

The paper "Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon" offers a critical examination of the robustness of LLMs in the face of benchmark evaluations. It introduces the Chameleon Benchmark Overfit Detector (C-BOD), a novel meta-evaluation framework designed to assess whether the performance of LLMs is genuinely reflective of robust language understanding or merely the result of superficial memorization of dataset-specific patterns.

Key Contributions and Methodology

The authors argue that conventional high performance of LLMs on benchmark datasets may not accurately reflect their true language processing capabilities. Instead, such performance might often result from overfitting to dataset-specific cues. To address this, the paper proposes C-BOD, a method that applies parametric transformations to benchmark prompts to detect overfitting. This approach involves:

Textual Distortions: Systematically rephrasing prompts while maintaining original semantic content to see if performance declines indicate reliance on memorized patterns.
Detection Across Models: Evaluating 26 leading LLMs on modest prompt perturbations, the method identifies a performance degradation of 2.15% on average, with 20 out of 26 models displaying statistically significant performance differences.
Insights on Model Sensitivity: Larger LLMs and those with higher baseline accuracy exhibit greater sensitivity to rephrasings. Although notable, the Llama family and models with lower baseline accuracy demonstrated minimal performance degradation, indicating potentially less reliance on superficial cues.

Implications for LLM Evaluation

The study challenges the NLP community to reconsider how LLMs are evaluated, advocating for a shift from merely tracking high scores on leaderboards to ensuring models exhibit resilience and generalization. The dataset- and model-agnostic nature of the C-BOD framework makes it a universally applicable evaluation tool, promoting its integration into training pipelines to foster genuine language understanding and discourage overfitting to specific evaluation prompts.

Observations on Model Performance

The findings indicate that the size of LLMs correlates with their likelihood of overfitting, as evidenced by their sensitivity to textual distortions. This trend suggests that as models scale up in parameters, they may increasingly depend on memorized benchmarks. This raises critical considerations regarding the training and evaluation paradigms that need to accommodate or correct for this tendency, thus driving more meaningful advancements in model development.

Future Prospects

The paper's results imply a need for developing evaluation frameworks focusing on surface-invariant testing to ensure the models' performance is not contingent upon familiar prompt structures. Furthermore, the insights gained from this study could catalyze future research on enhancing model robustness through dynamic benchmarks that evolve to outpace models' memorization tactics.

This article and the introduction of C-BOD provide a meaningful contribution toward refining the evaluation of LLMs, encouraging a method of assessment that seeks to uncover the robustness and genuine capabilities of LLMs beyond surface attributes. The proposed framework serves not only as a tool for diagnosing overfitting but also sets a stage for future developments aimed at robust and semantically grounded LLMs.

Markdown Report Issue