Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks (2404.16966v2)

Published 25 Apr 2024 in cs.CL
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Abstract: Benchmarks have emerged as the central approach for evaluating LLMs. The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

Evaluating Distributional Assumptions in Benchmark Evaluations of LLMs

Introduction

The accuracy and effectiveness of LLMs are typically evaluated using benchmark datasets. Previous approaches have typically treated benchmark prompts as independent samples from an equivalent distribution. However, new research suggests that correlations exist among prompt performances within these benchmarks, influencing the overall model evaluations and rankings. This investigation highlights that the distributional assumptions about benchmark composition can fundamentally affect the appraisal of LLMs.

Key Contributions and Findings

Several significant observations were made in this paper:

  • Performance Correlation: There is a notable non-random correlation in model performance across benchmark test prompts. Such a correlation suggests hidden relationships among prompts that influence model performance predictably across similar prompt types.
  • Impact on Model Rankings: Different weighting schemes of test prompts based on their distribution lead to notable changes in model rankings. Variations were observed up to 10% in performance metric shifts and up to 5 places in model ranking adjustments.
  • Distributional Assumptions: The equality assumption in prompt weighting is misleading because it neglects the inherent biases and relationships among the prompts. This paper categorized prompts based on their similarity and rearranged model rankings based on these clusters.

Methodological Approach

Correlation Analysis

The paper utilized permutation tests to evaluate the statistical randomness of correlations observed in model responses across prompts. By reshuffling responses and comparing agglomerated metrics, researchers could affirm the presence of significant non-random performance similarities.

Weighted Performance Metrics

Exploring different methods to account for prompt distribution, the paper analyzed cluster-based representative sampling and distance-weighted performance evaluations. Each method showed varying effects on model rankings, confirming that equating prompt contribution can skew benchmark outcomes.

Semantic Analysis

To understand the sources of prompt correlation, the paper compared performance vectors with semantic embeddings of prompts. The findings suggested correlations in several cases, attributed to semantic similarities or shared model failure points in processing particular prompt types.

Implications and Future Directions

The implications of these findings are critical for both theoretical and practical aspects of AI research. They challenge the conventional methods of evaluating LLMs using benchmarks and suggest the necessity for more nuanced approaches that consider the relationships and distributional biases within prompt sets.

Theoretical Implications

The paper enriches our understanding of the interactions within benchmark datasets and their impact on model evaluation metrics. This prompts a theoretical shift towards considering benchmarks as complex systems with internal dependencies rather than independent prompt samples.

Practical Implications

For AI practitioners, the paper underscores the need for robust benchmarking strategies that account for inherent prompt correlations. It suggests adapting benchmark weighting schemes based on prompt distribution and interrelations to better reflect real-world model performance and utility.

Future Research

Future work should focus on developing methodologies to further dissect the sources of prompt correlation, extending beyond semantic similarity to perhaps syntactic or contextual dimensions. Additionally, there's potential in exploring automated systems that dynamically adjust prompt weights in benchmarks based on observed performance correlations, thus offering a real-time calibration of benchmark difficulty and representativeness.

Conclusion

The research provides compelling evidence that standard evaluation benchmarks may not adequately reflect the true capabilities of LLMs due to their failure to acknowledge prompt interdependencies. This paper calls for a reevaluation of how benchmarks are constructed and utilized, proposing a more granular and dynamic methodology for LLM evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. The falcon series of language models: Towards open frontier models. Hugging Face repository.
  2. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
  3. Systematic evaluation of different approaches on embedding search. In Future of Information and Communication Conference, pages 526–536. Springer.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  6. Matthew DeBell. 2018. Best Practices for Creating Survey Weights, pages 159–162. Springer International Publishing, Cham.
  7. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Inderjit S Dhillon and Dharmendra S Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine learning, 42:143–175.
  9. Matthew Freestone and Shubhra Kanti Karmaker Santu. 2024. Word embeddings revisited: Do llms offer something new? arXiv preprint arXiv:2402.11094.
  10. Koala: A dialogue model for academic research. Blog post, April, 1.
  11. Robustness gym: Unifying the NLP evaluation landscape. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 42–55, Online. Association for Computational Linguistics.
  12. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  13. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
  14. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  15. Evaluating embedding APIs for information retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 518–526, Toronto, Canada. Association for Computational Linguistics.
  16. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  17. Adversarial filters of dataset biases. In International conference on machine learning, pages 1078–1088. PMLR.
  18. A survey on out-of-distribution evaluation of neural nlp models. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 6683–6691. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
  19. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  20. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  21. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  22. Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623.
  25. Leveraging large language models for topic classification in the domain of public affairs. arXiv preprint arXiv:2306.02864.
  26. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  27. Automatic prompt optimization with “gradient descent” and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957–7968, Singapore. Association for Computational Linguistics.
  28. An adversarial winograd schema challenge at scale. arXiv preprint arXiv:2305.06300.
  29. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  30. Noah A Smith and Roy W Tromble. 2004. Sampling uniformly from the unit simplex. Johns Hopkins University, Tech. Rep, 29.
  31. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  32. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  34. Attention is all you need. Advances in neural information processing systems, 30.
  35. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095.
  36. Generalizing to unseen domains: A survey on domain generalization. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4627–4635. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
  37. Measure and improve robustness in NLP models: A survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4569–4586, Seattle, United States. Association for Computational Linguistics.
  38. How far can camels go? exploring the state of instruction tuning on open resources. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  39. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  40. Kai-Cheng Yang and Filippo Menczer. 2023. Large language models can rate news outlet credibility. arXiv preprint arXiv:2304.00228.
  41. GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12731–12750, Toronto, Canada. Association for Computational Linguistics.
  42. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  43. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  44. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
  45. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Melissa Ailem (5 papers)
  2. Katerina Marazopoulou (5 papers)
  3. Charlotte Siska (4 papers)
  4. James Bono (7 papers)
Citations (9)