Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards (2402.01781v2)

Published 1 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLM leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions. We explain this phenomenon by conducting systematic experiments over three broad categories of benchmark perturbations and identifying the sources of this behavior. Our analysis results in several best-practice recommendations, including the advantage of a hybrid scoring method for answer selection. Our study highlights the dangers of relying on simple benchmark evaluations and charts the path for more robust evaluation schemes on the existing benchmarks. The code for this paper is available at https://github.com/National-Center-for-AI-Saudi-Arabia/lm-evaluation-harness.

When Benchmarks are Targets: Analyzing the Sensitivity of LLM Leaderboards

The paper authored by Norah Alzahrani et al. offers a thorough examination of the sensitivity of LLM leaderboards to minor perturbations in Multiple Choice Questions (MCQ) benchmarks. These leaderboards, which frequently guide model selection for practitioners, are shown to be unstable under small changes, which has significant implications for both theoretical understanding and practical application.

The authors conducted a systematic series of experiments on well-known MCQ benchmarks, notably the Massive Multitask Language Understanding (MMLU). Their experiments reveal that even trivial modifications, such as altering the order of choices or using different scoring methods, can cause dramatic shifts in model rankings. These changes can result in varying rankings by as many as 8 positions, thereby unequivocally demonstrating the fragility and potential unreliability of these benchmarks.

Key findings from the paper are as follows:

  1. Leaderboard Instability: The authors identified substantial variability in model rankings under slight perturbations. For instance, randomizing the order of answer choices led to major rank changes, notably with the Yi-6b model shifting from 3rd to 9th place.
  2. Sources of Bias: The paper explores sources of bias such as token and positional biases. LLMs showed a clear preference for specific choice positions and symbols, and these biases affected model performance unpredictably.
  3. Scoring Method Impact: The choice of scoring method (symbol, hybrid, or cloze scoring) was another significant source of instability. Symbol scoring, despite being the most common, led to high selection biases, while cloze scoring, although reducing bias, resulted in the poorest performance scores. Hybrid scoring provided a more balanced evaluation method.
  4. In-Context Knowledge Sensitivity: In-context manipulations like presenting correct or incorrect answers as part of the context led to models either "cheating" by copying the answers or performing poorly when misleading information was included.

The implications of these findings are manifold. Firstly, there is a need for more robust and consistent benchmark designs to ensure the reliable evaluation of LLMs. The current reliance on unstable benchmarks can lead to inefficiencies and misallocations of resources, especially given the high costs associated with training and deploying LLMs. Secondly, understanding and mitigating biases in LLMs is critical for their applicative validity. Bias towards specific formats or symbols can result in skewed assessments of a model's true capabilities.

Future research should focus on developing benchmarks that are resistant to minor perturbations. This might include integrating a variety of scoring methods, standardizing choice formats, and ensuring the randomization of answer orders in a consistent manner. Additionally, there is a need for transparency in the training datasets used for LLM development to address concerns about potential overfitting to benchmark formats.

In conclusion, this paper underscores the necessity for the AI research community to rethink how LLMs are evaluated and compared. Ensuring the stability and fairness of benchmarks will not only lead to better model selection and resource utilization but also drive the development of more robust AI systems capable of performing reliably in varied and realistic settings. The paper by Alzahrani et al. provides a critical step in this direction, offering valuable insights and practical recommendations for improving the evaluation methodologies in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Palm 2 technical report.
  2. Anthropic. 2023. Anthropic. model card and evaluations for claude models.
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  8. OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  9. Google Deepmind. 2023. Gemini: A family of highly capable multimodal models.
  10. The benchmark lottery.
  11. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
  12. Kawin Ethayarajh and Dan Jurafsky. 2021. Utility is in the eye of the user: A critique of nlp leaderboards.
  13. A framework for few-shot language model evaluation.
  14. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81–93.
  17. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486.
  18. Holistic evaluation of language models.
  19. The position of distractors in multiple-choice test items: The strongest precede the weakest. In Frontiers in Education, volume 6, page 731763. Frontiers.
  20. Cross-task generalization via natural language crowdsourcing instructions.
  21. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  25. Leveraging large language models for multiple choice question answering.
  26. Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph.
  27. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  28. Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091.
  29. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. arXiv preprint arXiv:2208.01448.
  30. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  31. Llama 2: Open foundation and fine-tuned chat models.
  32. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  33. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  34. Resolving knowledge conflicts in large language models.
  35. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts.
  36. Large language models are not robust multiple choice selectors. arXiv e-prints, pages arXiv–2309.
  37. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Norah Alzahrani (1 paper)
  2. Hisham Abdullah Alyahya (1 paper)
  3. Yazeed Alnumay (7 papers)
  4. Sultan Alrashed (4 papers)
  5. Shaykhah Alsubaie (1 paper)
  6. Yusef Almushaykeh (1 paper)
  7. Faisal Mirza (1 paper)
  8. Nouf Alotaibi (1 paper)
  9. Nora Altwairesh (1 paper)
  10. Areeb Alowisheq (3 papers)
  11. M Saiful Bari (22 papers)
  12. Haidar Khan (21 papers)
Citations (37)
Youtube Logo Streamline Icon: https://streamlinehq.com