Analysis of "Changing Answer Order Can Decrease MMLU Accuracy"
In their paper, Gupta et al. delve into an empirical investigation of the robustness of LLMs when subjected to changes in answer order within a widely-used evaluation framework, the Massive Multitask Language Understanding (MMLU) dataset. The paper’s findings carry significant implications for evaluating and interpreting the performance of LLMs, suggesting potential revisions to how benchmarks are constructed and utilized in ranking models on leaderboards.
Introduction and Motivation
Benchmarking LLMs typically involves measuring test accuracy across a suite of tasks, aggregating performance metrics to determine model rankings. Despite the common use of these benchmarks, the underlying fragility of accuracy measurements remains a concern. Prior research has identified various robustness issues, such as sensitivity to paraphrases and minor perturbations in input data. This paper extends that analysis by investigating whether shuffling the content of answer choices affects model accuracy, particularly focusing on the MMLU dataset.
Methodology
The MMLU dataset is a popular benchmark that comprises 57 tasks designed to assess an LLM’s world knowledge and problem-solving abilities. Each task involves multiple-choice questions with four answer options. The authors reconfigured the evaluation process by shuffling the content of answer labels while keeping the labels themselves (A, B, C, D) in the same order. This method ensures that any observed changes in model accuracy stem purely from label content shuffling, not from more extensive alterations.
The core metric employed by the authors, termed "Our Metric," measures a model's robustness by calculating how often it answers the same questions correctly across different shuffles. The metric is an average score of model performance over multiple shuffles, thereby reducing the influence of random chance on observed accuracy.
Experimental Results
The paper evaluated ten state-of-the-art LLMs, including both base and instruction-tuned models. These models ranged in size from 7 billion to 70 billion parameters and are well-represented on various leaderboards. Notable models tested include Llama-3, Yi-34B, and Falcon-40B.
All models exhibited a decrease in accuracy upon shuffling answer contents. For instance, the Llama-3-70B-instruct model showed a percentage drop of 6.2%, while the Falcon-40B-instruct model experienced a more substantial decline of 27.2%. This observed degradation suggests that models are not entirely robust to variations in answer ordering, an area previously presumed to be trivial for advanced LLMs.
An analysis of performance drop across subcategories of MMLU indicated that problem-solving tasks were particularly impacted. For example, on high school mathematics questions, the Gemma-7B-instruct model's accuracy decreased by 42.9%. This reveals that LLMs may be benefiting from logical consistency in original answer ordering, a factor that should be neutral in assessing true model understanding and capability.
Theoretical and Practical Implications
These findings challenge the common practice of using averaged test accuracy as a reliable indicator of model performance. The paper calls for a reconsideration of how models are ranked and suggests incorporating robustness metrics that account for variations in answer ordering. Such adjustments could provide a more nuanced and accurate picture of an LLM’s capabilities.
Practically, the implications extend to the design of more robust evaluation frameworks that mitigate overfitting to specific benchmark constructs. The authors highlight that future leaderboards should include metrics that reflect a model's stability across multiple test iterations, thereby encouraging the development of truly robust LLMs.
Conclusions and Future Directions
The consistent decline in accuracy with answer shuffling highlights a critical vulnerability in current LLM evaluation practices. To address this, the paper introduces "Our Metric" to measure test-retest stability and advocates for its integration into standard evaluation procedures. Future work could explore further shuffling variations and additional perturbation types to comprehensively assess model robustness. This paper also underscores the importance of continuous refinement in benchmarking methodologies to ensure they evolve in tandem with advances in LLM capabilities.
By shedding light on the brittle nature of model performance under answer order variations, Gupta et al.'s work encourages the AI research community to adopt more rigorous and multi-faceted evaluation strategies, ultimately leading to the development of more reliable and generalizable AI systems.