Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs May Perform MCQA by Selecting the Least Incorrect Option (2402.01349v3)

Published 2 Feb 2024 in cs.CL and cs.AI
LLMs May Perform MCQA by Selecting the Least Incorrect Option

Abstract: In the field of NLP, LLMs have markedly enhanced performance across a variety of tasks. However, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist. Building upon previous discussions on the issue of \textit{variability}, we reveal an additional dimension of concern: LLMs may perform MCQA by selecting the least incorrect option rather than distinctly correct. This observation suggests that LLMs might regard multiple options as correct, which could undermine the reliability of MCQA as a metric for evaluating LLMs. To address this challenge, we introduce an enhanced dataset augmentation method for MCQA, termed MCQA+, to provide a more accurate reflection of the model performance, thereby highlighting the necessity for more sophisticated evaluation mechanisms in the assessment of LLM capabilities.

Evaluation of LLMs Through MCQA: A Critical Examination

This paper, produced by researchers at the Harbin Institute of Technology, engages in a meticulous critique of Multiple Choice Question Answering (MCQA) as a benchmark for evaluating LLMs. At the core of their research is a series of experimental analyses designed to expose the inadequacies inherent in employing MCQA as a sole metric for assessing the true capabilities of LLMs.

Examination of MCQA as a Benchmark

The paper begins by acknowledging the widespread adoption of LLMs like GPT-3, LLaMA, and ChatGPT, and highlights the challenges associated with accurately evaluating these models. The traditional evaluation metrics such as BLEU and ROUGE, while effective in certain contexts, often fail to capture the nuanced understanding required for tasks like commonsense reasoning and other MCQA-based evaluations used in LLM benchmarks, such as MMLU and Big Bench.

The researchers note that MCQA tasks usually consist of a singular question with multiple-choice options. The evaluation method assumes the model's capability to consistently choose the correct answer option, irrespective of the order of presentation. However, the researchers present experimental evidence indicating that when answer options are re-ordered, LLMs often exhibit inconsistencies in selecting the correct answer, calling into question the reliability of MCQA as a fixed benchmark.

Limitations and Variability in MCQA

Through a comprehensive set of experiments using datasets like MMLU and MedMCQA, the paper underscores the variability in LLM performance due to the alteration in the order and number of answer choices. A notable finding is the evidence of performance volatility when the number of options is modified. Results demonstrated an apparent "overfitting" of LLMs to the traditional format of four options, resulting in marked variability when the options count differed, thus exposing a potential flaw in logic or knowledge assessment by LLMs.

The paper discusses how LLMs may inaccurately interpret multiple options as correct but opt for the most plausible one rather than an exclusively correct answer. The paper applied further testing through variations like True-or-False questions, revealing that LLMs often falter when encountering modified or complex reasoning tasks.

Introduction of MCQA+ as an Improved Benchmark

To address these challenges, the authors propose an augmented dataset termed MCQA+, aiming to deliver a more nuanced evaluation. MCQA+ includes additional variables such as re-ordered, expanded, and True-or-False formatted questions to better scrutinize LLM capabilities. Empirical evidence shows that performance on the MCQA+ dataset is generally inferior compared to the original, suggesting that traditional MCQA evaluations might be artificially inflated due to limitations in test design.

Implications and Future Directions

The critique and subsequent proposition of MCQA+ provides essential insights into the nuanced performance metrics required to evaluate LLMs meaningfully. The introduction of MCQA+ is indicative of an effort to refine LLM evaluation methodologies, ensuring that they consistently reflect true model capabilities and are not merely optimized for existing benchmarks.

In terms of practical implications, enhancing evaluation strategies fosters the development of more robust and adaptable NLP systems. By refining the benchmark metrics, future LLMs can be crafted with better understanding and reasoning capabilities that mirror human cognitive attributes more closely.

Overall, this critical examination of MCQA and the introduction of MCQA+ signify an incremental step towards refining the reliability and robustness of LLM evaluations, paving the way for more insightful and rigorous model assessments in the future. The work emphasizes the necessity for continuous examination and evolution of benchmarks, reflecting the ongoing growth and complexity of artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. The falcon series of open language models, 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020.
  5. Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
  6. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  7. A close look into the calibration of pre-trained language models. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1343–1367, Toronto, Canada, July 2023. Association for Computational Linguistics.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  11. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371629.
  12. How many options is enough for a multiple-choice test item? Educational and psychological measurement, 53(4):999–1010, 1993.
  13. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  14. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  15. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  16. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2391–2401, Hong Kong, China, November 2019. Association for Computational Linguistics.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  20. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
  21. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  22. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, June 2016.
  23. OpenAI. Chatgpt. https://chat.openai.com, 2022.
  24. Gpt-4 technical report, 2023.
  25. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  26. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Flores, G., Chen, G. H., Pollard, T., Ho, J. C., and Naumann, T. (eds.), Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pp.  248–260. PMLR, 07–08 Apr 2022.
  27. Bleu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  28. Improving language understanding by generative pre-training. OpenAI, 2018.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations, 2022.
  31. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, aug 2021. ISSN 0001-0782.
  32. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, Hong Kong, China, November 2019. Association for Computational Linguistics.
  33. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  35. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  36. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  37. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  42. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  43. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  93–104, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
  44. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800. Association for Computational Linguistics, July 2019.
  45. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882, 2023.
  46. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haochun Wang (17 papers)
  2. Sendong Zhao (31 papers)
  3. Zewen Qiang (7 papers)
  4. Bing Qin (186 papers)
  5. Ting Liu (329 papers)
  6. Nuwa Xi (11 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com