LLMs May Perform MCQA by Selecting the Least Incorrect Option (2402.01349v3)

Published 2 Feb 2024 in cs.CL and cs.AI

Abstract: In the field of NLP, LLMs have markedly enhanced performance across a variety of tasks. However, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist. Building upon previous discussions on the issue of \textit{variability}, we reveal an additional dimension of concern: LLMs may perform MCQA by selecting the least incorrect option rather than distinctly correct. This observation suggests that LLMs might regard multiple options as correct, which could undermine the reliability of MCQA as a metric for evaluating LLMs. To address this challenge, we introduce an enhanced dataset augmentation method for MCQA, termed MCQA+, to provide a more accurate reflection of the model performance, thereby highlighting the necessity for more sophisticated evaluation mechanisms in the assessment of LLM capabilities.

PDF HTML Abstract

Evaluation of LLMs Through MCQA: A Critical Examination

This paper, produced by researchers at the Harbin Institute of Technology, engages in a meticulous critique of Multiple Choice Question Answering (MCQA) as a benchmark for evaluating LLMs. At the core of their research is a series of experimental analyses designed to expose the inadequacies inherent in employing MCQA as a sole metric for assessing the true capabilities of LLMs.

Examination of MCQA as a Benchmark

The paper begins by acknowledging the widespread adoption of LLMs like GPT-3, LLaMA, and ChatGPT, and highlights the challenges associated with accurately evaluating these models. The traditional evaluation metrics such as BLEU and ROUGE, while effective in certain contexts, often fail to capture the nuanced understanding required for tasks like commonsense reasoning and other MCQA-based evaluations used in LLM benchmarks, such as MMLU and Big Bench.

The researchers note that MCQA tasks usually consist of a singular question with multiple-choice options. The evaluation method assumes the model's capability to consistently choose the correct answer option, irrespective of the order of presentation. However, the researchers present experimental evidence indicating that when answer options are re-ordered, LLMs often exhibit inconsistencies in selecting the correct answer, calling into question the reliability of MCQA as a fixed benchmark.

Limitations and Variability in MCQA

Through a comprehensive set of experiments using datasets like MMLU and MedMCQA, the paper underscores the variability in LLM performance due to the alteration in the order and number of answer choices. A notable finding is the evidence of performance volatility when the number of options is modified. Results demonstrated an apparent "overfitting" of LLMs to the traditional format of four options, resulting in marked variability when the options count differed, thus exposing a potential flaw in logic or knowledge assessment by LLMs.

The paper discusses how LLMs may inaccurately interpret multiple options as correct but opt for the most plausible one rather than an exclusively correct answer. The paper applied further testing through variations like True-or-False questions, revealing that LLMs often falter when encountering modified or complex reasoning tasks.

Introduction of MCQA+ as an Improved Benchmark

To address these challenges, the authors propose an augmented dataset termed MCQA+, aiming to deliver a more nuanced evaluation. MCQA+ includes additional variables such as re-ordered, expanded, and True-or-False formatted questions to better scrutinize LLM capabilities. Empirical evidence shows that performance on the MCQA+ dataset is generally inferior compared to the original, suggesting that traditional MCQA evaluations might be artificially inflated due to limitations in test design.

Implications and Future Directions

The critique and subsequent proposition of MCQA+ provides essential insights into the nuanced performance metrics required to evaluate LLMs meaningfully. The introduction of MCQA+ is indicative of an effort to refine LLM evaluation methodologies, ensuring that they consistently reflect true model capabilities and are not merely optimized for existing benchmarks.

In terms of practical implications, enhancing evaluation strategies fosters the development of more robust and adaptable NLP systems. By refining the benchmark metrics, future LLMs can be crafted with better understanding and reasoning capabilities that mirror human cognitive attributes more closely.

Overall, this critical examination of MCQA and the introduction of MCQA+ signify an incremental step towards refining the reliability and robustness of LLM evaluations, paving the way for more insightful and rigorous model assessments in the future. The work emphasizes the necessity for continuous examination and evolution of benchmarks, reflecting the ongoing growth and complexity of artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

References (46)

Authors (6)

Haochun Wang (17 papers)
Sendong Zhao (31 papers)
Zewen Qiang (7 papers)
Bing Qin (186 papers)
Ting Liu (329 papers)
Nuwa Xi (11 papers)

Citations (7)

View on Semantic Scholar

Tweets

https://twitter.com/Ayragent/status/1874389942563938708