Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

563 2

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? (2402.12483v2)

Published 19 Feb 2024 in cs.CL

Abstract: Multiple-choice question answering (MCQA) is often used to evaluate LLMs. To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

PDF HTML Abstract

Insights into LLM Performance on MCQA Without Questions

The paper, "Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?", presents an intriguing exploration into the capabilities of LLMs in multiple-choice question answering (MCQA). This paper critically examines a commonly used evaluation framework for LLMs, investigating whether these models can succeed in MCQA tasks even when deprived of the question prompts.

Key Findings

The researchers conducted experiments using three prominent MCQA datasets: ARC, MMLU, and HellaSwag, alongside four LLMs: LLaMA-2, Falcon, Phi-2, and Mixtral. Remarkably, the models' performance with "choices-only" prompts—where only answer options were provided—surpassed majority baselines in 11 out of 12 scenarios, demonstrating accuracy gains up to 0.33. This suggests that LLMs might leverage specific dynamics in the choices themselves for decision-making.

Three primary hypotheses were explored to explain these results:

Memorization: The paper found no substantial evidence indicating that high accuracy in choices-only settings stemmed from memorization of seen examples alone. Models equipped with prompts void of discriminative information failed to exhibit significant performance, debunking the notion of memorization as the primary factor.
Choice Dynamics: Examination into how models leverage individual priors (favoring certain words or patterns) and collective dynamics (considering the relationship between all options) revealed that individual priors weren't sufficient to account for the observed accuracy. This finding suggests that LLMs may engage in more complex reasoning processes when selecting answers based on group dynamics.
Abductive Question Inference (AQI): LLMs were shown to possess a degree of capability to infer questions from choices—sometimes resembling the original questions—indicating a potential to engage in abductive reasoning. The performance of LLMs was on par when they generated and answered their inferred questions, and in some cases, even exceeded the choices-only prompt results.

Implications and Future Directions

The results have significant implications for the design and evaluation of MCQA datasets and LLMs. Current benchmarks may inadvertently assess model capabilities not originally intended, such as exploiting dataset artifacts rather than demonstrating comprehension or reasoning skills. Consequently, this necessitates stronger baselines and robust dataset creation protocols to mitigate artifact exploitation.

Moreover, the findings highlight the importance of understanding how LLMs are making decisions, especially in partial-input settings. The paper presents an advanced methodological framework, which should encourage further investigations into whether more sophisticated reasoning abilities can be encouraged or detected in LLMs using other strategies.

Conclusion

Overall, this paper provides a nuanced assessment of how artifacts and reasoning interplay in LLMs' performance on MCQA tasks. It emphasizes the need for transparency in LLM evaluations and prompts a reevaluation of methodologies to better align with the intended assessment of model capabilities. Future research should continue to delve into these dynamics, ideally leading to advanced models capable of achieving more consistent and interpretable performance across varied MCQA settings.

PDF Markdown Bookmark Chat (Pro)

References (62)

Authors (3)

Nishant Balepur (14 papers)
Abhilasha Ravichander (33 papers)
Rachel Rudinger (46 papers)

Citations (14)

View on Semantic Scholar

Tweets

https://twitter.com/emollick/status/1764995243332555182

https://twitter.com/NishantBalepur/status/1764729478893174977

https://twitter.com/fly51fly/status/1766468035349721300

https://twitter.com/NishantBalepur/status/1902768409017979166

https://twitter.com/TechTweetBot/status/1765500690661793955

https://twitter.com/knishimae0531/status/1765177504875303078

HackerNews

How Do LLMs Answer Multiple-Choice Questions Without the Question? (2 points, 0 comments)