Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? (2402.12483v2)

Published 19 Feb 2024 in cs.CL
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Abstract: Multiple-choice question answering (MCQA) is often used to evaluate LLMs. To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

Insights into LLM Performance on MCQA Without Questions

The paper, "Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?", presents an intriguing exploration into the capabilities of LLMs in multiple-choice question answering (MCQA). This paper critically examines a commonly used evaluation framework for LLMs, investigating whether these models can succeed in MCQA tasks even when deprived of the question prompts.

Key Findings

The researchers conducted experiments using three prominent MCQA datasets: ARC, MMLU, and HellaSwag, alongside four LLMs: LLaMA-2, Falcon, Phi-2, and Mixtral. Remarkably, the models' performance with "choices-only" prompts—where only answer options were provided—surpassed majority baselines in 11 out of 12 scenarios, demonstrating accuracy gains up to 0.33. This suggests that LLMs might leverage specific dynamics in the choices themselves for decision-making.

Three primary hypotheses were explored to explain these results:

  1. Memorization: The paper found no substantial evidence indicating that high accuracy in choices-only settings stemmed from memorization of seen examples alone. Models equipped with prompts void of discriminative information failed to exhibit significant performance, debunking the notion of memorization as the primary factor.
  2. Choice Dynamics: Examination into how models leverage individual priors (favoring certain words or patterns) and collective dynamics (considering the relationship between all options) revealed that individual priors weren't sufficient to account for the observed accuracy. This finding suggests that LLMs may engage in more complex reasoning processes when selecting answers based on group dynamics.
  3. Abductive Question Inference (AQI): LLMs were shown to possess a degree of capability to infer questions from choices—sometimes resembling the original questions—indicating a potential to engage in abductive reasoning. The performance of LLMs was on par when they generated and answered their inferred questions, and in some cases, even exceeded the choices-only prompt results.

Implications and Future Directions

The results have significant implications for the design and evaluation of MCQA datasets and LLMs. Current benchmarks may inadvertently assess model capabilities not originally intended, such as exploiting dataset artifacts rather than demonstrating comprehension or reasoning skills. Consequently, this necessitates stronger baselines and robust dataset creation protocols to mitigate artifact exploitation.

Moreover, the findings highlight the importance of understanding how LLMs are making decisions, especially in partial-input settings. The paper presents an advanced methodological framework, which should encourage further investigations into whether more sophisticated reasoning abilities can be encouraged or detected in LLMs using other strategies.

Conclusion

Overall, this paper provides a nuanced assessment of how artifacts and reasoning interplay in LLMs' performance on MCQA tasks. It emphasizes the need for transparency in LLM evaluations and prompts a reevaluation of methodologies to better align with the intended assessment of model capabilities. Future research should continue to delve into these dynamics, ideally leading to advanced models capable of achieving more consistent and interpretable performance across varied MCQA settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Phi-2: The surprising power of small language models. Microsoft Research Blog.
  2. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
  3. It’s not easy being wrong: Evaluating process of elimination reasoning in large language models. arXiv preprint arXiv:2311.07532.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  5. Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 877–891, Florence, Italy. Association for Computational Linguistics.
  6. A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21(1):65–88.
  7. Abductive commonsense reasoning. In International Conference on Learning Representations.
  8. Extracting training data from large language models. In USENIX Security Symposium.
  9. Dhawaleswar Rao Ch and Sujan Kumar Saha. 2018. Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies, 13(1):14–25.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  11. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  12. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120.
  13. Measuring and improving attentiveness to partial inputs with counterfactuals. arXiv preprint arXiv:2311.09605.
  14. Misleading failures of partial-input baselines. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5533–5538, Florence, Italy. Association for Computational Linguistics.
  15. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
  16. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
  17. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
  18. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  19. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  20. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  21. Christine Herlihy and Rachel Rudinger. 2021. MedNLI is not immune: Natural language inference artifacts in the clinical domain. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1020–1027, Online. Association for Computational Linguistics.
  22. Natural language decompositions of implicit content enable better text representations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13188–13214, Singapore. Association for Computational Linguistics.
  23. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
  24. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2038–2047, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. Shotaro Ishihara. 2023. Training data extraction from pre-trained language models: A survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 260–275, Toronto, Canada. Association for Computational Linguistics.
  26. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
  27. Mixtral of experts.
  28. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  29. Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015, Brussels, Belgium. Association for Computational Linguistics.
  30. The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
  31. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  32. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  33. Chenkai Ma and Xinya Du. 2023. POE: Process of elimination for multiple choice reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4487–4496, Singapore. Association for Computational Linguistics.
  34. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Annual Meeting of the Association for Computational Linguistics.
  35. Michail Mersinias and Panagiotis Valvis. 2022. Mitigating dataset artifacts in natural language inference through automatic contextual data augmentation and learning optimization. In International Conference on Language Resources and Evaluation.
  36. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  37. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, Online. Association for Computational Linguistics.
  38. What in-context learning "learns" in-context: Disentangling task recognition and task learning. In Annual Meeting of the Association for Computational Linguistics.
  39. Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, Dubrovnik, Croatia. Association for Computational Linguistics.
  40. Charles Sanders Peirce. 1974. Collected papers of charles sanders peirce, volume 1. Harvard University Press.
  41. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  42. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. ArXiv, abs/2308.11483.
  43. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  45. When and why does bias mitigation work? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9233–9247, Singapore. Association for Computational Linguistics.
  46. Joshua Robinson and David Wingate. 2023. Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations.
  47. Kenneth D Royal and Mari-Wells Hedgpeth. 2015. A novel method for evaluating examination item quality. International Journal of Psychological Studies, 7(1):17.
  48. Kenneth D Royal and Myrah R Stockdale. 2017. The impact of 3-option responses to multiple-choice questions on guessing strategies and cut score determinations. Journal of Advances in Medical Education & Professionalism, 5(2):84.
  49. Did chatgpt cheat on your test?
  50. What do we expect from multiple-choice QA systems? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3547–3553, Online. Association for Computational Linguistics.
  51. Neha Srikanth and Rachel Rudinger. 2022. Partial-input baselines show that NLI models can ignore context, but they don’t. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4753–4763, Seattle, United States. Association for Computational Linguistics.
  52. Assessing the benchmarking capacity of machine reading comprehension datasets. In AAAI Conference on Artificial Intelligence.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  54. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388.
  55. Universal adversarial triggers for attacking and analyzing nlp. In Conference on Empirical Methods in Natural Language Processing.
  56. Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv preprint arXiv:2402.01349.
  57. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  58. Sarah Wiegreffe and Ana Marasović. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In NeurIPS Datasets and Benchmarks.
  59. Large language model as attributed training data generator: A tale of diversity and bias. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  60. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics.
  61. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
  62. Don’t make your llm an evaluation benchmark cheater. ArXiv, abs/2311.01964.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nishant Balepur (14 papers)
  2. Abhilasha Ravichander (33 papers)
  3. Rachel Rudinger (46 papers)
Citations (14)