Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? (2407.01992v1)

Published 2 Jul 2024 in cs.CL

Abstract: Recent work shows that LLMs can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility~of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards just due to their ability to exploit choices-only shortcuts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. It’s not easy being wrong: Evaluating process of elimination reasoning in large language models. arXiv preprint arXiv:2311.07532.
  2. Artifacts or abduction: How do llms answer multiple-choice questions without the question? arXiv preprint arXiv:2402.12483.
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  5. Ravi Boppana and Magnús M Halldórsson. 1992. Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics, 32(2):180–196.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  7. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
  8. Measuring and improving attentiveness to partial inputs with counterfactuals. arXiv preprint arXiv:2311.09605.
  9. Flexible text generation for counterfactual fairness probing. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 209–229, Seattle, Washington (Hybrid). Association for Computational Linguistics.
  10. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
  11. Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6):1082–1116.
  12. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
  13. Aligning {ai} with shared human values. In International Conference on Learning Representations.
  14. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Mixtral of experts.
  17. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations.
  18. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
  19. Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8082–8090.
  20. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
  21. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  22. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
  23. Better distractions: Transformer-based distractor generation and multiple choice question filtering. arXiv preprint arXiv:2010.09598.
  24. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  25. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
  26. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5043–5053, Hong Kong, China. Association for Computational Linguistics.
  27. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
  28. Joshua Robinson and David Wingate. 2023. Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations.
  29. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
  30. How often are errors in natural language reasoning due to paraphrastic variability? arXiv preprint arXiv:2404.11717.
  31. Neha Srikanth and Rachel Rudinger. 2022. Partial-input baselines show that NLI models can ignore context, but they don’t. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4753–4763, Seattle, United States. Association for Computational Linguistics.
  32. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  33. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Prairielearn: Mastery-based online problem solving with adaptive scoring and recommendations driven by machine learning. In 2015 ASEE Annual Conference & Exposition, pages 26–1238.
  36. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, Online. Association for Computational Linguistics.
  37. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
  38. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems, 36.
  39. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics.
  40. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Nishant Balepur (14 papers)
  2. Rachel Rudinger (46 papers)
Citations (3)