Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think (2404.08382v2)

Published 12 Apr 2024 in cs.CL and cs.AI

Abstract: Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of LLMs. One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50\%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  7. Questioning the survey responses of large language models. arXiv preprint arXiv:2306.07951, 2023.
  8. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  9. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  11. The language of prompting: What linguistic properties make a prompt successful? In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  9210–9232, 2023.
  12. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  13. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  14. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  15. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  16. Beyond probabilities: Unveiling the misalignment in evaluating large language models. arXiv preprint arXiv:2402.13887, 2024.
  17. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  18. Gpt-4 technical report, 2024.
  19. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  21. Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. arXiv preprint arXiv:2402.16786, 2024.
  22. Whose opinions do language models reflect? ArXiv, abs/2303.17548, 2023. URL https://api.semanticscholar.org/CorpusID:257834040.
  23. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36, 2024.
  24. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  25. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  26. Do llms exhibit human-like response biases? a case study in survey design. arXiv preprint arXiv:2311.04076, 2023.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  28. Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods. arXiv preprint arXiv:2403.00998, 2024.
  29. ” my answer is c”: First-token probabilities do not match text answers in instruction-tuned language models. arXiv preprint arXiv:2402.14499, 2024.
  30. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  31. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  33. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2023.
  34. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xinpeng Wang (34 papers)
  2. Chengzhi Hu (5 papers)
  3. Bolei Ma (18 papers)
  4. Paul Röttger (37 papers)
  5. Barbara Plank (130 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.