Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 37 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 111 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 243 tok/s Pro

2000 character limit reached

Answer Matching Outperforms Multiple Choice for Language Model Evaluation (2507.02856v1)

Published 3 Jul 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Multiple choice benchmarks have long been the workhorse of LLM evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern LLM with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

Collections

Summary

The paper shows that answer matching aligns with human judgment (e.g., Scott’s π = 0.97 on MATH) compared to MCQ’s reliance on statistical shortcuts.
It introduces a generative evaluation method where free-form responses are paired with a binary matcher to assess answer accuracy.
Empirical results reveal that answer matching shifts model rankings and reduces benchmark saturation while maintaining cost efficiency.

Answer Matching as a Superior Paradigm for LLM Evaluation

The paper "Answer Matching Outperforms Multiple Choice for LLM Evaluation" (2507.02856) presents a comprehensive critique of multiple choice (MCQ) benchmarks as the dominant paradigm for evaluating LLMs, and introduces answer matching as a more valid, scalable, and cost-effective alternative for generative evaluation. The authors provide both theoretical and empirical analyses, demonstrating that MCQ-based evaluation is fundamentally limited by its discriminative nature, which allows models to exploit statistical shortcuts unrelated to genuine generative ability. In contrast, answer matching—where a model generates a free-form response and a separate model (the matcher) determines if it matches a reference answer—aligns much more closely with human judgment and the actual generative use cases of LLMs.

Discriminative Shortcuts in Multiple Choice Evaluation

The paper formalizes the distinction between generative and discriminative evaluation. In MCQ settings, models are tasked with selecting the correct answer from a set of options, which reduces the evaluation to a discrimination problem. The authors empirically show that models can achieve high accuracy on MCQ benchmarks even when provided only with the answer choices and not the question itself. For instance, a fine-tuned Qwen3-4B model achieves 83% accuracy on TruthfulQA-v2 and 93% on GoldenSwag using only the answer choices. This exposes the prevalence of "choice-only" shortcuts, where statistical artifacts in the answer set allow models to bypass genuine reasoning or knowledge retrieval.

The authors further demonstrate that attempts to mitigate these shortcuts—such as reducing the number of choices or generating distractors with LLMs—are insufficient. Even with more choices or LLM-generated distractors, shortcut accuracy remains high. This issue is not confined to language tasks; similar effects are observed in multimodal benchmarks like MMMU-Pro.

Answer Matching: Methodology and Empirical Validation

Answer matching is proposed as a generative evaluation protocol: the model is prompted with the question alone, generates a free-form answer, and a separate matcher model determines if the response matches a reference answer. The matcher is provided with the question, the reference answer, and the model's response, and outputs a binary judgment.

The authors conduct extensive empirical studies on MATH, MMLU-Pro, and GPQA-Diamond, using both automated and human-annotated ground truth. Key findings include:

Alignment with Human Judgment: Answer matching using recent LLMs (even small models like Qwen3-4B) achieves near-perfect agreement with human annotators, as measured by Scott's $\pi$ (e.g., $\pi = 0.97$ on MATH). In contrast, MCQ evaluation aligns poorly ( $\pi = 0.26$ ), and LLM-as-a-judge without reference answers also underperforms.
Model Rankings Change: The choice of evaluation protocol significantly affects model rankings. Models that perform well on MCQ benchmarks may drop in rank when evaluated via answer matching, and vice versa. This has direct implications for model selection and deployment.
Benchmark Saturation is Illusory: Benchmarks that appear saturated under MCQ evaluation reveal substantial headroom when repurposed for answer matching. For example, top models' accuracy drops by over 20% on GPQA-Diamond when evaluated generatively.
Cost and Scalability: Contrary to concerns about the expense of LLM-based evaluation, answer matching is shown to be as cheap or cheaper than MCQ evaluation. This is because free-form responses are typically shorter, and the matcher model's inference cost is marginal compared to generation.

Implementation Considerations

The paper provides practical guidance for implementing answer matching:

Prompt Engineering: The matcher model should be prompted with the question, reference answer, and candidate response, and instructed to output a binary match judgment. The authors release prompt templates and code for reproducibility.
Dataset Preparation: Not all MCQ questions are suitable for answer matching, as some rely on the answer choices for specificity. Filtering or rewriting is necessary to ensure questions have unique, unambiguous answers.
Model Selection: Recent small and mid-sized LLMs are sufficient as matchers, reducing compute requirements and increasing reproducibility. Open-weight models like Qwen3-4B and Llama-4-Scout are recommended.
Robustness: Rankings are stable across different matcher models, and answer matching is less susceptible to self-preference bias than LLM-as-a-judge protocols.

Limitations and Open Challenges

The authors acknowledge several limitations:

Applicability: Answer matching is best suited for tasks with a unique correct answer or a well-defined set of paraphrases. It is less effective for tasks with multiple valid answers (e.g., summarization, translation) or where equivalence is hard to define (e.g., proofs, code).
Gaming the Matcher: The robustness of matcher models to adversarial responses or optimization pressure remains an open question. Stronger matchers or adversarial evaluation may be necessary for high-stakes settings.
Dataset Conversion: Many existing MCQ datasets require significant filtering or rewriting to be suitable for answer matching, which may reduce coverage or alter subject distributions.

Implications and Future Directions

The findings have significant implications for both the evaluation and development of LLMs:

Benchmark Design: The community should prioritize benchmarks that support generative evaluation via answer matching, with questions designed for specificity and unique answers. This may be more fruitful than creating harder MCQ distractors.
Model Development: As answer matching better reflects real-world generative use cases, optimizing models for generative performance (rather than MCQ accuracy) becomes more meaningful.
Evaluation Ecosystem: The transition to answer matching enables more valid, scalable, and cost-effective evaluation protocols, supporting the rapid pace of LLM development and deployment.

Future work may explore extending answer matching to tasks with multiple valid answers, improving matcher robustness, and integrating rubric-based or execution-based evaluation for complex outputs.

Conclusion

The paper provides a rigorous and practical case for replacing MCQ-based evaluation with answer matching in LLM benchmarking. By aligning evaluation protocols with the generative capabilities of modern models, answer matching offers a more valid, reliable, and scalable approach, with direct consequences for model selection, deployment, and future research directions in AI.

PDF Markdown

Paper Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (5)

Tweets

https://twitter.com/nikhilchandak29/status/1941163308653936771

https://twitter.com/jonasgeiping/status/1941190050005516370

https://twitter.com/ShashwatGoel7/status/1943807594084413542

https://twitter.com/smellslikeml/status/1941125331097936172

https://twitter.com/ArxivToday/status/1941179979007393799

https://twitter.com/PapersInML/status/1941196116009087306

YouTube

Show All Videos

alphaXiv

Answer Matching Outperforms Multiple Choice for Language Model Evaluation (39 likes, 0 questions)