- The paper shows that answer matching aligns with human judgment (e.g., Scott’s π = 0.97 on MATH) compared to MCQ’s reliance on statistical shortcuts.
- It introduces a generative evaluation method where free-form responses are paired with a binary matcher to assess answer accuracy.
- Empirical results reveal that answer matching shifts model rankings and reduces benchmark saturation while maintaining cost efficiency.
Answer Matching as a Superior Paradigm for LLM Evaluation
The paper "Answer Matching Outperforms Multiple Choice for LLM Evaluation" (2507.02856) presents a comprehensive critique of multiple choice (MCQ) benchmarks as the dominant paradigm for evaluating LLMs, and introduces answer matching as a more valid, scalable, and cost-effective alternative for generative evaluation. The authors provide both theoretical and empirical analyses, demonstrating that MCQ-based evaluation is fundamentally limited by its discriminative nature, which allows models to exploit statistical shortcuts unrelated to genuine generative ability. In contrast, answer matching—where a model generates a free-form response and a separate model (the matcher) determines if it matches a reference answer—aligns much more closely with human judgment and the actual generative use cases of LLMs.
Discriminative Shortcuts in Multiple Choice Evaluation
The paper formalizes the distinction between generative and discriminative evaluation. In MCQ settings, models are tasked with selecting the correct answer from a set of options, which reduces the evaluation to a discrimination problem. The authors empirically show that models can achieve high accuracy on MCQ benchmarks even when provided only with the answer choices and not the question itself. For instance, a fine-tuned Qwen3-4B model achieves 83% accuracy on TruthfulQA-v2 and 93% on GoldenSwag using only the answer choices. This exposes the prevalence of "choice-only" shortcuts, where statistical artifacts in the answer set allow models to bypass genuine reasoning or knowledge retrieval.
The authors further demonstrate that attempts to mitigate these shortcuts—such as reducing the number of choices or generating distractors with LLMs—are insufficient. Even with more choices or LLM-generated distractors, shortcut accuracy remains high. This issue is not confined to language tasks; similar effects are observed in multimodal benchmarks like MMMU-Pro.
Answer Matching: Methodology and Empirical Validation
Answer matching is proposed as a generative evaluation protocol: the model is prompted with the question alone, generates a free-form answer, and a separate matcher model determines if the response matches a reference answer. The matcher is provided with the question, the reference answer, and the model's response, and outputs a binary judgment.
The authors conduct extensive empirical studies on MATH, MMLU-Pro, and GPQA-Diamond, using both automated and human-annotated ground truth. Key findings include:
- Alignment with Human Judgment: Answer matching using recent LLMs (even small models like Qwen3-4B) achieves near-perfect agreement with human annotators, as measured by Scott's π (e.g., π=0.97 on MATH). In contrast, MCQ evaluation aligns poorly (π=0.26), and LLM-as-a-judge without reference answers also underperforms.
- Model Rankings Change: The choice of evaluation protocol significantly affects model rankings. Models that perform well on MCQ benchmarks may drop in rank when evaluated via answer matching, and vice versa. This has direct implications for model selection and deployment.
- Benchmark Saturation is Illusory: Benchmarks that appear saturated under MCQ evaluation reveal substantial headroom when repurposed for answer matching. For example, top models' accuracy drops by over 20% on GPQA-Diamond when evaluated generatively.
- Cost and Scalability: Contrary to concerns about the expense of LLM-based evaluation, answer matching is shown to be as cheap or cheaper than MCQ evaluation. This is because free-form responses are typically shorter, and the matcher model's inference cost is marginal compared to generation.
Implementation Considerations
The paper provides practical guidance for implementing answer matching:
- Prompt Engineering: The matcher model should be prompted with the question, reference answer, and candidate response, and instructed to output a binary match judgment. The authors release prompt templates and code for reproducibility.
- Dataset Preparation: Not all MCQ questions are suitable for answer matching, as some rely on the answer choices for specificity. Filtering or rewriting is necessary to ensure questions have unique, unambiguous answers.
- Model Selection: Recent small and mid-sized LLMs are sufficient as matchers, reducing compute requirements and increasing reproducibility. Open-weight models like Qwen3-4B and Llama-4-Scout are recommended.
- Robustness: Rankings are stable across different matcher models, and answer matching is less susceptible to self-preference bias than LLM-as-a-judge protocols.
Limitations and Open Challenges
The authors acknowledge several limitations:
- Applicability: Answer matching is best suited for tasks with a unique correct answer or a well-defined set of paraphrases. It is less effective for tasks with multiple valid answers (e.g., summarization, translation) or where equivalence is hard to define (e.g., proofs, code).
- Gaming the Matcher: The robustness of matcher models to adversarial responses or optimization pressure remains an open question. Stronger matchers or adversarial evaluation may be necessary for high-stakes settings.
- Dataset Conversion: Many existing MCQ datasets require significant filtering or rewriting to be suitable for answer matching, which may reduce coverage or alter subject distributions.
Implications and Future Directions
The findings have significant implications for both the evaluation and development of LLMs:
- Benchmark Design: The community should prioritize benchmarks that support generative evaluation via answer matching, with questions designed for specificity and unique answers. This may be more fruitful than creating harder MCQ distractors.
- Model Development: As answer matching better reflects real-world generative use cases, optimizing models for generative performance (rather than MCQ accuracy) becomes more meaningful.
- Evaluation Ecosystem: The transition to answer matching enables more valid, scalable, and cost-effective evaluation protocols, supporting the rapid pace of LLM development and deployment.
Future work may explore extending answer matching to tasks with multiple valid answers, improving matcher robustness, and integrating rubric-based or execution-based evaluation for complex outputs.
Conclusion
The paper provides a rigorous and practical case for replacing MCQ-based evaluation with answer matching in LLM benchmarking. By aligning evaluation protocols with the generative capabilities of modern models, answer matching offers a more valid, reliable, and scalable approach, with direct consequences for model selection, deployment, and future research directions in AI.