From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks (2409.04168v2)
Abstract: To reduce the need for human annotations, LLMs have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.
- AI@Meta. 2024. Llama 3 model card.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Bangalore Principles, 2002. 2002. The bangalore principles of judicial conduct. Available from the Judicial Integrity Group website.
- Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks.
- Leo Breiman. 2001. Random forests. Mach. Learn., 45(1):5–32.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Regression models. Springer.
- Gemma: Open models based on gemini research and technology.
- Measuring mathematical problem solving with the math dataset. NeurIPS.
- Mistral 7b.
- Mixtral of experts.
- Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations.
- The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models.
- Benchmarking cognitive biases in large language models as evaluators.
- Richard Kraut. 2022. Aristotle’s Ethics. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy, Fall 2022 edition. Metaphysics Research Lab, Stanford University.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
- Ro{bert}a: A robustly optimized {bert} pretraining approach.
- Llms as narcissistic evaluators: When ego inflates evaluation scores.
- LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 139–151, St. Julian’s, Malta. Association for Computational Linguistics.
- The generative AI paradox in evaluation: “what it can solve, it may not evaluate”. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 248–257, St. Julian’s, Malta. Association for Computational Linguistics.
- Llm evaluators recognize and favor their own generations.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment.
- Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference.
- Large language models for data annotation: A survey.
- PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In The Twelfth International Conference on Learning Representations.
- Helpsteer2: Open-source dataset for training top-performing reward models.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- The generative AI paradox: “what it can create, it may not understand”. In The Twelfth International Conference on Learning Representations.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge.
- Pride and prejudice: Llm amplifies self-bias in self-refinement.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671.
- Yi: Open foundation models by 01.ai.
- Self-rewarding language models.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.