From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks (2409.04168v2)

Published 6 Sep 2024 in cs.CL and cs.AI

Abstract: To reduce the need for human annotations, LLMs have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.

References (39)

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks (2409.04168v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Authors (5)

Don't miss out on important new AI/ML research

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks (2409.04168v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (5)

Don't miss out on important new AI/ML research