Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation (2405.04818v2)

Published 8 May 2024 in cs.CL

Abstract: Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. LLMs present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ana Brassard (9 papers)
  2. Benjamin Heinzerling (26 papers)
  3. Keito Kudo (7 papers)
  4. Keisuke Sakaguchi (44 papers)
  5. Kentaro Inui (119 papers)

Summary

We haven't generated a summary for this paper yet.