ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation (2405.04818v2)

Published 8 May 2024 in cs.CL

Abstract: Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. LLMs present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN

Authors (5)

Ana Brassard (9 papers)
Benjamin Heinzerling (26 papers)
Keito Kudo (7 papers)
Keisuke Sakaguchi (44 papers)
Kentaro Inui (119 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - a-brassard/ACORN: Home repository for the ACORN dataset: 3,500 explanations with aspect-wise human ratings of their quality. (4 stars)

Tweets

https://twitter.com/ana_brrr/status/1811332894436286578

https://twitter.com/_joestacey_/status/1851996547627532712

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation (2405.04818v2)

Summary

Related Papers

GitHub

Tweets