A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI (2412.13942v2)

Published 18 Dec 2024 in cs.CL

Abstract: Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding human label variation (HLV) and LLMs can approximate HJD from a few human-provided label-explanation pairs. However, collecting explanations for every label is still time-consuming. This paper examines whether LLMs can be used to replace humans in generating explanations for approximating HJD. Specifically, we use LLMs as annotators to generate model explanations for a few given human labels. We test ways to obtain and combine these label-explanations with the goal to approximate human judgment distributions. We further compare the resulting human with model-generated explanations, and test automatic and human explanation selection. Our experiments show that LLM explanations are promising for NLI: to estimate HJDs, generated explanations yield comparable results to human's when provided with human labels. Importantly, our results generalize from datasets with human explanations to i) datasets where they are not available and ii) challenging out-of-distribution test sets.

Summary

The paper demonstrates that LLM-generated explanations serve as effective proxies for human explanations in modeling human judgment distributions for Natural Language Inference (NLI).
Empirical results show that using LLM explanations results in a minimal performance gap compared to human explanations across various NLI datasets, indicating significant potential for reducing annotation costs.
Using model judgment distributions derived from LLM explanations can improve classifier performance and robustness on out-of-domain datasets, highlighting the value of capturing human label variation.

LLM-Generated Explanations as Proxies for Human Explanations in NLI

The paper "A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI" explores the efficacy of using LLMs to generate explanations as a means to model human judgment distributions (HJDs) in the field of Natural Language Inference (NLI). This research is rooted in addressing human label variation (HLV), which remains a significant obstacle in developing robust computational models. By substituting human-generated explanations with LLM-generated ones, the authors aim to reduce the cost and time associated with data annotation while retaining the modeling of HJD.

The paper leverages multiple datasets, including ChaosNLI, VariErr NLI, and MNLI, to assess the potential of LLM-generated explanations to replicate human judgment distributions. The authors introduced two core explanation selection strategies: "label-free" and "label-guided". Label-free explanations involved generating explanations without reference to human labels, while label-guided explanations leveraged a few human-provided labels to select LLM-generated explanations. These methodologies were meticulously evaluated to determine their effectiveness in approximating HJDs.

A significant aspect of this paper is the demonstration that model-generated explanations yield results comparable to those obtained with human-provided explanations. The empirical results are reassuring, indicating that the substitution of human explanations with LLM-generated ones does not markedly deteriorate the quality of HJD estimations. Specifically, evaluations using Kullback-Leibler Divergence (KL), Jensen-Shannon Distance (JSD), Total Variation Distance (TVD), and Global Metric (D.Corr) reveal that the performance gap is minimal, highlighting the potential efficiency gains without significant loss in modeling fidelity.

The research also extended its applicability beyond the datasets equipped with human-provided explanations, underscoring the generalizability of the approach. On out-of-domain (OOD) datasets like ANLI, fine-tuning with model judgment distributions improved classifier performance, demonstrating that capturing HLV information can enhance model robustness and generalization abilities across diverse inference contexts.

Perhaps one of the bold assertions of the paper is the potential of reasoning variability in LLM-generated explanations as a metric to evaluate HLV. The results of comparison studies, where human explanations were gradually replaced with model-generated ones, showcased that explanation relevance and variability might serve as indicators for gauging explanation quality. Interestingly, when human annotations were juxtaposed against LLM-generated ones, variability seemed to correlate with better performance metrics—highlighting that more diverse or variable explanations could encompass the broader HLV spectrum better than singular human explanations.

The implication of these findings could be transformative for the AI field, especially in areas reliant on substantial annotated data. By utilizing LLMs for generating explanation data, researchers could considerably reduce the reliance on human labor for annotation tasks. Moreover, this work opens pathways to applying similar methodologies to other domains of NLP and beyond. Looking into the future, exploration of multi-modal explanations combined with cross-LLM collaboration could potentially offer even broader insights into capturing and modeling human-like judgment using AI.

In conclusion, this paper effectively challenges the traditional boundaries of annotation by demonstrating that LLM-generated explanations can be viable substitutes for human-generated ones in modeling HJD for NLI tasks. The solid empirical findings will likely stimulate further research into efficient annotation methods across the AI landscape.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1871612495196295343