Comparing zero-shot self-explanations with human rationales in text classification (2410.03296v2)

Published 4 Oct 2024 in cs.CL and cs.AI

Abstract: Instruction-tuned LLMs are able to provide an explanation about their output to users by generating self-explanations. These do not require gradient computations or the application of possibly complex XAI methods. In this paper, we analyse whether this ability results in a good explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans as well as their faithfulness to models. We study two text classification tasks: sentiment classification and forced labour detection, i.e., identifying pre-defined risk indicators of forced labour. In addition to English, we include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations for all samples. To allow for direct comparisons, we also compute post-hoc feature attribution, i.e., layer-wise relevance propagation (LRP) and analyse 4 LLMs. We show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness. This finding suggests that self-explanations indeed provide good explanations for text classification.

Summary

The paper demonstrates that zero-shot self-explanations closely align with human rationales in multilingual sentiment analysis and forced labor detection.
It leverages instruction-tuned LLMs like Llama2, Llama3, Mistral, and Mixtral to generate and evaluate explanations across English, Danish, and Italian.
Results highlight that Llama3 achieves the highest agreement with human annotations, emphasizing improved interpretability for explainable AI.

Analyzing Zero-Shot Self-Explanations in Multilingual Text Classification

The research paper titled "Comparing Zero-Shot Self-Explanations with Human Rationales in Multilingual Text Classification" investigates the capabilities of instruction-tuned LLMs to self-generate explanations in the form of self-explanations, which are compared against human rationales and traditional post-hoc explainability methods like Layer-wise Relevance Propagation (LRP). The focus is primarily on text classification tasks encompassing sentiment analysis and forced labor detection, using both English and the translations in Danish and Italian.

Methodological Approach

The paper explores self-explanations generated by LLMs, including Llama2, Llama3, Mistral, and Mixtral. These models were tasked with generating rationale explanations in a zero-shot context across multiple languages. The qualitative analysis involved comparing the self-explanations with human annotations and post-hoc explanations generated through LRP, examining their plausibility to humans and faithfulness to model decisions.

For sentiment classification, annotations were evaluated in multiple languages, extending the analysis beyond English to Italian and Danish, which provides an insight into multilingual capabilities. Forced labor detection, a more complex task, tested the models on annotated news articles for specific risk indicators.

Experimental Results

The results indicate that self-explanations align more closely with human rationales compared to post-hoc methods like LRP, especially in terms of plausibility. Llama3 exhibited the highest level of agreement with human rationales, indicating potentially superior instruction-following and language understanding capabilities.

The task accuracy was generally high for sentiment classification across all languages, suggesting robust zero-shot multilingual capabilities. Conversely, forced labor detection showcased more variability, indicating task-specific challenges and the complexity involved in such domains.

Another critical finding was the models' varied adherence to instruction-prescribed constraints like JSON syntax in explanations, with Llama3 demonstrating more reliable compliance.

Faithfulness vs. Plausibility

Despite self-explanations displaying high human plausibility, their faithfulness was comparable to post-hoc attributions. The paper highlights that the feature removal methods did not significantly impact class probabilities, pointing to potential discrepancies between perceived relevance and model-specific inference pathways.

Contrastive explanations did not inherently yield higher plausibility, showing varying effectiveness across datasets and indicators, aligning with prior findings on explanatory approaches.

Implications and Future Directions

The implications of this work are multifaceted for the fields of Explainable AI (XAI) and multilingual text classification. It suggests that self-explanations could provide a more direct and human-comprehensible mode of interaction with AI systems, potentially enhancing user trust and understanding—particularly critical in applications accessible to a broader audience.

The potential of models to generalize across languages without substantial prior exposure underscores a need for further exploration into the robustness of cross-linguistic transfer learning. Similarly, the ability to accurately explain complex domain-specific classifications like forced labor detection could be pivotal in applying AI in sensitive contexts.

Future research could probe deeper into the mechanisms driving the alignment between self-explanations and human rationales. Additionally, exploring diverse, less-documented languages and aligning self-explanations more closely with model faithfulness while preserving human plausibility could further advance AI interpretability.

In conclusion, this paper contributes valuable insight into the evolving capabilities of LLMs to self-generate explanations, illustrating both their potential and current limitations within multilingual and domain-specific contexts. As AI systems are increasingly integrated into decision-making processes, their ability to provide transparent, understandable, and faithful explanations remains of paramount importance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StephanieBrandl/status/1843986680665244157

https://twitter.com/EberleOliver/status/1844157876019003841