Hypothesis-only Biases in Large Language Model-Elicited Natural Language Inference (2410.08996v1)

Published 11 Oct 2024 in cs.CL

Abstract: We test whether replacing crowdsource workers with LLMs to write Natural Language Inference (NLI) hypotheses similarly results in annotation artifacts. We recreate a portion of the Stanford NLI corpus using GPT-4, Llama-2 and Mistral 7b, and train hypothesis-only classifiers to determine whether LLM-elicited hypotheses contain annotation artifacts. On our LLM-elicited NLI datasets, BERT-based hypothesis-only classifiers achieve between 86-96% accuracy, indicating these datasets contain hypothesis-only artifacts. We also find frequent "give-aways" in LLM-generated hypotheses, e.g. the phrase "swimming in a pool" appears in more than 10,000 contradictions generated by GPT-4. Our analysis provides empirical evidence that well-attested biases in NLI can persist in LLM-generated data.

PDF HTML Abstract

Analysis of Hypothesis-only Biases in LLM-Generated NLI Data

The paper "Hypothesis-only Biases in LLM-Elicited Natural Language Inference" addresses a crucial issue regarding the biases present in Natural Language Inference (NLI) datasets generated by LLMs. The researchers, Grace Proebsting and Adam Poliak, investigate the presence of annotation artifacts in such datasets and assess their implications on hypothesis-only classification models.

Experimentation and Methodology

The paper focuses on replicating a section of the Stanford NLI (SNLI) corpus using prominent LLMs such as GPT-4, Llama-2, and Mistral 7b. By employing the same set of instructions used with human crowd-sourced workers, the researchers generated hypotheses corresponding to given premises. This approach allowed for a controlled comparison between human- and LLM-generated data.

Once the datasets were created, hypothesis-only classifiers based on Naive Bayes and BERT-based models were trained to determine if they could predict the NLI label using only the hypothesis, without the premise. The classifiers achieved notable accuracy, ranging from 86% to 96% on LLM-generated datasets, indicating substantial presence of annotation artifacts that could potentially bias results.

Key Findings

Presence of Annotation Artifacts: The paper found that LLM-generated NLI datasets do contain annotation artifacts similar to those found in human-generated datasets. The high accuracy of hypothesis-only classifiers substantiates this claim.
Common Give-Away Words: Certain phrases, such as those seen frequently in the dataset analysis, appear disproportionately within specific labels, showing strong indicative power. For example, "swimming in a pool" appeared in over 10,000 contradiction samples generated by GPT-4.
Model Bias Similarity: Interestingly, the hypothesis-only models trained on SNLI data performed better on GPT-4 datasets than on SNLI itself, hinting at potentially comparable biases across these datasets. Additionally, LLMs tended to exhibit similar patterns of biases.

Implications and Future Directions

The implications of this research are multifaceted. Practically, it indicates a need for comprehensive quality control and dataset filtering when utilizing LLMs for generating NLP datasets. Theoretically, the findings suggest that LLMs, while efficient, may inherit systematic biases from their training processes, which can degrade the quality and reliability of NLP applications.

Looking forward, research could explore methods to mitigate these biases, such as enhancing prompt engineering, incorporating more diverse data, or developing advanced filtering techniques post-generation. Additionally, understanding the root causes of these biases and developing models with reduced susceptibility to such artifacts could be potential areas of advancing Artificial Intelligence research.

Conclusion

The paper provides a meticulous examination of the biases inherent in LLM-generated NLI datasets, offering insights that are critical for both the utilization of LLMs in NLP and the broader understanding of bias propagation in AI systems. The results endorse the necessity of ongoing vigilance and innovation in dataset curation and model training methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Grace Proebsting (3 papers)
Adam Poliak (17 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_joestacey_/status/1847275940172165295