Large Language Models Can Be Easily Distracted by Irrelevant Context (2302.00093v3)

Published 31 Jan 2023 in cs.CL and cs.AI

Abstract: LLMs have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of LLMs, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for LLMs, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the LLM to ignore the irrelevant information.

Citations (420)

View on Semantic Scholar

Summary

The paper introduces the GSM-IC dataset to systematically assess LLM performance amidst irrelevant contextual distractors.
It evaluates prompting techniques and demonstrates that arithmetic accuracy drops significantly when extraneous context is present.
Mitigation strategies like self-consistency and instruction prompts noticeably improve accuracy, highlighting robust model design paths.

LLMs Can Be Easily Distracted by Irrelevant Context

The paper presented by Shi et al. examines the susceptibility of LLMs to distractors within problem contexts, particularly focusing on arithmetic reasoning tasks. This paper reveals crucial insights into the inherent limitations of LLMs, specifically when confronted with extraneous information not affecting the target output.

Key Contributions

GSM-IC Dataset: The authors introduce the Grade-School Math with Irrelevant Context (GSM-IC) dataset. Unlike typical benchmarks, the GSM-IC dataset is enriched with distractor sentences added to the baseline problems, ensuring these do not alter the correct solutions. This dataset serves as a tool to evaluate the impact of irrelevant information on the reasoning capabilities of LLMs.
Evaluation of Prompting Techniques: The paper leverages state-of-the-art prompting techniques on the GSM-IC dataset, including Chain-of-Thought (CoT) prompting, Least-to-Most (LtM) prompting, and Program-based approaches. The results show a marked decline in model accuracy in the presence of irrelevant context, underscoring the distractibility of these models.
Mitigation Strategies: Several strategies are evaluated, such as self-consistency and instruction-based prompting. The implementation of self-consistency, which involves sampling multiple outputs and voting for the most consistent answer, notably improves model robustness. Instruction-based prompts instructing the model to ignore irrelevant information also demonstrated significant performance gains.

Experimental Outcomes

Performance Drop: The paper illustrates a substantial decline in model performance on GSM-IC compared to the original GSM8K dataset. For instance, less than 18% consistency is observed across various prompting techniques when irrelevant information is introduced.
Self-Consistency: Implementing self-consistency significantly boosts the recall of correct answers to 99.7% when 20 samples are considered per problem, revealing the potential of this technique in improving model robustness.
Role of Exemplars: Including irrelevant information within exemplars serves to enhance performance, indicating that models can learn to disregard extraneous information through examples or explicit instructions.
Impact of Irrelevant Context Factors: The analysis reveals that factors such as role name overlap and in-range numbers notably affect model sensitivity, while topic relevance of distractors also plays a critical role in model performance degradation.

Implications and Future Work

The findings elucidated in the paper highlight a crucial limitation in LLMs: the inability to selectively discount irrelevant context, which impairs their real-world application in tasks necessitating nuanced context understanding. These observations suggest directions for future research, such as refining training techniques to enhance contextual filtering abilities of LLMs or devising new architectures inherently robust to irrelevant inputs.

Given the rapid advancements in AI, understanding and mitigating such fundamental weaknesses is vital. Future work should continue to investigate how models process irrelevant contexts across diverse tasks, potentially guiding the development of more resilient and discerning LLMs capable of more reliable reasoning in varied situational contexts.

Overall, this paper contributes significant findings to the current discourse on LLM limitations, opening avenues for further experimental and theoretical investigation in AI and computational linguistics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/denny_zhou/status/1845115215450604019

https://twitter.com/xinyun_chen_/status/1758193507519410191

https://twitter.com/HanchungLee/status/1845212230964609284

https://twitter.com/237941903/status/1736999329485795485

https://twitter.com/brbongco/status/1822115637365473513

https://twitter.com/knishimae0531/status/1845295873212940646

YouTube

Show All Videos