- The paper introduces the GSM-IC dataset to systematically assess LLM performance amidst irrelevant contextual distractors.
- It evaluates prompting techniques and demonstrates that arithmetic accuracy drops significantly when extraneous context is present.
- Mitigation strategies like self-consistency and instruction prompts noticeably improve accuracy, highlighting robust model design paths.
LLMs Can Be Easily Distracted by Irrelevant Context
The paper presented by Shi et al. examines the susceptibility of LLMs to distractors within problem contexts, particularly focusing on arithmetic reasoning tasks. This paper reveals crucial insights into the inherent limitations of LLMs, specifically when confronted with extraneous information not affecting the target output.
Key Contributions
- GSM-IC Dataset: The authors introduce the Grade-School Math with Irrelevant Context (GSM-IC) dataset. Unlike typical benchmarks, the GSM-IC dataset is enriched with distractor sentences added to the baseline problems, ensuring these do not alter the correct solutions. This dataset serves as a tool to evaluate the impact of irrelevant information on the reasoning capabilities of LLMs.
- Evaluation of Prompting Techniques: The paper leverages state-of-the-art prompting techniques on the GSM-IC dataset, including Chain-of-Thought (CoT) prompting, Least-to-Most (LtM) prompting, and Program-based approaches. The results show a marked decline in model accuracy in the presence of irrelevant context, underscoring the distractibility of these models.
- Mitigation Strategies: Several strategies are evaluated, such as self-consistency and instruction-based prompting. The implementation of self-consistency, which involves sampling multiple outputs and voting for the most consistent answer, notably improves model robustness. Instruction-based prompts instructing the model to ignore irrelevant information also demonstrated significant performance gains.
Experimental Outcomes
- Performance Drop: The paper illustrates a substantial decline in model performance on GSM-IC compared to the original GSM8K dataset. For instance, less than 18% consistency is observed across various prompting techniques when irrelevant information is introduced.
- Self-Consistency: Implementing self-consistency significantly boosts the recall of correct answers to 99.7% when 20 samples are considered per problem, revealing the potential of this technique in improving model robustness.
- Role of Exemplars: Including irrelevant information within exemplars serves to enhance performance, indicating that models can learn to disregard extraneous information through examples or explicit instructions.
- Impact of Irrelevant Context Factors: The analysis reveals that factors such as role name overlap and in-range numbers notably affect model sensitivity, while topic relevance of distractors also plays a critical role in model performance degradation.
Implications and Future Work
The findings elucidated in the paper highlight a crucial limitation in LLMs: the inability to selectively discount irrelevant context, which impairs their real-world application in tasks necessitating nuanced context understanding. These observations suggest directions for future research, such as refining training techniques to enhance contextual filtering abilities of LLMs or devising new architectures inherently robust to irrelevant inputs.
Given the rapid advancements in AI, understanding and mitigating such fundamental weaknesses is vital. Future work should continue to investigate how models process irrelevant contexts across diverse tasks, potentially guiding the development of more resilient and discerning LLMs capable of more reliable reasoning in varied situational contexts.
Overall, this paper contributes significant findings to the current discourse on LLM limitations, opening avenues for further experimental and theoretical investigation in AI and computational linguistics.