Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Easily do Irrelevant Inputs Skew the Responses of Large Language Models? (2404.03302v4)

Published 4 Apr 2024 in cs.CL

Abstract: By leveraging the retrieval of information from external knowledge databases, LLMs exhibit enhanced capabilities for accomplishing many knowledge-intensive tasks. However, due to the inherent flaws of current retrieval systems, there might exist irrelevant information within those retrieving top-ranked passages. In this work, we present a comprehensive investigation into the robustness of LLMs to different types of irrelevant information under various conditions. We initially introduce a framework to construct high-quality irrelevant information that ranges from semantically unrelated, partially related, and related to questions. Furthermore, our analysis demonstrates that the constructed irrelevant information not only scores highly on similarity metrics, being highly retrieved by existing systems, but also bears semantic connections to the context. Our investigation reveals that current LLMs still face challenges in discriminating highly semantically related information and can be easily distracted by these irrelevant yet misleading content. Besides, we also find that current solutions for handling irrelevant information have limitations in improving the robustness of LLMs to such distractions. All the resources are available on GitHub at https://github.com/Di-viner/LLM-Robustness-to-Irrelevant-Information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024.
  2. Injecting the score of the first-stage retriever as text improves bert-based re-rankers. 2023.
  3. Can retriever-augmented language models reason? the blame game between the retriever and the language model. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  15492–15509, 2023.
  4. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.  2206–2240. PMLR, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
  7. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  17754–17762, 2024.
  8. Over-reasoning and redundant calculation of large language models. arXiv preprint arXiv:2401.11467, 2024.
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  10. The power of noise: Redefining retrieval for rag systems. arXiv preprint arXiv:2401.14887, 2024.
  11. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  12. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  13. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
  14. Retrieval augmented language model pre-training. In International conference on machine learning, pp.  3929–3938. PMLR, 2020.
  15. E-eval: A comprehensive chinese k-12 education evaluation benchmark for large language models. arXiv preprint arXiv:2401.15927, 2024.
  16. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43, 2023.
  17. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  18. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020.
  19. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  20. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  21. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345, 2021.
  22. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  23. OpenAI. Chatgpt, 2022. URL https://openai.com/blog/chatgpt.
  24. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  25. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  26. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
  27. Simple entity-centric questions challenge dense retrievers. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp.  6138–6148. Association for Computational Linguistics (ACL), 2021.
  28. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023a.
  29. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023b.
  30. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  3784–3803, 2021.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  32. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377, 2023.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  34. Instructing large language models to identify and ignore irrelevant conditions. arXiv preprint arXiv:2403.12744, 2024.
  35. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, 2023.
  36. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations, 2023.
  37. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023.
  38. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112, 2023.
  39. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
  40. Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning. arXiv preprint arXiv:2401.17686, 2024.
Citations (13)

Summary

We haven't generated a summary for this paper yet.