Towards a Robust Retrieval-Based Summarization System (2403.19889v1)
Abstract: This paper describes an investigation of the robustness of LLMs for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.
- Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
- Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pp. 2206–2240. PMLR, 2022.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Skeleton-to-response: Dialogue generation guided by retrieval memory. arXiv preprint arXiv:1809.05296, 2018.
- Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431, 2023.
- Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080, 2018.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- News summarization and evaluation in the era of GPT-3. arXiv preprint arXiv:2209.12356, 2022.
- Retrieval augmented language model pre-training. In International Conference on Machine Learning, pp. 3929–3938. PMLR, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
- Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models. arXiv preprint arXiv:2402.14409, 2024.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Retrieval-augmented generation for knowledgeiintensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Jurassic-1: Technical details and evaluation. White Paper, AI21 Labs, 1:9, 2021.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
- Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345, 2019.
- Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981, 2022a.
- BRIO: Bringing order to abstractive summarization. arXiv preprint arXiv:2203.16804, 2022b.
- A compositional context sensitive multi-document summarizer: Exploring the factors that influence summarization. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 573–580, 2006.
- Automatic summarization. Foundations and Trends in Information Retrieval, 5(2–3):103–233, 2011.
- Retrieval augmented code generation and summarization. arXiv preprint arXiv:2108.11601, 2021.
- ToolLLM: Facilitating large language models to master 16000+ real-world APIs. arXiv preprint arXiv:2307.16789, 2023.
- Two uses of anaphora resolution in summarization. Information Processing & Management, 43(6):1663–1680, 2007.
- Evaluating large language models on medical evidence summarization. NPJ Digital Medicine, 6(1):158, 2023.
- LLaMa: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Clinical text summarization: Adapting large language models can outperform human experts. Research Square, 2023.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Chain-of-Thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35(4):24824–24837, 2022.
- Search-in-the-chain: Towards the accurate, credible and traceable content generation for complex knowledge-intensive tasks. arXiv preprint arXiv:2304.14732, 2023.
- List-aware reranking-truncation joint model for search and retrieval-augmented generation. arXiv preprint arXiv:2402.02764, 2024.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pp. 11328–11339. PMLR, 2020a.
- BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675, 2020b.
- Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57, 2024.