Passage Segmentation of Documents for Extractive Question Answering (2501.09940v1)

Published 17 Jan 2025 in cs.CL and cs.IR

Abstract: Retrieval-Augmented Generation (RAG) has proven effective in open-domain question answering. However, the chunking process, which is essential to this pipeline, often receives insufficient attention relative to retrieval and synthesis components. This study emphasizes the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline. We then introduce the Logits-Guided Multi-Granular Chunker (LGMGC), a novel framework that splits long documents into contextualized, self-contained chunks of varied granularity. Our experimental results, evaluated on two benchmark datasets, demonstrate that LGMGC not only improves the retrieval step but also outperforms existing chunking methods when integrated into a RAG pipeline.

Authors (3)

Zuhong Liu (4 papers)
Charles-Elie Simon (2 papers)
Fabien Caspani (5 papers)

Summary

Passage Segmentation of Documents for Extractive Question Answering

The paper "Passage Segmentation of Documents for Extractive Question Answering" by Zuhong Liu et al. introduces a novel framework for document segmentation aimed at enhancing the performance of retrieval-augmented generation (RAG) in open-domain question answering. The paper argues that while retrieval and synthesis have been pivotal components in RAG, the chunking process—responsible for dividing documents into manageable segments—has not received adequate attention despite its critical role.

The proposed solution is the Logits-Guided Multi-Granular Chunker (LGMGC), designed to address shortcomings in current chunking methodologies. LGMGC comprises two modules: the Logits-Guided Chunker and the Multi-Granular Chunker. The first utilizes logits from smaller LLMs to segment documents into semantically coherent chunks, leveraging the capability of LLMs to understand context. This helps ensure that chunks are self-contained and contextually complete. The Multi-Granular Chunker then takes these parent chunks and subdivides them further into smaller chunks of varying granularity, allowing for flexibility in handling different types of queries.

Experimental evaluation on benchmark datasets, specifically GutenQA for passage retrieval and LongBench for end-to-end QA tasks, demonstrates that LGMGC outperforms existing segmentation approaches. It significantly enhances retrieval performance, as evidenced by its superior DCG@k and Recall@k metrics, and achieves higher F1-scores in downstream tasks. Notably, LGMGC surpasses traditional chunking methods, showcasing consistent performance across various chunk sizes and outperforming established techniques like Recursive and Semantic Chunking.

From a practical standpoint, the methodology's reduced sensitivity to hyperparameters and compatibility with standalone retrievers and synthesizers make it adaptable for real-world applications. The integration of logits in LGMGC also retains computational efficiency by reducing the number of API calls required for recursive evaluations. This efficiency is crucial, especially when considering the potential costs and processing time introduced by large-scale LLMs.

Theoretical implications of this research indicate a paradigm shift in considering chunking as a key factor in RAG systems. The unified approach of LGMGC not only emphasizes semantic coherence but also supports multilevel granularity, suggesting that tailoring chunk size to the complexity of information can optimize retrieval processes. Future developments could explore the implications of these findings on model training processes and the elaboration of chunking strategies that support more complex reasoning tasks.

In summary, this work underscores the importance of effective document segmentation in improving the efficacy of retrieval-augmented systems and encourages further exploration into adaptive chunking methodologies. As LLMs continue to evolve, integrating advanced chunking frameworks like LGMGC could yield substantial improvements in both synthesis accuracy and retrieval efficiency.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1881183032083955986