HiCBench: Document Chunking Benchmark
- HiCBench is a benchmark that assesses document chunking quality in retrieval-augmented generation (RAG) systems using multi-level annotations.
- It leverages evidence-dense QA tasks and the HiChunk framework to optimize retrieval precision and improve response fidelity.
- Experimental evaluations reveal significant gains in chunking accuracy and overall RAG performance, highlighting its effectiveness for long-document processing.
HiCBench is a benchmark designed to evaluate the quality of document chunking within retrieval-augmented generation (RAG) systems, addressing a critical layer in the RAG pipeline that has previously lacked rigorous evaluation resources. By focusing explicitly on the chunking step and leveraging evidence-dense question-answer (QA) tasks and multi-level chunking annotations, HiCBench provides a framework for systematic, fine-grained assessment of how segmentation methods affect both retrieval and generated response fidelity. The benchmark is accompanied by HiChunk, a hierarchical chunking and retrieval system utilizing fine-tuned LLMs and an adaptive retrieval algorithm, resulting in demonstrably improved outcomes for downstream RAG tasks (Lu et al., 15 Sep 2025).
1. Motivation and Conceptual Foundation
Prevailing RAG evaluation benchmarks have largely emphasized either document retrieval or response generation accuracy, with minimal regard for the quality of document chunking itself. This oversight is particularly consequential in evidence-dense tasks, where relevant information is distributed over contiguous text fragments rather than isolated sentences. Existing benchmarks typically construct queries using only a few sentences as evidence, making them insensitive to disruptions in semantic segmentation and severely limiting diagnostic power at the chunking stage. HiCBench directly confronts this gap by establishing a benchmark tailored to the evaluation of chunking strategies, specifically for scenarios where evidence is extensive and semantically coherent, and where retrieval system performance and generation quality are affected by chunk boundaries.
2. Benchmark Construction and Core Components
HiCBench comprises a suite of meticulously annotated benchmark resources and synthetic QA tasks, structured as follows:
- Manually Annotated Multi-Level Chunking Points: Documents selected from the OHRBench corpus are annotated with hierarchical chunking points at various granularities, such as section, subsection, and paragraph boundaries. These manual annotations allow for fine-tuned assessment of how well chunking algorithms preserve inherent semantic structure.
- Evidence-Dense QA Pair Synthesis: QA instances are generated via a multi-step, prompt-driven process. Initially, section-level summaries are created. Then, grounded QA pairs are synthesized using both context from annotated chunks and these summaries. QA tasks are diversified into three core types:
- T₀: Evidence-sparse QAs, with only one or two sentences as supporting material.
- T₁: Single-chunk evidence-dense QAs, where the evidence comprises one semantically coherent chunk (512–4096 words).
- T₂: Multi-chunk evidence-dense QAs, where supporting evidence spans multiple semantic chunks (256–2048 words).
- Evidence Extraction and Validation: For each QA, LLM-based extraction identifies the precise sentences constituting valid evidence. This process includes multiple runs and filtering steps to ensure high evidence density (≥10% of context) and fact consistency (Fact-Cov metrics), yielding QA pairs highly sensitive to segmentation errors.
Component | Description | Role in Benchmark |
---|---|---|
Multi-Level Annotations | Manual chunk boundaries at multiple semantic levels | Ground-truth for chunking |
Synthesized QA Pairs | Evidence-dense, task-diverse QAs grounded in context | Chunk-sensitivity evaluation |
Evidence Sources | LLM-extracted, fact-filtered sentence sets | Ensures chunking impact |
3. HiChunk Framework and Algorithms
HiChunk is the hierarchical document segmentation framework paired with HiCBench, designed to produce multi-granular chunk boundaries and optimize retrieval granularity for RAG tasks. Notable aspects include:
- Hierarchical Structuring with Fine-Tuned LLMs: Given a document (list of sentences), a fine-tuned LLM identifies global chunking points across multiple structural levels, capturing both coarse and fine-grained semantic units.
- Iterative Inference Strategy: For documents surpassing the LLM’s input token limit , chunking is performed in sliding or windowed fashion: for each window with , local chunk points are inferred and integrated with the global set . Careful bookkeeping (such as residual text handling) mitigates hierarchical drift where segmentation could become inconsistent at boundaries.
Window selection:
- Auto-Merge Retrieval Algorithm: Post-chunking, retrieval of context for the LLM is guided by an adaptive algorithm:
- Traverses query-ranked chunk candidates .
- Applies upward "merge" when three conditions are met:
- At least two children enter the retrieval set:
- The aggregate chunk size surpasses a dynamic token threshold:
- The remaining token budget is sufficient to fit the parent chunk. This mechanism ensures semantic completeness and relevance while respecting model input constraints.
4. Experimental Evaluation and Results
HiCBench and HiChunk’s efficacy were demonstrated in extensive experiments involving Qasper, GutenQA, OHRBench, LongBench RAG, and the HiCBench dataset itself. Key quantitative findings include:
- Chunking Accuracy: On Qasper, F₁ for chunk point detection at level 1 increased from 0.5481 (LumberChunker, LC) to 0.6742 (HiChunk, HC), evidencing more faithful reproduction of ground-truth structure.
- RAG Pipeline Metrics: Replacing rule-based or semantic chunkers (e.g., FC200, LC) with HiChunk and its auto-merge variant (HC200+AM) yielded higher evidence recall (ERec), Fact-Cov, and Rouge scores. For instance, on HiCBench QA tasks, HC200+AM improved ERec from 74.06 to 81.03 in some settings, and enhanced answer quality across response models.
- Retrieval Token Budget Sensitivity: Experiments spanning token budgets from 2k–4k tokens showed monotonic improvements in response quality with increasing budget, but HC200+AM consistently outperformed all baselines at each setting.
- Efficiency: HiChunk offered the best overall quality–speed trade-off. While semantic chunkers have lower time cost, they perform markedly worse on chunking accuracy and RAG answer quality. HiChunk remains practical for real-time or batch processing of lengthy documents.
5. Implications for RAG System Development
The introduction of HiCBench and HiChunk has several notable implications:
- Diagnosis of Chunking-Related Bottlenecks: HiCBench isolating document segmentation enables researchers to systematically evaluate and refine chunkers, uncovering failures that would otherwise be masked in retrieval or generation metrics.
- Granular Control for Retrieval: Multi-level chunking allows systems to dynamically adjust retrieved context size and granularity per query, effecting greater evidence recall and more contextually appropriate generation.
- Improved End-to-End Performance: Empirical results substantiate that precise chunking not only aids retrieval but materially benefits response factuality and coverage, as measured by Fact-Cov and Rouge on multiple RAG datasets.
- Scalability and Robustness: HiChunk’s iterative inference accommodates very long inputs, addressing the challenge of document length beyond fixed-context models, supporting robust RAG across diverse text sources, including academic manuscripts and multi-section reports.
6. Technical Innovations and Specialty Formulas
HiChunk’s integration of hierarchical chunk point inference and adaptive merging for retrieval is governed by formulas such as:
where is the current aggregate retrieval token count, denotes a parent chunk, and is the total token budget. This, combined with explicit multi-level chunk annotation and iterative local-global merging strategies, constitutes the principal algorithmic contribution underpinning the observed improvements.
7. Significance, Limitations, and Outlook
HiCBench establishes a new paradigm for rigorous evaluation of chunking in evidence-dense retrieval-augmented generation, refining the granularity and precision of RAG system analysis. While the benchmark and framework represent a substantial advance, a plausible implication is that future work may extend these resources to more diverse corpora or multi-modality, further generalizing the efficacy of chunking-aware RAG evaluation.
HiCBench and HiChunk together close a critical methodological gap, enabling practitioners and researchers to dissect, optimize, and compare chunking strategies within the broader RAG pipeline, with direct implications for performance in knowledge-intensive and long-document NLP applications (Lu et al., 15 Sep 2025).