HiCBench: Document Chunking Benchmark

Updated 16 September 2025

HiCBench is a benchmark that assesses document chunking quality in retrieval-augmented generation (RAG) systems using multi-level annotations.
It leverages evidence-dense QA tasks and the HiChunk framework to optimize retrieval precision and improve response fidelity.
Experimental evaluations reveal significant gains in chunking accuracy and overall RAG performance, highlighting its effectiveness for long-document processing.

HiCBench is a benchmark designed to evaluate the quality of document chunking within retrieval-augmented generation (RAG) systems, addressing a critical layer in the RAG pipeline that has previously lacked rigorous evaluation resources. By focusing explicitly on the chunking step and leveraging evidence-dense question-answer (QA) tasks and multi-level chunking annotations, HiCBench provides a framework for systematic, fine-grained assessment of how segmentation methods affect both retrieval and generated response fidelity. The benchmark is accompanied by HiChunk, a hierarchical chunking and retrieval system utilizing fine-tuned LLMs and an adaptive retrieval algorithm, resulting in demonstrably improved outcomes for downstream RAG tasks (Lu et al., 15 Sep 2025).

1. Motivation and Conceptual Foundation

Prevailing RAG evaluation benchmarks have largely emphasized either document retrieval or response generation accuracy, with minimal regard for the quality of document chunking itself. This oversight is particularly consequential in evidence-dense tasks, where relevant information is distributed over contiguous text fragments rather than isolated sentences. Existing benchmarks typically construct queries using only a few sentences as evidence, making them insensitive to disruptions in semantic segmentation and severely limiting diagnostic power at the chunking stage. HiCBench directly confronts this gap by establishing a benchmark tailored to the evaluation of chunking strategies, specifically for scenarios where evidence is extensive and semantically coherent, and where retrieval system performance and generation quality are affected by chunk boundaries.

2. Benchmark Construction and Core Components

HiCBench comprises a suite of meticulously annotated benchmark resources and synthetic QA tasks, structured as follows:

Manually Annotated Multi-Level Chunking Points: Documents selected from the OHRBench corpus are annotated with hierarchical chunking points at various granularities, such as section, subsection, and paragraph boundaries. These manual annotations allow for fine-tuned assessment of how well chunking algorithms preserve inherent semantic structure.
Evidence-Dense QA Pair Synthesis: QA instances are generated via a multi-step, prompt-driven process. Initially, section-level summaries are created. Then, grounded QA pairs are synthesized using both context from annotated chunks and these summaries. QA tasks are diversified into three core types:
- T₀: Evidence-sparse QAs, with only one or two sentences as supporting material.
- T₁: Single-chunk evidence-dense QAs, where the evidence comprises one semantically coherent chunk (512–4096 words).
- T₂: Multi-chunk evidence-dense QAs, where supporting evidence spans multiple semantic chunks (256–2048 words).
Evidence Extraction and Validation: For each QA, LLM-based extraction identifies the precise sentences constituting valid evidence. This process includes multiple runs and filtering steps to ensure high evidence density (≥10% of context) and fact consistency (Fact-Cov metrics), yielding QA pairs highly sensitive to segmentation errors.

Component	Description	Role in Benchmark
Multi-Level Annotations	Manual chunk boundaries at multiple semantic levels	Ground-truth for chunking
Synthesized QA Pairs	Evidence-dense, task-diverse QAs grounded in context	Chunk-sensitivity evaluation
Evidence Sources	LLM-extracted, fact-filtered sentence sets	Ensures chunking impact

3. HiChunk Framework and Algorithms

HiChunk is the hierarchical document segmentation framework paired with HiCBench, designed to produce multi-granular chunk boundaries and optimize retrieval granularity for RAG tasks. Notable aspects include:

Hierarchical Structuring with Fine-Tuned LLMs: Given a document $S_{1:N}$ (list of $N$ sentences), a fine-tuned LLM identifies global chunking points $GCP_{1:k}$ across multiple structural levels, capturing both coarse and fine-grained semantic units.
Iterative Inference Strategy: For documents surpassing the LLM’s input token limit $L$ , chunking is performed in sliding or windowed fashion: for each window $S_{a:b}$ with $|S_{a:b}| \leq L$ , local chunk points $LCP_{1:k}$ are inferred and integrated with the global set $GCP_{1:k}$ . Careful bookkeeping (such as residual text handling) mitigates hierarchical drift where segmentation could become inconsistent at boundaries.

Window selection:

$b = \arg\max_{b} \quad |S_{a:b}| \leq L$

Auto-Merge Retrieval Algorithm: Post-chunking, retrieval of context for the LLM is guided by an adaptive algorithm:
- Traverses query-ranked chunk candidates $C_{[1:M]}$ .
- Applies upward "merge" when three conditions are met:
- At least two children enter the retrieval set: $\sum_{n \in \text{node}_\text{ret} \wedge n \in p.\text{children}} \geq 2$
- The aggregate chunk size surpasses a dynamic token threshold:
$\theta^\ast(\text{tk}_\text{cur}, p) = \frac{\text{len}(p)}{3} \times \left(1 + \frac{\text{tk}_\text{cur}}{T}\right)$ - The remaining token budget is sufficient to fit the parent chunk. This mechanism ensures semantic completeness and relevance while respecting model input constraints.

4. Experimental Evaluation and Results

HiCBench and HiChunk’s efficacy were demonstrated in extensive experiments involving Qasper, GutenQA, OHRBench, LongBench RAG, and the HiCBench dataset itself. Key quantitative findings include:

Chunking Accuracy: On Qasper, F₁ for chunk point detection at level 1 increased from 0.5481 (LumberChunker, LC) to 0.6742 (HiChunk, HC), evidencing more faithful reproduction of ground-truth structure.
RAG Pipeline Metrics: Replacing rule-based or semantic chunkers (e.g., FC200, LC) with HiChunk and its auto-merge variant (HC200+AM) yielded higher evidence recall (ERec), Fact-Cov, and Rouge scores. For instance, on HiCBench QA tasks, HC200+AM improved ERec from 74.06 to 81.03 in some settings, and enhanced answer quality across response models.
Retrieval Token Budget Sensitivity: Experiments spanning token budgets from 2k–4k tokens showed monotonic improvements in response quality with increasing budget, but HC200+AM consistently outperformed all baselines at each setting.
Efficiency: HiChunk offered the best overall quality–speed trade-off. While semantic chunkers have lower time cost, they perform markedly worse on chunking accuracy and RAG answer quality. HiChunk remains practical for real-time or batch processing of lengthy documents.

5. Implications for RAG System Development

The introduction of HiCBench and HiChunk has several notable implications:

Diagnosis of Chunking-Related Bottlenecks: HiCBench isolating document segmentation enables researchers to systematically evaluate and refine chunkers, uncovering failures that would otherwise be masked in retrieval or generation metrics.
Granular Control for Retrieval: Multi-level chunking allows systems to dynamically adjust retrieved context size and granularity per query, effecting greater evidence recall and more contextually appropriate generation.
Improved End-to-End Performance: Empirical results substantiate that precise chunking not only aids retrieval but materially benefits response factuality and coverage, as measured by Fact-Cov and Rouge on multiple RAG datasets.
Scalability and Robustness: HiChunk’s iterative inference accommodates very long inputs, addressing the challenge of document length beyond fixed-context models, supporting robust RAG across diverse text sources, including academic manuscripts and multi-section reports.

6. Technical Innovations and Specialty Formulas

HiChunk’s integration of hierarchical chunk point inference and adaptive merging for retrieval is governed by formulas such as:

$\theta^\ast(\text{tk}_\text{cur}, p) = \frac{\text{len}(p)}{3} \times \left(1 + \frac{\text{tk}_\text{cur}}{T}\right)$

where $\text{tk}_\text{cur}$ is the current aggregate retrieval token count, $p$ denotes a parent chunk, and $T$ is the total token budget. This, combined with explicit multi-level chunk annotation and iterative local-global merging strategies, constitutes the principal algorithmic contribution underpinning the observed improvements.

7. Significance, Limitations, and Outlook

HiCBench establishes a new paradigm for rigorous evaluation of chunking in evidence-dense retrieval-augmented generation, refining the granularity and precision of RAG system analysis. While the benchmark and framework represent a substantial advance, a plausible implication is that future work may extend these resources to more diverse corpora or multi-modality, further generalizing the efficacy of chunking-aware RAG evaluation.

HiCBench and HiChunk together close a critical methodological gap, enabling practitioners and researchers to dissect, optimize, and compare chunking strategies within the broader RAG pipeline, with direct implications for performance in knowledge-intensive and long-document NLP applications (Lu et al., 15 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking (2025)

HiCBench: Document Chunking Benchmark

1. Motivation and Conceptual Foundation

2. Benchmark Construction and Core Components

3. HiChunk Framework and Algorithms

4. Experimental Evaluation and Results

5. Implications for RAG System Development

6. Technical Innovations and Specialty Formulas

7. Significance, Limitations, and Outlook

Whiteboard

Follow Topic

Continue Learning

HiCBench: Document Chunking Benchmark

1. Motivation and Conceptual Foundation

2. Benchmark Construction and Core Components

3. HiChunk Framework and Algorithms

4. Experimental Evaluation and Results

5. Implications for RAG System Development

6. Technical Innovations and Specialty Formulas

7. Significance, Limitations, and Outlook

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics