HiCBench: Evaluating Document Chunking

Updated 22 September 2025

HiCBench is a benchmark for evaluating document chunking in RAG systems, using hierarchical chunk boundaries and evidence-dense QA tasks for precise measurement.
It employs the HiChunk framework and Auto-Merge algorithm to enhance segmentation quality, retrieval precision, and answer fidelity.
Empirical evaluations demonstrate improved factual consistency and recall across models, underscoring its value for long-document QA tasks.

HiCBench is a benchmark designed for the rigorous evaluation of document chunking within Retrieval-Augmented Generation (RAG) systems. Addressing the deficit in tools that specifically quantify and diagnose the impact of document segmentation on retrieval and downstream answer generation, HiCBench introduces a suite of manually annotated, hierarchically structured chunk boundaries and evidence-dense QA tasks. This enables the measurement of chunking quality and its influence on the end-to-end RAG pipeline, which is essential for tasks requiring comprehensive evidence recall and faithful answer synthesis.

1. Rationale and Motivation

HiCBench was developed to address the inadequacies of existing RAG evaluation benchmarks, which predominantly measure retrieval efficiency or reasoning quality while neglecting the critical role of document chunking. A central challenge is evidence sparsity—the fact that in most currently used datasets, queries are associated with only a handful of relevant sentences, rendering it difficult to assess whether document segmentation produces semantically coherent and self-contained chunks. HiCBench instead focuses on creating evaluation scenarios—such as enumeration and summarization—where evidence is distributed densely and non-locally across coherent semantic fragments, reflecting realistic information-seeking behaviors.

2. Benchmark Composition and Task Design

HiCBench is constructed with three core components:

Hierarchically-annotated chunking points: Exhaustive, manual annotation of document boundaries at multiple granularities (sections, subsections, paragraphs) yields ground-truth chunk structures that serve as a gold standard for evaluating automated chunking strategies.
Evidence-dense QA pairs: Synthetic question-answer pairs are engineered such that the supporting evidence is distributed over entire semantic chunks, counteracting the evidence sparsity seen in prior datasets.
Evidence source mappings: Extensive, iterative LLM-powered extraction and filtering workflows are employed to ensure that the evidence supporting each answer is both complete and densely packed within the chunk or cross-chunk context.

HiCBench tasks are explicitly stratified by evidence density and chunk coverage:

Task Type	Description	Chunk Size (words)
T₀ (Sparse)	Evidence confined to 1–2 sentences	N/A
T₁ (Single)	Evidence from a single complete semantic chunk	512–4096
T₂ (Multi)	Evidence spans multiple semantic chunks	256–2048 per chunk

This typology enables controlled experimentation on how chunking granularity and coherence affect retrieval precision and answer fidelity.

3. Chunking Methodology: HiChunk Framework

The HiChunk framework is designed to automate hierarchical document segmentation in support of HiCBench evaluation. Its methodology comprises:

LLM-based chunk identification: Framing chunk boundary detection and hierarchy assignment as a text generation task, with fine-tuned LLMs trained on datasets with explicit multilevel structure (such as Gov-report, Qasper, and Wiki-727).
Robustness augmentation: Augmenting training data by shuffling chapters or deleting content to increase resilience to document structure variations.
Iterative inference over long documents: For a document tokenized as sentences $S_{1:N}$ , chunk boundary points are inferred as $GCP_{1:k} \leftarrow \textrm{HiChunk}(S_{1:N})$ , where $k$ reflects the number of hierarchical levels. This approach accommodates arbitrarily long inputs by processing document segments sequentially.

This provides not only precise alignment with human-annotated structures but also facilitates adaptive control over chunk size and granularity.

4. Retrieval Enhancement: Auto-Merge Algorithm

Central to downstream retrieval quality is the Auto-Merge algorithm, which adaptively merges child nodes (chunks) into parent nodes based on token budget constraints and semantic density. The merging threshold is defined as:

$\theta^*(tk_{cur}, p) = \frac{\text{len}(p)}{3} \times \left(1 + \frac{tk_{cur}}{T}\right)$

where $tk_{cur}$ is current token usage, $\text{len}(p)$ is parent node length, and $T$ is the overall token budget. This adaptive mechanism increases selectivity as the amount of retrieved information grows, ensuring retention of semantically rich information while avoiding redundancy and fragmentation. Chunks with higher semantic value are prioritized for retention, optimizing the evidence coverage available to response models.

5. Empirical Evaluation and Results

HiCBench and HiChunk were evaluated across Qasper, Gov-report, and the HiCBench dataset itself. Chunking quality is measured using F1 scores at varying hierarchical levels, with additional benchmarking in the full RAG pipeline (datasets including LongBench, Qasper, GutenQA, and OHRBench). Principal findings include:

Superior chunking accuracy: HiChunk (and its Auto-Merge variant, HC200+AM) achieved higher F1 scores for chunk boundary prediction than fixed-size (FC200), semantic (SC), or heuristic (LumberChunker, LC) baselines.
Enhanced RAG performance: In evidence-dense scenarios (tasks T₁ and T₂), the Auto-Merge variant led to the best evidence recall and answer generation quality, as quantified by Fact Coverage and ROUGE metrics.
Efficiency considerations: Semantic chunkers (SC) were faster but underperformed in segmentation accuracy; HiChunk achieved a balance between computational cost and chunk quality, making it pragmatic for large-scale document processing.

The table below summarizes chunking task types and their properties as established in HiCBench:

Task Type	Evidence Scope	Evaluation Utility
T₀	1–2 sentences	Baseline for sparse-evidence chunking
T₁	Single chunk (512–4096 wds)	Tests internal consistency and completeness per semantic fragment
T₂	Multiple chunks	Assesses cross-fragment evidence retrieval and answer composition

6. Impact and Implications for RAG Systems

HiCBench provides the first systematic evaluation platform for determining how document chunking quality affects the global retrieval-augmented generation process. By introducing evidence-dense, multilayered QA tasks tied to gold-standard chunk boundaries, the benchmark enables fine-grained diagnosis of information loss, recall bottlenecks, and model hallucination sources. When the HiChunk framework and Auto-Merge mechanism are deployed within RAG systems, empirical results on models such as Llama3.1-8B, Qwen3-8B, and Qwen3-32B show measurable improvements in factual consistency and recall. This suggests that adaptive chunking strategies, guided by benchmarks like HiCBench, can directly enhance the reliability and factual grounding of long-document QA and synthesis applications.

A plausible implication is that future RAG pipelines incorporating HiCBench for both evaluation and chunking supervision will be better equipped to optimize the end-to-end retrieval and answer generation process, particularly as model and document scales continue to increase.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to HiCBench Benchmark.