Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 36 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 66 tok/s Pro
Kimi K2 170 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

HiCBench: Document Chunking Benchmark

Updated 16 September 2025
  • HiCBench is a benchmark that assesses document chunking quality in retrieval-augmented generation (RAG) systems using multi-level annotations.
  • It leverages evidence-dense QA tasks and the HiChunk framework to optimize retrieval precision and improve response fidelity.
  • Experimental evaluations reveal significant gains in chunking accuracy and overall RAG performance, highlighting its effectiveness for long-document processing.

HiCBench is a benchmark designed to evaluate the quality of document chunking within retrieval-augmented generation (RAG) systems, addressing a critical layer in the RAG pipeline that has previously lacked rigorous evaluation resources. By focusing explicitly on the chunking step and leveraging evidence-dense question-answer (QA) tasks and multi-level chunking annotations, HiCBench provides a framework for systematic, fine-grained assessment of how segmentation methods affect both retrieval and generated response fidelity. The benchmark is accompanied by HiChunk, a hierarchical chunking and retrieval system utilizing fine-tuned LLMs and an adaptive retrieval algorithm, resulting in demonstrably improved outcomes for downstream RAG tasks (Lu et al., 15 Sep 2025).

1. Motivation and Conceptual Foundation

Prevailing RAG evaluation benchmarks have largely emphasized either document retrieval or response generation accuracy, with minimal regard for the quality of document chunking itself. This oversight is particularly consequential in evidence-dense tasks, where relevant information is distributed over contiguous text fragments rather than isolated sentences. Existing benchmarks typically construct queries using only a few sentences as evidence, making them insensitive to disruptions in semantic segmentation and severely limiting diagnostic power at the chunking stage. HiCBench directly confronts this gap by establishing a benchmark tailored to the evaluation of chunking strategies, specifically for scenarios where evidence is extensive and semantically coherent, and where retrieval system performance and generation quality are affected by chunk boundaries.

2. Benchmark Construction and Core Components

HiCBench comprises a suite of meticulously annotated benchmark resources and synthetic QA tasks, structured as follows:

  • Manually Annotated Multi-Level Chunking Points: Documents selected from the OHRBench corpus are annotated with hierarchical chunking points at various granularities, such as section, subsection, and paragraph boundaries. These manual annotations allow for fine-tuned assessment of how well chunking algorithms preserve inherent semantic structure.
  • Evidence-Dense QA Pair Synthesis: QA instances are generated via a multi-step, prompt-driven process. Initially, section-level summaries are created. Then, grounded QA pairs are synthesized using both context from annotated chunks and these summaries. QA tasks are diversified into three core types:
    • T₀: Evidence-sparse QAs, with only one or two sentences as supporting material.
    • T₁: Single-chunk evidence-dense QAs, where the evidence comprises one semantically coherent chunk (512–4096 words).
    • T₂: Multi-chunk evidence-dense QAs, where supporting evidence spans multiple semantic chunks (256–2048 words).
  • Evidence Extraction and Validation: For each QA, LLM-based extraction identifies the precise sentences constituting valid evidence. This process includes multiple runs and filtering steps to ensure high evidence density (≥10% of context) and fact consistency (Fact-Cov metrics), yielding QA pairs highly sensitive to segmentation errors.
Component Description Role in Benchmark
Multi-Level Annotations Manual chunk boundaries at multiple semantic levels Ground-truth for chunking
Synthesized QA Pairs Evidence-dense, task-diverse QAs grounded in context Chunk-sensitivity evaluation
Evidence Sources LLM-extracted, fact-filtered sentence sets Ensures chunking impact

3. HiChunk Framework and Algorithms

HiChunk is the hierarchical document segmentation framework paired with HiCBench, designed to produce multi-granular chunk boundaries and optimize retrieval granularity for RAG tasks. Notable aspects include:

  • Hierarchical Structuring with Fine-Tuned LLMs: Given a document S1:NS_{1:N} (list of NN sentences), a fine-tuned LLM identifies global chunking points GCP1:kGCP_{1:k} across multiple structural levels, capturing both coarse and fine-grained semantic units.
  • Iterative Inference Strategy: For documents surpassing the LLM’s input token limit LL, chunking is performed in sliding or windowed fashion: for each window Sa:bS_{a:b} with Sa:bL|S_{a:b}| \leq L, local chunk points LCP1:kLCP_{1:k} are inferred and integrated with the global set GCP1:kGCP_{1:k}. Careful bookkeeping (such as residual text handling) mitigates hierarchical drift where segmentation could become inconsistent at boundaries.

Window selection:

b=argmaxbSa:bLb = \arg\max_{b} \quad |S_{a:b}| \leq L

  • Auto-Merge Retrieval Algorithm: Post-chunking, retrieval of context for the LLM is guided by an adaptive algorithm:

    • Traverses query-ranked chunk candidates C[1:M]C_{[1:M]}.
    • Applies upward "merge" when three conditions are met:
    • At least two children enter the retrieval set: nnoderetnp.children2\sum_{n \in \text{node}_\text{ret} \wedge n \in p.\text{children}} \geq 2
    • The aggregate chunk size surpasses a dynamic token threshold:

    θ(tkcur,p)=len(p)3×(1+tkcurT)\theta^\ast(\text{tk}_\text{cur}, p) = \frac{\text{len}(p)}{3} \times \left(1 + \frac{\text{tk}_\text{cur}}{T}\right) - The remaining token budget is sufficient to fit the parent chunk. This mechanism ensures semantic completeness and relevance while respecting model input constraints.

4. Experimental Evaluation and Results

HiCBench and HiChunk’s efficacy were demonstrated in extensive experiments involving Qasper, GutenQA, OHRBench, LongBench RAG, and the HiCBench dataset itself. Key quantitative findings include:

  • Chunking Accuracy: On Qasper, F₁ for chunk point detection at level 1 increased from 0.5481 (LumberChunker, LC) to 0.6742 (HiChunk, HC), evidencing more faithful reproduction of ground-truth structure.
  • RAG Pipeline Metrics: Replacing rule-based or semantic chunkers (e.g., FC200, LC) with HiChunk and its auto-merge variant (HC200+AM) yielded higher evidence recall (ERec), Fact-Cov, and Rouge scores. For instance, on HiCBench QA tasks, HC200+AM improved ERec from 74.06 to 81.03 in some settings, and enhanced answer quality across response models.
  • Retrieval Token Budget Sensitivity: Experiments spanning token budgets from 2k–4k tokens showed monotonic improvements in response quality with increasing budget, but HC200+AM consistently outperformed all baselines at each setting.
  • Efficiency: HiChunk offered the best overall quality–speed trade-off. While semantic chunkers have lower time cost, they perform markedly worse on chunking accuracy and RAG answer quality. HiChunk remains practical for real-time or batch processing of lengthy documents.

5. Implications for RAG System Development

The introduction of HiCBench and HiChunk has several notable implications:

  • Diagnosis of Chunking-Related Bottlenecks: HiCBench isolating document segmentation enables researchers to systematically evaluate and refine chunkers, uncovering failures that would otherwise be masked in retrieval or generation metrics.
  • Granular Control for Retrieval: Multi-level chunking allows systems to dynamically adjust retrieved context size and granularity per query, effecting greater evidence recall and more contextually appropriate generation.
  • Improved End-to-End Performance: Empirical results substantiate that precise chunking not only aids retrieval but materially benefits response factuality and coverage, as measured by Fact-Cov and Rouge on multiple RAG datasets.
  • Scalability and Robustness: HiChunk’s iterative inference accommodates very long inputs, addressing the challenge of document length beyond fixed-context models, supporting robust RAG across diverse text sources, including academic manuscripts and multi-section reports.

6. Technical Innovations and Specialty Formulas

HiChunk’s integration of hierarchical chunk point inference and adaptive merging for retrieval is governed by formulas such as:

θ(tkcur,p)=len(p)3×(1+tkcurT)\theta^\ast(\text{tk}_\text{cur}, p) = \frac{\text{len}(p)}{3} \times \left(1 + \frac{\text{tk}_\text{cur}}{T}\right)

where tkcur\text{tk}_\text{cur} is the current aggregate retrieval token count, pp denotes a parent chunk, and TT is the total token budget. This, combined with explicit multi-level chunk annotation and iterative local-global merging strategies, constitutes the principal algorithmic contribution underpinning the observed improvements.

7. Significance, Limitations, and Outlook

HiCBench establishes a new paradigm for rigorous evaluation of chunking in evidence-dense retrieval-augmented generation, refining the granularity and precision of RAG system analysis. While the benchmark and framework represent a substantial advance, a plausible implication is that future work may extend these resources to more diverse corpora or multi-modality, further generalizing the efficacy of chunking-aware RAG evaluation.

HiCBench and HiChunk together close a critical methodological gap, enabling practitioners and researchers to dissect, optimize, and compare chunking strategies within the broader RAG pipeline, with direct implications for performance in knowledge-intensive and long-document NLP applications (Lu et al., 15 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HiCBench.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube