Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 10 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Auto-Merge Retrieval Algorithm

Updated 16 September 2025
  • The Auto-Merge Retrieval Algorithm is a method that adaptively merges semantically related document chunks using hierarchical structure to construct coherent retrieval contexts.
  • It systematically replaces fragmented child chunks with their parent node when multiple merging conditions—child count, text length threshold, and token budget—are met.
  • Empirical evaluations show that this algorithm improves evidence recall and fact coverage in retrieval-augmented generation pipelines compared to conventional chunking methods.

The Auto-Merge Retrieval Algorithm denotes a class of algorithms developed to optimize the retrieval context in systems that unite hierarchical document segmentation and retrieval-augmented generation (RAG). By adaptively merging semantically related chunks so that retrieval is both maximally relevant and contextually coherent under a fixed token budget, Auto-Merge addresses the tendency of standard top-k chunk retrieval methods to produce fragmented or incomplete evidence. The principal formulation appears in the context of the HiChunk framework for hierarchical chunking, where the algorithm achieves superior recall and fact coverage compared to fixed or semantic chunking without merge strategies (Lu et al., 15 Sep 2025).

1. Hierarchical Chunking and the Problem of Fragmentation

Hierarchical chunking involves segmenting documents at multiple semantic scales—for instance, paragraphs grouped into sections and further into chapters. When retrieval is subject to a fixed token limit, simple ranking and selection of the top-relevant atomic chunks can yield contexts that exclude higher-level semantic structures (such as parent paragraphs) even when their children are highly relevant. This results in retrieved contexts containing fragmented text, impairing the answer generation model’s ability to synthesize complete information.

The Auto-Merge retrieval algorithm is designed to utilize the document’s hierarchical structure to overcome these limitations, replacing sets of individual child chunks present in the retrieval context with their shared parent—provided that three explicit merging criteria are satisfied.

2. Algorithmic Process and Merging Conditions

The algorithm proceeds by scoring all available chunks C1,C2,,CMC_1, C_2, \ldots, C_M for relevance to the query and successively assembling a retrieval context node set noderet\text{node}_{\text{ret}} up to the token budget TT:

  • Add highest-ranked chunk CiC_i to noderet\text{node}_{\text{ret}}; update current token count tkcurtk_{\text{cur}}.
  • For each CiC_i, identify parent node pp.
  • Check three merging conditions:

    1. Child Count: At least two child nodes of pp are present:

    nnoderetI[np.children]2\sum_{n \in \text{node}_{\text{ret}}} \mathbb{I}[n \in p.\text{children}] \geq 2

  1. Text Length Threshold: Total length of retrieved child nodes of pp meets adaptive threshold θ\theta^*:

    n(noderetp.children)len(n)θ(tkcur,p)\sum_{n \in (\text{node}_{\text{ret}} \cap p.\text{children})} \text{len}(n) \geq \theta^*(tk_{\text{cur}}, p)

    θ(tkcur,p)=len(p)3(1+tkcurT)\theta^*(tk_{\text{cur}}, p) = \frac{\text{len}(p)}{3} \left(1 + \frac{tk_{\text{cur}}}{T}\right)

  2. Token Budget Check: Sufficient remaining token budget for pp:

    Ttkcurlen(p)T - tk_{\text{cur}} \geq \text{len}(p)

If all criteria are met, the child nodes in noderet\text{node}_{\text{ret}} are merged upward into pp, and tkcurtk_{\text{cur}} is updated accordingly. The process continues until either the token budget TT is reached or all ranked chunks have been processed.

3. Formal Pseudocode

The fundamental algorithmic steps are as follows (cf. Algorithm 2 in (Lu et al., 15 Sep 2025)):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
node_ret = []
tk_cur = 0
for C_i in C_sorted:
    node_ret.append(C_i)
    tk_cur += len(C_i)
    p = parent(C_i)
    while (
        sum(n in p.children for n in node_ret) >= 2 and 
        sum(len(n) for n in node_ret if n in p.children) >= theta_star(tk_cur, p) and
        (T - tk_cur) >= len(p)
    ):
        # Replace child nodes by merging into parent node p
        for n in node_ret:
            if n in p.children:
                node_ret.remove(n)
        node_ret.append(p)
        tk_cur = sum(len(n) for n in node_ret)
    if tk_cur >= T:
        break
Where
1
2
def theta_star(tk_cur, p):
    return (len(p) / 3) * (1 + (tk_cur / T))

4. Evidence Recall, Fact Coverage, and Empirical Performance

In experimental evaluations using the HiCBench benchmark, Auto-Merge (denoted HC200+AM when combined with the Hierarchical Chunker) delivers a substantial improvement in metrics such as Evidence Recall (ERec) and Fact Coverage (FC). For example, using Qwen3-32B as the answer model, HC200+AM achieves an ERec of 81.03, exceeding competitive baselines (fixed chunking, semantic chunking, and hierarchical chunking without merge) (Lu et al., 15 Sep 2025). Robustness across retrieval context sizes (2K–4K tokens) is also demonstrated: the algorithm maintains high coverage and recall as illustrated by empirical curves in the paper.

These results suggest that Auto-Merge reliably packs more semantically coherent and complete evidence within a fixed token budget, directly benefiting downstream answer accuracy.

5. Semantic Integrity and Adaptive Context Construction

A core principle of Auto-Merge is the optimization of semantic integrity. By merging retrieved child chunks into their semantic parent when conditions are met, the algorithm prevents context fragmentation, ensuring that the generation model receives contiguous spans of relevant information, with adaptive aggressiveness governed by θ\theta^* (increasing with token usage).

This mechanism contrasts with order-preserving selection strategies that may omit critical parent-level context, and with naive chunking approaches that treat all text units independently.

6. Generalization and RAG Pipeline Impact

The algorithm’s general structure accommodates any hierarchical chunking regime and arbitrary relevance scoring mechanism. Integration into the RAG pipeline proceeds at the retrieval stage; the merged context is passed to the generation model for response synthesis. Systematic studies across multiple datasets indicate consistent improvement in metrics such as Rouge and Fact-Cov when Auto-Merge is employed (Lu et al., 15 Sep 2025).

A plausible implication is that hierarchical chunking combined with Auto-Merge forms a strong foundation for evidence-dense and fact-complete retrieval in complex generation systems, bridging gaps exposed by traditional fixed-size or semantic chunkers.

7. Limitations and Adaptive Threshold Tuning

The adaptive nature of the merging threshold θ\theta^* introduces a trade-off between aggressive merging and token overhead. As token usage approaches the budget, merging is delayed unless parent nodes offer sufficiently condensed representation. Empirical ablation in the cited work supports that setting θ\theta^* between len(p)/3\text{len}(p)/3 and 2len(p)/32\cdot\text{len}(p)/3 achieves optimal balance, though further tuning may be required for varying document structures and answer models.

Care must also be taken in hierarchical construction; improper parent-child assignments may interfere with the algorithm’s merging logic, potentially leading to incomplete context.


In summary, the Auto-Merge Retrieval Algorithm implements adaptive hierarchical merging to maximize semantic completeness and efficiency in retrieval-augmented generation systems, validated by improved empirical performance on evidence-dense benchmarks. Its design choices—multi-condition merging, adaptive thresholding, and explicit token budgeting—systematically address deficiencies in conventional chunk selection, enabling better information retrieval and answer synthesis in RAG pipelines (Lu et al., 15 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Auto-Merge Retrieval Algorithm.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube