Auto-Merge Retrieval Algorithm
- The Auto-Merge Retrieval Algorithm is a method that adaptively merges semantically related document chunks using hierarchical structure to construct coherent retrieval contexts.
- It systematically replaces fragmented child chunks with their parent node when multiple merging conditions—child count, text length threshold, and token budget—are met.
- Empirical evaluations show that this algorithm improves evidence recall and fact coverage in retrieval-augmented generation pipelines compared to conventional chunking methods.
The Auto-Merge Retrieval Algorithm denotes a class of algorithms developed to optimize the retrieval context in systems that unite hierarchical document segmentation and retrieval-augmented generation (RAG). By adaptively merging semantically related chunks so that retrieval is both maximally relevant and contextually coherent under a fixed token budget, Auto-Merge addresses the tendency of standard top-k chunk retrieval methods to produce fragmented or incomplete evidence. The principal formulation appears in the context of the HiChunk framework for hierarchical chunking, where the algorithm achieves superior recall and fact coverage compared to fixed or semantic chunking without merge strategies (Lu et al., 15 Sep 2025).
1. Hierarchical Chunking and the Problem of Fragmentation
Hierarchical chunking involves segmenting documents at multiple semantic scales—for instance, paragraphs grouped into sections and further into chapters. When retrieval is subject to a fixed token limit, simple ranking and selection of the top-relevant atomic chunks can yield contexts that exclude higher-level semantic structures (such as parent paragraphs) even when their children are highly relevant. This results in retrieved contexts containing fragmented text, impairing the answer generation model’s ability to synthesize complete information.
The Auto-Merge retrieval algorithm is designed to utilize the document’s hierarchical structure to overcome these limitations, replacing sets of individual child chunks present in the retrieval context with their shared parent—provided that three explicit merging criteria are satisfied.
2. Algorithmic Process and Merging Conditions
The algorithm proceeds by scoring all available chunks for relevance to the query and successively assembling a retrieval context node set up to the token budget :
- Add highest-ranked chunk to ; update current token count .
- For each , identify parent node .
- Check three merging conditions:
- Child Count: At least two child nodes of are present:
- Text Length Threshold: Total length of retrieved child nodes of meets adaptive threshold :
- Token Budget Check: Sufficient remaining token budget for :
If all criteria are met, the child nodes in are merged upward into , and is updated accordingly. The process continues until either the token budget is reached or all ranked chunks have been processed.
3. Formal Pseudocode
The fundamental algorithmic steps are as follows (cf. Algorithm 2 in (Lu et al., 15 Sep 2025)):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
node_ret = [] tk_cur = 0 for C_i in C_sorted: node_ret.append(C_i) tk_cur += len(C_i) p = parent(C_i) while ( sum(n in p.children for n in node_ret) >= 2 and sum(len(n) for n in node_ret if n in p.children) >= theta_star(tk_cur, p) and (T - tk_cur) >= len(p) ): # Replace child nodes by merging into parent node p for n in node_ret: if n in p.children: node_ret.remove(n) node_ret.append(p) tk_cur = sum(len(n) for n in node_ret) if tk_cur >= T: break |
1 2 |
def theta_star(tk_cur, p): return (len(p) / 3) * (1 + (tk_cur / T)) |
4. Evidence Recall, Fact Coverage, and Empirical Performance
In experimental evaluations using the HiCBench benchmark, Auto-Merge (denoted HC200+AM when combined with the Hierarchical Chunker) delivers a substantial improvement in metrics such as Evidence Recall (ERec) and Fact Coverage (FC). For example, using Qwen3-32B as the answer model, HC200+AM achieves an ERec of 81.03, exceeding competitive baselines (fixed chunking, semantic chunking, and hierarchical chunking without merge) (Lu et al., 15 Sep 2025). Robustness across retrieval context sizes (2K–4K tokens) is also demonstrated: the algorithm maintains high coverage and recall as illustrated by empirical curves in the paper.
These results suggest that Auto-Merge reliably packs more semantically coherent and complete evidence within a fixed token budget, directly benefiting downstream answer accuracy.
5. Semantic Integrity and Adaptive Context Construction
A core principle of Auto-Merge is the optimization of semantic integrity. By merging retrieved child chunks into their semantic parent when conditions are met, the algorithm prevents context fragmentation, ensuring that the generation model receives contiguous spans of relevant information, with adaptive aggressiveness governed by (increasing with token usage).
This mechanism contrasts with order-preserving selection strategies that may omit critical parent-level context, and with naive chunking approaches that treat all text units independently.
6. Generalization and RAG Pipeline Impact
The algorithm’s general structure accommodates any hierarchical chunking regime and arbitrary relevance scoring mechanism. Integration into the RAG pipeline proceeds at the retrieval stage; the merged context is passed to the generation model for response synthesis. Systematic studies across multiple datasets indicate consistent improvement in metrics such as Rouge and Fact-Cov when Auto-Merge is employed (Lu et al., 15 Sep 2025).
A plausible implication is that hierarchical chunking combined with Auto-Merge forms a strong foundation for evidence-dense and fact-complete retrieval in complex generation systems, bridging gaps exposed by traditional fixed-size or semantic chunkers.
7. Limitations and Adaptive Threshold Tuning
The adaptive nature of the merging threshold introduces a trade-off between aggressive merging and token overhead. As token usage approaches the budget, merging is delayed unless parent nodes offer sufficiently condensed representation. Empirical ablation in the cited work supports that setting between and achieves optimal balance, though further tuning may be required for varying document structures and answer models.
Care must also be taken in hierarchical construction; improper parent-child assignments may interfere with the algorithm’s merging logic, potentially leading to incomplete context.
In summary, the Auto-Merge Retrieval Algorithm implements adaptive hierarchical merging to maximize semantic completeness and efficiency in retrieval-augmented generation systems, validated by improved empirical performance on evidence-dense benchmarks. Its design choices—multi-condition merging, adaptive thresholding, and explicit token budgeting—systematically address deficiencies in conventional chunk selection, enabling better information retrieval and answer synthesis in RAG pipelines (Lu et al., 15 Sep 2025).