Hierarchical Retrieval (HiRetrieval)

Updated 16 November 2025

Hierarchical Retrieval (HiRetrieval) is a method that exploits explicit or implicit hierarchical structures in data to enhance retrieval performance.
It employs a multi-stage approach, using coarse retrieval to prune the search space and fine retrieval to rank relevant passages.
It boosts explainability and robustness by modeling relationships at multiple abstraction levels across diverse modalities.

Hierarchical Retrieval (HiRetrieval) designates a family of information retrieval methods that exploit explicit or implicit hierarchical structure within the corpus, queries, or target relevance signal, to improve effectiveness, efficiency, explainability, or robustness. In contrast to conventional “flat” retrieval—where all candidate documents are treated as peers and indexed/scored without regard for semantic, structural, or multi-granular relationships—HiRetrieval imposes or extracts a tree, forest, or layered graph over the index, and performs retrieval in a manner that respects and utilizes this hierarchy. The realization of HiRetrieval spans dense passage/document retrieval, generative models, multimodal collections, e-commerce, knowledge graphs, and image/video domains, with methods incorporating architectural, objective, and indexing techniques tailored to hierarchically organized data.

1. Hierarchical Retrieval: Core Principles and Formalization

HiRetrieval is instantiated whenever retrieval targets and/or query intent are best specified over nested, parent–child, or multi-level groups of entities, passages, or concepts. Fundamentally, the task generalizes pointwise match (find the one relevant passage/object) to setwise ancestral or hierarchical match, where the system must identify all relevant nodes in a hierarchy—or balance retrieval across multiple abstraction levels.

A canonical formalization, as articulated in (You et al., 19 Sep 2025), posits:

A hidden or explicit hierarchy is modeled as a directed acyclic graph $G = (D, E)$ , typically a tree, with each document $x \in D$ linked to its parent(s).
For each query $q$ , there is an exact match node $x_0 = E(q)$ (or more generally a set), and a set of relevant targets $S(q)$ (e.g., the ancestors of $x_0$ , or descendants, or both).
The retrieval objective is to return a ranked list maximizing the presence and prioritization of all targets in $S(q)$ , possibly across multiple hierarchy distances.

This formalization underpins both classical dual-encoder approaches and modern LLM-guided or prototype-based methods. Architectural and training modifications are introduced to accommodate the asymmetry and transitivity intrinsic to hierarchy-based matching—not properties of standard inner-product or Euclidean-based similarity.

2. Architectural Frameworks and Algorithms

2.1. Two-Stage Hierarchical Retrieval

Most prevailing dense HiRetrieval systems—such as DHR (Liu et al., 2021) and HiREC (Choe et al., 26 May 2025)—employ a cascaded retrieval pipeline:

Coarse Retrieval: A top-level retriever (e.g., query/document dual-encoder) prunes the search space by selecting the most relevant parent-level units (full documents, communities).
Fine Retrieval: A subordinate retriever (e.g., query/passage encoder or cross-encoder) operates within each selected parent to identify finer relevant sub-units (passages, paragraphs, fields).

In DHR, document-level retrieval precedes passage-level retrieval; final passage ranking is calibrated by fusing passage and document scores: $f_\text{final}(q, p) = f_\psi(q, p) + \lambda \cdot f_\phi(q, d_p)$ where $d_p$ is the parent document of $p$ and $\lambda$ is a tunable coefficient.

HiREC incorporates successive document-level dense recall and cross-encoder reranking, followed by passage-level scoring, further complemented by LLM-driven evidence curation.

2.2. Single-Stage and Path-Based Hierarchical Methods

Alternative strategies encode multi-granular context or traversal pathways directly:

Hierarchical Category Path Generation: HyPE (Lee et al., 8 Nov 2024) trains a generative model to first output a semantic path (coarse-to-fine categories) and then the document identifier, using external taxonomies for path construction.
Prototype and Tree-Based Representations: Cobweb (Gupta et al., 2 Oct 2025) learns a tree where internal nodes represent concept prototypes summarizing clusters at different granularities; queries are matched via tree traversals that yield interpretable multi-level relevance scores.
Latent Routing: ReTreever (Gupta et al., 11 Feb 2025) constructs a parameterized binary tree where each node learns a routing split function; assignment vectors at any level act as coarse-to-fine retrieval codes.

LLM-based approaches such as LATTICE (Gupta et al., 15 Oct 2025) utilize an index tree built from semantic summaries at multiple abstraction levels. At query time, an LLM traverses the tree, at each node evaluating child relevance, with latent score calibration to compensate for context-dependent and noisy LLM judgments. Efficient, logarithmic search complexity ( $O(\log N)$ ) is thus achieved with strong multi-hop reasoning performance.

2.4. Scheduling and Pruning in Multimodal Retrieval

HiMIR (Li et al., 10 Oct 2025) extends these principles to multi-granular, multi-vector modalities (e.g., images) by hierarchically segmenting objects and patches, matching queries at multiple scales, and dynamically pruning computation via cross-hierarchy signal consistency and convergence-based early exit.

3. Loss Functions, Training Strategies, and Theoretical Properties

3.1. Contrastive Loss for Hierarchical Objectives

Training must reflect the structure of the hierarchy:

DHR (Liu et al., 2021) and CHARM (Freymuth et al., 30 Jan 2025) use standard InfoNCE loss applied at both coarse and fine levels, with hard negatives selected from intra-document or intra-section distractors.
HAPPIER (Ramzi et al., 2022) introduces Hierarchical Average Precision (H-AP), which weights ranking inversions by their severity in the concept hierarchy and optimizes an upper-bound surrogate loss combined with a clustering regularizer.
In hyperbolic HiRetrieval for images (Wang et al., 26 Nov 2024), a contrastive entailment loss based on asymmetric angle metrics in hyperbolic space is designed to enforce ancestor-descendant relationships.

3.2. Hierarchy-Aware Negative Sampling and Hard Example Mining

Effective training for HiRetrieval often mandates careful negative sampling:

DHR employs BM25 negatives and “in-section”/“in-doc” negatives to produce hard positives/negatives within hierarchical structure.
In dual-encoder-based geometry (You et al., 19 Sep 2025), a key insight is that performance on shallow (close) ancestor retrieval masks potential collapse for long-distance pairs. A pretrain/fine-tune split, where shallow pairs dominate pretraining and distant ancestor pairs dominate fine-tuning (with high softmax temperature), overcomes this pathology and boosts deep recall from 1% to 32% or higher (WordNet, $m=16$ ). Embedding dimension requirements scale as $O(d \log N)$ , where $d$ is hierarchy depth and $N$ collection size.

3.3. Routing, Tree-Structured, and Logarithmic Scaling

Theoretical analyses (You et al., 19 Sep 2025, Gupta et al., 15 Oct 2025) affirm that with sufficient embedding dimensionality, dual-encoder or hierarchical codes can perfectly separate hierarchical labelings. LLM-guided or tree-structured approaches achieve $O(\log N)$ search but require robust cross-branch/path relevance aggregation to neutralize context effects.

4. Empirical Performance and Impact

HiRetrieval yields measurable improvements in both efficiency and accuracy over flat retrieval, especially in settings where:

Query intent naturally ladders up the abstraction hierarchy, requiring both specific and generic responses (e.g., synset ancestors in WordNet, section headings, product fields).
Retrieval budget—time, memory, or LLM context window—is stringent, and effective early pruning is necessary.
Explainability or multi-scale transparency is required.

Representative metrics from key works:

DHR (Liu et al., 2021): Top-1 passage retrieval on NQ improves from 40.1% (DPR) to 55.4%. Retrieval is 3–4× faster via document-level pruning.
HiRAG (Huang et al., 13 Mar 2025): On HotpotQA, EM rises from ~35% best baseline to 37%; on 2Wiki, EM from ~20–22% to 46.2%. Global win rates over flat baselines reach 87.6%.
CHARM (Freymuth et al., 30 Jan 2025): US-English Recall@10 for top-field modality rises from 33.61% (BiBERT) to 34.78%; explainability is improved by explicit query–field match signals.
HiREC (Choe et al., 26 May 2025): On LOFin financial QA, answer accuracy jumps from 29.22% (Dense+rerank) to 42.36% at lower passage count and computation.
LATTICE (Gupta et al., 15 Oct 2025): Achieves up to 9% higher Recall@100 and 5% higher nDCG@10 over the next-best zero-shot baseline on the multi-hop BRIGHT corpus.
HiMIR (Li et al., 10 Oct 2025): Image retrieval NDCG@10 increases by +5 points, with up to 3.5× speedup versus multi-vector retrieval.

5. Interpretability, Explainability, and Transparency

A central appeal of hierarchical methods is their capacity for transparent, interpretable rationales:

Category-path or prototype-trace methods (e.g., HyPE (Lee et al., 8 Nov 2024), Cobweb (Gupta et al., 2 Oct 2025)) allow explanation of why a document/object was selected via a sequence of coarse-to-fine decisions.
In CHARM and similar models, query–field alignment is explicitly recorded and can be surfaced for explainability.
Empirical user studies demonstrate human preference for path-based explanations over atomic keywords or flat BM25 scores (HyPE: +6.1% R@10, human preference over BM25 and Title-only IDs).

6. Limitations, Open Issues, and Future Prospects

Despite strong empirical results, limitations remain:

Full hierarchy construction (at index time) can be compute- and token-intensive for very large, frequently updated corpora (LATTICE (Gupta et al., 15 Oct 2025), ArchRAG (Wang et al., 14 Feb 2025)).
Some methods depend on explicit or user-defined hierarchies, incurring heavy annotation or preprocessing cost (hyperbolic HiRetrieval (Wang et al., 26 Nov 2024)).
For LLM-guided navigators, noisy or context-sensitive local scoring necessitates additional calibration steps (LATTICE's path-relevance smoothing; see ablations for up to 3-point nDCG drop if omitted).
Tree-derived representations may not match the best dense baselines in the high-dimensional regime, though probabilistic/stochastic training can bridge the gap at lower dimensions (ReTreever (Gupta et al., 11 Feb 2025)).

Active directions, as identified in the literature:

Dynamic, updatable summaries and incremental tree rebalancing for rapidly changing corpora;
Joint learning of hierarchy and embeddings, with attention to weak or no supervision;
Extensions to multimodal and cross-domain collections, integrating image, audio, and text modalities into unified trees;
Deeper LLM-in-the-loop control, leveraging RL and adaptive prompt strategies.

7. Domain-Specific and Multimodal Extensions

HiRetrieval generalizes robustly across modalities:

Video: Two-level (video/moment) pipelines formalized for hierarchical moment retrieval (Zala et al., 2023).
E-commerce: Hierarchical multi-field product representations (CHARM) and structured block-triangular attention.
Graphs and Knowledge Bases: Attributed communities and layer-by-layer aggregation in ArchRAG (Wang et al., 14 Feb 2025) and HiRAG (Huang et al., 13 Mar 2025).
Images: Hierarchical scheduling for multi-object/part retrieval (HiMIR) and hyperbolic embedding for part–whole semantic alignment (Wang et al., 26 Nov 2024).

The diversity of architectures reflects the breadth of use cases: improved open-domain QA, explainable generative retrieval, robust multi-vector embedding, multimodal search, and high-efficiency, high-precision financial analytics.

In summary, HiRetrieval constitutes a broad paradigm shift from uniform, one-size-fits-all indexing to methods that explicitly model, leverage, and transmit the semantic, structural, or abstraction hierarchy of the corpus. This shift yields measurable advances in retrieval precision, efficiency, explainability, and robustness, independent of modality or downstream reasoning engine, and is increasingly foundational in the design of state-of-the-art information access systems.