Papers
Topics
Authors
Recent
2000 character limit reached

CAT-ID²: Hierarchical Document Identifier

Updated 10 November 2025
  • CAT-ID² is a framework that integrates hierarchical category signals with learned representations to create unique, semantically meaningful document identifiers.
  • It employs residual quantization and multi-level supervision—including hierarchical class, cluster scale, and dispersion losses—to balance similarity preservation and uniqueness.
  • Empirical results show superior retrieval performance with improved Recall@10 metrics and scalable deployment in generative retrieval and large-scale e-commerce systems.

The Category-Tree Integrated Document Identifier (CAT-ID2^2) framework refers to a set of algorithms and system designs that generate, index, and retrieve discrete, semantically meaningful document identifiers by integrating explicit or latent category-tree structure. CAT-ID2^2 methods combine hierarchical categorical signals with learned representations to produce identifier sequences that enable fast, accurate, and scalable retrieval—especially in generative retrieval and large-scale e-commerce systems. The framework incorporates both Bayesian and neural quantization paradigms, exploits multi-level supervision, and is extensible to multiple indexing and retrieval settings.

1. Motivation and Conceptual Foundations

CAT-ID2^2 arose from the recognition that hierarchical category information is ubiquitous in document repositories (e.g., e-commerce catalogs, academic literature) and, if properly harnessed, enables both semantic locality and efficient search. Existing identifier (DocID) construction and retrieval systems historically ignored this structure or treated category and semantic modeling disjointly, limiting both representational adequacy and retrieval accuracy. CAT-ID2^2 methods are constructed to enforce two central desiderata:

  • Similarity preservation: Semantically similar documents (or items) should be assigned identifiers with shorter edit/distances.
  • Uniqueness and dispersion: Each document must have a distinct, unambiguous identifier so that retrieval collisions do not occur, even as similar documents cluster in the code space.

A principal challenge is that these goals are in tension: over-clustering IDs causes collisions, while over-dispersing destroys generalization and routing efficiency. CAT-ID2^2 reconciles these by explicit integration of the category tree during hierarchical code assignment, quantization, and loss function design.

2. Algorithmic Architecture: Quantization and Category Constraints

All contemporary CAT-ID2^2 instantiations are built from a sequence of quantizer layers—typically residual quantization modules—with hierarchical categorical constraints applied to the corresponding levels.

2.1 Encoding and Residual Quantization

Given a corpus of documents {di}\{d_i\} each with optional hierarchical category paths {ci1,,ciH}\{c_i^1,\dots,c_i^H\}, an encoder E()E(\cdot) (e.g., T5, BERT) maps did_i to a dense embedding ziRD\mathbf{z}_i \in \mathbb{R}^D. Residual quantization (RQ-VAE) then sequentially quantizes this representation through LL layers, each with codebook Cl={ekl}k=1K\mathcal{C}^l = \{\mathbf{e}^l_k\}_{k=1}^K:

  • At layer ll, the quantization input is rl\mathbf{r}^l (with r0=z\mathbf{r}^0 = \mathbf{z}), matching to its nearest codebook entry via

skl=rlekl2s^l_k = -\|\mathbf{r}^l - \mathbf{e}^l_k\|_2

and computing soft assignments pl=softmax(sl)p^l = \mathrm{softmax}(s^l). The quantized code is cl=argmaxkpklc^l = \arg\max_k\, p^l_k, and residual input for the next layer is updated: rl+1=rlecll\mathbf{r}^{l+1} = \mathbf{r}^l - \mathbf{e}^l_{c^l}.

  • The final DocID is the concatenation (c0,c1,,cL1)(c^0, c^1, \dots, c^{L-1}). The quantized embedding is z^=l=0L1ecll\hat{\mathbf{z}} = \sum_{l=0}^{L-1} \mathbf{e}^l_{c^l}.

2.2 Hierarchical Losses

CAT-ID2^2 introduces three structural loss terms at different quantization layers to incorporate category-tree information and address the trade-off between similarity and uniqueness:

  • Hierarchical Class Constraint Loss (HCCL): At each quantizer layer ll up to the category depth HH, InfoNCE-style contrastive loss ensures that codes for documents in the same (l+1)(l+1)-th level category are close, while inter-category distances are maximized, with hard negative mining distinguishing closely related siblings.
  • Cluster Scale Constraint Loss (CSCL): Enforces uniform utilization of codebook entries per layer, penalizing collapse via bidirectional KL divergence between the empirical assignment histogram pˉl\bar{\mathbf{p}}^l and the uniform distribution.
  • Dispersion Loss (DisL): Contrastive loss over reconstructed and true embedding pairs ensures that each document is well-separated in embedding space, enhancing identifier uniqueness.

The combined DocID objective is:

LID=Lrqquantizer commitment+αLDis+βl=0H1LHCCl+γl=0L1LCSCl\mathcal{L}_{\mathrm{ID}} = \underbrace{\mathcal{L}_{\mathrm{rq}}}_{\text{quantizer commitment}} + \alpha\,\mathcal{L}_{\mathrm{Dis}} + \beta\,\sum_{l=0}^{H-1}\mathcal{L}^l_{\mathrm{HCC}} + \gamma\,\sum_{l=0}^{L-1}\mathcal{L}^l_{\mathrm{CSC}}

where α,β,γ\alpha,\beta,\gamma are hyperparameters set according to empirical performance and computational stability.

After training, a Sinkhorn-based assignment remaps any colliding DocIDs to ensure injectivity.

3. Category-Tree Representation and Integration

Native category trees are treated as authoritative priors. Each document is indexed by its true or observed category path (ci1,,ciH)(c_i^1,\dots,c_i^H), and the structure is injected explicitly into the early quantization layers via HCCL. For e-commerce catalogs, category depths above 3 are truncated; depths below 3 are omitted. This implementation ensures that codes at each quantizer layer reflect successive levels of categorical granularity—from coarse to fine—so that search, retrieval, and clustering operations exploit both the tree's topology and learned embedding similarities.

Some CAT-ID2^2 implementations, such as those extending differentiable vector quantization frameworks (i.e., CAGE), construct the tree in a fully latent manner, learning codebooks and hierarchical relationships directly from document representations and their interactions.

4. Practical Retrieval Workflow: Indexing and Query

In generative retrieval scenarios, CAT-ID2^2-assigned DocIDs become atomic tokens in an LLM's vocabulary. Training proceeds as follows:

  1. DocID Learning: Assign each document a discrete sequence through quantization, enforcing category-tree constraints.
  2. LLM Fine-tuning: Fine-tune a sequence-to-sequence model (e.g., T5) on (query, DocID) pairs, minimizing next-token cross-entropy.
  3. Inference: At runtime, a query is ingested by the LLM, which generates likely DocID token sequences via beam search. These are mapped back to documents in the index.
  4. Postprocessing: Early-stage cluster pruning and dense retrieval reranking may be applied to meet real-world latency budgets; in production deployments, retrieval latency remains within 100–200 ms per query using A100-class GPUs, with no significant memory overhead from DocID token expansion.

This approach enables direct, end-to-end document retrieval without the need for explicit query rewriting or multi-stage rankers.

5. Empirical Results and Quantitative Performance

CAT-ID2^2 outperforms both sparse and dense retrieval baselines across multiple multilingual and domain-specific datasets. On e-commerce data (ESCI–us, –es, –jp), CAT-ID2^2 achieves best-in-class Recall@k, offering 15–20% relative gain over previous generative retrieval systems (e.g., TIGER, NCI). Variants with 512 codewords per quantization layer outperform smaller codebooks on large datasets, with four quantization layers empirically optimal.

Model Recall@10 (ESCI-us) Recall@10 (ESCI-es) Recall@10 (ESCI-jp)
BM25 5.68 5.19 3.23
DPR 6.77 4.64 4.45
TIGER 4.93 9.45 7.64
CAT-ID2^2 (256) 5.86 10.14 7.89
CAT-ID2^2 (512) 6.54 9.71 8.09

Online A/B experiments demonstrate that CAT-ID2^2 increases average orders per 1,000 users by +0.33% for ambiguous queries and +0.24% for long-tail queries compared to strong multi-stage production baselines.

Ablation shows that all three loss terms (hierarchical, scale, dispersion) are necessary for maximal gain; omitting any leads to notably worse retrieval performance.

6. Implementation and Computational Considerations

  • Hardware & Efficiency: CAT-ID2^2 is typically trained on eight A100 80 GB GPUs for 300 epochs, with full DocID training taking 24.8 h—~20% slower than TIGER but substantially faster than general-purpose retrieval frameworks (DSInaive_\text{naive}, NCI).
  • Hyperparameters: Default configuration is α=0.1\alpha=0.1, β=104\beta=10^{-4}, γ=1.0\gamma=1.0; codebook size K{256,512}K\in\{256,512\}, batch size 4096.
  • Hyperparameter Sensitivity: Catastrophic collapse occurs for small dispersion weights; excessive weights damage semantic locality. Category depth and codebook size exert U-shaped influences on code quality and collision rate.
  • Production Integration: The system is compatible with traditional ANN search infrastructures (e.g., Faiss), and hierarchical codes can be leveraged for coarse-to-fine partitioning.

7. Extensions and Relation to Prior Work

Earlier CAT-ID2^2-related frameworks address tree-structured categorical retrieval via index construction (e.g., colored-range reporting, wavelet trees, heavy-path decomposition) (Belazzougui et al., 2020). These approaches deliver worst-case optimal query time but focus on string pattern and category-level retrieval, not learned DocID representation.

Other document categorization models (e.g., HiMeCat (Zhang et al., 2020)) integrate hierarchical label dependencies, text, and metadata into shared embeddings for weakly supervised classification, achieving state-of-the-art F1 in low-label regimes. More recent neural extensions (e.g., CAGE, (Liu et al., 2023)) generalize to hierarchical clustering and end-to-end learning for recommendation and retrieval, providing the architectural foundation for CAT-ID2^2 generative retrieval platforms.

A plausible implication is that combining explicit category-tree priors with end-to-end quantization and contrastive learning offers both state-of-the-art accuracy and robust interpretability, especially in cases where domain ontologies are rich or maintained by external curators. This approach also supports efficient search space pruning, resilience to category errors, and robustness to missing or noisy supervisions.

8. Limitations and Future Directions

CAT-ID2^2 performance is sensitive to codebook cardinality, hierarchy depth, and the trade-off weights among loss functions. Very high codebook cardinality can lead to code fragmentation and slow convergence. While the category-tree prior provides robust supervision in e-commerce and taxonomically-structured corpora, domains lacking clear hierarchical structures may not benefit as strongly. Efficient Sinkhorn post-processing is necessary to guarantee identifier uniqueness at scale.

Further exploration is warranted in automatic category-tree induction, extension to arbitrarily deep or non-uniform trees, and joint modeling with user- or session-level attributes for recommendation. Emerging directions include integration with multi-modal generative models, application to scientific corpora with latent or evolving ontologies, and adaptation to open-set and continual learning settings.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Category-Tree Integrated Document Identifier (CAT-ID$^2$).