CAT-ID²: Hierarchical Document Identifier
- CAT-ID² is a framework that integrates hierarchical category signals with learned representations to create unique, semantically meaningful document identifiers.
- It employs residual quantization and multi-level supervision—including hierarchical class, cluster scale, and dispersion losses—to balance similarity preservation and uniqueness.
- Empirical results show superior retrieval performance with improved Recall@10 metrics and scalable deployment in generative retrieval and large-scale e-commerce systems.
The Category-Tree Integrated Document Identifier (CAT-ID) framework refers to a set of algorithms and system designs that generate, index, and retrieve discrete, semantically meaningful document identifiers by integrating explicit or latent category-tree structure. CAT-ID methods combine hierarchical categorical signals with learned representations to produce identifier sequences that enable fast, accurate, and scalable retrieval—especially in generative retrieval and large-scale e-commerce systems. The framework incorporates both Bayesian and neural quantization paradigms, exploits multi-level supervision, and is extensible to multiple indexing and retrieval settings.
1. Motivation and Conceptual Foundations
CAT-ID arose from the recognition that hierarchical category information is ubiquitous in document repositories (e.g., e-commerce catalogs, academic literature) and, if properly harnessed, enables both semantic locality and efficient search. Existing identifier (DocID) construction and retrieval systems historically ignored this structure or treated category and semantic modeling disjointly, limiting both representational adequacy and retrieval accuracy. CAT-ID methods are constructed to enforce two central desiderata:
- Similarity preservation: Semantically similar documents (or items) should be assigned identifiers with shorter edit/distances.
- Uniqueness and dispersion: Each document must have a distinct, unambiguous identifier so that retrieval collisions do not occur, even as similar documents cluster in the code space.
A principal challenge is that these goals are in tension: over-clustering IDs causes collisions, while over-dispersing destroys generalization and routing efficiency. CAT-ID reconciles these by explicit integration of the category tree during hierarchical code assignment, quantization, and loss function design.
2. Algorithmic Architecture: Quantization and Category Constraints
All contemporary CAT-ID instantiations are built from a sequence of quantizer layers—typically residual quantization modules—with hierarchical categorical constraints applied to the corresponding levels.
2.1 Encoding and Residual Quantization
Given a corpus of documents each with optional hierarchical category paths , an encoder (e.g., T5, BERT) maps to a dense embedding . Residual quantization (RQ-VAE) then sequentially quantizes this representation through layers, each with codebook :
- At layer , the quantization input is (with ), matching to its nearest codebook entry via
and computing soft assignments . The quantized code is , and residual input for the next layer is updated: .
- The final DocID is the concatenation . The quantized embedding is .
2.2 Hierarchical Losses
CAT-ID introduces three structural loss terms at different quantization layers to incorporate category-tree information and address the trade-off between similarity and uniqueness:
- Hierarchical Class Constraint Loss (HCCL): At each quantizer layer up to the category depth , InfoNCE-style contrastive loss ensures that codes for documents in the same -th level category are close, while inter-category distances are maximized, with hard negative mining distinguishing closely related siblings.
- Cluster Scale Constraint Loss (CSCL): Enforces uniform utilization of codebook entries per layer, penalizing collapse via bidirectional KL divergence between the empirical assignment histogram and the uniform distribution.
- Dispersion Loss (DisL): Contrastive loss over reconstructed and true embedding pairs ensures that each document is well-separated in embedding space, enhancing identifier uniqueness.
The combined DocID objective is:
where are hyperparameters set according to empirical performance and computational stability.
After training, a Sinkhorn-based assignment remaps any colliding DocIDs to ensure injectivity.
3. Category-Tree Representation and Integration
Native category trees are treated as authoritative priors. Each document is indexed by its true or observed category path , and the structure is injected explicitly into the early quantization layers via HCCL. For e-commerce catalogs, category depths above 3 are truncated; depths below 3 are omitted. This implementation ensures that codes at each quantizer layer reflect successive levels of categorical granularity—from coarse to fine—so that search, retrieval, and clustering operations exploit both the tree's topology and learned embedding similarities.
Some CAT-ID implementations, such as those extending differentiable vector quantization frameworks (i.e., CAGE), construct the tree in a fully latent manner, learning codebooks and hierarchical relationships directly from document representations and their interactions.
4. Practical Retrieval Workflow: Indexing and Query
In generative retrieval scenarios, CAT-ID-assigned DocIDs become atomic tokens in an LLM's vocabulary. Training proceeds as follows:
- DocID Learning: Assign each document a discrete sequence through quantization, enforcing category-tree constraints.
- LLM Fine-tuning: Fine-tune a sequence-to-sequence model (e.g., T5) on (query, DocID) pairs, minimizing next-token cross-entropy.
- Inference: At runtime, a query is ingested by the LLM, which generates likely DocID token sequences via beam search. These are mapped back to documents in the index.
- Postprocessing: Early-stage cluster pruning and dense retrieval reranking may be applied to meet real-world latency budgets; in production deployments, retrieval latency remains within 100–200 ms per query using A100-class GPUs, with no significant memory overhead from DocID token expansion.
This approach enables direct, end-to-end document retrieval without the need for explicit query rewriting or multi-stage rankers.
5. Empirical Results and Quantitative Performance
CAT-ID outperforms both sparse and dense retrieval baselines across multiple multilingual and domain-specific datasets. On e-commerce data (ESCI–us, –es, –jp), CAT-ID achieves best-in-class Recall@k, offering 15–20% relative gain over previous generative retrieval systems (e.g., TIGER, NCI). Variants with 512 codewords per quantization layer outperform smaller codebooks on large datasets, with four quantization layers empirically optimal.
| Model | Recall@10 (ESCI-us) | Recall@10 (ESCI-es) | Recall@10 (ESCI-jp) |
|---|---|---|---|
| BM25 | 5.68 | 5.19 | 3.23 |
| DPR | 6.77 | 4.64 | 4.45 |
| TIGER | 4.93 | 9.45 | 7.64 |
| CAT-ID (256) | 5.86 | 10.14 | 7.89 |
| CAT-ID (512) | 6.54 | 9.71 | 8.09 |
Online A/B experiments demonstrate that CAT-ID increases average orders per 1,000 users by +0.33% for ambiguous queries and +0.24% for long-tail queries compared to strong multi-stage production baselines.
Ablation shows that all three loss terms (hierarchical, scale, dispersion) are necessary for maximal gain; omitting any leads to notably worse retrieval performance.
6. Implementation and Computational Considerations
- Hardware & Efficiency: CAT-ID is typically trained on eight A100 80 GB GPUs for 300 epochs, with full DocID training taking 24.8 h—~20% slower than TIGER but substantially faster than general-purpose retrieval frameworks (DSI, NCI).
- Hyperparameters: Default configuration is , , ; codebook size , batch size 4096.
- Hyperparameter Sensitivity: Catastrophic collapse occurs for small dispersion weights; excessive weights damage semantic locality. Category depth and codebook size exert U-shaped influences on code quality and collision rate.
- Production Integration: The system is compatible with traditional ANN search infrastructures (e.g., Faiss), and hierarchical codes can be leveraged for coarse-to-fine partitioning.
7. Extensions and Relation to Prior Work
Earlier CAT-ID-related frameworks address tree-structured categorical retrieval via index construction (e.g., colored-range reporting, wavelet trees, heavy-path decomposition) (Belazzougui et al., 2020). These approaches deliver worst-case optimal query time but focus on string pattern and category-level retrieval, not learned DocID representation.
Other document categorization models (e.g., HiMeCat (Zhang et al., 2020)) integrate hierarchical label dependencies, text, and metadata into shared embeddings for weakly supervised classification, achieving state-of-the-art F1 in low-label regimes. More recent neural extensions (e.g., CAGE, (Liu et al., 2023)) generalize to hierarchical clustering and end-to-end learning for recommendation and retrieval, providing the architectural foundation for CAT-ID generative retrieval platforms.
A plausible implication is that combining explicit category-tree priors with end-to-end quantization and contrastive learning offers both state-of-the-art accuracy and robust interpretability, especially in cases where domain ontologies are rich or maintained by external curators. This approach also supports efficient search space pruning, resilience to category errors, and robustness to missing or noisy supervisions.
8. Limitations and Future Directions
CAT-ID performance is sensitive to codebook cardinality, hierarchy depth, and the trade-off weights among loss functions. Very high codebook cardinality can lead to code fragmentation and slow convergence. While the category-tree prior provides robust supervision in e-commerce and taxonomically-structured corpora, domains lacking clear hierarchical structures may not benefit as strongly. Efficient Sinkhorn post-processing is necessary to guarantee identifier uniqueness at scale.
Further exploration is warranted in automatic category-tree induction, extension to arbitrarily deep or non-uniform trees, and joint modeling with user- or session-level attributes for recommendation. Emerging directions include integration with multi-modal generative models, application to scientific corpora with latent or evolving ontologies, and adaptation to open-set and continual learning settings.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free