Unified Item Tokenization

Updated 1 September 2025

Unified item tokenization is the process of mapping diverse items into structured token sequences that capture both unique and shared information.
It employs semantic encoding, collaborative regularization, and multimodal integration to optimize efficiency and cross-domain generalization.
These frameworks enable applications in generative recommendation and retrieval, achieving significant memory savings and improved performance.

Unified item tokenization is the process of mapping items—ranging from products in recommender systems to images in multimodal AI—into discrete, structured, and task-consistent token sequences that can be leveraged across diverse downstream applications, including generative recommendation, retrieval, and multimodal understanding. The overarching goal is to construct representations that are both compact and semantically expressive, ensuring item tokens encode both unique and shared information, remain compatible with autoregressive (or similar) sequence models, and support cross-domain generalization, efficient inference, and interpretability. This article surveys the theoretical foundations, unification strategies, representative methodologies, optimization frameworks, technological developments, and empirical results in unified item tokenization, strictly reflecting the findings from the most pertinent research in the area.

1. Theoretical Foundations and Unification Principles

The formal theory of tokenization, particularly as articulated in (Gastaldi et al., 16 Jul 2024), frames tokenization as a pair of stochastic maps: an encoder $\tau$ that transforms sequences from a base alphabet (e.g., characters, pixels, modalities) to a sequence of tokens, and a decoder $\kappa$ mapping tokens back to the original space. Consistency and exactness are characterized in terms of pushforward and pullback probability distributions:

$p^* = \kappa(\tau(p^*))$

where $p^*$ is a distribution on the original data space. Exactness (bijective preservation over all $p^*$ ) is the unique condition ensuring statistical estimators trained on tokenized space can be straightforwardly decoded into equivalent estimators in the original space. Crucial computational aspects considered in this theory include:

Finiteness and Sequentiality: The requirement that all pre-images (potential token sequences for a symbol) remain finite and that tokenization is prefix-preserving, supporting efficient autoregressive processing.
Ambiguity and Inconsistency: Noninjective encodings may cause ambiguous mappings, potentially degrading downstream estimator consistency.

This formalism has influenced advances in item tokenization for generative models, by requiring methods that maintain both injective mappings (where possible) and alignment between semantic and discrete identifiers across domains.

2. Unified Tokenization Methodologies: Semantic, Collaborative, and Multimodal Approaches

Unified item tokenization frameworks, especially those in generative recommendation, have evolved to incorporate both item-specific (ID) and shared semantic information.

Semantic Tokenization (e.g., via RQ-VAE):

Items are first mapped to dense representations using pretrained encoders (textual or multimodal). A residual quantization (RQ-VAE) process is typically used: a latent vector $z$ is quantized across $L$ codebooks to obtain a tuple of token indices $[c_1, ..., c_L]$ per item, enforcing the approximation $z \approx \sum_l e_{c_l}^l$ . This approach compresses semantic content into compact, generically applicable tokens (Wang et al., 12 May 2024, Liu et al., 13 Mar 2024, Lin et al., 23 Feb 2025). In UIST (Liu et al., 13 Mar 2024), this paradigm enables a 200-fold reduction in memory usage for CTR prediction tasks with negligible loss of predictive quality.

Collaborative Regularization:

Recent frameworks enhance semantic tokenizers by integrating collaborative signals—patterns from user–item interactions and behavioral similarity—into the code assignment, leveraging contrastive or InfoNCE-style losses. LETTER (Wang et al., 12 May 2024) aligns semantic representations with collaborative filtering embeddings, while COSETTE (Lepage et al., 12 Aug 2025) and SimCIT (Zhai et al., 20 Jun 2025) embed co-occurrence patterns using contrastive learning in the codebook space, ensuring that similar items in behavioral space are tokenized closely in code space.

Multimodal and Knowledge-Integrated Tokenization:

For tasks involving images, text, or multimodal inputs, unified tokenizers such as UniTok (Ma et al., 27 Feb 2025), SemHiTok (Chen et al., 9 Mar 2025), and UTGRec (Zheng et al., 6 Apr 2025) introduce architectures employing separate codebooks for semantic and texture (pixel-level) information or adopt hierarchical/conditional codebooks (e.g., semantic-guided subcodebooks) to decouple and recombine various levels of information into a consistent token sequence. Tree-structured codebooks and collaborative co-occurrence alignment are used to create codes transferable across domains and robust to data sparsity (Zheng et al., 6 Apr 2025).

3. End-to-End and Synergistic Optimization Strategies

Several frameworks recognize the suboptimality of decoupling item tokenization and generative/recommendation model training.

Joint and Alternating Optimization (e.g., ETEGRec (Liu et al., 9 Sep 2024)):

Instead of fixing item tokens before recommendation training, architectures like ETEGRec co-train the item tokenizer with a generative recommender (seq2seq Transformer) using alternating optimization. Recommendation-oriented alignment objectives ensure that the sequence-level representations (from user histories) and token-level assignments remain synchronized. This prevents drift between the semantic space used for token assignments and the representation space learned by the recommender.

Self-Improving and Plug-and-Play Tokenization:

Adaptive tokenization strategies, as in SIIT (Chen et al., 22 Dec 2024), interleave periods of generative model training with rounds of identifier alignment. By allowing an LLM to periodically regenerate and refine the item tokens based on its internal representations, the system ensures that item tokens remain consistent with model understanding—even in the presence of initialization from external codebooks.

4. Hierarchical and Multi-Identifier Tokenization

Hierarchical Token Design:

Hierarchical or coarse-to-fine tokenization is found critical for balancing granularity, interpretability, and retrieval accuracy. The first tokens encode broad semantics, while subsequent tokens refine these—naturally aligning with autoregressive decoding (Wang et al., 12 May 2024, Liu et al., 11 Sep 2024, Zheng et al., 6 Apr 2025). Chain-of-thought (CoT) tokenization in GRACE (Ma et al., 19 Jul 2025) leverages explicit PKG-derived attributes interleaved with semantic tokens, supporting interpretable and behavior-aligned recommendation.

Multi-Identifier Strategies:

To address data sparsity and enhance generalization for long-tail items, multi-identifier tokenization frameworks assign multiple semantically coherent token sequences per item. This is operationalized in MTGRec (Zheng et al., 6 Apr 2025) by using RQ-VAE checkpoints from adjacent training epochs as different, but related, tokenization functions, thereby increasing exposure frequency for rare items and providing a more diverse sequence-level training signal. A curriculum pre-training scheme further adjusts the sampling probability of each identifier set based on data influence estimations.

5. Efficiency, Scalability, and Practical Integration

Memory and Inference Efficiency:

Tokenization approaches using discrete, compact codes (often with short length tuples, e.g., 4 tokens of 256 possible codes) can reduce storage by two orders of magnitude compared to maintaining full embedding tables (Liu et al., 13 Mar 2024, Liu et al., 11 Sep 2024). This enables real-time inference and efficient caching.

Efficient Decoding and Model Design:

To support inference over vast catalogs, methods such as RecGPT (Jiang et al., 6 Jun 2025) implement catalog-aware beam search decoders with trie-based prefix matching—ensuring that only valid item codes are mapped during next-item prediction. Decoupled architectures (e.g., MARIUS in (Lepage et al., 12 Aug 2025)) separate sequence modeling from item code decoding, optimizing both efficiency and expressiveness.

Cross-Domain and Zero-Shot Transfer:

Text-driven and multimodal-aware tokenizers allow immediate representation of unseen items and seamless transfer across domains without retraining (Jiang et al., 6 Jun 2025, Zheng et al., 6 Apr 2025). Through domain-invariant tokenization, systems can generalize to new contexts and mitigate the cold-start problem.

6. Performance, Applications, and Empirical Results

Unified item tokenization frameworks consistently outperform or close the gap with ID-based and neural embedding baselines across various benchmarks:

Method	Setting (Example)	Key Gains
STORE (Liu et al., 11 Sep 2024)	News/Yelp recommendation	Significant increase in Recall@K/NDCG@K
COSETTE+MARIUS (Lepage et al., 12 Aug 2025)	Large-scale sequential recommendation	Matches or exceeds best ID-based SASRec
MTGRec (Zheng et al., 6 Apr 2025)	Amazon genres, varying model size	Enhanced recall and scalability
REC-GPT (Jiang et al., 6 Jun 2025)	Six cross-domain, zero-shot settings	Outperforms ID-based and embedding models
UIST (Liu et al., 13 Mar 2024)	Industrial-scale CTR prediction	200× memory savings, 98% accuracy retained
UTGRec (Zheng et al., 6 Apr 2025)	Multi-domain generative recommendation	Superior transfer and downstream metrics
LETTER (Wang et al., 12 May 2024)	Generative recommendation, LLM-based	All components improve Recall, NDCG

Applications extend from online retail and industrial recommender systems to image generation and multitask vision–LLMs. Unified vision-language-action tokenization (OmniJARVIS (Wang et al., 27 Jun 2024)) has enabled end-to-end agent architectures in open-world environments, while hierarchical visual tokenizers (UniTok (Ma et al., 27 Feb 2025), SemHiTok (Chen et al., 9 Mar 2025)) support both high-fidelity image reconstruction and robust multimodal understanding without architectural duplication.

7. Challenges, Open Questions, and Future Directions

While substantial progress has been made, several challenges remain:

Loss Conflicts and Bottlenecks: Joint training under reconstruction and contrastive objectives may be limited by codebook capacity, not inherent loss incompatibility (Ma et al., 27 Feb 2025). Increasing codebook dimensionality and token length (via multi-codebook approaches) alleviates these issues.
Token Assignment Bias and Diversity: Code utilization tends to be imbalanced, leading to generation bias (Wang et al., 12 May 2024). Diversity regularization and hierarchical assignments are effective mitigations.
Statistical and Computational Consistency: Maintaining theoretical soundness under complex mappings, minimizing spurious ambiguity, and efficiently summing over possibly infinite pre-images remain areas for algorithmic improvement (Gastaldi et al., 16 Jul 2024).
Scalability with Model/Token Size: Some approaches show increased gains with larger model sizes (Zheng et al., 6 Apr 2025), but balancing computation, memory, and generalization requires further empirical paper.
Explainability and Interpretability: Explicit CoT and PKG-aligned structures (GRACE (Ma et al., 19 Jul 2025)) enhance interpretability but trade-off dense semantic expressiveness.

Future research targets include richer context- and modality-aware tokenization, adaptive token set expansion, further algorithmic efficiency in decoding (e.g., leveraging trie or catalog indexes), and deeper understanding of the statistical implications of tokenization choices for foundational AI models.

Unified item tokenization has become foundational for modern generative, retrieval, and multimodal AI systems. By merging semantic, collaborative, multimodal, and hierarchical information into structured token sequences, these frameworks enable end-to-end learning, transferability, efficiency, and robustness—advancing the state-of-the-art across recommendation, search, and multimodal generation and understanding.