Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 72 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 442 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Semantic IDs: Discrete Semantic Representations

Updated 1 October 2025

Semantic IDs are discrete content-derived representations that encode semantic relationships, offering a compact and rich alternative to one-hot IDs.
They are constructed through embedding extraction, quantization, and hierarchical clustering, addressing sparsity and scalability challenges in large-scale systems.
Their application improves recommendation accuracy, reduces cold-start issues, and enhances interpretability while integrating with multimodal and joint search-recommendation tasks.

Semantic Identifiers (SIDs) are discrete, content- or embedding-derived item representations that have emerged as a foundational concept for recommendation, retrieval, and information indexing in large-scale machine learning systems. Unlike traditional random or one-hot IDs, SIDs directly encode semantic and relational structure into the discrete identifiers used for items, users, or documents, enabling enhanced generalization, transfer, efficiency, and interpretability across a range of tasks and modeling paradigms.

1. Foundations and Motivation for Semantic IDs

Semantic IDs are designed to overcome the limitations of classical ID representations such as one-hot or random (hashed) IDs, which are high-dimensional, sparse, and fail to encode underlying semantic or relational similarities between entities. Modern recommendation and retrieval pipelines often involve massive entity catalogs: millions to billions of items, users, or products. Traditional methods lead to sparsity issues, parameter inefficiency, lack of transferability, and data pollution through random hash collisions that degrade performance and stability (Zhao et al., 2017, Zheng et al., 2 Apr 2025).

The core idea behind SIDs is to map each entity (item, document, person, etc.) to a compact, discrete set of tokens or indices that reflect the entity's semantics—typically via content-derived feature encoding, quantization of modality encoder outputs, or hierarchical clustering of feature embeddings. This enables (1) low-dimensional, dense representations; (2) transfer of learned structure across domains, tasks, and temporal regimes; and (3) efficient and scalable integration with large language and foundation models.

2. Construction Methodologies: Embedding, Quantization, and Hierarchy

SID construction strategies vary by application and desired properties, but generally follow a two- or multi-stage process. The methodology comprises:

Semantic Representation Extraction: Entities are first represented using dense feature embeddings—generated by text/image/graph encoders, multimodal LLMs (MLLMs), collaborative filtering, or hybrid content-collaborative models (Fu et al., 25 Sep 2025, Ju et al., 29 Jul 2025).
Discretization/Quantization: These continuous representations are then discretized:
- Hierarchical content-tokenization: A typical approach (e.g., hierarchical category paths— $\langle \mathrm{Makeup} \rangle\, \langle \mathrm{Lips} \rangle\, \langle \mathrm{Lip\_Liners} \rangle\, \langle 5 \rangle$ ) encodes items using domain labels and subcategories (Hua et al., 2023).
- Residual Quantized VAE (RQ-VAE): High-dimensional features are quantized in multiple levels (layers), each selecting the nearest centroid from a codebook; the index vector forms the SID (Singh et al., 2023, Wang et al., 2 Jun 2025).
- LM-based Generative Indexers: Sequential or parallel generators (e.g., LMIndexer) directly produce discrete token sequences aligned with semantic structure, via latent codebooks and self-supervised objectives (Jin et al., 2023, Hou et al., 6 Jun 2025).
- Hybrid and Multi-modal Fusion: Modern systems combine text, vision, and collaborative signals with joint quantization and fusion modules (Fu et al., 25 Sep 2025, Ye et al., 14 Aug 2025).
Collision Handling and Postprocessing: Since similar items can map to identical SIDs, collison mitigations include adding tie-breaking suffixes, semantic uniqueness/enhancement losses, or search/greedy assignment algorithms to guarantee uniqueness and control codebook utilization (Wang et al., 2 Jun 2025, Zhang et al., 19 Sep 2025).

This process can be tightly coupled with learning objectives, e.g., joint optimization to directly align SIDs with collaborative signals (as in DAS, which unifies quantization and contrastive alignment in a one-stage framework) (Ye et al., 14 Aug 2025). Codebook layering and prefix-ngrams preserve hierarchical structure and allow progressively finer granularity and grouping (Zheng et al., 2 Apr 2025, Hou et al., 6 Jun 2025).

3. Impact on Recommendation, Retrieval, and Transfer Learning

SIDs have demonstrated efficacy across diverse tasks including next-item recommendation, multimodal generative retrieval, cross-domain adaptation, person re-identification, and joint search-recommendation scenarios.

Generalization and Cold-Start: By encoding content-relatedness, SIDs significantly improve performance on long-tail and unseen items compared to systems based purely on random or arbitrary IDs. In production-scale evaluations (e.g., YouTube, Meta Ads, Taobao), SIDs have enabled up to 16% improvements in recommendation accuracy and substantial reduction in cold-start error, while also reducing parameter count (Singh et al., 2023, Zheng et al., 2 Apr 2025, Wang et al., 2 Jun 2025, Fu et al., 25 Sep 2025).
Memory and Computation Efficiency: Models leveraging SIDs require far fewer parameters for item representation, freeing up model capacity for other computations or boosting hidden state size for higher accuracy. For example, song-level SIDs with codebooks replace up to 99% of item-specific embedding parameters at no cost to offline ranking accuracy (Mei et al., 24 Jul 2025).
Scalability and Robustness: The prefix-gram and hierarchical codebook encoding strategies allow for stable operation as catalogs grow and change, reduce representation shift, and enhance the focus of attention-based architectures on salient parts of user history (Zheng et al., 2 Apr 2025, Hou et al., 6 Jun 2025).
Interpretability and Traceability: Hierarchically supervised quantization as proposed in HiD-VAE enables SIDs to be directly mapped to interpretable semantic paths, increasing trust, auditability, and enabling human-in-the-loop investigation (Fang et al., 6 Aug 2025).

The table below summarizes representative construction and application patterns:

Construction Technique	Semantic Source	Applications
Hierarchical Category Paths	Structured labels	LLM-based generative recsys, foundation models
RQ-VAE Quantization	Embeddings (text/image/audio)	Video/music/POI recommendation, search
LMIndexer, RPG	LLM generative latent	Document retrieval, large-scale recsys
Semantic+Collaborative Fusion	MLLM + CF signals	Industrial ad ranking, cold-start
Attribute Grouping + Prototyping	Attribute labels	Person ReID, attribute search and recognition

4. Optimization, Alignment, and Scaling Challenges

Despite their success, SIDs present unique challenges and optimization trade-offs:

Semantic- versus Collaborative-Signal Alignment: SIDs derived solely from content fail to capture all collaborative filtering (CF) nuances, while conventional ID features underserve semantic similarity. Dual-aligned methods (e.g., DAS) jointly optimize quantization and collaborative alignment (using debiased CF modules and multi-view contrastive losses), yielding superior performance over two-stage or separately aligned methods (Ye et al., 14 Aug 2025).
Conflict Resolution and Semantic Preservation: Relying on appending non-semantic tie-breakers for uniqueness (as in earlier clustering-based SIDs) leads to spurious token space inflation and reduced cold-start effectiveness. Exhaustive Candidate Matching and Recursive Residual Search enable unique, purely semantic SIDs, directly strengthening retrieval and cold-start generalization (Zhang et al., 19 Sep 2025).
Capacity Bottlenecks and Scaling Laws: SID-based paradigms encounter intrinsic bottlenecks: as the capacity (modality encoder size, quantizer complexity, or sequential RS parameters) increases, overall system performance saturates rapidly—far before hitting the scaling laws observed in raw text LLMs. The restricted capacity of discrete SIDs to encode the full richness of pretrained semantic embedding space is the limiting factor, as opposed to limitations in the downstream RS architecture or encoder model (Liu et al., 29 Sep 2025).
Trade-off in Tokenization Granularity: Increasing token sequence length (or codebook size) increases semantic capacity but also complicates prediction and inference; excess complexity can degrade learnability (Hou et al., 6 Jun 2025, Ju et al., 29 Jul 2025).

Recent advances apply SIDs across multiple modalities, in industrial-scale environments and in joint or unified retrieval/recommendation settings:

Multimodal SIDs: Frameworks such as MME-SID, FORGE, and DAS aggregate embeddings from text, image, and collaborative features, quantize them into SIDs, and optimize via joint or contrastive objectives. These approaches mitigate embedding collapse, catastrophic forgetting, and slow convergence, achieving significant real-world performance gains in massive-scale production systems (Wang et al., 2 Sep 2025, Fu et al., 25 Sep 2025, Ye et al., 14 Aug 2025).
Unified Search and Recommendation: Semantic IDs constructed using multi-task, bi-encoder representations fine-tuned jointly on search and recommendation data provide an effective trade-off for joint models, balancing performance across both domains and eliminating the need for task-specific vocabularies (Penha et al., 14 Aug 2025).
Open-source and Benchmarks: Frameworks such as GRID and FORGE provide modular, extensible pipelines for developing, evaluating, and deploying SID-based generative recommendation, complete with large, multi-modality public datasets, acceleration via offline pre-training strategies, and direct metrics (e.g., Gini coefficient, embedding hitrate) for offline SID quality assessment (Fu et al., 25 Sep 2025, Ju et al., 29 Jul 2025).

6. Quantitative Results and Empirical Findings

Empirical results across diverse datasets and platforms consistently demonstrate the advantages and challenges associated with SIDs:

Recommendation Accuracy and Diversity: SIDs yield up to 16% improvements in recommendation accuracy (e.g., top-1 POI), 35% gains in recall and 33% in NDCG for hierarchical SID models over SOTA baselines, and significant boosts in diversity (lower repetition, increased distinct recommendations) (Wang et al., 2 Jun 2025, Fang et al., 6 Aug 2025, Mei et al., 24 Jul 2025).
Production Impact: Integration into Meta’s Ads Ranking improved top-line metrics by 0.15%, with a 43% reduction in “A/A” prediction variance. FORGE reports a 0.35% increase in transactions on Taobao with optimized SIDs and halved convergence time via offline pretraining (Zheng et al., 2 Apr 2025, Fu et al., 25 Sep 2025).
Alignment and Cold-Start: Dual-aligned SIDs achieved 3.48% lift in eCPM in Kuaishou ads ranking, and 8.98% improvement in cold-start segments (Ye et al., 14 Aug 2025). Purely semantic SIDs (without non-semantic tokens) enhance recall@10 and improve cold/generalization in unseen scenarios (Zhang et al., 19 Sep 2025).
Scaling Ceiling: For SID-based GR, recall/NDCG performance saturates rapidly when scaling encoder, quantizer, or RS—contrasted by LLM-as-recommender systems, which improve as model scale increases, surpassing SID-based approaches by up to 20% at similar scaling (Liu et al., 29 Sep 2025).

7. Future Directions and Open Questions

Areas for future research and ongoing challenges include:

Enhanced Disentanglement & Interpretability: Mechanisms for unique, hierarchical, and human-interpretable SIDs (such as tag-aligned quantization and uniqueness losses) continue to evolve—broadening transparency, diversity, and rationalization in recommendations (Fang et al., 6 Aug 2025).
Efficient, Scalable Training and Inference: Graph-constrained decoding, parallel token prediction, and improved collision handling offer avenues to scale SIDs to even larger catalogs and tasks (Hou et al., 6 Jun 2025, Zhang et al., 19 Sep 2025).
Hybrid Architectures: Hybrid paradigms combining the efficiency of SIDs and the direct scaling benefits of LLM-as-RS point toward a next-generation synthesis for foundation recommendation/search models (Liu et al., 29 Sep 2025).
Adaptive Alignment and Multi-modal Generalization: Dynamic, data-driven alignment between semantic and collaborative signals, and multimodal fusion architectures remain open for further refinement and extension (Ye et al., 14 Aug 2025, Wang et al., 2 Sep 2025).
Benchmarking and Methodological Standardization: Continued open-sourcing of pipelines (e.g., FORGE, GRID), broader multimodal datasets, and creation of direct SID quality metrics accelerate fair evaluation, reproducibility, and rapid model development (Fu et al., 25 Sep 2025, Ju et al., 29 Jul 2025).

In summary, Semantic IDs represent a unifying concept at the intersection of content modeling, information retrieval, and collaborative filtering, bridging dense foundation model representations with discrete, scalable, and semantically rich identifiers. Their impact spans theoretical, algorithmic, and production-ready applications—yet significant challenges remain in optimizing their construction, alignment, scaling properties, and real-world deployment.