Semantic IDs: Content-Based Identifiers

Updated 10 October 2025

Semantic IDs are structured, hierarchically arranged discrete identifiers derived from content embeddings that capture semantic, behavioral, and structural signals.
They enable efficient retrieval, recommendation, and multi-modal applications by representing content with interpretable codes that boost scalability and performance.
Construction methods include embedding quantization, hierarchical clustering, and rule-based aggregation, which address challenges like collisions and alignment for improved generalization.

Semantic IDs—also referred to as semantic identifiers, SIDs, or semantic fingerprints—are compact, discrete representations derived from the content or attributes of items, documents, sessions, or entities. These identifiers encode intrinsic semantic, structural, or behavioral information, enabling systems in information retrieval, recommendation, security, and cross-modal retrieval to leverage meaning-aware, generalizable representations instead of arbitrary or randomly assigned IDs. The design, construction, and application of semantic IDs have become central for scaling modern systems, improving generalization on cold-start or long-tail data, and supporting unified generative frameworks across tasks and modalities.

1. Origins, Definitions, and Context

Semantic IDs emerged from the observation that traditional unique identifiers (such as random item IDs, one-hot encodings, or hash-based indices) fail to capture the inherent similarity or structure across items. In contrast, a semantic ID is a structured, often hierarchical sequence of discrete codes, tokens, or words produced by quantizing dense content-feature embeddings (from text, vision, or multi-modal encoders), clustering based on behavioral attributes, or extracting concept-level signals. The goal is to create an ID space where similar items are close or partially share codes, thereby enabling meaningful collisions and improved knowledge transfer.

Semantic IDs appear in various forms:

Hierarchical code sequences from residual quantization (e.g., RQ-VAE, hierarchical clustering) (Singh et al., 2023, Jin et al., 2023).
Semantic “fingerprints” based on session content distributions for user/session identification (Guha, 2016).
Label-constructed SIDs in attribute-based search and re-identification (e.g., in reID or PAR tasks) (Eom et al., 2 Dec 2024).
Structured, interpretable, multi-token outputs generated by LLMs from multi-modal inputs (Li et al., 22 Sep 2025).

They are now foundational to large-scale recommendation engines, industrial search and retrieval, personalization, security audit, and even generative cross-modal frameworks powered by LLMs or MLLMs.

2. Construction Methodologies

The construction of semantic IDs encompasses several algorithmic strategies. Across systems and tasks, the canonical construction pipeline includes:

Embedding-Based Pipelines

Dense semantic embeddings $h_i \in \mathbb{R}^d$ are produced for each item/entity using modality-specific (e.g., BERT, CLIP, Video-BERT) or collaborative (user–item) encoders. These embeddings are then discretized into a hierarchical sequence of code indices via:

Residual Quantized Variational Autoencoder (RQ-VAE), where each embedding passes sequentially through $L$ codebooks (of size $K$ ) and is quantized at each level:

$\mathrm{SID}_i = [c_1, c_2, ..., c_L], \qquad c_l = \arg\min_{j} \| r_{l-1} - e_j \|_2$

where $r_l$ is the residual after subtracting the selected codeword at level $l$ , and $e_j$ is a codebook entry (Singh et al., 2023, Zheng et al., 2 Apr 2025, Wang et al., 2 Jun 2025).

Hierarchical clustering, product quantization, or learned tokenizers (e.g., RK-Means, SentencePiece adaptation).
Multi-modal fusion, where semantic vectors are constructed from multiple sources (text, image, collaborative signals) and then quantized (Xu et al., 21 Aug 2025).

Rule-Based and Label Aggregation

Some systems use human or machine-annotated labels to define SIDs, particularly when ground-truth semantics are well defined, as in person attribute reID (Eom et al., 2 Dec 2024).

Session or Behavior Fingerprints

User/browser sessions are mapped to signature distributions—such as visit frequencies across semantic categories—to form unique behavioral profiles used as fingerprints for semantic identification attacks (Guha, 2016).

Generation via LLMs

For cross-modal tasks, LLMs or MLLMs are prompted to generate structured semantic identifiers directly from input pairs (e.g., image-caption), often involving natural language expressions of object/action composites (Li et al., 22 Sep 2025).

3. Hierarchical, Discrete, and Interpretable Structure

Semantic IDs are generally:

Hierarchical: Each level in the code sequence corresponds to a coarse-to-fine semantic division—higher levels group broad categories, lower levels act as refinements (e.g., “clothing” → “topwear” → “dress”) (Singh et al., 2023, Fang et al., 6 Aug 2025).
Compact and Discrete: Represented by short sequences or multi-level indices, ensuring easy storage, efficient lookup, and efficient model integration (Wang et al., 2 Jun 2025).
Interpretable (in some designs): Especially with supervision or rationale-guided generation, SIDs can be human-interpretable, traceable to semantic tags, or include generation rationales as auxiliary supervision (Fang et al., 6 Aug 2025, Li et al., 22 Sep 2025).

A summary of SID structure across domains:

Approach	Construction Basis	Example Structure
RQ-VAE	Quantized content embeddings	(c₁, c₂, ..., c_L)
Attribute aggregation	Grouped label combinations	"young-male-glasses"
Session fingerprint	Category-wise visit distrib.	[0.1, 0.2, 0.3, ...]
LLM-generated	Concept-level tokens	"cat-window-sitting"

4. Applications in Information Retrieval, Recommendation, Security, and ReID

Semantic IDs have broad applicability:

Recommender systems: Replace random IDs to enable improved generalization on cold-start/long-tail items, stabilize embedding tables, and support large, dynamic corpora (Singh et al., 2023, Zheng et al., 2 Apr 2025). They are adopted in production at Meta, YouTube, Taobao, and Kuaishou (Zheng et al., 2 Apr 2025, Fu et al., 25 Sep 2025, Ye et al., 14 Aug 2025).
Generative retrieval and combined search & recommendation: Power unified LLM-driven generative frameworks able to handle both search queries and recommendation, leveraging jointly trained or multi-task encoded SIDs (Penha et al., 14 Aug 2025). Recent work emphasizes the importance of unified semantic ID spaces over purely task-specific ones for strong cross-task generalization.
Cross-modal and multi-modal retrieval: Used in LLM- or MLLM-based frameworks to index images, videos, or complex multi-modal content by generating compact, concept-level SIDs (Li et al., 22 Sep 2025, Xu et al., 21 Aug 2025).
Intrusion detection and security: Rule-based semantic “IDs” represent contextualized features or patterns, mapped and matched using formal grammars for robust application-layer detection (Sangeetha et al., 2010). Session fingerprinting for privacy attacks also leverages such representations (Guha, 2016).
Re-Identification (reID), Attribute Recognition, and Search: Attribute label aggregation defines SIDs that are used to anchor embedding spaces, supporting zero-shot generalization and compositional querying in tasks like person reID and attribute-based search (Eom et al., 2 Dec 2024).

5. Challenges: Capacity, Conflict, Alignment, and Learnability

Key limitations and ongoing research topics in the evolution of semantic IDs include:

Capacity Bottlenecks: Quantizing high-dimensional, information-rich embeddings into short sequences of codes can result in loss of semantic detail and poor scaling behavior. SID-based generative recommendation models exhibit quick performance saturation as system scale increases, with performance bottlenecked by the expressiveness of SIDs (Liu et al., 29 Sep 2025).
Conflict and Collisions: Assigning identical SIDs to distinct but similar items (semantic collisions) degrades model accuracy and diversity. Existing solutions include appending non-semantic tokens, though this increases search space and hurts cold-start performance. Model-agnostic, uniqueness-enforcing assignment algorithms—such as Exhaustive Candidate Matching (ECM) and Recursive Residual Searching (RRS)—address this (Zhang et al., 19 Sep 2025). Uniqueness or margin-based losses can also reduce representational entanglement (Fang et al., 6 Aug 2025).
Behavioral Alignment: There is often a mismatch between semantic representations and actual user behavioral signals (the semantic-behavioral gap). Solutions such as multimodal mixture-of-quantization (MMQ) and behavior-aware fine-tuning adapt SIDs during downstream task training, ensuring alignment with user interactions (Xu et al., 21 Aug 2025).
Catastrophic Forgetting and Embedding Collapse: Re-initialization of semantic code embeddings in downstream models erases structural information gained during quantization (as shown by drops in order statistics or intra-modal correlation), and direct projection of collaborative embeddings into high-dimensional LLM spaces can cause embedding collapse. Preserving and initializing from pretrained code embeddings, or employing LoRA for efficient, focused adaptation, are used to mitigate this (Wang et al., 2 Sep 2025).
Modality Synergy and Specificity: Effectively integrating multi-modal information (text, vision, collaborative) in SID tokenization requires architectures that capture both shared and unique features, e.g., MMQ’s shared-specific tokenizer with orthogonal regularization (Xu et al., 21 Aug 2025).

6. Empirical Findings and Industrial Impact

Semantic IDs have demonstrated broad empirical gains:

Improvements in Generalization and Cold-Start: SIDs allow more faithful generalization to new or rare items by ensuring similar items share parts of their code, thereby enabling knowledge transfer to the tail of the corpus (Singh et al., 2023, Zheng et al., 2 Apr 2025).
Reduced Model Size and Improved Efficiency: By decoupling embedding table growth from corpus size, SIDs reduce the parameter count dramatically (75–99% in music, 3x memory reductions in advertising) (Mei et al., 24 Jul 2025, Ramasamy et al., 20 Jun 2025).
Memory and Serving Efficiency: Discrete token representations support low-latency serving and adaptation to dynamically changing item catalogs. This has led to extensive industrial deployment at large-scale platforms (Zheng et al., 2 Apr 2025, Fu et al., 25 Sep 2025, Ye et al., 14 Aug 2025).
Enhancement in Diversity and Interpretability: Hierarchically supervised or rationale-guided SIDs improve recommendation diversity and enable interpretable, controllable outputs in both text and cross-modal retrieval (Fang et al., 6 Aug 2025, Li et al., 22 Sep 2025).
Unified and Scalable System Design: Recent studies emphasize the importance of framework modularity (e.g., GRID (Ju et al., 29 Jul 2025)), cross-task reusable SIDs for search and recommendation (Penha et al., 14 Aug 2025), and offline pretraining for fast production convergence (Fu et al., 25 Sep 2025).

7. Current Directions and Open Problems

Recent and emerging directions in semantic ID research include:

Scaling Limits and Alternatives: It is now understood that SID-based generative recommendation systems have fundamental bottlenecks in scaling, with performance saturating quickly; direct LLM-as-recommender frameworks (LLM-as-RS) have shown stronger scaling properties and better integration of both semantic and collaborative signals as LLMs scale (Liu et al., 29 Sep 2025).
Ensuring Uniqueness without Non-Semantic Tokens: To avoid random, non-semantic augmentation for collision resolution, new algorithms (e.g., ECM, RRS) relax centroid selection during quantization to guarantee unique, purely semantic IDs for every item (Zhang et al., 19 Sep 2025).
Hierarchical and Disentangled Learning: Tag-alignment and uniqueness (disentanglement) losses produce interpretable, hierarchical SIDs with reduced collision, boosting both accuracy and controllability (Fang et al., 6 Aug 2025).
Multimodal and Behavior-Adaptive Tokenization: MMQ and similar frameworks address the challenge of fusing modality-specific and modality-shared features, while behavior-aware fine-tuning strategies optimize SIDs for actual recommendation objectives (Xu et al., 21 Aug 2025).
Contrastive and Dual Alignment: Dual-aligned SID methods simultaneously optimize quantization and collaborative alignment, integrating multi-view contrastive losses and dual user–item quantizations within one stage (Ye et al., 14 Aug 2025).
Industrial Datasets and Evaluation Metrics: Benchmarks such as FORGE (Fu et al., 25 Sep 2025) provide large-scale, multi-modal resource and propose lightweight pretraining and direct metrics (e.g., embedding hitrate, Gini coefficient) to evaluate SID quality without costly end-to-end retraining.

The further integration of multi-modal content, advancement of tokenization and codebook learning, and development of more robust alignment and uniqueness mechanisms are expected to shape future research and deployment of semantic IDs in large-scale retrieval, recommendation, and generative systems.