Adaptive Embedding Sizes in Deep Learning
- Adaptive embedding sizes are techniques that dynamically adjust vector dimensions per entity based on empirical importance, frequency, and task signals.
- Methods like reinforcement learning, bandit frameworks, and mask-based approaches optimize resource allocation while strictly enforcing memory or parameter budgets.
- Empirical studies show these adaptive approaches improve ranking metrics and reduce memory footprints in streaming, recommendation, and retrieval systems.
Adaptive embedding sizes refer to techniques that dynamically allocate or tune the dimensionality of embedding vectors—often per entity, feature, or instance—instead of fixing them uniformly across a model. This addresses the inefficiency and memory constraints inherent to standard uniform-size embeddings, especially in large-scale recommender systems, retrieval models, and modern deep learning pipelines that rely on categorical or dense vector representations. Adaptive sizing mechanisms are critical for balancing prediction performance, efficient memory use, and the ability to handle nonstationary, streaming, or long-tailed data distributions.
1. Motivation and Problem Formulation
Embedding layers dominate the parameter and memory footprints of large-scale recommender and retrieval systems due to the vast number of categorical entities (users, items, tokens) each with an individual embedding vector. Traditional approaches fix a uniform embedding dimension, incurring two main problems: (i) over-parameterization for low-frequency or “cold” identities, leading to overfitting and wasted capacity; (ii) under-provision for frequent or “hot” identities, restricting expressiveness. With data distributions evolving (e.g., in streaming recommendations), static dimension assignments rapidly lead to inefficiencies and possible violations of strict memory budgets (Qu et al., 2024, Joglekar et al., 2019).
The core principle of adaptive embedding sizes is to allocate representational capacity proportionally to empirical importance, frequency, or other task-dependent signals, subject to global resource constraints. The formal optimization seeks
where are model parameters, is the per-entity (or per-feature) embedding size, and is a memory or parameter budget (Qu et al., 2023). The overarching goal is to identify dimension assignments that maximize downstream utility (ranking/recommendation/retrieval accuracy) while strictly enforcing budgetary or latency constraints.
2. Core Methodologies for Adaptive Embedding Size Search
2.1 Reinforcement Learning and Bilevel Optimization
Several approaches model embedding size search as a sequential decision process controlled via RL, where the state may encode global statistics (frequency histograms, performance metrics), actions correspond to size allocations (or distributional parameters), and rewards reflect downstream task gains penalized by memory usage. NIS employs an RNN-based RL controller for multi-size specification, jointly optimizing coverage and expressiveness under a hard cardinality constraint; warm-up and baseline subtraction stabilize the search (Joglekar et al., 2019). SCALL constrains the embedding table using a binary mask and samples size allocations from parameterized power-law distributions to precisely fit a resource budget per segment, with an SAC-based policy network guiding the evolution over streaming data (Qu et al., 2024).
2.2 Bandit Formulations and Surrogate Modeling
For streaming contexts, bandit-based frameworks (e.g., DESS) select embedding-size arms dynamically for each user/item using contextual features such as interaction frequency and diversity. Sublinear regret bounds are achieved with discounted, contextual LinUCB policies tracking nonstationary utility across arms and enabling fine-grained size tuning without heavy neural controllers. The contextual variables (e.g., diversity measures) estimate the informativeness of user/item histories, guiding arm selection (He et al., 2023).
Surrogate modeling is exemplified by BET, which encodes full table-level actions (dimension assignment vectors) using set-based representations, enabling efficient candidate evaluation via MLP predictors trained on past fitness outcomes. This replaces costly per-entity RL with table-level action sampling and selection strategies (greedy, uniform, nearest neighbor in embedding space), maintaining budget adherence and coverage (Qu et al., 2023).
2.3 Masking, Pruning, and One-Shot Supernets
Mask-based methods such as AMTL and PEP attach auxiliary mask generators or threshold matrices to the embedding tables, learning either binary or soft masks per entity or dimension. Pruning is performed via learned thresholds—with or without explicit regularization—reducing the table to a mixed-dimension, sparsified structure without manual allocation (Yan et al., 2021, Liu et al., 2021). Advanced “supernet” approaches (e.g., AdaS&S) construct a universal DLRM encompassing all candidate embedding sizes; adaptive sampling during pretraining ensures proper parameter coverage, followed by RL search with tunable resource penalties to output a stable, high-performance, budget-constrained configuration (Wei et al., 2024).
2.4 Instance-Specific Quantization and Gumbel-Softmax Reparameterization
In vector quantization settings, codebook size and embedding dimension are balanced under a fixed capacity constraint (). Gumbel-Softmax-based dynamic quantizers select, per instance, among candidate codebooks characterized by different tuples satisfying the total constraint, enabling optimal reconstruction and allocation, and outperforming any static setting (Chen et al., 2024).
2.5 Nested/Matryoshka Representation Learning for Compression
For embedding compression (e.g., for LLM outputs), nested (“Matryoshka”) architectures produce embeddings whose lower-dimensional truncations are optimally aligned for retrieval or ranking under both unsupervised and supervised objectives. Matryoshka-Adaptor learns a single residual mapping such that any leading dimensions of the output serve directly as the embedding at size , supporting arbitrary compression rates. SMEC improves upon this by sequential hierarchy-wise training (reducing gradient variance), differentiable coordinate pruning (via Gumbel-Softmax), and cross-batch negative mining, achieving superior retrieval performance and flexibility (Yoon et al., 2024, Zhang et al., 14 Oct 2025).
3. Enforcing and Navigating Resource Constraints
A central challenge in adaptive embedding sizing is the strict enforcement of parameter or memory budgets despite continuously changing data. Solutions include:
- Mask matrix formulations that tie total active entries to the memory or average-dimension constraint (e.g., SCALL, BET) (Qu et al., 2024, Qu et al., 2023).
- Sampling strategies over (parameterized) distributions (power-law, truncated normal) whose normalization factors and sampled allocations are globally scaled to fit the exact budget (Qu et al., 2024).
- Direct inclusion of resource penalties in RL objectives; e.g., AdaS&S’s reward explicitly balances AUC against mean/total embedding size, and resource-competition terms encourage allocation divergence across features (Wei et al., 2024).
- Pruning-based approaches monitor the instantaneous active parameter count, terminating or rebalancing when the desired limit is reached (Liu et al., 2021).
Theoretical analyses justify such budgets: in dynamic quantization, expected reconstruction error is decomposed into quantization and representation terms, predicting an optimal allocation point for each data distribution (Chen et al., 2024).
4. Adaptation Mechanisms: Streaming, Unseen Entities, and Nonstationarity
Adaptive embedding sizes are notably effective in streaming and dynamic settings:
- Table-level RL (SCALL) or bandit (DESS) policies infer allocation rules from statistical summaries, not explicit IDs, enabling immediate sizing for unseen users/items based on their frequency ranking or context group (Qu et al., 2024, He et al., 2023).
- Streaming protocols require policies or controllers that do not require per-entity retraining, avoiding catastrophic forgetting via buffer or reservoir sampling, as in SCALL and AutoEmb (Qu et al., 2024, Zhao et al., 2020).
- In compression/regression settings, workshops such as Matryoshka-Adaptor and SMEC handle dimension reduction for unseen queries by learning similarity-preserving mappings at all plausible truncation sizes, agnostic to the underlying base model (Yoon et al., 2024, Zhang et al., 14 Oct 2025).
5. Empirical and Comparative Evaluation
A broad range of adaptive sizing methods—e.g., SCALL, BET, DESS, NIS, AMTL, PEP, AdaS&S, Matryoshka-Adaptor, SMEC—consistently improve downstream task metrics (Recall@K, NDCG, ROC-AUC) at severely reduced memory footprints, compared to both uniform and naive rule-based allocations. Notable empirical highlights include:
- SCALL achieving R@20 gains over static and streaming baselines while exactly enforcing budget constraints, and providing robust adaptation to streaming data (Qu et al., 2024).
- BET surpassing other state-of-the-art methods in Recall/NDCG at the same sparsity, with guaranteed budget satisfaction (Qu et al., 2023).
- DESS achieving sublinear regret in nonstationary embedding sizing, outperforming prior contextual-bandit and RL-based candidates and yielding real-time streaming applicability (He et al., 2023).
- PEP and AMTL reducing embedding “table” size by 60–99% without performance loss or even with improvements, leveraging mask/threshold learning (Liu et al., 2021, Yan et al., 2021).
- Supernet-based AdaS&S yielding +0.1%–0.3% AUC gains while cutting 20–30% of parameters, and delivering much higher search stability than DARTS-style methods (Wei et al., 2024).
- In Matryoshka and SMEC frameworks, a 2–12x reduction in embedding size is achieved with minimal nDCG degradation or even improvements at high compression ratios, across language, modality, and API base model choices (Yoon et al., 2024, Zhang et al., 14 Oct 2025).
| Method | Setting | Budget enforcement | Adaptivity granularity | Key Performance Result |
|---|---|---|---|---|
| SCALL | Streaming RecSys | Hard (mask) | User/item (table-level) | +R@20/N@20, strict memory fit (Qu et al., 2024) |
| BET | Batch RecSys | Hard (sampler) | User/item (table-level) | Higher Recall/NDCG, O(T) efficiency (Qu et al., 2023) |
| DESS | Streaming RecSys | Soft (bandit) | User/item (per interaction) | Sublinear regret, best Memory/Accuracy (He et al., 2023) |
| PEP/AMTL | Offline RecSys | Hard/soft | Parameter/feature | 60–99% fewer params, improved AUC (Liu et al., 2021, Yan et al., 2021) |
| AdaS&S | Batch RecSys | Hard (RL+penalty) | Feature (per-field) | +0.1–0.3% AUC, 20–30% param saving, high stability (Wei et al., 2024) |
| Matryoshka/SMEC | Embedding Compression | Implicit (loss) | Any truncation | 2–12x reduction, nDCG@10 within 1pt of full-dim (Yoon et al., 2024, Zhang et al., 14 Oct 2025) |
| Adaptive VQ (VQVAE) | VQ/Generative | Hard (C=K·D) | Instance (per-input) | Lowest reconstruction loss vs. best static (Chen et al., 2024) |
6. Limitations, Extensions, and Research Outlook
Despite substantial empirical and theoretical advances, several important axes remain open:
- RL and bandit-based methods incur additional computational or sample efficiency costs; table-level or inductive surrogates partially mitigate this by reducing action space dimensionality (Qu et al., 2023).
- Mask/pruning approaches may face hyperparameter or stability issues, especially under extreme compression or if frequency statistics are unavailable (Yan et al., 2021).
- Supernet training and ensemble sampling (AdaS&S) trade off memory stability for construction cost; shared table variants help scale to large M, T (Wei et al., 2024).
- Most methods focus on memory; latency, fairness, or privacy-oriented constraints present additional complications and opportunities for joint optimization (Qu et al., 2023).
- The performance of instance-specific or group-specific adaptivity is fundamentally tied to accurate, responsive, and explainable context features—current methods leverage frequency, diversity, or recent interaction history, but side information remains underexplored.
- As LLMs, multimodal retrieval, and personalized AI systems expand in scope, efficient, nested embedding compression (Matryoshka, SMEC) is increasingly relevant, e.g., for on-device or API-based deployments (Yoon et al., 2024, Zhang et al., 14 Oct 2025).
A plausible implication is that future progress will increasingly exploit modular, hierarchical, and resource-aware adaptivity, integrating learned allocation strategies not only per entity or instance, but also across modalities, distributions, and time, possibly within federated or privacy-preserving frameworks.
References:
- (Qu et al., 2024) Scalable Dynamic Embedding Size Search for Streaming Recommendation
- (Joglekar et al., 2019) Neural Input Search for Large Scale Recommendation Models
- (Yoon et al., 2024) Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions
- (Qu et al., 2023) Budgeted Embedding Table For Recommender Systems
- (Zhang et al., 14 Oct 2025) SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
- (Yan et al., 2021) Learning Effective and Efficient Embedding via an Adaptively-Masked Twins-based Layer
- (Chen et al., 2024) Balance of Number of Embedding and their Dimensions in Vector Quantization
- (He et al., 2023) Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System
- (Liu et al., 2021) Learnable Embedding Sizes for Recommender Systems
- (Zhao et al., 2020) AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations
- (Wei et al., 2024) AdaS&S: a One-Shot Supernet Approach for Automatic Embedding Size Search in Deep Recommender System