Dynamic Model Size Selection (DMSS)

Updated 2 January 2026

Dynamic Model Size Selection (DMSS) is a framework that adapts model capacity such as depth or width in response to data, context, and resource constraints.
It employs methodologies like measurement-driven planning, gating functions, differentiable selection, and bandit approaches to balance accuracy, cost, and energy efficiency.
DMSS demonstrates significant efficiency gains in applications including live video analytics, green AI, language models, and streaming systems while maintaining target performance.

Dynamic Model Size Selection (DMSS) is a principled framework for optimizing compute and resource utilization by dynamically adapting a model's effective capacity (such as depth, width, parameter count, or codebook size) to the data, context, or resource constraints at hand. The overarching objective in DMSS is to attain the best possible predictive or generative utility—subject to explicit constraints on latency, energy, inference cost, or memory—by dynamically selecting from a pool of candidate models or architectural subgraphs, rather than applying a fixed model size uniformly. DMSS is now a core technique in contemporary large-scale vision, language, streaming, and resource-limited systems for balancing accuracy, efficiency, and sustainability.

1. Formal Problem Statements and Domain-Specific Instantiations

DMSS formally addresses selection among a set (or continuous spectrum) of models $\mathcal{M} = \{m_1, \dots, m_K\}$ , each characterized by an inference cost $C(m_i)$ and accuracy $A(m_i, s)$ on task instances or segments $s \in \mathcal{S}$ . The canonical DMSS objective is to minimize total or average prediction cost,

$\text{TotalCost} = \sum_{s} C(m_s, s),$

while retaining aggregate or segment-wise accuracy above a user-specified threshold $\alpha_{\text{target}}$ ,

$\text{TotalAcc} = \frac{\sum_s A(m_s, s)\cdot |s|}{\sum_s |s|} \geq \alpha_{\text{target}}.$

This scalarization can be cast as a per-instance or per-segment objective function: $\text{cost\_acc\_obj}(m, s) = \alpha\cdot C(m) + (1-\alpha)\cdot[1-\hat{A}(m, s)],$ where $\alpha \in [0, 1]$ trades off cost and predicted accuracy $\hat{A}(m, s)$ (Sela et al., 30 Dec 2025).

Across domains, variations include:

Live Video Analytics: DMSS selects from pre-trained vision models for each video segment, leveraging measurement-driven planning and runtime prediction statistics (Sela et al., 30 Dec 2025).
Green AI: Selection targets energy-aware inference, using confidence-driven cascading or learned routing to invoke the smallest model meeting the prediction's confidence or empirical risk constraint (Cruciani et al., 24 Sep 2025).
Layer Selection in LLMs: For decoder-only transformers, DMSS determines the execution or skipping of layers (per-token or per-sequence) to optimize for inference budget under minimal performance loss (Glavas et al., 2024).
Streaming Recommender Systems: Embedding dimension selection per user/item in response to context, cast as a non-stationary bandit, aims at regret minimization under memory and performance constraints (He et al., 2023).
Neural Architecture Search: Differentiable surrogates for width/depth are optimized jointly with task losses, yielding resource-aware, efficient structure discovery (Liu et al., 2024).
Speaker Identification: Adaptive codebook size selection for each speaker to maximize overall identification accuracy under computational constraints (Faundez-Zanuy, 2022).

2. Algorithmic Approaches for DMSS

DMSS methodologies can be broadly classified, with significant advances in recent years:

A. Measurement-Driven and Data-Driven Planners

Measurement-driven DMSS selects the subset of candidate models to sample on streaming data, trading off the cost of additional measurements against the expected improvement in model selection fidelity. A planner computes

$G(m | O) = J(O) - \sum_s p_m(s|O) J(O \cup \{(m, s)\}) - C_s(m),$

where $O$ is the set of observed statistics, $J(O)$ is an objective with current observation, and $C_s(m)$ is measurement cost (Sela et al., 30 Dec 2025). Sampling is conducted only when $G(m|O) > 0$ , focusing exploration where it is likely to revise choice.

B. Gating Functions and Cascades

In resource-constrained or budgeted scenarios, a gating function $g(x): \mathcal{X} \to \Delta^{K-1}$ is trained to route inputs to models that optimally balance loss and cost. Training proceeds by alternating minimization of empirical risk (loss+cost) and divergence (e.g., KL) between the gating and an auxiliary target routing, under global or instance-level constraints (Nan et al., 2017).

C. Differentiable and Bandit-Based Selection

For end-to-end differentiable scaling, continuous “pruning ratio” parameters per layer or block determine width and depth. A differentiable TopK operator soft-selects components via parameterized thresholds, updating importance scores by Taylor approximation and normalization (Liu et al., 2024). In streaming discrete selection tasks, contextual bandit (non-stationary LinUCB) approaches with sublinear regret guarantees enable per-user/item dynamic sizing (e.g., embedding dimension), adapting to drifting data and balancing cumulative reward (He et al., 2023).

D. Confidence-Driven and Oracle Routing

For real-time systems, simple confidence thresholds (e.g., prediction margin for classifiers or hidden state divergence in transformers) define early-exit or cascade decisions. Learned or oracle routing leverages lightweight models or dynamic programming to allocate input instances to models of optimal size or depth, maximizing performance under cost constraints (Glavas et al., 2024, Cruciani et al., 24 Sep 2025).

3. Quantitative Performance and Empirical Results

Recent studies have quantified DMSS gains in a variety of settings. Representative results include:

Domain / Task	DMSS Approach	Perf. Gain (Accuracy or Retention)	Cost / Energy Reduction	Reference
Live video analytics	RedunCut (sampling, kNN)	26–56 mAP, 26–56 MOTA, fixed	14–62% compute vs. cascade/gates	(Sela et al., 30 Dec 2025)
Green classification	Cascade, routing	92–94% baseline accuracy	21–24% energy	(Cruciani et al., 24 Sep 2025)
LLM inference	Layer skip/oracle	Full ROUGE-L at 23.3% layer usage	69.6% use 4-layer submodel	(Glavas et al., 2024)
Recommender system	Bandit (DESS)	8% Rec@10 rel., 1.7 ppt ACC gain	50–70% memory	(He et al., 2023)
ImageNet, COCO, Llama-7B	Diff. scaling (DMS)	+1.3% Top-1, +2.0 mAP, -8% PPL	20–400× search speedup	(Liu et al., 2024)
Speaker ID	Per-speaker codebooks	Match best fixed rate	29% fewer bits/codebook	(Faundez-Zanuy, 2022)

These results establish DMSS as a key method for maintaining domain-specific accuracy while effecting substantial reductions in computation, latency, and memory.

4. Robustness, Generalization, and Limitations

State-of-the-art DMSS systems are evaluated for robustness to:

Limited or Drifting Historical Data: Segment-based DMSS planners (e.g., RedunCut) remain within 5–10% of optimal cost in the face of 8× data reduction or cross-domain drift. Use of auto-labeled history (via largest model predictions) does not degrade performance (Sela et al., 30 Dec 2025).
Distribution Shift: Non-stationary bandit-based selectors (DESS) maintain sublinear regret in the presence of model drift and user/item dynamics (He et al., 2023).
One-Shot and Differentiable Optimization Pitfalls: Gradient-based DMSS relying on proxy parameters or non-differentiable masks underperform fully differentiable schemes based on continuous parameterization and normalization, which achieve significantly better accuracy-cost tradeoffs for fixed search times (Liu et al., 2024).

Limitations include, for some approaches, the need for sufficient historical data to support accurate knn/heuristic performance models, challenges in defining confidence proxies for generative tasks or hard-to-parameterize architectures, and the necessity to manage calibration or adaptation cost in cross-domain or rapidly evolving environments.

5. Application Domains and Architectural Variants

DMSS paradigms are now instantiated in multiple domains:

Live Video Analytics: DMSS exploits segment-wise redundancy and scene variation to select optimal detectors or segmenters in traffic, surveillance, or mobile video (Sela et al., 30 Dec 2025).
Green AI/Edge Deployment: Energy- and carbon-aware DMSS cascades are crucial for edge and distributed inference, especially in serverless or region-specific deployments (Cruciani et al., 24 Sep 2025).
LLMs: Token- and sequence-level layer skipping/early exiting have proven effective for inference cost control, with uniform skipping dominant in non-fine-tuned settings (Glavas et al., 2024).
Streaming and Large-Scale Recommendation: Bandit-based DMSS for user/item embedding sizes accommodates dynamic populations and temporal drift (He et al., 2023).
Neural Architecture Search (NAS): Differentiable model scaling extends DMSS to width/depth search over CNNS/Transformers, achieving state-of-the-art accuracy per FLOP or parameter (Liu et al., 2024).
Speaker Identification: Per-entity DMSS adjusts model granularity to model complexity requirement, reducing codebook size without accuracy trade-off (Faundez-Zanuy, 2022).

6. Methodological Insights and Practical Guidelines

Key methodological recommendations include:

Utilize measurement-driven planners only when sampling cost is expected to yield net benefit, exploiting historical data for segment/model filtering to narrow the decision space (Sela et al., 30 Dec 2025).
In transformer-based NLG, prefer layer skipping—using uniform token-agnostic skip rates—over early exit unless fine-tuning is feasible (Glavas et al., 2024).
Gate or budgeted selection models should be parameterized with regularizers favoring feature reuse and cost minimization, leveraging group-sparsity, KL divergence, or boosting as task-specific (Nan et al., 2017).
For scalable deployment, keep selection controllers lightweight; amortize or subsample routing costs in real-time cascade/routing systems (Cruciani et al., 24 Sep 2025).
Calibrate selection algorithms using development data distinct from final test streams to prevent overfitting and maximize robustness (Faundez-Zanuy, 2022).
Beyond model size, the general DMSS framework may be extended to selection over learning rates, regularizers, activation blocks, or data modalities in modular structured systems (He et al., 2023).

7. Comparative Analysis and Future Directions

DMSS now outperforms traditional static or top-down pruning across accuracy–cost tradeoff frontiers and efficiently explores vast architectural design spaces. Notably, differentiable model scaling eliminates sampling inefficiency and architectural gap limitations of previous NAS; streaming bandit DMSS establishes theoretical regret rates for adaptivity; and measurement-driven planners enable black-box DMSS without retraining (Sela et al., 30 Dec 2025, Liu et al., 2024, He et al., 2023).

Future directions include multi-objective DMSS optimized for accuracy, energy, and memory jointly; hierarchical or multi-granular DMSS over model ensembles and submodules; and integration of sophisticated distribution drift detectors and auto-tuning of controller hyperparameters.

Collectively, DMSS frameworks are now foundational for high-performance, sustainable, and adaptive machine learning systems in large-scale and resource-constrained deployments.