Online Draft Model Selection

Updated 25 October 2025

Online draft model selection is defined as dynamically choosing the optimal candidate model during inference to boost throughput using full-information online learning methods like HedgeSpec.
Key methodologies involve evaluating metrics such as Token Acceptance Probability and Expected Acceptance Length, enabling rapid adaptation and no-regret algorithmic performance.
Applications span LLM serving, federated learning, and generative tasks, where online adaptation, knowledge distillation, and mixture-based strategies yield significant speed and quality gains.

Online draft model selection is the problem of dynamically choosing—potentially at each step and in an online environment—the best model or mixture from a pool of candidate “draft” models for a given sequence of predictions, data stream, or generation task. In the contemporary context of large-scale LLMs and generative models, this is especially pertinent for accelerating inference (e.g., via speculative decoding) by leveraging efficient draft models tailored to evolving workloads, data distributions, or domain requirements. Recent research brings rigorous algorithmic, theoretical, and practical advances, spanning full-information online learning, knowledge distillation, heterogeneity-aware selection, multi-draft methods, and sample-efficient evaluation.

1. Full-Information Online Drafter Selection: HedgeSpec Framework

The HedgeSpec framework (Liu et al., 22 Oct 2025) introduces a full-information online learning algorithm for adaptive draft model selection in speculative decoding. In contrast to classical bandit approaches, which update only the arm (drafter) chosen at each round, HedgeSpec leverages the deterministic nature of speculative decoding: after the target model verifies a drafted token sequence, all candidate drafters can be evaluated in counterfactual fashion by “prefilling” the verified target trajectory. This enables computation of a loss (e.g., one minus token acceptance probability or one minus expected acceptance length) for every draft model without extra queries to the target model.

The algorithm applies online learning with full-information feedback, using (e.g.) the standard Hedge update, and achieves no-regret guarantees relative to the best drafter in hindsight. Given the loss vector $f_t[i]$ for each drafter $i$ , the cumulative regret over $T$ timesteps is

$\text{Regret}_T = \sum_{t=1}^T f_t[i_t] - \min_{i^*} \sum_{t=1}^T f_t[i^*]$

where $i_t$ is the selected drafter at time $t$ . This approach exponentially improves adaptation compared to bandit algorithms as the pool of drafters grows, and is broadly compatible with single-draft, multi-draft, and tree-based speculative decoding variants.

2. Evaluation and Performance Metrics in Online Drafter Orchestration

Effectively orchestrating multiple draft models in speculative decoding relies on accurate and efficient performance evaluation. HedgeSpec and related works (Liu et al., 22 Oct 2025, Khisti et al., 23 Oct 2024) define two principal metrics:

Token Acceptance Probability (TAP):

$\gamma_t = \Pr[\text{Accept}(x_t) \mid \text{prefix } x_{<t}]$

which averages acceptance probabilities over prefix distributions relevant to the inference domain.

Expected Acceptance Length (EAL):

The expected number of consecutive tokens accepted in speculative execution, formally:

$\mathbb{E}[\text{AcceptLength}(x_{\leq t})] = \sum_{k=1}^{K+1} k \cdot (1 - \gamma_k) \prod_{j=1}^{k-1} \gamma_j$

with $K$ the draft chunk size and $\gamma_k$ the acceptance probability at position $k$ .

The full-information nature of HedgeSpec allows estimation of both TAP and EAL for all candidate drafters after a single target verification, enabling much faster switch-over to the best drafter for the observed query distribution.

Empirical results show significant throughput improvements (up to $80\%$ higher token rates) and rapid regret reduction compared to previous methods such as EAGLE-3, BanditSpec, and EXP3Spec, especially for domain-expert drafters and long reasoning chains (Liu et al., 22 Oct 2025).

3. Online Knowledge Distillation and Drafter Adaptation

Online speculative decoding (Liu et al., 2023) and domain-adaptive drafter training (Hong et al., 10 Mar 2025) have demonstrated that online model selection is tightly linked with continual adaptation and knowledge distillation. In the online regime, user queries are sampled from potentially nonstationary or highly specific distributions, so static, offline-distilled draft models quickly lose alignment with the target model, reducing acceptance rates and negating speedup.

Online adaptation leverages knowledge distillation, where errors (rejected tokens, or locations where the draft diverges from the target) are logged during inference. The draft model’s parameters are updated using the target’s logits via a cross-entropy or KL divergence loss: $\ell(\theta) = \frac{1}{n_B} \sum_{i \in \mathbb{B}} D(\mathbf{p}(\cdot \mid x), \mathbf{q}_\theta(\cdot \mid x))$ where $\mathbf{p}$ is the target distribution and $\mathbf{q}_\theta$ the draft’s prediction. This adaptation can be applied continuously in background training, exploiting live traffic. Experimental evidence establishes that online distillation yields $1.42 \times$ to $2.17 \times$ latency reduction (Liu et al., 2023), and online domain adaptation narrows the token acceptance gap (typically improving acceptance rate by $0.1$ to $0.65$ absolute).

However, when offline, white-box distillation with target logits and historical queries is available, it generally achieves higher acceptance rates and more stable alignment (by $11\%$ – $25\%$ over online, and by $2\%$ – $10\%$ over black-box methods) (Hong et al., 10 Mar 2025). Synthetic (“Magpie”) data can close the gap further, achieving $80\%$ – $93\%$ of the performance of in-domain user data.

4. Multi-Draft and Cross-Vocabulary Drafter Coordination

Recent innovations address scenarios where multiple draft models are available per step, possibly each with different architectures, tokenization, or domain specializations. The multi-draft speculative sampling framework (Khisti et al., 23 Oct 2024) frames the online model selection problem as constructing a token-level selection rule (TLSR) that chooses an output token from independently sampled candidate sets, guaranteeing the output distribution matches that of the target.

This optimal token selector is the solution to an optimal transport problem, which reduces to a linear program. The solution decomposes into two steps: an importance-sampling–like intermediate selection and a single-draft speculative sampling validation. For the case of $K=2$ identical draft models, the necessary and sufficient condition for perfect acceptance is: $\forall S \subseteq \Omega: ~ \sum_{x \in S} q(x) \geq \left(\sum_{x \in S} p(x)\right)^2$ where $q$ and $p$ are the target and draft distributions, respectively.

OmniDraft (Ramakrishnan et al., 3 Jul 2025) addresses the challenge of cross-vocabulary mismatches in practical deployments by introducing an online n-gram cache translating between draft and target vocabularies, hybrid distillation, and adaptive drafting. This enables a “one drafter for all” approach, with a single lightweight draft model pairing on-the-fly with arbitrary target models. Online adaptive drafting further tunes proposal lengths to dynamically maximize acceptance, boosting speedup to $1.5 \times$ – $2 \times$ across real-world LLM targets.

5. Bandit and Mixture-Based Model Selection for Generative Modeling

Optimal mixture selection for generative models in online environments involves MAB formulations with mixture arms (Rezaei et al., 23 Dec 2024). Unlike standard best-arm identification, the objective is to optimize a quadratic kernel-based metric (e.g., Kernel Inception Distance, Rènyi Kernel Entropy, or FID) over mixtures.

Mixture-UCB algorithms (e.g., Mixture-UCB-CAB and Mixture-UCB-OGD) select at each step which base model to sample from, estimate mixture-dependent kernel scores, and update mixture weights using UCB-inspired confidence bounds: $\epsilon_i^{(t)} = \Delta_L \sqrt{\frac{\beta \log t}{2 n_i^{(t)}}} + \frac{\Delta_\kappa}{n_i^{(t)}}$ where $n_i^{(t)}$ is the count of pulls for arm $i$ . Regret bounds of $O(\sqrt{(m \log T)/T})$ guarantee convergence of the realized mixture’s loss to the offline optimum. Empirical results in text and image synthesis confirm that mixtures can significantly outperform the best individual model (in both diversity and combined quality/diversity metrics).

6. Application Domains and Generalization

Online draft model selection methods have been validated in diverse large-scale inference and generation settings:

LLM serving and RL policy optimization: Concurrency-aware speculative decoding (Zhang et al., 26 Sep 2025) fine-tunes speculative decoding strategies in response to real-time batch sizes and adapts drafters through online learning, achieving up to $2.72\times$ speedup and robust alignment during target training drift.
Domain-specialized expert drafters: HedgeSpec (Liu et al., 22 Oct 2025) and related frameworks orchestrate ensembles of domain-adapted or expert drafters (e.g., Math, Code, SQL) and dynamically hand off queries as required, achieving both throughput and quality gains in out-of-distribution and multi-domain workloads.
General model selection: CODA (Kay et al., 31 Jul 2025), CAMS (Liu et al., 2022), and AOE (Dai et al., 2021) extend online model selection to scenarios where ground-truth annotations are expensive, using active learning, consensus priors, and Bayesian surrogate modeling to minimize label/query complexity and efficiently identify the best candidate.
Edge and federated learning: Budgeted federated online model selection frameworks (Ghari et al., 19 Jan 2024) optimize among large model dictionaries with strict device memory constraints, using sublinear-regret algorithms and client–server partitioning.

7. Challenges, Limitations, and Future Directions

Despite recent advances, challenges remain regarding computational and resource efficiency, robustness to extreme domain drift, extreme class imbalance in active selection (as observed in CODA (Kay et al., 31 Jul 2025)), and adaptation to highly heterogeneous query streams. Open challenges include optimal replay buffer management and still more label-efficient active strategies, improved mixture/truncation heuristics for high-dimensional draft model sets, and formalization of convergence guarantees in contextual or nonstationary environments.

Additional opportunities arise in leveraging cross-domain or cross-lingual draft models, “one-for-all” drafters adaptable to evolving multitask targets (Ramakrishnan et al., 3 Jul 2025), and mixture learning objectives balancing user cost, latency, and model diversity/quality (Rezaei et al., 23 Dec 2024, Li et al., 13 Feb 2024). The field continues to explore rapid orchestration, adaptive distillation, and robust online learning as longitudinal demands for inference speed, personalization, and domain transfer intensify.

In sum, online draft model selection in speculative decoding and beyond has evolved into a field characterized by full-information, no-regret algorithms, continual online adaptation, and principled mixture strategies. Modern approaches enable rapid selection and adaptation among expert drafters or mixtures, leveraging theoretical guarantees and cross-domain flexibility in both high-throughput production and low-label-budget scientific and engineering applications.