Omni-Token Scaling Advances

Updated 4 July 2026

Omni-Token Scaling is a framework that unifies diverse modality tokens (text, image, video, audio) into a common sequence to streamline multimodal processing.
It tackles challenges like token-rate mismatches and computational bottlenecks through specialized pruning, routing, and capacity allocation methods.
Recent advances leverage token-aware optimization and sparse routing to enhance model efficiency and performance while managing modality-specific demands.

Omni-Token Scaling denotes a family of architectural, algorithmic, and systems strategies for scaling multimodal models by treating heterogeneous signals as tokenized objects and then managing the resulting bottlenecks at the token interface. In current work, the term covers at least three coupled problems: unifying text, image, video, audio, latent reasoning, and conditioning signals into a common sequence or conditioning memory; reducing or routing multimodal token loads so that dense non-text streams do not dominate inference; and adapting optimization, training curricula, and infrastructure to token horizons, token frequencies, and modality-dependent compute demand (Yang et al., 9 Feb 2026, Cheng et al., 25 Jan 2026, Xin et al., 19 May 2026, Shah, 1 Jul 2026).

1. Conceptual scope and problem formulation

The literature uses “token scaling” in two distinct but connected senses. In the language-model optimization literature, the scaling variable is the token horizon $D$ , i.e. the total number of training tokens processed. In that regime, the central result is that the optimal learning rate is not invariant to training duration: longer training requires a smaller optimal learning rate, and the paper “Scaling Optimal LR Across Token Horizons” formalizes this as $LR^*(D)=B D^{-\beta}$ for a fixed architecture and recipe (Bjorck et al., 2024). In the omnimodal literature, by contrast, the scaling variable is usually the multimodal token load produced by dense audio and video streams, especially under temporally aligned audio-video chunking (Jung et al., 12 May 2026, Xin et al., 19 May 2026).

This distinction matters because omni-token scaling is not reducible to generic long-context compression. Omnimodal systems exhibit strong modality asymmetry: video often contributes hundreds of visual tokens per second, while audio may contribute only a few tokens per second in some backbones, and text remains comparatively compact (Jung et al., 12 May 2026). As a result, multimodal scaling failures are frequently caused not by parameter count alone, but by the interaction of token count, token heterogeneity, and attention cost. Several papers therefore redefine scaling around token allocation rather than around dense backbone growth: the relevant questions become which tokens should exist, where they should be inserted, how long they should survive across depth, and how much compute each token should receive (Park et al., 14 May 2026, Xin et al., 19 May 2026).

A further conceptual split runs between unification and compression. Some works seek a single autoregressive or shared-attention interface for all modalities, as in AR-Omni’s “single token stream, a single next-token objective, and a single decoder” (Cheng et al., 25 Jan 2026). Others preserve modality-specific encoders or generators but standardize the interface at the conditioning sequence or shared hidden-space level, as in Omni-Video 2’s concatenated conditioning bank and Uni-MoE-2.0-Omni’s language-centric projection of modality-specific tokens into one MoE backbone (Yang et al., 9 Feb 2026, Li et al., 16 Nov 2025). This suggests that omni-token scaling is best understood not as one fixed architecture, but as an umbrella framework for scaling heterogeneous token interfaces.

2. Interface unification and token-form architectural patterns

Recent work exhibits several recurrent architectural patterns for unifying heterogeneous tokens.

Paradigm	Representative work	Token interface
Single autoregressive stream	AR-Omni (Cheng et al., 25 Jan 2026)	Joint vocabulary over text, speech, and image tokens
Unified conditioning memory	Omni-Video 2 (Yang et al., 9 Feb 2026)	Concatenated condition tokens consumed by cross-attention
Language-centric shared backbone	Ming-Flash-Omni (AI et al., 28 Oct 2025), Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025)	Modality-specific encoders projected into one LM hidden space
Interleaved latent/public streams	Mini-Omni-Reasoner (Xie et al., 18 Aug 2025)	One autoregressive stream with silent reasoning and spoken tokens

AR-Omni is the clearest “pure” token-stream formulation. It defines a joint vocabulary

$\mathcal{V} = \mathcal{V}_{text} \cup \mathcal{V}_{speech} \cup \mathcal{V}_{image},$

serializes multimodal inputs and outputs into one linear sequence, and trains a single causal Transformer decoder with one next-token objective (Cheng et al., 25 Jan 2026). The unification is at the sequence-model level rather than at the raw tokenizer level: text uses SentencePiece BPE, speech uses WavTokenizer, and images use a scene-aware VQ tokenizer, but all discrete IDs are merged into one autoregressive stream. This formulation exposes the operational barriers of omni-token scaling directly: token-rate mismatch, sequence budget pressure, and modality-specific decoding entropy. AR-Omni’s solution includes a low-rate single-codebook speech tokenizer at 40 tokens per second for both speech input and output, weighted next-token prediction to counter modality imbalance, and a finite-state decoding mechanism that uses deterministic decoding for tasks such as ASR and TTS while reserving sampling for open-ended generation (Cheng et al., 25 Jan 2026).

Omni-Video 2 adopts a different unification strategy. Rather than collapsing all modalities into one causal stream, it preserves a pretrained text-to-video diffusion model and augments its cross-attention memory with a standardized token bank. The MLLM-based understanding branch predicts an explicit target caption

$\hat{p}^{tgt} = \mathrm{MLLM}(x^{src}, p^{edit}),$

and produces final-layer hidden states that become multimodal interaction tokens. These are combined with the target-caption embedding, the original instruction embedding, and source-reference tokens: $C=[C^{mllm};\,C^{tgt};\,C^{edit};\,C^{ref}].$ The resulting sequence is passed through the diffusion model’s standard cross-attention layers, with diffusion latents as queries and the unified condition bank as keys and values (Yang et al., 9 Feb 2026). The architectural thesis is explicit: capability expansion should occur by preserving the pretrained caption-conditioned interface and adding heterogeneous control tokens, not by replacing the conditioning pathway.

Language-centric omnimodal backbones pursue a third form of unification. Ming-Flash-Omni feeds projected embeddings from Qwen2.5 visual encoders and Whisper audio encoders, concatenated with tokenized text, into Ling-Flash-2.0, a sparse MoE LLM with distinct routers per modality (AI et al., 28 Oct 2025). Uni-MoE-2.0-Omni similarly relies on modality-specific front ends—SigLIP for vision and Whisper-Large-v3 plus a decoder-as-QFormer for audio—but projects those representations into a shared MoE transformer equipped with Omni-Modality 3D RoPE for temporal and spatial alignment (Li et al., 16 Nov 2025). In both cases, token unification occurs in the shared self-attention space rather than in a universal raw tokenizer.

Mini-Omni-Reasoner extends the notion of token unification into internal cognition. Its “Thinking-in-Speaking” formulation interleaves silent reasoning tokens and spoken response tokens at token granularity, rather than finishing reasoning before speech begins. The interleaved stream alternates $p$ response tokens and $q$ reasoning tokens, with the default schedule set to 2 response tokens versus 8 reasoning tokens (Xie et al., 18 Aug 2025). This introduces a distinction between public and latent token streams within one autoregressive process, a pattern likely to remain important as omnimodal systems add internal reasoning, tool plans, and memory-edit tokens.

3. Token explosion, pruning, and budget allocation

The most active branch of omni-token scaling research treats multimodal scaling as a token-budget problem. The premise is shared: dense audio-video inputs generate long non-text sequences whose cost is dominated by prefill attention, yet large fractions of those tokens are redundant or become redundant after cross-modal fusion.

ContextGuard formulates this as context-preserving redundancy removal in Omni-LLMs. It preserves all audio tokens, predicts coarse visual semantics from audio via a lightweight audio-to-video predictor, prunes visual tokens whose semantics are recoverable from audio, preserves additional spatially distributed detail tokens, and then merges temporally redundant chunks (Jung et al., 12 May 2026). Its conceptual objective is

$\pi^{*} = \operatorname*{arg\,max}_{\pi} \left\{ \mathcal{I}(V; Z_\pi \mid A) - \lambda \mathbb{E}[C(Z_\pi)] \right\},$

with the practical rule “retain what audio cannot say.” On Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens; on the same model its average normalized performance is reported as 100.0 at that compression rate (Jung et al., 12 May 2026). The paper’s significance lies less in any single pruning heuristic than in its shift from saliency preservation to cross-modal complement preservation.

OmniSIFT argues that compression should be modality-asymmetric rather than symmetric or decoupled. Its first stage, Spatio-Temporal Video Pruning, removes video redundancy using spatial saliency in the first frame and temporal saliency in the second frame of each chunk; its second stage, Vision-Guided Audio Selection, uses the surviving visual anchors to rank audio tokens by lightweight cross-attention (Ding et al., 4 Feb 2026). The selector adds only 4.85M parameters on Qwen2.5-Omni-7B and, at 25% retained context, reduces total FLOPs from 555.74T to 250.83T while often matching or exceeding full-token performance on QA-style benchmarks. At that 25% context budget, it reports 49.9 on WorldSense versus 49.7 for full tokens, and 68.2 on VideoMME average versus 67.6 for full tokens, although captioning on video-SALMONN-2 remains worse than the full-token model (Ding et al., 4 Feb 2026). The broader implication is that token budgets should not be split evenly across modalities; video can often be pruned more aggressively than audio if the audio selector is visually grounded.

OmniDrop moves pruning inside the decoder. It first performs lightweight intra-modality cleanup—retaining 70% of audio tokens and 40% of video tokens before the LLM—then progressively prunes audiovisual tokens across decoder layers using a sigmoid schedule, text-query-guided attention scores, and a Temporal Diversity Score to preserve global temporal context (Park et al., 14 May 2026). Because pruning occurs after some multimodal fusion has already happened, the method can preserve early fusion capacity and compress more aggressively in deeper layers. On Qwen2.5-Omni-7B, OmniDrop reports up to 39.9% prefill reduction, up to 14.7% GPU memory reduction, and a gain of 3.58 points over the best baseline on AVUT at 20% retention (Park et al., 14 May 2026).

OmniSelect retains the front-end, training-free character of earlier methods but rejects fixed modality-specific guidance. It uses a lightweight AudioCLIP model to estimate cross-modal relevance between the query and each modality, classifies the example into Audio-Centric, Video-Centric, or Uniform pruning, and then allocates pruning ratios across temporal groups before selecting tokens within each group (Yang et al., 18 May 2026). On WorldSense, at 30% token retention, it preserves 97.4% of full-token accuracy for Qwen2.5-Omni-3B and 94.3% for Qwen2.5-Omni-7B (Yang et al., 18 May 2026). The appendix’s “any-correct” analysis further indicates that the pruning regimes themselves are often adequate, while the main failure source is imperfect strategy selection by the relevance estimator.

SEATS pushes stage awareness further by combining pre-LLM redundancy removal, in-LLM progressive pruning, and complete late-layer removal of non-textual tokens once cross-modal fusion is sufficiently complete (Xin et al., 19 May 2026). It defines a layerwise token retention ratio $r_l$ , keeps a larger pre-LLM token budget $r_{s,v}=\lambda R_v$ , $LR^*(D)=B D^{-\beta}$ 0, and then decays the middle-block budget exponentially before setting late-block non-text retention to zero. It also redistributes the current-layer token budget across temporal windows and modalities according to query relevance. On Qwen2.5-Omni-7B, retaining only 10% of visual and audio tokens yields a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance (Xin et al., 19 May 2026). This is among the clearest empirical demonstrations that omni-token scaling is a depth-dependent routing problem, not merely an input pruning problem.

4. Sparse routing, progressive curricula, and systems-scale execution

Another branch of omni-token scaling addresses the problem by scaling capacity per token rather than only reducing token count. Ming-Flash-Omni replaces the Ming-Omni core with Ling-Flash-2.0, a sparse MoE LLM with 100 billion total parameters of which only 6.1 billion are active per token (AI et al., 28 Oct 2025). The model uses distinct routers per modality, sequence packing for dynamic multimodal batches, and flexible encoder sharding across data, tensor, and pipeline parallelism. The systems claim is concrete: these optimizations yield more than twice the training throughput of the baseline Megatron-LM implementation (AI et al., 28 Oct 2025). In this formulation, omni-token scaling means that a growing number of heterogeneous tokens can traverse a shared backbone without every token paying dense 100B-scale compute.

Uni-MoE-2.0-Omni develops this language-centric MoE strategy in more explicit token terms. Its Dynamic-Capacity MoE uses shared experts, routed experts, and null experts; routed experts are selected by a Top- $LR^*(D)=B D^{-\beta}$ 1 policy rather than fixed Top- $LR^*(D)=B D^{-\beta}$ 2, and each token activates 2 shared experts plus 0–3 routed experts, producing an active-parameter range of 1.5B to 18B (Li et al., 16 Nov 2025). The paper additionally introduces Omni-Modality 3D RoPE so that text, images, video, and audio can share a positional scheme with temporal, height, and width components. Audio is distilled into 200 query tokens per 30-second clip, while video is sampled and converted into visual token sequences before projection into the shared MoE space (Li et al., 16 Nov 2025). The progressive training recipe—cross-modal pretraining, expert warmup, omni-modal fine-tuning, simulated annealing, then RL and generative training—indicates that scaling heterogeneous token ecosystems requires staged specialization and rebalance rather than naive joint optimization.

Diffusion-based generative systems reveal a different systems bottleneck. Omni-Video 2 scales to a “14B video diffusion model,” clarified as two 14B-parameter diffusion transformers for low-noise and high-noise regimes, while leaving the MLLM frozen and reusing the pretrained caption-conditioned interface (Yang et al., 9 Feb 2026). Because the unified conditioning bank can become long—MLLM interaction tokens, target captions, instructions, and references all concatenated into one sequence—cross-attention becomes computationally significant. The paper reports that Ulysses-style sequence parallelism with degree 8, applied to both self-attention and cross-attention, yields a 4–5x speedup per training step on the 14B setup (Yang et al., 9 Feb 2026). This is a direct systems consequence of omni-token scaling: heterogeneous condition tokens are not merely a semantic device; they create an attention-level bottleneck that must be parallelized.

These MoE and systems papers collectively correct a common misconception. Omni-token scaling is not only about fewer tokens. It also includes token-aware capacity allocation, null routing, conditional activation, and training infrastructure that can cope with dynamic tensor shapes, variable sequence lengths, and modality-specific compute profiles (AI et al., 28 Oct 2025, Li et al., 16 Nov 2025).

5. Token-aware optimization, token horizons, and the geometry of token interfaces

A further dimension of omni-token scaling concerns the optimization geometry of token interfaces and the dependence of training hyperparameters on total token count. “Scaling Optimal LR Across Token Horizons” shows that the optimal learning rate decreases materially as the token horizon increases and follows a power law

$LR^*(D)=B D^{-\beta}$ 3

For the 50M model, the empirically estimated optimum drops from $LR^*(D)=B D^{-\beta}$ 4 at 25B tokens to $LR^*(D)=B D^{-\beta}$ 5 at 800B tokens, and for larger models the paper recommends a practical rule of approximately $LR^*(D)=B D^{-\beta}$ 6 (Bjorck et al., 2024). The main relevance to omni-token scaling is conceptual: tokens are not merely a measure of dataset size or sequence length; they also determine the near-optimal training configuration.

“Token Geometry” reframes the embedding table and LM-head as a structured read/write interface between discrete symbols and continuous computation. It argues that token-interface parameters have a gradient geometry distinct from dense hidden weights and introduces Ember, an optimizer for embedding and LM-head matrices with $LR^*(D)=B D^{-\beta}$ 7 optimizer state instead of Adam’s $LR^*(D)=B D^{-\beta}$ 8 (Shah, 1 Jul 2026). The paper’s empirical second-moment law is

$LR^*(D)=B D^{-\beta}$ 9

and its optimizer maintains rowwise and columnwise second-moment EMAs rather than dense per-parameter states. The systems payoff is substantial: for Qwen2.5-7B, Adam stores roughly 8.72 GB for token-interface optimizer state alone, whereas Ember reduces this to about 1.2 MB (Shah, 1 Jul 2026). The paper further reports that token trajectories are dominated by a simple 1D ray, with PC1 explaining nearly 90% of variance, suggesting that token optimization is lower in intrinsic dimension than raw $\mathcal{V} = \mathcal{V}_{text} \cup \mathcal{V}_{speech} \cup \mathcal{V}_{image},$ 0 parameter count would imply (Shah, 1 Jul 2026). Although the paper is not multimodal, its claims are directly relevant to any omnimodal system whose discrete interfaces proliferate across modalities.

AR-Omni highlights a complementary optimization problem inside unified autoregression: modality imbalance induced by token-rate mismatch. Speech streams are far longer than text streams, so naive next-token training lets long modalities dominate optimization. AR-Omni’s response-focused weighted next-token prediction, its image-token perceptual alignment loss, and its low-rate 40 tokens-per-second speech representation are all explicit attempts to make shared-token training viable across modalities (Cheng et al., 25 Jan 2026). This suggests that omni-token scaling requires token-aware loss design even when the architecture itself is modality-agnostic.

Taken together, these results broaden the meaning of token scaling. The relevant axes are not only token count, but also token horizon, token frequency, token rate, and token-table optimizer state. Omnimodal scaling therefore intersects directly with the geometry of discrete interfaces and with the transfer of hyperparameters across training-token regimes (Bjorck et al., 2024, Shah, 1 Jul 2026).

6. Empirical status, misconceptions, and unresolved questions

The empirical record already shows that omni-token scaling is practically consequential. Omni-Video 2 reaches 73.53 FiVE-Acc on FiVE-Bench versus 62.53 for UniVideo, and reports a VBench total score of 84.69 while preserving text-to-video generation quality alongside editing capability (Yang et al., 9 Feb 2026). ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens on Qwen2.5-Omni 7B (Jung et al., 12 May 2026). OmniSIFT often matches or exceeds full-token performance at 25% context on QA-style benchmarks, and SEATS attains 101.1% of full-token performance at 35% retention on Qwen2.5-Omni-7B while cutting prefill time from 0.937 s to 0.436 s (Ding et al., 4 Feb 2026, Xin et al., 19 May 2026). Uni-MoE-2.0-Omni reports that a model trained on approximately 75B tokens surpasses Qwen2.5-Omni, which it states was trained with 1.2T tokens, on over 50 of 76 benchmarks (Li et al., 16 Nov 2025). These results collectively show that token-interface decisions have become a first-order determinant of multimodal capability.

At the same time, the literature remains methodologically uneven. Omni-Video 2 explicitly does not present controlled ablations over the number of MLLM tokens $\mathcal{V} = \mathcal{V}_{text} \cup \mathcal{V}_{speech} \cup \mathcal{V}_{image},$ 1, the effect of explicit target captions versus raw edit prompts, or alternate fusion mechanisms; its token-scaling narrative is architectural and systems-oriented rather than expressed as a quantified scaling law (Yang et al., 9 Feb 2026). Many compression methods depend on heuristics such as fixed spatial budgets, stage boundaries, thresholds, or offline access to the full sequence, as acknowledged by OmniDrop, OmniSelect, and SEATS (Park et al., 14 May 2026, Yang et al., 18 May 2026, Xin et al., 19 May 2026). OmniSIFT shows that extreme compression can still lag the full-token model on dense captioning even when it surpasses it on QA benchmarks, indicating that token removal affects generative descriptive tasks differently from multiple-choice or short-answer reasoning (Ding et al., 4 Feb 2026). Uni-MoE-2.0-Omni, despite extensive evaluation, leaves many optimization details unspecified, including learning rates, batch sizes, and explicit MoE balancing losses (Li et al., 16 Nov 2025).

Several misconceptions recur in discussions of the topic. The first is that omni-token scaling is merely “multimodal token pruning.” The evidence is broader: the field includes additive conditioning banks, shared-autoregressive token tapes, sparse MoE routing, latent/public token interleaving, optimizer redesign for token tables, and token-horizon scaling laws (Yang et al., 9 Feb 2026, Cheng et al., 25 Jan 2026, Shah, 1 Jul 2026). The second is that a unified multimodal model must use one literal tokenizer. Recent systems often unify at the hidden-space or conditioning-memory level rather than at the raw vocabulary level, precisely because modality-specific tokenizers remain advantageous (AI et al., 28 Oct 2025, Li et al., 16 Nov 2025). The third is that scaling is solved by larger backbones alone. Compression and routing papers show that token-budget allocation can improve both efficiency and accuracy, sometimes exceeding full-token baselines by removing distractors and redundancy (Xin et al., 19 May 2026, Yang et al., 18 May 2026).

The principal open questions follow directly from these limitations. The field lacks clean token-level scaling laws for multimodal condition length, retention schedules, or expert activation under modality heterogeneity. It also lacks a unified account of when multimodal fusion is “complete” enough to drop non-text tokens, when null experts should dominate routed compute, and how latent/private token streams should be scheduled against public output streams. Current results suggest that omni-token scaling is evolving from a descriptive label into a technical discipline concerned with token interfaces as the primary substrate of multimodal scale. What remains incomplete is the transition from strong architectures and heuristics to predictive theory.