Scaling Properties of Continuous Diffusion Spoken Language Models

Published 27 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.24416v1)

Abstract: Speech-only spoken LLMs (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents continuous diffusion spoken language models that train solely on speech, outperforming autoregressive models in data efficiency and compute scalability.
It leverages a multimodal diffusion transformer on log-mel filterbanks and establishes predictable power-law scaling relationships using tokens-per-parameter ratios and pJSD metrics.
Ablation studies reveal the impact of training duration, patch size, noise schedules, and diffusion timesteps, while highlighting current limits in achieving long-form linguistic coherence.

Scaling Laws and Data Efficiency of Continuous Diffusion Spoken LLMs

Introduction

The paper "Scaling Properties of Continuous Diffusion Spoken LLMs" (2604.24416) presents a comprehensive scaling law analysis for continuous diffusion (CD) spoken LLMs (SLMs) trained exclusively from speech, without textual supervision. The work addresses the computational and representational bottlenecks inherent in discrete autoregressive (AR) SLMs, proposing that continuous diffusion modeling offers significant advantages, both in computational efficiency and linguistic quality, particularly at extreme scales of model and dataset sizes.

Model Architecture and Speech Representation

The CD SLM leverages log-mel filterbanks as its data representation, bypassing the compression artifacts of neural codecs and preserving both semantic and acoustic information. The generative framework employs a multimodal diffusion transformer (MM-DiT), adapted to handle variable-length speech context and continuation streams within a bidirectional attention mechanism. The generation process entails direct diffusion on log-mel filterbanks, minimizing min-SNR weighted denoising loss, and utilizing classifier-free guidance (CFG) at inference for effective conditional sampling.

The model architecture supports scaling up to 16B parameters and handles tens of millions of hours of diverse conversational speech, facilitated by large-scale data curation and preprocessing using WhisperX pipelines.

Scaling Law Analysis

Validation Loss and Token-to-Parameter Ratio

CD SLMs demonstrate validation loss scaling behavior driven by model size, dataset size, and compute budget, aligning with power-law relationships observed in language modeling [45, 46]. The optimal tokens-per-parameter ratio $r^*$ decreases with compute, indicating an improvement in data efficiency as scale increases. This contrasts with AR SLMs where $r^*$ typically increases with higher compute budgets due to discretization limits. The curvature of isoFLOP scaling curves flattens as compute grows, enabling near-optimal performance across a broad spectrum of data-model allocations, which has practical implications for fast inference and resource allocation.

Linguistic Quality: Phoneme Jensen-Shannon Divergence

To rigorously quantify "languageness", the paper introduces the phoneme Jensen-Shannon divergence (pJSD), which measures the distributional distance between phoneme n-grams of generated and real speech. The pJSD metric is found to scale predictably with compute and model size, with higher-order n-grams (e.g., 5-grams) yielding a stronger fit to scaling laws and demonstrating tight correlation with training loss. The fused two-stage scaling law approach provides robust fits across the scaling regime, with mean relative errors (MRE) consistently below 5%.

Perceptual Quality Metrics

Analysis of perceptual quality using DNSMOS, NISQA Mean Opinion Scores (MOS), and Meta Audiobox Aesthetics reveals that most metrics quickly saturate near real-data baselines and do not exhibit scaling law behavior. Exceptions are certain subjective axes (content enjoyment and understanding) under Meta Audiobox Aesthetics, which do show predictable scaling. Extrapolation suggests that some perceptual metrics may not reach real-data quality solely via scaling, implying intrinsic representational constraints in the current approach.

Ablation Studies

A systematic ablation study investigates sensitivity to training duration, temporal patch size, noise schedule, and number of diffusion timesteps. The results indicate that:

Training duration: Primary driver for improved linguistic quality and higher content enjoyment/understanding scores.
Patch size: Increasing patch size degrades all metrics substantially, underscoring the importance of fine temporal resolution.
Noise schedule: Linear schedule with zero terminal SNR is consistently robust for perceptual quality, outperforming cosine and exponential alternatives.
Diffusion timesteps: Finer discretization offers marginal gains but with increased computational complexity.

This multi-factorial sensitivity analysis informs practical training design for optimizing both linguistic and audio fidelity.

Extreme Scaling and Representation Dependence

Scaling the CD SLM to 16B parameters and integrating auxiliary conditioning with a frozen Whisper encoder demonstrates that architectural and representational choices define empirical performance bounds. The enhanced model achieves loss below the irreducible minimum estimated for the base MM-DiT architecture, validating the critical impact of richer, information-dense representations. Despite improved generation quality (prosody, emotion, multilingualism), long-form linguistic coherence remains unattained, highlighting enduring limitations of speech-only SLMs.

Implications and Future Directions

The findings establish that CD SLMs exhibit scaling laws analogous to AR SLMs but offer superior compute efficiency and flexible allocation at high scales. Nevertheless, practical limitations persist: achieving LLM-level linguistic proficiency with speech-only models may be infeasible without either advancing data representations or adopting joint text-speech modeling strategies.

Future directions include:

Representation Learning: Exploring superposition-based, information-dense representations or architectures to achieve sharper scaling.
Joint Modalities: Integrating text-conditioned or text-pretrained models to bridge the gap in linguistic structure and coherence.
Evaluation Metrics: Developing robust evaluation protocols as SLMs achieve higher levels of coherence and explicit language generation.

Conclusion

The paper provides the first systematic scaling law analysis for continuous diffusion SLMs, demonstrating that linguistic proficiency and data efficiency improve predictably with compute and scale, albeit with inherent representational limits in current speech-only paradigms. The work clarifies the structural requirements for high-quality spoken language modeling and suggests that further advances in speech representations and joint modeling architectures are necessary to approach the capabilities of state-of-the-art text-based systems.

Markdown Report Issue