Flow-SLM: Generative Spoken Language Model

Updated 17 February 2026

Flow-SLM is a generative spoken language model that jointly synthesizes discrete semantic tokens and continuous acoustic embeddings to capture both linguistic content and fine acoustic detail.
It employs a dual-stream architecture with a causal transformer and a conditional flow matching head to enable multi-token prediction and maintain speaker and prosodic consistency.
Experimental results show that Flow-SLM matches token-only baselines on linguistic metrics while significantly improving acoustic fidelity and speaker similarity.

Flow-SLM denotes a paradigm for generative spoken language modeling, jointly synthesizing linguistic and acoustic information in a single autoregressive pass. Conventional textless spoken LLMs (SLMs) are restricted to generating discrete semantic tokens and rely on downstream vocoders for acoustic realization, limiting their ability to capture fine acoustic context or enforce prosodic and speaker-consistent detail. Flow-SLM addresses these constraints through an architecture that fuses discrete semantic token generation with direct continuous acoustic frame modeling, advancing the state-of-the-art in speaker and prosody preservation for textless speech generation (Chou et al., 12 Aug 2025).

1. Joint Linguistic and Acoustic Modeling Architecture

At its core, Flow-SLM introduces a dual-stream representation: a sequence of discrete semantic tokens and a sequence of real-valued acoustic frame embeddings, both derived from a pretrained, frozen speech encoder (Mimi). The semantic tokens are obtained from the first level of Mimi's residual vector quantizer (RVQ), representing coarse linguistic content, while the acoustic frames are high-dimensional embeddings (768-dim, 50 Hz).

These representations are consumed by a causal transformer backbone, which ingests previously generated acoustic frames and produces a context vector. This context branches into two distinct predictive heads:

Semantic Token Predictor: An MLP predicts the next $k$ semantic tokens using a multi-token cross-entropy loss.
Conditional Flow Matching (CFM) Head: This module models the generation of the next acoustic frame via a time-dependent vector field, trained by flow-matching to transport a Gaussian noise seed to the data embedding, conditioned on both past context and predicted future semantic tokens.

Decoding proceeds by alternately sampling semantic token sequences and then generating an acoustic embedding by integrating the flow-generated ODE from prior to target. The complete embedding sequence is then decoded back to waveform via Mimi's decoder and RVQ layers.

2. Training Objectives and Conditional Flow Matching

The model is supervised by a composite loss over both output channels, indexed by the length $k$ of the semantic token prediction horizon: $L(\theta) = L_{\mathrm{sem}-k}(\theta) + L_{\mathrm{CFM}-k}(\theta)$

$L_{\mathrm{sem}-k}$ : Multi-token cross-entropy encouraging accurate prediction of the next $k$ semantic tokens.
$L_{\mathrm{CFM}-k}$ : Conditional flow-matching loss training the vector field $v_t$ to match the optimal-transport field between a Gaussian prior and the true acoustic embedding.

Specifically, time-dependent noisy interpolates $\phi_t^{OT}(x_0)$ are constructed between noise and ground-truth embedding, with the CFM head trained in $L_2$ to approximate $u_t(\phi_t^{OT}(x_0) | x_1) = x_1 - (1-\sigma_\mathrm{min})x_0$ . Sampling in practice is performed with a numerical ODE solver from a Gaussian seed.

3. Multi-Future-Token Prediction and Mitigation of Trivial Continuity

Flow-SLM resolves a central pathology of continuous-augmented SLMs: when only the immediate next semantic token ( $k$ 0) is predicted given continuous context, models degenerate to exploiting local continuity—impairing the model's ability to maintain high-level linguistic structure in the generated speech. Empirical analysis shows that increasing $k$ 1 (multi-token prediction) forces the model to attend to and preserve longer-range linguistic dependencies. With $k$ 2, benchmarks such as sWUGGY and sBLIMP recover nearly to discrete-token-only baselines, while maintaining superior acoustic detail.

4. Representation and Conditioning of Acoustic Frames

The acoustic signal is encoded as 768-dimensional vectors at 20 ms hops (50 Hz), without further quantization prior to CFM training. At each generative step, the CFM head is conditioned on:

The current interpolated embedding $k$ 3
The transformer context $k$ 4
An embedding of the next $k$ 5 future semantic tokens

These signals are concatenated and input to a residual MLP that predicts the flow-matching vector field for ODE integration, coupling the discrete and continuous representations for coherent generative modeling.

5. Experimental Setup and Evaluation Metrics

Models were trained on MLS-En (45k hours read English) and an extended corpus (+20k h People’s Speech clean), with configurations:

Flow-SLM-270M: 270M params (OpenELM-270M transformer, $k$ 6125M CFM)
Flow-SLM-1B: 1B params (OpenELM-1B transformer, $k$ 7150M CFM)

Evaluation metrics span both linguistic plausibility and acoustic fidelity:

Linguistic: sWUGGY (lexical contrast), sBLIMP (syntactic acceptability), genPPL (perplexity via Whisper/LLaMA-3.2-1B transcripts)
Acoustic: SALMon (consistency, sentiment/background alignment), speaker similarity (WavLM-large-TDNN), Fréchet Speech Distance (e2v-FSD, wlm-FSD)

Inference uses nucleus sampling for tokens, midpoint ODE solver for CFM (64 steps), and post-decoding by Mimi.

6. Empirical Performance and Ablation Analysis

Flow-SLM achieves linguistic likelihood comparable to contemporary token-only SLMs:

Flow-SLM-270M shows only a minor drop in sWUGGY/sBLIMP relative to TWIST-350M; Flow-SLM-1B is $k$ 82 pts lower on sWUGGY but $k$ 93 pts higher on sBLIMP than TWIST-1.3B.
Acoustic metrics surpass baselines: speaker similarity is markedly improved (0.43 vs 0.09 for discrete-only TWIST), and Fréchet metrics (e2v-FSD, wlm-FSD) indicate more natural and expressive synthesis.
SALMon shows +8–9 pts acoustic consistency and +6–8 pts sentiment alignment over baselines.
GenPPL is higher (worse) compared to textually pretrained models, indicating that while local and segmental fidelity is strong, long-range coherence remains a challenge with purely speech-supervised SLMs.

Ablations confirm that absence of multi-token prediction (k=1) degrades high-level task accuracy due to the model's over-reliance on continuity in the acoustic frame. Increasing $L(\theta) = L_{\mathrm{sem}-k}(\theta) + L_{\mathrm{CFM}-k}(\theta)$ 0 and retaining CFM yields robust linguistic and acoustic performance.

7. Implications, Limitations, and Future Directions

Flow-SLM demonstrates that joint generative modeling of semantic and acoustic information—without dependence on external vocoders—enables robust speaker and prosody preservation in textless SLMs. Its design allows for competitive semantic representation with modest computational resources and training data.

Key limitations include:

Lower long-range linguistic coherence relative to models pretrained on text (as indicated by genPPL).
Restricted background diversity, an artifact of audiobook-centric training data.
ODE-based flow-matching sampling incurs higher computational costs compared to discrete-token inference.

Proposed directions for improvement include scaling up compute and data, integrating text supervision for hybrid speech–text models, mixing CFM continuous frames with quantized acoustic tokens for tunable fidelity/computation tradeoffs, and diversifying training data to encompass conversational and multi-speaker scenarios. These avenues point toward flexible, robust, and realistic spoken language generation beyond current token-centric paradigms (Chou et al., 12 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flow-SLM.

Flow-SLM: Generative Spoken Language Model

1. Joint Linguistic and Acoustic Modeling Architecture

2. Training Objectives and Conditional Flow Matching

3. Multi-Future-Token Prediction and Mitigation of Trivial Continuity

4. Representation and Conditioning of Acoustic Frames

5. Experimental Setup and Evaluation Metrics

6. Empirical Performance and Ablation Analysis

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Flow-SLM: Generative Spoken Language Model

1. Joint Linguistic and Acoustic Modeling Architecture

2. Training Objectives and Conditional Flow Matching

3. Multi-Future-Token Prediction and Mitigation of Trivial Continuity

4. Representation and Conditioning of Acoustic Frames

5. Experimental Setup and Evaluation Metrics

6. Empirical Performance and Ablation Analysis

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research