Token-Level Serialized Output Training (t-SOT)
- Token-Level Serialized Output Training (t-SOT) is a sequence modeling paradigm that interleaves tokens from multiple sources using special change tokens for speaker and modality transitions.
- It leverages streaming decoders like RNN-Transducer or Transformer Transducers to achieve low-latency inference in multi-talker, multilingual, and joint ASR/ST tasks.
- t-SOT simplifies architecture by unifying token streams and employing auxiliary losses, yielding state-of-the-art performance on benchmarks such as LibriSpeechMix and AliMeeting.
Token-Level Serialized Output Training (t-SOT) is a paradigm for sequence modeling in multi-talker, multilingual, and multi-task speech processing. t-SOT redefines training targets and inference outputs by serializing multiple streams of conceptually distinct content (speaker turns, language pairs, or modality outputs) into a single, fully interleaved token sequence, using special tokens to encode stream or role transitions. This architectural simplification enables streaming, low-latency inference with a single decoder, effectively handling overlapping speech, speaker-attribution, and joint tasks such as ASR and ST—without the computational burden or architectural complexity of multiple output branches or explicit permutation solutions (Kanda et al., 2022, Wu et al., 2023, Fan et al., 2024).
1. Serialization Principle and Label Construction
In t-SOT, each unit (token, word, or subword) from multiple sources—such as speaker streams in ASR or source/target outputs in joint ASR-ST—is assigned an emission time (from forced alignment, timestamp inference, or aligner). The reference label sequence is constructed by merging all tokens across sources in increasing time order. Whenever the current token originates from a different source than the previous, a stream-identity or change-channel token (e.g., ⟨cc⟩, ⟨sc⟩, ⟨asr⟩, ⟨st⟩) is inserted. The resulting sequence takes the form:
where each ⟨cc⟩ denotes a source or speaker change (Kanda et al., 2022, Liang et al., 2023, Fan et al., 2024). This principle generalizes to joint tasks by inserting tokens marking modality switches (e.g., ⟨asr⟩, ⟨st⟩ for joint recognition and translation (Papi et al., 2023, Papi et al., 2023)), or role tokens for speaker-attribution (Xu et al., 12 Jun 2025).
Token-level interleaving, as opposed to utterance-level SOT or PIT-based streams, ensures strict chronological ordering and stream-switch information at token granularity, which is critical for handling rapid overlaps, cross-lingual switching, and hybrid content.
2. Model Architecture and Training Objectives
The dominant t-SOT instantiation is the streaming Transformer Transducer (TT) or RNN-Transducer (RNN-T), with the following components:
- Encoder: Causal transformer or conformer stack operating on log-Mel features (with chunk-wise lookahead for streaming).
- Prediction network: LSTM stack that consumes previously emitted non-blank (and non-⟨cc⟩) tokens.
- Joint network: Fuses encoder and predictor states and outputs logits over the expanded vocabulary with the special change tokens (Kanda et al., 2022, Kanda et al., 2022).
Training proceeds with the standard RNN-T loss over the serialized target , where is the t-SOT sequence. The vocabulary is augmented by special change tokens, which are treated as standard prediction targets.
For multi-task scenarios (e.g., joint ASR/ST), the label construction involves interleaving the ASR and ST tokens using explicit textual aligners (e.g., awesome-align), with separate tokens for output mode transitions. The training loss remains a single sequence-level RNN-T or CTC loss (Papi et al., 2023, Papi et al., 2023).
Recent variants further decouple the LLM (LM) component from the predictor using a factorized neural transducer (FNT), allowing text-only domain adaptation by isolating the multi-stream disruption introduced by change tokens (Wu et al., 2023).
3. Decoding and Inference Strategies
Streaming inference with t-SOT produces a single token stream. As each token is emitted, inference logic monitors change-channel tokens, partitioning the token stream into separate “virtual output channels” corresponding to speakers, roles, or modalities:
- Speaker switching: On ⟨cc⟩ emission, toggle the target buffer to a different speaker (Kanda et al., 2022, Kanda et al., 2022).
- Joint ASR-ST: On ⟨asr⟩ or ⟨st⟩ emission, toggle the hypothesis buffer to transcript or translation (Papi et al., 2023).
- Role tagging: On ⟨spk0⟩/⟨spk1⟩ tokens, switch transcript attribution (Xu et al., 12 Jun 2025).
This enables real-time demultiplexing with compute and latency comparable to single-talker ASR, even under heavy overlap or rapid context switching. For more structured architectures (e.g., FNT), the LM maintains multiple hidden states—one per channel—switching context only on change token emission (Wu et al., 2023).
4. Extensions: Auxiliary Supervision and Multi-Task Losses
To address frequent t-SOT challenges—including degraded context modeling at rapid stream switches and unreliable speaker-change prediction—recent work introduces auxiliary objectives and architectural blocks:
- Speaker-aware auxiliary losses: Masked t-SOT loss reinforces intra-speaker context learning by feeding the model a per-speaker masked target, where only tokens from one speaker are visible (others are masked), thereby reducing semantic confusion in autoregressive decoding (Fan et al., 2024).
- Speaker similarity attention: Decoder self-attention is regularized using a similarity matrix over learned speaker embeddings, suppressing cross-speaker attention and prioritizing intra-speaker dependencies (Fan et al., 2024).
- Boundary-aware losses: Penalties on decoder cross-attention, guided by oracle or estimated timestamps, constrain the attention to remain block-diagonal at segment boundaries (Liang et al., 2023).
- Speaker change detection heads: Auxiliary prediction heads jointly estimate speaker-change or utterance boundaries, with associated CE or binary cross-entropy losses (Liang et al., 2023).
A generalized multi-task loss aggregates main RNN-T or attention loss with CTC (sometimes two-stage, for token-level monotonicity followed by FIFO re-ordering), auxiliary speaker- or boundary losses, and structural penalties (Liang et al., 2023).
5. Application Domains and Evaluation
t-SOT underpins major advances in a variety of tasks:
- Multi-talker streaming ASR: State-of-the-art WER on LibriSpeechMix, LibriCSS, AMI, and AliMeeting, with single-pass models matching or improving on two-stream or PIT-based systems (Kanda et al., 2022, Kanda et al., 2022, Liang et al., 2023).
- Speaker-attributed ASR: Joint ASR and speaker identification/diarization with low latency using token-aligned speaker embeddings or role tokens (Kanda et al., 2022, Xu et al., 12 Jun 2025).
- Joint ASR/ST and Multi-Task Streaming: Token-level interleaving unifies translation and transcription—enabling simultaneous recognition with single decoder and minimal latency, and outperforming separate models in quality–latency tradeoff (Papi et al., 2023, Papi et al., 2023).
- Domain adaptation: Text-only adaptation becomes feasible via FNT, yielding 8–11% relative WER gains after cross-domain LM fine-tuning (Wu et al., 2023).
Metrics such as concatenated minimum-permutation WER (cpWER), utterance-dependent CER (UD-CER), and attribution error rate (AER) are used to capture not only lexical but also speaker change and segmentation accuracy (Liang et al., 2023, Fan et al., 2024, Xu et al., 12 Jun 2025).
| Task | Typical Change Token | Key Metric |
|---|---|---|
| Multi-talker ASR | ⟨cc⟩, ⟨sc⟩ | WER, UD-CER |
| Joint ASR/ST | ⟨asr⟩, ⟨st⟩ | WER, BLEU, LAAL |
| Speaker Attribution/Role | ⟨spk0⟩, ⟨spk1⟩ | mtWER, AER |
6. Limitations, Ablations, and Future Directions
Observed limitations include the need for accurate, timestamped supervised data to construct the interleaved label sequences and for explicit utterance or speaker boundary annotation (Xu et al., 12 Jun 2025). Semantic confusion during frequent interleaving can degrade recognition accuracy, motivating auxiliary masking and attention methods (Fan et al., 2024). For higher numbers of concurrent speakers, scalability is constrained by fixed channel token assignment, and change-point modeling becomes more complex (Kanda et al., 2022, Liang et al., 2023).
Proposed directions include:
- End-to-end segmentation and change-detection within the ASR/ST model, moving beyond forced alignment (Xu et al., 12 Jun 2025, Liang et al., 2023).
- Hierarchical and multimodal t-SOT (e.g., combining audio, video, gaze for speaker-role tagging) (Xu et al., 12 Jun 2025).
- Continuous pre-training and large-scale adaptation with text-only corpora or robust two-channel simulated mixtures (Wu et al., 2023, Kanda et al., 2022).
- Prompt-based serialized output as semantic control for LLM-based ASR and multi-modal tasks (Shi et al., 1 Sep 2025).
7. Impact and Summary of Empirical Results
t-SOT models have established new standards in streaming conversational ASR, with empirical results consistently demonstrating lower WER and latency than multi-branch and offline separation-based baselines: e.g., LibriSpeechMix WER of 4.4% (2-speaker, TT-36, 2.56 s), AliMeeting UD-CER reduction of 19.9% with BA-SOT, and cpWER of 3.41% on LibrispeechMix with SA-SOT after extensive training (Kanda et al., 2022, Liang et al., 2023, Fan et al., 2024). t-SOT unifies single and multi-talker recognition, allows text-only domain adaptation, and supports joint ASR/ST/role-aware generation with no increase in inference cost.
The serialized output training paradigm, and its token-level variant t-SOT, have demonstrated both theoretical elegance and practical efficacy in the deployment of unified, flexible, and scalable sequence models for complex streaming speech tasks (Kanda et al., 2022, Liang et al., 2023, Wu et al., 2023, Fan et al., 2024, Xu et al., 12 Jun 2025, Shi et al., 1 Sep 2025).