Serialized Output Training (t-SOT)
- Serialized Output Training (t-SOT) is a framework that converts overlapping multi-speaker speech into a single, serialized token stream, simplifying ASR tasks.
- It employs strategies like FIFO and CTC-based dominance to resolve speaker permutation issues and manage token-level ordering efficiently.
- t-SOT supports both streaming and offline applications, achieving state-of-the-art performance in overlapping speech transcription and joint ASR-translation tasks.
Serialized Output Training (SOT) is a framework and training paradigm for end-to-end multi-talker speech recognition that represents the outputs from multiple overlapping speakers as a single serialized token stream, rather than using parallel output branches. SOT and its recent token-level extension (“t-SOT”; Editor's term) have systematically advanced the state of the art in overlapped speech transcription, streaming speaker-attributed ASR, and joint ASR-plus-translation tasks. SOT’s design centers on reducing architectural complexity, enabling flexible speaker modeling, addressing permutation ambiguity in overlapping speech, and providing robust performance across offline/online and constrained/unconstrained speaker scenarios.
1. Fundamental Principles and Mechanism
SOT reformulates the multi-talker ASR problem in an attention-based encoder–decoder (AED) or transducer framework. Instead of producing separate outputs for each speaker, SOT concatenates the transcriptions of all speakers into a single sequence, delineated by a special token (e.g., ⟨sc⟩, ⟨cc⟩) indicating a speaker change. The encoder processes the overlapped audio and the decoder sequentially generates the transcriptions, producing a serialized representation:
SOT models can count speakers simply by tallying ⟨sc⟩ tokens in the output. This paradigm is “open set” in nature: unlike Permutation Invariant Training (PIT) systems with fixed output branches, SOT is not constrained by a maximum number of speakers and can accommodate arbitrary speaker counts (Kanda et al., 2020).
The token-level SOT (“t-SOT”) variant further serializes at the token (word or subword) level based on emission times, introducing “virtual channel change” tokens (e.g., ⟨cc⟩). This enables strict chronological ordering of all tokens, supporting both streaming inference and direct single-stream decoding (Kanda et al., 2022).
2. Serialization Strategies and Label Ordering
The label serialization order in SOT presents a critical training challenge. Naive orderings lead to a combinatorial number of permutations—requiring O(S!) computation for S speakers. Early SOT approaches addressed this either by:
- Minimum-loss ordering: Evaluating all possible output concatenations and choosing the one with minimum loss, which is computationally infeasible for more than two speakers.
- First-In-First-Out (FIFO): Sorting reference utterances by start time to establish a single, linear serialization order. This reduces the computational complexity to O(S) and was shown to work robustly in practical ASR settings (Kanda et al., 2020).
Recent work proposes a learned, model-based ordering using an auxiliary serialization module (DOM-SOT), which integrates a CTC-based dominance criterion to automatically order speakers according to learned features such as loudness and gender, improving robustness in scenarios where start time bias is ambiguous or fails (Shi et al., 4 Jul 2024).
In token-level SOT, tokens are ordered by their emission times; a channel-change token is inserted whenever the next token comes from a new speaker. For cases with more than two concurrent speakers, distinct ⟨ccₖ⟩ tokens are assigned for “virtual output channels,” managed via an assignment dictionary during decoding (Kanda et al., 2022).
3. Model Architectures
SOT is typically instantiated in several backbone architectures:
- Attention-based Encoder-Decoder (AED): The encoder (often a Conformer stack) encodes the mixed input; a single decoder, using attention, generates the serialized output.
- Streaming Transducer (e.g., Transformer Transducer): For low-latency, online processing, the encoder combines convolutional and transformer layers; the predictor network handles serialized token emission with a joiner combining encoder and predictor outputs (Kanda et al., 2022).
- Multichannel Extensions: Multichannel SOT variants employ advanced fusion techniques, including multi-frame and cross-channel attention mechanisms for spatial feature integration, or neural beamforming front-ends (Shi et al., 2022).
- Diarization-Conditioned and Target-Speaker Models: Hybrid frameworks inject speaker embeddings (e.g., from diarization masks or d-vectors), concatenate them for joint decoding, and embed speaker/time information into token representations for precise speaker attribution and context modeling (Kocour et al., 4 Oct 2025).
Auxiliary modules such as CTC-based separators, speaker query RNNs, and inventory attention for speaker identification have been incorporated in advanced frameworks (Kanda et al., 2020, Shi et al., 1 Sep 2024, Kanda et al., 2022).
4. Loss Functions, Training Strategies, and Speaker Disentanglement
SOT training is typically driven by a cross-entropy (CE) loss over the serialized target sequence. Recent developments incorporate hybrid and multi-task strategies, notably:
- CTC-Attention Hybrid Loss: Overlapped encoding separation (EncSep) modules leverage both CTC and attention losses, extracting single-speaker embeddings and guiding the attention decoder with explicit speaker-wise information, thereby mitigating representation entanglement in the encoder (Shi et al., 1 Sep 2024).
- Speaker-Aware CTC (SACTC): SACTC introduces a Bayes risk CTC objective with speaker-aware path penalties, guiding the model to temporally disentangle speakers in the encoder, and enforce time-region emission for each speaker (Kang et al., 19 Sep 2024).
- Speaker-Distinguishable CTC (SD-CTC): SD-CTC outputs per-frame token–speaker label pairs, jointly optimizing for token recognition and frame-level speaker assignment, and flatly reducing SOT model error rates by ~26% without auxiliary timestamp data (Sakuma et al., 9 Jun 2025).
- Speaker-Aware Training (SA-SOT): This uses masked SOT labels for auxiliary loss, along with a self-attention adjustment via token-wise speaker similarity matrices to reduce confusion in the decoder context (Fan et al., 4 Mar 2024).
Domain adaptation capabilities are expanded by factorizing LLMs (“t-SOT FNT”), enabling text-only adaptation via separate vocabulary predictors that handle ⟨cc⟩ tokens with special switching logic for multi-speaker context (Wu et al., 2023).
5. Applications and Empirical Advances
SOT and t-SOT have enabled practical, state-of-the-art performance for:
- Meeting and Conversational Transcription: SOT-tuned models, trained using large-scale simulated overlap mixtures and fine-tuned on real data, have achieved WERs of 21.2% on AMI-SDM, outperforming prior work with oracle utterance boundaries (Kanda et al., 2021).
- Speaker Attribution and Counting: SOT-based systems, equipped with tokenized speaker-change or role-attribution markers and jointly optimized for identification, achieve high accuracy (>97% for two-talker counting) and outperform cascaded or PIT-based pipelines (Kanda et al., 2020, Xu et al., 12 Jun 2025).
- Streaming Multi-Talker ASR: Token-level SOT architectures reach WERs on par with single-talker ASR for non-overlapping speech (~4–4.5% WER), while maintaining robust WERs (~6.9–8.4%) in dense overlap conditions, with negligible increase in inference cost (Kanda et al., 2022).
- Speaker-Aware and Diarization-Conditioned Recognition: Diarization-conditioned SOT architectures (SA-DiCoW) that concatenate encoder embeddings per speaker, then decode a joint serialized stream, achieve lower cpWER on dense mixture benchmarks and outperform models that decode each speaker independently (Kocour et al., 4 Oct 2025).
- Joint ASR and Translation (“Joint SOT”): By inserting task-specific tokens and leveraging textual alignment or timestamp-based interleaving, SOT models simultaneously output streaming transcription and translation with improved quality, lower latency, and a single decoder (Papi et al., 2023, Papi et al., 2023).
Advanced approaches for segment-based SOT (segSOT) and multi-stage training (e.g., SOT pretraining + separator + SOP adaptation for LLM-based ASR) further improve readability, latency–accuracy tradeoffs, and robustness in real-world, highly overlapped or multi-speaker settings (Subramanian et al., 17 Jun 2025, Shi et al., 1 Sep 2025).
6. Limitations, Open Questions, and Future Directions
Despite the advances, open challenges remain for SOT/t-SOT frameworks:
- Speaker Misassignment and Representation Entanglement: Experiments and visualizations consistently reveal that vanilla SOT models may not fully disentangle overlapping speakers in encoder representations, leading to speaker assignment failures, particularly without auxiliary supervision (Kang et al., 19 Sep 2024, Sakuma et al., 9 Jun 2025). CTC-based and explicit speaker-aware loss functions have been shown to mitigate but not fully eliminate this issue.
- Label Ordering and Serialization bias: The reliance on start time (FIFO) or learned dominance can introduce biases—FIFO fails in 0s-offset overlap, PIT is computationally heavy and lacks consistent logic. DOM-SOT and CTC-based dominance scoring offer improved robustness, but further research is required for open-domain, ambiguous mixtures (Shi et al., 4 Jul 2024).
- Scaling to More Speakers and Real-World Conditions: SOT models generally perform well in two- or three-talker scenarios. Performance degrades as the number of concurrent speakers increases or when environmental noise and recording artifacts are present. Recent work with multichannel systems, stronger encoder separation modules, and advanced fusion indicates further scaling potential (Shi et al., 2022, Shi et al., 1 Sep 2024).
- Integration with LLMs and Downstream Tasks: In LLM-based ASR, plain SOT sequences are insufficient for complex overlapping scenarios (three or more speakers). Structured, speaker-aware prompts (SOP) and staged training (e.g., SOT pretraining, separator/CTC extraction, LoRA adaptation) are necessary to maintain performance, suggesting further research into prompt design and acoustic–linguistic integration (Shi et al., 1 Sep 2025).
- Readability and Turn-Taking in Offline ASR: Segment-based SOT (segSOT) improves turn-taking and transcript quality in offline ASR by enforcing segment boundaries using pause-based parameters, balancing accuracy, and human-like readability (Subramanian et al., 17 Jun 2025).
7. Significance and Outlook
Serialized Output Training has become foundational for modern multi-talker ASR. It unifies overlapping and non-overlapping speech recognition in a single model, supports joint tasks such as speaker identification or translation, facilitates streaming and offline decoding, and lends itself to integration with advanced separation modules, multi-head attention, and large-scale pretraining. The continued refinement of SOT—through improved label ordering, hybrid loss functions, explicit speaker disentanglement, and structured prompting—points toward further robustness and versatility in real-world, scenario-agnostic speech recognition systems. Ongoing research is expanding SOT’s applicability to more speakers, diverse acoustic environments, multi-modal tasks, and LLM-based transcription engines, consolidating its role in the architecture of future conversational AI and multi-party transcription solutions.