Serialized Output Training (SOT)
- Serialized Output Training (SOT) is an end-to-end framework that serializes multi-speaker transcriptions using a unified attention-based encoder-decoder model.
- It introduces a special ⟨sc⟩ token to mark speaker changes, enabling native speaker counting and eliminating the need for multiple decoder branches.
- Empirical results show that SOT reduces computational complexity and improves word error rates compared to traditional PIT methods in multi-talker scenarios.
Serialized Output Training (SOT) is an end-to-end framework for multi-speaker overlapped speech recognition that restructures the conventional multi-output paradigm by employing a single attention-based encoder–decoder (AED) model to generate all speakers’ transcriptions in a serialized fashion. Instead of separate decoder branches for each speaker—as in permutation invariant training (PIT)—the SOT framework concatenates all transcriptions into a single target sequence, demarcating speaker changes with a special separator token (⟨sc⟩), and produces this sequence token by token. This approach eliminates the combinatorial constraints of PIT, enables modeling of cross-speaker dependencies, and supports arbitrary numbers of speakers within a unified system.
1. SOT Fundamentals and Model Architecture
Serialized Output Training is built upon the attention-based encoder–decoder architecture. Given an input acoustic sequence , the encoder produces hidden representations . For output step :
- The decoder’s context vector and attention weights are computed as .
- The recurrent state update is .
- The output token is generated as .
For SOT, the reference sequence presents all speakers’ transcriptions serially, separated by and terminated by :
where is the number of speakers in the sample and is the token length for speaker .
This serialization reduces the output prediction problem to a single sequence generation task without explicit branching for each speaker.
2. SOT versus Permutation Invariant Training (PIT) and Complexity
Traditional PIT-based frameworks require decoder branches when handling -speaker mixtures, and the learning objective involves permutation search over all possible assignments of reference and network outputs, resulting in computational complexity. This approach also hard-limits the maximum number of speakers per model instance and does not naturally capture sequential dependencies between speakers’ utterances.
SOT adopts an alternative serialization loss. Minimizing over all possible reference permutations is computationally infeasible in practice. Instead, the SOT paper introduces a “first-in, first-out” (FIFO) trick: speakers are sorted by utterance start time , and the reference is concatenated in that order. The training loss then becomes . This reduces training complexity to .
This deterministic serialization resolves the permutation problem and significantly lowers the computational burden during both training and decoding.
3. Speaker Change Token, Counting, and Output Semantics
SOT introduces a special token to mark speaker changes within the serialized output. During inference, the model generates utterances for all speakers in a round-robin fashion, with as the delimiter and terminating the stream.
Speaker counting is handled natively by the number of tokens in the output stream. Specifically, the predicted number of speakers equals the count of tokens plus one. This mechanism obviates the need for dedicated speaker counting models or heuristics.
In practice, this serialization enables downstream segmentation for diarized transcription and provides a transparent structure for postprocessing and evaluation.
4. Empirical Results and Evaluation Metrics
Extensive experiments on the LibriSpeech corpus demonstrate several key performance trends:
Model | Condition | WER (%) | Speaker Counting Accuracy |
---|---|---|---|
SOT (512-dim) | 2-spk mixed speech | 16.5–17.0 | ~97.0% (2 spk) |
PIT (matched size) | 2-spk mixed speech | much higher | lower |
SOT (1024-dim) | 1-spk/2-spk/3-spk mix | low to mid | ~99.8% (1 spk), 97.0%+ |
(The table data is excerpted verbatim from results reported in the paper.)
SOT models achieve lower WERs—not only compared to single-speaker ASR when used on overlapping mixtures, but also over the PIT baseline, with fewer parameters and superior training efficiency. The “Separation after Attention” extension (inserting an additional LSTM after attention) further improves WER by helping to disambiguate speaker overlaps.
SOT is also “speaker agnostic” in that a single model can handle variable speaker counts, including both single- and multi-talker mixtures, without switch-based architectures or data routing.
5. Implementation and Practical Considerations
In practice, SOT is implemented by adapting a standard encoder–decoder ASR model. At training time, batches are constructed by mixing utterances from multiple speakers and concatenating their label sequences according to start time. The serialized target sequence includes auxiliary tokens as described above.
The decoding is unbranched: the model autoregressively generates tokens, inserting a where it predicts a speaker change. No external postprocessing is required to obtain speaker-attributed transcriptions, aside from parsing the output stream at the separator tokens.
Memory and compute resource requirements for SOT models are moderate and comparable to their single-talker AED counterparts, especially when utilizing the FIFO ordering, which avoids exponential scaling.
6. Limitations and Extensions
While SOT overcomes the combinatorial issues and branch limitations of PIT, a remaining challenge is determining the ideal serialization order in highly overlapped or ambiguous temporal configurations. The FIFO approach, while efficient, may not match actual discourse order in some conversational overlaps where turn-taking is less distinct.
Implicit in the SOT design is that the attention mechanism and decoder learn to leverage cross-speaker context and to disambiguate overlapping regions based on acoustics and linguistic cues. Extensions, such as “Separation after Attention,” post-attentive LSTM stages, or integration with diarization/identification modules, may further improve real-world performance.
SOT’s simplicity, effectiveness on LibriSpeech, and capacity for speaker counting and variable-talkers mark it as a foundational technique for overlapped speech ASR. Further refinements and adaptations (e.g., for streaming, multilingual, or speaker-attributed scenarios) continue to be developed in subsequent research.
7. Summary Table: SOT versus PIT Approaches
Feature | SOT | PIT |
---|---|---|
Decoder Structure | Single sequence (1 branch) | Parallel (S branches) |
Permutation Complexity | (FIFO ordering) | |
Speaker Dependency Modeling | Sequential/contextual (via serialization) | Branch-independent |
Max Speakers Per Model | Unlimited | Hard limit by number of branches |
Speaker Counting | Native (via ) | Not directly supported |
Empirical WER (LibriSpeech) | Lower (e.g., 16.5–17.0% on 2-spk) | Higher |
Multi-Speaker Flexibility | Yes | No (fixed branching) |
The SOT framework, by leveraging streamlined decoder semantics, efficient permutation handling, and implicit speaker transition modeling, stands as the basis for a range of modern overlapped speech recognition architectures (Kanda et al., 2020).