Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 187 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 177 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Serialized Output Training (SOT)

Updated 7 October 2025
  • Serialized Output Training (SOT) is an end-to-end framework that serializes multi-speaker transcriptions using a unified attention-based encoder-decoder model.
  • It introduces a special ⟨sc⟩ token to mark speaker changes, enabling native speaker counting and eliminating the need for multiple decoder branches.
  • Empirical results show that SOT reduces computational complexity and improves word error rates compared to traditional PIT methods in multi-talker scenarios.

Serialized Output Training (SOT) is an end-to-end framework for multi-speaker overlapped speech recognition that restructures the conventional multi-output paradigm by employing a single attention-based encoder–decoder (AED) model to generate all speakers’ transcriptions in a serialized fashion. Instead of separate decoder branches for each speaker—as in permutation invariant training (PIT)—the SOT framework concatenates all transcriptions into a single target sequence, demarcating speaker changes with a special separator token (⟨sc⟩), and produces this sequence token by token. This approach eliminates the combinatorial constraints of PIT, enables modeling of cross-speaker dependencies, and supports arbitrary numbers of speakers within a unified system.

1. SOT Fundamentals and Model Architecture

Serialized Output Training is built upon the attention-based encoder–decoder architecture. Given an input acoustic sequence X={x1,...,xT}X = \{x_1, ..., x_T\}, the encoder produces hidden representations H(enc)=Encoder(X)H^{(\mathrm{enc})} = \text{Encoder}(X). For output step nn:

  • The decoder’s context vector and attention weights are computed as (cn,αn)=Attention(qn,αn1,H(enc))(c_n, \alpha_n) = \mathrm{Attention}(q_n, \alpha_{n-1}, H^{(\mathrm{enc})}).
  • The recurrent state update is qn=DecoderRNN(yn1,cn1,qn1)q_n = \mathrm{DecoderRNN}(y_{n-1}, c_{n-1}, q_{n-1}).
  • The output token is generated as yn=DecoderOut(cn,qn)y_n = \mathrm{DecoderOut}(c_n, q_n).

For SOT, the reference sequence RR presents all speakers’ transcriptions serially, separated by sc\langle \mathrm{sc} \rangle and terminated by eos\langle \mathrm{eos} \rangle:

R={r1(1),...,rN1(1),sc,r1(2),...,rN2(2),sc,...,r1(S),...,rNS(S),eos}R = \{r_1^{(1)}, ..., r_{N_1}^{(1)}, \langle \mathrm{sc} \rangle, r_1^{(2)}, ..., r_{N_2}^{(2)}, \langle \mathrm{sc} \rangle, ..., r_1^{(S)}, ..., r_{N_S}^{(S)}, \langle \mathrm{eos} \rangle\}

where SS is the number of speakers in the sample and NiN_i is the token length for speaker ii.

This serialization reduces the output prediction problem to a single sequence generation task without explicit branching for each speaker.

2. SOT versus Permutation Invariant Training (PIT) and Complexity

Traditional PIT-based frameworks require SS decoder branches when handling SS-speaker mixtures, and the learning objective involves permutation search over all S!S! possible assignments of reference and network outputs, resulting in O(S3)O(S^3) computational complexity. This approach also hard-limits the maximum number of speakers per model instance and does not naturally capture sequential dependencies between speakers’ utterances.

SOT adopts an alternative serialization loss. Minimizing over all possible reference permutations LSOT1=minϕΦ(1,...,S)nCE(yn,rnϕ)L^{\mathrm{SOT-1}} = \min_{\phi \in \Phi(1,...,S)} \sum_n \mathrm{CE}(y_n, r_n^\phi) is computationally infeasible in practice. Instead, the SOT paper introduces a “first-in, first-out” (FIFO) trick: speakers are sorted by utterance start time Ψ(1,...,S)\Psi(1,...,S), and the reference is concatenated in that order. The training loss then becomes LSOT2=nCE(yn,rnΨ(1,...,S))L^{\mathrm{SOT-2}} = \sum_n \mathrm{CE}(y_n, r_n^{\Psi(1,...,S)}). This reduces training complexity to O(S)O(S).

This deterministic serialization resolves the permutation problem and significantly lowers the computational burden during both training and decoding.

3. Speaker Change Token, Counting, and Output Semantics

SOT introduces a special token sc\langle \mathrm{sc} \rangle to mark speaker changes within the serialized output. During inference, the model generates utterances for all speakers in a round-robin fashion, with sc\langle \mathrm{sc} \rangle as the delimiter and eos\langle \mathrm{eos} \rangle terminating the stream.

Speaker counting is handled natively by the number of sc\langle \mathrm{sc} \rangle tokens in the output stream. Specifically, the predicted number of speakers equals the count of sc\langle \mathrm{sc} \rangle tokens plus one. This mechanism obviates the need for dedicated speaker counting models or heuristics.

In practice, this serialization enables downstream segmentation for diarized transcription and provides a transparent structure for postprocessing and evaluation.

4. Empirical Results and Evaluation Metrics

Extensive experiments on the LibriSpeech corpus demonstrate several key performance trends:

Model Condition WER (%) Speaker Counting Accuracy
SOT (512-dim) 2-spk mixed speech 16.5–17.0 ~97.0% (2 spk)
PIT (matched size) 2-spk mixed speech much higher lower
SOT (1024-dim) 1-spk/2-spk/3-spk mix low to mid ~99.8% (1 spk), 97.0%+

(The table data is excerpted verbatim from results reported in the paper.)

SOT models achieve lower WERs—not only compared to single-speaker ASR when used on overlapping mixtures, but also over the PIT baseline, with fewer parameters and superior training efficiency. The “Separation after Attention” extension (inserting an additional LSTM after attention) further improves WER by helping to disambiguate speaker overlaps.

SOT is also “speaker agnostic” in that a single model can handle variable speaker counts, including both single- and multi-talker mixtures, without switch-based architectures or data routing.

5. Implementation and Practical Considerations

In practice, SOT is implemented by adapting a standard encoder–decoder ASR model. At training time, batches are constructed by mixing utterances from multiple speakers and concatenating their label sequences according to start time. The serialized target sequence includes auxiliary tokens as described above.

The decoding is unbranched: the model autoregressively generates tokens, inserting a sc\langle \mathrm{sc} \rangle where it predicts a speaker change. No external postprocessing is required to obtain speaker-attributed transcriptions, aside from parsing the output stream at the separator tokens.

Memory and compute resource requirements for SOT models are moderate and comparable to their single-talker AED counterparts, especially when utilizing the FIFO ordering, which avoids exponential scaling.

6. Limitations and Extensions

While SOT overcomes the combinatorial issues and branch limitations of PIT, a remaining challenge is determining the ideal serialization order in highly overlapped or ambiguous temporal configurations. The FIFO approach, while efficient, may not match actual discourse order in some conversational overlaps where turn-taking is less distinct.

Implicit in the SOT design is that the attention mechanism and decoder learn to leverage cross-speaker context and to disambiguate overlapping regions based on acoustics and linguistic cues. Extensions, such as “Separation after Attention,” post-attentive LSTM stages, or integration with diarization/identification modules, may further improve real-world performance.

SOT’s simplicity, effectiveness on LibriSpeech, and capacity for speaker counting and variable-talkers mark it as a foundational technique for overlapped speech ASR. Further refinements and adaptations (e.g., for streaming, multilingual, or speaker-attributed scenarios) continue to be developed in subsequent research.

7. Summary Table: SOT versus PIT Approaches

Feature SOT PIT
Decoder Structure Single sequence (1 branch) Parallel (S branches)
Permutation Complexity O(S)O(S) (FIFO ordering) O(S3)O(S^3)
Speaker Dependency Modeling Sequential/contextual (via serialization) Branch-independent
Max Speakers Per Model Unlimited Hard limit by number of branches
Speaker Counting Native (via sc\langle \mathrm{sc} \rangle) Not directly supported
Empirical WER (LibriSpeech) Lower (e.g., 16.5–17.0% on 2-spk) Higher
Multi-Speaker Flexibility Yes No (fixed branching)

The SOT framework, by leveraging streamlined decoder semantics, efficient permutation handling, and implicit speaker transition modeling, stands as the basis for a range of modern overlapped speech recognition architectures (Kanda et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Serialized Output Training (SOT).