MOSS Transcribe Diarize Model

Updated 8 January 2026

MOSS Transcribe Diarize is an end-to-end multimodal large language model that unifies automatic speech recognition and speaker diarization within a single transformer-based architecture.
It processes continuous 90-minute audio sessions by chunking raw waveforms and leveraging a convolutional-transformer encoder to maintain a 128k-token dense context.
The model achieves superior transcription and speaker attribution accuracy, as demonstrated by lower cpCER and Δcp metrics compared to classical multi-stage pipelines.

MOSS Transcribe Diarize is an end-to-end multimodal LLM (MLLM) for speaker-attributed, time-stamped transcription (SATS), engineered to unify automatic speech recognition (ASR) and speaker diarization within a single architecture capable of processing up to 90-minute audio sessions in a single inference pass. The system leverages a convolutional-transformer audio encoder and a high-capacity transformer LLM to jointly produce accurate transcriptions enriched with per-utterance speaker labels and explicit timestamps, directly from raw waveforms without cascading or external alignment modules. MOSS Transcribe Diarize achieves superior accuracy and scalability compared to both classical pipelined approaches and contemporary commercial solutions, especially on long-form, multi-speaker conversational data (Yu et al., 4 Jan 2026).

1. End-to-End Model Architecture and Workflow

MOSS Transcribe Diarize adopts a monolithic multimodal transformer paradigm: raw 16 kHz mono audio is divided into 1-second overlapping chunks (50% overlap), each processed via a convolutional-transformer audio encoder to generate 1,024-dimensional embeddings. These are projected into the LLM embedding space (1,536-dim) through a two-layer MLP with GeLU activation. Chunk embeddings, time tokens (“hh:mm:ss.ms”), and the transcript token sequence are contextually injected into a 40-block transformer backbone (hidden size: 1,536; feed-forward inner size: 6,144; 24 attention heads). The context capacity is fixed at 128,000 tokens, enabling approximately 90 minutes of audio + transcript with complete speaker and timestamp annotation.

Speaker diarization is performed implicitly within the token generation process: speaker labels ([S01], [S02], …) and timestamp tokens are autoregressively emitted as part of the output token stream. No post-hoc clustering or alignment is needed. The workflow pseudocode is:

def TranscribeAndDiarize(audio_waveform):
    chunks = Frame(audio_waveform, 1 s, 50 % overlap)
    encoded = [AudioEncoder(c) for c in chunks]
    projected = [MLPProj(e) for e in encoded]
    context = interleave("<AudioChunk>", projected, "<Time>" tokens)
    output_tokens = LLM.decode(context)
    parse output_tokens into (t_start, t_end, speaker_id, text)
    return transcript

All tasks—word recognition, speaker attribution, and timestamping—are supervised via a single cross-entropy loss over the full token sequence:

\mathcal{L}_{CE} = -\sum_{t=1}^{N} \log p_\theta(y_t \mid y_{<t},\,\text{audio})

where

y_t

spans words, speaker tags, and timestamp tokens (Yu et al., 4 Jan 2026).

2. Input Representation and Long-Context Handling

Audio is preprocessed into log-Mel spectrograms, fed through two CNN layers before transformer encoding. Each 1-second chunk is represented by a single continuous token group, optimizing context usage and obviating the need for atomistic time alignment. Time annotation is explicit: timestamp tokens precede and follow each audio chunk throughout the 128k-token LLM context window.

No retrieval mechanisms or context compression are employed; all chunk embeddings, timestamps, and transcript tokens are processed as dense attention sequences: $\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V$ with context length $T \le 128,000$ , permitting direct dense attention via CUDA acceleration (Yu et al., 4 Jan 2026). This enables robust speaker memory and cross-chunk disambiguation over durations up to 90 minutes, a scale unattainable with classical windowed or streaming diarization systems.

3. Training Data, Supervision, and Simulation Strategies

Training leverages both real data (AISHELL-4: $\sim$ 150 h, 5–7 speakers/session; in-house podcasts: $\sim$ 200 h, 2–11 speakers; film/TV segments: $\sim$ 50 h, 1–6 speakers) and simulated mixtures (synthetic dialogues with 2–12 speakers, up to 80% overlap, Gaussian-distributed gaps, noise/reverb at SNR 0–15 dB; $\sim$ 100 h total synthetic). The curriculum is staged: first 2 epochs use simulated data exclusively, then 8 epochs with a real/simulated 3:1 mixture. AdamW optimization (learning rate $1\text{e}^{-5}$ to $5\text{e}^{-5}$ , 64 GB GPUs, batch size 32 sessions) is used throughout (Yu et al., 4 Jan 2026).

Annotation leverages a unified token vocabulary; speaker labels and timestamps are directly embedded within the target transcript as token-level supervision. No frame-level, segment-level, or time-aligned label annotation is required.

4. Comparative Evaluation and Metrics

MOSS Transcribe Diarize was benchmarked against commercial and academic systems (Doubao, ElevenLabs Scribe v1, Gemini 2.5/3 Pro, GPT-4o) across multiple datasets:

Dataset	Model	CER (%)	cpCER (%)	Δcp (%)
AISHELL-4	Doubao	18.18	27.86	9.68
	ElevenLabs Scribe v1	19.58	37.95	18.36
	Gemini 2.5 Pro	42.70	53.42	10.72
	MOSS Transcribe Diarize	15.43	20.04	4.61
Podcast	Doubao	7.93	10.54	2.61
	ElevenLabs	8.50	11.34	2.85
	Gemini 2.5 Pro	7.38	10.23	2.85
	MOSS	4.46	6.97	2.50
Movies	Doubao	9.94	30.88	20.94
	ElevenLabs	11.49	17.85	6.37
	GPT-4o	14.37	23.67	9.31
	Gemini 2.5 Pro	15.46	24.15	8.69
	Gemini 3 Pro	8.62	14.73	6.11
	MOSS	7.50	13.36	5.86

Primary metrics are Character Error Rate (CER), concatenated-permutation CER (cpCER), and the speaker-attribution gap Δcp (cpCER − CER), with lower values indicating higher transcription and diarization accuracy (Yu et al., 4 Jan 2026).

5. Functional and Deployment Characteristics

Inference is performed in batch mode over the full audio session up to 90 minutes. The entire pipeline—audio chunking, encoding, projection, token context assembly, and autoregressive decoding—occurs with no intermediate cascade, clustering, or alignment steps. The outputs are directly parsed into tuples of $(t_{\text{start}},\,t_{\text{end}},\,\text{speaker\_id},\,\text{text})$ by extracting speaker tags and timestamp tokens from the generated transcript.

This contrasts with classical multi-stage systems (e.g., virtual microphone array pipelines (Yoshioka et al., 2019), Turn-to-Diarize (Xia et al., 2021), Transcribe-to-Diarize (Kanda et al., 2021)), which separate ASR from diarization via explicit turn-detection tokens, clustering algorithms, constrained spectral embeddings, and fusion with prior speaker d-vectors. MOSS Transcribe Diarize instead learns the joint mapping as part of its single multimodal transformer decoding process.

6. Limitations and Possible Research Directions

Current limitations include the absence of streaming/online inference (processing occurs in one batch for up to 90 minutes), single-tokenizer granularity (constrained multilingual/code-switch handling outside Chinese/English), and timestamp precision limited to chunk-level rather than phoneme or word-level. Promised directions include enabling streaming SATS with sub-second latency, refining timestamp resolution, and extending multilingual coverage and domain adaptation (Yu et al., 4 Jan 2026).

A plausible implication is that the transformer-based MLLM approach can absorb more sophisticated diarization logic—e.g., cross-attention to fine-grained speaker embeddings—without reverting to cascade models or handcrafted constraint propagation, unifying SATS under a single loss and model configuration. This suggests future architectures may generalize meeting-level transcription and diarization with minimal external supervision.

7. Context: Classical and Recent SATS Pipelines

Earlier work in meeting transcription (virtual microphone arrays (Yoshioka et al., 2019), Turn-to-Diarize (Xia et al., 2021), Transcribe-to-Diarize (Kanda et al., 2021)) decomposes the SATS task into multiple stages, combining stream alignment, mask-based MVDR beamforming, BLSTM-based ASR, speaker-turn segmentation, and d-vector clustering or constrained spectral algorithms. These systems typically report WER, SAWER, and DER in the 7.5–22% range depending on microphone count and overlap, but at higher annotation cost and cascading compute. MOSS Transcribe Diarize demonstrates that a true end-to-end SATS formulation is feasible, scalable, and empirically superior—achieving lower cpCER and Δcp across both long-form and short-form conversational benchmarks (Yu et al., 4 Jan 2026).

In summary, MOSS Transcribe Diarize marks a shift toward large-scale, unified, long-context SATS solutions, replacing multi-stage heuristic pipelines with direct multimodal sequence modeling of the transcript, speaker turns, and time annotations.