MOSS-Audio: Unified Audio-Language Model
- MOSS-Audio is a unified audio-language model that fuses dedicated audio encoders, modality adapters, and LLM decoders to achieve nuanced speech, sound, and music understanding.
- It employs DeepStack cross-layer feature injection and explicit time markers to enhance low-level audio cues and temporal reasoning for tasks like transcription and captioning.
- Extensive benchmarks and ablation studies validate its superior performance in ASR, timestamped transcription, and multi-step audio reasoning compared to similar architectures.
MOSS-Audio is a unified audio-LLM family for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. In related MOSS technical reports, the name also denotes a broader audio stack built around a dedicated audio encoder for understanding, a large-scale discrete tokenizer, and autoregressive generators for text-to-speech, spoken dialogue, and instruction-driven voice design (Yang et al., 1 Jun 2026, Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).
1. Scope, model family, and ecosystem position
Within the understanding setting, MOSS-Audio is released in four variants: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct models are tuned for direct instruction following and stable task execution, whereas the Thinking models are tuned for multi-step, audio-grounded reasoning; empirically, Thinking variants outperform Instruct variants on reasoning-oriented general audio understanding benchmarks, while Instruct variants are stronger on ASR, timestamped ASR, and speech captioning (Yang et al., 1 Jun 2026).
In related reports, the broader MOSS audio stack includes a discrete tokenizer, autoregressive speech generators, spoken-dialogue synthesis, instruction-driven voice generation, and long-context speaker-attributed transcription. This suggests a stack organized around shared low-rate audio representations and autoregressive decoding, but the term “MOSS-Audio” is used most precisely for the unified audio-language understanding family described in the technical report (Yang et al., 1 Jun 2026, Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).
| Component | Reported role | arXiv id |
|---|---|---|
| MOSS-Audio | Unified audio-language understanding | (Yang et al., 1 Jun 2026) |
| MOSS-Audio-Tokenizer | Discrete audio tokenizer | (Gong et al., 11 Feb 2026) |
| MOSS-TTS / MOSS-TTS-Local-Transformer | Autoregressive speech generation | (Gong et al., 18 Mar 2026) |
| MOSS-VoiceGenerator | Instruction-driven voice design | (Huang et al., 30 Mar 2026) |
| MOSS-TTSD | Spoken dialogue generation | (Zhang et al., 20 Mar 2026) |
2. Core architecture of the understanding model
MOSS-Audio couples a dedicated audio encoder with a modality adapter and a decoder LLM. The encoder consumes 128-channel log-Mel spectrograms, applies three stride-2 Conv2D layers for 8× temporal downsampling, and produces tokens at 12.5 Hz, so each audio token corresponds to 80 ms. The encoder itself is a 32-layer Transformer with hidden dimension 1280 and sliding-window self-attention over at most 100 frames, corresponding to 8 seconds of local context; the encoder module contains approximately 0.6B parameters (Yang et al., 1 Jun 2026).
Two architectural mechanisms are central. The first is DeepStack cross-layer feature injection. Instead of exposing the decoder only to the top encoder layer, MOSS-Audio extracts intermediate hidden states from multiple encoder depths, projects them with a GatedMLP merge adapter, and injects them into selected early decoder layers. The motivation is that the top encoder layer over-emphasizes lexical and semantic content, whereas non-speech audio understanding and temporal reasoning also require low- and mid-level cues such as prosody, transients, timbre, and local time-frequency patterns (Yang et al., 1 Jun 2026).
The second mechanism is explicit time markers. Since the encoder runs at 12.5 Hz, the elapsed time at encoder step is . The system appends a numeric time marker after every 25 audio features, so each inserted marker denotes 2 seconds of elapsed time. These markers are interleaved with the adapted audio sequence and provide absolute temporal anchors for timestamped transcription and time-aware question answering (Yang et al., 1 Jun 2026).
Timestamp quality is evaluated with Accumulated Average Shift:
Here is the number of timestamp slots, the predicted timestamp, and the reference timestamp (Yang et al., 1 Jun 2026).
3. Data pipeline, annotation strategy, and training regimen
The pretraining corpus is built through an event-preserving audio annotation pipeline. Raw audio is first segmented at coherent event boundaries using frame-level sound event detection based on AudioSet taxonomy with BEATs in PretrainedSED. The pipeline merges vocal and speech events with gap tolerance, ignores non-speech events longer than 60 seconds when forming boundaries, overlap-merges remaining events, adds short padding, and enforces a maximum segment-length cap. Fine-grained AudioSet tags are then mapped into nine coarse categories: speech, human voice (non-speech), singing, music, natural sounds, source-ambiguous, sounds of things, channel/environment/background, and animal (Yang et al., 1 Jun 2026).
Annotation then proceeds along branch-specific routes. Speech and singing segments receive pseudo-labels from an ensemble of ASR systems, language identification via fastText and MMS-LID, and forced alignment with TorchAudio MMS_FA for word-level and sentence-level timestamps. Speech-caption data are enriched with DiariZen diarization and a speaker-aware captioner that describes gender, age, accent, pitch, volume, speed, emotion, tone, texture, clarity, fluency, personality, and utterance-level summary. General-audio captioning combines frame-level event evidence, global anchors, a dense caption generator, and LLM-judged verification. Music captioning integrates outputs from audio-LLMs with symbolic and structural analysis, including chords, beat and tempo, key, melody descriptors, instrument recognition, and structure labeling (Yang et al., 1 Jun 2026).
The heterogeneous branch outputs are then normalized into a unified schema and merged by Router-R1, which uses prior-driven routing, conservative thresholds, and uncertainty estimates to decide which branches to include and in what order. Intermediate branch-specific captions are retained for subsequent task-oriented supervised fine-tuning (Yang et al., 1 Jun 2026).
Pretraining uses approximately 1.2T tokens with default sampling ratios of 30% ASR-related tasks, 40% audio captioning, and 30% text-only language modeling, together with square-root mixing across datasets. Stage 1 primarily trains the modality adapter and DeepStack injection modules on audio-text data only. Stage 2 performs full end-to-end optimization of encoder, adapters, DeepStack, and LLM using the complete objective mixture. Post-training then proceeds through supervised fine-tuning, a reasoning cold-start phase, and DAPO-based reinforcement learning, the last of which uses clipped DAPO with asymmetric clipping, token-level importance-sampling correction, and dynamic filtering of rollout groups with zero reward standard deviation (Yang et al., 1 Jun 2026).
4. Variants, benchmark profile, and ablation findings
On reasoning-heavy general-audio benchmarks, the strongest reported model is MOSS-Audio-8B-Thinking, with average 71.08 across MMAU, MMAU-Pro, MMAR, and MMSU; MOSS-Audio-4B-Thinking reaches 68.37, while the Instruct variants score 66.32 and 64.04 for 8B and 4B respectively. The pattern is consistent: Thinking variants outperform Instruct variants on reasoning-oriented evaluation, and 8B outperforms 4B (Yang et al., 1 Jun 2026).
On speech captioning, MOSS-Audio-8B-Instruct attains average 3.725 on the reported 2,000-sample benchmark, and MOSS-Audio-4B-Instruct attains 3.711. On ASR, the overall average CER reported for MOSS-Audio-8B-Instruct is 11.30, compared with 11.58 for MOSS-Audio-4B-Instruct. The 8B-Instruct model is also reported as strongest on several timestamp-sensitive or acoustically difficult subsets, including singing, whisper, and code-switching, while remaining challenged by multi-speaker and far/near-field conditions (Yang et al., 1 Jun 2026).
Timestamped ASR is a major empirical strength. On AISHELL-1, MOSS-Audio-8B-Instruct reports AAS 35.77 ms, and MOSS-Audio-4B-Instruct reports 76.96 ms; on LibriSpeech, the corresponding values are 131.61 ms and 358.13 ms. These results are substantially lower than the AAS figures reported for Qwen3-Omni-Instruct and Gemini-3.1-Pro in the same evaluation tables, indicating that explicit time markers plus timestamp-ASR pretraining materially improve temporal grounding (Yang et al., 1 Jun 2026).
Ablations support the encoder and DeepStack design. Under the XARES-LLM framework, the MOSS Audio Encoder achieves the highest overall score on the reported generative Task 2, and in a matched ASR study paired with Qwen3-1.7B it reduces average error across 38 datasets from 17.61% for AuT to 16.31%. The DeepStack ablation on MECAT-Caption raises DATE from 0.4823 to 0.4831 overall, with the reported gains concentrated on non-speech categories such as music, pure or mixed sound, and environment, alongside a slight trade-off on speech-dominated subsets (Yang et al., 1 Jun 2026).
5. Relation to the broader MOSS audio stack
The broader MOSS stack is anchored by MOSS-Audio-Tokenizer, a general-purpose discrete audio tokenizer built on CAT, a homogeneous causal Transformer encoder–quantizer–decoder trained end-to-end from scratch. The main release contains approximately 1.6B parameters, is pretrained on approximately 3,000,000 hours of diverse audio at 24 kHz, and produces discrete tokens at 12.5 Hz. With residual vector quantization depth and codebook size $1024$ per layer, bitrate is
yielding operating points from 750 bps at 6 active layers to 4000 bps at 32 layers (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).
On top of this tokenizer, the MOSS-TTS technical report releases two complementary generators: MOSS-TTS, based on a delay pattern, and MOSS-TTS-Local-Transformer, based on a global-latent plus local design. Both support zero-shot voice cloning, token-level duration control, phoneme- or pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. The report explicitly states long-context operation up to hour-scale outputs, and the Local-Transformer variant improves speaker preservation and time to first audio (Gong et al., 18 Mar 2026).
Related generative extensions broaden the scope further. MOSS-VoiceGenerator is an instruction-driven voice generation model that creates new timbres directly from natural-language prompts without reference audio and is trained on approximately 25,000 hours unique audio across Chinese and English (Huang et al., 30 Mar 2026). MOSS-TTSD performs spoken dialogue generation from dialogue scripts with explicit speaker tags, supports up to 60 minutes of single-pass synthesis, and handles up to 5 speakers with zero-shot voice cloning from short reference audio clips (Zhang et al., 20 Mar 2026). MOSS Transcribe Diarize addresses Speaker-Attributed, Time-Stamped Transcription with a 128k context window for up to 90-minute inputs (Yu et al., 4 Jan 2026). MOSS-Speech departs from text-guided cascades by introducing a true speech-to-speech LLM with a 36-layer Transformer split after the 32nd block, together with a 12.5 Hz, 175 bps streaming speech tokenizer (Zhao et al., 1 Oct 2025).
Taken together, these reports define MOSS-Audio not only as an understanding model family but also as a wider architecture program spanning discrete tokenization, autoregressive speech generation, spoken-dialogue synthesis, speaker-attributed transcription, and direct speech-to-speech modeling (Yang et al., 1 Jun 2026, Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).
6. Terminological ambiguity and adjacent uses
The term “MOSS-Audio” is not used uniformly across arXiv. In the MOSS technical reports, it refers to the unified audio-language understanding family and, by extension, the associated tokenizer and speech-generation stack (Yang et al., 1 Jun 2026, Gong et al., 11 Feb 2026). In other papers, however, the phrase appears as a convenient label for substantially different tasks.
One distinct usage appears in audio evaluation. Semi-intrusive assessment recasts MOS and SNR estimation as a multimodal text-prediction problem with audio-text inputs, and the paper explicitly frames this as fitting naturally into MOS-based audio evaluation or “MOSS-Audio” (Coldenhoff et al., 2024). Another usage appears in the AudioMOS Challenge 2025 Track 1 system for text-to-music evaluation, where “MOSS-Audio” is used as a practical predictor for Music Impression and Text Alignment (Ritter-Gutierrez et al., 14 Jul 2025). These are evaluation frameworks rather than members of the MOSS model family.
A second ambiguity arises in cross-modal generation. MOVA, expanded as MOSS Video and Audio, uses “MOSS-Audio” to denote its audio tower and its integrations for joint video-audio synthesis, including lip-synced multilingual speech, environment-aware sound effects, and content-aligned music (Team et al., 9 Feb 2026). A third appears in spatial-audio-driven motion generation: MOSPA explicitly notes that “MOSS-Audio,” interpreted as “motion from spatial audio,” aligns conceptually with its task, although MOSPA is the specific framework and dataset introduced there (Xu et al., 16 Jul 2025).
These usages are not contradictory, but they refer to different technical objects: a unified audio-language foundation model, a broader tokenizer-plus-generator stack, evaluation systems for MOS-style prediction, an audio tower inside a video-audio generator, and a shorthand for motion from spatial audio. The most precise encyclopedia usage reserves “MOSS-Audio” for the MOSS family centered on unified audio understanding and its closely related tokenizer and speech-generation components (Yang et al., 1 Jun 2026, Gong et al., 11 Feb 2026).