Dasheng: General Audio Foundation Models
- Dasheng is a general-audio foundation model family that leverages masked autoencoder pretraining to integrate representations of speech, music, and environmental sound.
- Its architecture employs heavy masking of mel-spectrogram time-chunk tokens and scales from 86M to 1.2B parameters to balance cross-domain performance.
- Dasheng serves as a backbone for diverse applications including multilingual audio–text alignment, deepfake detection, and unified generative audio systems.
Searching arXiv for papers on Dasheng and related systems to ground the article in current literature. Dasheng is a family of self-supervised and subsequently adapted general-audio foundation models centered on masked audio encoder learning for broad audio representation across speech, music, and environmental sound. Introduced as a large-scale masked autoencoder for general audio classification, Dasheng was designed to reduce the domain fragmentation that had characterized prior audio representation learning, in which speech-oriented models underperformed on environmental or musical audio, and environment-oriented models underperformed on speech (Dinkel et al., 2024). Subsequent work used Dasheng as an audio backbone for multilingual audio–text alignment, deepfake and spoof detection, large audio-language modeling, encoder benchmarking, and unified audio generation, making the term denote both the original encoder family and a broader ecosystem of Dasheng-based systems (Dinkel et al., 12 Jun 2025).
1. Origins and model family
Dasheng was introduced in “Scaling up masked audio encoder learning for general audio classification” as a self-supervised audio encoder built on the masked autoencoder framework and trained to provide a single representation effective across speech, music, and environmental sound tasks (Dinkel et al., 2024). The original paper defines Dasheng as a simple SSL audio encoder based on masked autoencoding of mel-spectrograms, with three encoder scales: Dasheng-Base at 86M parameters, Dasheng-0.6B at 600M parameters, and Dasheng-1.2B at 1200M parameters (Dinkel et al., 2024).
The model family was explicitly motivated by a “generalization gap” between speech and other sound domains. The original evaluation showed that supervised AudioSet-centric systems remained narrow, while speech-specific systems such as Whisper were strong on speech but weak on environmental and music tasks; Dasheng was positioned as a single backbone intended to avoid this tradeoff (Dinkel et al., 2024). Later papers preserved this categorization. In multilingual audio–text pretraining, Dasheng was characterized as a pretrained general-purpose audio encoder with about 90M parameters and variable-length inputs, selected because it was “the most balanced” across sound, music, and speech among several candidate encoders (Dinkel et al., 12 Jun 2025). In anti-spoofing and environmental forensic settings, Dasheng was likewise contrasted with speech-specific SSL encoders such as WavLM and W2V2-BERT and treated as a “general audio SSL” model (Peng et al., 14 Dec 2025, Peng et al., 9 Dec 2025).
This family identity remained stable even as Dasheng was embedded in larger systems. MiDashengLM used Dasheng-0.6B as its audio encoder, describing it as a Transformer-based encoder with about 630M parameters, architecturally a frame-level Vision Transformer applied to audio spectrograms and pretrained with a Masked Autoencoder objective (Dinkel et al., 6 Aug 2025). GLAP used Dasheng as the underlying audio foundation model beneath a contrastive audio–text objective and multilingual text encoder (Dinkel et al., 12 Jun 2025). Dasheng AudioGen, in turn, used “DashengTokenizer” and a unified semantic-acoustic representation as the latent interface for text-to-audio scene generation (Mei et al., 27 May 2026). This suggests an ecosystem in which the original masked-audio encoder became a reusable substrate for both discriminative and generative audio systems.
2. Core architecture and pretraining
The original Dasheng encoder operates on 16 kHz audio converted into 64-dimensional log-mel spectrograms with 32 ms window size and 10 ms hop size (Dinkel et al., 2024). Rather than 2D time–frequency patches, it uses 1D time-chunk tokens by grouping four consecutive frames into one chunk, corresponding to 40 ms per token and an effective 25 Hz representation rate (Dinkel et al., 2024). Each chunk is flattened and linearly projected to the model embedding dimension, and a learnable absolute positional embedding is added (Dinkel et al., 2024).
A central design choice is heavy masking. Dasheng masks 75% of tokens and removes them before the encoder, while using grouped masking of at least two consecutive chunks to avoid trivial reconstruction arising from STFT overlap (Dinkel et al., 2024). The encoder is a ViT-style Transformer with standard multi-head self-attention, GeLU MLPs, and pre-norm LayerNorm. The three published configurations are summarized below.
| Model | Params | Depth |
|---|---|---|
| Dasheng-Base | 86M | 12 |
| Dasheng-0.6B | 600M | 32 |
| Dasheng-1.2B | 1200M | 40 |
During pretraining, a lightweight Transformer decoder reconstructs the masked mel chunks, and training uses normalized mean squared error computed only on masked tokens (Dinkel et al., 2024). After pretraining, the decoder is discarded and only the encoder is retained for downstream use (Dinkel et al., 2024). The model was trained on 272,356 hours of diverse audio drawn from ACAV100M, AudioSet, VGGSound, and MTG-Jamendo, with labels discarded during SSL pretraining (Dinkel et al., 2024). The scale was material: moving from AudioSet-only to the full training mixture improved average HEAR scores by +8.45 for Dasheng-Base, +8.69 for Dasheng-0.6B, and +6.37 for Dasheng-1.2B (Dinkel et al., 2024).
Later work retained this architectural framing while emphasizing deployment properties. MiDashengLM highlighted native variable-length support, a maximum native context of 10.08 s per pass, and aggressive downsampling to 5 Hz audio tokens before decoding in a LLM (Dinkel et al., 6 Aug 2025). That paper also contrasted Dasheng with Whisper-Large v3, reporting 630.3M parameters for the Dasheng-based encoder and describing its training lineage as MAE pretraining on ACAV100M followed by alignment to general audio captions (Dinkel et al., 6 Aug 2025). A plausible implication is that Dasheng’s tokenization and temporal resolution made it suitable not only for classification but also for systems in which sequence length and decoder attention cost are major bottlenecks.
3. Representation properties and benchmark behavior
Dasheng’s original evaluation was conducted on the HEAR benchmark, using frozen encoder features from the last layer at 25 Hz with shallow downstream MLP heads (Dinkel et al., 2024). The main result was domain breadth rather than dominance on every task. Dasheng-1.2B achieved domain averages of 83.20 for environment, 75.71 for speech, 84.86 for music, and 81.25 overall, surpassing CED-Base, Whisper-Base, Wav2Vec2, ATST-Frame, and ATST-Clip in overall average (Dinkel et al., 2024). The paper reported that Dasheng-1.2B scored above 80 on 13 of 18 HEAR tasks and obtained notable results on CREMA-D, SpeechCommands, VoxLingua, and several music tasks (Dinkel et al., 2024).
The representation was also probed via k-NN. On nine tasks, Dasheng substantially exceeded AudioMAE, reaching 68.6 on ESC-50, 72.1 on FSDKaggle 2018, 77.7 on UrbanSound8k, 95.9 on SpeechCommands V1, 90.9 on SpeechCommands V2, 39.4 on VoxCeleb1, 61.9 on RAVDESS, and 62.4 on Fluent Speech Commands for Dasheng-1.2B (Dinkel et al., 2024). Linear evaluation on VoxCeleb1 reached 92.5 for Dasheng-1.2B (Dinkel et al., 2024). The original authors concluded that Dasheng features “inherently contain rich speech, music, and environmental information” (Dinkel et al., 2024).
Later benchmark work reinforced that characterization from a different angle. In GLAP, an audio-encoder comparison using text-to-audio retrieval mAP10 found that CED-Base and BEATs were strong on sound/music but weak on speech, Whisper and WavLM were strong on speech but weak on sound/music, while Dasheng was “the most versatile choice for general audio encoding” (Dinkel et al., 12 Jun 2025). In that comparison, Dasheng obtained 55.8 on AudioCaps, 60.1 on ACD, 20.3 on music retrieval, 94.8 on LibriSpeech-other, and 99.0 on AIS2, indicating a balanced profile across domains (Dinkel et al., 12 Jun 2025).
A separate encoder benchmarking paper for the ICME 2025 Audio Encoder Challenge treated Dasheng 1.2B as a strong reference model and as one component of the submitted ensemble (Bharadwaj et al., 22 Jan 2026). Task-wise scores showed Dasheng particularly strong on GTZAN Genre at 0.886, FMA Small at 0.647, Fluent Speech Commands at 0.973, Speech Commands V1 at 0.973, VoxLingua33 at 0.860, and VocalSound at 0.925 (Bharadwaj et al., 22 Jan 2026). However, that same report showed that carefully trained BEATs variants could surpass Dasheng on some domain-specialized tasks such as FSD18-Kaggle, UrbanSound8k, and LibriSpeech-MF, and that feature-level ensembling could often improve beyond Dasheng alone (Bharadwaj et al., 22 Jan 2026). This corrects a common misconception: Dasheng is not uniformly superior on every audio task; rather, its defining property is strong cross-domain balance.
4. Dasheng as a backbone for multimodal alignment and audio-language modeling
One important development was the use of Dasheng as the audio encoder in contrastive audio–text pretraining. GLAP paired Dasheng as audio encoder with Sonar as multilingual text encoder , followed by projection MLPs into a shared embedding space (Dinkel et al., 12 Jun 2025). Similarity was computed by cosine similarity, and training used a SigLIP-style sigmoid contrastive loss with learnable and , initialized to $0.07$ and , respectively (Dinkel et al., 12 Jun 2025). All subsequent GLAP experiments used Dasheng as the single audio encoder because it offered the most versatile performance across sound, music, and speech domains (Dinkel et al., 12 Jun 2025).
GLAP’s training mixture spanned speech, sound, and music, with YODAS contributing 400k hours and 431M pairs across 145 languages, and sound/music captions translated into seven additional languages beyond English (Dinkel et al., 12 Jun 2025). The resulting system achieved strong retrieval on AudioCaps and Clotho, near-perfect speech retrieval on LibriSpeech-other and AISHELL-2, zero-shot keyword spotting across 50 languages, and competitive zero-shot sound/music classification (Dinkel et al., 12 Jun 2025). Since GLAP built directly on Dasheng, this established Dasheng as not only a classifier backbone but also an audio-side encoder for multilingual audio–text alignment.
MiDashengLM moved further, using Dasheng-0.6B inside a prefix-based large audio-LLM (Dinkel et al., 6 Aug 2025). The pipeline was waveform mel spectrogram Dasheng projection MLP Qwen2.5-Omni decoder, trained with next-token cross-entropy conditioned on audio features 0: 1 (Dinkel et al., 6 Aug 2025). Its alignment stage used ACAVCaps, a 38,662-hour corpus of general audio captions, to fine-tune Dasheng and the decoder jointly on holistic captions rather than ASR transcripts (Dinkel et al., 6 Aug 2025). This was presented as a shift from ASR-centric alignment to “general audio captions,” capturing speech content, speaker traits, sound events, music attributes, and acoustic conditions within one textual target (Dinkel et al., 6 Aug 2025).
The performance and systems consequences were substantial. MiDashengLM reported that a Dasheng-based encoder beat Whisper on 18 of 22 X-ARES tasks, with large relative improvements on sound classification, music tasks, and speaker recognition, though it remained slightly worse on some pure speech tasks such as LibriSpeech-100h ASR and Speech Commands keyword spotting (Dinkel et al., 6 Aug 2025). It also reported approximately 4x faster time-to-first-token and up to 20.2x higher throughput than Qwen2.5-Omni-7B at large batch sizes, attributing these gains to Dasheng’s variable-length support and 5 Hz token rate (Dinkel et al., 6 Aug 2025). This suggests that Dasheng’s representational breadth and temporal compression are jointly valuable in multimodal generative systems.
5. Forensic and anti-spoofing applications
Dasheng has been adopted as a front-end in audio forensics, especially where robustness to out-of-domain generators is more important than speech specialization. In the ESDD 2026 Challenge submission from BUT, Dasheng was one of the general-audio SSL front-ends alongside BEATs and EAT, contrasted with speech-specific SSLs such as WavLM and HuBERT (Peng et al., 9 Dec 2025). The model fed into a lightweight Multi-Head Factorized Attention backend. If Dasheng has 2 Transformer layers producing 3, all hidden layers were used through learned layer aggregation: 4 followed by projection to a compression dimension of 128 and attention pooling with 32 heads to form a 256-dimensional utterance embedding (Peng et al., 9 Dec 2025).
In that environmental sound deepfake setting, Dasheng-0.6B achieved 0.27% Dev EER and Dasheng-1.2B achieved 0.33% Dev EER, versus 4.75% for WavLM Base (Peng et al., 9 Dec 2025). The authors generalized this as a “distinct performance gap” between speech-specific SSLs and general audio SSLs for environmental sound deepfake detection (Peng et al., 9 Dec 2025). However, Dasheng was not used in the final submitted fusion systems, which were EAT-based; no Dasheng progress or final evaluation EERs were reported (Peng et al., 9 Dec 2025). This indicates strong suitability but incomplete exploration within that study.
In the WildSpoof SASV system, Dasheng played a more central role as the main general-audio front end for the countermeasure subsystem (Peng et al., 14 Dec 2025). The paper evaluated Mi-Dasheng-base, Mi-Dasheng-0.6B, and Mi-Dasheng-1.2B, again with MHFA and, in some cases, Distribution Uncertainty augmentation applied to the Value stream (Peng et al., 14 Dec 2025). On out-of-domain ASVspoof5 Dev, Mi-Dasheng-base obtained 5.164% EER, Mi-Dasheng-0.6B 3.122%, Mi-Dasheng-0.6B + DSU 1.777%, Mi-Dasheng-1.2B 1.625%, and Mi-Dasheng-1.2B + DSU 1.193%, compared with 11.885% for WavLM Base+ and 12.090% for a ResNet18 baseline (Peng et al., 14 Dec 2025). The same paper noted that DSU slightly degraded in-domain results while improving OOD robustness, and that Mi-Dasheng-0.6B slightly outperformed Mi-Dasheng-1.2B in some in-domain settings, suggesting that larger capacity does not necessarily yield better performance (Peng et al., 14 Dec 2025).
These results clarify an important point. Dasheng’s main advantage in such tasks is not merely higher accuracy on matched-domain speech, but robustness to acoustic variability, unseen vocoders, and recording conditions. The forensic literature cited here consistently associates that robustness with Dasheng’s broad general-audio pretraining and with the use of intermediate-layer information rather than only final-layer semantics (Peng et al., 14 Dec 2025, Peng et al., 9 Dec 2025).
6. Generative extensions and the Dasheng ecosystem
Dasheng has also been extended into generative audio systems. MiDashengLM used Dasheng as an audio encoder for understanding, captioning, QA, prompting-based classification, and multilingual ASR within a decoder-only text generation framework (Dinkel et al., 6 Aug 2025). Dasheng AudioGen went further and used a Dasheng-derived latent interface for end-to-end text-to-audio scene generation (Mei et al., 27 May 2026).
Dasheng AudioGen introduced structured multi-view captions with fields such as <|caption|>, <|speech|>, <|asr|>, <|music|>, <|sfx|>, and <|env|>, paired with a unified semantic-acoustic latent space from DashengTokenizer, where 5 at 25 Hz (Mei et al., 27 May 2026). The generator was a flow-matching DiT trained with
6
and conditioned by cross-attention on text embeddings from FLAN-T5-Large (Mei et al., 27 May 2026). The authors argued that this high-dimensional semantic-acoustic latent space provided sufficient capacity to disentangle and fuse concurrent components such as speech, music, and sound effects (Mei et al., 27 May 2026).
Empirically, Dasheng AudioGen reported strong performance on mixed-scene generation. On the SMA category of the MECAT benchmark, it achieved FAD 2.17, FD 17.75, KL 0.63, CLAP 38.3, GLAP 9.52, WER 28.98%, and UTMOSv2 2.46, compared with markedly worse figures from an expert pipeline that independently generated speech, music, and sound effects and then mixed them (Mei et al., 27 May 2026). Human and PAFI evaluations were interpreted as showing that Dasheng AudioGen approached real recordings in mixed categories while remaining competitive with specialized models on single-type tasks (Mei et al., 27 May 2026).
Across these systems, the Dasheng ecosystem now spans at least four major roles: a frozen or fine-tuned general encoder (Dinkel et al., 2024), a contrastive audio backbone for multilingual retrieval and zero-shot tasks (Dinkel et al., 12 Jun 2025), an audio front-end for large audio-LLMs (Dinkel et al., 6 Aug 2025), and a latent substrate for unified scene generation (Mei et al., 27 May 2026). The recurring design principle is that general-audio pretraining should precede specialization rather than be replaced by separate speech, music, and sound encoders.
7. Limitations, interpretation, and future directions
Several limitations recur across the literature. First, Dasheng is balanced rather than universally optimal. The original paper showed that it was strong overall on HEAR, but CED-Base still led on some environment-focused tasks and specialized systems could remain better on pure speech settings (Dinkel et al., 2024). The ICME challenge report likewise showed that domain-tailored BEATs models or simple ensembling could outperform Dasheng on certain tasks (Bharadwaj et al., 22 Jan 2026). MiDashengLM explicitly reported weaker performance than Whisper-based models on traditional ASR benchmarks such as LibriSpeech, LibriCount, VoxLingua33, and Speech Commands (Dinkel et al., 6 Aug 2025).
Second, scaling is not uniformly monotonic. In the original HEAR study, average performance improved from Base to 0.6B to 1.2B (Dinkel et al., 2024). Yet in environmental sound deepfake detection, Dasheng-1.2B slightly underperformed Dasheng-0.6B on the Dev set (Peng et al., 9 Dec 2025), and in WildSpoof the paper stated that Mi-Dasheng-0.6B slightly outperformed Mi-Dasheng-1.2B in some settings (Peng et al., 14 Dec 2025). This suggests that downstream data scale, task mismatch, and adaptation protocol materially affect the benefit of larger backbones.
Third, context length and efficiency remain tradeoffs. The base Dasheng design is organized around 10 s clips during pretraining and 10.08 s native context in later LALM use, requiring chunking for longer audio (Dinkel et al., 2024, Dinkel et al., 6 Aug 2025). MiDashengLM turned this into an efficiency advantage, but the same paper identified long-audio handling as a limitation (Dinkel et al., 6 Aug 2025). Dasheng AudioGen similarly remained constrained to 10-second generation because its training clips were all 10 s long (Mei et al., 27 May 2026).
Fourth, reproducibility varies by downstream system. The original Dasheng paper emphasized public datasets and open code, with a repository at github.com/RicherMans/Dasheng (Dinkel et al., 2024). GLAP and MiDashengLM also released code and checkpoints (Dinkel et al., 12 Jun 2025, Dinkel et al., 6 Aug 2025). By contrast, Dasheng AudioGen relied on a 77k-hour private superset of ACAVCaps, which limits exact reproducibility even though the model design is documented (Mei et al., 27 May 2026).
Future directions are strongly implied by the collected work. GLAP suggests broader multilingual and multi-domain audio–text systems built atop Dasheng (Dinkel et al., 12 Jun 2025). MiDashengLM suggests more caption-level multilingual supervision, better long-audio handling, and integration with video or other modalities (Dinkel et al., 6 Aug 2025). The forensic papers suggest applying DSU and related robustness methods more systematically to Dasheng and studying layer-wise interpretability of forensic cues (Peng et al., 9 Dec 2025, Peng et al., 14 Dec 2025). The ICME report suggests that mixture-of-experts architectures might internalize the gains currently obtained through simple feature-level ensembling with Dasheng (Bharadwaj et al., 22 Jan 2026). Dasheng AudioGen suggests extensions toward longer-context scene generation, finer temporal control, and editing of existing recordings (Mei et al., 27 May 2026).
Taken together, Dasheng can be understood as a general-audio representation program rather than a single static model. Its original contribution was to show that masked autoencoder scaling on heterogeneous audio could yield one backbone competitive across speech, music, and environmental sound (Dinkel et al., 2024). Its later significance lies in how that backbone became a common substrate for retrieval, zero-shot inference, deepfake detection, speaker anti-spoofing, large audio-LLMs, encoder ensembles, and coherent scene generation (Dinkel et al., 12 Jun 2025, Peng et al., 9 Dec 2025, Peng et al., 14 Dec 2025, Bharadwaj et al., 22 Jan 2026, Dinkel et al., 6 Aug 2025, Mei et al., 27 May 2026).