MiDashengLM: Unified Audio Understanding
- MiDashengLM is a unified audio understanding model that shifts from ASR-centric alignment to a caption-first approach capturing diverse sound attributes.
- It employs a Transformer encoder-decoder design, combining a Dasheng audio encoder and a Qwen2.5-Omni-3B text decoder for efficient audio-to-text mapping.
- The model demonstrates significant improvements in multimedia benchmarks and throughput, offering up to 20.2× speed gains over comparable systems.
Searching arXiv for the specified paper and closely related context papers. arXiv search: (Dinkel et al., 6 Aug 2025) MiDashengLM Efficient Audio Understanding with General Audio Captions MiDashengLM is an open, efficient large audio-LLM (LALM) for unified understanding of speech, environmental sounds, and music. Introduced as a Transformer-based encoder-decoder system that combines a Dasheng audio encoder with a Qwen2.5-Omni-3B text decoder, it departs from ASR-centered alignment and instead uses general audio captions as the principal audio-text interface. The model is trained exclusively on publicly available alignment, pretraining, and supervised fine-tuning datasets, with checkpoints and code released, and is presented as a reproducible alternative to closed-data or proprietary large audio-LLMs (Dinkel et al., 6 Aug 2025).
1. Conceptual orientation
MiDashengLM is defined by a caption-first view of audio understanding. Rather than treating audio-text alignment primarily as transcription, it uses general audio captions that can jointly encode speech content, sound events, music attributes, and audio-quality or paralinguistic information. In the paper’s framing, this yields a holistic textual representation of complex audio scenes, in contrast to ASR-based alignment, which is described as speech-centric and comparatively narrow in scope (Dinkel et al., 6 Aug 2025).
The motivation is threefold. First, ASR-based pretraining uses only speech-like portions of audio and discards music, background sounds, reverberation, noise, silence, and non-speech structure. Second, the alignment objective is characterized as relatively monotonic and therefore “trivial” in the sense that left-to-right transcript alignment may encourage shallower correspondences rather than broader scene-level understanding. Third, transcripts do not naturally encode speaker gender, age, or emotion, recording quality, acoustics, reverberation, or environmental context. MiDashengLM is positioned as a response to those limitations by unifying speech transcripts or summaries, sound captions, and music captions within a single textual target (Dinkel et al., 6 Aug 2025).
This design places the model closer to general-purpose audio understanding than to classical speech recognition. A plausible implication is that MiDashengLM should be evaluated not only on ASR but on captioning, QA, paralinguistics, music understanding, and mixed-scene reasoning, which is consistent with the benchmark suite reported in the paper.
2. Model architecture and interface design
MiDashengLM uses a standard Transformer encoder-decoder architecture with prefix-style alignment. Audio features are extracted by Dasheng, projected into the decoder embedding space through an MLP, and decoded autoregressively by Qwen2.5-Omni-3B using next-token prediction (Dinkel et al., 6 Aug 2025).
| Component | Instantiation | Function |
|---|---|---|
| Audio encoder | Dasheng | Encodes diverse auditory information |
| Projection | MLP | Maps audio features into decoder embedding space |
| Text decoder | Qwen2.5-Omni-3B | Generates text conditioned on audio prefix |
Dasheng-0.6B is described as an open-source, frame-level Vision Transformer-style audio encoder pretrained with a Masked Autoencoder objective on ACAV100M. The model uses 16 kHz audio and converts waveforms into 64-dimensional mel-spectrograms. Dasheng extracts 32 ms frame features with 10 ms stride, then downsamples by a factor of 4 to 40 ms intervals, and the full system ultimately operates at an audio-token framerate of 5 Hz (Dinkel et al., 6 Aug 2025).
The efficiency rationale follows directly from these choices. Dasheng supports variable-length inputs, avoids fixed-length padding, reduces encoder overhead, and shortens the sequence delivered to the decoder. Because decoder-side computation is identified as a major bottleneck, lower token rates and shorter effective audio prefixes translate into lower time-to-first-token and higher throughput. The paper contrasts this with Whisper-style systems that typically pad to 30 seconds even though most training samples are only 1–10 seconds long (Dinkel et al., 6 Aug 2025).
3. Data resources: ACAVCaps, MECAT, and MECAT-QA
A central enabling resource is ACAVCaps, the paper’s general audio caption dataset. It is derived from ACAV100M, chosen because it is public, large-scale, and contains multilingual speech, music, and environmental audio. Since ACAV100M is unlabeled, the authors construct an automatic curation pipeline in which CED-Base predicts AudioSet labels at a 2-second scale, multiple specialized models infer speech analysis, vocal analysis, music analysis, and environmental acoustics, and DeepSeek-R1 generates short captions from the resulting metadata (Dinkel et al., 6 Aug 2025).
The inferred metadata includes speech language, speaker identity or diarization, emotion, gender or age, transcript, pitch, timbre, vocal health, genre, instruments, tempo, mood, singing, acoustic scene, quality, noise, and reverberation. The resulting captions can therefore describe not merely what sound is present, but also how it sounds. The final split comprises ACAVCaps as the training set and MECAT as the evaluation benchmark, with ACAVCaps providing about 38,662 hours of general-caption training data (Dinkel et al., 6 Aug 2025).
MECAT is extracted from the curated corpus after license filtering and audio-text consistency scoring with GLAP. MECAT-QA is a further conversion of audio clips into question-answer pairs. The benchmark is reported to contain over 100,000 QA pairs, with 5 QA pairs per audio clip, spanning perception, analysis, and reasoning, and including direct perception, sound characteristics, quality assessment, environment reasoning, inference or judgement, and application context (Dinkel et al., 6 Aug 2025).
This dataset design marks the main representational shift in the work. The captions are not limited to sound-event descriptions; they can jointly characterize pure speech, pure sound, pure music, and mixed speech, sound, and music. That multimodal textual consolidation is the basis for the paper’s claim that alignment should target general audio understanding rather than transcription alone.
4. Training regime and objective
Training proceeds in three stages: audio-text alignment on ACAVCaps, pretraining on about 1.1M hours of public audio data, and supervised fine-tuning on a curated subset. The alignment stage is trained for 3 epochs; pretraining covers about 1.4 epochs on the full 1.1M hours; and supervised fine-tuning adds 1 epoch on a 352k-hour subset (Dinkel et al., 6 Aug 2025).
The data composition is explicitly skewed toward speech: about 90% of the total training data is ASR-oriented speech data, while smaller portions cover sound, music, paralinguistics, and QA. The paper notes that careful sampling is required so that ASR does not dominate the other capabilities. This is an important point for interpreting the model: although the representational thesis is caption-first, the corpus mixture remains speech-heavy (Dinkel et al., 6 Aug 2025).
The training objective is standard autoregressive cross-entropy conditioned on audio:
where is the current token, are previous text tokens, and denotes the audio features (Dinkel et al., 6 Aug 2025).
Optimization uses AdamW8bit with linear warmup for 1,000 iterations and cosine decay to 10% of the maximum learning rate. Pretraining uses learning rate , supervised fine-tuning uses , LoRA rank is 8, LoRA alpha is 32, LoRA dropout is 0.1, and batch sizes are 10 for pretraining and 8 for SFT. The paper also states that frozen-LLM integration and LoRA-only approaches were explored during development but performed worse for the audio encoder (Dinkel et al., 6 Aug 2025).
5. Empirical characteristics
The empirical profile of MiDashengLM is strongest on general audio understanding. On the X-ARES benchmark, the Dasheng-based encoder is reported to outperform Whisper-Large v3 on 18 of 22 tasks. Whisper remains better on some strictly speech-centric tasks such as ASR, speaker counting, spoken language recognition, and keyword spotting, whereas the Dasheng-based encoder is better on sound, music, paralinguistic, retrieval, and scene-understanding tasks. Notable gains are listed for VoxCeleb1 (+195.6), DESED (+137.6), Clotho (+87.1), ESC-50 (+50.9), and GTZAN (+23.4) (Dinkel et al., 6 Aug 2025).
On captioning benchmarks, MiDashengLM reports 59.71 on MusicCaps, 45.39 on Songdescriber, 62.18 on AudioCaps, 49.20 on ClothoV2, and 66.52 on AutoACD. On MECAT, it achieves Overall FENSE = 57.53, compared with 43.80 for Qwen2.5-Omni and 36.32 for Kimi-Audio-Instruct. On MECAT-QA, it again leads with average 62.08 FENSE, versus 43.74 for Qwen2.5-Omni and 33.22 for Kimi-Audio-Instruct (Dinkel et al., 6 Aug 2025).
The model also performs strongly on several paralinguistic and classification-oriented tasks, including VoxCeleb1 at 92.36, VoxLingua107 at 93.41, VGGSound at 52.11, NSynth at 80.52, and FMA at 63.73. In QA settings, the reported numbers include 71.35 on MuChoMusic, 66.30 average on MMAU, 62.35 FENSE on MusicQA, and 54.31 FENSE on AudioCaps-QA (Dinkel et al., 6 Aug 2025).
ASR remains a supported but secondary capability. The paper reports 3.7 WER on LibriSpeech test-clean and 6.2 WER on test-other; AISHELL2 variants are around 2.9–3.2 CER/WER; and GigaSpeech 2 multilingual test sets are reported at 20.8 for Indonesian, 36.9 for Thai, and 18.1 for Vietnamese. The authors explicitly emphasize that classic English ASR performance is weaker than baseline systems, which they present as consistent with the model’s caption-first rather than ASR-first orientation (Dinkel et al., 6 Aug 2025).
6. Efficiency, reproducibility, and limitations
Efficiency is a headline claim of the model. MiDashengLM is reported to provide up to 4× faster time-to-first-token, with an example comparison of about 160 ms versus 40 ms, and up to 20.2× higher throughput than comparable models. In a throughput table measured in samples per second on an 80GB GPU with 30-second audio and 100-token output, MiDashengLM is reported at 0.65 versus 0.45 at batch size 1, 2.42 versus 1.21 at 4, 4.67 versus 1.44 at 8, 8.93 versus OOM at 16, 14.36 versus OOM at 32, 19.54 versus OOM at 64, 24.26 versus OOM at 128, and 29.04 versus OOM at 512, corresponding to speedups from 1.4× to 20.2× (Dinkel et al., 6 Aug 2025).
The paper attributes these results to variable-length support in Dasheng, reduced padding, the 5 Hz audio-token rate, and shorter effective decoder sequences. Taken together, these choices indicate that efficiency is not treated as a post hoc optimization but as a design principle spanning encoder selection, temporal downsampling, and multimodal interface construction.
Reproducibility is another explicit theme. MiDashengLM is trained only on publicly available datasets for alignment, pretraining, and SFT, and the work releases checkpoints and code. The stated rationale is that other researchers can recreate the setup, audit benchmark and data construction choices, compare results under transparent conditions, and reduce dependence on inaccessible systems (Dinkel et al., 6 Aug 2025).
Several limitations are also apparent. The model’s ASR performance is weaker than speech-specialized systems on classic benchmarks. The overall training mixture remains heavily speech-dominated despite the caption-centered framing. ACAVCaps is automatically curated and captioned with upstream predictors and a reasoning LLM, so caption quality plausibly depends on the reliability of those components. A common misconception would be to read MiDashengLM primarily as an ASR system with a LLM attached; the paper instead defines it as a broad audio reasoning model whose main contribution is the shift from ASR-centric alignment to general-caption alignment across speech, sound, and music (Dinkel et al., 6 Aug 2025).