MiMo-Audio: MIMO & Multimodal Audio

Updated 4 July 2026

MiMo-Audio is a research paradigm that preserves multi-channel and multimodal data without collapsing them, ensuring spatial and structural integrity.
It supports diverse applications including speech enhancement, localization, diarization, and cross-modal knowledge transfer using architectures from CNNs to Transformers.
The paradigm extends to unified audio–language modeling, enabling few-shot learning and robust inference across audio and text modalities.

MiMo-Audio is a label used across several strands of contemporary audio research to denote multi-input multi-output or multimodal audio formulations. In the literature covered here, it refers to systems that preserve multiple channels rather than collapsing them to a single stream, transfer knowledge from multimodal teachers to audio-only students, couple audio with video for diarization, extraction, or spatial authoring, exploit acoustic MIMO structure for private or direction-preserving playback, and, in one case, name a unified audio–LLM that treats text and audio as a single interleaved sequence (Li et al., 2020, Feng et al., 2024, Ning et al., 2024, Chaman et al., 2018, Cheng et al., 2024, Team et al., 29 Dec 2025).

1. Terminological scope and core formulations

In multichannel speech processing, a MIMO system consumes $C_{\text{in}}$ simultaneous microphone channels and produces $C_{\text{out}}$ enhanced channels. This generalizes both single-channel enhancement and MISO processing, in which many inputs are combined into one output. The specific motivation for retaining multiple outputs is that downstream tasks or end users may require multi-channel signals for spatial audio rendering, multi-point capture, acoustic analytics, or later re-beamforming or localization (Li et al., 2020). In other words, the MIMO formulation is not only an architectural choice; it encodes a commitment to preserving spatial structure.

The same label is extended in audiovisual and multimodal systems. In federated learning, MiMo-Audio is used to denote multimodal-to-unimodal audio distillation, where an audiovisual teacher transfers decision knowledge into an audio-only student deployable on clients lacking video (Feng et al., 2024). In speaker diarization, the MIMO idea appears as a unified sequence-to-sequence framework that accepts audio-only, video-only, or audio-visual inputs and emits multiple target-speaker activity streams simultaneously (Cheng et al., 2024). In audio language modeling, “MiMo” is defined as unified multi-input, multi-output modeling across text and audio, with a decoder-only backbone that takes both text and audio tokens as inputs and can generate either text or audio as outputs (Team et al., 29 Dec 2025).

Usage	Representative formulation	Representative paper
Multichannel enhancement	$C_{\text{in}}=N$ , $C_{\text{out}}=N$ waveform mapping with bottleneck compression	(Li et al., 2020)
Multimodal distillation	Audiovisual teacher to unimodal audio student in federated learning	(Feng et al., 2024)
Audio-visual diarization	Multi-input features and multi-output target-speaker VAD streams	(Cheng et al., 2024)
Audio–language modeling	Interleaved text and audio sequence with next-token prediction	(Team et al., 29 Dec 2025)

This breadth of usage suggests that MiMo-Audio functions less as a single standardized architecture than as a recurring design pattern: preserve structured inputs, preserve or recover structured outputs, and avoid reducing audio to a single monolithic representation when multi-channel or cross-modal structure is operationally useful.

2. Multichannel compression and enhancement

A canonical early instantiation is the MIMO speech compression and enhancement system based on a convolutional denoising autoencoder. The framework is trained end-to-end in a MIMO configuration, with an encoder on the edge device and a decoder on the server or cloud. The encoder maps noisy multichannel waveforms $Y=[y_1,\ldots,y_N]$ to a low-dimensional bottleneck $Z$ , and the decoder reconstructs enhanced multichannel speech $\hat{Y}=[\hat{y}_1,\ldots,\hat{y}_N]$ . In the reported configuration, $C_{\text{in}}=N=7$ , $C_{\text{out}}=N=7$ , and the compression block reduces the channel dimension to $C_{\text{num}}=1$ , yielding a $C_{\text{out}}$ 0 reduction before transmission (Li et al., 2020).

The system operates directly on raw time-domain waveforms rather than through STFT, masks, or iSTFT. This is significant because phase is preserved implicitly, and the framework bypasses magnitude-only time–frequency processing. Two CDAE variants were investigated. The fully convolutional network uses convolutional blocks built from Conv1D, BatchNorm, and LeakyReLU; its encoder contains four feature-extractor convolutional blocks and a compression block, while the decoder mirrors this through decompression blocks and a reconstruction block with tanh activation producing seven enhanced output channels. The Sinc FCN replaces the first convolutional block in the encoder with a learnable SincConv layer using band-pass filters of the form

$C_{\text{out}}$ 1

motivated by fewer parameters and interpretable frequency-selective kernels (Li et al., 2020).

Training uses paired noisy and clean multichannel targets with mean-squared error across channels:

$C_{\text{out}}$ 2

The corpus is TMHINT Mandarin speech, recorded at 16 kHz with a 7-microphone array. Training uses 250 sentences mixed with 8 noise types at SNRs $C_{\text{out}}$ 3 dB, generating 35,000 noisy–clean pairs; testing uses 70 sentences with 4 unseen noise types and the same SNR grid, yielding 9,800 noisy samples (Li et al., 2020).

Average results across all SNRs and channels show Noisy at PESQ 1.825 and STOI 0.678, MIMO-SCE(S) at PESQ 2.890 and STOI 0.750, and MIMO-SCE(F) at PESQ 2.927 and STOI 0.801. The FCN variant therefore improves PESQ by +1.102 absolute and STOI by +0.123 absolute relative to the noisy input, while also reducing transmission data by a factor of 7 (Li et al., 2020). The reported qualitative limitation is that some high-frequency components are degraded, and STOI does not improve over Noisy at 5 and 10 dB, likely reflecting mild speech distortion introduced by compression. This indicates that the MIMO bottleneck is effective but not lossless in perceptual terms.

A later direction-preserving MIMO enhancement formulation addresses a different problem: not compression, but blind online enhancement that preserves inter-channel directional cues. In this setting, a lightweight OnlineSpatialNet estimates a scale-normalized Cholesky factor of the frequency-domain noise covariance, and a direction-preserving MIMO Wiener filter reconstructs enhanced multichannel speech while maintaining the spatial characteristics of both the target and the residual noise (Deppisch, 13 Apr 2026). This later line shifts MiMo-Audio from bandwidth reduction toward downstream spatial fidelity.

3. Localization and diarization as MIMO inference

MIMO formulations have also been used to avoid the limitations of single-output inference in localization and diarization. MIMO-DoAnet addresses direction-of-arrival estimation with an unknown number of sound sources. Rather than mapping multi-channel input to a single overall spatial pseudo-spectrum, it predicts, for up to $C_{\text{out}}$ 4 sound sources, independent SPS codings and per-output activity. Input features combine the magnitude spectrogram of the first channel with interaural phase differences, while per-source spatial covariance matrices are derived from complex-valued ratio filters and passed through parallel SPS estimators (Yin et al., 2022).

The inference rule is deliberately simple. For output $C_{\text{out}}$ 5,

$C_{\text{out}}$ 6

and output $C_{\text{out}}$ 7 is active if $C_{\text{out}}$ 8, with $C_{\text{out}}$ 9 working best in the reported experiments. The DoA estimate is then

$C_{\text{in}}=N$ 0

This reframes source counting from global peak picking in an entangled SPS to per-output presence detection (Yin et al., 2022).

Empirically, at SIR $C_{\text{in}}=N$ 1 dB, MIMO-DoAnet improves F1 from 0.7159 to 0.8492 in 3-source scenes and from 0.5861 to 0.7880 in 4-source scenes, corresponding to relative and absolute gains of 18.6% and 13.3%, and 34.4% and 20.2%, respectively. In small included-angle tests with source separation below $C_{\text{in}}=N$ 2, the gains remain large, indicating that the multi-output design alleviates both threshold sensitivity and the minimum angular separation assumption (Yin et al., 2022).

Speaker diarization extends the same logic to multiple target speakers. MIMO-TSVAD defines a unified sequence-to-sequence framework with multi-input features $C_{\text{in}}=N$ 3, $C_{\text{in}}=N$ 4, $C_{\text{in}}=N$ 5, and $C_{\text{in}}=N$ 6, and multi-output framewise voice activities $C_{\text{in}}=N$ 7. It emits three branches:

audio-based output $C_{\text{in}}=N$ 8,
video-based output $C_{\text{in}}=N$ 9,
mixed output $C_{\text{out}}=N$ 0 (Cheng et al., 2024).

The architecture consists of an audio extractor based on ResNet-34 over 80-dim log Mel-filterbank energies, a video extractor based on ResNet18-3D over grayscale lip ROIs at $C_{\text{out}}=N$ 1 and 25 FPS, a Conformer encoder with modality-dependent and shared sublayers, and a speaker-wise multi-task decoder. Robustness to missing modalities is induced by model-level modality masking and data-level modality masking. Speaker Alignment further addresses off-screen speakers by matching acoustically derived and visually derived speaker representations via the Hungarian algorithm (Cheng et al., 2024).

The reported DERs are 4.18% on VoxConverse, 10.10% on DIHARD-III, and 8.15% on MISP 2022. On MISP 2022, the stagewise ablation shows audio-only DER 23.35%, video-only DER 15.01%, audio-visual with lip profiles DER 10.06%, and Stage 4 mixed with Speaker Alignment DER 8.15% (Cheng et al., 2024). These results establish MIMO not merely as a multichannel signal-processing device, but as a multi-output structural prior for inference problems where multiple latent sources must remain disentangled.

4. Multimodal transfer, extraction, and authoring

In multimodal federated learning, MiMo-Audio denotes multimodal-to-unimodal audio distillation. ModalityMirror addresses modality heterogeneity when some clients are audio-only and others are multimodal. Phase 1 performs modality-aware aggregation: the video encoder $C_{\text{out}}=N$ 2 is aggregated only from multimodal clients, while the audio encoder $C_{\text{out}}=N$ 3 is aggregated from both audio-only and multimodal clients. Phase 2 performs federated knowledge distillation on multimodal clients only, where the audio student minimizes

$C_{\text{out}}=N$ 4

with $C_{\text{out}}=N$ 5 produced locally by the global audiovisual teacher (Feng et al., 2024).

The audio backbone is SSAST initialized from publicly available pre-trained weights; the visual backbone is ResNet-18 initialized from ImageNet pretraining; fusion is late fusion by concatenation. Experiments on UCF101 and ActivityNet use 200 communication rounds, 1 local epoch per round, 10 randomly sampled clients per round, learning rate $C_{\text{out}}=N$ 6, and FedAvg aggregation (Feng et al., 2024). On UCF101, Harmony yields 36.93 audio top-1 accuracy across video-missing rates $C_{\text{out}}=N$ 7, while ModalityMirror yields 41.77±0.30, 40.12±0.43, 40.43±0.64, 39.51±0.81, and 40.48±0.37. On ActivityNet, Harmony yields 14.35 audio top-5 accuracy, while ModalityMirror yields up to 15.82±0.02 (Feng et al., 2024). The method therefore operationalizes MiMo-Audio as a way to improve deployable audio-only models by transferring multimodal decision boundaries back into the weaker modality.

A related audiovisual strand concerns target-speaker extraction under impaired visual conditions. MeMo augments standard AV-TSE backbones with two adaptive memory banks: a speaker bank storing compact speaker embeddings and a contextual bank storing recent speech embeddings. At streaming step $C_{\text{out}}=N$ 8, the fused representation is

$C_{\text{out}}=N$ 9

When visual input is missing, the visual stream is zeros and the memories carry what the paper calls “attentional momentum” (Li et al., 21 Jul 2025). Reported online results under impaired visuals show SI-SNR improvements of at least 2 dB over corresponding baselines: TDSE 8.13 dB to 10.34 dB, USEV 7.40 dB to 9.47 dB, and BSRNN 7.98 dB to 10.98 dB (Li et al., 21 Jul 2025). This suggests a MiMo-Audio interpretation in which robustness comes from maintaining a cross-modal attention state even when one modality disappears.

MIMOSA applies a multimodal pipeline to computational spatial audio authoring for video. Its backend uses Faster R-CNN object detection, Global-Local Path Networks for depth estimation, source separation using pre-trained universal or supervised models, PANNs for audio tagging, and cross-modal association by category name matching. The frontend exposes the resulting intermediate artifacts—object tracks, source tracks, associations, and spatial positions—for direct manipulation in 2D and 3D, with real-time rendering through WebAudio PannerNode(s) to monaural, stereo, quadraphonic, or 5.1 output (Ning et al., 2024).

MIMOSA explicitly rejects end-to-end “black-box” spatialization as the only design pattern. Its evaluation reports external subjective ratings on six videos with five audio conditions: Immersion averages MA 1.95, RA 4.51, OA 2.76, DA 4.47, and UA 6.03; Realism averages RA 6.03, MA 5.68, UA 5.58, OA 3.96, and DA 3.67 (Ning et al., 2024). A lab study with 15 participants reports Usefulness 6.47, Immersion of results 6.20, Expressiveness 6.27, and Ease of use 5.87. In this usage, MiMo-Audio is neither compression nor inference alone; it is an interpretable audiovisual production workflow.

5. Spatial propagation, privacy, and direction preservation

A different lineage uses MIMO acoustics to control what is heard, where, and with what spatial coherence. Multipath-enabled private audio models a reverberant room as a multi-speaker, multi-listener acoustic MIMO channel:

$Y=[y_1,\ldots,y_N]$ 0

Two methods are proposed. Multichannel convolutional synthesis by noise makes each loudspeaker emit filtered random noise so that the signals descramble into meaningful messages only at designated focus spots. Nullspace-projected noise instead synthesizes the intended message and adds artificial noise in the nullspace of the listeners’ channel, so it vanishes at target listeners and jams elsewhere (Chaman et al., 2018).

The work emphasizes that echoes are beneficial rather than detrimental: reverberation improves conditioning and spatial diversity. Simulations in a $Y=[y_1,\ldots,y_N]$ 1 room and real-room experiments in a $Y=[y_1,\ldots,y_N]$ 2 office with 6 loudspeakers and 2 focus spots show high STOI at focus locations and substantial degradation 50 cm away; the nullspace approach yields lower STOI outside target spots than MCCS (Chaman et al., 2018). Here, MiMo-Audio denotes acoustic privacy enabled by exploiting room impulse responses as structured MIMO channels.

The direction-preserving enhancement work continues the emphasis on spatial consistency but with microphone arrays rather than loudspeaker delivery. Its signal model is $Y=[y_1,\ldots,y_N]$ 3 in the STFT domain, with noise covariance estimated online through a scale-normalized Cholesky factor:

$Y=[y_1,\ldots,y_N]$ 4

The estimated covariance is combined with a direction-preserving MIMO Wiener filter

$Y=[y_1,\ldots,y_N]$ 5

followed by identity mixing via $Y=[y_1,\ldots,y_N]$ 6, with $Y=[y_1,\ldots,y_N]$ 7, $Y=[y_1,\ldots,y_N]$ 8, and $Y=[y_1,\ldots,y_N]$ 9 in the experiments (Deppisch, 13 Apr 2026).

Using 30,000 simulated scenes at 32 kHz with a 6-microphone circular array, the proposed OnlineSpatialNet attains SI-SDR 9.37 dB, $Z$ 0, NR 11.72 dB, CovSim 0.93, SpeechSim 0.83, and NoiseSim 0.89, versus a mask-based NICE baseline at SI-SDR 8.50 dB, $Z$ 1, NR 12.11 dB, CovSim 0.92, SpeechSim 0.82, and NoiseSim 0.90. The proposed estimator uses 0.82 M parameters and 23.23 GFLOPs/s, compared with 2.54 M parameters and 59.71 GFLOPs/s for NICE (Deppisch, 13 Apr 2026). Downstream delay-and-sum beamforming SI-SDR improves from $Z$ 2 dB unprocessed to 5.61 dB, and binaural ILD error drops from 1.27 dB to 0.28 dB. This establishes a direction-preserving interpretation of MiMo-Audio in which the central object is not a single enhanced channel, but a spatially coherent multichannel field.

6. MiMo-Audio as a unified audio–LLM

The most explicit use of MiMo-Audio as a proper model name is the 7B-scale audio LLM introduced as “MiMo-Audio: Audio LLMs are Few-Shot Learners.” In this work, MiMo-Audio is a unified, generative audio–LLM that treats audio and text as a single interleaved sequence and learns by next-token prediction at scale. Text is tokenized with a vocabulary size of 151,680. Audio is tokenized by a dedicated MiMo-Audio-Tokenizer using residual vector quantization; the tokenizer operates on 24 kHz waveform input and produces 25 Hz latent frames, with 8 RVQ layers exposed to the LLM, giving 200 audio tokens per second (Team et al., 29 Dec 2025).

To reduce the audio token rate seen by the LLM, the architecture groups $Z$ 3 consecutive 25 Hz frames into 6.25 Hz audio patches. A 6-layer bidirectional Transformer patch encoder converts these grouped frames to LLM inputs, and a 16-layer causal Transformer patch decoder expands LLM hidden states back to audio tokens with 8 independent output heads and a delay mechanism $Z$ 4 across RVQ layers (Team et al., 29 Dec 2025). The backbone is MiMo-7B-Base, a decoder-only Transformer with 36 layers, hidden size 4096, 32 heads, FFN size 11008, and context length 8192. The unified autoregressive objective is

$Z$ 5

where $Z$ 6 may be a text token or an audio patch (Team et al., 29 Dec 2025).

Pretraining proceeds in two stages, starting from MiMo-7B-Base. The understanding stage trains the patch encoder and LLM with loss on text tokens only over 2.6T tokens. The understanding–generation stage trains patch encoder, LLM, and patch decoder over 5T tokens, including speech continuation, speech–text interleaving, ASR, TTS, audio captioning, instruction-following TTS, and text-only pretraining, with text-guided 5:5 interleaving and audio-token loss weights $Z$ 7. The overall pretraining scale exceeds one hundred million hours of in-the-wild audio (Team et al., 29 Dec 2025).

The paper reports that few-shot capabilities emerge after roughly 0.7T tokens of pretraining. MiMo-Audio-7B-Base attains SpeechMMLU 5-shot scores of S2S 69.1, S2T 69.5, T2S 71.5, and T2T 72.5, with a modality gap of 3.4 between T2T and S2S. On MMAU 5-shot, it reaches 66.0 overall, with 67.6 on Speech, 65.2 on Sound, and 65.3 on Music (Team et al., 29 Dec 2025). MiMo-Audio-7B-Instruct, obtained by post-training on 100B tokens, reports MMSU overall 61.70 and 62.88 with “+Think,” MMAU overall 74.90, MMAR 63.60, MMAU-Pro 53.35, Big Bench Audio S2T 72.90 and S2S 60.20, and InstructTTSEval overall 72.59 in English and 70.52 in Chinese (Team et al., 29 Dec 2025).

The model also generalizes to tasks absent from its training data, including voice conversion, style transfer, speech editing, speech denoising, speech-to-speech translation, and long-horizon speech continuation such as talk shows, debates, livestreams, and recitations (Team et al., 29 Dec 2025). At the same time, the paper notes limitations: in-context learning remains limited for complex general audio generation, spoken dialogue can exhibit timbre discontinuities and quality drops, and the “thinking” mechanism can hallucinate on sound and music tasks. These caveats are important because they distinguish benchmark-level generalization from uniformly reliable behavior across all audio regimes.

Across these lines of work, MiMo-Audio designates a family of representations and systems that resist premature collapsing of audio structure. Whether the objective is $Z$ 8 transmission reduction with multichannel recovery, disentangled per-source localization, unified audio-visual diarization, multimodal knowledge transfer into audio-only models, human-steerable spatial authoring, private message delivery in reverberant rooms, direction-preserving enhancement, or few-shot audio–language modeling, the recurring principle is that preserving multiplicity—of channels, sources, modalities, or output types—can be more valuable than optimizing a single monolithic stream.