Whisper-Flamingo: Multimodal AV Speech Recognition
- Whisper-Flamingo is a multimodal model that integrates audio cues from Whisper with visual inputs via gated cross-attention to enhance speech recognition in noisy environments.
- It leverages a unified architecture that fuses modality-specific encoders through late fusion and modality dropout, enabling effective multilingual ASR and translation without retraining.
- The design incorporates robust defenses against adversarial attacks and supports challenging tasks such as whisper-mode adaptation and long-context audio reasoning for interactive applications.
Whisper-Flamingo is a family of multimodal, audio-visual speech recognition (AVSR) and audio LLMs that integrate the OpenAI Whisper architecture with visual feature injection strategies originating from the Flamingo model. This framework is designed to advance state-of-the-art performance in robust automatic speech recognition, multilingual AVSR, sequence-to-sequence translation, reasoning over long-form audio, whisper speech adaptation, and interactive voice-based tasks. Whisper-Flamingo models have been adapted and extended to cover English and multilingual AVSR (Rouditchenko et al., 14 Jun 2024), large-scale audio understanding and reasoning (Goel et al., 10 Jul 2025), whisper-mode and privacy-sensitive speech (Li et al., 28 Sep 2025), and robust ASR under adversarial and noisy conditions (Olivier et al., 2022).
1. Motivation and Architectural Foundations
Whisper-Flamingo addresses fundamental limitations in both unimodal ASR systems and previous AVSR approaches. Whisper, trained on 680k+ hours of audio, demonstrates strong robustness to out-of-distribution and environmental noise but lacks mechanisms for handling visual cues and multimodal integration—crucial for disambiguation in severe noise or ambiguous speech. AVSR models can leverage lip-based video but are typically restricted by the scarcity of large-scale, labeled audiovisual datasets. By adapting Whisper to accept visual features through a Flamingo-style gated cross-attention mechanism, Whisper-Flamingo capitalizes on both the decoding power of Whisper and the noise resilience provided by visual context.
Key design principles:
- Gated cross attention: Visual information is injected via specialized cross-attention layers with learnable scaling parameters, initialized to act as identity maps to preserve Whisper’s pretrained decoding (see Section 2 for mathematical formulation).
- Late- and multi-modal fusion: Each modality is processed via its own encoder and fused only in the decoder, allowing for flexible training strategies such as modality dropout.
- Unified parameter set: Whisper-Flamingo can handle transcription (ASR), sequence-to-sequence translation, and multilingual AVSR without retraining or language-specific specialization.
2. Cross-Modal Integration: Gated Cross Attention
The integration of visual features is achieved by inserting a gated cross-attention mechanism at the beginning of each decoder block in the Whisper architecture. Visual features (typically extracted using AV-HuBERT encoders from lip crops at 25 fps) are fused with decoder representations as follows:
where is the decoder input, represents projected visual features, is layer normalization, and is the feed-forward module. The gating parameters and are initialized at zero, ensuring identity initialization. Only the cross-attention and adapter layers are updated during fine-tuning on AVSR datasets, leaving the core Whisper parameters frozen (Rouditchenko et al., 14 Jun 2024). This approach preserves pre-trained audio recognition capabilities and enables effective learning of cross-modal cues even with limited video training data.
3. Multilingual and Noise-Robust Extensions
mWhisper-Flamingo generalizes the audio-visual fusion paradigm to nine languages by introducing a multi-encoder, late-fusion architecture. Key novelties include:
- Decoder Modality Dropout: During training, the model is exposed to both complete audiovisual pairs and unimodal (visual-only) samples with dropout probabilities , , . This procedure enhances the decoder’s ability to adapt to missing or noisy modalities, improving robustness in noisy environments (Rouditchenko et al., 3 Feb 2025).
- Multilingual Pretraining: The underlying Whisper encoder and decoder are multilingual, while the AV-HuBERT visual encoder is re-trained on a multilingual corpus. Fusion is performed within the decoder via learned projections:
where and are audio and visual embedding sequences, with , being modality-specific linear projections.
- Noise robustness: On the MuAViC benchmark, mWhisper-Flamingo achieves significant non-English WER improvements (e.g., from 48.0% to 43.7% in medium-size models at 0 dB SNR), demonstrating consistent superiority over audio-only baselines in challenging conditions.
4. Audio Reasoning, Long Context, and Dialogue
Audio Flamingo 3 (AF3) extends Whisper-Flamingo to encompass:
- AF-Whisper Encoder: A unified encoder that adapts the Whisper-Large-v3 backbone to process speech, environmental sounds, and music via joint representation learning. Acoustic inputs are transformed into mel-spectrograms, processed into dense audio features, pooled, and projected into the text embedding space to serve as LLM prompts (Goel et al., 10 Jul 2025).
- Flexible reasoning ("+Think" mode): Incorporation of chain-of-thought annotations (AF-Think dataset), enabling the model to perform explicit, intermediate reasoning before answer generation.
- Multi-turn, multi-audio dialogue: Chat-oriented finetuning with AF-Chat data allows the model to process conversational threads involving several audio inputs and reference back to prior context.
- Long audio chains: Via curriculum learning (progressing from 30-second to 10-minute clips), AF3 maintains context and reasoning across extended audio sequences, outperforming closed and open models on long-form audio benchmarks.
5. Whisper Speech Adaptation and Multimodal Alignment
The adaptation of Whisper-Flamingo to whisper-mode (voiceless speech) recognition—demonstrated on the AISHELL6-Whisper dataset—incorporates the following specialized mechanisms (Li et al., 28 Sep 2025):
- Parallel audio training: Paired whisper and normal speech utterances are processed through the same encoder, with the cross-entropy loss encouraging alignment of the acoustic embedding spaces across modalities.
- Spectral projection for whisper speech: For the whisper branch, an additional projection layer (2-layer MLP with ReLU and residual connection) is applied:
with the first linear initialized by Kaiming normal initialization and the output head to zero, preserving initial identity and allowing controlled adaptation.
- Audio-visual fusion: Lip-based visual features are again added via gated cross-attention, further reducing error rates in the challenging whisper regime. The final model achieves a Character Error Rate (CER) of 4.13% for whisper speech, establishing new state-of-the-art performance on both AISHELL6-Whisper and the wTIMIT benchmark.
6. Robustness and Adversarial Vulnerabilities
Despite robust performance in challenging naturalistic and noisy conditions, Whisper-Flamingo and related Whisper-based systems inherit susceptibility to adversarial perturbations as detailed for Whisper (Olivier et al., 2022):
- Projected Gradient Descent (PGD) and targeted Carlini–Wagner (CW) attacks: Minor, norm-constrained perturbations (, for or ) at SNR levels 35–45 dB can cause dramatic increases in WER (35–90 percentage points) or enforce attackers’ chosen transcriptions in 50–90% of trials.
- Language confusion attacks: By adversarially shifting the language detector, system reliability for low-resource languages can degrade (WER exceeding 100% in some cases).
- Defenses: Randomized smoothing via the addition of Gaussian noise provides partial mitigation for untargeted PGD attacks but introduces a 6–15 pp WER penalty on clean inputs. Adversarial training is suggested, but not yet fully explored for the Whisper-Flamingo family.
7. Practical Performance, Applications, and Open Resources
Whisper-Flamingo’s architecture and training methodology yield state-of-the-art results across a wide range of ASR and AVSR benchmarks:
| Model/Variant | Task | Benchmark | Clean WER | Noisy WER | Notes |
|---|---|---|---|---|---|
| Whisper-Flamingo Lg. | ASR (En) | LRS3 | 1.5% | 5.6% | Babble 0 dB SNR |
| Whisper-Flamingo | AVSR (En-X) | MuAViC | 1.3–1.6% | ~7.2% | 6 translation languages |
| mWhisper-Flamingo | AVSR (multi) | MuAViC | SOTA | +10% rel. | 9 langs, 0 dB SNR |
| AISHELL6-Whisper | AVSR (Zh) | AISHELL6-w | 4.13% CER | – | Whisper-mode speech |
| Audio Flamingo 3 | Reasoning | MMAU, ClothoAQA | SOTA | SOTA | 20+ audio understanding |
Whisper-Flamingo’s versatility enables:
- Robust ASR in real-world noisy environments
- Multilingual and cross-lingual speech-to-text and translation from a unified parameterization
- Domain adaptation to specialty scenarios (whisper, clinical, privacy-sensitive speech)
- Large-context audio reasoning and multi-turn dialogue in audio intelligence
- Open-source datasets and codebases (e.g., AISHELL6-Whisper; https://zutm.github.io/AISHELL6-Whisper) supporting reproducibility and further research
Ongoing challenges include improving adversarial robustness, extending support for more languages where parallel AV data are limited, and further optimizing the fusion and alignment of modalities for both efficiency and robustness.
References
- Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation (Rouditchenko et al., 14 Jun 2024)
- mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition (Rouditchenko et al., 3 Feb 2025)
- Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio LLMs (Goel et al., 10 Jul 2025)
- AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines (Li et al., 28 Sep 2025)
- There is more than one kind of robustness: Fooling Whisper with adversarial examples (Olivier et al., 2022)
- Teach me with a Whisper: Enhancing LLMs for Analyzing Spoken Transcripts using Speech Embeddings (Hasan et al., 2023)