AISHELL6-Whisper: AVSR Benchmark

Updated 4 October 2025

AISHELL6-Whisper is a large-scale audio-visual dataset featuring over 60 hours of paired whisper and normal speech with synchronized lip videos from 167 native speakers.
The project introduces a two-stage training framework that uses a projection layer to align whisper embeddings and gated cross-attention to fuse visual cues with audio processing.
It achieves significant performance gains by reducing whisper speech error rates from 18.93% to 4.13%, facilitating advancements in privacy-sensitive and clinical ASR applications.

AISHELL6-Whisper denotes a large-scale, open-source Chinese Mandarin audio-visual whisper speech dataset with paired normal speech and synchronized lip videos, complemented by a competitive audio-visual speech recognition baseline built on the Whisper-Flamingo framework. The project targets the development and benchmarking of speech recognition systems tailored for whisper speech, which is characterized by the absence of vocal fold vibration, low energy, and spectral differences relative to normal speech. Such systems are crucial in privacy-sensitive communications, clinical settings for patients under vocal restraint, and noise-sensitive environments. The dataset and open-source code facilitate research addressing the unique acoustic-phonetic and multimodal challenges inherent to whisper speech (Li et al., 28 Sep 2025).

1. Dataset Structure and Composition

AISHELL6-Whisper is comprised of approximately 30 hours each of whisper and normal speech, yielding more than 60 hours in total. These recordings are sourced from 167 native Mandarin speakers, each reading 10–20 minutes of non-overlapping poetry texts. For 121 speakers, speech is accompanied by synchronized frontal RGBD facial videos (1280×720 px, 25 fps), captured with an unobstructed view of the mouth region via an appropriately positioned (one meter frontal, non-obstructive) microphone.

The dataset is partitioned into training, validation, and test sets in a 4:1:1 ratio, strictly splitting speakers to avoid overlap across sets. Each split maintains a balanced inclusion of whisper and normal speech as well as paired video data, so that models developed can be evaluated consistently on both parallel audio-only and audio-visual subsets. Detailed statistics, including total utterances and hours per split, are tabulated in the original work.

2. Methodological Innovations

The baseline audio-visual speech recognition (AVSR) system leverages a two-stage training framework:

Stage 1: Audio-Only Training with Parallel Strategy

The backbone is a pre-trained OpenAI Whisper model. Paired utterances of whisper and normal speech are processed in parallel through a shared encoder and decoder.
A parallel training loss is used:

$\mathcal{L} = \mathcal{L}_w + \mathcal{L}_n$

where $\mathcal{L}_w$ and $\mathcal{L}_n$ are cross-entropy losses for whisper and normal speech transcripts, respectively.

To mitigate the spectral gap between normal and whisper speech—specifically, the absence of a discernible fundamental frequency in whisper—an additional projection layer (structured Linear→ReLU→Linear; initialized to approximate identity) refines the whisper embedding:

$E'_w = E_w + \text{projection\_layer}(E_w)$

ensuring a smooth adaptation of the acoustic space.

Stage 2: Audio-Visual Fine-Tuning

For samples with video, facial landmarks are extracted with RetinaFace; the lip region is cropped based on computed mouth centers and width formulas such as:

$x_{\text{center}},~y_{\text{center}} = \frac{p_2 + p_3}{2}$

$\text{width} = \min\left\{3.2 \times d_{\text{MN}},~2 \times \max(d_{\text{MN}}, d_{p_1 p_2})\right\}$

providing consistent input to the AV-HuBERT encoder.

Extracted visual features are integrated into the Whisper decoder through gated cross-attention modules at the start of each block, as in the Flamingo paradigm, enabling the model to leverage both audio and visual cues.

3. Performance Benchmarks

On the AISHELL6-Whisper test set:

The baseline system achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech.
This marks a substantial improvement versus the direct baseline Whisper model, which yielded a whisper speech CER of 18.93% prior to applying the parallel projection and audio-visual training strategy.
External benchmarking on the wTIMIT dataset demonstrates new state-of-the-art results when further fine-tuning with AISHELL6-Whisper pretrained models, both for US and Singaporean accent variants. The architecture and hyperparameters are disclosed in the released code base.

4. Technical Mechanisms and Loss Functions

The model is trained end-to-end with a total loss that is the sum of cross-entropy for whisper and normal speech.
The projection layer parameters are initialized using Kaiming normal initialization to support the ReLU operation in the intermediate layer, with the output layer initialized to zero to ensure the initial operation is an identity mapping.
For the integration of video, the mouth-centered lip crop is standardized across all video samples, and the resulting embeddings from AV-HuBERT are injected via gated cross-attention at each decoder block, facilitating robust multimodal alignment.

5. Applications and Societal Relevance

AISHELL6-Whisper directly addresses several real-world and domain-specific challenges:

Privacy-sensitive communication: Whisper speech is essential in contexts where acoustic privacy is needed, such as confidential conversations or settings requiring discretion.
Clinical and Medical Use: For patients with impaired phonation (e.g., due to surgery, injury, or disease), robust whisper speech recognition systems restore basic communication.
Noise-Sensitive and Multimodal Environments: By incorporating lip movements, the AVSR system maintains high recognition accuracy even when audio quality is degraded, enabling reliable speech recognition in noisy or adverse acoustic conditions.
Research Facilitation: The scale and multimodality of AISHELL6-Whisper make it an ideal benchmark for next-generation research in audio-visual speech processing, model adaptation, and multimodal alignment.

6. Implications and Future Directions

The techniques employed in AISHELL6-Whisper establish both methodological and empirical benchmarks. The parallel training strategy and spectrally adaptive projection layer are immediately transferable to future research on whisper and “non-standard” phonation, while the modular design allows direct incorporation of advanced visual encoders or alternative audio backbones.

The dataset's release is expected to catalyze research into privacy-preserving ASR, medical speech technologies, and multimodal learning for under-resourced speech styles. The baseline models and code, openly available, provide a platform for reproducibility and extension.

A plausible implication is that success in this endeavor may motivate the creation of similar multimodal datasets for other languages and speaking conditions, furthering the robustness and inclusivity of modern speech recognition systems.

PDF Markdown Chat (Pro)

References (1)

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines (2025)

Follow Topic

Get notified by email when new papers are published related to AISHELL6-Whisper Project.