u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality (2207.07036v2)

Published 14 Jul 2022 in cs.CL, cs.AI, cs.CV, cs.SD, eess.AS, and eess.IV

Abstract: While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert

PDF Abstract

Overview of "u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality"

The paper introduces u-HuBERT, a self-supervised pre-training framework designed to handle both multimodal and unimodal speech data effectively. By leveraging a unified masked cluster prediction objective and employing modality dropout during pre-training, the model achieves performance that rivals state-of-the-art modality-specific models. Notably, it handles zero-shot modality generalization, enabling robust speech processing across different modalities without requiring labeled data for every configuration.

Key Contributions

Unified Pre-training Framework: u-HuBERT generalizes the AV-HuBERT model to pre-train using both multimodal (audio-visual) and unimodal (audio-only) data. This approach facilitates learning modality-agnostic representations that can be utilized across various speech processing tasks.
Modality Dropout: The strategic use of modality dropout enables the framework to simulate conditions where one or more modalities are absent. This is integral for learning generalized representations that support zero-shot transfer between modalities.
State-of-the-Art Performance: The unified model achieves Word Error Rates (WERs) of 1.2%/1.4%/27.2% for audio-visual/audio/visual input on the LRS3 dataset. This performance is competitive with current best-in-class modality-specific models.

Results and Findings

Numerous Modalities: The framework supports diverse input types, enhancing its utility where data may be sparse or difficult to label across all modalities.
Superior Cluster Quality: The model exhibits strong phone normalized mutual information (PNMI), demonstrating its capability to produce high-quality representations across modalities.
Evaluations on Speech Processing Tasks: u-HuBERT was evaluated on speech recognition and translation tasks. It was observed to maintain robust performance even when fine-tuned with data exclusively from one modality, showcasing its potential for real-world applications where labeled multimodal data is scarce.

Implications

Theoretical Impact: The capability to learn modality-invariant features could advance the exploration of unified models covering various sensory input types, not limited to speech.
Practical Deployment: In practice, the potential to deploy a single model across different devices or scenarios (e.g., noisy environments leveraging visual inputs) can reduce development complexity and improve accessibility in technology deployment.

Future Directions

The work presents a promising foundation for further development in multimodal AI. Future explorations could focus on extending the framework to more diverse modalities beyond speech (e.g., bio-signals or other sensor data) and improving the model’s adaptability to lesser-studied or emerging modalities. Additionally, tackling potential issues in catastrophic forgetting during domain adaptation and enhancing the model's ability to generalize across unseen domains remain pertinent areas for future research.

This paper pushes the frontier of mixed-modal learning by proposing a cohesive strategy that mitigates the necessity for exhaustive labeled data across all modalities, thus streamlining the development of robust and generalizable models suitable for broad applications.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Wei-Ning Hsu (76 papers)
Bowen Shi (82 papers)

Citations (32)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/av_hubert: A self-supervised learning framework for audio-visual speech (847 stars)