Multi-task self-supervised learning for Robust Speech Recognition (2001.09239v2)

Published 25 Jan 2020 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

Citations (284)

View on Semantic Scholar

Summary

The paper presents PASE+, a novel multi-task self-supervised approach that learns robust speech representations from unlabelled audio.
It integrates convolutional and QRNN layers with an online distortion module to capture both short- and long-term dynamics in challenging acoustic environments.
Experimental results demonstrate a 13.5% performance improvement over traditional features, showcasing its effectiveness in real-world noisy scenarios.

Multi-Task Self-Supervised Learning for Robust Speech Recognition

The paper "Multi-task self-supervised learning for Robust Speech Recognition" introduces PASE+, an enhanced version of a problem-agnostic speech encoder (PASE) aimed at improving speech recognition in challenging noisy and reverberant environments. The authors have devised a multi-task self-supervised learning approach that capitalizes on the strengths of self-supervised learning paradigms to extract meaningful representations from unlabelled audio data, addressing the limitations faced when labeled data is scarce or unavailable.

Overview of PASE+ Design

PASE+ builds on the architecture of its predecessor by incorporating several improvements aimed at handling adverse acoustic conditions effectively. Key design elements of PASE+ include:

Enhanced Encoder Architecture: The revised encoder combines convolutional layers with a quasi-recurrent neural network (QRNN), allowing it to capture both short- and long-term speech dynamics efficiently. This combination is designed to improve the system's ability to handle temporal dependencies in audio signals, especially beneficial in noisy environments.
Online Speech Distortion Module: A novel module that contaminates input signals with various distortions like reverberation, additive noise, and temporal/frequency masking during training. This online distortion not only augments the data but also forces the model to learn robust features invariant to these perturbations, resulting in enhanced performance under real-world test conditions.
Improved Self-Supervised Workers: PASE+ employs a series of small neural networks called workers, each tasked with solving different self-supervised objectives. The range of tasks has been expanded to capture various aspects of speech, including phonetic, prosodic, and high-level speaker characteristics. The encoder is encouraged to learn a more robust and comprehensive speech representation through these tasks.

Experimental Results

The paper presents strong empirical results demonstrating that PASE+ significantly outperforms traditional acoustic features and the previous PASE architecture. Key findings include:

On datasets like TIMIT, DIRHA, and CHiME-5, PASE+ shows a marked improvement in performance over standard hand-crafted speech features, such as MFCCs and FBANKs. A relative improvement of 13.5% over the best traditional feature set was observed in dereverberated and noisy conditions.
PASE+ also offers transferability of its learned representations across different acoustic environments, showing efficacy in real-world scenarios due to its self-supervised pre-training, as evidenced by experiments carried out on the CHiME-5 dataset.

Implications and Future Directions

The development and validation of PASE+ indicate significant potential in applying multi-task self-supervised learning frameworks for robust speech processing. The ability to learn quality representations from unlabelled data using self-supervision presents a compelling pathway for reducing reliance on large annotated datasets, which are often expensive and time-consuming to produce.

Future work could further explore the integration of semi-supervised learning into the PASE+ framework by incorporating additional supervised tasks. This blended approach could help in refining the learned representations and improving performance on specific downstream tasks such as speaker and emotion recognition. Moreover, extending the usability of PASE+ to sequence-to-sequence neural speech recognition frameworks could provide additional insights into optimizing speech recognition systems.

Overall, PASE+ represents a significant stride in advancing speech recognition technology, particularly in improving adaptability and performance in acoustically challenging environments using self-supervised learning approaches.

PDF Markdown