Data Augmenting Contrastive Learning of Speech Representations in the Time Domain (2007.00991v1)

Published 2 Jul 2020 in eess.AS, cs.CL, and cs.SD

Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.

Citations (116)

View on Semantic Scholar

Summary

The paper demonstrates that applying time-domain data augmentations in Contrastive Predictive Coding (CPC) significantly improves speech representation learning.
It introduces WavAugment and a refined CPC2 model, achieving an 18-22% performance boost on key unsupervised benchmarks with dramatically less data.
The enhanced model also improves semi-supervised phoneme classification by 12-15%, reducing reliance on extensive labeled datasets.

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

In the ongoing discourse of unsupervised representation learning for speech, this paper presents an investigation into the substantial enhancements achievable via time-domain data augmentation techniques integrated into Contrastive Predictive Coding (CPC). The paper introduces WavAugment, a data augmentation library, and examines its impact on the efficacy of CPC in extracting speech representations.

Methodological Advancements

The paper outlines a refined CPC architecture termed CPC2, which employs a convolutional encoder and a recurrent context network to derive representations from raw audio signals. A significant focus is given to the application of data augmentation using the WavAugment library, which operates in the time domain to introduce audio transformations like pitch modification, additive noise, and reverberation. These augmentations are strategically applied to the past segments of audio data, yielding optimal performance in comparison to simultaneous application on both past and future segments.

Numerical Results

The paper reports a notable improvement in CPC performance with the adoption of data augmentations, specifically achieving an 18-22\% relative improvement on unsupervised representation learning benchmarks such as the Zero Speech Benchmark 2017. Remarkably, these gains are achieved using significantly less data—600 times less than the reference data typically required—achieving competitive results with state-of-the-art models.

Implications and Future Directions

The augmentation techniques not only enhance the unsupervised learning of speech representations but also extend advantages to semi-supervised setups. Through limited-supervision phoneme classification tasks, the augmented CPC model demonstrates a 12-15\% improvement, thus highlighting the potential for reduced dependency on labeled datasets in speech recognition applications.

This work has implications beyond immediate performance enhancements. It underscores the promise of time-domain augmentations in bridging the gap between model capacity and data availability. The results invite further exploration into the scalability of these methodologies across languages and dataset sizes, and their adaptability to larger, more varied datasets which may contain noise or require preprocessing.

Looking ahead, the continued evolution of time-domain augmentations could catalyze advancements in other areas of unsupervised learning, fostering developments that align model training more closely with the realities of diverse, real-world datasets. Tables illustrating these augmentations and their specific impacts could provide deeper insights and offer benchmarks for future research in similar domains.

In conclusion, the paper delivers valuable insights and empirical evidence supporting the efficacy of strategic data augmentation methods. These innovations contribute meaningfully to the broader field of representation learning, highlighting a path toward more effective and efficient unsupervised learning paradigms.

Related Papers

GitHub

GitHub - facebookresearch/WavAugment: A library for speech data augmentation in time-domain (665 stars)