TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech (2007.06028v3)

Published 12 Jul 2020 in eess.AS, cs.CL, and cs.LG

Abstract: We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison of various self-supervised models. TERA achieves strong performance in the comparison by improving upon surface features and outperforming previous models. In our experiments, we study the effect of applying different alteration techniques, pre-training on more data, and pre-training on various features. We analyze different model sizes and find that smaller models are strong representation learners than larger models, while larger models are more effective for downstream fine-tuning than smaller models. Furthermore, we show the proposed method is transferable to downstream datasets not used in pre-training.

Citations (340)

View on Semantic Scholar

Summary

The paper proposes TERA, a self-supervised method that pre-trains Transformer encoders using alterations in time, frequency, and magnitude to capture rich speech representations.
The methodology leverages strategic alterations to corrupt acoustic frames, yielding enhanced performance on downstream tasks like phoneme classification, keyword spotting, and speaker recognition.
Empirical results demonstrate that TERA outperforms previous self-supervised models, setting a new benchmark in speech processing accuracy.

An Analysis of TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

The paper introduces TERA, a self-supervised learning approach aimed at enhancing speech processing systems through pre-training Transformer Encoders on unlabeled speech data. TERA stands for Transformer Encoder Representations from Alteration, and the methodology diverges from traditional techniques that rely primarily on single auxiliary tasks.

Key Methodological Insights

TERA employs alterations across three orthogonal dimensions—time, frequency, and magnitude—to guide the pre-training process. The primary objective is to reconstruct acoustic frames from their altered forms using a probabilistic policy:

Time Alteration: Involves corrupting blocks of time steps, leading the model to learn contextual relationships within speech sequences.
Frequency Alteration: Masks frequency bins, prompting the model to capture speaker identity and other high-level features.
Magnitude Alteration: Adds Gaussian noise, enhancing the model's robustness and generalization capabilities across diverse datasets.

These alterations are combined to form a comprehensive pre-training model that captures rich, contextualized representations more effectively than prior models.

Numerical Performance and Comparisons

TERA's efficacy is validated through several downstream tasks such as phoneme classification, keyword spotting, speaker recognition, and automatic speech recognition (ASR). The results demonstrate that:

In phoneme classification, TERA pre-trained models, particularly with both time and frequency alterations, outperform older models such as CPC and Mockingjay.
For keyword spotting, the application of time alteration alone yielded significant improvements.
Speaker recognition tasks exhibited high accuracy, especially when frequency alteration was applied.

Comparative analyses reveal that TERA consistently outperforms other self-supervised models like vq-wav2vec and wav2vec 2.0, particularly in terms of linear classification accuracy, highlighting its representational capacity.

Theoretical and Practical Implications

The integration of multiple alterations for pre-training suggests an advancement in self-supervised learning paradigms that transcends the limitations of previous unidimensional approaches. By adopting these methods, TERA achieves:

More robust feature extraction capabilities, as evidenced by the superior performance across diverse tasks.
Enhanced generalization, given its consistent results on unseen datasets like TIMIT.

Future Research Directions

Given TERA's demonstrated effectiveness, future research might explore solutions for overcoming challenges in domain adaptation, especially where datasets differ significantly in acoustical properties. The scalability of TERA with even larger datasets or different linguistic contexts could also be a fruitful area for further exploration.

Conclusion

Through innovative alterations on time, frequency, and magnitude axes, TERA provides a promising direction for improving speech processing via self-supervised techniques. Its adaptability and strong task performance indicate a significant leap in representation learning, setting a new standard for how models can leverage unlabeled data to achieve cognitive proficiency akin to human learning processes.

PDF Markdown

Related Papers

GitHub

GitHub - s3prl/s3prl: Self-Supervised Speech Pre-training and Representation Learning Toolkit (2,129 stars)

YouTube

Show All Videos