BERTphone: Phonetically-Aware Encoder Representations for Utterance-Level Speaker and Language Recognition (1907.00457v2)

Published 30 Jun 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the first, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for acoustic representation learning; the second, inspired by the success of bottleneck features from ASR, is a sequence-level CTC loss applied to phoneme labels for phonetic representation learning. We pretrain two BERTphone models (one on Fisher and one on TED-LIUM) and use them as feature extractors into x-vector-style DNNs for both tasks. We attain a state-of-the-art $C_{\text{avg}}$ of 6.16 on the challenging LRE07 3sec closed-set language recognition task. On Fisher and VoxCeleb speaker recognition tasks, we see an 18% relative reduction in speaker EER when training on BERTphone vectors instead of MFCCs. In general, BERTphone outperforms previous phonetic pretraining approaches on the same data. We release our code and models at https://github.com/awslabs/speech-representations.

Citations (27)

View on Semantic Scholar

Summary

The paper introduces BERTphone, a Transformer model using masked frame reconstruction and CTC loss to incorporate phonetic cues into acoustic representations.
It achieves an 18% reduction in speaker recognition error rates and a state-of-the-art Cavg of 6.16 on language recognition benchmarks.
The dual pretraining approach leverages unlabeled data effectively, paving the way for robust, low-resource speech recognition applications.

BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition

The paper introduces BERTphone, a Transformer-based model designed to generate phonetically-aware contextual representation vectors, which are useful for both speaker and language recognition tasks. The authors emphasize the utility of multi-task pretraining, incorporating phonetic constraints to achieve this dual capability.

Methodology

BERTphone diverges from typical LLMs like BERT by working within the domain of continuous speech. It leverages two primary objectives during training:

Masked Frame Reconstruction: Inspired by masked language modeling, spans of input frames are masked and the model is trained to reconstruct the entire sequence. The reconstruction task uses an $L_1$ loss to enhance acoustic representation learning.
Connectionist Temporal Classification (CTC) Loss: Borrowing from automatic speech recognition (ASR) systems, a sequence-level CTC loss aims to align model outputs with phoneme labels for phonetic representation learning.

The two tasks allow for the embedding of both acoustic and phonetic information into feature vectors that can be applied to speaker and language recognition tasks.

Experimental Results

The authors evaluated BERTphone using two datasets, Fisher (for 8kHz audio) and TED-LIUM (for 16kHz audio), spanning several benchmark experiments:

Speaker Recognition: When integrated into x-vector-style DNNs for speaker recognition, BERTphone achieved an 18% relative reduction in equal error rate (EER) compared to systems using Mel-Frequency Cepstral Coefficients (MFCCs) as input on the Fisher dataset.
Language Recognition: On the LRE07 3-second closed-set language recognition task, BERTphone attained a state-of-the-art $C_{\text{avg}}$ of 6.16, showing a notable improvement over prior models pre-trained on similar data.

These results suggest that BERTphone fares better than previous phonetic pretraining methods, substantially improving upon systems that used pre-existing phonetic information only.

Architectural Considerations

The architecture utilizes a deep Transformer encoder similar to BERT but customizes for the acoustic domain:

12 self-attention layers with 12 heads.
Mean-normalized MFCCs serve as inputs, with frames being stacked to reduce sequence length.
Final layer outputs are kept fixed during downstream task training, similar to using feature extractors in previous speaker and language recognition systems.

Implications and Future Directions

BERTphone demonstrates an effective integration of phonetic knowledge into deep acoustic models, providing enhanced versatility across speech-related tasks. Through its self-supervised approach, it reduces the need for extensive labeled data, aligning with recent trends favoring pre-trained models.

Practically, the release of BERTphone may facilitate broader applications of speech recognition technologies in environments where labeled data is scarce. Theoretically, it paves the way for exploration into unsupervised pretraining techniques within a phonetically constrained space.

Speculatively, future research can explore the unfrozen adaptation of BERTphone representations to specific tasks, potentially yielding further improvements. Additionally, large-scale evaluations on unlabeled datasets may enable even broader applicability in real-world scenarios.

In summary, this work adds substantial value to the domain of speech processing by enhancing the capability and robustness of models tackling diverse recognition tasks, and it sets the stage for innovative cross-task pretraining methodologies.