Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System (1804.05160v1)

Published 14 Apr 2018 in eess.AS, cs.LG, and cs.SD

Abstract: In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.

Authors (3)

Weicheng Cai (13 papers)
Jinkun Chen (9 papers)
Ming Li (787 papers)

Citations (326)

View on Semantic Scholar

Summary

An Overview of the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition Systems

The paper "Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System" explores enhancing end-to-end systems for speaker and language recognition, focusing specifically on exploring encoding/pooling layers and loss functions that are part of these systems. The research aims to provide a unified system that leverages various encoding layers and loss functions to improve the performance of speaker and language recognition tasks.

System Framework and Design

The end-to-end system proposed in this work addresses a "sequence-to-one" task, which unlike automatic speech recognition (ASR), does not require output on a per-frame basis but at the level of complete utterances. The system is designed to produce utterance-level results from variable-length inputs. The core components of this system include a frame-level feature extraction front end, an encoding layer that aggregates features into a fixed-dimensional representation, followed by a fully connected layer and output classification layer. The work presented investigates three specific types of encoding layers - Temporal Average Pooling (TAP), Self-Attentive Pooling (SAP), and Learnable Dictionary Encoding (LDE), to determine their efficacy in summarizing the sequential input data into utterance-level representation.

Encoding Layers Explored

Temporal Average Pooling (TAP): A baseline approach where a simple average is used to pool the frame-level features over time.
Self-Attentive Pooling (SAP): This layer employs attention mechanisms to weigh frames within the input sequence differently, allowing the model to focus on more significant parts of the sequence, thus improving utterance representation.
Learnable Dictionary Encoding (LDE): Inspired by the GMM-i-vector approach, LDE introduces learnable dictionary components to accumulate statistics, which merge the dictionary learning and vector encoding stages into a single layer for direct optimization.

Loss Functions Analyzed

For improving discriminative embedding useful for open-set speaker verification, the paper explores two notable loss functions apart from the common softmax:

Center Loss: Works alongside softmax to minimize intra-class variations while keeping inter-class features separable.
Angular Softmax (A-Softmax): Introduces an angular margin in feature space, providing a discriminative angular distance metric on a hyperspherical manifold, aligning with the inherent manifold structure of speaker embeddings.

Experimental Findings

Performances were evaluated on the Voxceleb and NIST LRE 07 datasets. The paper shows substantial improvements in system performance through the incorporation of different encoding layers and loss functions:

The LDE layer generally performed best, significantly enhancing system robustness in both speaker verification and language identification tasks.
The SAP layer outperforms the TAP baseline by utilizing selective attention on frame features.
Introducing Center Loss and A-Softmax Loss resulted in more discriminative speaker embeddings, particularly benefitting open-set tasks, where these approaches surpassed traditional PLDA scoring.
The integration of the LDE layer and discriminative loss functions demonstrated clear improvement in overall system metrics, illustrating the effectiveness of the unified end-to-end system over conventional i-vector systems.

Implications and Future Work

This paper's findings have several implications for the development of robust speaker and language recognition systems. By integrating adaptive pooling methods and discriminative losses, systems can achieve improved generalization and discrimination, important for real-world applications where speaker and language variability can be high. Future research directions could explore further refinements in dictionary learning methods and extend these models to additional paralinguistic attributes, improving their applicability to broader audio processing tasks.

In conclusion, exploring alternative pooling methods and loss functions underlines the potential to substantially enhance end-to-end speech recognition systems, bridging gaps in performance associated with traditional methodologies. The proposed approaches contribute toward more adaptable and effective models for speaker verification and language identification.

PDF Markdown