Attention-Based Models for Text-Dependent Speaker Verification (1710.10470v3)

Published 28 Oct 2017 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.

Authors (4)

F A Rezaur Rahman Chowdhury (2 papers)
Quan Wang (130 papers)
Ignacio Lopez Moreno (24 papers)
Li Wan (40 papers)

Citations (171)

View on Semantic Scholar

Summary

Attention-Based Models for Text-Dependent Speaker Verification

The paper "Attention-Based Models for Text-Dependent Speaker Verification" investigates the utilization of attention mechanisms within the domain of speaker verification systems, specifically focusing on text-dependent scenarios where utterances contain specific phrases, such as "OK Google" and "Hey Google". This research falls within the broader context of leveraging attention-based models, which have shown considerable promise in various machine learning tasks, including speech recognition and machine translation.

Overview

The authors explore the integration of different attention layer topologies into an end-to-end speaker verification architecture, which traditionally relies on Long Short-Term Memory (LSTM) networks. Their core contribution lies in a novel adaptation of attention mechanisms to more precisely capture relevant features from speech sequences, minimizing interference from silence and background noise.

Methodology

The investigation includes a thorough examination of various attention layer configurations, outlining several scoring functions such as bias-only, linear, and non-linear attention. The authors present models based on shared-parameter versions of these scoring functions, which ensure consistency across time frames. They also propose attention variants such as cross-layer and divided-layer attention, designed to further optimize the feature extraction process.

Furthermore, pooling methods on attention weights are tested, including sliding window maxpooling and global top-K maxpooling, aimed at enhancing the model's robustness against temporal variations in input signals.

Experimental Results

Empirical evaluations demonstrate that attention-based models achieve a notable reduction in Equal Error Rate (EER), exemplified by a 14% relative improvement over baseline LSTM models. Specifically, the combination of shared-parameter non-linear attention and sliding window maxpooling resulted in an average EER of 1.48%, indicating improved discriminative capability in recognizing speaker identities in text-dependent setups.

Implications and Future Work

This research underscores the effectiveness of attention mechanisms in refining feature extraction processes in speaker verification, thereby enhancing system performance. By focusing on text-dependent scenarios, the paper sets the stage for further exploration into text-independent systems and speaker diarization applications, where similar methodologies may yield significant advancements.

Looking forward, the techniques elaborated in this paper could inform the development of more agile speaker verification frameworks, adaptable to diverse vocal inputs and conducive to real-time applications such as voice authentication for smart devices.

In conclusion, the paper provides substantial insights into optimizing speaker verification through advanced attention models, contributing valuable strategies to the field of speech processing and AI-driven communication systems.

PDF Markdown