Fooling End-to-end Speaker Verification by Adversarial Examples (1801.03339v2)

Published 10 Jan 2018 in cs.LG and cs.CL

Abstract: Automatic speaker verification systems are increasingly used as the primary means to authenticate costumers. Recently, it has been proposed to train speaker verification systems using end-to-end deep neural models. In this paper, we show that such systems are vulnerable to adversarial example attack. Adversarial examples are generated by adding a peculiar noise to original speaker examples, in such a way that they are almost indistinguishable from the original examples by a human listener. Yet, the generated waveforms, which sound as speaker A can be used to fool such a system by claiming as if the waveforms were uttered by speaker B. We present white-box attacks on an end-to-end deep network that was either trained on YOHO or NTIMIT. We also present two black-box attacks: where the adversarial examples were generated with a system that was trained on YOHO, but the attack is on a system that was trained on NTIMIT; and when the adversarial examples were generated with a system that was trained on Mel-spectrum feature set, but the attack is on a system that was trained on MFCC. Results suggest that the accuracy of the attacked system was decreased and the false-positive rate was dramatically increased.

Authors (4)

Felix Kreuk (22 papers)
Yossi Adi (96 papers)
Moustapha Cisse (14 papers)
Joseph Keshet (42 papers)

Citations (196)

View on Semantic Scholar

Summary

Adversarial Examples in End-to-End Speaker Verification Systems

The paper "Fooling End-to-End Speaker Verification with Adversarial Examples," authored by Felix Kreuk et al., explores the vulnerability of automatic speaker verification systems when exposed to adversarial inputs. The research highlights the susceptibility of these systems, particularly those leveraging end-to-end deep neural networks, to adversarial perturbations that remain almost imperceptible to human listeners. This finding challenges the robustness of biometric authentication methods widely used in secure applications, such as banking and e-commerce.

Background and Methodology

The core focus of speaker verification is to ascertain whether a spoken utterance indeed belongs to a claimed speaker. Traditional systems often use a hybrid of i-vector representations and PLDA scoring. However, end-to-end approaches based on deep neural networks, particularly those pioneered by Heigold et al., are gaining traction for their ability to jointly learn from raw acoustic input, thereby enhancing system simplicity and effectiveness.

The paper introduces adversarial examples crafted by adding minimal noise to the original speaker input, which deceives these neural systems into misclassifying the speaker identity. The authors conducted experiments using both white-box and black-box attack models against networks trained on YOHO and NTIMIT datasets. These attacks involved perturbing the input vectors, often using the fast gradient sign method, to reduce the system's accuracy dramatically.

Results

The empirical results demonstrate a notable loss in accuracy when adversarial examples are employed. For instance, on the YOHO dataset, the system's accuracy dropped from 85.5% to 37.5% when Mel-spectrum features were perturbed, and from 87.5% to 25.75% when MFCC features were used. Parallel results were observed for the NTIMIT dataset, substantiating the effectiveness of adversarial attacks in drastically decreasing system reliability.

Additionally, the paper exhibits black-box attack scenarios where adversarial examples generated across different datasets and with disparate feature sets further exploit these models’ weaknesses. For example, a model trained on YOHO and attacked using NTIMIT-generated adversarial inputs saw a degradation in accuracy by 22.62%. These exercises confirm transferable adversarial vulnerabilities across differing training conditions and feature extraction methodologies.

Implications and Future Directions

The findings of this paper have significant implications for the field of automatic speaker verification and, by extension, speech-based biometric systems. The demonstrated vulnerabilities necessitate a reevaluation of system robustness, extending beyond conventional metrics such as accuracy, to include resistance against adversarial manipulations.

Future explorations could involve enhancing the robustness of these systems through adversarial training methods or developing new neural architectures explicitly designed to withstand adversarial perturbations. The application of defensive distillation or other forms of robustness improvements constitutes another promising direction. In practice, reinforcing these authentication systems against adversarial attacks is crucial for maintaining security and trust in biometric-based applications.

This research highlights the urgent need for more secure implementations in end-to-end speaker verification systems and encourages ongoing development in this dynamic field to ensure the robustness and trustworthiness of voice-based authentication mechanisms.