Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems (1911.01840v2)

Published 3 Nov 2019 in eess.AS, cs.CR, cs.LG, cs.MM, and cs.SD

Abstract: Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identification mechanism. The popularity of SR brings in serious security concerns, as demonstrated by recent adversarial attacks. However, the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only. In this paper, we conduct the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical blackbox setting. For this purpose, we propose an adversarial attack, named FAKEBOB, to craft adversarial samples. Specifically, we formulate the adversarial sample generation as an optimization problem, incorporated with the confidence of adversarial samples and maximal distortion to balance between the strength and imperceptibility of adversarial voices. One key contribution is to propose a novel algorithm to estimate the score threshold, a feature in SRSs, and use it in the optimization problem to solve the optimization problem. We demonstrate that FAKEBOB achieves 99% targeted attack success rate on both open-source and commercial systems. We further demonstrate that FAKEBOB is also effective on both open-source and commercial systems when playing over the air in the physical world. Moreover, we have conducted a human study which reveals that it is hard for human to differentiate the speakers of the original and adversarial voices. Last but not least, we show that four promising defense methods for adversarial attack from the speech recognition domain become ineffective on SRSs against FAKEBOB, which calls for more effective defense methods. We highlight that our study peeks into the security implications of adversarial attacks on SRSs, and realistically fosters to improve the security robustness of SRSs.

PDF Abstract

Insights into Adversarial Attacks on Speaker Recognition Systems

This essay examines a paper focusing on the security vulnerabilities of Speaker Recognition Systems (SRSs) under adversarial attacks in a practical black-box setting. The research introduces a novel adversarial attack mechanism, FakeBob, designed to critically analyze the robustness of SRSs against adversarial samples that aim to fool the system into misidentifying speakers. Unlike conventional explorations limited to white-box scenarios, this paper leaps into the black-box domain, offering insightful implications on the practicality and challenges of securing SRSs.

The speaker recognition systems, integral across biometric authentication, forensic investigations, and personalizations in smart devices, operate by extracting and analyzing audio characteristics from spoken utterances. Despite their widespread applications, SRSs inherently face security risks, notably adversarial attacks, where deliberate disturbances in audio inputs can deceive these systems. The research highlights this vulnerability by employing FakeBob, an adversarial attack that effectively achieves a 99% targeted attack success rate across both open-source and commercial systems using an optimization problem modeled in the context of speaker recognition.

FakeBob stands out due to its capacity to generate adversarial samples under a black-box setting, where attackers have no access to internal structures or configurations of the targeted SRS. It employs sophisticated strategies combining threshold estimation, gradient estimation, and basics of iterative methods to engineer adversarial samples that are powerful yet imperceptible to human listeners. Additionally, it challenges existing defense solutions, demonstrating their ineffectiveness in countering sophisticated adversarial samples crafted for SRS.

Numerical Results and Implications

The paper outlines impressive experimental results; on ivector and GMM implementations, FakeBob achieves near-complete success in deceiving the systems across various speaker tasks. Specifically, the attack demonstrates effectiveness on commercial platforms like Talentedsoft and Microsoft Azure, even under the challenging over-the-air conditions where adversarial samples need to retain their efficacy in physical environments.

The paper shows that current defense mechanisms, originally devised for speech recognition systems, fall short against FakeBob for SRSs. Approaches such as local smoothing, quantization, audio squeezing, and temporal dependency detection either inadequately mitigate attack success rates or inadvertently increase false rejection rates of legitimate samples.

Concluding Remarks

The implications of this paper are multifold. The vulnerabilities exposed by FakeBob necessitate a reevaluation of current defense strategies for SRSs. Practical defenses require innovations that can robustly detect and thwart adversarial examples. Potential advancements may incorporate enhanced liveness detection to discern genuine human speech from playback attacks or leveraging machine learning models tailored to detect adversarial patterns specific to speaker recognition contexts.

The need for research in this direction is pressing as adversarial attacks threaten to degrade trust in systems crucial to security and authentication frameworks worldwide. As technology evolves, so too must its defenses, ensuring speaker recognition systems can operate securely in both controlled and practical environments. This paper paves the way for future explorations in adversarial resilience and defense methodology, laying a critical foundation for sustainable security in voice recognition technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Guangke Chen (11 papers)
Sen Chen (49 papers)
Lingling Fan (48 papers)
Xiaoning Du (27 papers)
Zhe Zhao (97 papers)
Fu Song (37 papers)
Yang Liu (2253 papers)

Citations (178)

View on Semantic Scholar

Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems (1911.01840v2)

Insights into Adversarial Attacks on Speaker Recognition Systems

Numerical Results and Implications

Concluding Remarks

Related Papers