Collaborative Watermarking for Adversarial Speech Synthesis (2309.15224v2)
Abstract: Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.
- “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.
- “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in International Conference on Machine Learning. PMLR, 2022, pp. 2709–2720.
- “Spoofing and countermeasures for speaker verification: A survey,” Speech Communication, vol. 66, pp. 130–153, feb 2015.
- “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023.
- “ADD 2022: The first Audio Deep Synthesis Detection Challenge,” in Proc. ICASSP, May 2022, pp. 9216–9220.
- “Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?,” in Proc. ASVspoof Challenge workshop, 2021, pp. 55–60.
- “The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System,” in Proc. Interspeech, 2021, pp. 4279–4283.
- “Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora,” in Proc. ICASSP, 2017, pp. 2047–2051.
- “Does Audio Deepfake Detection Generalize?,” Proc. Interspeech, pp. 2783–2787, 2022.
- “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
- “GELP: GAN-excited linear prediction for speech synthesis from mel-spectrogram,” in Proc. Interspeech, 2019, pp. 694–698.
- “Probability density distillation with generative adversarial networks for high-quality parallel waveform generation,” in Proc. Interspeech, 2019, pp. 699–703.
- “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
- F. Petitcolas, “Watermarking schemes evaluation,” IEEE Signal Processing Magazine, vol. 17, no. 5, pp. 58–64, 2000.
- “Secure spread spectrum watermarking for images, audio and video,” in Proc. ICIP, 1996, vol. 3, pp. 243–246.
- “Techniques for data hiding,” IBM systems journal, vol. 35, no. 3.4, pp. 313–336, 1996.
- “Echo hiding,” in Information Hiding: First International Workshop Cambridge, UK, May 30–June 1, 1996 Proceedings 1. Springer, 1996, pp. 295–315.
- “Robust speech watermarking by a jointly trained embedder and detector using a DNN,” Digital Signal Processing, vol. 122, pp. 103381, 2022.
- “Wavmark: Watermarking for audio generation,” arXiv preprint arXiv:2308.12770, 2023.
- M. Steinebach, Digitale Wasserzeichen fuer Audiodaten, Shaker, 2004.
- “Artificial fingerprinting for generative models: Rooting deepfake attribution in training data,” in Proc. ICCV, 2021, pp. 14448–14457.
- “Responsible disclosure of generative models using scalable fingerprinting,” in Proc. ICLR, 2022.
- X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in Proc. ICASSP, 2023, pp. 1–5.
- “Removing batch normalization boosts adversarial training,” in International Conference on Machine Learning. PMLR, 2022, pp. 23433–23445.
- “A comparison of features for synthetic speech detection,” in Proc. Interspeech, 2015, pp. 2087–2091.
- “A light CNN for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
- X. Wang and J. Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” in Proc. Interspeech, 2021, pp. 4259–4263.
- “End-to-end anti-spoofing with RawNet2,” in Proc. ICASSP, 2020, pp. 6369–6373.
- “MUSAN: A Music, Speech, and Noise Corpus,” 2015.
- “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019.
- “WaveGrad: Estimating gradients for waveform generation,” in Proc. ICLR, 2021.
- “DiffWave: A versatile diffusion model for audio synthesis,” in Proc. ICLR, 2021.