Papers
Topics
Authors
Recent
2000 character limit reached

Unintended Memorization in Large ASR Models, and How to Mitigate It (2310.11739v1)

Published 18 Oct 2023 in cs.LG, cs.SD, and eess.AS

Abstract: It is well-known that neural networks can unintentionally memorize their training examples, causing privacy concerns. However, auditing memorization in large non-auto-regressive automatic speech recognition (ASR) models has been challenging due to the high compute cost of existing methods such as hardness calibration. In this work, we design a simple auditing method to measure memorization in large ASR models without the extra compute overhead. Concretely, we speed up randomly-generated utterances to create a mapping between vocal and text information that is difficult to learn from typical training examples. Hence, accurate predictions only for sped-up training examples can serve as clear evidence for memorization, and the corresponding accuracy can be used to measure memorization. Using the proposed method, we showcase memorization in the state-of-the-art ASR models. To mitigate memorization, we tried gradient clipping during training to bound the influence of any individual example on the final model. We empirically show that clipping each example's gradient can mitigate memorization for sped-up training examples with up to 16 repetitions in the training set. Furthermore, we show that in large-scale distributed training, clipping the average gradient on each compute core maintains neutral model quality and compute cost while providing strong privacy protection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. “The secret sharer: Evaluating and testing unintended memorization in neural networks.,” in USENIX Security Symposium, 2019.
  2. “Quantifying memorization across neural language models,” arXiv preprint arXiv:2202.07646, 2022.
  3. “Extracting training data from diffusion models,” arXiv preprint arXiv:2301.13188, 2023.
  4. “Measuring forgetting of memorized training examples,” International Conference on Machine Learning, 2022.
  5. “Truth serum: Poisoning machine learning models to reveal their secrets,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 2779–2792.
  6. “Self-supervised learning with random-projection quantizer for speech recognition,” in International Conference on Machine Learning, 2022.
  7. “How to dp-fy ml: A practical guide to machine learning with differential privacy,” Journal of Artificial Intelligence Research, vol. 77, 2023.
  8. “Detecting unintended memorization in language-model-fused asr,” Interspeech, 2022.
  9. “A method to reveal speaker identity in distributed asr training, and how to counter it,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4338–4342.
  10. “On the importance of difficulty calibration in membership inference attacks,” arXiv preprint arXiv:2111.08440, 2021.
  11. “Membership inference attacks from first principles,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 1897–1914.
  12. “Extracting targeted training data from asr models, and how to mitigate it,” Interspeech, 2022.
  13. “Reconstructing training data with informed adversaries,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 1138–1156.
  14. “Pushing the limits of semi-supervised learning for automatic speech recognition,” arXiv preprint arXiv:2010.10504, 2020.
  15. “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  16. “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  17. “Parallel wavenet: Fast high-fidelity speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 3918–3926.
  18. “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016, pp. 308–318.
  19. “Why is public pretraining necessary for private model training?,” in International Conference on Machine Learning. PMLR, 2023, pp. 10611–10627.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com