Papers
Topics
Authors
Recent
Search
2000 character limit reached

Private federated discovery of out-of-vocabulary words for Gboard

Published 17 Apr 2024 in cs.DS | (2404.11607v2)

Abstract: The vocabulary of LLMs in Gboard, Google's keyboard application, plays a crucial role for improving user experience. One way to improve the vocabulary is to discover frequently typed out-of-vocabulary (OOV) words on user devices. This task requires strong privacy protection due to the sensitive nature of user input data. In this report, we present a private OOV discovery algorithm for Gboard, which builds on recent advances in private federated analytics. The system offers local differential privacy (LDP) guarantees for user contributed words. With anonymous aggregation, the final released result would satisfy central differential privacy guarantees with $\varepsilon = 0.315, \delta = 10{-10}$ for OOV discovery in en-US (English in United States).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. D. P. T. (Apple). Learning with privacy at scale. https://machinelearning.apple.com/research/learning-with-privacy-at-scale.
  2. Towards sparse federated analytics: Location heatmaps under distributed differential privacy with secure aggregation. arXiv preprint arXiv:2111.02356, 2021.
  3. Practical locally private heavy hitters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3d779cae2d46cf6a8a99a35ba4167977-Paper.pdf.
  4. F. Beaufays and M. Riley. The machine intelligence behind gboard. https://research.google/blog/the-machine-intelligence-behind-gboard/. Accessed: 2024-04-10.
  5. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, page 1175–1191, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349468. doi: 10.1145/3133956.3133982. URL https://doi.org/10.1145/3133956.3133982.
  6. Federated learning and privacy. Commun. ACM, 65(4):90–97, mar 2022. ISSN 0001-0782. doi: 10.1145/3500240. URL https://doi.org/10.1145/3500240.
  7. Differentially private heavy hitter detection using federated analytics. arXiv preprint arXiv:2307.11749, 2023.
  8. Federated learning of out-of-vocabulary words. arXiv preprint arXiv:1903.10635, 2019a.
  9. Federated learning of n-gram language models. arXiv preprint arXiv:1910.03432, 2019b.
  10. G. Cormode and A. Bharadwaj. Sample-and-threshold differential privacy: Histograms and applications. In International Conference on Artificial Intelligence and Statistics, pages 1420–1431. PMLR, 2022.
  11. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006.
  12. Confidential federated computations, 2024.
  13. RAPPOR: randomized aggregatable privacy-preserving ordinal response. In CCS, pages 1054–1067. ACM, 2014.
  14. Amplification by shuffling: From local to central differential privacy via anonymity. In SODA, pages 2468–2479. SIAM, 2019.
  15. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling, 2021.
  16. Federated learning for mobile keyboard prediction, 2019.
  17. What can we learn privately? SIAM J. Comput., 40(3):793–826, June 2011. ISSN 0097-5397. doi: 10.1137/090756090. URL http://dx.doi.org/10.1137/090756090.
  18. Private federated statistics in an interactive setting. arXiv preprint arXiv:2211.10082, 2022.
  19. Mobile keyboard input decoding with finite-state transducers. arXiv preprint arXiv:1704.03987, 2017.
  20. D. Ramage and S. Mazzocchi. Federated analytics: Collaborative data science without data collection. https://research.google/blog/federated-analytics-collaborative-data-science-without-data-collection/.
  21. Mutual information optimally local private discrete distribution estimation, 2016.
  22. Wikipedia. Trusted execution environment. https://en.wikipedia.org/wiki/Trusted_execution_environment#:~:text=A%20trusted%20execution%20environment%20(TEE,respect%20to%20confidentiality%20and%20integrity.
  23. Z. Xu and Y. Zhang. Advances in private training for production on-device language models. https://research.google/blog/advances-in-private-training-for-production-on-device-language-models/.
  24. Federated learning of gboard language models with differential privacy. arXiv preprint arXiv:2305.18465, 2023.
  25. M. Ye and A. Barg. Optimal schemes for discrete distribution estimation under local differential privacy. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 759–763, 2017. doi: 10.1109/ISIT.2017.8006630.
  26. Federated heavy hitters discovery with differential privacy. In International Conference on Artificial Intelligence and Statistics, pages 3837–3847. PMLR, 2020.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.