Papers
Topics
Authors
Recent
2000 character limit reached

PrIeD-KIE: Towards Privacy Preserved Document Key Information Extraction (2310.03777v1)

Published 5 Oct 2023 in cs.CL

Abstract: In this paper, we introduce strategies for developing private Key Information Extraction (KIE) systems by leveraging large pretrained document foundation models in conjunction with differential privacy (DP), federated learning (FL), and Differentially Private Federated Learning (DP-FL). Through extensive experimentation on six benchmark datasets (FUNSD, CORD, SROIE, WildReceipts, XFUND, and DOCILE), we demonstrate that large document foundation models can be effectively fine-tuned for the KIE task under private settings to achieve adequate performance while maintaining strong privacy guarantees. Moreover, by thoroughly analyzing the impact of various training and model parameters on model performance, we propose simple yet effective guidelines for achieving an optimal privacy-utility trade-off for the KIE task under global DP. Finally, we introduce FeAm-DP, a novel DP-FL algorithm that enables efficiently upscaling global DP from a standalone context to a multi-client federated environment. We conduct a comprehensive evaluation of the algorithm across various client and privacy settings, and demonstrate its capability to achieve comparable performance and privacy guarantees to standalone DP, even when accommodating an increasing number of participating clients. Overall, our study offers valuable insights into the development of private KIE systems, and highlights the potential of document foundation models for privacy-preserved Document AI applications. To the best of authors' knowledge, this is the first work that explores privacy preserved document KIE using document foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, oct 2016.
  2. Privacy-Preserving Machine Learning: Threats and Solutions. IEEE Secur. Priv., 17(2):49–58, mar 2019.
  3. Privacy enabled Financial Text Classification using Differential Privacy and Federated Learning. In Proceedings of the 3rd Workshop on Economics and Natural Language Processing, ECONLP 2021, pages 50–55, Stroudsburg, PA, USA, oct 2021. Association for Computational Linguistics.
  4. The secret Sharer: Evaluating and testing unintended memorization in neural networks. In Proc. 28th USENIX Secur. Symp., pages 267–284, 2019.
  5. Privacy-preserving neural representations of text. In Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2018, pages 1–10, 2020.
  6. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. sep 2019.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  8. AN EFFICIENT DP-SGD MECHANISM FOR LARGE SCALE NLU MODELS. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, volume 2022-May, pages 4118–4122, 2022.
  9. Cynthia Dwork. Differential Privacy. In Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener, editors, Automata, Languages and Programming, pages 1–12, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
  10. The Algorithmic Foundations of Differential Privacy. Foundations and Trends R in Theoretical Computer Science, 9:211–407, 2014.
  11. European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council.
  12. Improving accuracy and speeding up document image classification through parallel systems. In Lecture Notes in Computer Science, pages 387–400. Springer International Publishing, 2020.
  13. Leveraging hierarchical representations for preserving privacy and utility in text. In Proceedings - IEEE International Conference on Data Mining, ICDM, volume 2019-Novem, pages 210–219, oct 2019.
  14. Model inversion attacks that exploit confidence information and basic countermeasures. In Proc. ACM Conf. Comput. Commun. Secur., volume 2015-Octob, pages 1322–1333, New York, NY, USA, 2015. ACM.
  15. Numerical composition of differential privacy. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 11631–11642. Curran Associates, Inc., 2021.
  16. Learning and Evaluating a Differentially Private Pre-trained Language Model. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, pages 1178–1189, 2021.
  17. Differentially Private Natural Language Models: Recent Advances and Future Directions. jan 2023.
  18. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, New York, NY, USA, oct 2022. ACM.
  19. ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, sep 2019.
  20. Funsd: A dataset for form understanding in noisy scanned documents, 2019.
  21. Chargrid: Towards understanding 2D documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pages 4459–4469, Stroudsburg, PA, USA, sep 2018. Association for Computational Linguistics.
  22. Adam: A method for stochastic optimization, 2017.
  23. Individual privacy accounting with gaussian differential privacy, 2022.
  24. FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, volume 1, pages 3735–3754. Long Papers, 2022.
  25. Building a test collection for complex document information processing. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, page 665–666, New York, NY, USA, 2006. Association for Computing Machinery.
  26. Large Language Models Can Be Strong Differentially Private Learners. oct 2021.
  27. Vibertgrid: A jointly trained multi-modal 2d document representation for key information extraction from documents. In Josep Lladós, Daniel Lopresti, and Seiichi Uchida, editors, Document Analysis and Recognition – ICDAR 2021, pages 548–563, Cham, 2021. Springer International Publishing.
  28. Towards Differentially Private Text Representations. In SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1813–1816, New York, NY, USA, jul 2020. ACM.
  29. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2016.
  30. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2017.
  31. Sentence-level Privacy for Document Embeddings. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, volume 1, pages 3367–3380, Stroudsburg, PA, USA, may 2022. Association for Computational Linguistics.
  32. Evaluating privacy-preserving machine learning in critical infrastructures: A case study on time-series classification. IEEE Transactions on Industrial Informatics, 2021.
  33. Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF). IEEE, aug 2017.
  34. Making the shoe fit: Architectures, initializations, and tuning for learning with privacy, 2020.
  35. Cord: A consolidated receipt dataset for post-ocr parsing.
  36. CAPE: Context-aware private embeddings for private language learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7970–7978, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  37. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer. In Josep Lladós, Daniel Lopresti, and Seiichi Uchida, editors, Document Analysis and Recognition – ICDAR 2021, pages 732–747, Cham, 2021. Springer International Publishing.
  38. Adaptive federated optimization, 2021.
  39. Towards privacy preserved document image classification - a comprehensive benchmark. TechRxiv preprint https://doi.org/10.36227/techrxiv.19518925.v1, 2021.
  40. Privacy meets explainability: A comprehensive impact benchmark. arXiv preprint arXiv:2211.04110, 2022.
  41. C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology, 67(1):148–163, 2016.
  42. Layoutparser: A unified toolkit for deep learning based document image analysis. arXiv preprint arXiv:2103.15348, 2021.
  43. Membership Inference Attacks Against Machine Learning Models. In Proc. - IEEE Symp. Secur. Priv., pages 3–18, 2017.
  44. Spatial dual-modality graph reasoning for key information extraction, 2021.
  45. Attention is all you need, 2017.
  46. On the Privacy–Utility Trade-Off in Differentially Private Hierarchical Text Classification. Applied Sciences (Switzerland), 12(21), mar 2022.
  47. XFUND: A benchmark dataset for multilingual visually rich form understanding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  48. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online, Aug. 2021. Association for Computational Linguistics.
  49. A Differentially Private Text Perturbation Method Using Regularized Mahalanobis Metric. pages 7–17, 2020.
  50. Differentially Private Fine-tuning of Language Models. oct 2021.
  51. Large Scale Private Learning via Low-rank Reparametrization. Proceedings of the 38th International Conference on Machine Learning, 139:12208–12218, jun 2021.
  52. Differential privacy for text analytics via natural text sanitization. In Findings, ACL-IJCNLP 2021, 2021.
  53. Docile benchmark for document information localization and extraction, 2023.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.