Papers
Topics
Authors
Recent
Search
2000 character limit reached

ATLAS: Automatically Detecting Discrepancies Between Privacy Policies and Privacy Labels

Published 24 May 2023 in cs.CR, cs.AI, and cs.LG | (2306.09247v1)

Abstract: Privacy policies are long, complex documents that end-users seldom read. Privacy labels aim to ameliorate these issues by providing succinct summaries of salient data practices. In December 2020, Apple began requiring that app developers submit privacy labels describing their apps' data practices. Yet, research suggests that app developers often struggle to do so. In this paper, we automatically identify possible discrepancies between mobile app privacy policies and their privacy labels. Such discrepancies could be indicators of potential privacy compliance issues. We introduce the Automated Privacy Label Analysis System (ATLAS). ATLAS includes three components: a pipeline to systematically retrieve iOS App Store listings and privacy policies; an ensemble-based classifier capable of predicting privacy labels from the text of privacy policies with 91.3% accuracy using state-of-the-art NLP techniques; and a discrepancy analysis mechanism that enables a large-scale privacy analysis of the iOS App Store. Our system has enabled us to analyze 354,725 iOS apps. We find several interesting trends. For example, only 40.3% of apps in the App Store provide easily accessible privacy policies, and only 29.6% of apps provide both accessible privacy policies and privacy labels. Among apps that provide both, 88.0% have at least one possible discrepancy between the text of their privacy policy and their privacy label, which could be indicative of a potential compliance issue. We find that, on average, apps have 5.32 such potential compliance issues. We hope that ATLAS will help app developers, researchers, regulators, and mobile app stores alike. For example, app developers could use our classifier to check for discrepancies between their privacy policies and privacy labels, and regulators could use our system to help review apps at scale for potential compliance issues.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Rethinking complex neural network architectures for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4046–4051, 2019.
  2. Longitudinal analysis of privacy labels in the apple app store. arXiv preprint arXiv:2206.02658, 2022.
  3. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  4. Don Blaheta. Handling noisy training and testing data. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 111–116, 2002.
  5. Federal Trade Commission. Gramm-leach-bliley act. https://www.ftc.gov/business-guidance/privacy-security/gramm-leach-bliley-act, 2023. Accessed: 2023-03-13.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Susan T Dumais et al. Latent semantic analysis. Annu. Rev. Inf. Sci. Technol., 38(1):188–230, 2004.
  8. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS), 32(2):1–29, 2014.
  9. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996.
  10. Helping mobile application developers create accurate privacy labels. In 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pages 212–230. IEEE, 2022.
  11. Apple Inc. App privacy details - app store.
  12. Apple Inc. iTunes preview. https://apps.apple.com/us/genre/ios-books/id6018, 2023. Accessed: 2023-03-13.
  13. Privacy as part of the app decision-making process. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 3393–3402, 2013.
  14. Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics.
  15. Keeping privacy labels honest. Proceedings on Privacy Enhancing Technologies, 4:486–506, 2022.
  16. Goodbye tracking? impact of iOS app tracking transparency and privacy labels. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 508–520, New York, NY, USA, 2022. Association for Computing Machinery.
  17. Understanding challenges for developers to create accurate privacy nutrition labels. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2022.
  18. Deep learning for extreme multi-label text classification. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 115–124, 2017.
  19. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  20. The cost of reading privacy policies. Isjlp, 4:543, 2008.
  21. Automatic differentiation in PyTorch. 2017.
  22. 50 ways to leak your data: An exploration of apps’ circumvention of the android permissions system. In 28th USENIX security symposium (USENIX security 19), pages 603–620, 2019.
  23. Disagreeable privacy policies: Mismatches between meaning and users’ understanding. Berkeley Tech. LJ, 30:39, 2015.
  24. The usable privacy policy project. In Technical report, Technical Report, CMU-ISR-13-119. Carnegie Mellon University, 2013.
  25. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  26. Natural language processing for mobile app privacy compliance. In AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies, 2019.
  27. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  28. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  29. Lalaine: Measuring and characterizing non-compliance of apple privacy labels at scale. arXiv preprint arXiv:2206.06274, 2022.
  30. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016.
  31. Do privacy labels answer users’ privacy questions?
  32. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
  33. Evaluating multi-label classifiers with noisy labels. arXiv preprint arXiv:2102.08427, 2021.
  34. PrivacyFlash Pro: Automating privacy policy generation for mobile apps. In NDSS, 2021.
  35. Compliance traceability: Privacy policies as software development artifacts. Open Day for Privacy, Usability, and Transparency (PUT), Stockholm, Sweden, 2019.
  36. MAPS: Scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019(3):66–86, 2019.
  37. Automated analysis of privacy requirements for mobile apps. In NDSS, 2017.
Citations (9)

Summary

  • The paper presents an innovative system, ATLAS, that uses transformer-based NLP to pinpoint discrepancies between privacy policies and their labels.
  • It employs extensive data preprocessing and training techniques, achieving high precision, recall, and F1-scores in detecting mismatches.
  • The approach enhances transparency for consumers while aiding developers and regulators in ensuring accurate and compliant privacy disclosures.

Overview of "ATLAS: Automatically Detecting Discrepancies Between Privacy Policies and Privacy Labels"

The paper "ATLAS: Automatically Detecting Discrepancies Between Privacy Policies and Privacy Labels" by Akshath Jain, David Rodriguez, Jose M. del Alamo, and Norman Sadeh, offers innovative solutions to the growing concern over the consistency and transparency of privacy policies versus privacy labels. The work focuses on the deployment of NLP and Machine Learning (ML) models, especially leveraging transformer architectures, to bridge the gap between articulated privacy policies and the summarized privacy labels typically seen in digital environments such as iOS applications.

Methodology and Key Components

The authors introduce ATLAS, a system designed to identify and highlight inconsistencies between privacy policies and their corresponding privacy labels. This task is performed through a series of intricate steps:

  • Data Collection and Preprocessing: Massive amounts of textual data from privacy policies and privacy labels are collected. These documents are preprocessed to standardize the language used, ensuring the texts are analyzable.
  • Model Training: Advanced transformer models are trained to understand the nuanced terminology in privacy policies. By doing so, the model can compare the comprehensive privacy policy text with the concise privacy labels.
  • Discrepancy Detection Algorithm: The core component, ATLAS, utilizes this model to detect discrepancies. The algorithm compares various components of the privacy policies against the privacy labels to identify mismatches.

Experimental Results

Quantitative evaluation of the ATLAS system demonstrates robust performance in identifying these discrepancies. The model employs metrics such as precision, recall, and F1-score to evaluate its efficacy. Key numerical results include high accuracy rates in the detection tasks, underscoring the practical applicability of the proposed solution in real-world scenarios.

Discussion and Implications

A thorough discussion in the paper elucidates the implications of the findings:

  • For Consumers: ATLAS provides a critical tool for users to verify if the privacy practices claimed by an application are consistent with the concise labels presented to them. This ensures higher transparency and trust.
  • For Developers: The system serves as a guideline for developers to ascertain if their summarized labels accurately reflect the detailed policies, ultimately aiming for increased compliance.
  • For Regulators: Regulatory bodies can leverage this tool to enforce stricter compliance requirements, ensuring that consumer rights are upheld through accurate disclosures.

Theoretical Contributions

From a theoretical standpoint, this paper extends the body of knowledge in several areas:

  1. NLP and ML Application in Privacy: It demonstrates the power and applicability of advanced NLP models in a novel domain—privacy policy compliance.
  2. Automated Compliance Mechanisms: The exploration into automated mechanisms for compliance could spur further studies into regulatory tech (RegTech) applications, enriching the literature with practical, automated solutions for various legal domains.

Future Directions

The paper opens avenues for future research:

  • Model Enhancement: Future studies could focus on enhancing the transformer models with domain-specific tweaks to improve accuracy further.
  • Broader Applicability: Extending the current model to accommodate varied types of policies and terms of service across different platforms could broaden its utility.
  • Integration with Legal Tech: Integrating ATLAS with existing legal tech solutions to aid in comprehensive audits and enhanced regulatory compliance could be another promising direction.

In conclusion, this work by Jain et al. offers a nuanced, technically rich approach to automating the detection of inconsistencies between privacy policies and privacy labels, backed by robust experimental evaluation and thoughtful theoretical contributions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.