Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning (1802.02561v2)

Published 7 Feb 2018 in cs.CL, cs.CR, and cs.HC

Abstract: Privacy policies are the primary channel through which companies inform users about their data collection and sharing practices. These policies are often long and difficult to comprehend. Short notices based on information extracted from privacy policies have been shown to be useful but face a significant scalability hurdle, given the number of policies and their evolution over time. Companies, users, researchers, and regulators still lack usable and scalable tools to cope with the breadth and depth of privacy policies. To address these hurdles, we propose an automated framework for privacy policy analysis (Polisis). It enables scalable, dynamic, and multi-dimensional queries on natural language privacy policies. At the core of Polisis is a privacy-centric LLM, built with 130K privacy policies, and a novel hierarchy of neural-network classifiers that accounts for both high-level aspects and fine-grained details of privacy practices. We demonstrate Polisis' modularity and utility with two applications supporting structured and free-form querying. The structured querying application is the automated assignment of privacy icons from privacy policies. With Polisis, we can achieve an accuracy of 88.4% on this task. The second application, PriBot, is the first freeform question-answering system for privacy policies. We show that PriBot can produce a correct answer among its top-3 results for 82% of the test questions. Using an MTurk user study with 700 participants, we show that at least one of PriBot's top-3 answers is relevant to users for 89% of the test questions.

Citations (320)

View on Semantic Scholar

Summary

The paper presents a deep-learning framework that automates the analysis of complex privacy policies, achieving 88.4% accuracy in structured querying.
It employs a multi-layered neural network with a privacy-specific language model trained on over 130,000 policies, enabling refined classification of policy segments.
The system supports both structured querying and free-form Q&A, delivering an 82% top-3 correctness score and an 89% relevance score in user studies.

Overview of "Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning"

The paper introduces "Polisis," a novel automated framework leveraging deep learning techniques to analyze privacy policies. This framework addresses the pervasive issue of privacy policies being expansive and complex, leading to users, researchers, and regulators lacking efficient tools to manage them at scale. "Polisis" is designed to perform scalable, dynamic, and multidimensional queries on natural language privacy policies by integrating a privacy-specific LLM and a sophisticated hierarchy of neural-network classifiers.

Core Components

"Polisis" is constructed around three primary layers: Application Layer, Data Layer, and Machine Learning (ML) Layer.

ML Layer: At the core, it features a privacy-centric LLM trained on over 130,000 privacy policies from websites and apps. The ML Layer also incorporates a unique neural network hierarchy that discerns both high-level and fine-grained privacy classes within policy segments. This enables refined classifications and streamlined querying compared to simpler heuristic-based methods.
Data Layer: This layer handles preprocessing. It initially extracts policy data from the web, segments them using semantic similarity techniques, and handles elements like lists differently to maintain the coherence and integrity of information.
Application Layer: Facilitates both structured and free-form queries, empowering users and researchers to pose complex information retrieval tasks over privacy policy content accurately.

Applications and Results

The practicality of "Polisis" is demonstrated through two applications: structured querying with privacy icons and free-form privacy policy Q&A.

Structured Querying: The framework successfully automates the attribution of privacy icons, achieving an impressive 88.4% accuracy, indicating high alignment with annotations made by legal experts.
Free-form Question Answering: By providing answers to user questions with high accuracy, the QA system yields a top-3 correctness score of 82% and achieves an 89% relevance score from users in an MTurk paper.

Implications and Future Prospects

The framework potentiates an essential shift in privacy policy interactions and compliance monitoring. It opens avenues for creating real-time, conversational interfaces for privacy information dissemination, which are increasingly significant as voice-activated and smart devices proliferate. For regulators and compliance researchers, "Polisis" serves as a scalable approach for auditing and ensuring that privacy commitments align with regulatory expectations.

Theoretical and Practical Considerations

From a theoretical standpoint, "Polisis" is significant due to its application of deep learning in parsing legal and linguistic complexity embedded in natural language text within privacy policies. Practically, it enables key stakeholders to derive actionable insights and maintain regulatory compliance efficiently.

Future Directions

Future enhancements could focus on expanding the hierarchy of classifiers to encompass emerging privacy considerations and improving model robustness against adversarial manipulations of text. Furthermore, adaptive methods for real-time policy changes and consumer expectations need integration to maintain the framework's efficacy as privacy regulations and digital ecosystems evolve.

Overall, "Polisis" represents a substantive advancement in privacy policy analysis, enabling more accessible, understandable, and actionable insights into privacy practices that align with both legal frameworks and user comprehension.

PDF Markdown