An In-Depth Investigation of Data Collection in LLM App Ecosystems (2408.13247v2)

Published 23 Aug 2024 in cs.CR, cs.AI, cs.CL, cs.CY, and cs.LG

Abstract: LLM app (tool) ecosystems are rapidly evolving to support sophisticated use cases that often require extensive user data collection. Given that LLM apps are developed by third parties and anecdotal evidence indicating inconsistent enforcement of policies by LLM platforms, sharing user data with these apps presents significant privacy risks. In this paper, we aim to bring transparency in data practices of LLM app ecosystems. We examine OpenAI's GPT app ecosystem as a case study. We propose an LLM-based framework to analyze the natural language specifications of GPT Actions (custom tools) and assess their data collection practices. Our analysis reveals that Actions collect excessive data across 24 categories and 145 data types, with third-party Actions collecting 6.03% more data on average. We find that several Actions violate OpenAI's policies by collecting sensitive information, such as passwords, which is explicitly prohibited by OpenAI. Lastly, we develop an LLM-based privacy policy analysis framework to automatically check the consistency of data collection by Actions with disclosures in their privacy policies. Our measurements indicate that the disclosures for most of the collected data types are omitted, with only 5.8% of Actions clearly disclosing their data collection practices.

Citations (2)

View on Semantic Scholar

Summary

The paper presents a comprehensive analysis of GPT apps, examining 119,274 GPTs and 2,596 Actions to expose privacy vulnerabilities.
It reveals that 82.9% of Actions are third-party services that collect excessive user data, including sensitive information against policy norms.
The study uncovers that co-occurring Actions can amplify data exposure up to 9.5 times, highlighting the need for more secure execution isolation.

Data Exposure from LLM Apps: An In-Depth Investigation of OpenAI's GPTs

The paper "Data Exposure from LLM Apps: An In-depth Investigation of OpenAI's GPTs," authored by Evin Jaff, Yuhao Wu, Ning Zhang, and Umar Iqbal, meticulously explores the data practices within the emerging ecosystem of LLM applications, specifically focusing on OpenAI's GPT platform. The research emphasizes the various privacy risks posed by third-party-developed GPT applications (apps) and evaluates their data collection practices, indirect data exposure, and the transparency of their privacy policies.

Overview of OpenAI’s GPT Ecosystem

The paper identifies that the OpenAI GPT ecosystem, the most mature among third-party LLM application platforms, has grown rapidly, boasting more than 3 million GPTs. These GPTs are equipped with various tools such as web browsers, image generation (DALLE), code interpreters, and external services (Actions) to provide extensive functionalities. However, the integration of third-party services introduces significant privacy concerns, particularly because these services collect vast amounts of user data, often without adequate checks or balances.

Methodology and Scope

The authors conducted a detailed analysis by crawling 119,274 GPTs and 2,596 unique Actions over a period of four months. Their methodology included a static analysis of natural language-based source code of GPTs and their Actions to characterize data collection practices and a framework to analyze privacy policies of Actions for data collection disclosures.

Significant Findings

Data Collection Practices

The analysis revealed that a significant proportion (82.9%) of Actions embedded in GPTs are external third-party services. These Actions collect an expansive range of data types, often including sensitive information like passwords, which is explicitly prohibited by OpenAI's policies. The collected data spans categories such as app activity, personal information, web browsing activities, location data, messages, financial information, files, photos, calendar events, device IDs, and health data.

Key Observations:

Expansive Data Collection: Actions collect diverse types of user data, often more than necessary for their stated functionality.
Sensitive Information: Some Actions collect sensitive information that is prohibited by OpenAI, raising significant privacy concerns.
Third-Party Integrations: Many GPTs embedded Actions from external third parties, reflecting practices reminiscent of early web and mobile ecosystems, which pose heightened privacy risks.

Indirect Data Exposure

The paper highlights the significant privacy risks posed by the lack of isolation in the execution of Actions within GPTs. The authors modeled the co-occurrence of Actions in a graph representation to paper indirect data exposure, revealing that co-occurrence in GPTs can expose up to 9.5 times more data to Actions than individual ones. This lack of execution isolation enables third-party Actions to access and potentially influence each other’s data, exacerbating privacy risks.

Privacy Policy Compliance

The paper also scrutinizes the transparency of data collection practices in the privacy policies of Actions. Despite OpenAI's requirement for Actions to publish privacy policies, there is a striking inconsistency in data disclosure:

Omitted Disclosures: Disclosures for most collected data types are omitted in privacy policies, with only 5.8% of Actions clearly disclosing their data collection practices.
Inconsistent Disclosures: Although nearly half of the Actions clearly disclosed more than half of the data they collected, significant inconsistencies still exist, including vague, ambiguous, and incorrect disclosures.

Implications and Future Directions

The paper underscores the pressing need for improvements in the privacy and security landscape of LLM-based application ecosystems:

Enhanced Reviews: OpenAI should implement stringent review processes for GPTs and Actions to ensure compliance with privacy policies and policies prohibiting the collection of sensitive information.
Execution Isolation: To mitigate indirect data exposure risks, platforms like OpenAI should develop secure execution environments for Actions, akin to isolating third-party code in web browsers.
Transparent Disclosures: There is a need for tools and frameworks to help GPT developers provide clear and accurate privacy disclosures. LLMs themselves could be utilized to assist in drafting and verifying privacy policies.

The rapid maturation of the LLM app ecosystem, as evidenced by OpenAI's GPT platform, presents unique opportunities and challenges. To build a secure and trustworthy environment, it is imperative to prioritize privacy and security considerations right from the design phase. This research serves as a crucial step in highlighting the current gaps and proposing pathways for more robust policy enforcement and technological safeguards in LLM applications.

Related Papers

Tweets

https://twitter.com/TechTweetBot/status/1830279637470962110

https://twitter.com/ndurner/status/1828188457660711049

YouTube

Show All Videos

HackerNews

Data Exposure from LLM Apps: An In-Depth Investigation of OpenAI's GPTs (2 points, 0 comments)
Data Exposure from LLM Apps: An In-Depth Investigation of OpenAI's GPTs (2 points, 0 comments)