- The paper presents a comprehensive analysis of GPT apps, examining 119,274 GPTs and 2,596 Actions to expose privacy vulnerabilities.
- It reveals that 82.9% of Actions are third-party services that collect excessive user data, including sensitive information against policy norms.
- The study uncovers that co-occurring Actions can amplify data exposure up to 9.5 times, highlighting the need for more secure execution isolation.
Data Exposure from LLM Apps: An In-Depth Investigation of OpenAI's GPTs
The paper "Data Exposure from LLM Apps: An In-depth Investigation of OpenAI's GPTs," authored by Evin Jaff, Yuhao Wu, Ning Zhang, and Umar Iqbal, meticulously explores the data practices within the emerging ecosystem of LLM applications, specifically focusing on OpenAI's GPT platform. The research emphasizes the various privacy risks posed by third-party-developed GPT applications (apps) and evaluates their data collection practices, indirect data exposure, and the transparency of their privacy policies.
Overview of OpenAI’s GPT Ecosystem
The paper identifies that the OpenAI GPT ecosystem, the most mature among third-party LLM application platforms, has grown rapidly, boasting more than 3 million GPTs. These GPTs are equipped with various tools such as web browsers, image generation (DALLE), code interpreters, and external services (Actions) to provide extensive functionalities. However, the integration of third-party services introduces significant privacy concerns, particularly because these services collect vast amounts of user data, often without adequate checks or balances.
Methodology and Scope
The authors conducted a detailed analysis by crawling 119,274 GPTs and 2,596 unique Actions over a period of four months. Their methodology included a static analysis of natural language-based source code of GPTs and their Actions to characterize data collection practices and a framework to analyze privacy policies of Actions for data collection disclosures.
Significant Findings
Data Collection Practices
The analysis revealed that a significant proportion (82.9%) of Actions embedded in GPTs are external third-party services. These Actions collect an expansive range of data types, often including sensitive information like passwords, which is explicitly prohibited by OpenAI's policies. The collected data spans categories such as app activity, personal information, web browsing activities, location data, messages, financial information, files, photos, calendar events, device IDs, and health data.
Key Observations:
- Expansive Data Collection: Actions collect diverse types of user data, often more than necessary for their stated functionality.
- Sensitive Information: Some Actions collect sensitive information that is prohibited by OpenAI, raising significant privacy concerns.
- Third-Party Integrations: Many GPTs embedded Actions from external third parties, reflecting practices reminiscent of early web and mobile ecosystems, which pose heightened privacy risks.
Indirect Data Exposure
The paper highlights the significant privacy risks posed by the lack of isolation in the execution of Actions within GPTs. The authors modeled the co-occurrence of Actions in a graph representation to paper indirect data exposure, revealing that co-occurrence in GPTs can expose up to 9.5 times more data to Actions than individual ones. This lack of execution isolation enables third-party Actions to access and potentially influence each other’s data, exacerbating privacy risks.
Privacy Policy Compliance
The paper also scrutinizes the transparency of data collection practices in the privacy policies of Actions. Despite OpenAI's requirement for Actions to publish privacy policies, there is a striking inconsistency in data disclosure:
- Omitted Disclosures: Disclosures for most collected data types are omitted in privacy policies, with only 5.8% of Actions clearly disclosing their data collection practices.
- Inconsistent Disclosures: Although nearly half of the Actions clearly disclosed more than half of the data they collected, significant inconsistencies still exist, including vague, ambiguous, and incorrect disclosures.
Implications and Future Directions
The paper underscores the pressing need for improvements in the privacy and security landscape of LLM-based application ecosystems:
- Enhanced Reviews: OpenAI should implement stringent review processes for GPTs and Actions to ensure compliance with privacy policies and policies prohibiting the collection of sensitive information.
- Execution Isolation: To mitigate indirect data exposure risks, platforms like OpenAI should develop secure execution environments for Actions, akin to isolating third-party code in web browsers.
- Transparent Disclosures: There is a need for tools and frameworks to help GPT developers provide clear and accurate privacy disclosures. LLMs themselves could be utilized to assist in drafting and verifying privacy policies.
The rapid maturation of the LLM app ecosystem, as evidenced by OpenAI's GPT platform, presents unique opportunities and challenges. To build a secure and trustworthy environment, it is imperative to prioritize privacy and security considerations right from the design phase. This research serves as a crucial step in highlighting the current gaps and proposing pathways for more robust policy enforcement and technological safeguards in LLM applications.