Auditing Prompt Caching in Language Model APIs (2502.07776v1)

Published 11 Feb 2025 in cs.CL, cs.CR, and cs.LG

Abstract: Prompt caching in LLMs results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.

Summary

The paper demonstrates that global prompt caching in LLM APIs introduces timing variations that can reveal sensitive user data.
It employs statistical hypothesis testing to differentiate cache hit and miss response times, confirming shared caching behaviors in providers like OpenAI.
The study highlights mitigations such as per-user caching and response obfuscation to protect data privacy and safeguard proprietary model details.

Auditing Prompt Caching in LLM APIs

The paper "Auditing Prompt Caching in LLM APIs" presents a comprehensive examination of the potential privacy issues associated with prompt caching in LLM APIs. The authors address how prompt caching can inadvertently lead to timing variations that might be exploited in side-channel attacks, potentially disclosing sensitive user information.

Key Findings

The paper identifies that prompt caching aims to enhance the efficiency of LLM inference processes by accelerating response times for cached prompts. However, this caching mechanism introduces timing differences that can be exploited by attackers, especially when cache sharing is implemented across multiple users. Importantly, the paper identifies that in a scenario involving shared caches, attackers can potentially infer information about other users' prompts by observing the speed of API responses.

The team's statistical audits on various LLM API providers uncovered evidence of global cache sharing. Specifically, caching behaviors were identified in the APIs of providers including OpenAI, where caches were shared globally, raising significant privacy concerns. Additionally, these timing variations enabled the authors to infer information about the underlying model architecture of some proprietary embedding models—namely, identifying OpenAI's embedding model as a decoder-only Transformer, which was not previously disclosed.

Methodological Approach

The audit process was meticulously designed using statistical hypothesis testing. The authors developed an audit method that distinguishes between cache hit and cache miss response times, leveraging statistical tests to confirm caching presence with statistical significance. By measuring the time-to-first-token (TTFT) for repeated and variant prompt requests, distinctions in processing times provided evidence of caching activities. Significant p-values from these tests underscored the presence of shared caches and the associated risks.

Implications and Mitigations

Findings in this paper present crucial practical and theoretical implications for the deployment and usage of LLM APIs. On a practical level, the global sharing of prompt caches implies serious privacy risks that API providers must address to protect user data integrity. The authors advocate for transparent disclosures regarding cache sharing policies by API providers so that users are adequately informed of potential privacy considerations.

From a theoretical perspective, the detection of architectural information based on prompt caching behavior invites further examination of model design features that may inadvertently leak proprietary information. In reaction to these findings, the authors propose several mitigations. Primarily, restricting cache sharing to per-user levels may mitigate privacy leakage risks by preventing attackers from accessing other users' cached data. Additionally, intentional response delays and other obfuscation strategies could be employed to mask timing differences and nullify the advantages conferred by caching.

Future Directions

The research outlined in the paper offers a foundation for future audits of machine learning systems, especially as the landscape of LLMs rapidly evolves. Prospective investigations may focus on the advancement of more nuanced or clandestine methods for prompt extraction, or further refine audit frameworks to accommodate emerging LLM architectures and caching practices. This work sets a precedent for ongoing emphasis on user-centric trust and the robust security of LLM services in an increasingly interconnected digital ecosystem.

Conclusively, this paper underscores the importance of vigilance in the deployment of machine learning technologies, ensuring security measures are in place to safeguard against unintended data exposure while continuously improving the systems' operational efficiencies.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/chenchenygu/status/1895190721419976837

https://twitter.com/ZainHasan6/status/1889552808053027229

https://twitter.com/ins_bug/status/1889857035594019155

https://twitter.com/GptMaestro/status/1890661793082605962

YouTube

Show All Videos