The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) (2402.16893v1)

Published 23 Feb 2024 in cs.CR, cs.AI, and cs.CL

Abstract: Retrieval-augmented generation (RAG) is a powerful technique to facilitate LLM with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of LLMs, the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. In this work, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risk brought by RAG on the retrieval data, we further reveal that RAG can mitigate the leakage of the LLMs' training data. Overall, we provide new insights in this paper for privacy protection of retrieval-augmented LLMs, which benefit both LLMs and RAG systems builders. Our code is available at https://github.com/phycholosogy/RAG-privacy.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (33)

Authors (11)

Shenglai Zeng (19 papers)
Jiankun Zhang (10 papers)
Pengfei He (36 papers)
Yue Xing (47 papers)
Yiding Liu (30 papers)
Han Xu (92 papers)
Jie Ren (329 papers)
Shuaiqiang Wang (68 papers)
Dawei Yin (165 papers)
Yi Chang (150 papers)
Jiliang Tang (204 papers)

Citations (31)

View on Semantic Scholar

Tweets

https://twitter.com/_reachsumit/status/1763059013393285153

https://twitter.com/snowzeng2/status/1763241930404831694

https://twitter.com/dse_msu/status/1763280678324769114

https://twitter.com/snowzeng2/status/1763263646984278044

The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) (2402.16893v1)

Related Papers

Tweets