Certifiably Robust RAG against Retrieval Corruption (2405.15556v1)

Published 24 May 2024 in cs.LG, cs.CL, and cs.CR

Abstract: Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval corruption attacks: an attacker can inject malicious passages into retrieval results to induce inaccurate responses. In this paper, we propose RobustRAG as the first defense framework against retrieval corruption attacks. The key insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM responses from each passage in isolation and then securely aggregate these isolated responses. To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses. Notably, RobustRAG can achieve certifiable robustness: we can formally prove and certify that, for certain queries, RobustRAG can always return accurate responses, even when the attacker has full knowledge of our defense and can arbitrarily inject a small number of malicious passages. We evaluate RobustRAG on open-domain QA and long-form text generation datasets and demonstrate its effectiveness and generalizability across various tasks and datasets.

References (54)

Citations (24)

View on Semantic Scholar

Summary

The paper proposes RobustRAG, a framework that isolates and aggregates responses to ensure certifiable robustness against retrieval corruption attacks.
It employs keyword and decoding aggregation techniques to maintain high accuracy, achieving up to 71.0% certifiable accuracy on benchmarks like RealtimeQA.
The framework shows minimal performance trade-offs, reducing clean accuracy losses to below 11% while significantly lowering attack success rates.

Certifiably Robust RAG against Retrieval Corruption

The paper addresses the vulnerability of Retrieval-Augmented Generation (RAG) systems to retrieval corruption attacks, wherein attackers can inject malicious passages into the retrieval results to induce inaccurate responses. To counteract these threats, the authors propose RobustRAG, a defense framework that assures certifiable robustness against such retrieval corruption attacks.

Background and Motivation

LLMs like GPT-3.5 and Mistral-7B have shown limitations in accuracy due to incomplete and often outdated parametrized knowledge. RAG improves response accuracy by incorporating relevant external knowledge retrieved from large knowledge bases. This method is employed in several popular AI applications, such as Microsoft Bing Chat and Perplexity AI. However, RAG's reliance on external sources makes it susceptible to retrieval corruption, where attackers may insert malicious passages to manipulate the outputs of the system.

RobustRAG Framework

RobustRAG utilizes an isolate-then-aggregate strategy to mitigate the impact of malicious passages:

Isolation: Responses are generated from each retrieved passage in isolation to prevent an attacker-influenced passage from contaminating others.
Aggregation: The isolated responses are then securely aggregated to produce the final output.

Two techniques for secure response aggregation are proposed:

Keyword Aggregation: Extracts key terms from each response and aggregates them based on their frequency.
Decoding Aggregation: Combines predicted probability vectors from different isolated passages at each decoding step, ensuring robustness by making decisions that consider input consistency and confidence thresholds.

Theoretical Foundations

The authors elucidate the concept of certifiable robustness, aiming to ensure that for any given attack model allowing up to $k^\prime$ malicious passages among the top $k$ retrieved, the resultant responses will always maintain a certain accuracy threshold $\tau$ . This is formulated through the introduction of various aggregation techniques, demonstrating that RobustRAG's architecture inherently limits the attack surface, thereby controlling the influence of corrupted passages.

Experimental Evaluation

The experimental setup encompasses various datasets including RealtimeQA, Natural Questions, and Biography Generation, and evaluates the system across different LLMs such as Mistral-7B-Instruct and Llama2-7B-Chat.

Key Findings:

Certifiable Robustness: RobustRAG consistently achieves notable certifiable accuracy across different datasets and models. For example, on the RealtimeQA-MC dataset, certifiable accuracy reaches up to 71.0%.
Clean Performance: Despite enhancing robustness, the clean performance of RobustRAG remains high, with accuracy drops remaining below 11% in most cases, demonstrating minimal performance trade-offs.
Empirical Robustness: Against prompt injection and data poisoning attacks, RobustRAG shows substantial resilience, reducing attack success rates below 10% in almost all cases.

Practical and Theoretical Implications

Practically, RobustRAG assures users and developers of AI applications that their systems can maintain high reliability even in the presence of sophisticated retrieval corruption attacks. Theoretically, the framework extends robustness analysis and certification to complex generative tasks, showcasing a method applicable beyond simple classification tasks.

Future Developments and Considerations

While RobustRAG provides a robust framework against retrieval corruption, several aspects warrant further investigation:

Retrieval Step Hardening: Strengthening the retrieval process itself to prevent the inclusion of malicious passages.
Multi-hop Queries: Extending RobustRAG to handle complex multi-hop queries effectively.
Minimizing Performance Trade-offs: Further reducing the clean performance drop to encourage broader adoption.
Integration with Advanced RAG Techniques: Combining RobustRAG with advanced RAG techniques such as self-critic and fine-tuning to further enhance its robustness and accuracy.

Conclusions

RobustRAG marks a significant advancement in the field of robust AI, particularly in connection with retrieval-augmented language systems. The isolate-then-aggregate strategy, combined with rigorous robustness certification, ensures that systems can withstand targeted attacks aimed at corrupting retrieved information. This work not only demonstrates the feasibility of building robust retrieval-augmented generative systems but also sets a foundational framework for future explorations in robust, certifiably secure AI applications.

Related Papers

Tweets

https://twitter.com/TongWu_Pton/status/1795084330290536812

https://twitter.com/florian_tramer/status/1932868639319011429

https://twitter.com/susumuota/status/1795607303372075040

https://twitter.com/gm8xx8/status/1795277425418592680

https://twitter.com/tryvariable/status/1804213697163768220

https://twitter.com/FSFG/status/1795091914225492050

HackerNews

Certifiably Robust RAG Against Retrieval Corruption (2 points, 0 comments)