Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models (2406.05660v2)

Published 9 Jun 2024 in cs.LG, cs.CR, and stat.ML

Abstract: As ML models become increasingly complex and integral to high-stakes domains such as finance and healthcare, they also become more susceptible to sophisticated adversarial attacks. We investigate the threat posed by undetectable backdoors, as defined in Goldwasser et al. (FOCS '22), in models developed by insidious external expert firms. When such backdoors exist, they allow the designer of the model to sell information on how to slightly perturb their input to change the outcome of the model. We develop a general strategy to plant backdoors to obfuscated neural networks, that satisfy the security properties of the celebrated notion of indistinguishability obfuscation. Applying obfuscation before releasing neural networks is a strategy that is well motivated to protect sensitive information of the external expert firm. Our method to plant backdoors ensures that even if the weights and architecture of the obfuscated model are accessible, the existence of the backdoor is still undetectable. Finally, we introduce the notion of undetectable backdoors to LLMs and extend our neural network backdoor attacks to such models based on the existence of steganographic functions.

Summary

The paper presents a formal framework that leverages cryptographic techniques—PRGs, indistinguishability obfuscation, and digital signatures—to inject undetectable, non-replicable backdoors.
It details a novel pipeline that transforms networks into Boolean circuits, applies obfuscation, and converts them back, ensuring the backdoor remains hidden even with full model access.
The methodology extends to language models by embedding triggers through steganographic functions, underlining severe security risks in high-stakes AI applications.

Injecting Undetectable Backdoors in Deep Learning and LLMs

The paper explores a critical vulnerability in the domain of ML and AI: the injection of undetectable backdoors into deep learning models and LLMs (LMs). This research is motivated by the increasing complexity and reliance on ML models in high-stakes applications such as finance and healthcare, where such backdoors could lead to significant adversarial threats.

Key Contributions

Definition and Framework: The paper introduces a formal framework for understanding backdoor attacks, specifically addressing scenarios where the adversary possesses white-box access to the ML model. Two core properties of robust backdoor attacks are defined:
- Undetectability: The backdoor remains undetectable even when the model's architecture and parameters are fully accessible.
- Non-replicability: It is computationally infeasible for a third party to replicate the backdoor's effect after observing a few examples.
Backdoor Injection Methodology: The primary technical innovation lies in combining cryptographic techniques to inject these backdoors. The core methods involved include:
- Pseudo-random Generators (PRGs): To ensure that model perturbations appear random.
- Indistinguishability Obfuscation (iO): To make it computationally infeasible to distinguish between a backdoored model and an original model.
- Digital Signatures: To make the backdoor non-replicable.
Pipeline for Backdoor Injection: The proposed pipeline consists of transforming the neural network to a Boolean circuit, applying iO to the circuit, and converting it back. This approach ensures that even with white-box access, it is computationally infeasible to detect the backdoor due to the obfuscation.
Backdoors in LLMs: The paper extends the methodology to LLMs, which operate in discrete domains, a more complex setting for backdoor injection. Utilizing steganographic functions, the researchers developed a mechanism to embed backdoors within LM prompts, facilitating activation through concealed triggers.

Theoretical and Practical Implications

The implications of these findings are broad and impactful for both the theoretical understanding and practical application of ML systems:

Security Risks: The ability to inject undetectable and non-replicable backdoors poses a significant threat to the integrity and trustworthiness of ML models. This vulnerability could lead to unauthorized manipulations or breaches, especially in critical applications where the consequences of such manipulations could be severe.
Cryptographic Intersection: This work highlights the intersection of ML security and cryptographic techniques, showcasing how advancements in cryptography can expose and potentially mitigate vulnerabilities in ML models.
Defense Mechanisms: While the paper primarily focuses on the attack methodology, it underscores the necessity for developing robust defense mechanisms. Potential defenses could involve stringent input validation, noise addition, and anomaly detection, though these defenses must be robust against sophisticated methods that might bypass them.

Speculations on Future Developments

Given the powerful implications of undetectable backdoors, future developments in AI and ML security are likely to focus on:

Enhanced Detection Algorithms: Research may pivot towards developing algorithms capable of identifying subtle, cryptographically obfuscated backdoors. Existing detection mechanisms will need substantial enhancements backed by theoretically rigorous guarantees.
Secure Model Training Protocols: Establishing protocols that ensure the integrity of models post-training will be paramount, especially in collaborative or outsourced ML training scenarios. Such protocols might include verified computation and secure enclaves.
Proactive Security Measures: Advances in proactive measures, such as adversarial training specifically designed to harden models against such backdoor injections, will be essential.
Ethical and Regulatory Frameworks: As the awareness of such vulnerabilities grows, regulatory frameworks will likely evolve to mandate security standards and regular audits for ML systems, particularly in sectors handling sensitive data and operations.

Conclusion

The paper "Injecting Undetectable Backdoors in Deep Learning and LLMs" presents a formidable challenge in the landscape of ML security. The innovative use of cryptographic tools to plant undetectable backdoors reveals a profound vulnerability, urging the community to develop robust defensive mechanisms and rethink the security architectures of intelligent systems. While the research exposes significant risks, it also paves the way for future advancements in securing ML models against insidious adversarial threats.

PDF Markdown

Related Papers

Tweets

https://twitter.com/aminkarbasi/status/1802173516856308123

https://twitter.com/Matt_McKibbin/status/1884279627826487423