Preference Poisoning Attacks on Reward Model Learning (2402.01920v2)
Abstract: Learning reward models from pairwise comparisons is a fundamental component in a number of domains, including autonomous control, conversational agents, and recommendation systems, as part of a broad goal of aligning automated decisions with user preferences. These approaches entail collecting preference information from people, with feedback often provided anonymously. Since preferences are subjective, there is no gold standard to compare against; yet, reliance of high-impact systems on preference learning creates a strong motivation for malicious actors to skew data collected in this fashion to their ends. We investigate the nature and extent of this vulnerability by considering an attacker who can flip a small subset of preference comparisons to either promote or demote a target outcome. We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods. Next, we evaluate the efficacy of best attacks in both these classes in successfully achieving malicious goals on datasets from three domains: autonomous control, recommendation system, and textual prompt-response preference learning. We find that the best attacks are often highly successful, achieving in the most extreme case 100\% success rate with only 0.3\% of the data poisoned. However, \emph{which} attack is best can vary significantly across domains. In addition, we observe that the simpler and more scalable rank-by-distance approaches are often competitive with, and on occasion significantly outperform, gradient-based methods. Finally, we show that state-of-the-art defenses against other classes of poisoning attacks exhibit limited efficacy in our setting.
- Neural network learning: Theoretical foundations. Cambridge University Press, 1999.
- Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR, 2018.
- Contra: Defending against poisoning attacks in federated learning. In Computer Security–ESORICS 2021: 26th European Symposium on Research in Computer Security, Darmstadt, Germany, October 4–8, 2021, Proceedings, Part I 26, pages 455–475. Springer, 2021.
- Self-driving cars: A survey. Expert Systems with Applications, 165:113816, 2021.
- Quantitative verification of neural networks and its security applications. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1249–1264, 2019.
- Detecting poisoning attacks on machine learning in iot environments. In 2018 IEEE international congress on internet of things (ICIOT), pages 57–64. IEEE, 2018.
- Salvador Barberá. Strategy-proofness and pivotal voters: a direct proof of the gibbard-satterthwaite theorem. International Economic Review, 24(2):413–417, 1983.
- A new backdoor attack in cnns by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing (ICIP), pages 101–105. IEEE, 2019.
- Support vector machines under adversarial label noise. In Asian conference on machine learning, pages 97–112. PMLR, 2011.
- Poisoning attacks against support vector machines. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 1467–1474, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.
- Can optical trojans assist adversarial perturbations? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 122–131, 2021.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Value alignment verification. In International Conference on Machine Learning, pages 1105–1115. PMLR, 2021.
- Nicholas Carlini. Poisoning the unlabeled dataset of {{\{{Semi-Supervised}}\}} learning. In 30th USENIX Security Symposium (USENIX Security 21), pages 1577–1592, 2021.
- Poisoning and backdooring contrastive learning. arXiv preprint arXiv:2106.09667, 2021.
- Debugging machine learning tasks. ArXiv, abs/1603.07292, 2016. URL https://api.semanticscholar.org/CorpusID:16479676.
- Detecting backdoor attacks on deep neural networks by activation clustering. ArXiv, abs/1811.03728, 2018a. URL https://api.semanticscholar.org/CorpusID:53250337.
- Detecting backdoor attacks on deep neural networks by activation clustering. ArXiv, abs/1811.03728, 2018b. URL https://api.semanticscholar.org/CorpusID:53250337.
- Deepinspect: A black-box trojan detection and mitigation framework for deep neural networks. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:199466093.
- Clean-image backdoor: Attacking multi-label models with poisoned labels only. In The Eleventh International Conference on Learning Representations, 2022.
- Trojdiff: Trojan attacks on diffusion models with diverse targets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4035–4044, 2023.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- Sentinet: Detecting localized universal attacks against deep learning systems. 2020 IEEE Security and Privacy Workshops (SPW), pages 48–54, 2018. URL https://api.semanticscholar.org/CorpusID:215856644.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Wild patterns reloaded: A survey of machine learning security against training data poisoning. ACM Computing Surveys, 55(13s):1–39, 2023.
- Towards effective and robust neural trojan defenses via input filtering. In European Conference on Computer Vision, pages 283–300. Springer, 2022.
- Lira: Learnable, imperceptible and robust backdoor attacks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11966–11976, 2021.
- Robust anomaly detection and backdoor attack detection via differential privacy. ArXiv, abs/1911.07116, 2019a. URL https://api.semanticscholar.org/CorpusID:208139512.
- Robust anomaly detection and backdoor attack detection via differential privacy. ArXiv, abs/1911.07116, 2019b. URL https://api.semanticscholar.org/CorpusID:208139512.
- Robust anomaly detection and backdoor attack detection via differential privacy. arXiv preprint arXiv:1911.07116, 2019c.
- Noise-response analysis of deep neural networks quantifies robustness and fingerprints structural malware. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 100–108. SIAM, 2021.
- Poisoning attacks to graph-based recommender systems. In Proceedings of the 34th annual computer security applications conference, pages 381–392, 2018.
- Influence function based data poisoning attacks to top-n recommender systems. In The Web Conference, pages 3019–3025, 2020.
- A feature-based on-line detector to remove adversarial-backdoors by iterative demarcation. IEEE Access, 10:5545–5558, 2022. doi: 10.1109/ACCESS.2022.3141077.
- Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing, 19:2349–2364, 2019. URL https://api.semanticscholar.org/CorpusID:208267734.
- Anti-distillation backdoor attacks: Backdoors can really survive in knowledge distillation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 826–834, 2021.
- Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
- Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems. ArXiv, abs/1908.01763, 2019. URL https://api.semanticscholar.org/CorpusID:199452956.
- Sensitive-sample fingerprinting of deep neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4729–4737, 2019.
- On the effectiveness of mitigating data poisoning attacks with gradient shaping. ArXiv, abs/2002.11497, 2020. URL https://api.semanticscholar.org/CorpusID:211506328.
- Neuroninspect: Detecting backdoors in neural networks via output explanations. ArXiv, abs/1911.07399, 2019. URL https://api.semanticscholar.org/CorpusID:208139121.
- Top: Backdoor detection in neural networks via transferability of perturbation. ArXiv, abs/2103.10274, 2021. URL https://api.semanticscholar.org/CorpusID:232270062.
- Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In IEEE symposium on security and privacy (SP), pages 19–35, 2018.
- Cleann: Accelerated trojan shield for embedded neural networks. 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2020. URL https://api.semanticscholar.org/CorpusID:221516695.
- Defending against the label-flipping attack in federated learning. arXiv preprint arXiv:2207.01982, 2022.
- Lfighter: Defending against the label-flipping attack in federated learning. Neural Networks, 170:111–126, 2024.
- Label poisoning is all you need. arXiv preprint arXiv:2310.18933, 2023.
- Pore: Provably robust recommender systems against data poisoning attacks. arXiv preprint arXiv:2303.14601, 2023.
- Data quality detection mechanism against label flipping attacks in federated learning. IEEE Transactions on Information Forensics and Security, 18:1625–1637, 2023.
- Eliciting pairwise preferences in recommender systems. In ACM Conference on Recommender Systems, pages 329–337, 2018.
- Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
- Universal litmus patterns: Revealing backdoor attacks in cnns. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 298–307, 2019. URL https://api.semanticscholar.org/CorpusID:195658208.
- Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660, 2020.
- Data poisoning attacks on factorization-based collaborative filtering. Advances in neural information processing systems, 29, 2016.
- Detection and mitigation of label-flipping attacks in federated learning systems with kpca and k-means. In 2021 8th International Conference on Dependable Systems and Their Applications (DSA), pages 551–559. IEEE, 2021a.
- Lomar: A local defense against poisoning attack on federated learning. IEEE Transactions on Dependable and Secure Computing, 2021b.
- Anti-backdoor learning: Training clean models on poisoned data. In Neural Information Processing Systems, 2021c. URL https://api.semanticscholar.org/CorpusID:239616453.
- Backdoor attack with sample-specific triggers. arXiv preprint arXiv:2012.03816, 23, 2020.
- Invisible backdoor attack with sample-specific triggers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16463–16472, 2021d.
- Robust linear regression against training data poisoning. In ACM workshop on artificial intelligence and security, pages 91–102, 2017a.
- An adaptive black-box defense against trojan attacks (trojdef). IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Fine-pruning: Defending against backdooring attacks on deep neural networks. ArXiv, abs/1805.12185, 2018. URL https://api.semanticscholar.org/CorpusID:44096776.
- Abs: Scanning neural networks for back-doors by artificial brain stimulation. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2019. URL https://api.semanticscholar.org/CorpusID:204746801.
- Reflection backdoor: A natural backdoor attack on deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 182–199. Springer, 2020.
- Neural trojans. In 2017 IEEE International Conference on Computer Design (ICCD), pages 45–48. IEEE, 2017b.
- Antipodes of label differential privacy: Pate and alibi. Advances in Neural Information Processing Systems, 34:6934–6945, 2021.
- Moat: Model agnostic defense against targeted poisoning attacks in federated learning. In Information and Communications Security: 23rd International Conference, ICICS 2021, Chongqing, China, November 19-21, 2021, Proceedings, Part I 23, pages 38–55. Springer, 2021.
- Using machine teaching to identify optimal training-set attacks on machine learners. In Proceedings of the aaai conference on artificial intelligence, 2015.
- Towards poisoning of deep learning algorithms with back-gradient optimization. In ACM workshop on artificial intelligence and security, pages 27–38, 2017.
- Wanet–imperceptible warping-based backdoor attack. arXiv preprint arXiv:2102.10369, 2021.
- A voting-based system for ethical decision making. In AAAI Conference on Artificial Intelligence, 2018.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Label sanitization against label flipping poisoning attacks. In ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September 10-14, 2018, Proceedings 18, pages 5–15. Springer, 2019.
- Billy Perrigo. Exclusive: Openai used kenyan workers on less than $2 per hour to make chatgpt less toxic. Time, 2023. URL https://time.com/6247678/openai-chatgpt-kenya-workers/.
- Defending neural backdoors via generative distribution modeling. In Neural Information Processing Systems, 2019. URL https://api.semanticscholar.org/CorpusID:202774460.
- Nunung Nurul Qomariyah. Pairwise preferences learning for recommender systems. PhD thesis, University of York, 2018.
- Backdooring and poisoning neural networks with image-scaling attacks. In 2020 IEEE Security and Privacy Workshops (SPW), pages 41–47. IEEE, 2020.
- Certified robustness to label-flipping attacks via randomized smoothing. In International Conference on Machine Learning, pages 8230–8241. PMLR, 2020.
- Stuart Russell. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
- Hidden trigger backdoor attacks. In Proceedings of the AAAI conference on artificial intelligence, pages 11957–11965, 2020.
- Dynamic backdoor attacks against machine learning models. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 703–718. IEEE, 2022.
- Backdoor scanning for deep neural networks through k-arm optimization. In International Conference on Machine Learning, 2021. URL https://api.semanticscholar.org/CorpusID:231861441.
- Django: Detecting trojans in object detection models via gaussian focus calibration. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194, 2023.
- Mathematics for economists, volume 7. Norton New York, 1994.
- Exposing backdoors in robust machine learning models. ArXiv, abs/2003.00865, 2020. URL https://api.semanticscholar.org/CorpusID:211677391.
- Certified defenses for data poisoning attacks. Advances in neural information processing systems, 30, 2017.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Deep probabilistic models to detect data poisoning attacks. ArXiv, abs/1912.01206, 2019. URL https://api.semanticscholar.org/CorpusID:208548607.
- When does machine learning {{\{{FAIL}}\}}? generalized transferability for evasion and poisoning attacks. In 27th USENIX Security Symposium (USENIX Security 18), pages 1299–1316, 2018.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
- What distributions are robust to indiscriminate poisoning attacks for linear learners? Advances in neural information processing systems, 2023, 2023a.
- What distributions are robust to indiscriminate poisoning attacks for linear learners? In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=yyLFUPNEiT.
- Demon in the variant: Statistical analysis of dnns for robust backdoor contamination detection. In USENIX Security Symposium, 2019. URL https://api.semanticscholar.org/CorpusID:199405468.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Spectral signatures in backdoor attacks. In Neural Information Processing Systems, 2018a. URL https://api.semanticscholar.org/CorpusID:53298804.
- Spectral signatures in backdoor attacks. Advances in neural information processing systems, 31, 2018b.
- Model agnostic defence against backdoor attacks in machine learning. IEEE Transactions on Reliability, 71:880–895, 2019. URL https://api.semanticscholar.org/CorpusID:199452965.
- Nnoculation: Broad spectrum and targeted treatment of backdoored dnns. ArXiv, abs/2002.08313, 2020. URL https://api.semanticscholar.org/CorpusID:211171902.
- Confoc: Content-focus protection against trojan attacks on neural networks. ArXiv, abs/2007.00711, 2020. URL https://api.semanticscholar.org/CorpusID:220302655.
- Yevgeniy Vorobeychik and Kantarcioglu. Adversarial machine learning. Springer, 2018.
- Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
- Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
- Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723, 2019. URL https://api.semanticscholar.org/CorpusID:67846878.
- A survey of neural trojan attacks and defenses in deep learning. arXiv preprint arXiv:2202.07183, 2022a.
- Practical detection of trojan neural networks: Data-limited and data-free cases. ArXiv, abs/2007.15802, 2020. URL https://api.semanticscholar.org/CorpusID:220919798.
- Robust learning for data poisoning attacks. In International Conference on Machine Learning, pages 10859–10869. PMLR, 2021.
- Training with more confidence: Mitigating injected and natural backdoors during training. Advances in Neural Information Processing Systems, 35:36396–36410, 2022b.
- Adversarial neuron pruning purifies backdoored deep models. ArXiv, abs/2110.14430, 2021. URL https://api.semanticscholar.org/CorpusID:239998081.
- A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136, 2023.
- Lirong Xia. Learning and decision-making from rank data. Morgan & Claypool Publishers, 2019.
- Revealing backdoors, post-training, in dnn classifiers via novel inference on optimized perturbations inducing group misclassification. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3827–3831, 2019. URL https://api.semanticscholar.org/CorpusID:201654000.
- Adversarial label flips attack on support vector machines. In ECAI 2012, pages 870–875. IOS Press, 2012.
- Detecting ai trojans using meta neural analysis. 2021 IEEE Symposium on Security and Privacy (SP), pages 103–120, 2019. URL https://api.semanticscholar.org/CorpusID:203902799.
- Backdooring instruction-tuned large language models with virtual prompt injection. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023.
- Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340, 2017.
- Data poisoning attacks against machine learning algorithms. Expert Systems with Applications, 208:118101, 2022.
- Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security, 2020. URL https://api.semanticscholar.org/CorpusID:226228175.
- Deepsweep: An evaluation framework for mitigating dnn backdoor attacks using data augmentation. Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, 2020. URL https://api.semanticscholar.org/CorpusID:229156177.
- {{\{{Meta-Sift}}\}}: How to sift out a clean subset in the presence of data poisoning? In 32nd USENIX Security Symposium (USENIX Security 23), pages 1667–1684, 2023.
- Practical data poisoning attack against next-item recommendation. In The Web Conference, pages 2458–2464, 2020a.
- Adversarial label-flipping attack and defense for graph neural networks. In 2020 IEEE International Conference on Data Mining (ICDM), pages 791–800. IEEE, 2020b.
- Towards class-oriented poisoning attacks against neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3741–3750, 2022.
- Bridging mode connectivity in loss landscapes and adversarial robustness. ArXiv, abs/2005.00060, 2020. URL https://api.semanticscholar.org/CorpusID:213987212.
- Topological detection of trojaned neural networks. ArXiv, abs/2106.06469, 2021. URL https://api.semanticscholar.org/CorpusID:235417537.
- Backdoor embedding in convolutional neural network models via invisible perturbation. In Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy, pages 97–108, 2020.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.