Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity (2405.13967v5)
Abstract: Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of LLMs by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace. The toxic subspace is identified by extracting preference data embeddings from the LLM, and removing non-toxic information from these embeddings. We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data. Finally, we attempt to connect tuning based alignment with editing, by establishing both theoretical and empirical connections between ProFS and DPO, showing that ProFS can be interpreted as a denoised version of a single DPO step.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- Y.-S. Chuang, R. Uppaal, Y. Wu, L. Sun, M. N. Sreedhar, S. Yang, T. T. Rogers, and J. Hu, “Evolving domain adaptation of pretrained language models for text classification,” in NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2023.
- R. Uppaal, J. Hu, and Y. Li, “Is fine-tuning needed? pre-trained language models are near perfect for out-of-domain detection,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12813–12832, 2023.
- E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “The woman worked as a babysitter: On biases in language generation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–3412, 2019.
- S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” arXiv preprint arXiv:2009.11462, 2020.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
- Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, and G. Irving, “Alignment of language agents,” arXiv preprint arXiv:2103.14659, 2021.
- L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, et al., “Ethical and social risks of harm from language models,” arXiv preprint arXiv:2112.04359, 2021.
- A. Toumi and A. Koziell-Pipe, “Functorial language models,” arXiv preprint arXiv:2103.14411, 2021.
- Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al., “Siren’s song in the ai ocean: a survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
- Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
- D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
- N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
- Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
- L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, et al., “Zephyr: Direct distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
- H. Lee, S. Phatale, H. Mansoor, K. R. Lu, T. Mesnard, J. Ferret, C. Bishop, E. Hall, V. Carbune, and A. Rastogi, “Rlaif: Scaling reinforcement learning from human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, 2023.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- B. Y. Lin, A. Ravichander, X. Lu, N. Dziri, M. Sclar, K. Chandu, C. Bhagavatula, and Y. Choi, “The unlocking spell on base llms: Rethinking alignment via in-context learning,” arXiv preprint arXiv:2312.01552, 2023.
- A. Lee, X. Bai, I. Pres, M. Wattenberg, J. K. Kummerfeld, and R. Mihalcea, “A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity,” arXiv preprint arXiv:2401.01967, 2024.
- E. Strubell, A. Ganesh, and A. Mccallum, “Energy and policy considerations for deep learning in nlp,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650, 2019.
- X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, P. Han, J. Li, L. Du, B. Qin, et al., “Flm-101b: An open llm and how to train it with $100 k budget,” arXiv preprint arXiv:2309.03852, 2023.
- T. Limisiewicz and D. Mareček, “Don’t forget about pronouns: Removing gender bias in language models without losing factual gender information,” in Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp. 17–29, 2022.
- K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model, july 2023,” URL http://arxiv. org/abs/2306.03341, 2023.
- S. Bordia and S. R. Bowman, “Identifying and reducing gender bias in word-level language models,” arXiv preprint arXiv:1904.03035, 2019.
- A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, et al., “Representation engineering: A top-down approach to ai transparency,” arXiv preprint arXiv:2310.01405, 2023.
- J. Campbell, R. Ren, and P. Guo, “Localizing lying in llama: Understanding instructed dishonesty on true-false questions through prompting, probing, and patching,” arXiv preprint arXiv:2311.15131, 2023.
- A. Lauscher, G. Glavaš, S. P. Ponzetto, and I. Vulić, “A general framework for implicit and explicit debiasing of distributional word vector spaces,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8131–8138, 2020.
- T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” Advances in neural information processing systems, vol. 29, 2016.
- S. Dev and J. Phillips, “Attenuating bias in word vectors,” in The 22nd international conference on artificial intelligence and statistics, pp. 879–887, PMLR, 2019.
- P. O. Aboagye, Y. Zheng, J. Shunn, C.-C. M. Yeh, J. Wang, Z. Zhuang, H. Chen, L. Wang, W. Zhang, and J. Phillips, “Interpretable debiasing of vectorized language representations with iterative orthogonalization,” in The Eleventh International Conference on Learning Representations, 2022.
- S. Singh, S. Ravfogel, J. Herzig, R. Aharoni, R. Cotterell, and P. Kumaraguru, “Mimic: Minimally modified counterfactuals in the representation space,” arXiv preprint arXiv:2402.09631, 2024.
- K. Park, Y. J. Choe, and V. Veitch, “The linear representation hypothesis and the geometry of large language models,” in Causal Representation Learning Workshop at NeurIPS 2023, 2023.
- T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 746–751, 2013.
- S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “A latent variable model approach to pmi-based word embeddings,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 385–399, 2016.
- N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al., “Toy models of superposition,” arXiv preprint arXiv:2209.10652, 2022.
- Z. Wang, L. Gui, J. Negrea, and V. Veitch, “Concept algebra for (score-based) text-controlled generative models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- N. Nanda, A. Lee, and M. Wattenberg, “Emergent linear representations in world models of self-supervised sequence models,” arXiv preprint arXiv:2309.00941, 2023.
- S. Hammarling, “The singular value decomposition in multivariate statistics,” ACM Signum Newsletter, vol. 20, no. 3, pp. 2–25, 1985.
- H. Abdi and L. J. Williams, “Principal component analysis,” Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–459, 2010.
- D. Donoho, M. Gavish, and E. Romanov, “Screenot: Exact mse-optimal singular value thresholding in correlated noise,” The Annals of Statistics, vol. 51, no. 1, pp. 122–148, 2023.
- J. Fan, K. Wang, Y. Zhong, and Z. Zhu, “Robust high dimensional factor models with applications to statistical machine learning,” Statistical science: a review journal of the Institute of Mathematical Statistics, vol. 36, no. 2, p. 303, 2021.
- X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y. Wang, X. Zhao, and D. Lin, “Shadow alignment: The ease of subverting safely-aligned language models,” arXiv preprint arXiv:2310.02949, 2023.
- R. Balestriero, R. Cosentino, and S. Shekkizhar, “Characterizing large language model geometry solves toxicity detection and generation,” arXiv preprint arXiv:2312.01648, 2023.
- E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2153–2162, 2019.
- A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.
- J. Chu, Y. Liu, Z. Yang, X. Shen, M. Backes, and Y. Zhang, “Comprehensive assessment of jailbreak attacks against llms,” arXiv preprint arXiv:2402.05668, 2024.
- X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang, “" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” arXiv preprint arXiv:2308.03825, 2023.
- A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” arXiv preprint arXiv:2401.06373, 2024.
- X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!,” in The Twelfth International Conference on Learning Representations, 2023.
- Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. Hashimoto, and D. Kang, “Removing rlhf protections in gpt-4 via fine-tuning,” arXiv preprint arXiv:2311.05553, 2023.
- M. Geva, A. Caciularu, K. R. Wang, and Y. Goldberg, “Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space,” arXiv preprint arXiv:2203.14680, 2022.
- G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089, 2022.
- X. Hu, D. Li, B. Hu, Z. Zheng, Z. Liu, and M. Zhang, “Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 18252–18260, 2024.
- L. Gao, Y. Niu, T. Tang, S. Avestimehr, and M. Annavaram, “Ethos: Rectifying language models in orthogonal parameter space,” arXiv preprint arXiv:2403.08994, 2024.
- B. Wei, K. Huang, Y. Huang, T. Xie, X. Qi, M. Xia, P. Mittal, M. Wang, and P. Henderson, “Assessing the brittleness of safety alignment via pruning and low-rank modifications,” arXiv preprint arXiv:2402.05162, 2024.
- K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in gpt,” Advances in Neural Information Processing Systems, vol. 35, pp. 17359–17372, 2022.
- M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed-forward layers are key-value memories,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495, 2021.
- G. H. Golub and C. F. Van Loan, Matrix computations. JHU press, 2013.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., “Mistral 7b (2023),” arXiv preprint arXiv:2310.06825, 2023.
- L. Tunstall, E. Beeching, N. Lambert, N. Rajani, S. Huang, K. Rasul, A. M. Rush, and T. Wolf, “The alignment handbook.” https://github.com/huggingface/alignment-handbook, 2023.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
- B. Wang and A. Komatsuzaki, “Gpt-j-6b: A 6 billion parameter autoregressive language model,” 2021.
- S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations, 2016.
- S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” in International Conference on Learning Representations, 2019.
- L. Hanu and Unitary team, “Detoxify.” Github. https://github.com/unitaryai/detoxify, 2020.
- L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muennighoff, et al., “A framework for few-shot language model evaluation,” Version v0. 0.1. Sept, p. 8, 2021.
- C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924–2936, 2019.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, 2018.
- R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, 2019.
- K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande: An adversarial winograd schema challenge at scale,” Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021.
- P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
- T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, 2018.
- E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2021.
- H.-S. Chang, S. Vembu, S. Mohan, R. Uppaal, and A. McCallum, “Using error decay prediction to overcome practical issues of deep active learning for named entity recognition,” Machine Learning, vol. 109, pp. 1749–1778, 2020.
- H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE transactions on neural networks and learning systems, 2022.
- D. Chong, J. Hong, and C. D. Manning, “Detecting label errors by using pre-trained language models,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9074–9091, 2022.
- R. B. Cattell, “The scree test for the number of factors,” Multivariate behavioral research, vol. 1, no. 2, pp. 245–276, 1966.
- S. Chatterjee, “Matrix estimation by Universal Singular Value Thresholding,” The Annals of Statistics, vol. 43, no. 1, pp. 177 – 214, 2015.
- M. Gavish and D. L. Donoho, “The optimal hard threshold for singular values is 4/3434/\sqrt{3}4 / square-root start_ARG 3 end_ARG,” IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 5040–5053, 2014.
- D. Donoho and M. Gavish, “Minimax risk of matrix denoising by singular value thresholding,” arXiv preprint arXiv:1304.2085, 2014.
- M. Gavish and D. L. Donoho, “Optimal shrinkage of singular values,” IEEE Transactions on Information Theory, vol. 63, no. 4, pp. 2137–2152, 2017.
- D. L. Donoho, M. Gavish, and I. M. Johnstone, “Optimal shrinkage of eigenvalues in the spiked covariance model,” Annals of statistics, vol. 46, no. 4, p. 1742, 2018.
- C. Davis and W. M. Kahan, “The rotation of eigenvectors by a perturbation. iii,” SIAM Journal on Numerical Analysis, vol. 7, no. 1, pp. 1–46, 1970.
- Y. Yu, T. Wang, and R. J. Samworth, “A useful variant of the davis–kahan theorem for statisticians,” Biometrika, vol. 102, no. 2, pp. 315–323, 2015.
- P.-Å. Wedin, “Perturbation bounds in connection with singular value decomposition,” BIT Numerical Mathematics, vol. 12, pp. 99–111, 1972.
- R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010.
Collections
Sign up for free to add this paper to one or more collections.