Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs (2404.00486v1)

Published 30 Mar 2024 in cs.CL and cs.AI

Abstract: With the rise of LLMs, ensuring they embody the principles of being helpful, honest, and harmless (3H), known as Human Alignment, becomes crucial. While existing alignment methods like RLHF, DPO, etc., effectively fine-tune LLMs to match preferences in the preference dataset, they often lead LLMs to highly receptive human input and external evidence, even when this information is poisoned. This leads to a tendency for LLMs to be Adaptive Chameleons when external evidence conflicts with their parametric memory. This exacerbates the risk of LLM being attacked by external poisoned data, which poses a significant security risk to LLM system applications such as Retrieval-augmented generation (RAG). To address the challenge, we propose a novel framework: Dialectical Alignment (DA), which (1) utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts and context-memory conflicts with different external evidence in context window (i.e., different ratios of poisoned factual contexts); (2) constructs the SFT dataset as well as the preference dataset based on the AI feedback and strategies above; (3) uses the above datasets for LLM alignment to defense poisoned context attack while preserving the effectiveness of in-context knowledge editing. Our experiments show that the dialectical alignment model improves poisoned data attack defense by 20 and does not require any additional prompt engineering or prior declaration of you may be attacked to the LLMs' context window.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Antonym-synonym classification based on new sub-space embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33(01), pp.  6204–6211, 2019.
  2. Fine-grained named entity typing over distantly supervised data based on refined representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34(05), pp.  7391–7398, 2020.
  3. Fine-grained named entity typing over distantly supervised data via refinement in hyperbolic space. arXiv preprint arXiv:2101.11212, 2021.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback, 2022b.
  6. Seven failure points when engineering a retrieval augmented generation system, 2024.
  7. Niels J Blunch. Position bias in multiple-choice questions. Journal of Marketing Research, 21(2):216–220, 1984.
  8. Searching for needles in a haystack: On the role of incidental bilingualism in palm’s translation capability. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  9432–9452. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.524. URL https://doi.org/10.18653/v1/2023.acl-long.524.
  9. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  10. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  11. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2292–2307, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.146. URL https://aclanthology.org/2022.emnlp-main.146.
  12. Benchmarking large language models in retrieval-augmented generation. CoRR, abs/2309.01431, 2023a. doi: 10.48550/ARXIV.2309.01431. URL https://doi.org/10.48550/arXiv.2309.01431.
  13. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023b.
  14. Accelerating reinforcement learning of robotic manipulations via feedback from large language models, 2023.
  15. Evaluating the ripple effects of knowledge editing in language models, 2023.
  16. Safe rlhf: Safe reinforcement learning from human feedback, 2023.
  17. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  18. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
  19. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
  20. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp.  265–284. Springer, 2006.
  21. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
  22. Scaling laws for reward model overoptimization, 2022.
  23. Retrieval-augmented generation for large language models: A survey, 2024.
  24. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  25. High dimensional differentially private stochastic optimization with heavy-tailed data. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp.  227–236, 2022b.
  26. Differentially private natural language models: Recent advances and future directions. arXiv preprint arXiv:2301.09112, 2023a.
  27. Seat: stable and explainable attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37(11), pp.  12907–12915, 2023b.
  28. Privacy-preserving sparse generalized eigenvalue problem. In International Conference on Artificial Intelligence and Statistics, pp.  5052–5062. PMLR, 2023c.
  29. Certified robustness of nearest neighbors against data poisoning and backdoor attacks. Proceedings of the AAAI Conference on Artificial Intelligence, 36(9):9575–9583, Jun. 2022. doi: 10.1609/aaai.v36i9.21191. URL https://ojs.aaai.org/index.php/AAAI/article/view/21191.
  30. Mistral 7b, 2023.
  31. Challenges and applications of large language models, 2023.
  32. Large language models are zero-shot reasoners, 2023.
  33. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  34. Faithful vision-language interpretation via concept bottleneck models. In The Twelfth International Conference on Learning Representations, 2023.
  35. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.
  36. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
  37. Lost in the middle: How language models use long contexts, 2023a.
  38. Piccolo: Exposing complex backdoors in nlp transformer models. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 2025–2042, 2022. doi: 10.1109/SP46214.2022.9833579.
  39. Prompt injection attacks and defenses in llm-integrated applications. CoRR, abs/2310.12815, 2023b. doi: 10.48550/ARXIV.2310.12815. URL https://doi.org/10.48550/arXiv.2310.12815.
  40. Prompt injection attacks and defenses in llm-integrated applications, 2023c.
  41. Prompt injection attacks and defenses in llm-integrated applications, 2023d.
  42. Pick your poison: Undetectability versus robustness in data poisoning attacks, 2023.
  43. NOTABLE: Transferable backdoor attacks against prompt-based NLP models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15551–15565, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.867. URL https://aclanthology.org/2023.acl-long.867.
  44. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268, 2016. URL http://arxiv.org/abs/1611.09268.
  45. Can LMs learn new entities from descriptions? challenges in propagating injected knowledge. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5469–5485, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.300. URL https://aclanthology.org/2023.acl-long.300.
  46. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  47. ”merge conflicts!” exploring the impacts of external distractors to parametric knowledge graphs, 2023.
  48. Direct preference optimization: Your language model is secretly a reward model, 2023.
  49. Center-of-inattention: Position biases in decision-making. Organizational Behavior and Human Decision Processes, 99(1):66–80, 2006.
  50. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
  51. The effect of sampling temperature on problem solving in large language models, 2024.
  52. Ignore this title and HackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4945–4977, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.302. URL https://aclanthology.org/2023.emnlp-main.302.
  53. What makes a good conversation? how controllable attributes affect human judgments. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  1702–1723. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1170. URL https://doi.org/10.18653/v1/n19-1170.
  54. Survey of vulnerabilities in large language models revealed by adversarial attacks. CoRR, abs/2310.10844, 2023. doi: 10.48550/ARXIV.2310.10844. URL https://doi.org/10.48550/arXiv.2310.10844.
  55. Differentially private non-convex learning for multi-layer neural networks. arXiv preprint arXiv:2310.08425, 2023a.
  56. ”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825, 2023b.
  57. Learning to summarize from human feedback, 2022.
  58. Faster rates of private stochastic convex optimization. In International Conference on Algorithmic Learning Theory, pp.  995–1002. PMLR, 2022.
  59. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  60. Tensor trust: Interpretable prompt injection attacks from an online game. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fsW7wJGLBd.
  61. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023a.
  62. Differentially private (gradient) expectation maximization algorithm with statistical guarantees. arXiv preprint arXiv:2010.13520, 2020.
  63. Estimating smooth glm in non-interactive local differential privacy model with public unlabeled data. In Algorithmic Learning Theory, pp.  1207–1213. PMLR, 2021.
  64. Generalized linear models in non-interactive local differential privacy with public data. Journal of Machine Learning Research, 24(132):1–57, 2023b.
  65. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023c.
  66. On practical aspects of aggregation defenses against data poisoning attacks, 2023.
  67. Improved certified defenses against data poisoning with (deterministic) finite aggregation. In ICML, pp.  22769–22783, 2022a. URL https://proceedings.mlr.press/v162/wang22m.html.
  68. Lethal dose conjecture on data poisoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=PYnSpt3jAz.
  69. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022c.
  70. Aligning large language models with human: A survey, 2023d.
  71. Jailbroken: How does LLM safety training fail? In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. URL http://papers.nips.cc/paper_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html.
  72. Jailbroken: How does LLM safety training fail? In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/fd6613131889a4b656206c50a8bd7790-Abstract-Conference.html.
  73. Chain-of-thought prompting elicits reasoning in large language models, 2023c.
  74. Practical differentially private and byzantine-resilient federated learning. Proceedings of the ACM on Management of Data, 1(2):1–26, 2023.
  75. How does selection leak privacy: Revisiting private selection and improved results for hyper-parameter tuning. arXiv preprint arXiv:2402.13087, 2024.
  76. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=auKAUJZMO6.
  77. Cvalues: Measuring the values of chinese large language models from safety to responsibility, 2023a.
  78. Knowledge conflicts for llms: A survey, 2024a.
  79. Perils of self-feedback: Self-bias amplifies in large language models, 2024b.
  80. An llm can fool itself: A prompt-based adversarial attack. arXiv preprint arXiv:2310.13345, 2023b.
  81. Moral: Moe augmented lora for llms’ lifelong learning. arXiv preprint arXiv:2402.11260, 2024a.
  82. Human-ai interactions in the communication era: Autophagy makes large models achieving local optima. arXiv preprint arXiv:2402.11271, 2024b.
  83. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2369–2380. Association for Computational Linguistics, 2018. doi: 10.18653/V1/D18-1259. URL https://doi.org/10.18653/v1/d18-1259.
  84. Editing large language models: Problems, methods, and opportunities. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10222–10240, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.632. URL https://aclanthology.org/2023.emnlp-main.632.
  85. Tinyllama: An open-source small language model, 2024.
  86. Bagflip: A certified defense against data poisoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ZidkM5b92G.
  87. A survey of large language models, 2023.
  88. Can we edit factual knowledge by in-context learning? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4862–4876, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.296. URL https://aclanthology.org/2023.emnlp-main.296.
  89. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
  90. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  91. Poisoning retrieval corpora by injecting adversarial passages. In Empirical Methods in Natural Language Processing (EMNLP), 2023.
  92. Fine-tuning language models from human preferences, 2020.
  93. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shu Yang (178 papers)
  2. Jiayuan Su (10 papers)
  3. Han Jiang (24 papers)
  4. Mengdi Li (19 papers)
  5. Keyuan Cheng (9 papers)
  6. Muhammad Asif Ali (18 papers)
  7. Lijie Hu (50 papers)
  8. Di Wang (407 papers)
Citations (4)