Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment (2402.19085v3)

Published 29 Feb 2024 in cs.AI, cs.CL, cs.SY, and eess.SY

Abstract: Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving improvements in multi-objective alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875.
  4. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems.
  5. A close look into the calibration of pre-trained language models. ArXiv, abs/2211.00151.
  6. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv, abs/2310.01377.
  7. Enhancing chat language models by scaling high-quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing.
  8. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. ArXiv, abs/2310.05344.
  9. Improving factuality and reasoning in language models through multiagent debate. ArXiv, abs/2305.14325.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv, abs/2209.07858.
  11. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
  12. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. ArXiv, abs/2310.11564.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825.
  14. Scalable agent alignment via reward modeling: a research direction. ArXiv, abs/1811.07871.
  15. The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205.
  16. A pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation. In Proceedings of the 13th ACM Conference on recommender systems, pages 20–28.
  17. Tuning language models by proxy. ArXiv, abs/2401.08565.
  18. Prudent silence or foolish babble? examining large language models’ responses to the unknown. ArXiv.
  19. Chain of hindsight aligns language models with feedback. ArXiv, abs/2302.02676.
  20. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  21. On faithfulness and factuality in abstractive summarization. ArXiv, abs/2005.00661.
  22. An emulator for fine-tuning large language models using small language models. ArXiv, abs/2310.12962.
  23. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue.
  24. OpenAI. 2023. Gpt-4 technical report.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  26. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  27. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263.
  28. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition. In Conference on Empirical Methods in Natural Language Processing.
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  30. Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31.
  31. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  32. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  33. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  34. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  35. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  36. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  37. Alignment for honesty. ArXiv, abs/2312.07000.
  38. Do large language models know what they don’t know? In Annual Meeting of the Association for Computational Linguistics.
  39. On uncertainty calibration and selective generation in probabilistic neural summarization: A benchmark study. arXiv preprint arXiv:2304.08653.
  40. R-tuning: Teaching large language models to refuse unknown questions. ArXiv, abs/2311.09677.
  41. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  42. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yiju Guo (4 papers)
  2. Ganqu Cui (39 papers)
  3. Lifan Yuan (22 papers)
  4. Ning Ding (122 papers)
  5. Huimin Chen (15 papers)
  6. Bowen Sun (18 papers)
  7. Ruobing Xie (97 papers)
  8. Jie Zhou (687 papers)
  9. Yankai Lin (125 papers)
  10. Zhiyuan Liu (433 papers)
  11. Maosong Sun (337 papers)
  12. Zexu Sun (15 papers)
Citations (32)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com