Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

MAP: Multi-Human-Value Alignment Palette (2410.19198v1)

Published 24 Oct 2024 in cs.AI, cs.CY, cs.ET, cs.HC, and cs.LG

Abstract: Ensuring that generative AI systems align with human values is essential but challenging, especially when considering multiple human values and their potential trade-offs. Since human values can be personalized and dynamically change over time, the desirable levels of value alignment vary across different ethnic groups, industry sectors, and user cohorts. Within existing frameworks, it is hard to define human values and align AI systems accordingly across different directions simultaneously, such as harmlessness, helpfulness, and positiveness. To address this, we develop a novel, first-principle approach called Multi-Human-Value Alignment Palette (MAP), which navigates the alignment across multiple human values in a structured and reliable way. MAP formulates the alignment problem as an optimization task with user-defined constraints, which define human value targets. It can be efficiently solved via a primal-dual approach, which determines whether a user-defined alignment target is achievable and how to achieve it. We conduct a detailed theoretical analysis of MAP by quantifying the trade-offs between values, the sensitivity to constraints, the fundamental connection between multi-value alignment and sequential alignment, and proving that linear weighted rewards are sufficient for multi-value alignment. Extensive experiments demonstrate MAP's ability to align multiple values in a principled manner while delivering strong empirical performance across various tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Anthropic. HH-RLHF Data, 2024. URL https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased. Accessed on July 5, 2024.
  2. Deep reinforcement learning from policy-dependent human feedback. arXiv preprint arXiv:1902.04257, 2019.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on Machine learning, pp.  41–47, 2008.
  5. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
  6. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  7. A new scalarization technique to approximate pareto fronts of problems with disconnected feasible sets. Journal of Optimization Theory and Applications, 162:428–446, 2014.
  8. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024.
  9. Mohamed Dhiab. Humor no humor model, 2024. URL https://huggingface.co/mohameddhiab/humor-no-humor. Accessed on July 5, 2024.
  10. Model selection techniques: An overview. IEEE Signal Processing Magazine, 35(6):16–34, 2018.
  11. Contextual moral value alignment through context-based aggregation. arXiv preprint arXiv:2403.12805, 2024.
  12. Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
  13. Simcse: Simple contrastive learning of sentence embeddings. Conference on Empirical Methods in Natural Language Processing, 2021.
  14. Llm-based framework for administrative task automation in healthcare. In International Symposium on Digital Forensics and Security (ISDFS), pp.  1–7. IEEE, 2024.
  15. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013.
  16. Assigning a value to a power likelihood in a general bayesian model. Biometrika, 104(2):497–503, 2017.
  17. IMDB. IMDB movie reviews dataset, 2024. URL https://huggingface.co/datasets/imdb. Accessed on July 5, 2024.
  18. Language model decoding as direct metrics optimization. International Conference on Learning Representations, 2024.
  19. ARGS: Alignment as reward-guided search. The Twelfth International Conference on Learning Representations, 2024.
  20. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, pp.  1–10, 2024.
  21. Saydulu Kolasani. Optimizing natural language processing, large language models (llms) for efficient customer service, and hyper-personalization to enable sustainable growth and revenue. Transactions on Latest Trends in Artificial Intelligence, 4(4), 2023.
  22. Deep reinforcement learning for multiobjective optimization. IEEE transactions on cybernetics, 51(6):3103–3114, 2020.
  23. Scalarizing functions for generating the weakly efficient solution set in convex multiobjective problems. SIAM Journal on Optimization, 15(4):987–1001, 2005.
  24. Lvwerra. Distilbert for imdb sentiment analysis, 2024. URL https://huggingface.co/lvwerra/distilbert-imdb. Accessed on July 5, 2024.
  25. Direct preference optimization: Your language model is secretly a reward model. Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  26. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  27. An interactive weighted tchebycheff procedure for multiple objective programming. Mathematical programming, 26:326–344, 1983.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  29. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  30. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36, 2024.
  31. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. International Conference on Machine Learning, 2024.
  32. Information criteria for model selection. Wiley Interdisciplinary Reviews: Computational Statistics, 15(5):e1607, 2023.
  33. Opt: Open pre-trained transformer language models, 2022.
  34. Bertscore: Evaluating text generation with bert. International Conference on Learning Representations, 2020.
  35. Beyond one-preference-for-all: Multi-objective direct preference optimization. Findings of the Association for Computational Linguistics ACL, 2024.

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube