Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models (2403.04204v1)

Published 7 Mar 2024 in cs.AI and cs.CL

Abstract: Big models have achieved revolutionary breakthroughs in the field of AI, but they might also pose potential concerns. Addressing such concerns, alignment technologies were introduced to make these models conform to human preferences and values. Despite considerable advancements in the past year, various challenges lie in establishing the optimal alignment strategy, such as data cost and scalable oversight, and how to align remains an open question. In this survey paper, we comprehensively investigate value alignment approaches. We first unpack the historical context of alignment tracing back to the 1920s (where it comes from), then delve into the mathematical essence of alignment (what it is), shedding light on the inherent challenges. Following this foundation, we provide a detailed examination of existing alignment methods, which fall into three categories: Reinforcement Learning, Supervised Fine-Tuning, and In-context Learning, and demonstrate their intrinsic connections, strengths, and limitations, helping readers better understand this research area. In addition, two emerging topics, personal alignment, and multimodal alignment, are also discussed as novel frontiers in this field. Looking forward, we discuss potential alignment paradigms and how they could handle remaining challenges, prospecting where future alignment will go.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (161)
  1. A.I. Asilomar. 2017. Asilomar ai principles. Https://futureoflife.org/open-letter/ai-principles/.
  2. I Asimov. 1942. Runaround. astounding science fiction, 29 (1), 94-103. Recuperado de http://www. isfdb. org/cgi-bin/title. cgi, 44191.
  3. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  4. Ömer Aydin. 2023. Google bard generated literature review: metaverse. Journal of AI, 7(1):1–14.
  5. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  7. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  8. Albert Bandura and Richard H Walters. 1977. Social learning theory, volume 1. Englewood cliffs Prentice Hall.
  9. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. arXiv preprint arXiv:2305.00447.
  10. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  11. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8.
  12. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
  13. Samia Cornelius Bhatti and Lionel Peter Robert. 2023. What does it mean to anthropomorphize robots? food for thought for hri research. In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’23, page 422–425, New York, NY, USA. Association for Computing Machinery.
  14. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  15. Nick Bostrom. 2003. Ethical issues in advanced artificial intelligence. Science fiction and philosophy: from time travel to superintelligence, pages 277–284.
  16. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540.
  17. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  18. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228.
  19. Karel Capek. 1920. Rur: Rossum’s universal robots.
  20. Micah Carroll. 2018. Overview of current ai alignment approaches.
  21. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  22. Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003.
  23. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
  24. Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv preprint arXiv:2305.13614.
  25. Zheng Chen. 2023. Palr: Personalization aware llms for recommendation. arXiv preprint arXiv:2305.07622.
  26. Marked personas: Using natural language prompts to measure stereotypes in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1504–1532, Toronto, Canada. Association for Computational Linguistics.
  27. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  28. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
  29. Paul Christiano. 2018. Clarifying ai alignment.
  30. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  31. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  32. Instructblip: towards general-purpose vision-language models with instruction tuning. arxiv. Preprint posted online on June, 15:2023.
  33. Paul de Font-Reaulx. 2022. Alignment as a dynamic process. In NeurIPS ML Safety Workshop.
  34. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  35. Anthropomorphization of ai: Opportunities and risks. arXiv preprint arXiv:2305.14784.
  36. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  37. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  38. Alara Dirik and Sayak Paul. 2023. A dive into vision-language models. Hugging Face. https://huggingface.co/blog/vision_language_pretraining. February 3, 2023.
  39. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  40. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  41. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  42. Tom Everitt and Marcus Hutter. 2018. The alignment problem for bayesian history-based reinforcement learners. Under submission.
  43. PoliSe: Reinforcing politeness using user sentiment for customer care response generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6165–6175, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  44. Polise: Reinforcing politeness using user sentiment for customer care response generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6165–6175.
  45. Sigmund Freud. 1975. Group psychology and the analysis of the ego. WW Norton & Company.
  46. Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437.
  47. Iason Gabriel and Vafa Ghazavi. 2021. The challenge of value alignment: From fairer algorithms to ai safety. arXiv preprint arXiv:2101.06060.
  48. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459.
  49. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  50. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  51. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215.
  52. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
  53. Irving John Good. 1970. Some future social repercussions of computers. International Journal of Environmental Studies, 1(1-4):67–79.
  54. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738.
  55. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.
  56. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29.
  57. Thilo Hagendorff. 2023. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988.
  58. Xiaochuang Han. 2023. In-context alignment: Chat with vanilla language models before fine-tuning. arXiv preprint arXiv:2308.04275.
  59. Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639.
  60. Eric Horvitz and Bart Selman. 2009. Aaai presidential panel on long-term ai futures: Interim report from the panel chairs. Association for the Advancement of Artificia l Intelligence (AAAI).
  61. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  62. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  63. Mpi: Evaluating and inducing personality in pre-trained language models. arXiv preprint arXiv:2206.07550.
  64. Personallm: Investigating the ability of gpt-3.5 to express personality traits and gender differences. arXiv preprint arXiv:2305.02547.
  65. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
  66. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  67. Alignment of language agents. arXiv preprint arXiv:2103.14659.
  68. Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735.
  69. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. arXiv preprint arXiv:2303.05453.
  70. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  71. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192.
  72. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
  73. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  74. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  75. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
  76. The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552.
  77. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  78. Towards user-driven neural machine translation. arXiv preprint arXiv:2106.06200.
  79. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  80. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
  81. Chain of hindsight aligns language models with feedback.
  82. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  83. Second thoughts are best: Learning to re-align with human values from text edits. Advances in Neural Information Processing Systems, 35:181–196.
  84. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960.
  85. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
  86. A survey on empathetic dialogue systems. Information Fusion, 64:50–70.
  87. Abraham Harold Maslow. 1958. A Dynamic Theory of Human Motivation. Howard Allen Publishers.
  88. Dan P McAdams and Jennifer L Pals. 2006. A new big five: fundamental principles for an integrative science of personality. American psychologist, 61(3):204.
  89. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479.
  90. Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
  91. Leslie C Morey. 2004. The Personality Assessment Inventory (PAI). Lawrence Erlbaum Associates Publishers.
  92. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  93. Sharan Narang and Aakanksha Chowdhery. 2022. Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance. Google AI Blog.
  94. Bostrom Nick. 2014. Superintelligence: Paths, dangers, strategies. Oxford University Press, Oxford.
  95. OpenAI. 2022. Introducing chatgpt. OpenAI Blog. November 2022.
  96. R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  97. OSTP. 2022. Blueprint for an ai bill of rights: Making automated systems work for the american people. Office of Science and Technology Policy (OSTP). Available online: https://www.whitehouse.gov/ostp/ai-bill-of-rights/.
  98. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  99. The effects of reward misspecification: Mapping and mitigating misaligned models. In International Conference on Learning Representations.
  100. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  101. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
  102. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–18.
  103. Adam Pauls and Dan Klein. 2011. Faster and smaller n-gram language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, pages 258–267.
  104. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. arXiv preprint arXiv:2204.12749.
  105. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  106. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  107. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  108. The big five personality factors and personal values. Personality and social psychology bulletin, 28(6):789–801.
  109. Research priorities for robust and beneficial artificial intelligence. AI magazine, 36(4):105–114.
  110. Personality traits in large language models. arXiv preprint arXiv:2307.00184.
  111. Lamp: When large language models meet personalization. arXiv preprint arXiv:2304.11406.
  112. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  113. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326.
  114. Geovana Ramos Sousa Silva and Edna Dias Canedo. 2022. Towards user-centric guidelines for chatbot conversational design. International Journal of Human–Computer Interaction, pages 1–23.
  115. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460–9471.
  116. Nate Soares and Benja Fallenstein. 2014. Aligning superintelligence with human interests: A technical research agenda. Machine Intelligence Research Institute (MIRI) technical report, 8.
  117. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492.
  118. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  119. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Thirty-seventh Conference on Neural Information Processing Systems.
  120. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.
  121. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  122. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  123. Chatplug: Open-domain generative dialogue system with internet-augmented instruction tuning for digital human.
  124. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954.
  125. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  126. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  127. Misc: A mixed strategy-aware model integrating comet for emotional support conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 308–319.
  128. Attention is all you need. Advances in neural information processing systems, 30.
  129. M Mitchell Waldrop. 1987. A question of responsibility. AI Magazine, 8(1):28–28.
  130. Lei Wang and Ee-Peng Lim. 2023. Zero-shot next-item recommendation using large pretrained language models. arXiv preprint arXiv:2304.03153.
  131. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  132. Self-consistency improves chain of thought reasoning in language models.
  133. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  134. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  135. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  136. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  137. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  138. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229.
  139. Norbert Wiener. 1960. Some moral and technical consequences of automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers. Science, 131(3410):1355–1358.
  140. Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025.
  141. Lamini-lm: A diverse herd of distilled models from large-scale instructions.
  142. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420.
  143. Weizhen Xiao and Canqun He. 2020. Study on anthropomorphism in human–computer interaction design. In Man-Machine-Environment System Engineering: Proceedings of the 20th International Conference on MMESE, pages 629–635. Springer.
  144. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
  145. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
  146. Align on the fly: Adapting chatbot behavior to established norms. arXiv preprint arXiv:2312.15907.
  147. Roman V Yampolskiy. 2012. Artificial intelligence safety engineering: Why machine ethics is a wrong approach. Philosophy and Theory of Artificial Intelligence, 5:389.
  148. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  149. Eliezer Yudkowsky. 2016. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4.
  150. Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739.
  151. Recommendation as instruction following: A large language model empowered recommendation approach. arXiv preprint arXiv:2305.07001.
  152. Building user-oriented personalized machine translator based on user-generated textual content. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–26.
  153. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  154. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
  155. A survey of large language models. arXiv preprint arXiv:2303.18223.
  156. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
  157. Augesc: Large-scale data augmentation for emotional support conversation with pre-trained language models. arXiv preprint arXiv:2202.13047.
  158. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  159. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  160. Simon Zhuang and Dylan Hadfield-Menell. 2020. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33:15763–15773.
  161. Can large language models transform computational social science? arXiv preprint arXiv:2305.03514.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Xinpeng Wang (34 papers)
  2. Shitong Duan (6 papers)
  3. Xiaoyuan Yi (42 papers)
  4. Jing Yao (56 papers)
  5. Shanlin Zhou (5 papers)
  6. Zhihua Wei (34 papers)
  7. Peng Zhang (642 papers)
  8. Dongkuan Xu (43 papers)
  9. Maosong Sun (337 papers)
  10. Xing Xie (220 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.