Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision (2305.03047v2)

Published 4 May 2023 in cs.LG, cs.AI, cs.CL, and cs.CY
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Abstract: Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of LLMs with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base LLM, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

Principle-Driven Self-Alignment of LLMs

The paper "Principle-Driven Self-Alignment of LLMs from Scratch with Minimal Human Supervision" presents a novel approach to aligning LLMs with human intentions, termed Self-Align. This method emphasizes principle-driven reasoning alongside the inherent generative capabilities of LLMs, significantly minimizing the reliance on extensive human supervision typically associated with techniques like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

Methodology

The Self-Align approach introduces a structured, four-stage process:

  1. Topic-Guided Red-Teaming Self-Instruct: Building on the Self-Instruct methodology, this stage involves the generation of diverse, synthetic instructions with a core focus on broadening the scope through topic-guided ad-hoc instructions.
  2. Principle-Driven Self-Alignment: Central to this method, a set of sixteen human-defined principles guides the LLM in generating dependable responses. This phase includes in-context learning (ICL) with exemplars to showcase responses adhering to principles such as neutrality and comprehensive coverage.
  3. Principle Engraving: The model is fine-tuned on the outputs generated in the preceding stage, engraving the principles into the model's parameters, thus reducing token usage while enhancing alignment performance.
  4. Verbose Cloning: This final stage aims to address issues with brief responses by facilitating a context distillation that yields more elaborate and detailed outputs.

Results and Evaluation

Dromedary, the AI assistant developed using the Self-Align process on the LLaMA-65b model, demonstrates remarkable performance enhancements across various benchmarks. The paper notes that with fewer than 300 lines of human annotations, Dromedary exceeds the capabilities of state-of-the-art systems such as Text-Davinci-003 and Alpaca in several respects.

The evaluation encompasses both quantitative and qualitative measures:

  • Quantitative: On benchmarks like TruthfulQA, Dromedary surpasses several competitive models, achieving enhanced accuracy in multiple-choice tasks and outstripping others in truthful and informative answer generation.
  • Qualitative: The model exhibits improved alignment with principles in handling harmful queries and generating nuanced responses, albeit with acknowledged failure modes such as indirect responses and some hallucinations.

Implications and Future Directions

This paper's contributions are notable as they address the scalability and efficiency challenges in aligning LLMs with human values. The reduction in dependency on expansive human annotations positions Self-Align as a viable alternative in scenarios where access to pre-trained, aligned systems is limited or impractical. Moreover, as the AI landscape evolves, the principle-driven approach offers a foundation for broader stakeholder engagement in defining alignment principles specific to varied ethical and cultural contexts.

The authors suggest exploring reinforcement learning enhancements (akin to Constitutional AI) and deepening human evaluations to further assess real-world applicability. Future research might also investigate utilizing existing datasets more effectively and refining engagement processes with diverse communities to ensure the alignment of AI models leads to concrete, positive outcomes.

In conclusion, this paper extends the discourse on AI alignment by providing an insightful, efficient methodology to align LLMs from scratch, opening new avenues for responsible AI development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Anthropic. Claude’s constitution, 2023a. URL https://www.anthropic.com/index/claudes-constitution.
  2. Anthropic. Core views on ai safety: When, why, what, and how, 2023b. URL https://www.anthropic.com/index/core-views-on-ai-safety.
  3. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback, 2022b.
  6. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
  7. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org.
  9. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
  11. Databricks. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
  14. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.
  15. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
  16. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  17. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  18. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
  19. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  20. Openassistant conversations – democratizing large language model alignment, 2023.
  21. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  22. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  23. Visual instruction tuning. 2023.
  24. Microsoft. Introducing the new bing, 2023. URL https://www.bing.com/new#features.
  25. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  26. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  27. OpenAI. Gpt-4 technical report, 2023a.
  28. OpenAI. OpenAI: GPT-4, 2023b. URL https://openai.com/research/gpt-4.
  29. OpenAI. How do text-davinci-002 and text-davinci-003 differ? https://help.openai.com/en/articles/6779149-how-do-text-davinci-002-and-text-davinci-003-differ, 2023c.
  30. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  31. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
  32. Align-rudder: Learning from few demonstrations by reward redistribution. arXiv preprint arXiv:2009.14108, 2020.
  33. Language models are unsupervised multitask learners. 2019.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  35. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
  36. Bloom: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  37. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061, 2022.
  38. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
  39. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  40. Salmon: Self-alignment with principle-following reward models. arXiv preprint arXiv:2310.05910, 2023a.
  41. Recitation-augmented language models. In International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=-cqvvvb-NkI.
  42. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  43. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  44. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  46. Attention is all you need. NeurIPS, 2017.
  47. Poisoning language models during instruction tuning, 2023.
  48. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  49. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
  50. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
  51. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  52. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhiqing Sun (35 papers)
  2. Yikang Shen (62 papers)
  3. Qinhong Zhou (6 papers)
  4. Hongxin Zhang (47 papers)
  5. Zhenfang Chen (36 papers)
  6. David Cox (48 papers)
  7. Yiming Yang (151 papers)
  8. Chuang Gan (195 papers)
Citations (268)
Github Logo Streamline Icon: https://streamlinehq.com