Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision (2305.03047v2)

Published 4 May 2023 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of LLMs with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base LLM, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

PDF HTML Abstract

Principle-Driven Self-Alignment of LLMs

The paper "Principle-Driven Self-Alignment of LLMs from Scratch with Minimal Human Supervision" presents a novel approach to aligning LLMs with human intentions, termed Self-Align. This method emphasizes principle-driven reasoning alongside the inherent generative capabilities of LLMs, significantly minimizing the reliance on extensive human supervision typically associated with techniques like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

Methodology

The Self-Align approach introduces a structured, four-stage process:

Topic-Guided Red-Teaming Self-Instruct: Building on the Self-Instruct methodology, this stage involves the generation of diverse, synthetic instructions with a core focus on broadening the scope through topic-guided ad-hoc instructions.
Principle-Driven Self-Alignment: Central to this method, a set of sixteen human-defined principles guides the LLM in generating dependable responses. This phase includes in-context learning (ICL) with exemplars to showcase responses adhering to principles such as neutrality and comprehensive coverage.
Principle Engraving: The model is fine-tuned on the outputs generated in the preceding stage, engraving the principles into the model's parameters, thus reducing token usage while enhancing alignment performance.
Verbose Cloning: This final stage aims to address issues with brief responses by facilitating a context distillation that yields more elaborate and detailed outputs.

Results and Evaluation

Dromedary, the AI assistant developed using the Self-Align process on the LLaMA-65b model, demonstrates remarkable performance enhancements across various benchmarks. The paper notes that with fewer than 300 lines of human annotations, Dromedary exceeds the capabilities of state-of-the-art systems such as Text-Davinci-003 and Alpaca in several respects.

The evaluation encompasses both quantitative and qualitative measures:

Quantitative: On benchmarks like TruthfulQA, Dromedary surpasses several competitive models, achieving enhanced accuracy in multiple-choice tasks and outstripping others in truthful and informative answer generation.
Qualitative: The model exhibits improved alignment with principles in handling harmful queries and generating nuanced responses, albeit with acknowledged failure modes such as indirect responses and some hallucinations.

Implications and Future Directions

This paper's contributions are notable as they address the scalability and efficiency challenges in aligning LLMs with human values. The reduction in dependency on expansive human annotations positions Self-Align as a viable alternative in scenarios where access to pre-trained, aligned systems is limited or impractical. Moreover, as the AI landscape evolves, the principle-driven approach offers a foundation for broader stakeholder engagement in defining alignment principles specific to varied ethical and cultural contexts.

The authors suggest exploring reinforcement learning enhancements (akin to Constitutional AI) and deepening human evaluations to further assess real-world applicability. Future research might also investigate utilizing existing datasets more effectively and refining engagement processes with diverse communities to ensure the alignment of AI models leads to concrete, positive outcomes.

In conclusion, this paper extends the discourse on AI alignment by providing an insightful, efficient methodology to align LLMs from scratch, opening new avenues for responsible AI development.