IterAlign: Iterative Constitutional Alignment of Large Language Models (2403.18341v1)
Abstract: With the rapid development of LLMs, aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to $13.5\%$ in harmlessness.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
- Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Using large language models in psychology. Nature Reviews Psychology, 2:688–701.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- OpenAI. 2023. Gpt-4: Improving language model performance through scale and data. arXiv.
- What makes it ok to set a fire? iterative self-distillation of contexts and rationales for disambiguating defeasible social and moral situations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12140–12159.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
- On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061.
- Large language model alignment: A survey. arXiv preprint arXiv:2309.15025.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Generating and evaluating tests for k-12 students with language model simulations: A case study on sentence reading efficiency. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2190–2205.
- Xuanyu Zhang and Qing Yang. 2023. Self-qa: Unsupervised knowledge guided language model alignment. arXiv preprint arXiv:2305.11952.
- Xiusi Chen (36 papers)
- Hongzhi Wen (14 papers)
- Sreyashi Nag (16 papers)
- Chen Luo (77 papers)
- Qingyu Yin (44 papers)
- Ruirui Li (33 papers)
- Zheng Li (326 papers)
- Wei Wang (1793 papers)