Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models (2403.11838v2)
Abstract: LLMs exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023, pages 79–90. ACM.
- A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- RAFT: reward ranked finetuning for generative foundation model alignment. CoRR, abs/2304.06767.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858.
- Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6465–6488. Association for Computational Linguistics.
- MART: improving LLM safety with multi-round automatic red-teaming. CoRR, abs/2311.07689.
- Improving alignment of dialogue agents via targeted human judgements. CoRR, abs/2209.14375.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
- OpenAI. 2022. Openai: Introducing chatgpt.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. CoRR, abs/2211.09527.
- Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
- Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (PALMS) with values-targeted datasets. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 5861–5873.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
- Safety assessment of chinese large language models. CoRR, abs/2304.10436.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. CoRR, abs/2305.03047.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5085–5109. Association for Computational Linguistics.
- Do-not-answer: A dataset for evaluating safeguards in llms. CoRR, abs/2308.13387.
- From instructions to intrinsic human values - A survey of alignment goals for big models. CoRR, abs/2308.12014.
- Automatic hallucination assessment for aligned large language models via transferable adversarial attacks. CoRR, abs/2310.12516.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. CoRR, abs/2309.07045.
- Yi Luo (153 papers)
- Zhenghao Lin (14 papers)
- Yuhao Zhang (107 papers)
- Jiashuo Sun (11 papers)
- Chen Lin (75 papers)
- Chengjin Xu (36 papers)
- Xiangdong Su (12 papers)
- Yelong Shen (83 papers)
- Jian Guo (76 papers)
- Yeyun Gong (78 papers)