Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models (2403.11838v2)

Published 18 Mar 2024 in cs.CL and cs.AI

Abstract: LLMs exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023, pages 79–90. ACM.
  2. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  4. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  6. RAFT: reward ranked finetuning for generative foundation model alignment. CoRR, abs/2304.06767.
  7. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858.
  8. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6465–6488. Association for Computational Linguistics.
  9. MART: improving LLM safety with multi-round automatic red-teaming. CoRR, abs/2311.07689.
  10. Improving alignment of dialogue agents via targeted human judgements. CoRR, abs/2209.14375.
  11. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  12. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
  13. OpenAI. 2022. Openai: Introducing chatgpt.
  14. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  15. Training language models to follow instructions with human feedback. In NeurIPS.
  16. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. CoRR, abs/2211.09527.
  17. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  18. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  19. Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (PALMS) with values-targeted datasets. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 5861–5873.
  20. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  21. Safety assessment of chinese large language models. CoRR, abs/2304.10436.
  22. Principle-driven self-alignment of language models from scratch with minimal human supervision. CoRR, abs/2305.03047.
  23. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  24. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
  25. Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5085–5109. Association for Computational Linguistics.
  26. Do-not-answer: A dataset for evaluating safeguards in llms. CoRR, abs/2308.13387.
  27. From instructions to intrinsic human values - A survey of alignment goals for big models. CoRR, abs/2308.12014.
  28. Automatic hallucination assessment for aligned large language models via transferable adversarial attacks. CoRR, abs/2310.12516.
  29. Safetybench: Evaluating the safety of large language models with multiple choice questions. CoRR, abs/2309.07045.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yi Luo (153 papers)
  2. Zhenghao Lin (14 papers)
  3. Yuhao Zhang (107 papers)
  4. Jiashuo Sun (11 papers)
  5. Chen Lin (75 papers)
  6. Chengjin Xu (36 papers)
  7. Xiangdong Su (12 papers)
  8. Yelong Shen (83 papers)
  9. Jian Guo (76 papers)
  10. Yeyun Gong (78 papers)
Citations (1)