Papers
Topics
Authors
Recent
Search
2000 character limit reached

SoFA: Shielded On-the-fly Alignment via Priority Rule Following

Published 27 Feb 2024 in cs.CL | (2402.17358v1)

Abstract: The alignment problem in LLMs involves adapting them to the broad spectrum of human values. This requirement challenges existing alignment methods due to diversity of preferences and regulatory standards. This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. Our preliminary analysis reveals that even the advanced LLMs, such as GPT-4, exhibit shortcomings in understanding and prioritizing the rules. Therefore, we present PriorityDistill, a semi-automated approach for distilling priority following signals from LLM simulations to ensure robust rule integration and adherence. Our experiments show that this method not only effectively minimizes misalignments utilizing only one general rule but also adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Fine-tuning language models to find agreement among humans with diverse preferences. In Advances in Neural Information Processing Systems, volume 35, pages 38176–38189. Curran Associates, Inc.
  5. Weak-to-strong generation: eliciting strong capabilities with weak supervision.
  6. Everyone deserves a reward: Learning customized human preferences. arXiv preprint arXiv:2309.03126.
  7. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  11. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  12. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.
  13. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.
  14. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852.
  15. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
  16. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  17. Can llms follow simple rules? arXiv preprint arXiv:2311.04235.
  18. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  19. OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  20. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  21. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland. Association for Computational Linguistics.
  22. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, Toronto, Canada. Association for Computational Linguistics.
  24. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817.
  26. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
  27. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025.
  28. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585.
  29. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  30. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
  31. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011.
  34. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  35. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  36. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  37. Align on the fly: Adapting chatbot behavior to established norms. arXiv preprint arXiv:2312.15907.
  38. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658.
  39. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  40. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1751–1777, Singapore. Association for Computational Linguistics.
  41. Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558.
  42. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.