Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reasons to Reject? Aligning Language Models with Judgments (2312.14591v4)

Published 22 Dec 2023 in cs.CL

Abstract: As humans, we consistently interact with our peers and receive feedback in the form of natural language. This language feedback allows us to maintain appropriate behavior, and rectify potential errors. The question arises naturally: can we use language feedback to align LLMs? In contrast to previous research that aligns LLMs with scalar rewards, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We start with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods cannot fully capitalize on judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 50.84 points on AlpacaEval. CUT (LLaMA2-chat-13b) can also align LLMs in an iterative fashion using up-to-date model-specific judgments, improving performance from 81.09 to 91.68 points on AlpacaEval. Further analysis suggests that judgments hold greater potential than rewards in LLM alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. RL4F: Generating natural language feedback with reinforcement learning for repairing model outputs. In Proc. of ACL, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv preprint, abs/2204.05862, 2022a. URL https://arxiv.org/abs/2204.05862.
  3. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022b. URL https://arxiv.org/abs/2212.08073.
  4. Peering through preferences: Unraveling feedback acquisition for aligning large language models. ArXiv preprint, abs/2308.15812, 2023. URL https://arxiv.org/abs/2308.15812.
  5. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457, 2018. URL https://arxiv.org/abs/1803.05457.
  7. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv preprint, abs/2310.01377, 2023. URL https://arxiv.org/abs/2310.01377.
  8. Raft: Reward ranked finetuning for generative foundation model alignment. ArXiv preprint, abs/2304.06767, 2023. URL https://arxiv.org/abs/2304.06767.
  9. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022. URL https://aclanthology.org/2022.acl-long.26.
  10. A framework for few-shot language model evaluation, 2021.
  11. Improving alignment of dialogue agents via targeted human judgements. ArXiv preprint, abs/2209.14375, 2022. URL https://arxiv.org/abs/2209.14375.
  12. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  13. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  14. Platypus: Quick, cheap, and powerful refinement of llms. ArXiv preprint, abs/2308.07317, 2023. URL https://arxiv.org/abs/2308.07317.
  15. Dialogue learning with human-in-the-loop. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum?id=HJgXCV9xx.
  16. Generative judge for evaluating alignment. ArXiv preprint, abs/2310.05470, 2023. URL https://arxiv.org/abs/2310.05470.
  17. Using interactive feedback to improve the accuracy and explainability of question answering systems post-deployment. In Findings of the Association for Computational Linguistics: ACL 2022, 2022. URL https://aclanthology.org/2022.findings-acl.75.
  18. Let’s verify step by step. ArXiv preprint, abs/2305.20050, 2023. URL https://arxiv.org/abs/2305.20050.
  19. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004. URL https://aclanthology.org/W04-1013.
  20. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022. URL https://aclanthology.org/2022.acl-long.229.
  21. Languages are rewards: Hindsight finetuning using human feedback. ArXiv preprint, abs/2302.02676, 2023a. URL https://arxiv.org/abs/2302.02676.
  22. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, (9), 2023b.
  23. On improving summarization factual consistency from natural language feedback. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proc. of ACL, 2023c.
  24. Self-refine: Iterative refinement with self-feedback. ArXiv preprint, abs/2303.17651, 2023. URL https://arxiv.org/abs/2303.17651.
  25. Maf: Multi-aspect feedback for improving reasoning in large language models. ArXiv preprint, abs/2310.12426, 2023. URL https://arxiv.org/abs/2310.12426.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022.
  27. Check your facts and try again: Improving large language models with external knowledge and automated feedback. ArXiv preprint, abs/2302.12813, 2023a. URL https://arxiv.org/abs/2302.12813.
  28. Stabilizing rlhf through advantage model and selective rehearsal. ArXiv preprint, abs/2309.10202, 2023b. URL https://arxiv.org/abs/2309.10202.
  29. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
  30. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023.
  31. Sequence level training with recurrent neural networks. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06732.
  32. Self-critiquing models for assisting human evaluators. ArXiv preprint, abs/2206.05802, 2022. URL https://arxiv.org/abs/2206.05802.
  33. Training language models with natural language feedback. ArXiv preprint, abs/2204.14146, 2022. URL https://arxiv.org/abs/2204.14146.
  34. Training language models with language feedback at scale. ArXiv preprint, abs/2303.16755, 2023. URL https://arxiv.org/abs/2303.16755.
  35. Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347, 2017. URL https://arxiv.org/abs/1707.06347.
  36. Preference ranking optimization for human alignment. ArXiv preprint, abs/2306.17492, 2023. URL https://arxiv.org/abs/2306.17492.
  37. Learning to summarize with human feedback. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html.
  38. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, 2022. URL https://aclanthology.org/2022.findings-naacl.26.
  39. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  40. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.org/abs/2307.09288.
  41. Shepherd: A critic for language model generation. ArXiv preprint, abs/2308.04592, 2023a. URL https://arxiv.org/abs/2308.04592.
  42. Aligning large language models with human: A survey. ArXiv preprint, abs/2307.12966, 2023b. URL https://arxiv.org/abs/2307.12966.
  43. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  44. Neural text generation with unlikelihood training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020. URL https://openreview.net/forum?id=SJeYe0NtvH.
  45. Generating sequences by learning to self-correct. ArXiv preprint, abs/2211.00053, 2022. URL https://arxiv.org/abs/2211.00053.
  46. Jason Weston. Dialog-based language learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/07563a3fe3bbe7e3ba84431ad9d055af-Abstract.html.
  47. Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv preprint, abs/1609.08144, 2016. URL https://arxiv.org/abs/1609.08144.
  48. Fine-grained human feedback gives better rewards for language model training. ArXiv preprint, abs/2306.01693, 2023. URL https://arxiv.org/abs/2306.01693.
  49. A critical evaluation of evaluations for long-form question answering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proc. of ACL, 2023a.
  50. Learning new skills after deployment: Improving open-domain internet-driven dialogue with human feedback. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proc. of ACL, 2023b.
  51. Re3: Generating longer stories with recursive reprompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. URL https://aclanthology.org/2022.emnlp-main.296.
  52. Rlcd: Reinforcement learning from contrast distillation for language model alignment. ArXiv preprint, abs/2307.12950, 2023. URL https://arxiv.org/abs/2307.12950.
  53. Constructive large language models alignment with diverse feedback. ArXiv preprint, abs/2310.06450, 2023. URL https://arxiv.org/abs/2310.06450.
  54. Rrhf: Rank responses to align language models with human feedback without tears. ArXiv preprint, abs/2304.05302, 2023. URL https://arxiv.org/abs/2304.05302.
  55. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. URL https://aclanthology.org/P19-1472.
  56. The wisdom of hindsight makes language models better instruction followers. ArXiv preprint, abs/2302.05206, 2023. URL https://arxiv.org/abs/2302.05206.
  57. Secrets of rlhf in large language models part i: Ppo. ArXiv preprint, abs/2307.04964, 2023. URL https://arxiv.org/abs/2307.04964.
  58. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206, 2023. URL https://arxiv.org/abs/2305.11206.
  59. Fine-tuning language models from human preferences. ArXiv preprint, abs/1909.08593, 2019. URL https://arxiv.org/abs/1909.08593.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com