Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs (2305.08844v2)

Published 15 May 2023 in cs.CL

Abstract: Despite their unprecedented success, even the largest LLMs make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing LLMs with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  2. Andreas Blank. 1999. Why do new meanings occur? a cognitive typology of the motivations for lexical semantic change andreas blank. Cognitive Linguistics Research, page 61.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  5. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548.
  6. Language model cascades. arXiv preprint arXiv:2207.10342.
  7. Text editing by command. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5259–5274, Online. Association for Computational Linguistics.
  8. Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.
  9. Simulating bandit learning from user feedback for extractive question answering. arXiv preprint arXiv:2203.10079.
  10. Stephen P Harter. 1975. A probabilistic approach to automatic keyword indexing. part i. on the distribution of specialty words in a technical literature. Journal of the american society for information science, 26(4):197–206.
  11. Keith Vertanen. 2018. Big english word lists.
  12. Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499.
  13. Corpora generation for grammatical error correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3291–3301, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  15. Rainier: Reinforced knowledge introspector for commonsense question answering. arXiv preprint arXiv:2210.03078.
  16. Memory-assisted prompt editing to improve gpt-3 after deployment. arXiv preprint arXiv:2201.06009.
  17. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  18. FELIX: Flexible text editing through tagging and insertion. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1244–1255, Online. Association for Computational Linguistics.
  19. Text generation with text-editing models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, pages 1–7, Seattle, United States. Association for Computational Linguistics.
  20. Fast model editing at scale. In International Conference on Learning Representations.
  21. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
  22. Fixing model bugs with natural language patches. arXiv preprint arXiv:2211.03318.
  23. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  24. Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020): Systems Demonstrations, pages 46–54, Online. Association for Computational Linguistics.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  26. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
  27. Machel Reid and Graham Neubig. 2022. Learning to model editing processes. arXiv preprint arXiv:2205.12374.
  28. Local string transduction as sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1360–1371, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  29. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
  30. Training language models with natural language feedback. arXiv preprint arXiv:2204.14146.
  31. Peer: A collaborative language model. arXiv preprint arXiv:2208.11663.
  32. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics, 9:1408–1424.
  33. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  34. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  35. When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels. arXiv preprint arXiv:2210.15893.
  36. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics.
  37. Interscript: A dataset for interactive learning of scripts through error feedback. arXiv preprint arXiv:2112.07867.
  38. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339–352, Seattle, United States. Association for Computational Linguistics.
  39. Norbert Wiener. 1960. Some moral and technical consequences of automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers. Science, 131(3410):1355–1358.
  40. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256.
  41. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Afra Feyza Akyürek (9 papers)
  2. Ekin Akyürek (25 papers)
  3. Aman Madaan (30 papers)
  4. Ashwin Kalyan (26 papers)
  5. Peter Clark (108 papers)
  6. Derry Wijaya (31 papers)
  7. Niket Tandon (40 papers)
Citations (65)