Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

System-Level Natural Language Feedback (2306.13588v3)

Published 23 Jun 2023 in cs.CL and cs.AI

Abstract: Natural language (NL) feedback offers rich insights into user experience. While existing studies focus on an instance-level approach, where feedback is used to refine specific examples, we introduce a framework for system-level use of NL feedback. We show how to use feedback to formalize system-level design decisions in a human-in-the-loop-process -- in order to produce better models. In particular this is done through: (i) metric design for tasks; and (ii) LLM prompt design for refining model responses. We conduct two case studies of this approach for improving search query and dialog response generation, demonstrating the effectiveness of system-level feedback. We show the combination of system-level and instance-level feedback brings further gains, and that human written instance-level feedback results in more grounded refinements than GPT-3.5 written ones, underlying the importance of human feedback for building systems. We release our code and data at https://github.com/yyy-Apple/Sys-NL-Feedback.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. The cringe loss: Learning what language not to model. arXiv preprint arXiv:2211.05826.
  2. Director: Generator-classifiers for supervised language modeling.
  3. A general language assistant as a laboratory for alignment.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. Oğuz ’Oz’ Buruk. 2023. Academic writing with gpt-3.5: Reflections on practices, efficacy and transparency.
  7. Improving code generation by training with natural language feedback.
  8. Teaching large language models to self-debug.
  9. Scaling instruction-finetuned language models.
  10. Gptscore: Evaluate as you desire.
  11. SimCSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP).
  12. Chatgpt outperforms crowd-workers for text-annotation tasks.
  13. Improving alignment of dialogue agents via targeted human judgements.
  14. Seeing chatgpt through students’ eyes: An analysis of tiktok data.
  15. Learning from dialogue after deployment: Feed yourself, chatbot! In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3667–3684, Florence, Italy. Association for Computational Linguistics.
  16. J. A. Hartigan and M. A. Wong. 1979. A k-means clustering algorithm. JSTOR: Applied Statistics, 28(1):100–108.
  17. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):248:1–248:38.
  18. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  19. Internet-augmented dialogue generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460–8478, Dublin, Ireland. Association for Computational Linguistics.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  21. Dialogue learning with human-in-the-loop.
  22. Halueval: A large-scale hallucination evaluation benchmark for large language models.
  23. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. arXiv preprint arXiv:1911.03860.
  24. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35.
  26. Chatgpt as a factual inconsistency evaluator for text summarization.
  27. Self-refine: Iterative refinement with self-feedback.
  28. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  29. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  30. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  31. Self-critiquing models for assisting human evaluators.
  32. Training language models with language feedback.
  33. Training language models with language feedback at scale.
  34. When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels.
  35. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 373–393, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  36. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  37. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339–352, Seattle, United States. Association for Computational Linguistics.
  38. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  39. Jason E Weston. 2016. Dialog-based language learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  40. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5180–5197, Dublin, Ireland. Association for Computational Linguistics.
  41. Learning new skills after deployment: Improving open-domain internet-driven dialogue with human feedback.
  42. chateval.
  43. Opt: Open pre-trained transformer language models.
  44. Benchmarking large language models for news summarization.
  45. Visar: A human-ai argumentative writing assistant with visual programming and rapid draft prototyping.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Weizhe Yuan (25 papers)
  2. Kyunghyun Cho (292 papers)
  3. Jason Weston (130 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com