Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Refinement of Language Models from External Proxy Metrics Feedback (2403.00827v1)

Published 27 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: It is often desirable for LLMs to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source LLMs Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Evaluating correctness and faithfulness of instruction-following models for question answering.
  2. Constitutional ai: Harmlessness from ai feedback.
  3. Quac : Question answering in context.
  4. Scaling instruction-finetuned language models.
  5. Qlora: Efficient finetuning of quantized llms.
  6. MultiDoc2Dial: Modeling dialogues grounded in multiple documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6162–6176, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  7. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, pages 16477–16508.
  8. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738.
  9. Large language models cannot self-correct reasoning yet.
  10. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
  11. Self-refine: Iterative refinement with self-feedback.
  12. Is self-repair a silver bullet for code generation? In Proceedings of ICLR 2024.
  13. Automatically correcting large language models: Surveying the landscape of diverse self correction strategies. arXiv preprint arXiv:2308.03188.
  14. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
  15. Check your facts and try again: Improving large language models with external knowledge and automated feedback.
  16. Self-critiquing models for assisting human evaluators.
  17. Training language models with language feedback at scale.
  18. Reflexion: Language agents with verbal reinforcement learning.
  19. Llama 2: Open foundation and fine-tuned chat models.
  20. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592.
  21. Enable language models to implicitly learn self-improvement from data.
  22. Generating sequences by learning to self-correct.
  23. Wecheck: Strong factual consistency checker via weakly supervised learning. arXiv preprint arXiv:2212.10057.
  24. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post.
  25. A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators. In AAAI-2024.
  26. Bertscore: Evaluating text generation with bert. In ICLR 2020.
  27. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Keshav Ramji (4 papers)
  2. Young-Suk Lee (17 papers)
  3. Ramón Fernandez Astudillo (29 papers)
  4. Md Arafat Sultan (25 papers)
  5. Tahira Naseem (27 papers)
  6. Asim Munawar (29 papers)
  7. Radu Florian (54 papers)
  8. Salim Roukos (41 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.