Self-Refinement of Language Models from External Proxy Metrics Feedback (2403.00827v1)
Abstract: It is often desirable for LLMs to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source LLMs Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.
- Evaluating correctness and faithfulness of instruction-following models for question answering.
- Constitutional ai: Harmlessness from ai feedback.
- Quac : Question answering in context.
- Scaling instruction-finetuned language models.
- Qlora: Efficient finetuning of quantized llms.
- MultiDoc2Dial: Modeling dialogues grounded in multiple documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6162–6176, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, pages 16477–16508.
- Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738.
- Large language models cannot self-correct reasoning yet.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
- Self-refine: Iterative refinement with self-feedback.
- Is self-repair a silver bullet for code generation? In Proceedings of ICLR 2024.
- Automatically correcting large language models: Surveying the landscape of diverse self correction strategies. arXiv preprint arXiv:2308.03188.
- Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback.
- Self-critiquing models for assisting human evaluators.
- Training language models with language feedback at scale.
- Reflexion: Language agents with verbal reinforcement learning.
- Llama 2: Open foundation and fine-tuned chat models.
- Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592.
- Enable language models to implicitly learn self-improvement from data.
- Generating sequences by learning to self-correct.
- Wecheck: Strong factual consistency checker via weakly supervised learning. arXiv preprint arXiv:2212.10057.
- Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post.
- A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators. In AAAI-2024.
- Bertscore: Evaluating text generation with bert. In ICLR 2020.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Keshav Ramji (4 papers)
- Young-Suk Lee (17 papers)
- Ramón Fernandez Astudillo (29 papers)
- Md Arafat Sultan (25 papers)
- Tahira Naseem (27 papers)
- Asim Munawar (29 papers)
- Radu Florian (54 papers)
- Salim Roukos (41 papers)