Rescue: Ranking LLM Responses with Partial Ordering to Improve Response Generation (2311.09136v3)
Abstract: Customizing LLMs for a specific task involves separating high-quality responses from lower-quality ones. This skill can be developed using supervised fine-tuning with extensive human preference data. However, obtaining a large volume of expert-annotated data is costly for most tasks. In this paper, we explore a novel method to optimize LLMs using ranking metrics. This method trains the model to prioritize the best responses from a pool of candidates created for a particular task. Rather than a traditional full ordering, we advocate for a partial ordering, as achieving consensus on the perfect order of candidate responses can be challenging. Our partial ordering is more robust, less sensitive to noise, and can be achieved with limited human annotations or through heuristic methods. We test our system's improved response generation ability using benchmark datasets, including textual entailment and multi-document question answering. We conduct ablation studies to understand crucial factors, such as how to gather candidate responses for a specific task, determine their most suitable order, and balance supervised fine-tuning with ranking metrics. Our approach, named Rescue, offers a promising avenue for enhancing the response generation and task accuracy of LLMs.
- Opt-r: Exploring the role of explanations in finetuning and prompting for reasoning skills of large language models. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE). Association for Computational Linguistics.
- The falcon series of open language models.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint 2204.05862.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Discovering latent knowledge in language models without supervision.
- e-snli: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 9560–9572.
- Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668, Vancouver, Canada. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Content planning for neural story generation with aristotelian rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4319–4338, Online. Association for Computational Linguistics.
- Annotation artifacts in natural language inference data. CoRR, abs/1803.02324.
- Contrastive preference learning: Learning from human feedback without rl.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
- Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1352–1362, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8003–8017. Association for Computational Linguistics.
- Unsupervised dense information retrieval with contrastive learning.
- Mistral 7b.
- Sawan Kumar and Partha Talukdar. 2020. NILE : Natural language inference with faithful natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8730–8742, Online. Association for Computational Linguistics.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Retrieval-augmented generation for knowledge-intensive nlp tasks.
- Explaining competitive-level programming solutions using llms.
- Prudent silence or foolish babble? examining large language models’ responses to the unknown.
- Chain of hindsight aligns language models with feedback.
- Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172.
- Brio: Bringing order to abstractive summarization.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Explanation-based finetuning makes models more robust to spurious cues.
- MosaicML. 2023. Introducing mpt-30b: Raising the bar for open-source foundation models. Accessed: 2023-06-22.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint 2112.09332.
- Training language models to follow instructions with human feedback.
- A study of generative large language model for medical research and healthcare.
- Direct preference optimization: Your language model is secretly a reward model.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint 2210.01241.
- Chatgpt-based investment portfolio selection.
- Reflexion: Language agents with verbal reinforcement learning.
- The confidence-competence gap in large language models: A cognitive study.
- Trustllm: Trustworthiness in large language models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- Chain-of-thought prompting elicits reasoning in large language models.
- To repeat or not to repeat: Insights from scaling llm under token-crisis. arXiv preprint arXiv:2305.13230.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Complementary explanations for effective in-context learning.
- Rrhf: Rank responses to align language models with human feedback without tears.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint 2301.11270.
- Fine-tuning language models from human preferences. arXiv preprint 1909.08593.
- Yikun Wang (25 papers)
- Rui Zheng (79 papers)
- Haoming Li (19 papers)
- Qi Zhang (785 papers)
- Tao Gui (127 papers)
- Fei Liu (232 papers)