Learning to Generate Better Than Your LLM (2306.11816v2)

Published 20 Jun 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning LLMs for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be optimized for maximizing rewards. The guide LLM can generate text which serves as additional starting states for the RL optimization procedure. The guide LLM can also be used to complete the partial sentences generated by the LLM that is being optimized, treating the guide LLM as an expert to imitate and surpass eventually. We experiment on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We show that our RL algorithms achieve higher performance than supervised learning (SL) and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM. On both CommonGen and TL;DR, we not only outperform our SL baselines but also improve upon PPO across a variety of metrics beyond the one we optimized for. Our code can be found at https://github.com/Cornell-RL/tril.

PDF Abstract

Evaluating Reinforcement Learning for Fine-Tuning LLMs

The paper at hand investigates the application of reinforcement learning (RL) for fine-tuning LLMs, like GPT-2 and T5, specifically targeting downstream tasks. The research contends that while current supervised learning (SL) methods have demonstrated significant successes, they suffer from a metric mismatch between training and testing phases, coupled with distribution divergence challenges. These are areas where RL is posited to provide viable solutions through reward-based metrics optimization.

Key Contributions and Methodology

The paper introduces the concept of Reinforcement Learning with Guided Feedback (RLGF), a novel suite of RL algorithms aimed at refining LLMs. The guiding LLM serves a dual role: as a generator of additional start states and as an expert providing prompts to be surpassed by the finetuned LLM. The exploration is primarily grounded on enhancing RL algorithms over generic approaches like Proximal Policy Optimization (PPO).

Following a foundational review of RL in the context of LLMs, the paper proposes modifications to leverage a black-box guide LLM for informed policy training. The authors propose aggressive iteration through methods such as AggreVaTeD, PPO, and a differentiable version of Locally Optimal Learning to Search (LOLS), strategically interweaving imitation learning principles to bolster exploration and effectiveness in the observed state space.

Experimental Outcomes

The empirical evaluation spans multiple tasks: the IMDB sentiment analysis, CommonGen, and TL;DR summarization tasks. On these fronts, the authors report notable performance enhancements. The metrics in focus include sentiment scoring, fluency, and the ability to adequately leverage the LLM's depth for comprehensive text generation. RLGF algorithms consistently outshine SL baselines and surpass PPO in metric-based assessments, highlighting the effectiveness of guided RL in text generation tasks.

Broadly, the RLGF framework led to improvements across various dimensions beyond the primary optimization objectives, signaling its potential to impact lexicon-rich environments. Algorithm designs such as PPO harness guided feedback effectively to amend flaws intrinsic to PPO's local minima sensitivity.

Theoretical Underpinnings and Implications

The work draws on task-specific optimizations and theoretically grounds its proposals on iterative policy improvement, leveraging results from classic topics in reinforcement and imitation learning. The AggreVaTeD algorithm, for instance, is motivated by interactive imitation learning, aiming to exceed performance baselines established by the guide LLM.

By integrating guide policies into roll-in and rollout phases, the proposed methods showcase versatility in tackling exploration issues inherent in policy gradient strategies. The RLGF framework essentially integrates domain knowledge through pre-trained LLM capabilities, gracefully reducing exploratory inefficiencies and conditioning policies via states richly covered by the guidance distribution.

Future Trajectories

The findings suggest several paths for further work. Future explorations could assess the scalability of RLGF frameworks in more complex linguistic tasks and evaluate the robustness of diverse guidance models. The generality allows for the potential application of advanced, non-disclosed LLMs as guidance, offering vast experimental flexibility.

As transformer models continue to dominate the NLP landscape, reinforcements aided by guided feedback present themselves as pivotal towards achieving efficient and purposeful language generation mechanisms. Enhanced alignment with human-like language proficiencies and broader, context-aware competencies for LLMs could very well be the future of NLP model fine-tuning.

In conclusion, the paper solidifies the position of RL in fine-tuning LLMs, leveraging guided policies, and setting a foundation for improved, task-specific LLMing strategies.