Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Published 20 Feb 2024 in cs.LG and cs.CL | (2402.12621v2)

Abstract: As LMs demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel two-player online RL approach that combines a frozen reflection model with a policy model to fine-tune LLMs.
It employs negative example generation and single-prompt action enumeration to improve error correction and reduce training time.
Empirical results, especially on the AutoExplore benchmark, demonstrate that Reflect-RL significantly outperforms traditional fine-tuning methods in complex decision-making tasks.

Enhancing LLMs with Online Reinforcement Learning through Reflect-RL

Introduction

Recent advancements in LLMs have shown significant promise in various applications, including problem-solving, coding, and document retrieval. LLMs have begun to demonstrate impressive capabilities in understanding, reasoning, planning, and even reflection by leveraging advanced prompting techniques. Despite these capabilities, the success of LLMs in interactive decision-making environments remains limited, particularly when they require dynamic adaptation beyond static datasets. This paper introduces Reflect-RL, an innovative approach to fine-tuning LMs for online Reinforcement Learning (RL) within interactive decision-making environments. Reflect-RL uniquely incorporates online RL with a two-player mechanism, comprising a reflection model and a policy model, to facilitate learning in complex environments.

Key Contributions and Techniques

Reflect-RL differentiates itself through a series of novel techniques:

Reflection Mechanism: Utilizes a frozen reflection model, distilled from GPT-4, aiding decision-making by generating reflections on the current situation and potential next steps. This mechanism accelerates training and enhances test performance.
Negative Example Generation: Balances the training dataset with negative examples, improving the reflection model's error-correction capability and overall success rate in tasks.
Single-Prompt Action Enumeration: Integrates valid actions into a single prompt, allowing the LLM to select an appropriate action efficiently and reducing time complexity.
Curriculum Learning: Implements a task-specific curriculum to address challenges in RL, such as planning for long horizons and sparse rewards, thereby optimizing the learning process.

Benchmark Development

Reflect-RL introduces AutoExplore, a new benchmark tailored to industrial applications. This benchmark, alongside others such as DangerousTaxi and ALFWorld, demonstrates Reflect-RL's efficacy in enhancing LLM's decision-making capabilities in interactive and complex environments.

Empirical Results

Reflect-RL significantly outperforms both traditional supervised fine-tuning (SFT) and untuned pre-trained LMs across various benchmarks, showcasing its superior ability to fine-tune LLMs for complex RL tasks. Notably, Reflect-RL demonstrates exceptional performance improvements in the AutoExplore benchmark, emphasizing its practical applicability in real-world scenarios.

Implications and Future Directions

The introduction of Reflect-RL marks a significant step forward in the ongoing efforts to enhance LLMs' adaptability and interactive decision-making capabilities. By effectively integrating online RL and leveraging innovative techniques such as reflection and curriculum learning, Reflect-RL sets a new precedent for fine-tuning LLMs. Future research could explore the scalability of Reflect-RL to foundation models, their application across a broader range of environments, and further enhancements to the reflection mechanism to foster generalization and adaptability.

Reflect-RL presents a promising avenue for advancing the field of LLMs, potentially broadening their applicability and efficiency in tackling complex, interactive decision-making tasks. As we continue to explore these possibilities, Reflect-RL serves as a foundational framework for future developments in enhancing LLMs' dynamic learning and adaptation capabilities.

Markdown Report Issue