Eureka: Human-Level Reward Design via Coding Large Language Models (2310.12931v2)

Published 19 Oct 2023 in cs.RO, cs.AI, and cs.LG

Abstract: LLMs have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.

PDF Abstract

Human-level Reward Design via Coding LLMs

The paper "Human-level Reward Design via Coding LLMs" introduces a novel approach to reward design in reinforcement learning (RL) utilizing LLMs like GPT-4. The primary contribution of this work is a framework named Evolution-driven Universal REward Kit for Agent (Eureka), which utilizes the code-writing and in-context improvement capabilities of state-of-the-art LLMs to generate and optimize reward functions. This enables the design of reward functions that surpass the performance of those manually engineered by humans.

Methodology

The Eureka framework proposes a three-pronged approach to achieve human-level reward design:

Environment as Context: This component leverages the environment’s source code to provide context to the LLM, enabling zero-shot generation of executable reward functions. By feeding the raw environment code without any pre-defined reward templates to the LLM, Eureka generates plausible reward functions using minimal task-specific prompting.
Evolutionary Search: In each iteration, multiple reward functions are sampled independently from the LLM. This not only mitigates the risk of execution errors but also leverages the LLM's inherent capability for in-context learning and adaptation. Through reward mutation informed by prior iterations, the framework iteratively refines the reward candidates, progressively improving their performance.
Reward Reflection: Targeted improvement of reward functions is facilitated by providing a detailed textual reflection based on policy training outcomes. This feedback specifies which aspects of the reward function contributed to or hindered the learning process, leading to more effective modifications in subsequent iterations.

Experimental Setup and Results

Eureka's effectiveness was evaluated on a diverse range of 29 open-source RL environments, covering various robot morphologies from quadrupeds and humanoids to dexterous hands. The key metrics of interest were the average normalized improvement over human-engineered rewards and the success rate in high-dimensional dexterity tasks.

Human Expert Comparison: Eureka outperformed human-designed rewards in 83% of the tasks, with an average normalized improvement of 52%.
Dexterous Manipulation: A significant achievement was the ability to perform complex tasks such as pen spinning with a human-like robotic hand, which had not been feasible with previous reward shaping techniques.
Task Generalization: By employing a gradient-free in-context learning approach, Eureka effectively incorporated various forms of human input to fine-tune and improve reward functions without necessitating model updates.

Implications and Future Directions

This research signifies a substantial advancement in automating and optimizing reward design using LLMs. The implications are twofold:

Theoretical Implications: The use of LLMs for generating reward functions highlights their potential in solving other open-ended, complex decision-making problems. Importantly, the framework's success underscores the instrumental role of iterative refinement and contextual learning capabilities inherent in LLMs.
Practical Implications: For practitioners, Eureka offers a scalable solution to reward design challenges, reducing the dependency on domain-specific knowledge for generating effective reward functions. This can expedite the development of RL applications in various fields such as robotics, gaming, and autonomous systems.

Future Prospects in AI

Future developments may focus on extending Eureka's approach to incorporate more dynamic forms of human feedback and on enhancing the LLM's understanding of complex, multi-agent environments. Additionally, integrating Eureka with real-world robotic systems could further validate its applicability and robustness outside simulated environments. Another prospective area is optimizing the computational efficiency of the framework to handle more extensive and complex tasks in real-time applications.

In summary, this paper presents a significant step towards automating reward design in RL by leveraging coding LLMs' capabilities. Through the Eureka framework, it successfully demonstrates that LLMs can not only generate but also iteratively improve reward functions, achieving and surpassing human-level performance across various RL tasks.