Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
44 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
18 tokens/sec
GPT-4o
105 tokens/sec
DeepSeek R1 via Azure Premium
83 tokens/sec
GPT OSS 120B via Groq Premium
475 tokens/sec
Kimi K2 via Groq Premium
259 tokens/sec
2000 character limit reached

From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning (2411.03817v3)

Published 6 Nov 2024 in cs.AI, cs.CL, cs.HC, and cs.RO

Abstract: The outstanding capabilities of LLMs render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.

Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.