Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Actor-Critics Can Achieve Optimal Sample Efficiency (2505.03710v1)

Published 6 May 2025 in stat.ML, cs.AI, and cs.LG

Abstract: Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $\epsilon$-optimal policy with a sample complexity of $O(1/\epsilon2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH5 \log|\mathcal{A}|/\epsilon2 + d H4 \log|\mathcal{F}|/ \epsilon2)$ trajectories, and accompanying $\sqrt{T}$ regret when the BeLLMan eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate. Here, $\mathcal{F}$ is the critic function class, $\mathcal{A}$ is the action space, and $H$ is the horizon in the finite horizon MDP setting. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL. Further, utilizing access to offline data, we provide a \textit{non-optimistic} provably efficient actor-critic algorithm that only additionally requires $N_{\text{off}} \geq c_{\text{off}}*dH4/\epsilon2$ in exchange for omitting optimism, where $c_{\text{off}}*$ is the single-policy concentrability coefficient and $N_{\text{off}}$ is the number of offline samples. This addresses another open problem in the literature. We further provide numerical experiments to support our theoretical findings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kevin Tan (12 papers)
  2. Wei Fan (160 papers)
  3. Yuting Wei (47 papers)

Summary

Actor-Critics Can Achieve Optimal Sample Efficiency

The paper in consideration explores a novel contribution to the field of reinforcement learning (RL) by addressing a significant open problem: the sample complexity inefficiency in learning ϵ\epsilon-optimal policies with actor-critic algorithms under general function approximation. This paper demonstrates how these algorithms can achieve optimal sample efficiency, with a sample complexity result of O(dH5logA/ϵ2+dH4logF/ϵ2)O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2) trajectories, and T\sqrt{T} regret, by leveraging strategic exploration.

The following are key components and innovations presented in the paper:

Actor-Critic Algorithms and Exploration

Actor-critic methods blend policy-based and value-based strategies, and they have become integral to RL frameworks. Despite their popularity, achieving optimal sample efficiency in complex environments without assuming comprehensive state-action space coverage remained challenging. Previous research suggested that the reachability of the state-action space or significant data coverage could simplify learning. This paper introduces a new algorithm that strategically combines optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets to overcome these challenges.

Significant Results

The main contribution is a novel actor-critic algorithm that achieves a sample complexity of O(dH5logA/ϵ2+dH4logF/ϵ2)O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2). This result is a substantial improvement over existing methods requiring 1/ϵ31/\epsilon^3 to 1/ϵ41/\epsilon^4 samples. Furthermore, the approach ensures T\sqrt{T} regret when the BeLLMan eluder dimension dd does not increase with TT at more than a logT\log T rate. These results are obtained without common assumptions about coverage or reachability, which broadens the application scope in real-world scenarios characterized by large state and action spaces.

Hybrid RL Setting

The paper further extends the results to a hybrid RL setup, combining offline and online data for improved efficiency. Introducing offline data in initial critic training yields better sample efficiency than purely offline or online approaches. Moreover, with access to offline samples, a non-optimistic provably efficient actor-critic algorithm is proposed, requiring enough offline samples (NoffcoffdH4/ϵ2N_{off} \geq c_{off}^*dH^4/\epsilon^2) to avoid optimism while maintaining efficiency.

Implications and Future Directions

The strong numerical results and theoretical claims have profound implications for the RL community and practical applications. Actor-critic algorithms, especially the newly proposed algorithm, can offer more competitive learning make them more applicable to domains where sample efficiency is paramount. The concept of leveraging hybrid datasets offers a promising direction for future research, potentially reducing computational requirements in environments characterized by incomplete or limited data coverage.

This paper not only resolves an open problem concerning sample efficiency but also sets the stage for further exploration into optimizing RL algorithms in complex and functionally diverse environments. Future work may focus on refining these techniques, expanding them further into other dimensions of RL, such as model-based methods, and exploring their empirical application in real-world settings.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com