Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints (2101.02195v2)

Published 6 Jan 2021 in cs.LG, math.OC, and stat.ML

Abstract: We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d3H3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d3H3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d3H3T[1+T/(dH)]{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d3H3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on $B$ in our regret bound is tight.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tianhao Wang (98 papers)
  2. Dongruo Zhou (51 papers)
  3. Quanquan Gu (198 papers)
Citations (163)

Summary

Provably Efficient Reinforcement Learning with Linear Function Approximation under Adaptivity Constraints

The paper focuses on addressing the challenges associated with reinforcement learning (RL) in environments characterized by infinite state and action spaces. Traditional RL algorithms, developed within the tabular setting, encounter limitations when applied to such expansive environments. A viable solution acknowledged in this paper involves leveraging linear function approximation to manage the underlying structures of Markov Decision Processes (MDPs).

The authors investigate two constrained adaptivity models: the batch learning model and the rare policy switch model. They successfully introduce two adaptable online RL algorithms tailored for episodic linear MDPs, designated by linear representations of transition probabilities and reward functions. The batch learning model features an algorithm called LSVI-UCB-Batch which achieves a regret bound of O~(d3H3T+dHT/B)\tilde O(\sqrt{d^3H^3T} + dHT/B). This algorithm suggests that merely T/dH\sqrt{T/dH} batches suffice to attain a regret of O~(d3H3T)\tilde O(\sqrt{d^3H^3T}). Emphasizing limited adaptivity, the authors prove a lower bound on regret, indicating the tight dependency on BB.

Conversely, the rare policy switch model presents the LSVI-UCB-RareSwitch algorithm with a regret bound of $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B})$. It reveals that dHlogTdH\log T policy switches are adequate to secure a O~(d3H3T)\tilde O(\sqrt{d^3H^3T}) regret, thereby demonstrating a significant reduction in adaptivity requirements.

The strong numerical results underscore the efficiency of both algorithms when compared to their full adaptivity counterpart, LSVI-UCB. Furthermore, the findings of this paper imply substantial improvements concerning deployability in large-scale RL applications where adaptivity reduction is paramount, without compromising the performance.

Moreover, the paper intricately establishes theoretical insights into RL algorithms constrained by adaptivity limitations. It contributes to the foundational understanding of developing efficient RL algorithms, where adaptivity is confined due to practical constraints such as computational capacities and costs associated with policy switches.

For future exploration, the paper prompts a deeper inquiry into lower bounds for the rare policy switch model concerning varying adaptivity budgets. This investigation could further refine the architecture of RL algorithms for broader applications under constrained adaptivity scenarios, presenting promising directions for advancing the theoretical dimensions of RL with linear approximation.

In summary, this paper enriches the domain of RL by reinforcing the applicability of linear function approximation under adaptivity constraints. Its contributions pave the way for developing RL algorithms capable of efficient, scalable deployment in complex environments with restricted adaptivity.