Variational Bayesian Reinforcement Learning with Regret Bounds (1807.09647v4)

Published 25 Jul 2018 in cs.LG, cs.AI, and stat.ML

Abstract: In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new BeLLMan operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of $\tilde O(L \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the total number of states, $A$ is the number of actions, and $T$ is the number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning.

Citations (33)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

YouTube

Show All Videos

Variational Bayesian Reinforcement Learning with Regret Bounds (1807.09647v4)

Summary

Related Papers

YouTube