Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Value of Reward Lookahead in Reinforcement Learning (2403.11637v2)

Published 18 Mar 2024 in cs.LG and stat.ML

Abstract: In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Active coverage for pac reinforcement learning. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195, pages 5044–5109. PMLR, 2023.
  2. Eitan Altman. Constrained Markov decision processes. Routledge, 2021.
  3. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  4. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  5. Model predictive control. Springer, 2007.
  6. Posted price mechanisms for a random stream of customers. In Proceedings of the 2017 ACM Conference on Economics and Computation, pages 169–186, 2017.
  7. Prophet inequalities for iid random variables from an unknown distribution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 3–17, 2019a.
  8. Recent developments in prophet inequalities. ACM SIGecom Exchanges, 17(1):61–70, 2019b.
  9. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516, 2019.
  10. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578–598. PMLR, 2021.
  11. How to combine tree-search methods in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3494–3501, 2019a.
  12. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, pages 12224–12234, 2019b.
  13. Online planning with lookahead policies. Advances in Neural Information Processing Systems, 33:14024–14033, 2020.
  14. Online optimization with dynamic temporal uncertainty: Incorporating short term predictions for renewable integration in intelligent energy systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 27, pages 1291–1297, 2013.
  15. Ratio comparisons of supremum and stop rule expectations. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 56:283–285, 1981.
  16. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  17. Online resource allocation in markov chains. In Proceedings of the ACM Web Conference 2023, pages 3498–3507, 2023.
  18. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
  19. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015.
  20. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pages 1153–1162, 2019.
  21. Reinforcement learning: An introduction. MIT press, 2018.
  22. Learning from the hindsight plan—episodic mpc improvement. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 336–343. IEEE, 2017.
  23. Near instance-optimal pac reinforcement learning for deterministic mdps. Advances in Neural Information Processing Systems, 35:8785–8798, 2022.
  24. The role of coverage in online reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022.
  25. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  26. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pages 4528–4531. PMLR, 2021.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com