Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anytime-Competitive Reinforcement Learning with Policy Prior

Published 2 Nov 2023 in cs.LG | (2311.01568v3)

Abstract: This paper studies the problem of Anytime-Competitive Markov Decision Process (A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim to optimize the expected reward while constraining the expected cost over random dynamics, but the cost in a specific episode can still be unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior. We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints. The regret analysis shows the policy asymptotically matches the optimal reward achievable under the anytime competitive constraints. Experiments on the application of carbon-intelligent computing verify the reward performance and cost constraint guarantee of ACRL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
  2. Safe reinforcement learning with linear function approximation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 243–253. PMLR, 18–24 Jul 2021.
  3. Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 243–253. PMLR, 2021.
  4. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136–145. PMLR, 2017.
  5. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pages 264–273. PMLR, 2018.
  6. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  7. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  8. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
  9. Location, location, location: The variable value of renewable energy and demand-side efficiency resources. Journal of the Association of Environmental and Resource Economists, 5(1):39–75, 2018.
  10. Reinforcement learning with almost sure constraints. In Learning for Dynamics and Control, 2022.
  11. Enforcing policy feasibility constraints through differentiable projection for energy optimization. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, pages 199–210, 2021.
  12. Parties: Qos-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 107–120, 2019.
  13. Policy gradients for probabilistic constrained reinforcement learning, 2022.
  14. An adaptive deep rl method for non-stationary environments with piecewise stable context. Advances in Neural Information Processing Systems, 35:35449–35461, 2022.
  15. Reinforcement learning for decision-making and control in power systems: Tutorial, review, and vision. arXiv, 2021.
  16. Semi-analytical industrial cooling system model for reinforcement learning. arXiv preprint arXiv:2207.13131, 2022.
  17. Optimal robustness-consistency tradeoffs for learning-augmented metrical task systems. In AI STATS, 2023.
  18. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 153–167, 2017.
  19. Google DeepMind. Safety-first ai for autonomous data centre cooling and industrial control. https://www.deepmind.com/blog/safety-first-ai-for-autonomous-data-centre-cooling-and-industrial-control, 2018.
  20. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices, 48(4):77–88, 2013.
  21. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR, 2021.
  22. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020.
  23. Measuring the carbon intensity of ai in cloud instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1877–1894, 2022.
  24. Dc3: A learning method for optimization with hard constraints. arXiv preprint arXiv:2104.12225, 2021.
  25. A one-size-fits-all solution to conservative bandit problems. In AAAI, 2021.
  26. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  27. Conservative exploration in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1431–1441. PMLR, 2020.
  28. Deepmdp: Learning continuous latent space models for representation learning. In International Conference on Machine Learning, pages 2170–2179. PMLR, 2019.
  29. Provably efficient model-free constrained rl with linear function approximation. arXiv preprint arXiv:2206.11889, 2022.
  30. Greenslot: scheduling energy consumption in green datacenters. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11, 2011.
  31. Conservative exploration using interleaving. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 954–963. PMLR, 2019.
  32. Imran Khan. Temporal carbon intensity analysis: renewable versus fossil fuel dominated electricity systems. Energy Sources, Part A: Recovery, Utilization, and Environmental Effects, 41(3):309–323, 2019.
  33. Marginal emission factors considering renewables: A case study of the us midcontinent independent system operator (miso) system. Environmental science & technology, 51(19):11215–11223, 2017.
  34. Expert-calibrated learning for online optimization with switching costs. Proc. ACM Meas. Anal. Comput. Syst., 6(2), Jun 2022.
  35. Robustified learning for online optimization with memory costs. INDOCOM, 2023.
  36. Information aggregation for constrained online control. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 5(2):1–35, 2021.
  37. Certifying black-box policies with stability for nonlinear control. IEEE Open Journal of Control Systems, 2:49–62, 2023.
  38. Robustness and consistency in linear quadratic control with untrusted predictions. Proc. ACM Meas. Anal. Comput. Syst., 6(1), feb 2022.
  39. Safe adaptive learning-based control for constrained linear quadratic regulators with regret guarantees. arXiv preprint arXiv:2111.00411, 2021.
  40. Online switching control with stability and regret guarantees. arXiv preprint arXiv:2301.08445, 2023.
  41. Low complexity homeomorphic projection to ensure neural-network solution feasibility for optimization over (non-)convex set. In ICML, 2023.
  42. Bounded-regret mpc via perturbation analysis: Prediction error, constraints, and nonlinearity. arXiv preprint arXiv:2210.12312, 2022.
  43. Perturbation-based regret analysis of predictive control in linear time varying systems. Advances in Neural Information Processing Systems, 34:5174–5185, 2021.
  44. Estimating the carbon footprint of bloom, a 176b parameter language model. arXiv preprint arXiv:2211.02001, 2022.
  45. A survey on model-based reinforcement learning. arXiv preprint arXiv:2206.09328, 2022.
  46. Controlling commercial cooling systems using reinforcement learning. arXiv preprint arXiv:2211.07357, 2022.
  47. California Independent System Operator. Calfornia renewable datasets. https://www.caiso.com/Pages/default.aspx, 2023.
  48. Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems, 27, 2014.
  49. Carbon-aware computing for datacenters. IEEE Transactions on Power Systems, 38(2):1270–1280, 2022.
  50. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  51. Online optimization with untrusted predictions. arXiv preprint arXiv:2202.03519, 2022.
  52. Green ai. Communications of the ACM, 63(12):54–63, 2020.
  53. Incentivizing self-capping to increase cloud utilization. In Proceedings of the 2017 Symposium on Cloud Computing, pages 52–65, 2017.
  54. Stability constrained reinforcement learning for real-time voltage control. In 2022 American Control Conference (ACC), pages 2715–2721. IEEE, 2022.
  55. Improved soc estimation of lithium-ion batteries with novel soc-ocv curve estimation method using equivalent circuit model. In 2019 4th International Conference on Smart and Sustainable Technologies (SpliTech), pages 1–6. IEEE, 2019.
  56. Sauté rl: Almost surely safe reinforcement learning using state augmentation. In International Conference on Machine Learning, pages 20423–20443. PMLR, 2022.
  57. Competitive algorithms for the online multiple knapsack problem with application to electric vehicle charging. ACM on Measurement and Analysis of Computing Systems (POMACS), 4(3), 2021.
  58. Reinforcement learning with history-dependent dynamic contexts. ICML, 2023.
  59. Garrett Thomas. Markov decision processes. 2007.
  60. Contraction theory for nonlinear stability analysis and learning-based control: A tutorial overview. Annual Reviews in Control, 52:135–169, 2021.
  61. Near-optimal sample complexity bounds for constrained mdps. Advances in Neural Information Processing Systems, 35:3110–3122, 2022.
  62. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pages 3274–3307. PMLR, 2022.
  63. Optimizing industrial hvac systems with hierarchical reinforcement learning. arXiv preprint arXiv:2209.08112, 2022.
  64. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813, 2022.
  65. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016.
  66. Sublinear regret for learning pomdps. Production and Operations Management, 31(9):3491–3504, 2022.
  67. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2020.
  68. A reduction-based framework for conservative bandits and reinforcement learning. In International Conference on Learning Representations, 2022.
  69. On the regret analysis of online lqr control with predictions. In 2021 American Control Conference (ACC), pages 697–703. IEEE, 2021.
  70. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021.
  71. Asymptotically optimal load balancing in large-scale heterogeneous systems with multiple dispatchers. ACM SIGMETRICS Performance Evaluation Review, 48(3):57–58, 2021.
  72. Flexible load balancing with multi-dimensional state-space collapse: Throughput and heavy-traffic delay optimality. ACM SIGMETRICS Performance Evaluation Review, 46(3):10–11, 2019.
  73. Reverse and forward engineering of local voltage control in distribution networks. IEEE Transactions on Automatic Control, 66(3):1116–1128, 2020.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.