Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Matrix Low-Rank Trust Region Policy Optimization (2405.17625v1)

Published 27 May 2024 in cs.LG and cs.AI

Abstract: Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Reinforcement learning: An introduction, MIT press, 2018.
  2. D. P. Bertsekas, Reinforcement learning and optimal control, Athena Scientific, 2019.
  3. “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, 1999.
  4. S. M. Kakade, “A natural policy gradient,” Advances in Neural Information Processing Systems, vol. 14, 2001.
  5. “Trust region policy optimization,” in Intl. Conf. Machine Learning. PMLR, 2015, pp. 1889–1897.
  6. “Deep reinforcement learning: A brief survey,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 26–38, 2017.
  7. C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, 1936.
  8. I. Markovsky, Low rank approximation: Algorithms, implementation, applications, vol. 906, Springer, 2012.
  9. “Generalized low rank models,” Foundations and Trends® in Machine Learning, vol. 9, no. 1, pp. 1–118, 2016.
  10. “Decentralized sparsity-regularized rank minimization: Algorithms and applications,” IEEE Trans. Signal Process., vol. 61, no. 21, pp. 5374–5388, 2013.
  11. “Nonparametric stochastic compositional gradient descent for q-learning in continuous markov decision problems,” in Annual American Control Conf. (ACC). IEEE, 2018, pp. 6608–6615.
  12. “Compressed conditional mean embeddings for model-based reinforcement learning,” in AAAI Conf. Artificial Intelligence, 2016, vol. 30.
  13. “Sparse variational deterministic policy gradient for continuous real-time control,” IEEE Trans. Industrial Electronics, vol. 68, no. 10, pp. 9800–9810, 2020.
  14. “Q-learning with linear function approximation,” in Intl. Conf. Computational Learning Theory. Springer, 2007, pp. 308–322.
  15. “Fast feature selection for linear value function approximation,” in Intl. Conf. Automated Planning and Scheduling, 2019, vol. 29, pp. 601–609.
  16. “Incremental stochastic factorization for online reinforcement learning,” in AAAI Conf. Artificial Intelligence, 2016.
  17. “Contextual decision processes with low Bellman rank are PAC-learnable,” in Intl. Conf. Machine Learning. JMLR. org, 2017, pp. 1704–1713.
  18. “Tesseract: Tensorised actors for multi-agent reinforcement learning,” in Intl. Conf. Machine Learning. PMLR, 2021, pp. 7301–7312.
  19. H. Y. Ong, “Value function approximation via low-rank models,” arXiv preprint arXiv:1509.00061, 2015.
  20. “Factorized q-learning for large-scale multi-agent systems,” in Intl. Conf. Distributed Artificial Intelligence (DAI), 2019.
  21. “A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks,” in IEEE Intl. Conf. Acoustics, Speech Signal Process. (ICASSP). IEEE, 2020, pp. 3292–3296.
  22. B. Cheng and W. B. Powell, “Co-optimizing battery storage for the frequency regulation and energy arbitrage using multi-scale dynamic programming,” IEEE Trans. Smart Grid, vol. 9.3, pp. 1997–2005, 2016.
  23. “Low-rank value function approximation for co-optimization of battery storage,” IEEE Trans. Smart Grid, vol. 9.6, pp. 6590–6598, 2017.
  24. “Sample efficient reinforcement learning via low-rank matrix estimation,” in Intl. Conf. Neural Information Processing Systems (NIPS), 2020.
  25. “Low-rank state-action value-function approximation,” in European Signal Process. Conf. (EUSIPCO), 2021, pp. 1471–1475.
  26. “Overcoming the long horizon barrier for sample-efficient reinforcement learning with latent low-rank structure,” ACM on Measurement and Analysis of Computing Systems, vol. 7, no. 2, pp. 1–60, 2023.
  27. S. Rozada and A. G. Marques, “Tensor and matrix low-rank value-function approximation in reinforcement learning,” arXiv preprint arXiv:2201.09736, 2022.
  28. S. Rozada and A. G. Marques, “Matrix low-rank approximation for policy gradient methods,” in IEEE Intl. Conf. Acoustics, Speech Signal Process. (ICASSP). IEEE, 2023, pp. 1–5.
  29. “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  30. S. Amari, Differential-geometrical methods in statistics, vol. 28, Springer Science & Business Media, 2012.
  31. K. Ciosek and S. Whiteson, “Expected policy gradients,” in AAAI Conf. Artificial Intelligence, 2018, vol. 32.
  32. “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
  33. S. Rozada, “Online code repository: Matrix low-rank trust region policy optimization,” https://github.com/sergiorozada12/matrix-low-rank-trpo, 2023.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets