Matrix Low-Rank Trust Region Policy Optimization (2405.17625v1)
Abstract: Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.
- Reinforcement learning: An introduction, MIT press, 2018.
- D. P. Bertsekas, Reinforcement learning and optimal control, Athena Scientific, 2019.
- “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, 1999.
- S. M. Kakade, “A natural policy gradient,” Advances in Neural Information Processing Systems, vol. 14, 2001.
- “Trust region policy optimization,” in Intl. Conf. Machine Learning. PMLR, 2015, pp. 1889–1897.
- “Deep reinforcement learning: A brief survey,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 26–38, 2017.
- C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, 1936.
- I. Markovsky, Low rank approximation: Algorithms, implementation, applications, vol. 906, Springer, 2012.
- “Generalized low rank models,” Foundations and Trends® in Machine Learning, vol. 9, no. 1, pp. 1–118, 2016.
- “Decentralized sparsity-regularized rank minimization: Algorithms and applications,” IEEE Trans. Signal Process., vol. 61, no. 21, pp. 5374–5388, 2013.
- “Nonparametric stochastic compositional gradient descent for q-learning in continuous markov decision problems,” in Annual American Control Conf. (ACC). IEEE, 2018, pp. 6608–6615.
- “Compressed conditional mean embeddings for model-based reinforcement learning,” in AAAI Conf. Artificial Intelligence, 2016, vol. 30.
- “Sparse variational deterministic policy gradient for continuous real-time control,” IEEE Trans. Industrial Electronics, vol. 68, no. 10, pp. 9800–9810, 2020.
- “Q-learning with linear function approximation,” in Intl. Conf. Computational Learning Theory. Springer, 2007, pp. 308–322.
- “Fast feature selection for linear value function approximation,” in Intl. Conf. Automated Planning and Scheduling, 2019, vol. 29, pp. 601–609.
- “Incremental stochastic factorization for online reinforcement learning,” in AAAI Conf. Artificial Intelligence, 2016.
- “Contextual decision processes with low Bellman rank are PAC-learnable,” in Intl. Conf. Machine Learning. JMLR. org, 2017, pp. 1704–1713.
- “Tesseract: Tensorised actors for multi-agent reinforcement learning,” in Intl. Conf. Machine Learning. PMLR, 2021, pp. 7301–7312.
- H. Y. Ong, “Value function approximation via low-rank models,” arXiv preprint arXiv:1509.00061, 2015.
- “Factorized q-learning for large-scale multi-agent systems,” in Intl. Conf. Distributed Artificial Intelligence (DAI), 2019.
- “A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks,” in IEEE Intl. Conf. Acoustics, Speech Signal Process. (ICASSP). IEEE, 2020, pp. 3292–3296.
- B. Cheng and W. B. Powell, “Co-optimizing battery storage for the frequency regulation and energy arbitrage using multi-scale dynamic programming,” IEEE Trans. Smart Grid, vol. 9.3, pp. 1997–2005, 2016.
- “Low-rank value function approximation for co-optimization of battery storage,” IEEE Trans. Smart Grid, vol. 9.6, pp. 6590–6598, 2017.
- “Sample efficient reinforcement learning via low-rank matrix estimation,” in Intl. Conf. Neural Information Processing Systems (NIPS), 2020.
- “Low-rank state-action value-function approximation,” in European Signal Process. Conf. (EUSIPCO), 2021, pp. 1471–1475.
- “Overcoming the long horizon barrier for sample-efficient reinforcement learning with latent low-rank structure,” ACM on Measurement and Analysis of Computing Systems, vol. 7, no. 2, pp. 1–60, 2023.
- S. Rozada and A. G. Marques, “Tensor and matrix low-rank value-function approximation in reinforcement learning,” arXiv preprint arXiv:2201.09736, 2022.
- S. Rozada and A. G. Marques, “Matrix low-rank approximation for policy gradient methods,” in IEEE Intl. Conf. Acoustics, Speech Signal Process. (ICASSP). IEEE, 2023, pp. 1–5.
- “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
- S. Amari, Differential-geometrical methods in statistics, vol. 28, Springer Science & Business Media, 2012.
- K. Ciosek and S. Whiteson, “Expected policy gradients,” in AAAI Conf. Artificial Intelligence, 2018, vol. 32.
- “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
- S. Rozada, “Online code repository: Matrix low-rank trust region policy optimization,” https://github.com/sergiorozada12/matrix-low-rank-trpo, 2023.