Reward-Relevance-Filtered Linear Offline Reinforcement Learning (2401.12934v1)
Abstract: This paper studies offline reinforcement learning with linear function approximation in a setting with decision-theoretic, but not estimation sparsity. The structural restrictions of the data-generating process presume that the transitions factor into a sparse component that affects the reward and could affect additional exogenous dynamics that do not affect the reward. Although the minimally sufficient adjustment set for estimation of full-state transition properties depends on the whole state, the optimal policy and therefore state-action value function depends only on the sparse component: we call this causal/decision-theoretic sparsity. We develop a method for reward-filtering the estimation of the state-action value function to the sparse component by a modification of thresholded lasso in least-squares policy evaluation. We provide theoretical guarantees for our reward-filtered linear fitted-Q-iteration, with sample complexity depending only on the size of the sparse component.
- Thresholded lasso bandit. In International Conference on Machine Learning, pages 878–928. PMLR, 2022.
- A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli, 2013.
- P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
- J. Chen and N. Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
- Discovering and removing exogenous state variables and rewards for reinforcement learning. In International Conference on Machine Learning, pages 1262–1270. PMLR, 2018.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
- Risk bounds and rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning, pages 2892–2902. PMLR, 2021.
- Provably filtering exogenous distractors using multistep inverse dynamics. In International Conference on Learning Representations, 2021.
- Sparsity in partially controllable linear systems. In International Conference on Machine Learning, pages 5851–5860. PMLR, 2022.
- Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
- Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021.
- Random design analysis of ridge regression. pages 9–1, 2012.
- Guaranteed discovery of control-endogenous latent states with multi-step inverse models. Transactions on Machine Learning Research, 2022.
- Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
- Oracle inequalities for model selection in offline reinforcement learning. Advances in Neural Information Processing Systems, 35:28194–28207, 2022.
- N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. 2009.
- A. Nedić and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2):79–110, 2003.
- Minimax rates of estimation for high-dimensional linear regression over ℓqsubscriptℓ𝑞\ell_{q}roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT-balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
- Causal influence detection for improving efficiency in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
- S. M. Shortreed and A. Ertefaie. Outcome-adaptive lasso: variable selection for causal inference. Biometrics, 73(4):1111–1122, 2017.
- The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). 2011.
- Denoised mdps: Learning world models better than the world itself. arXiv preprint arXiv:2206.15477, 2022a.
- Task-independent causal state abstraction.
- Causal dynamics learning for task-independent state abstraction. arXiv preprint arXiv:2206.13452, 2022b.
- Invariant causal prediction for block mdps. In International Conference on Machine Learning, pages 11214–11224. PMLR, 2020.
- S. Zhou. Thresholding procedures for high dimensional variable selection and statistical estimation. Advances in Neural Information Processing Systems, 22, 2009.
- S. Zhou. Thresholded lasso for high dimensional variable selection and statistical estimation. arXiv preprint arXiv:1002.1583, 2010.