Representation-Driven Reinforcement Learning
Abstract: We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
- Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pp. 176–184. PMLR, 2017.
- Asymptotically efficient adaptive allocation schemes for controlled markov chains: Finite parameter space. Technical report, MICHIGAN UNIV ANN ARBOR COMMUNICATIONS AND SIGNAL PROCESSING LAB, 1988.
- Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135, 2013.
- Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
- Learning action representations for reinforcement learning. In International conference on machine learning, pp. 941–950. PMLR, 2019.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. JMLR Workshop and Conference Proceedings, 2011.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
- Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. PMLR, 2020.
- Adaptive policies for markov renewal programs. The Annals of Statistics, 1(2):334–341, 1973.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International conference on machine learning, pp. 1861–1870, 2018.
- Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.
- Neural contextual bandits without regret. In International Conference on Artificial Intelligence and Statistics, pp. 240–278. PMLR, 2022.
- Evolution-guided policy gradient in reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639–5650. PMLR, 2020.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670, 2010.
- Erl-re: Efficient evolutionary reinforcement learning with shared state representation and individual policy representation. arXiv preprint arXiv:2210.17375, 2022.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Online limited memory neural-linear bandits with likelihood matching. arXiv preprint arXiv:2102.03799, 2021.
- Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31, 2018.
- Equivariant architectures for learning in deep weight spaces. arXiv preprint arXiv:2301.12780, 2023.
- Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
- Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2):153–173, 2017.
- Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
- Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Trust region policy optimization. International conference on machine learning, pp. 1889–1897, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Language is power: Representing states using natural language in reinforcement learning. arXiv preprint arXiv:1910.02789, 2019.
- Deterministic policy gradient algorithms. International conference on machine learning, pp. 387–395, 2014.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Reinforcement learning: An introduction. MIT press Cambridge, 1998.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- What about inputting policy in value function: Policy representation and policy-extended value function approximator. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8441–8449, 2022.
- Uncertainty estimation using riemannian model dynamics for offline reinforcement learning. In Advances in Neural Information Processing Systems.
- The natural language of actions. In International Conference on Machine Learning, pp. 6196–6205. PMLR, 2019.
- Distributional policy optimization: An alternative approach for continuous control. Advances in Neural Information Processing Systems, 32, 2019.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
- Learning reward machines for partially observable reinforcement learning. Advances in neural information processing systems, 32, 2019.
- Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
- Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780, 2020.
- Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv preprint arXiv:1903.03176, 2019.
- Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492–11502. PMLR, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.