Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning (2408.03029v4)
Abstract: Reward shaping is a technique in reinforcement learning that addresses the sparse-reward problem by providing more frequent and informative rewards. We introduce a self-adaptive and highly efficient reward shaping mechanism that incorporates success rates derived from historical experiences as shaped rewards. The success rates are sampled from Beta distributions, which dynamically evolve from uncertain to reliable values as data accumulates. Initially, the shaped rewards exhibit more randomness to encourage exploration, while over time, the increasing certainty enhances exploitation, naturally balancing exploration and exploitation. Our approach employs Kernel Density Estimation (KDE) combined with Random Fourier Features (RFF) to derive the Beta distributions, providing a computationally efficient, non-parametric, and learning-free solution for high-dimensional continuous state spaces. Our method is validated on various tasks with extremely sparse rewards, demonstrating notable improvements in sample efficiency and convergence stability over relevant baselines.
- Utilizing prior solutions for reward shaping and composition in entropy-regularized reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 6658–6665.
- Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings.
- A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500.
- Potential-based shaping in model-based reinforcement learning. In AAAI Conference on Artificial Intelligence, pages 604–609.
- Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 29.
- Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, 41(1):45–67.
- Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165–1177. PMLR.
- Exploration by random network distillation. In International Conference on Learning Representations.
- Heuristic-guided reinforcement learning. Advances in Neural Information Processing Systems, 34:13550–13563.
- Gymnasium robotics.
- Exploration-guided reward shaping for reinforcement learning under sparse rewards. Advances in Neural Information Processing Systems, 35:5829–5842.
- Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pages 433–440. IFAAMAS.
- Accelerating numerical dense linear algebra calculations with gpus. Numerical computations with GPUs, pages 3–28.
- One-shot imitation learning. Advances in neural information processing systems, 30.
- Risk averse bayesian reward learning for autonomous navigation from human demonstration. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 8928–8935. IEEE.
- Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations.
- Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR.
- Unpacking reward shaping: Understanding the benefits of reward engineering on sample complexity. Advances in Neural Information Processing Systems, 35:15281–15295.
- Behavior alignment via reward function optimization. Advances in Neural Information Processing Systems, 36.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.
- Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29.
- Diversity-driven exploration strategy for deep reinforcement learning. Advances in neural information processing systems, 31.
- Learning to utilize shaping rewards: A new approach of reward shaping. Advances in Neural Information Processing Systems, 33:15931–15941.
- Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18.
- Exploration in deep reinforcement learning: A survey. Information Fusion, 85:1–22.
- Reward shaping for reinforcement learning with an assistant reward agent. In Forty-first International Conference on Machine Learning. PMLR.
- How to stay curious while avoiding noisy tvs using aleatoric uncertainty estimation. In International Conference on Machine Learning, pages 15220–15240. PMLR.
- Self-supervised online reward shaping in sparse-reward environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2369–2375. IEEE.
- Learning to shape rewards using a game of two partners. In AAAI Conference on Artificial Intelligence, volume 37, pages 11604–11612.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
- Count-based exploration with neural density models. In International Conference on Machine Learning, pages 2721–2730. PMLR.
- Controllability-aware unsupervised skill discovery. In International Conference on Machine Learning, pages 27225–27245. PMLR.
- Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pages 2778–2787. PMLR.
- Random features for large-scale kernel machines. Advances in neural information processing systems, 20.
- Bayesian inverse reinforcement learning. In International Joint Conference on Artificial Intelligence, volume 7, pages 2586–2591.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Inverse optimal control adapted to the noise characteristics of the human sensorimotor system. Advances in Neural Information Processing Systems, 34:9429–9442.
- Reward design via online gradient ascent. Advances in Neural Information Processing Systems, 23.
- Internal rewards mitigate agent boundedness. In International Conference on Machine Learning, pages 1007–1014.
- Learning intrinsic rewards as a bi-level optimization problem. In Conference on Uncertainty in Artificial Intelligence, pages 111–120. PMLR.
- Exploit reward shifting in value-based deep-rl: Optimistic curiosity-based exploration and conservative exploitation via linear reward shaping. Advances in Neural Information Processing Systems, 35:37719–37734.
- Reinforcement learning: An introduction. MIT press.
- # exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 30.
- Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294.
- Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE.
- Gymnasium.
- Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards. Advances in Neural Information Processing Systems, 32.
- Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29.
- Shaping rewards for reinforcement learning with imperfect demonstrations using generative models. In IEEE International Conference on Robotics and Automation, pages 6628–6634. IEEE.
- Learning to share in networked multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 35:15119–15131.
- On learning intrinsic rewards for policy gradient methods. Advances in Neural Information Processing Systems, 31.
- Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, volume 8, pages 1433–1438. Chicago, IL, USA.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.