Papers
Topics
Authors
Recent
Search
2000 character limit reached

Automated Feature Selection for Inverse Reinforcement Learning

Published 22 Mar 2024 in cs.LG and cs.RO | (2403.15079v1)

Abstract: Inverse reinforcement learning (IRL) is an imitation learning approach to learning reward functions from expert demonstrations. Its use avoids the difficult and tedious procedure of manual reward specification while retaining the generalization power of reinforcement learning. In IRL, the reward is usually represented as a linear combination of features. In continuous state spaces, the state variables alone are not sufficiently rich to be used as features, but which features are good is not known in general. To address this issue, we propose a method that employs polynomial basis functions to form a candidate set of features, which are shown to allow the matching of statistical moments of state distributions. Feature selection is then performed for the candidates by leveraging the correlation between trajectory probabilities and feature expectations. We demonstrate the approach's effectiveness by recovering reward functions that capture expert policies across non-linear control tasks of increasing complexity. Code, data, and videos are available at https://sites.google.com/view/feature4irl.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. T. Ni, H. Sikchi, Y. Wang, T. Gupta, L. Lee, and B. Eysenbach, “f-irl: Inverse reinforcement learning via state marginal matching,” in Conference on Robot Learning.   PMLR, 2021, pp. 529–551.
  2. F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” arXiv preprint arXiv:1805.01954, 2018.
  3. A. Y. Ng, S. Russell et al., “Algorithms for inverse reinforcement learning.” in Icml, vol. 1, no. 2, 2000, p. 2.
  4. T. Xu, Z. Li, and Y. Yu, “Error bounds of imitating policies and environments,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 737–15 749, 2020.
  5. P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Twenty-First International Conference on Machine Learning - ICML ’04.   Banff, Alberta, Canada: ACM Press, 2004, p. 1.
  6. Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane, “Specification gaming: The flip side of ai ingenuity,” https://deepmind.com/blog/article/specification-gaming-the-flip-side-of-ai-ingenuity, 2020, accessed: 2024-03-10.
  7. B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey et al., “Maximum entropy inverse reinforcement learning.” in Aaai, vol. 8.   Chicago, IL, USA, 2008, pp. 1433–1438.
  8. A. Boularias, J. Kober, and J. Peters, “Relative entropy inverse reinforcement learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2011, pp. 182–189.
  9. Q. Qiao and P. A. Beling, “Inverse reinforcement learning with gaussian process,” in Proceedings of the 2011 American control conference.   IEEE, 2011, pp. 113–118.
  10. J.-D. Choi and K.-E. Kim, “Inverse reinforcement learning in partially observable environments,” Journal of Machine Learning Research, vol. 12, pp. 691–730, 2011.
  11. N. Aghasadeghi and T. Bretl, “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 1561–1566.
  12. G. Dexter, K. Bello, and J. Honorio, “Inverse reinforcement learning in a continuous state space with formal guarantees,” Advances in Neural Information Processing Systems, vol. 34, pp. 6972–6982, 2021.
  13. S. Tesfazgi, A. Lederer, and S. Hirche, “Inverse reinforcement learning: A control lyapunov approach,” in 2021 60th IEEE Conference on Decision and Control (CDC).   IEEE, 2021, pp. 3627–3632.
  14. Y. Xu, W. Gao, and D. Hsu, “Receding horizon inverse reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 880–27 892, 2022.
  15. T. Phan-Minh, F. Howington, T.-S. Chu, S. U. Lee, M. S. Tomov, N. Li, C. Dicle, S. Findler, F. Suarez-Ruiz, R. Beaudoin et al., “Driving in real life with inverse reinforcement learning,” arXiv preprint arXiv:2206.03004, 2022.
  16. M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles for autonomous vehicles from demonstration,” in 2015 IEEE international conference on robotics and automation (ICRA).   IEEE, 2015, pp. 2641–2646.
  17. S. Levine, Z. Popovic, and V. Koltun, “Nonlinear inverse reinforcement learning with gaussian processes,” Advances in neural information processing systems, vol. 24, 2011.
  18. C. Finn, P. Christiano, P. Abbeel, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” arXiv:1710.11248, 2016. [Online]. Available: https://arxiv.org/abs/1710.11248
  19. J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” arXiv preprint arXiv:1710.11248, 2017.
  20. S. Levine, Z. Popovic, and V. Koltun, “Feature construction for inverse reinforcement learning,” Advances in neural information processing systems, vol. 23, 2010.
  21. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  22. T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
  23. M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris, A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhoff, J. J. Tai, A. T. J. Shen, and O. G. Younis, “Gymnasium,” Mar. 2023. [Online]. Available: https://zenodo.org/record/8127025
  24. A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.org/papers/v22/20-1364.html
  25. G. Konidaris, S. Osentoski, and P. Thomas, “Value function approximation in reinforcement learning using the fourier basis,” in Proceedings of the AAAI conference on artificial intelligence, vol. 25, no. 1, 2011, pp. 380–385.

Summary

  • The paper presents a method that uses polynomial basis functions to construct candidate features by matching key statistical moments of state distributions.
  • It employs a correlation-based feature selection process that effectively reduces noise and complexity by ranking candidate features.
  • Experiments in Gymnasium environments demonstrate that the approach replicates expert behavior using fewer features compared to baseline methods.

Automated Feature Selection for Inverse Reinforcement Learning

Introduction

Inverse Reinforcement Learning (IRL) provides a powerful framework for learning reward functions from expert demonstrations, eliminating the necessity for explicit reward specification while retaining the generalization advantages inherent to reinforcement learning (RL). The reward within IRL is traditionally represented as a linear combination of features, which challenges researchers to identify suitable, rich features, especially in continuous state spaces where mere state variables are inadequate. The paper presents a method utilizing polynomial basis functions to construct a set of candidate features, which statistically match the moments of state distributions and leverage correlations between trajectory probabilities and feature expectations for optimal feature selection. Figure 1

Figure 1: A central open challenge in inverse reinforcement learning is the choice of suitable features to represent the reward. We propose a method that constructs a candidate feature set and then selects a subset that best describes expected rewards.

Methodology

Polynomial Feature Selection

The method initiates by generating candidate features as quadratic polynomial functions of the state. This choice facilitates the matching of the statistical moments, specifically the mean and variance, between the expert demonstrations and the policy obtained, aligning the Gaussian approximations of state distributions while maintaining a manageable feature dimensionality.

Feature Selection Mechanism

An efficient feature selection process is implemented that employs correlation-based techniques to identify features with the highest relevance by leveraging the relationship between trajectory probabilities and feature expectations. This process considerably reduces reward complexity and effectively addresses noise or spurious correlations. The algorithm ranks features using statistical tests and selects those deemed most promising, optimizing for compactness and interpretability.

Reward and Policy Retrieval

Maximum entropy IRL is adapted to discern the weights of selected features, optimizing the feature weights by aligning the feature expectations of expert trajectories with the policy-derived ones. This involves gradient descent optimization over the computed feature expectations using RL (PPO and SAC algorithms) to extract an optimal policy that maximizes the expected cumulative reward based on the retrieved reward function.

Experiments

The experimental setup encompassed three Gymnasium environments—Pendulum-v1, CartPole-v1, and Acrobot-v1—allowing the rigorous evaluation of the proposed method by comparing the retrieved policies against known benchmarks. Figure 2

Figure 2

Figure 2

Figure 2: Pendulum-v1.\ dim(Phi) = 9

Key findings illustrated how the proposed method achieved benchmark results while employing fewer features compared to baselines like hand-picked features, random selection, and the inclusion of all candidate features. The superiority was evident in the effective replication of expert behavior across varying complexity environments, confirming the efficacy of polynomial basis functions in overcoming non-linear control challenges. Figure 3

Figure 3: Mean cumulative rewards for policies trained using various feature sets, calculated across 10 different initial conditions. A) Pendulum, B) Acrobot, C) CartPole.

Discussion

The research advances automated IRL feature selection efficiency and accuracy, emphasizing the prospect of incorporating diverse basis functions such as radial basis functions (RBFs) and Fourier series. This generalization potential positions the method for broader applicability, particularly in preference learning scenarios among tasks with consistent environmental setups and varying expert influences. Figure 4

Figure 4: 2D Wasserstein distance between training and testing data for the Pendulum and Acrobot environments.

Conclusion

This paper introduces refined algorithms that employ polynomial basis functions for constructing reward functions in inverse reinforcement learning, demonstrating substantial effectiveness in diverse environments. By facilitating the streamlined selection of pertinent features, the proposed method enhances both process efficiency and model interpretability, setting the stage for further refinement with alternative basis functions to boost precision and broaden applicability. Future endeavors are expected to build on these foundations, exploring more sophisticated strategies to enhance the adaptability and efficacy of automated feature selection in imitation learning contexts.

Paper to Video (Beta)

Whiteboard

Collections

Sign up for free to add this paper to one or more collections.