Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents (2401.05821v4)
Abstract: Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce Successive Concept Bottleneck Agents (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots' competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it. Overall, SCoBots thus result in more human-aligned RL agents. Our code is available at https://github.com/k4ntz/SCoBots .
- Hindsight experience replay. Advances in neural information processing systems, 2017.
- Rationalization through concepts. ArXiv, 2021.
- Value alignment or misalignment – what will keep systems accountable? In AAAI Workshop on AI, Ethics, and Society, 2017.
- The option-critic architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, 2017.
- Debiasing concept bottleneck models with instrumental variables. ArXiv, 2020.
- The arcade learning environment: An evaluation platform for general agents (extended abstract). In International Joint Conference on Artificial Intelligence, 2012.
- Concept-level debugging of part-prototype networks. In International Conference on Learning Representations (ICLR). OpenReview.net, 2023.
- A gradient-based split criterion for highly accurate and transparent model trees. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, 2019.
- A comparative study of faithfulness metrics for model interpretability methods. In Conference of the Association for Computational Linguistics (ACL), pp. 5029–5038. Association for Computational Linguistics, 2022.
- Interactive concept bottleneck models. ArXiv, 2022.
- Quantifying generalization in reinforcement learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 2019.
- Playing atari with six neurons (extended abstract). In Bessiere, C. (ed.), Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, 2020.
- Towards symbolic reinforcement learning with common sense, 2018.
- Levels of explainable artificial intelligence for human-aligned conversational explanations. Artif. Intell., 2021.
- Explainable reinforcement learning for broad-xai: a conceptual framework and survey. Neural Computing and Applications, 2022.
- Adaptive rational activations to boost deep reinforcement learning. 2021.
- Ocatari: Object-centric atari 2600 reinforcement learning environments. ArXiv, 2023a.
- Interpretable and explainable logical policies via neurally guided symbolic abstraction. ArXiv, 2023b.
- Boosting object representation learning via motion and object continuity. In Koutra, D., Plant, C., Rodriguez, M. G., Baralis, E., and Bonchi, F. (eds.), European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), volume 14172 of Lecture Notes in Computer Science, pp. 610–628. Springer, 2023c.
- ERASER: A benchmark to evaluate rationalized NLP models. In Conference of the Association for Computational Linguistics (ACL), pp. 4443–4458. Association for Computational Linguistics, 2020.
- Goal misgeneralization in deep reinforcement learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
- Concept-based understanding of emergent multi-agent behavior. In Deep Reinforcement Learning Workshop NeurIPS 2022, 2022.
- Relative behavioral attributes: Filling the gap between symbolic goal specification and reward learning from human preferences. In International Conference on Learning Representations (ICLR). OpenReview.net, 2023.
- A survey of methods for explaining black box models. ACM Computing Surveys, 51(5):93:1–93:42, 2019.
- Deep reinforcement learning that matters. In AAAI Conference on Artificial Intelligence, 2017.
- A benchmark for interpretability methods in deep neural networks. In Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 9734–9745, 2019.
- Ai safety via debate. ArXiv, 2018.
- Visual explanation using attention mechanism in actor-critic-based deep reinforcement learning. 2021 International Joint Conference on Neural Networks (IJCNN), 2021.
- Unsupervised curricula for visual meta-reinforcement learning. ArXiv, 2019.
- Model-based reinforcement learning for atari. ArXiv, 2019.
- Symbols as a lingua franca for bridging human-ai chasm for explainable and advisable ai systems. In AAAI Conference on Artificial Intelligence, 2021.
- Objective robustness in deep reinforcement learning, 2021.
- Concept bottleneck models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 2020.
- Explainability in reinforcement learning: perspective and position. ArXiv, 2022.
- Attribute and simile classifiers for face verification. 2009 IEEE 12th International Conference on Computer Vision, 2009.
- Learning interpretable concept-based models with human feedback. ArXiv, 2020.
- Learning to detect unseen object classes by between-class attribute transfer. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10(1):1096, 2019.
- SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. In International Conference on Learning Representations, 2020.
- Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents (extended abstract). In Lang, J. (ed.), Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. ijcai.org, 2018.
- Glancenets: Interpretable, leak-proof concept-based models. In Advances in Neural Information Processing (NeurIPS), 2022.
- Neuro-symbolic reasoning shortcuts: Mitigation strategies and their limitations. In International Workshop on Neural-Symbolic Learning and Reasoning, volume 3432 of CEUR Workshop Proceedings, pp. 162–166, 2023.
- Counterfactual credit assignment in model-free reinforcement learning. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
- Explainable reinforcement learning: A survey and comparative review. ACM Computing Surveys, 2023.
- Playing atari with deep reinforcement learning. ArXiv, 2013.
- Human-level control through deep reinforcement learning. Nature, 2015.
- Training value-aligned reinforcement learning agents using a normative prior. ArXiv, 2021.
- Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, 1999.
- Ngo, R. The alignment problem from a deep learning perspective. ArXiv, 2022.
- Neat for large-scale reinforcement learning through evolutionary feature learning and policy gradient search. Proceedings of the Genetic and Evolutionary Computation Conference, 2018.
- A survey on explainable reinforcement learning: Concepts, algorithms, challenges. ArXiv, 2022.
- Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268), 2021.
- Synthetic returns for long-term credit assignment. ArXiv, 2021.
- Explainable deep learning: A field guide for the uninitiated. Journal of Artificial Intelligence Research, 73:329–396, 2022.
- You only look once: Unified, real-time object detection. In Conference on Computer Vision and Pattern Recognition, CVPR 2016, 2016.
- Can wikipedia help offline reinforcement learning? ArXiv, 2022.
- Explainability via causal self-talk. 2022.
- Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowledge-Based Systems, 263:110273, 2023.
- Concept bottleneck model with additional unsupervised concepts. IEEE Access, 10:41758–41765, 2022.
- Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence, 2(8):476–486, 2020.
- Proximal policy optimization algorithms. ArXiv, 2017.
- Curl: Contrastive unsupervised representations for reinforcement learning. ArXiv, 2020.
- Right for the right concept: Revising neuro-symbolic concepts by interacting with their explanations. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2021.
- Interactive disentanglement: Learning concepts by interacting with their prototype representations. In Conference on Computer Vision and Pattern Recognition, (CVPR), pp. 10307–10318, 2022.
- Learning to intervene on concept bottlenecks. ArXiv, 2023.
- Leveraging explanations in interactive machine learning: An overview. Frontiers in Artificial Intelligence, 2023.
- Touzet, C. F. Neural reinforcement learning for behaviour synthesis. Robotics Auton. Syst., 1997.
- Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, 2016.
- Vouros, G. A. Explainable deep reinforcement learning: State of the art and challenges. ACM Computing Surveys, 2022.
- Visual rationalizations in deep reinforcement learning for atari games. In BNCAI, 2018.
- Read and reap the rewards: Learning to play atari with the help of instruction manuals. ArXiv, 2023.
- Evolutionary reinforcement learning via cooperative coevolutionary negatively arxivelated search. Swarm and Evolutionary Computation, 2022.
- Concept learning for interpretable multi-agent reinforcement learning. ArXiv, 2023.
- Efficient decompositional rule extraction for deep neural networks. CoRR, 2021.
- Vision-based robot navigation through combining unsupervised learning and hierarchical reinforcement learning. Sensors (Basel, Switzerland), 2019.
- Çağlar Aytekin. Neural networks are decision trees. ArXiv, 2022.