Programmatically Interpretable Reinforcement Learning (1804.02477v3)

Published 6 Apr 2018 in cs.LG, cs.AI, cs.PL, and stat.ML

Abstract: We present a reinforcement learning framework, called Programmatically Interpretable Reinforcement Learning (PIRL), that is designed to generate interpretable and verifiable agent policies. Unlike the popular Deep Reinforcement Learning (DRL) paradigm, which represents policies by neural networks, PIRL represents policies using a high-level, domain-specific programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (NDPS), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maximal reward. NDPS works by first learning a neural policy network using DRL, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural "oracle". We evaluate NDPS on the task of learning to drive a simulated car in the TORCS car-racing environment. We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also show that PIRL policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by DRL.

View on arXiv

Authors (5)

Abhinav Verma (12 papers)
Vijayaraghavan Murali (14 papers)
Rishabh Singh (58 papers)
Pushmeet Kohli (116 papers)
Swarat Chaudhuri (61 papers)

Citations (326)

View on Semantic Scholar

Summary

Overview of Programmatically Interpretable Reinforcement Learning

The paper "Programmatically Interpretable Reinforcement Learning" introduces a framework termed Programmatically Interpretable Reinforcement Learning (Pirl), which is designed to enhance the interpretability of reinforcement learning (RL) policies by representing them in a high-level, domain-specific programming language. This approach contrasts with the often opaque policy representations found in Deep Reinforcement Learning (Drl), which typically employ neural networks. The Pirl framework uses a domain-specific language to facilitate verification through symbolic methods, a significant advancement in terms of ensuring the reliability and safety of RL systems.

A novel algorithm, Neurally Directed Program Search (Ndps), is proposed to tackle the challenging problem of discovering programmatic policies that maximize reward in a nonsmooth policy space. This method involves first learning a neural policy network using Drl, then performing a local search over programmatic policies, striving to match the behavior of this neural network as closely as possible. Ndps aims to address the issue of the vast and complex policy search space by leveraging the expressive capabilities of neural networks to guide the search for interpretable policies.

Methodology

The core innovation in this work is the use of a programming language to constrain and express policies, defined within the Pirl framework. This enables the specification of a "policy sketch," which dictates the structure and constraints of potential policies within this space. Such sketches effectively encode inductive biases, streamline the search by pruning undesired policies, and allow for symbolic verification of the learned policies. This approach promises not only interpretability but also potential improvements in robustness and adaptability.

Ndps operates by initially training a Drl agent to perform the task, using its learned policy as a behavioral "oracle." It then searches for a programmatic policy that mimics this oracle closely. By iteratively updating its search space with new histories sampled from the current best program, Ndps can refine its policy to better approximate the oracle while adhering to the constraints of the policy language.

Evaluation and Results

The evaluation tasks include learning to drive a simulated car in the Torcs racing environment, in addition to several classic control problems. The findings reveal that the Pirl framework, and Ndps in particular, can discover human-readable policies that, while sometimes less optimal in raw performance compared to Drl policies, are significantly more interpretable and transferable to new environments.

Notably, policies discovered using Ndps were shown to produce smoother trajectories and demonstrate greater adaptability to previously unseen environments compared to traditional Drl models. This smoothness is attributed to the structural constraints imposed by the policy sketch, which serves as a regularizer during the learning process.

Implications and Future Directions

The Pirl framework represents a significant step toward making RL policies more interpretable and verifiable. By enabling policies to be expressed in a human-readable form, Pirl facilitates a more transparent understanding of the decision-making process of RL agents, which is particularly crucial for deployment in safety-critical domains.

Future directions for research include extending this framework to handle perceptual inputs directly, such as those from visual or auditory sensors, which could further enhance the applicability of Pirl to a wider array of real-world tasks. Additionally, incorporating stochasticity into the policies learned within this framework could prove beneficial for applications requiring flexibility and adaptability in dynamic environments.

Overall, this work establishes a foundation for future studies that aim to bridge the gap between the interpretability of RL policies and their performance capabilities, paving the way for more transparent and reliable AI systems.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos