Overview
Innovations in deep reinforcement learning (DRL) have led to remarkable achievements in the AI domain. Despite these successes, DRL models often struggle with interpretability and generalizability. Prior work has explored learning programmatic policies, which offer better structure for interpretability and generalizing to new situations. However, these methods have limitations: they either use constrained policy representations or require substantial supervision. To further this research, a novel framework titled Learning Embeddings for lAtent Program Synthesis (LEAPS) has been developed, aiming to synthesize programs directly from reward feedback that are both understandable and capable of generalizing.
Learning Program Embedding
LEAPS utilizes a two-stage learning scheme to facilitate program synthesis. Initially, it learns a program embedding space, where proximal latent programs signify similar behaviors. A program encoder encodes given programs into latent space, and a corresponding decoder reconstructs programs from this latent representation. This embedding space is crafted through unsupervised learning, which relies on reconstructing randomly generated programs and the behaviors they invoke. Significantly, once this space is established, it can be utilized universally across different tasks, circumventing the need for retraining.
Program Synthesis
After learning the embedding space, the framework searches for optimal latent programs using a gradient-free search algorithm called Cross Entropy Method (CEM). By iteratively updating a population of candidate latent programs via CEM, a latent program which, when decoded, maximizes the desired behavior in the given task is found. This search process benefits from the embedding space's ability to interpolate smoothly between program behaviors.
Experimental Validation
Experiments conducted in the Karel domain highlight LEAPS’s proficiency in synthesizing programs that solve tasks requiring navigation and interaction with objects, like stacking and maze navigation. Comparatively, LEAPS not only reliably generates functional programs but also surpasses both DRL and program synthesis baselines in achieving better task performance. Moreover, it shows an enhanced ability to generalize across various domain settings and task configurations.
Concluding Thoughts
LEAPS stands out by embracing a highly expressive program representation and demanding minimal supervision, distinguishing it from preceding programmatic reinforcement learning approaches. Its reliance on reward signals and two-stage learning scheme elegantly avoids the intricacies of mastering program synthesis from the ground up. The research demonstrates its superiority over traditional DRL and program synthesis methods, not just in performance metrics but also in the interpretability and editability of the synthesized programs. This framework paves the way for future advancements, particularly in scenarios demanding interpretable and generalizable policies.