Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing (1807.02322v5)

Published 6 Jul 2018 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://github.com/crazydonkey200/neural-symbolic-machines

Citations (133)

View on Semantic Scholar

Summary

The paper introduces MAPO, a method that reformulates the expected return to reduce gradient variance in deterministic, discrete-action tasks.
It employs memory weight clipping, systematic exploration, and distributed sampling to stabilize training and efficiently discover high-reward trajectories.
Empirical results show MAPO improves accuracy by 2.6% on WikiTableQuestions and achieves 74.9% on WikiSQL with weak supervision.

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

The paper by Liang et al. introduces Memory Augmented Policy Optimization (MAPO), a novel algorithm designed to enhance the performance of policy gradient methods in deterministic environments with discrete actions, such as program synthesis and semantic parsing. MAPO targets the well-known issues with traditional policy gradient methods, namely large variance in gradient estimates and inefficiency in exploring high-reward trajectories.

Core Contributions

MAPO reformulates the expected return objective as a weighted sum of two distinct components: the trajectories stored within a memory buffer and those outside it. This separation allows for more precise estimation of policy gradients by reducing variance, thereby addressing the cold start and inefficiency problems inherent in traditional policy gradient approaches.

Key technical innovations include:

Memory Weight Clipping: Introduced to stabilize and accelerate training, this technique ensures that the weights associated with the memory buffer are adjusted dynamically to maintain training robustness.
Systematic Exploration: This aspect of MAPO systematically explores potential trajectories to uncover high-reward paths efficiently, avoiding redundant computations in deterministic domains.
Distributed Sampling: MAPO effectively uses distributed sampling to manage trajectory selection from both inside and outside the memory buffer, allowing for scalable and efficient computation.

Empirical Evaluation

MAPO was empirically validated through two benchmark datasets—WikiTableQuestions and WikiSQL—focusing on weakly supervised program synthesis and semantic parsing. The results are noteworthy:

On WikiTableQuestions, MAPO achieved an accuracy of 46.3%, marking a 2.6% improvement over the previous state-of-the-art, indicating its efficacy in handling complex semantic parsing tasks.
On WikiSQL, MAPO attained a 74.9% accuracy with only weak supervision, surpassing several models reliant on full supervision, which underscores its potential for tasks with sparse rewards.

Theoretical Implications and Practical Applications

Theoretically, MAPO provides a structured method to manage exploration and variance reduction within the policy gradient framework, offering insights into improving sample efficiency and robustness in reinforcement learning.

Practically, MAPO's successful application in natural language processing tasks suggests broader applicability in structured prediction and combinatorial optimization domains. This could lead to significant advancements in automated program synthesis, enabling more efficient solutions to complex, real-world problems.

Future Directions

The research underscores several avenues for further exploration:

Integrating MAPO with other reinforcement learning techniques to enhance scalability and performance in larger and more dynamic environments.
Refining systematic exploration methods to further reduce computational overhead while maintaining accuracy.
Exploring the integration of MAPO components with other models to improve generalization across diverse AI applications.

In conclusion, the contributions of MAPO lie in its innovative approach to policy optimization, offering both theoretical insights and practical benefits that have the potential to advance the fields of program synthesis and semantic parsing significantly.