Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

126 tokens/sec

GPT-4o

47 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Mixed On-policy Reinforcement Learning

Updated 30 June 2025

Mixed On-policy Reinforcement Learning is a framework that integrates neural gradient-based optimization with symbolic program synthesis for both performance and interpretability.
It iterates through synthesis, repair, imitation, and gradient optimization to continuously refine and verify policy behavior.
This approach is applied in safety-critical and control domains, enabling explicit constraint enforcement and human-driven policy adjustments.

Mixed On-policy Reinforcement Learning (MORL) refers to a class of reinforcement learning methodologies that combine multiple optimization paradigms—principally, on-policy gradient-based optimization with symbolic or programmatic methods—within a single iterative framework. The foundational concept is to alternate between powerful but opaque neural function approximators and interpretable, modifiable programmatic policy representations, enabling both high performance and human-driven modification or constraint enforcement.

1. Conceptual Foundations and Motivation

Mixed On-policy Reinforcement Learning arises from the limitations inherent in deep RL: while neural network policies achieve high task performance, they are typically difficult to interpret and modify, and do not offer built-in mechanisms for constraint satisfaction or verification. MORL, as introduced in "Towards Mixed Optimization for Reinforcement Learning with Program Synthesis" (Towards Mixed Optimization for Reinforcement Learning with Program Synthesis, 2018), directly targets these limitations by interleaving the following representations and optimizations:

Black-box, Reactive Policy: Standard RL policies such as neural networks parameterized for gradient-based learning (e.g., TRPO, PPO).
Symbolic, Programmatic Policy: Human-interpretable programs (e.g., decision trees, domain-specific language [DSL] programs), synthesized to mimic the reactive policy’s behavior.

This juxtaposition enables both sample-efficient learning and explicit, global corrections by leveraging synthesis, repair, and policy distillation in an iterative loop.

2. The MORL Iterative Framework: Workflow and Algorithmic Structure

The core MORL framework operates as a closed-loop of four main steps, iteratively repeated until the synthesized policy meets desired performance or specification criteria:

Synthesis: Extract a symbolic representation $P_t$ of the current reactive policy $\pi_t$ , using behavioral imitation techniques (such as decision tree extraction via VIPER or DSL induction via PIRL). This is formally:

$\forall s \in \mathcal{S},\quad P_t(s) \approx \pi_t(s)$

with $P_t$ constrained to a symbolic domain $\mathcal{D}$ .

Repair: Debug or modify the synthesized program symbolically—either manually (human-in-the-loop correction) or automatically (using program repair tools, e.g., CSPs solved via SAT/SMT). This yields a new program $P'_t$ such that:

$P_t \longrightarrow P'_t \quad \text{by satisfying constraints or correcting errors}$

Imitation (Behavioral Cloning): Transfer the modified symbolic policy $P'_t$ back into the differentiable domain by training a neural policy $\pi'_t$ to imitate it:

$\forall s \in \mathcal{S},\quad \pi'_t(s) \approx P'_t(s)$

Gradient-based Policy Optimization: Further refine $\pi'_t$ using standard RL algorithms (e.g., TRPO, PPO), yielding an updated policy $\pi_{t+1}$ for the next cycle.

Iteration: $\pi_t \longrightarrow_{\text{Synthesis}} P_t \longrightarrow_{\text{Repair}} P'_t \longrightarrow_{\text{Imitation}} \pi'_t \longrightarrow_{\text{Optimization}} \pi_{t+1}$ This process proceeds until the policy meets prescribed constraints or task success thresholds.

3. Symbolic Synthesis and Program Repair in MORL

Symbolic Synthesis utilizes techniques from policy extraction (e.g., VIPER, PIRL), where the symbolic program is derived to closely mimic the neural policy. The representation is typically a high-level DSL, decision tree, or similar structure. Constraints (correctness, safety, user intent) can then be explicitly enforced at the symbolic level—an operation impractical in neural networks.

Program Repair enables direct correction of policy behavior, either by manual modification (e.g., editing a decision branch to enforce a control invariant) or by automated means using repair algorithms (e.g., GenProg, automated program synthesis). This can enforce specifications like safety constraints non-invasively, since logic errors or undesirable behaviors can be addressed through high-level program edits.

4. Dual Representation: Interpretability, Debuggability, and Optimization

MORL’s strength lies in alternating between two policy spaces:

Opaque but optimizable: Neural policies support powerful, sample-efficient gradient-based learning but lack interpretability.
Interpretable but non-differentiable: Symbolic/programmatic policies support debugging, formal verification, and easy human correction.

Switching between them enables both rapid, local improvements (in neural space) and global, structural policy adjustments (in symbolic space), resulting in policies that are not only high-performing but also convertible, debuggable, and certifiable.

5. Application and Empirical Results

The MORL framework is exemplified on the CartPole-v0 domain. The process unfolds as follows:

Initial policy: Begin with a suboptimal (“Worst”) policy learned by RL.
Synthesis: Extract a decision tree from the reactive policy.
Repair: Manually edit tree logic (e.g., insert a rule to always push the cart in the direction the pole is falling).
Imitation: Clone this repaired program back into a neural policy.
Optimization: Fine-tune the neural policy via TRPO.

Results indicate that:

Direct programmatic repair can drastically increase mean reward (e.g., 9.3 to 104 to 200—the latter being near the environment maximum).
Behavioral cloning retains most improvements (with minor dropouts due to imperfect imitation).
Further on-policy optimization rapidly drives the policy to optimality.
The ability to inject domain knowledge or correct errors programmatically makes the policy easily modifiable and safe by construction.

6. Benefits, Limitations, and Extensions

Benefits:

Interpretability: Symbolic programs unlock policy understanding and debugging.
Explicit Specification: Safety constraints, logical conditions, and domain knowledge can be injected into the policy.
Efficient Learning: Fine-tuning proceeds faster when initialized from repaired, programmatically-improved policies.
Human-in-the-loop integration: Domain experts can review, edit, and verify symbolic policies before deployment.

Limitations:

Scalability: Symbolic synthesis and repair become challenging as state/action spaces grow more complex.
Automation: Automated program repair advances are required for larger-scale applications.
Imperfect Imitation: Behavioral cloning can introduce approximation errors, potentially requiring further optimization.

Extensions may include the use of richer DSLs, incremental program synthesis, and integration with advanced repair techniques for larger environments.

7. Impact and Practical Use Cases

Mixed On-policy Reinforcement Learning, as exemplified by the MORL framework (Towards Mixed Optimization for Reinforcement Learning with Program Synthesis, 2018), paves the way for RL deployments that demand both high performance and strong guarantees—such as autonomous systems, industrial control, robotics, and safety-critical domains. Its capacity to balance the expressivity of neural policies with the structure, verifiability, and flexibility of programmatic policies allows practitioners to design, debug, and refine RL controllers with unprecedented assurance and adaptability.

Table: Key Workflow Stages in MORL

Stage	Description	Artefact Produced
Synthesis	Extract symbolic program from neural policy	Symbolic policy $P_t$
Repair	Manually/automatically fix symbolic program	Repaired policy $P'_t$
Imitation	Clone repaired program into neural policy	Neural policy $\pi'_t$
Optimization	RL fine-tuning of neural policy	Improved policy $\pi_{t+1}$

In summary, Mixed On-policy Reinforcement Learning, via iterative alternation between program synthesis/repair and on-policy optimization, enables the synthesis of policies that are simultaneously high-performing, interpretable, and naturally compatible with specification-driven requirements. This systematic framework extends gracefully to complex control domains, provided advances in symbolic synthesis and repair continue to scale.

PDF Markdown Chat (Pro)