Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 185 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Erasable Reinforcement Learning

Updated 2 October 2025

Erasable Reinforcement Learning (ERL) is a framework that integrates mechanisms to erase or correct agent memories, enhancing sample efficiency, robustness, and privacy.
It combines techniques like hybrid evolutionary algorithms, selective experience replay, and corrective reasoning in LLMs to improve exploration and stability.
These methods have achieved measurable gains in benchmarks such as MuJoCo tasks and multi-hop QA, demonstrating practical benefits in adaptive and safe decision-making.

Erasable Reinforcement Learning (ERL) designates a diverse class of frameworks and algorithms that introduce explicit mechanisms for erasing, forgetting, or correcting (parts of) a reinforcement learning agent’s experience, memory, reasoning trajectory, or knowledge. ERL has emerged across several lines of research—including evolutionary hybrid systems, experience replay management, trajectory-level correction, rating-based reward learning, quantum control, simulator optimization, and unlearning for privacy. Common to these approaches is a focus on selective removal or modification to enable improved robustness, sample efficiency, adaptability, or privacy in sequential decision-making, often beyond what conventional RL techniques achieve.

1. Hybrid Evolutionary and Reinforcement Learning Algorithms

Evolutionary Reinforcement Learning originally referred to hybrid approaches that combine Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) to tackle the limitations of each method in isolation (Khadka et al., 2018). In the ERL paradigm, a population of actor policies are evolved using genetic operators, with performance measured by episode-level cumulative reward (fitness). A “RL actor” is embedded that performs gradient descent using a shared replay buffer of experiences from the entire EA population, periodically injecting improved weights back into the population. The hybridization achieves:

Episode-level credit assignment: Fitness evaluates policies over entire episodes, handling sparse/long-delayed rewards without bootstrapping.
Exploration: EA diversifies globally in parameter space, RL actor explores locally via action noise.
Stability: Population selection avoids deceptive gradients and enhances robustness to hyperparameters.
Sample efficiency: RL gradient updates accelerate learning compared to pure EAs.

Empirical results in MuJoCo benchmarks (Half-Cheetah, Ant, Hopper, Walker2D, etc.) show that this ERL framework outperforms DDPG, PPO, and pure EAs, especially in tasks with deceptive or sparse reward landscapes.

2. Experience Replay and Episodic Memory Management

Erasable RL is tightly connected to experience replay mechanisms that govern the retention and discarding of agent-environment interactions. ReF-ER (Remember and Forget for Experience Replay) introduces two critical rules (Novati et al., 2018):

Selective Erasure via Importance Weighting: Only trajectories whose behavior policy is sufficiently similar to the current policy (within a trust-region, defined by an importance weight threshold) contribute to gradient updates; “far-policy” samples are zeroed out.
KL-based Policy Regularization: A KL-divergence penalty between stored behaviors and the current policy constrains updates within a trust region, reducing drift and instability.

This form of active memory erasure stabilizes learning and improves generalization. Mechanisms for erasability, such as dynamic filtering or weighting based on current-policy distance, generalize to both lifelong learning and privacy-motivated forgetting requirements.

3. Robust Reasoning in Search-Augmented LLMs

“Erasable Reinforcement Learning” has expanded to cover trajectory-level reasoning in search-augmented LLMs, where the agent’s reasoning process is a sequence of state–action–evidence pairs (Wang et al., 1 Oct 2025). Here, ERL operates as follows:

Fault Detection: At each reasoning round, intermediate rewards for sub-answers and search hits are evaluated; steps below error thresholds (α for local, β for plan-level) are flagged.
Erasure Operator: Identified faulty steps are erased from the reasoning trajectory, and the agent regenerates solutions from the last trusted state.
Fine-grained Correction: Erasure may target decomposition errors (incorrect problem breakdown), retrieval misses (missing evidence), or logic errors, preventing error propagation.

Quantitative gains (+8.48% EM, +11.56% F1 for 3B models; +5.38% EM for 7B models) on multi-hop QA tasks indicate substantial improvement over previous methods, with the ERL framework rendering reasoning resilient to cascades of logic faults.

4. Rating-Based RL with AI-Generated Feedback

Recent ERL variants leverage external feedback models as an erasable “teacher.” In ERL-VLM, agents receive absolute trajectory ratings from large Vision-LLMs instead of engineered rewards (Luu et al., 15 Jun 2025):

Absolute Ratings: VLM feedback provides multi-class labels (bad to excellent) for trajectory segments, which are integrated via a probabilistic rating model governing the agent’s surrogate reward function.
Data Imbalance and Label Noise Handling: Stratified sampling and mean absolute error loss (with inverse-frequency weighting) mitigate dominant (“bad”) labels and hallucinations in the teacher’s ratings.
Autonomous Reward Design: This paradigm scales RL training and reward learning while reducing need for human annotation.

ERL-VLM demonstrates superior performance and sample efficiency in simulated and real-world tasks, establishing erasable, automatically refactorable reward channels as practically viable.

5. Mechanisms for Reinforcement Unlearning

Privacy requirements introduce erasable RL as “reinforcement unlearning”—selectively erasing the agent’s knowledge, skills, or memories of environments upon owner request (Ye et al., 2023). Two principal schemes exist:

Decremental RL: Fine-tuning the policy with a composite loss that penalizes performance in target (unlearning) environments while preserving behavior in others:

$\mathcal{L}_u = \mathbb{E}_{s\sim\mathcal{S}_u}\left[\|Q_{\pi'}(s)\|_{\infty}\right] + \mathbb{E}_{s\not\sim\mathcal{S}_u}\left[\|Q_{\pi'}(s) - Q_{\pi}(s)\|_{\infty}\right]$

Environment Poisoning: Actively modifying the target environment’s transition dynamics to degrade learned knowledge without impacting retained environments.

Effectiveness is evaluated through “environment inference attacks,” quantifying the difficulty of reconstructing the erased environment from observed agent behavior.

6. Extensions to Physics, Robotics, and Quantum Control

Erasable RL methods extend to protocol optimization in physics and controlled systems:

Efficient Erasure Protocols: Genetic algorithms using a neural-network representation of control drive efficient, low-heating, high-speed bit erasure in underdamped mechanical memory (Barros et al., 23 Sep 2024).
Quantum Control: Mapping RL states/actions to quantum states/unitary controls allows resource-efficient, high-fidelity evolution in quantum systems, with enhanced neural heuristic functions accelerating learning under practical constraints (Liu et al., 2023).
Trajectory-Level Refinement: Frameworks like MoRe-ERL learn residual corrections to reference trajectories, identifying and modifying only segments critical for adaptation via smooth B-Spline movement primitives. This segment-level erasability yields better sample efficiency, task performance, and sim-to-real transfer (Huang et al., 2 Aug 2025).

7. Algorithmic, Theoretical, and Practical Implications

ERL frameworks introduce several substantive implications for research and practice:

Memory Management: Dynamic, rule-based memory erasure enhances generalization, stability, and privacy across RL variants.
Sample Efficiency and Robustness: Hybridization with evolutionary methods and external teachers enables faster convergence and resilience to reward sparsity, non-Markovian objectives, and environmental drift.
Corrective Reasoning: Fine-grained erasure of reasoning steps transforms multi-hop decision processes into robust, error-correctable sequential reasoning, advancing capabilities in LLMs and combinatorial agents.
Applied Systems: ERL-inspired techniques now drive advances in autonomous robotics, quantum systems, and physical device optimization, evidenced by sample-efficient sim-to-real transfer and experimentally validated protocols.

A plausible implication is that erasable reinforcement learning is rapidly evolving from hybrid evolutionary frameworks and experience replay management into a universal principle for adaptive, privacy-preserving, and resilient sequential decision-making in machine learning. Future research directions include finer granularity in unlearning, continuous policy reinitialization, multi-agent erasability, and theoretically grounded benchmarks for memory, privacy, and generalization.