Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
38 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Learned Optimization for Plasticity & Exploration

Updated 26 July 2025
  • Learned optimization is a meta-learning approach that designs adaptive update rules to maintain neural plasticity, ensure effective exploration, and manage non-stationary environments.
  • Techniques include meta-learned optimizers, Hebbian-inspired local updates, stochastic synaptic sampling, and dynamic network architectures, all validated in diverse RL tasks.
  • Empirical studies show that such methods outperform traditional optimizers by reactivating dormant neurons and adapting dynamically to distributional shifts.

Learned optimization for plasticity, exploration, and non-stationarity refers to the meta-learning of update rules, architectures, or local dynamics in neural or reinforcement learning systems so that the resulting optimization processes are robust to distributional drift, maintain adaptability over time, and facilitate exploration to prevent suboptimal or premature convergence. This paradigm encompasses algorithmic and architectural innovations in meta-optimizers, synaptic plasticity, policy search, and continual learning to address plasticity loss, non-stationary environments, and exploration–exploitation dilemmas. Approaches span recurrent and parameter-conditioned optimizers, Hebbian/biological local rules, uncertainty-aware critics, topological growth, dynamic ensemble models, and multi-objective trade-off frameworks.

1. Meta-Learned Optimizers for Plasticity, Exploration, and Non-stationarity

Meta-learned optimizers are update rules parameterized by neural networks—typically recurrent architectures such as GRUs or LSTMs—that take as input features summarizing current and past optimization dynamics (e.g., gradient statistics, parameter values, temporal progress, dormancy measures) and output parameter updates. The OPEN optimizer (Goldie et al., 9 Jul 2024) exemplifies this paradigm, directly targeting three canonical RL difficulties:

  • Non-stationarity: Inputs include "training proportion" and "batch proportion" signals to condition the step size and momentum on the stage of learning, adapting to distributional shifts.
  • Plasticity loss: Neuron- or parameter-level dormancy metrics are provided as features, enabling the optimizer to modulate updates to reactivate neurons that have become inactive.
  • Exploration: OPEN meta-learns a stochasticity scaling output used to inject parameter-space noise adaptively into actor updates (i.e., producing a learned exploration noise term), which addresses the need for directed exploration—especially in extended horizon and hard-exploration tasks.

The update rule for each parameter in OPEN is: u^i=α1miexp(α2ei)+α3δiϵ\hat{u}_i = \alpha_1\, m_i\, \exp\left(\alpha_2\, e_i\right) + \alpha_3\, \delta_i\, \epsilon where mim_i (momentum), eie_i, and δi\delta_i are network outputs, and ϵ\epsilon is unit Gaussian noise. Zero-mean normalization is applied for stability.

Empirically, OPEN demonstrates robust generalization when meta-trained on small sets of environments, outperforming hand-designed optimizers (Adam, RMSProp, Lion, meta-opt baselines) across both in-distribution and out-of-support RL tasks, as well as generalizing across varying agent architectures (Goldie et al., 9 Jul 2024). Ablations confirm that removing dormancy or temporal conditioning features degrades performance.

2. Biological and Hebbian Plasticity Rules

Some approaches derive learned update rules inspired by biological mechanisms of plasticity. Local Hebbian-like updates, parameterized in compact, interpretable forms, are evolved via genetic algorithms to enable networks to adapt weights autonomously to changing reinforcement or environmental conditions (Yaman et al., 2019). In these systems, synaptic changes are locally computed using pre- and post-synaptic activity and a global modulatory reward or punishment signal: Δwij=ηaiajm\Delta w_{ij} = \eta \cdot a_i \cdot a_j \cdot m with m{+1,1,0}m \in \{+1, -1, 0\} for positive, negative, or neutral reinforcement, respectively. Population-based search over discrete rule sets identifies compact plasticity schemes that robustly support adaptation in non-stationary foraging and predator–prey tasks.

Performance of such evolved local rules is competitive with offline, global search methods (hill climbing) in environments with abrupt or gradual reward function drift (Yaman et al., 2019). The key advantage is continual online adaptation without explicit global supervision or resets, with rules often revealing interpretable, reward-modulated weight adjustments.

3. Synaptic Sampling and Neural Plasticity in Embodied Systems

Stochastic synaptic plasticity rules governed by stochastic differential equations support robust lifelong learning and exploration in non-stationary embodied control tasks (Kaiser et al., 2020). In the SPORE framework, synaptic parameters θ\theta are iteratively sampled from a target distribution combining prior constraints and expected discounted reward: dθi=β[θilogps(θ)+θilogV(θ)]dt+2βTdWid\theta_i = \beta \Bigl[ \partial_{\theta_i} \log p_s(\theta) + \partial_{\theta_i} \log \mathcal{V}(\theta) \Bigr] dt + \sqrt{2\beta T} dW_i The temperature TT modulates the exploration–consolidation trade-off: high TT increases stochasticity for exploration, while annealing (e.g., exponential decay of learning rate β\beta) supports consolidation as the agent stabilizes (Kaiser et al., 2020).

In integration with open-source robotic simulators, SPORE demonstrates online learning in closed-loop visuomotor tasks, with performance sensitive to plasticity and exploration schedule regulation.

4. Dynamic Architectures: Neuroplastic Expansion and Structural Plasticity

Architectural adaptation forms a complementary avenue for optimizing plasticity and robustness. The Neuroplastic Expansion (NE) method (Liu et al., 10 Oct 2024) dynamically grows the network topology in response to observed gradient magnitudes, preferentially adding connections where learning signals are strong: Igrowl=TopKiβl(tLi)\mathbb{I}^l_{\text{grow}} = \text{TopK}_{i \notin \beta^l}(|\nabla_t \mathcal{L}_i|) Concurrently, neurons exhibiting persistent inactivity are pruned using normalized activation measures. A consolidation mechanism—sampling earlier replay experiences when the dormant neuron ratio plateaus—prevents catastrophic forgetting and balances the plasticity–stability dilemma (Liu et al., 10 Oct 2024).

Experiments in MuJoCo and DeepMind Control Suite environments indicate that such dynamic topologies maintain higher active neuron ratios and enhance adaptability, outperforming fixed-topology and reset-based baselines.

5. Multi-Objective and Preference-Conditioned Continual Learning

Plasticity–stability trade-offs are formalized as a multi-objective optimization (MOO) problem in Pareto Continual Learning (ParetoCL) (Lai et al., 30 Mar 2025). The system learns a spectrum of solutions via a hypernetwork conditioned on preference vectors α\alpha that balance loss on new (Lnew\mathcal{L}_{\text{new}}) and replayed (Lreplay\mathcal{L}_{\text{replay}}) data: minθF(θ)=[Lreplay(f(θ)),Lnew(f(θ))]\min_\theta F(\theta) = [\mathcal{L}_{\text{replay}}(f(\theta)), \mathcal{L}_{\text{new}}(f(\theta))] During inference, the model dynamically samples or chooses the trade-off that yields highest-confidence (minimum entropy) predictions, rapidly adjusting adaptation strategy to changes in distribution (Lai et al., 30 Mar 2025). Empirical results demonstrate improved accuracy and robustness compared to static-weighted experience replay, especially in online continual learning.

6. Exploration–Adaptive Methods and Non-Stationarity Handling

Explicit exploration modulation is achieved by adaptively tuning policy parameters that govern stochasticity, optimism, and action consistency, using non-stationary multi-armed bandits to select behaviour modulations (Schaul et al., 2019). The learning progress is tracked by proxy signals (e.g., episodic return), and modulations are selected per-episode to maximize empirical learning gains: LPt(Δθ)=Es0[Vπθt+Δθ(s0)Vπθt(s0)]LP_t(\Delta\theta) = \mathbb{E}_{s_0}\bigl[V^{\pi_{\theta_t+\Delta\theta}}(s_0) - V^{\pi_{\theta_t}}(s_0)\bigr] A factored bandit structure increases adaptation speed. Empirical evaluation shows that this method achieves performance on par with per-task-tuned settings across diverse Atari domains (Schaul et al., 2019).

Non-stationarity in RL—whether due to agent–environment interactions, opponent policy shifts (multi-agent settings), or explicit changes in rewards/dynamics—demands optimizers that both detect environmental drift (e.g., via critic uncertainty—as in evidential PPO (Akgül et al., 3 Mar 2025)) and modulate exploration to rapidly adapt to the new regime.

7. Benchmarking, Metrics, and Frameworks for Plasticity Research

Plasticine (Yuan et al., 24 Apr 2025) provides a comprehensive benchmarking framework featuring more than 13 mitigation strategies (resets, normalization, regularization, activation/optimizer changes) and 10+ evaluation metrics for analyzing plasticity and adaptability under increasingly non-stationary scenarios (from standard RL to open-ended lifelong environments). Metrics such as Ratio of Dormant Units, Fraction of Active Units, entropy, and representation quality (stable/effective rank) enable fine-grained assessment of plasticity loss and the efficacy of interventions. Progressive benchmark environments such as Craftax support rigorous scrutiny of plasticity in dynamic, procedurally generated tasks.

This unified framework enables systematic, reproducible comparison of methods targeting plasticity, exploration, and robustness to non-stationary distributions, supporting progress in lifelong and continual learning agents.


In summary, learned optimization for plasticity, exploration, and non-stationarity comprises a spectrum of methods—meta-learned and biologically-inspired update rules, adaptive architectures, multi-objective formulations, and dynamic exploration control—grounded in mathematical and algorithmic foundations. These approaches have been validated across high-dimensional RL, supervised learning, embodied control, and non-stationary bandit settings, with robust evidence that explicitly addressing adaptability, dormancy, and exploration yields substantial improvements over traditional static optimizers and hand-designed learning rules. Ongoing research leverages flexible feature conditioning, uncertainty quantification, dynamic topologies, and objective augmentation to further advance the field.