Papers
Topics
Authors
Recent
Search
2000 character limit reached

Performative Reinforcement Learning Overview

Updated 18 March 2026
  • Performative Reinforcement Learning is a framework where an agent’s deployed policy actively alters the environment’s dynamics, leading to nonstationary MDPs.
  • It develops methods such as repeated retraining and policy gradients that converge to performatively stable or optimal policies under smoothness and regularization conditions.
  • PRL has practical applications in adaptive systems like recommender systems, financial markets, and multi-robot coordination where deployment feedback is crucial.

Performative Reinforcement Learning (PRL) is the study of reinforcement learning (RL) in settings where the agent’s policy influences, and is influenced by, the environment’s dynamics. Unlike classical RL, which assumes a stationary MDP, PRL explicitly models the feedback loop where deploying a policy causes the reward and transition functions of the environment to change as a function of that policy. This setting captures practical situations such as recommender systems with adaptive users, financial markets with learning agents, and multi-robot systems with mutual adaptation. The formalism and algorithmic theory of PRL has evolved rapidly, moving from stability-oriented schemes to algorithms with provable convergence to performatively optimal policies.

1. Formal Framework and Mathematical Foundations

In PRL, the environment is no longer static but depends on the agent’s deployed policy. Formally, deploying policy π\pi induces an MDP M(π)=(S,A,Pπ,rπ,γ,ρ)M(\pi) = (S, A, P^\pi, r^\pi, \gamma, \rho), where the transition kernel PπP^\pi and reward rπr^\pi satisfy smoothness and Lipschitz conditions in π\pi (Mandal et al., 2022, Mandal et al., 2024). The agent’s objective is typically the expected discounted return under the distribution induced by deploying π\pi in M(π)M(\pi):

J(π)=Eτ(Pπ,π)[t=0γtrπ(st,at)]J(\pi) = \mathbb{E}_{\tau \sim (P^\pi, \pi)} \left[ \sum_{t=0}^\infty \gamma^t r^\pi(s_t, a_t) \right]

A crucial distinction arises between two solution concepts:

  • Performatively Stable (PS) Policy: πPS\pi_{PS} is a fixed point—optimal for the environment it induces: πPSargmaxπJ(π;M(πPS))\pi_{PS} \in \arg\max_{\pi'} J(\pi'; M(\pi_{PS})).
  • Performatively Optimal (PO) Policy: π\pi^* is globally optimal for its own induced environment: πargmaxπJ(π)\pi^* \in \arg\max_\pi J(\pi).

In general, there is a strict gap J(π)J(πPS)>0J(\pi^*) - J(\pi_{PS}) > 0, so retraining-based approaches can converge to suboptimal stable policies (Chen et al., 6 Oct 2025, Basu et al., 23 Dec 2025).

The occupancy-measure formulation is central in PRL. Let dRS×Ad \in \mathbb{R}^{|S| \times |A|} denote the discounted occupancy; in the regularized PRL objective:

maxd0s,ad(s,a)rd(s,a)λ2d22 s.t. ad(s,a)=ρ(s)+γs,ad(s,a)Pd(s,a,s)\max_{d \geq 0} \sum_{s,a} d(s,a) r_{d}(s,a) - \frac{\lambda}{2} \|d\|_2^2 \ \text{s.t.}~ \sum_a d(s,a) = \rho(s) + \gamma \sum_{s',a} d(s',a) P_{d}(s' , a, s)

where rdr_d and PdP_d depend on dd via πd\pi^d. Regularization (λ>0\lambda > 0) renders the problem strongly concave, facilitating the analysis of fixed points and algorithmic convergence (Mandal et al., 2022, Rank et al., 2024).

2. Algorithmic Paradigms for PRL

PRL algorithm design has evolved along two main trajectories: repeated retraining (fixed-point search) and direct policy optimization (gradient methods).

2.1 Repeated Retraining: Stability Approaches

The classical approach, analyzed in (Mandal et al., 2022), is Repeated Policy Optimization (RPO), wherein at each round tt the agent deploys πt\pi_t, observes the new environment M(πt)M(\pi_t), solves the regularized RL problem for M(πt)M(\pi_t), and sets πt+1\pi_{t+1} accordingly. Accelerated variants include Delayed Repeated Retraining (DRR) and Mixed Delayed Repeated Retraining (MDRR) (Rank et al., 2024):

  • RR: Immediate retraining after each deployment.
  • DRR: Retraining every kk rounds, allowing the environment to evolve.
  • MDRR: Aggregates samples from multiple prior deployments with geometric weighting, improving sample efficiency in highly inertial environments.

Convergence of these methods to PS policies is established under contractive environment mappings and sufficient regularization. In the presence of gradual shifts (persistent environmental inertia), MDRR demonstrates significant gains in samples-per-deployment and speed of convergence (Rank et al., 2024).

2.2 Direct Policy Optimization: Towards Performative Optimality

Recent advances target the PO policy, overcoming the PS–PO gap. Two breakthrough algorithmic primitives are:

  • Performative Policy Gradient (PePG) (Basu et al., 23 Dec 2025): Extends policy gradient by accounting for how environment dynamics PθP_\theta, rθr_\theta depend on the policy parameter θ\theta. The gradient for the performative objective J(θ)J(\theta) is

θJ(θ)=11γEs,adPθ,πθ[Aθθ(s,a)(θlogπθ(as)+θlogPθ(ss,a))+θrθ(s,a)]\nabla_\theta J(\theta) = \frac{1}{1-\gamma} \mathbb{E}_{s,a \sim d^{P_\theta, \pi_\theta}} \bigg[ A^\theta_\theta(s,a) \left( \nabla_\theta \log \pi_\theta(a|s) + \nabla_\theta \log P_\theta(s'|s,a) \right) + \nabla_\theta r_\theta(s,a) \bigg]

PePG provably converges to a PO policy under standard assumptions in time O(SA2/((1γ)3ϵ2))O(|S||A|^2 /((1-\gamma)^3 \epsilon^2)), outperforming stability-seeking methods both theoretically and empirically.

  • Zeroth-Order Frank–Wolfe for PRL (0-FW) (Chen et al., 6 Oct 2025): Approximates performative policy optimization via bandit gradient estimates and a Frank–Wolfe step. Under a “regularizer-dominant” condition, 0-FW enjoys polynomial-time convergence to an ϵ\epsilon-optimal PO policy, circumventing the need for analytic gradients—crucial when Pπ,rπP_\pi, r_\pi are black-box.

Both approaches rely on strong regularization to induce gradient dominance (Polyak–Łojasiewicz) in the nonconvex PRL objective, enabling global convergence from arbitrary initialization.

3. Convergence Theory and Guarantees

The core analysis of PRL methods centers on contraction mappings induced by the environment-policy feedback and on regularization-induced strong concavity.

  • PS Policy Convergence: Repeated retraining (including RR, DRR, MDRR) achieves linear convergence to a performatively stable dd_* in tabular as well as linear MDPs, provided the environment maps are Lipschitz and the regularizer is sufficiently large (Mandal et al., 2022, Rank et al., 2024, Mandal et al., 2024). In environments with slow response (high inertia), aggregating historic data (MDRR) yields improved sample complexity.
  • PO Policy Convergence: With either PePG or 0-FW, and under regularizer dominance, the methods converge globally (to an ϵ\epsilon-close PO policy) at rates polynomial in 1/ϵ,S,A1/\epsilon, |S|, |A|. The primal challenge is the mismatch between the fixed-point (stability) property and the global maximum (optimality), with nontrivial bias appearing in the former (Basu et al., 23 Dec 2025, Chen et al., 6 Oct 2025).
  • Robustness to Data Corruption: When empirical gradients are contaminated (e.g., under Huber’s ϵ\epsilon-contamination model), robust versions of repeated retraining using median-of-means or trimmed means ensure O(ϵ)O(\sqrt{\epsilon}) last-iterate convergence to a PS policy (Pollatos et al., 8 May 2025).

4. Extensions and Generalizations

4.1 Linear and Function-Approximation Settings

Recent extensions analyze PRL under linear MDPs, where PπP_\pi and rπr_\pi are parameterized by features ϕ(s,a)\phi(s,a) (Mandal et al., 2024). While strong convexity is lost in this setting, new two-step recurrences yield last-iterate convergence guarantees and sample-complexity bounds scaling polynomially with the feature dimension dd rather than S|S|. These results generalize to Stackelberg settings and partially to multi-follower games.

4.2 Multi-Agent Performative Games

In performative Markov potential games (MPGs), each agent’s local policy affects the global environment. The notion of performatively stable equilibrium (PSE) generalizes PS to the multi-agent domain (Sahitaj et al., 29 Apr 2025):

Vi,ππ(ρ)maxπiVi,(πi,πi)π(ρ)ϵiV^\pi_{i,\pi}(\rho) \geq \max_{\pi'_i} V^\pi_{i,(\pi'_i,\pi_{-i})}(\rho) - \epsilon \quad \forall i

Independent policy gradient and natural policy gradient algorithms converge to approximate (δr,p\delta_{r,p}-close) PSE, provided the performative effect is sufficiently smooth. Log-barrier regularization and occupancy-based retraining facilitate last-iterate finite-time convergence in the agent-independent case.

4.3 Robustness, Corruption, and Sample-Efficiency

PRL algorithms have been shown to be robustified against noisy samples and adversarial corruptions (Pollatos et al., 8 May 2025), with robust mean estimation in saddle-point subroutines being essential for finite-sample statistical stability.

When environmental inertia is high, collecting and leveraging all historic trajectory data—as formalized in MDRR—not only reduces sample complexity but preserves convergence rates, which closely matches the statistics-dominated regime of real-world sequential decision systems (Rank et al., 2024).

5. Empirical Benchmarks and Insights

Benchmarks typically involve gridworlds augmented with performative effects, such as interventional agents whose responses depend stochastically or smoothly on the principal's actions (Mandal et al., 2022, Rank et al., 2024, Basu et al., 23 Dec 2025). Metrics of interest include:

  • Distance to last-epoch averages (for stability)
  • Number of retrainings and samples-per-deployment (for efficiency)
  • Performative return and occupancy difference (for optimality)

Salient empirical findings include:

6. Significance and Implications

PRL models the feedback loop between policy optimization and environmental response, bridging RL with performative prediction and potential game theory. In settings where deployments affect future data distributions, standard RL underestimates risk and may converge to suboptimal policies. The PRL framework formalizes both stability (fixed-point) and optimality (global maximization), provides methods scaling to feature-rich MDPs, and demonstrates that sample-efficient and robust learning is possible by exploiting prior deployments and regularization.

Ongoing research directions include extending PRL guarantees to non-linear function approximators (e.g., deep RL), developing finer-grained theories for multi-agent performative equilibria, and integrating PRL with online, adversarial, and non-stationary settings (Mandal et al., 2022, Rank et al., 2024, Mandal et al., 2024, Basu et al., 23 Dec 2025, Chen et al., 6 Oct 2025, Pollatos et al., 8 May 2025, Sahitaj et al., 29 Apr 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Performative Reinforcement Learning (PRL).