Performative Reinforcement Learning Overview

Updated 18 March 2026

Performative Reinforcement Learning is a framework where an agent’s deployed policy actively alters the environment’s dynamics, leading to nonstationary MDPs.
It develops methods such as repeated retraining and policy gradients that converge to performatively stable or optimal policies under smoothness and regularization conditions.
PRL has practical applications in adaptive systems like recommender systems, financial markets, and multi-robot coordination where deployment feedback is crucial.

Performative Reinforcement Learning (PRL) is the study of reinforcement learning (RL) in settings where the agent’s policy influences, and is influenced by, the environment’s dynamics. Unlike classical RL, which assumes a stationary MDP, PRL explicitly models the feedback loop where deploying a policy causes the reward and transition functions of the environment to change as a function of that policy. This setting captures practical situations such as recommender systems with adaptive users, financial markets with learning agents, and multi-robot systems with mutual adaptation. The formalism and algorithmic theory of PRL has evolved rapidly, moving from stability-oriented schemes to algorithms with provable convergence to performatively optimal policies.

1. Formal Framework and Mathematical Foundations

In PRL, the environment is no longer static but depends on the agent’s deployed policy. Formally, deploying policy $\pi$ induces an MDP $M(\pi) = (S, A, P^\pi, r^\pi, \gamma, \rho)$ , where the transition kernel $P^\pi$ and reward $r^\pi$ satisfy smoothness and Lipschitz conditions in $\pi$ (Mandal et al., 2022, Mandal et al., 2024). The agent’s objective is typically the expected discounted return under the distribution induced by deploying $\pi$ in $M(\pi)$ :

$J(\pi) = \mathbb{E}_{\tau \sim (P^\pi, \pi)} \left[ \sum_{t=0}^\infty \gamma^t r^\pi(s_t, a_t) \right]$

A crucial distinction arises between two solution concepts:

Performatively Stable (PS) Policy: $\pi_{PS}$ is a fixed point—optimal for the environment it induces: $\pi_{PS} \in \arg\max_{\pi'} J(\pi'; M(\pi_{PS}))$ .
Performatively Optimal (PO) Policy: $\pi^*$ is globally optimal for its own induced environment: $\pi^* \in \arg\max_\pi J(\pi)$ .

In general, there is a strict gap $J(\pi^*) - J(\pi_{PS}) > 0$ , so retraining-based approaches can converge to suboptimal stable policies (Chen et al., 6 Oct 2025, Basu et al., 23 Dec 2025).

The occupancy-measure formulation is central in PRL. Let $d \in \mathbb{R}^{|S| \times |A|}$ denote the discounted occupancy; in the regularized PRL objective:

$\max_{d \geq 0} \sum_{s,a} d(s,a) r_{d}(s,a) - \frac{\lambda}{2} \|d\|_2^2 \ \text{s.t.}~ \sum_a d(s,a) = \rho(s) + \gamma \sum_{s',a} d(s',a) P_{d}(s' , a, s)$

where $r_d$ and $P_d$ depend on $d$ via $\pi^d$ . Regularization ( $\lambda > 0$ ) renders the problem strongly concave, facilitating the analysis of fixed points and algorithmic convergence (Mandal et al., 2022, Rank et al., 2024).

2. Algorithmic Paradigms for PRL

PRL algorithm design has evolved along two main trajectories: repeated retraining (fixed-point search) and direct policy optimization (gradient methods).

2.1 Repeated Retraining: Stability Approaches

The classical approach, analyzed in (Mandal et al., 2022), is Repeated Policy Optimization (RPO), wherein at each round $t$ the agent deploys $\pi_t$ , observes the new environment $M(\pi_t)$ , solves the regularized RL problem for $M(\pi_t)$ , and sets $\pi_{t+1}$ accordingly. Accelerated variants include Delayed Repeated Retraining (DRR) and Mixed Delayed Repeated Retraining (MDRR) (Rank et al., 2024):

RR: Immediate retraining after each deployment.
DRR: Retraining every $k$ rounds, allowing the environment to evolve.
MDRR: Aggregates samples from multiple prior deployments with geometric weighting, improving sample efficiency in highly inertial environments.

Convergence of these methods to PS policies is established under contractive environment mappings and sufficient regularization. In the presence of gradual shifts (persistent environmental inertia), MDRR demonstrates significant gains in samples-per-deployment and speed of convergence (Rank et al., 2024).

2.2 Direct Policy Optimization: Towards Performative Optimality

Recent advances target the PO policy, overcoming the PS–PO gap. Two breakthrough algorithmic primitives are:

Performative Policy Gradient (PePG) (Basu et al., 23 Dec 2025): Extends policy gradient by accounting for how environment dynamics $P_\theta$ , $r_\theta$ depend on the policy parameter $\theta$ . The gradient for the performative objective $J(\theta)$ is

$\nabla_\theta J(\theta) = \frac{1}{1-\gamma} \mathbb{E}_{s,a \sim d^{P_\theta, \pi_\theta}} \bigg[ A^\theta_\theta(s,a) \left( \nabla_\theta \log \pi_\theta(a|s) + \nabla_\theta \log P_\theta(s'|s,a) \right) + \nabla_\theta r_\theta(s,a) \bigg]$

PePG provably converges to a PO policy under standard assumptions in time $O(|S||A|^2 /((1-\gamma)^3 \epsilon^2))$ , outperforming stability-seeking methods both theoretically and empirically.

Zeroth-Order Frank–Wolfe for PRL (0-FW) (Chen et al., 6 Oct 2025): Approximates performative policy optimization via bandit gradient estimates and a Frank–Wolfe step. Under a “regularizer-dominant” condition, 0-FW enjoys polynomial-time convergence to an $\epsilon$ -optimal PO policy, circumventing the need for analytic gradients—crucial when $P_\pi, r_\pi$ are black-box.

Both approaches rely on strong regularization to induce gradient dominance (Polyak–Łojasiewicz) in the nonconvex PRL objective, enabling global convergence from arbitrary initialization.

3. Convergence Theory and Guarantees

The core analysis of PRL methods centers on contraction mappings induced by the environment-policy feedback and on regularization-induced strong concavity.

PS Policy Convergence: Repeated retraining (including RR, DRR, MDRR) achieves linear convergence to a performatively stable $d_*$ in tabular as well as linear MDPs, provided the environment maps are Lipschitz and the regularizer is sufficiently large (Mandal et al., 2022, Rank et al., 2024, Mandal et al., 2024). In environments with slow response (high inertia), aggregating historic data (MDRR) yields improved sample complexity.
PO Policy Convergence: With either PePG or 0-FW, and under regularizer dominance, the methods converge globally (to an $\epsilon$ -close PO policy) at rates polynomial in $1/\epsilon, |S|, |A|$ . The primal challenge is the mismatch between the fixed-point (stability) property and the global maximum (optimality), with nontrivial bias appearing in the former (Basu et al., 23 Dec 2025, Chen et al., 6 Oct 2025).
Robustness to Data Corruption: When empirical gradients are contaminated (e.g., under Huber’s $\epsilon$ -contamination model), robust versions of repeated retraining using median-of-means or trimmed means ensure $O(\sqrt{\epsilon})$ last-iterate convergence to a PS policy (Pollatos et al., 8 May 2025).

4. Extensions and Generalizations

4.1 Linear and Function-Approximation Settings

Recent extensions analyze PRL under linear MDPs, where $P_\pi$ and $r_\pi$ are parameterized by features $\phi(s,a)$ (Mandal et al., 2024). While strong convexity is lost in this setting, new two-step recurrences yield last-iterate convergence guarantees and sample-complexity bounds scaling polynomially with the feature dimension $d$ rather than $|S|$ . These results generalize to Stackelberg settings and partially to multi-follower games.

4.2 Multi-Agent Performative Games

In performative Markov potential games (MPGs), each agent’s local policy affects the global environment. The notion of performatively stable equilibrium (PSE) generalizes PS to the multi-agent domain (Sahitaj et al., 29 Apr 2025):

$V^\pi_{i,\pi}(\rho) \geq \max_{\pi'_i} V^\pi_{i,(\pi'_i,\pi_{-i})}(\rho) - \epsilon \quad \forall i$

Independent policy gradient and natural policy gradient algorithms converge to approximate ( $\delta_{r,p}$ -close) PSE, provided the performative effect is sufficiently smooth. Log-barrier regularization and occupancy-based retraining facilitate last-iterate finite-time convergence in the agent-independent case.

4.3 Robustness, Corruption, and Sample-Efficiency

PRL algorithms have been shown to be robustified against noisy samples and adversarial corruptions (Pollatos et al., 8 May 2025), with robust mean estimation in saddle-point subroutines being essential for finite-sample statistical stability.

When environmental inertia is high, collecting and leveraging all historic trajectory data—as formalized in MDRR—not only reduces sample complexity but preserves convergence rates, which closely matches the statistics-dominated regime of real-world sequential decision systems (Rank et al., 2024).

5. Empirical Benchmarks and Insights

Benchmarks typically involve gridworlds augmented with performative effects, such as interventional agents whose responses depend stochastically or smoothly on the principal's actions (Mandal et al., 2022, Rank et al., 2024, Basu et al., 23 Dec 2025). Metrics of interest include:

Distance to last-epoch averages (for stability)
Number of retrainings and samples-per-deployment (for efficiency)
Performative return and occupancy difference (for optimality)

Salient empirical findings include:

MDRR achieves substantially faster convergence (4–6 updates) than DRR (10–12) and RR (20–25) in high-inertia environments (Rank et al., 2024).
PePG and 0-FW close the PS–PO gap, attaining higher performative returns than stability-only methods (Basu et al., 23 Dec 2025, Chen et al., 6 Oct 2025).
Robust retraining recovers stability under adversarial contamination, whereas naïve averaging fails (Pollatos et al., 8 May 2025).

6. Significance and Implications

PRL models the feedback loop between policy optimization and environmental response, bridging RL with performative prediction and potential game theory. In settings where deployments affect future data distributions, standard RL underestimates risk and may converge to suboptimal policies. The PRL framework formalizes both stability (fixed-point) and optimality (global maximization), provides methods scaling to feature-rich MDPs, and demonstrates that sample-efficient and robust learning is possible by exploiting prior deployments and regularization.

Ongoing research directions include extending PRL guarantees to non-linear function approximators (e.g., deep RL), developing finer-grained theories for multi-agent performative equilibria, and integrating PRL with online, adversarial, and non-stationary settings (Mandal et al., 2022, Rank et al., 2024, Mandal et al., 2024, Basu et al., 23 Dec 2025, Chen et al., 6 Oct 2025, Pollatos et al., 8 May 2025, Sahitaj et al., 29 Apr 2025).