Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 167 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 42 tok/s Pro

GPT-4o 97 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 442 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

Performative Reinforcement Learning

Updated 13 October 2025

Performative Reinforcement Learning is a framework where the deployed policy reshapes the environment by influencing transition probabilities and reward functions.
Algorithms such as repeated retraining and zeroth-order Frank–Wolfe address non-stationarity and gradient challenges to approach performatively optimal policies.
Real-world applications in recommender, autonomous, and multi-agent systems leverage PRL to achieve robust performance with theoretical convergence guarantees.

Performative reinforcement learning (PRL) is a reinforcement learning framework in which the deployed policy does not merely interact with a fixed environment but actively influences the environment’s dynamics—specifically, the transition probabilities and reward functions. This feedback alters the subsequent data distribution, reward structure, or system response, so the agent’s present choices (policy) shape the future environment it faces. PRL is especially relevant in real-world domains such as recommender systems, autonomous systems, or any setting where decisions influence the behavior or configuration of the underlying environment. The rigorous paper of PRL reveals new theoretical and algorithmic challenges, separating it from conventional RL frameworks predicated on stationary environments.

1. Formal Framework and Foundational Concepts

PRL generalizes the Markov Decision Process (MDP) by allowing the environment—characterized by transition kernel $P$ and reward function $r$ —to depend on the deployed policy $\pi$ . Formally, executing $\pi$ induces a new MDP $M(\pi) = (\mathcal{S}, \mathcal{A}, P_{\pi}, r_{\pi}, \rho)$ . Unlike standard RL, where $\pi$ optimizes the value function with respect to static environment parameters, here both $P_{\pi}$ and $r_{\pi}$ vary as a direct function of $\pi$ , typically mediated through the state–action occupancy measure $d$ associated with $\pi$ (Mandal et al., 2022).

Two key notions underpin PRL:

Performatively Stable (PS) Policy: A policy is PS if it is optimal in the environment that it itself induces; that is, redeploying $\pi$ in $M(\pi)$ yields no further policy improvement. The occupancy measure $d^*$ is PS if it solves the problem:

$d^* \in \underset{d' \in \mathcal{C}(d^*)}{\arg\max} \left\{ \sum_{s,a} d'(s,a) r_{d^*}(s,a) - \frac{\lambda}{2}\|d'\|_2^2 \right\}$

where $\mathcal{C}(d^*)$ encodes the BeLLMan flow constraints under $P_{d^*}$ (Mandal et al., 2022).

Performatively Optimal (PO) Policy: A policy $\pi_{PO}$ that globally maximizes the original value function, fully accounting for its impact on the environment. There can be a provable positive constant gap between PS and PO policies (Chen et al., 6 Oct 2025).

This conceptual distinction arises because a PS policy can be suboptimal when the goal is to maximize long-term return while considering dynamic environmental feedback.

2. Challenges in Learning and Optimization

The principal challenges in PRL stem from the non-stationarity and policy-dependence of the environment:

Convergence to PS Points: Most existing methods, such as repeated retraining or projected gradient ascent in the occupancy space, converge to PS policies, leaving a fixed performance gap from the PO policy (Mandal et al., 2022, Mandal et al., 7 Nov 2024).
Nonconvexity and Gradient Estimation: The performative value function is nonconvex in the policy due to the implicit dependence of $P$ and $r$ on $\pi$ . Gradient computation is challenging since changes to $\pi$ alter both the policy's action distribution and the induced environment; thus, even simple occupancy-gradient updates become insufficient for global optimality (Chen et al., 6 Oct 2025).
Sample Efficiency and Distribution Shift: Only on-policy samples from $M(\pi)$ are available, precluding straightforward off-policy evaluation and exacerbating sample complexity (Lin et al., 2023).
Finite Sample and Corruption Robustness: In the finite-sample regime, environmental feedback and collected trajectories may be adversarially corrupted, necessitating robust algorithms adapted to Huber’s $\epsilon$ -contamination model (Pollatos et al., 8 May 2025).
Function Approximation and Scalability: PRL in large or continuous spaces raises profound issues. In linear MDPs, the lack of strong convexity and the need for generalization beyond tabular settings require specialized primal–dual and saddle point methods (Mandal et al., 7 Nov 2024).

3. Algorithmic Solutions and Theoretical Guarantees

Repeated Retraining and Dual Perspective:

The canonical approach studied in (Mandal et al., 2022) optimizes a strongly regularized objective in the occupancy measure, alternating between policy deployment (to induce $M(\pi)$ and gather new samples) and policy update. The analysis, leveraging the dual (Lagrangian) formulation, shows that:

With strongly convex regularization ( $\lambda > 0$ ) and sensitivity constants ( $\epsilon_r$ , $\epsilon_p$ ) on $r$ and $P$ (Lipschitz continuity w.r.t. $d$ ), repeated retraining converges linearly to a PS point:

$\|d_t - d_S\|_2 \leq \delta \quad \text{for } t \geq (1-\mu)^{-1} \log(2/\delta(1-\gamma))$

where $\mu = (12|S|^{3/2}(2\epsilon_r + 5|S|\epsilon_p))/(\lambda(1-\gamma)^4)$ .

Zeroth-order Frank–Wolfe (0-FW) for PO:

(Chen et al., 6 Oct 2025) overcomes the PS gap by introducing the 0-FW algorithm. This method estimates the performative policy gradient via two-point function evaluations and updates the policy over a compact, convex subspace $\Pi_{\Delta}$ :

Zeroth-order gradient estimator:

$\widehat{g}_{\lambda, \delta}(\pi) = \frac{|\mathcal{S}|(|\mathcal{A}|-1)}{2N\delta} \sum_{i=1}^N \left( \widehat{V}_{\lambda, \pi+\delta u_i}^{(\pi+\delta u_i)} - \widehat{V}_{\lambda, \pi-\delta u_i}^{(\pi-\delta u_i)} \right) u_i$

Frank–Wolfe update:

$\pi_{t+1} = \pi_t + \beta (\tilde{\pi}_t - \pi_t)$

with $\tilde{\pi}_t$ the maximizer of the linearization over $\Pi_{\Delta}$ .

Theoretical analysis proves a gradient dominance property: within $\Pi_{\Delta}$ , stationary points are PO, and $\|\cdot\|$ --stationarity $< \epsilon$ implies PO policy suboptimality $< O(\epsilon)$ . The algorithm achieves polynomial-time convergence.

Linear MDPs and Primal–Dual Algorithms:

(Mandal et al., 7 Nov 2024) extends convergence analysis to linear MDPs, constructing an empirical Lagrangian and proposing a primal–dual algorithm that operates efficiently with function approximation. The key is a new recurrence relation for the suboptimality gap:

$\|d_{t+1} - d_S\|_2 \leq \beta_1 \|d_t - d_S\|_2 + \beta_2 \|d_{t-1} - d_S\|_2$

which, for sufficiently large $\lambda$ and bounded sensitivity parameters, guarantees geometric convergence to PS.

Finite Sample and Robust Estimation:

Under limited, possibly corrupted samples, robust mean estimators are used for gradient estimation within convex–concave optimization frameworks (Pollatos et al., 8 May 2025). Under Huber’s $\epsilon$ -contamination, the robust optimistic FTRL algorithm guarantees:

$\text{Duality gap} \leq O\left( \frac{1}{T} + \sqrt{\epsilon} \right)$

addressing corruption robustness.

4. Extensions: Gradual Shift, Multi-Agent, and Real-World Feedback

Gradual Environmental Change:

(Rank et al., 15 Feb 2024) introduces scenarios where the environment changes gradually, reflecting lagged adaptation atypical in classic PRL. Algorithms such as Mixed Delayed Repeated Retraining (MDRR) aggregate samples across deployments with variable weights. MDRR provably reduces the number of retrainings and sample complexity compared to standard retraining, and experiments show superior convergence under slow environmental shift.

Multi-Agent Performative Games:

Multi-agent PRL generalizes the framework to Markov Potential Games (MPGs) with performative effects (Sahitaj et al., 29 Apr 2025). Here, the notion of a performatively stable equilibrium (PSE) is central—agents optimize under environments determined by the current joint policy. Both independent policy gradient ascent (IPGA) and independent natural policy gradient (INPG) are shown to converge (in a best-iterate or last-iterate sense) to approximate PSEs, with extra error terms proportional to the sensitivity parameters. In games with agent-independent transitions, repeated retraining achieves finite-time last-iterate convergence.

Preference-Based and Vision-Language Feedback:

Realistic environments often rely on indirect reward feedback, e.g., preferences labeled by vision-LLMs (VLMs). VARP (Singh et al., 18 Mar 2025) enhances preference-based RL using trajectory sketches and agent-regularized preference loss, boosting both preference accuracy and policy performance, with impacts directly tied to the evolving agent policy—a performative effect at the feedback model level.

Reinforcement Learning in LLMs:

Recent work explores PRL in the context of LLMs. For instance, in-context RL frameworks (Song et al., 21 May 2025) show that LLMs exposed to their own prior outputs and scalar rewards can learn better policies over multiple inference rounds, closely resembling performative sequential decision making. Other frameworks (e.g., Embodied Planner-R1 (Fei et al., 29 Jun 2025), VL-DAC (Bredis et al., 6 Aug 2025)) demonstrate large-scale PRL for embodied agents and vision-LLMs, achieving robust skill generalization using decoupled policy/value updates and sparse completion-driven rewards.

5. Theoretical Advances and Quantitative Results

Core theoretical advances include:

Gradient Dominance and Stationarity: When the policy regularizer dominates environmental shift (formally quantified via sensitivity parameters), the performative value function satisfies a gradient dominance property: all sufficiently stationary points are PO (Chen et al., 6 Oct 2025). Explicit dependence of gradient smoothness and dominance constant $\mu$ on the regularization $\lambda$ , state space, action space, and sensitivity constants is provided:

$\mu = \left( D\lambda/(1-\gamma) - 6\gamma|\mathcal{S}|(1+\lambda\log|\mathcal{A}|) / \left(D(1-\gamma)^3(\varepsilon_p S_p + \varepsilon_r S_r)\right) \right)$

Explicit Convergence Rates: Convergence to PS or PO policies is polynomial in dimension and inverse-accuracy, e.g., $O(\epsilon^{-4} \log(1/\epsilon))$ for 0-FW (Chen et al., 6 Oct 2025).
Impact of Environmental Sensitivity: All theoretical rates and fixed-point characterization depend on the Lipschitz continuity (sensitivity) of $r$ and $P$ with respect to the occupancy $d$ or policy $\pi$ . Excessive sensitivity can break uniqueness or convergence guarantees (Mandal et al., 2022, Mandal et al., 7 Nov 2024, Sahitaj et al., 29 Apr 2025).
Empirical Results: Experiments demonstrate that 0-FW achieves higher value than repeated retraining (PS-only methods), robust retraining under corruption converges when naive methods diverge (Pollatos et al., 8 May 2025), and MDRR achieves lower variance and faster convergence in gradually shifting environments (Rank et al., 15 Feb 2024). Application frameworks like VARP (Singh et al., 18 Mar 2025) and VL-DAC (Bredis et al., 6 Aug 2025) report 20–50% performance gains on standard benchmarks compared to baselines.

6. Broader Implications and Future Research Directions

The PRL paradigm has foundational implications for both learning theory and practical applications:

Closing the PS–PO Gap: The 0-FW approach provides the first polynomial-time algorithm for provably converging to PO, representing a step toward unlocking maximal agent–environment “co-design.”
Function Approximation and Deep RL: While linear MDPs are well-understood, extending PRL guarantees and algorithms to nonlinear and deep RL settings remains a core challenge.
Robustness and Distribution Shift: Rigorous treatment of robustness (adversarial, distributional, or model misspecification) is increasingly critical, especially as PRL is deployed in adversarial or unpredictable real-world domains.
Multi-Agent Equilibria: Joint PRL among multiple interactive agents—where policies alter both their own and others’ environments—necessitates new equilibrium concepts and learning algorithms.
Automated and Structured Feedback: Integration of automated, agent-aware reward signals (e.g., from VLMs in vision-language RL) and structure-aware optimization could enhance stability and adaptability in practical PRL deployments.
Real-World Generalization: Empirical evidence shows that PRL-trained models (e.g., for embodied agents or VLMs) can generalize from synthetic to real-world domains, supporting efficient real-world deployment with minimal retraining (Bredis et al., 6 Aug 2025, Fei et al., 29 Jun 2025).

Open questions include adapting these theoretical advances to high-dimensional, function-approximated settings; developing sharper finite-sample analyses; integrating richer feedback modalities; and designing algorithms suited for dynamic, multi-agent environments with substantial strategic complexity.

PRL is closely related to performative prediction (Lin et al., 2023) and classification under performative distribution shift (Cyffers et al., 4 Nov 2024). In both contexts, the chosen predictor or policy alters the future distribution of data. Plug-in performative optimization and push-forward models offer complementary approaches for risk minimization in performative settings, with explicit characterizations of the performative risk and gradient estimation procedures. These methods have practical analogs in RL—parametric model fitting, plug-in optimization, or adversarially robust and min–max formulations—that suggest tight conceptual and methodological integration across performative learning domains.

In summary, performative reinforcement learning elevates RL from passive optimization in a fixed world to dynamic co-adaptation with a reactive environment. Contemporary research establishes foundational models, provides algorithms with provable convergence to performatively optimal policies, addresses robustness and history dependence, and brings these ideas to bear on complex, real-world agentic systems.