Repeated Policy Regret (RP-Regret) Framework
- Repeated Policy Regret (RP-Regret) is a counterfactual regret notion where the comparator is endogenous, evaluated under the counterfactual history it induces rather than the learner’s actual journey.
- It extends standard regret measures by benchmarking performance against adaptive adversaries in settings such as online bandits, repeated games, and contracting, thereby highlighting conditions for sublinear regret.
- Practical algorithms, including mini-batching, successive elimination in tallying bandits, and occupancy-measure methods, demonstrate how RP-Regret can be effectively minimized in dynamic environments.
Searching arXiv for recent and foundational papers on repeated policy regret and closely related policy-regret frameworks. Repeated Policy Regret (RP-Regret) is a family of counterfactual regret notions for repeated interaction in which the benchmark is evaluated under the history that the comparator policy would itself induce, rather than under the learner’s realized history. In online bandits, policy regret compares realized cumulative loss to the loss of a competitor sequence under the adversary’s response to that counterfactual sequence (Arora et al., 2012). In repeated games, RP-Regret measures the difference between the realized and the best-in-hindsight accumulated utility or loss when all players can respond to the history of play (Liu et al., 4 Jun 2026). Closely related formulations appear as policy regret in repeated games (Arora et al., 2018), complete policy regret against the full policy class in tallying bandits (Malik et al., 2022), counterfactual policy regret for repeated contracting (Collina et al., 2024), and regret-plus-switching guarantees that coincide with RP-Regret under stationary, gap-separated episodic reinforcement-learning assumptions (Velegkas et al., 2022).
1. Formal object of comparison
The central formal feature of RP-Regret is that the comparator is endogenous: changing the comparator policy changes the history distribution and therefore the future losses or utilities. In the online bandit formulation, for a competitor class , policy regret is
This differs from standard regret because the comparator term is evaluated under the counterfactual sequence itself rather than under the realized trajectory (Arora et al., 2012).
In repeated games, the same idea is written directly in terms of strategy sequences. With accumulated expected loss
$J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$
RP-Regret for player $1$ is
$R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$
The minimization allows the comparator to be a history-dependent policy, and the opponents respond to the counterfactual distribution of histories induced by that comparator sequence at every round (Liu et al., 4 Jun 2026).
Comparator classes vary by model. In tallying bandits, the strongest benchmark is the complete class , giving complete policy regret
which is explicitly stronger than constant-action policy regret and other restricted comparator classes (Malik et al., 2022). In repeated contracting, policy regret compares the Principal’s realized utility to the counterfactual world in which the Principal commits to a constant mechanism selecting Agent every round, while Agents play a non-responsive equilibrium appropriate to that mechanism (Collina et al., 2024).
2. Departure from external regret
RP-Regret was introduced because standard external regret becomes inadequate when the environment adapts to the learner’s behavior. In the adaptive-adversary bandit setting, standard regret compares the learner to the “peculiar” counterfactual
0
which holds the adversary’s response fixed to the realized trajectory while silently replacing only the current action. The policy-regret benchmark instead evaluates the loss sequence that would have resulted had the learner committed to the comparator policy itself (Arora et al., 2012).
This distinction is not merely terminological. For fully adaptive adversaries with unbounded memory, no bandit algorithm can guarantee sublinear policy regret even against the best constant action sequence: Theorem 1 gives a linear lower bound, stated as 1 (Arora et al., 2012). In general online learning with 2-memory bounded utilities, policy regret and external regret can also be incompatible: there exist utility sequences such that any action sequence with sublinear policy regret must incur linear external regret, and any action sequence with sublinear external regret must incur linear policy regret (Arora et al., 2018).
The comparison class also matters. Complete policy regret is the strongest possible version of policy regret because its benchmark is the entire deterministic policy class 3, not just fixed actions or piecewise-constant competitors. The tallying-bandit results were motivated precisely by the gap between tractability for weaker comparator classes and tractability for this strongest benchmark (Malik et al., 2022).
3. Conditions under which sublinear RP-Regret is possible
Across the literature, sublinear RP-Regret requires structural restrictions on either the adversary, the comparator, the opponents, or the information flow through history. In adaptive bandits, the foundational restriction is bounded memory: an 4-memory-bounded adversary chooses losses that depend only on the 5 most recent actions. This is the regime in which the mini-batching reduction from standard regret to policy regret becomes possible (Arora et al., 2012).
Tallying bandits impose stricter structure. The adversary is 6-memory bounded, 7-restricted, and 8-tallying: for each action 9, the loss depends only on the tally of 0 among the last 1 rounds,
2
This dramatically reduces the effective complexity of adaptive histories and enables nontrivial complete policy regret guarantees (Malik et al., 2022).
The 2026 repeated-games formulation identifies two necessary conditions for sublinear RP-Regret. The first is sublinear variation of the comparator sequence: 3 The second is imperfect recall via exponential decay memory (EDM): 4 Without Condition 1 on the comparator, RP-Regret can be 5; if either the comparator or opponents have perfect recall, RP-Regret can also be 6 (Liu et al., 4 Jun 2026).
In repeated contracting, the equilibrium restriction is different but serves an analogous role. Non-responsive equilibrium requires strategies to depend only on the history of states of nature and randomness, not on the history of other Agents’ actions or Principal selections. The paper proves existence of a pure non-responsive Nash equilibrium for every Principal selection mechanism, thereby avoiding the complexities of collusion and threats in the policy-regret analysis (Collina et al., 2024).
4. Bandit and online-learning algorithms
The foundational constructive result is the mini-batching reduction. Let 7 be any bandit algorithm whose standard regret against constant actions over 8 rounds is bounded by 9. Group the horizon into batches of length $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$0, play a single action throughout each batch, and update $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$1 only with the average batch loss. For an $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$2-memory-bounded adaptive adversary and $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$3, the policy regret against any constant comparator $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$4 satisfies
$J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$5
This conversion yields explicit sublinear rates in several settings. For $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$6-armed bandits, mini-batched EXP3 with
$J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$7
achieves policy regret
$J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$8
The same framework gives $J_T(\bpi_{1:T}) \coloneqq \sum_{t=1}^T f^{t-1}(\bpi_{1:t}),$9 policy regret for bandit convex optimization and $1$0 policy regret for bandit linear optimization and bandit submodular minimization; it also extends to switching regret, internal regret, swap regret, and policy $1$1-regret (Arora et al., 2012).
Tallying bandits address a harder comparator class. The SE-TB algorithm exploits the structural fact that optimal policies can be approximated by $1$2-cyclic policies, then applies successive elimination over that structured class while estimating shared tally statistics $1$3. For any $1$4-tallying bandit and $1$5, the high-probability complete policy regret bound is
$1$6
with probability at least $1$7. The expected lower bound is
$1$8
and in the deterministic feedback case the paper also proves an almost-sure upper bound $1$9 and an expected lower bound $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$0 (Malik et al., 2022).
5. Repeated games and equilibrium consequences
In repeated games with self-interested agents, policy regret and external regret need not be antagonistic. For two-player repeated games with bounded bilinear utilities, if both players use no-external-regret algorithms with regrets $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$1 and both are on average $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$2-stable, then for any fixed actions $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$3 and $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$4,
$R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$5
MWU and Exp3 are on average $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$6-stable for any $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$7 and have $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$8 external regret, so both achieve $R_T = J_T(\bpi_{1:T}) - \min_{\hat\bpi^{(1)}_{1:T}\in \cC_T^{(1)}} J_T\big((\hat\bpi^{(1)}_{1:T},\bpi^{(-1)}_{1:T})\big).$9 policy regret in this regime. The same work introduces policy equilibrium and shows that no-policy-regret dynamics converge to policy equilibria, while coarse correlated equilibria form a strict subset of policy equilibria (Arora et al., 2018).
The 2026 repeated-games framework makes RP-Regret explicit for adaptive opponents and dynamic, history-dependent comparators. The per-round loss is non-convex in strategy space because it depends on products of policies along histories, and the paper therefore develops three algorithmic routes: an oracle-based non-convex method, a convex linearized surrogate called Local RP-Regret (LRP-Regret), and a direct occupancy-measure method for slowly changing opponents. For the local surrogate, projected gradient descent yields
0
with 1 and 2. Under slowly changing opponents, the occupancy-measure method gives 3 provided 4. When all players minimize RP-Regret or its linearized variant, the resulting periodic strategy profiles yield approximate SPNE or SPCCE with bounded deviation. The experiments on Stag-Hunt over 100,000 iterations, with memory lengths 5 and learning rate 6, show that minimizing LRP-Regret more frequently converges to the cooperative equilibrium 7, and that longer memory further improves the rate of convergence to cooperative outcomes (Liu et al., 4 Jun 2026).
6. Counterfactual contracting and episodic reinforcement learning
Repeated contracting provides a direct counterfactual interpretation of RP-Regret. A Principal adaptively selects among 8 non-myopic Agents over 9 rounds using only bandit feedback, while Agents maximize 0-round utility and may hold mis-specified but full-support beliefs over future states. Policy regret compares realized utility under the Principal’s selection mechanism 1 to the counterfactual utility in a world where the Principal commits to a constant mechanism 2 selecting Agent 3 every round and Agents play a non-responsive equilibrium for that mechanism: 4 Monotonicity of the selection mechanism is the key condition converting standard regret to counterfactual policy regret. For any concave contract, if the Principal uses a monotone bandit algorithm with external regret bounded by 5, then policy regret is at most 6. MonoBandit instantiated with Multiplicative Weights is monotone and satisfies
7
so the Principal’s policy regret is at most that quantity. For limited liability, the paper introduces a tab mechanism and shows that a monotone no swap-regret algorithm suffices: MonoBandit(TreeSwap) is monotone, has
8
and yields policy-regret and liability guarantees of the same order, with a corresponding 9 bound under a fixed linear contract 0 (Collina et al., 2024).
Episodic reinforcement learning provides a closely related but not explicitly named RP-Regret framework. In an episodic MDP 1, episode 2 uses policy 3, regret is
4
and switching cost is
5
Under the global suboptimality-gap assumption
6
the paper proves logarithmic regret in 7 together with 8 policy switches, with complexity governed by eluder dimension and covering numbers. In the model-free case,
9
and in the model-based case,
0
The lower bound
1
shows that 2 instance-dependent regret is impossible even in tabular episodic MDPs. The paper does not explicitly define RP-Regret, but under its stationary, deterministic-reward, gap-separated assumptions the best comparator sequence with at most 3 switches is the stationary optimal policy 4, so 5. In that sense, its regret-plus-switching framework is a precise instantiation of RP-Regret in episodic RL with general function approximation (Velegkas et al., 2022).