Iterated Rock-Paper-Scissors

Updated 15 February 2026

Iterated Rock-Paper-Scissors is a repeated game where players choose Rock, Paper, or Scissors to study equilibrium, cyclic dynamics, and adaptive strategies.
It serves as a benchmark for behavioral economics, statistical learning, and multiagent reinforcement learning by quantifying deviations from Nash equilibrium through conditional-response models.
Recent advances incorporate adaptive Markov models, cooperation-trap strategies, and cryptographic commitment protocols to enhance performance and protocol fairness in adversarial settings.

Iterated Rock-Paper-Scissors (IRPS) is the repeated version of the classic symmetric, cyclic-dominance game in which agents choose among three actions (Rock, Paper, Scissors) over a sequence of rounds, observing outcomes and possibly updating their strategies dynamically. IRPS functions as both a canonical model in game theory and a benchmark for statistical learning, behavioral economics, multiagent reinforcement learning, and cryptographically fair protocols. Studies of IRPS illuminate core issues in non-cooperative equilibrium, behavioral adaptation, social cycling, learnability, and the mechanisms supporting both competition and emergent cooperation.

1. Mathematical Formulation and Game Theoretic Baseline

A single stage of Rock-Paper-Scissors is defined by the action set $A = \{R, P, S\}$ for each player, with a cyclic-dominance payoff: Rock beats Scissors, Scissors beats Paper, and Paper beats Rock. The canonical zero-sum payout is:

$\Pi = \begin{bmatrix} 0 & -1 & +1 \ +1 & 0 & -1 \ -1 & +1 & 0 \ \end{bmatrix}$

for player $i$ as row and player $j$ as column (Lanctot et al., 2023).

In standard IRPS, the same game is repeated for $T$ rounds ( $T \gg 1$ ), with players possibly basing play at round $t$ on the joint history $h_{1:t-1}$ . In the widely analyzed nonzero-sum version, the payoff matrix for player X is:

$\begin{array}{c|ccc} & R & P & S \ \hline R & 1 & 0 & a \ P & a & 1 & 0 \ S & 0 & a & 1 \ \end{array}$

with $a > 1$ parameterizing the value of a win relative to a tie (Wang et al., 2014, Bi et al., 2014, Zhou, 2019).

The unique mixed-strategy Nash equilibrium (NE) is for each player to independently randomize uniformly ( $p_r = p_p = p_s = 1/3$ ), yielding payoff per round $g_0 = (1+a)/3$ (Bi et al., 2014, Zhou, 2019). For $a > 2$ , this equilibrium is also evolutionarily stable against small mutational deviations (Zhou, 2019).

2. Non-Equilibrium Behavioral Dynamics and Conditional Strategies

Laboratory and computational studies consistently observe deviations from NE randomization, especially when humans or learning algorithms play IRPS over long horizons. Empirical findings include:

Players' action marginals remain close to (1/3, 1/3, 1/3), but sequences of moves exhibit conditional dependencies and inertia (Wang et al., 2014).
Population-level social states $(n_R, n_P, n_S)$ exhibit persistent cyclical drift around the simplex (counterclockwise R→P→S→R), contradicting detailed balance (Wang et al., 2014, Zhou, 2019).
This cycling is robust to variations in $a$ , with mean cycling frequency $f \approx 0.03$ cycles per round observed across a wide range $a \in [1.1,100]$ (Wang et al., 2014).

These observations are quantitatively explained by a six-parameter win/tie/loss conditional-response (CR) model, generalizing "win-stay, lose-shift": after outcome $r \in \{\text{Win},\text{Tie},\text{Loss}\}$ , players repeat, switch clockwise, or counterclockwise actions with calibrated probabilities $(r_0, r_-, r_+)$ (Wang et al., 2014). Analytical solutions for steady-state social rotation and mean payoffs under CR rules match experimental data without parameter fitting.

Optimized CR parameters can yield steady-state payoffs exceeding $g_0$ , for instance approaching up to $10\%$ higher average payoffs at large $a$ (Wang et al., 2014). This infers that non-equilibrium memory and conditional play can outperform static NE even with unaltered marginal action frequencies.

3. Learning Algorithms and Multiagent Benchmarks

IRPS is a standard testbed for multiagent reinforcement learning (MARL) and online/adversarial learning due to its uniquely challenging interplay between exploitation and exploitability.

Tabular and deep RL (Q-learning, DQN, A2C, IMPALA) agents achieve low returns or become easily exploitable in head-to-head bot tournaments (Lanctot et al., 2023).
Regret-matching, strongly adaptive online learning, and swap-regret minimization algorithms, especially when conditioned on recent history ("contextual bandits"), achieve significantly higher returns and lower exploitability (Lanctot et al., 2023).
LLMs (Chinchilla 70B), prompted to predict the opponent from joint-action histories, reach returns on par with strongest contextual regret methods but below the top hand-crafted bots.
The leading hand-coded agents—iocainebot and greenberg—outperform all learning-based methods, with "PopulationReturn" scores of 255 and 288, respectively, and exploitabilities below 5 per 1000-round match (Lanctot et al., 2023).
Hybrid population-based RL with active opponent identification (PopRL, $p \approx 0.5$ ) achieves state-of-the-art learning-based returns ( $\approx 258$ ), illustrating the balance of adaptation and robustness (Lanctot et al., 2023).

A key empirical principle is that the best IRPS agents combine adaptive pattern exploitation with strategic robustness, preserving low exploitability while actively countering suboptimal opponents. Pure self-play RL converges to equilibrium but gains little versus fallible adversaries; greedy exploiters are highly vulnerable to best responses.

4. Algorithmic Prediction and Adaptive AI: Markov and Multi-Model Approaches

Adaptive Markov models are highly effective in exploiting consistent behavioral patterns in human IRPS play (Wang et al., 2020). The approach is as follows:

Construct $M$ single-AI models, each an $m$ -th order Markov chain mapping a memory state (last $m$ moves) $\sigma_t$ to empirical transition probabilities $P^{(m)}_{i,j}$ .
At round $t$ , each model $m$ predicts the most likely human move $\hat{H}_t^{(m)}$ ; the AI plays the corresponding counter-move.
An adaptive multi-AI architecture evaluates each model's cumulative windowed score $S^{(m)}(t)$ over the past $F$ rounds ("focus length") and selects the model with maximal recent performance: $m^*(t+1) = \arg\max_m S^{(m)}(t)$ .
Empirically, banks of 5–10 Markov models, with focus lengths 5–10, win against over 95% of 52 human subjects over 300-round continuous matches. Multi-10AI offers higher stability and lower variance than multi-5AI (Wang et al., 2020).

Early in play, low-order Markov models exploit short-term human patterns, with higher-order models dominating as humans adapt. Focus length hyperparameterizes the adaptation speed: smaller $F$ reacts faster to opponent shifts (risking overfitting), while larger $F$ is more stable but less responsive (Wang et al., 2020).

5. Mechanisms for Cooperation and Efficiency: Cooperation-Trap Strategies

Beyond the NE baseline, IRPS admits protocols enabling full cooperative efficiency for rational agents (Bi et al., 2014, Zhou, 2019). "Cooperation-trap" (CT) strategies are of the form:

Player X precommits to a default mixed (or pure) profile, selected to induce Y's best-reply to be a deterministic or degenerate mixed strategy (e.g., always play Paper).
Deviations by Y (e.g., tie outcomes) trigger a finite "punishment phase"—X plays the NE randomization $(1/3,1/3,1/3)$ for $m$ rounds before returning to default.
With appropriate calibration of the default mixture and punishment length $m \gtrsim 3/(a-2)$ for $a > 2$ , Y's rational best response is to cooperate perpetually, stabilizing a fair split at per-round payoff $g_X^* = g_Y^* = a/2 \gg g_0$ (Bi et al., 2014, Zhou, 2019).
Memoryless CT protocols can yield higher than $g_0$ but are generically asymmetric (Y earns more than X) and can underperform NE for certain $a$ (Bi et al., 2014).

These CT methods generalize to any cyclic dominance system with unique NE and potential for Pareto improvements, relying only on enforceable patterns of conditional retaliation.

6. Fair Protocols and Secure Simulation of Simultaneous Play

Iterated RPS requires secure commitment to simultaneous moves when players are remotely located or adversarial (0708.4379). Commitment-based protocols provide computational and practical guarantees:

Each round proceeds via a three-flow envelope exchange (or digital hash-based commitment) securing Alice's move prior to Bob's reveal.
The commitment scheme $C(m; r)$ must be binding (Alice cannot equivocate after the fact) and hiding (Bob cannot deduce $m$ before the open).
Extension to IRPS is by sequential repetition; possible improvements include batching commitments, online digital boards, and digital signatures (ECDSA).
The undetected cheating probability can be reduced to negligible by increasing commitment entropy (e.g., $k\geq 128$ ) (0708.4379).

This protocol is robust to round-trip latency and provides data integrity for both finite and infinite horizon matches.

7. Dynamical Systems Analogues and Broader Applications

IRPS extends to population and ecological dynamics, and economic models where cyclic dominance, asymmetric responses, and non-equilibrium phenomena play central roles (Zhou, 2019):

Replicator dynamics formalize population-level learning, where the NE is a global attractor for $a>2$ , neutrally stable at $a=2$ , and cycles for $1
Collision models (Lotka–Volterra type) for three species exhibit neutral cycles; finite $N$ stochasticity drives eventual extinction, capturing the "survival of the weakest."
In economic contexts (e.g., price cycling), logit best-response induces nonuniform steady-state mixtures and persistent Edgeworth cycles, with mean payoffs exceeding NE at intermediate intensities.

These analogues demonstrate that IRPS serves not only as a minimal testbed but also as a universal motif for studying learning, adaptation, and cyclic dominance across domains.

In summary, IRPS integrates equilibrium concepts, learning dynamics, algorithmic adaptation, cryptographic fairness, and cooperative protocol design. Its study reveals the necessity of both equilibrium and dynamical analyses in finite sequential games, and provides a rigorous foundation for the development and evaluation of multiagent systems, behavioral models, and secure protocols in adversarial, non-cooperative environments (Lanctot et al., 2023, Wang et al., 2020, Wang et al., 2014, Bi et al., 2014, 0708.4379, Zhou, 2019).