Reduced-Format Liar's Poker

Updated 9 November 2025

The paper demonstrates that reinforcement learning agents using actor-critic methods and state abstraction achieve elite human performance in reduced-format Liar's Poker.
It details a formal game structure with sequential bidding, rebid mechanics, and defined state-action spaces that model bluffing and probabilistic reasoning.
Experimental results and ablation studies show that architectural scaling and entropy regularization reduce exploitability and enhance overall AI robustness.

Reduced-format Liar’s Poker is a multi-player, imperfect-information bidding game that distills essential elements of bluffing, probabilistic reasoning, and strategic interaction found in the standard Liar’s Poker setting. The game is of particular interest as a testbed for artificial intelligence research due to its continuous multi-player engagement, formal structure, and well-defined metrics for both human and artificial agents. Recent computational results demonstrate that reinforcement learning (RL) agents employing deep neural policies attain elite human performance in reduced-format versions of Liar’s Poker, notably in the 3×3 and 5×5 cases, by leveraging state abstraction, actor-critic methods, and self-play.

1. Formal Game Structure

Reduced-format Liar’s Poker comprises L players, each receiving a private hand of H independent draws from the digit set $\{1,\ldots,D\}$ . The hand of player $\ell$ is a count vector $X_\ell = (X_{\ell,1},\ldots,X_{\ell,D}) \sim \text{Multinomial}(H; 1/D, \ldots, 1/D)$ . The aggregate public state is $S = \sum_\ell X_\ell$ , where $S_r$ is the total number of digit $r$ held among all players.

Each round consists of sequential bidding, with allowed bids of the form $(q, r)$ , interpreted as “at least $q$ occurrences of digit $r$ exist among all hands.” Lexicographical order constrains the legal set of subsequent bids: $(q', r') \succ (q, r)$ iff $q' > q$ or ( $q' = q$ and $r' > r$ ). On their turn, a player must either challenge the preceding bid or escalate to a higher valid bid.

A key mechanic is the rebid: if all opponents challenge a bid $(q^*, r^*)$ , the proposer may choose between revealing (count) and, if the rebid privilege has not been exercised, submitting a single new bid $(q^{**}, r^{**}) \succ (q^*, r^*)$ . A second round of challenges after a rebid finalizes the hand—either via a count or a loss for the final challenged bidder.

Payoff for a counted hand is determined as follows: if $S_{r_{\text{final}}} \geq q_{\text{final}}$ , the final bidder earns $+1$ from each opponent; otherwise, $-1$ per opponent. In a three-player match, this yields a per-hand payoff of $\pm2$ .

State, action, and chance are formalized as:

State space: $s = (X_1, ..., X_L; b; c; \ell; \delta)$ , with $b$ the bid history, $c$ the vector of challenging players, $\ell$ the current player index, and $\delta$ the rebid flag.
Action space: $A(s) = \{\text{challenge}\} \cup \{(q', r'): (q', r') \succ \text{last}(b)\}$ , and if fully challenged and proposer: “count”.
Chance enters only at the initial deal: $P(X_1, ..., X_L) = \prod_\ell \text{Multinomial}(X_\ell; H, 1/D, ..., 1/D)$ .

2. Policy Representation and Network Design

Solly, the AI agent attaining elite-level performance in reduced-format Liar’s Poker, employs a unified multi-layer perceptron (MLP) architecture parameterizing both policy and value functions. Input features include:

Private hand as length- $D$ integer count vector $X_\ell$ ,
One-hot encoding of the full bid history (clipped to game-legal length),
Challenge history: one-hot vector tracking challengers,
Rebid flag $\delta \in \{0,1\}$ and a binary terminal indicator.

These are concatenated into a fixed-length feature vector $\phi(s)$ . The network consists of two fully connected hidden layers of 256 units each, activated by ReLU functions. The actor (policy) head is a linear layer over $|A(s)|$ possible actions; the critic (value) head is a linear projection to $V_\phi(s)$ . In experiments scaling to more complex regimes, deeper (7 layers, 512 units each) networks were employed to accelerate and stabilize convergence.

3. Learning Process and Training Regime

Solly is optimized by self-play under the regularized Nash dynamics (R-NaD) actor-critic framework:

The policy (actor) loss is

$\mathcal{L}_p(\theta) = -\mathbb{E}_{s,a\sim\pi_\theta}\left[A_\phi(s,a)\log\pi_\theta(a|s) + \beta H(\pi_\theta(\cdot|s))\right]$

with entropy regularization ( $\beta \approx 0.01$ ) to enforce play randomization.

The value (critic) loss is

$\mathcal{L}_v(\phi) = \mathbb{E}_{s,a,r,s'}\left[(r + \gamma V_\phi(s') - V_\phi(s))^2\right]$

where $\gamma \approx 0.99$ is the discount factor.

Advantage estimation employs generalized advantage estimation (GAE), recursively:

$\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t), \qquad A_t = \delta_t + \gamma\lambda\delta_{t+1} + \cdots$

with $\lambda \approx 0.95$ .

Training proceeds with the Adam optimizer, actor learning rate $\approx 1\times10^{-4}$ , critic learning rate $\approx 5\times10^{-4}$ , decayed over time, and batch updates over $B$ trajectories truncated at $T_\text{max}\approx 15$ moves. The “shared-policy” self-play paradigm is adopted: every agent in the environment runs from the same policy instance, avoiding explicit opponent modeling. Rollouts are generated on-policy without a replay buffer, and learning steps accumulate to $\sim10^7$ updates with up to $\sim3\times10^9$ self-play hands.

4. Strategic Innovations and Exploitability

Solly exhibits distinct strategic differences compared to elite human play:

Rebidding frequency: Solly rebids approximately $33\%$ of the time in 3×3-3-player games versus $8\%$ for humans, deploying the rebid strategically as a bluffing device.
Opening tactics: Initial bids are frequently non-forcing, meaning they avoid placing the next player in a deterministic challenge vs. bid-up position, which diversifies the ensuing game tree.
Randomization: The entropy regularization term in the policy loss maintains stochasticity over multiple near-optimal bidding actions, making policies less predictable and more robust to exploitation.
Empirical exploitability: Training a DQN-based “best-response” agent against fixed Solly checkpoints, the observed payoff declines from $\sim1.0$ (high exploitability) at $10^4$ learning steps to $\sim0.25$ after over $5\times10^6$ steps, indicative of convergence toward a locally stable, less exploitable policy.

5. Comprehensive Evaluation and Performance Metrics

Performance is primarily measured via win rate (hands with positive payoff) and equity (average net winnings per 100 hands). The following results characterize Solly’s efficacy:

Setting	Opponent	Equity per 100 hands
3×3, 2-player	Baseline model	+16 ± 3
3×3, 2-player	GPT-4.1	+19 ± 3
3×3, 2-player	OpenAI o3	+9 ± 3
3×3, 2-player	Elite humans	−4 ± 10
3×3, 2-player	DQN best-response	−12 ± 3
5×5, 2-player	Elite humans	+10 ± 10
3×3, 3-player	2 elite humans	+17 ± 15

Ablation studies indicate that hand-abstraction (mapping hands to “canonical” count vectors) and architectural scaling (increased MLP depth, reward rescaling) further suppress exploitability and provide a credible route to scaling methods up to full-format (8×10) Liar’s Poker.

6. Implications and Prospective Directions

Solly establishes a benchmark as the first AI to achieve competitive, elite performance in a full multi-player, sequential bidding game with both bluffing and rebid complexity. Despite the small scale of the input and network (relative to commercial poker bots), the agent requires no explicit search at test time and achieves strong results with under $3$ billion self-play hands on commodity hardware.

Key implications include:

The shared-policy self-play paradigm is effective in 3+ player imperfect-information games with continuous engagement and non-myopic dynamics.
Existing RL infrastructure (off-the-shelf actor-critic, GAE, modest MLP) suffices for attaining robust, low-exploitability performance.
Strategic play instituting deliberate bluffing via rebids and policy-driven randomization emerges naturally under entropy-regularized RL.

Open research avenues include incorporating Monte Carlo search at test time, curriculum learning or transfer learning from reduced to full-format games, multi-policy asynchronous self-play, and the combination of LLM reasoning components with learned policy distributions. These directions suggest a generalizable methodology for AI development in other complex, multi-agent, imperfect-information environments.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Reduced-Format Liar's Poker.