Reduced-Format Liar's Poker
- The paper demonstrates that reinforcement learning agents using actor-critic methods and state abstraction achieve elite human performance in reduced-format Liar's Poker.
- It details a formal game structure with sequential bidding, rebid mechanics, and defined state-action spaces that model bluffing and probabilistic reasoning.
- Experimental results and ablation studies show that architectural scaling and entropy regularization reduce exploitability and enhance overall AI robustness.
Reduced-format Liar’s Poker is a multi-player, imperfect-information bidding game that distills essential elements of bluffing, probabilistic reasoning, and strategic interaction found in the standard Liar’s Poker setting. The game is of particular interest as a testbed for artificial intelligence research due to its continuous multi-player engagement, formal structure, and well-defined metrics for both human and artificial agents. Recent computational results demonstrate that reinforcement learning (RL) agents employing deep neural policies attain elite human performance in reduced-format versions of Liar’s Poker, notably in the 3×3 and 5×5 cases, by leveraging state abstraction, actor-critic methods, and self-play.
1. Formal Game Structure
Reduced-format Liar’s Poker comprises L players, each receiving a private hand of H independent draws from the digit set . The hand of player is a count vector . The aggregate public state is , where is the total number of digit held among all players.
Each round consists of sequential bidding, with allowed bids of the form , interpreted as “at least occurrences of digit exist among all hands.” Lexicographical order constrains the legal set of subsequent bids: iff or ( and ). On their turn, a player must either challenge the preceding bid or escalate to a higher valid bid.
A key mechanic is the rebid: if all opponents challenge a bid , the proposer may choose between revealing (count) and, if the rebid privilege has not been exercised, submitting a single new bid . A second round of challenges after a rebid finalizes the hand—either via a count or a loss for the final challenged bidder.
Payoff for a counted hand is determined as follows: if , the final bidder earns from each opponent; otherwise, per opponent. In a three-player match, this yields a per-hand payoff of .
State, action, and chance are formalized as:
- State space: , with the bid history, the vector of challenging players, the current player index, and the rebid flag.
- Action space: , and if fully challenged and proposer: “count”.
- Chance enters only at the initial deal: .
2. Policy Representation and Network Design
Solly, the AI agent attaining elite-level performance in reduced-format Liar’s Poker, employs a unified multi-layer perceptron (MLP) architecture parameterizing both policy and value functions. Input features include:
- Private hand as length- integer count vector ,
- One-hot encoding of the full bid history (clipped to game-legal length),
- Challenge history: one-hot vector tracking challengers,
- Rebid flag and a binary terminal indicator.
These are concatenated into a fixed-length feature vector . The network consists of two fully connected hidden layers of 256 units each, activated by ReLU functions. The actor (policy) head is a linear layer over possible actions; the critic (value) head is a linear projection to . In experiments scaling to more complex regimes, deeper (7 layers, 512 units each) networks were employed to accelerate and stabilize convergence.
3. Learning Process and Training Regime
Solly is optimized by self-play under the regularized Nash dynamics (R-NaD) actor-critic framework:
- The policy (actor) loss is
with entropy regularization () to enforce play randomization.
- The value (critic) loss is
where is the discount factor.
Advantage estimation employs generalized advantage estimation (GAE), recursively:
with .
Training proceeds with the Adam optimizer, actor learning rate , critic learning rate , decayed over time, and batch updates over trajectories truncated at moves. The “shared-policy” self-play paradigm is adopted: every agent in the environment runs from the same policy instance, avoiding explicit opponent modeling. Rollouts are generated on-policy without a replay buffer, and learning steps accumulate to updates with up to self-play hands.
4. Strategic Innovations and Exploitability
Solly exhibits distinct strategic differences compared to elite human play:
- Rebidding frequency: Solly rebids approximately of the time in 3×3-3-player games versus for humans, deploying the rebid strategically as a bluffing device.
- Opening tactics: Initial bids are frequently non-forcing, meaning they avoid placing the next player in a deterministic challenge vs. bid-up position, which diversifies the ensuing game tree.
- Randomization: The entropy regularization term in the policy loss maintains stochasticity over multiple near-optimal bidding actions, making policies less predictable and more robust to exploitation.
- Empirical exploitability: Training a DQN-based “best-response” agent against fixed Solly checkpoints, the observed payoff declines from (high exploitability) at learning steps to after over steps, indicative of convergence toward a locally stable, less exploitable policy.
5. Comprehensive Evaluation and Performance Metrics
Performance is primarily measured via win rate (hands with positive payoff) and equity (average net winnings per 100 hands). The following results characterize Solly’s efficacy:
| Setting | Opponent | Equity per 100 hands |
|---|---|---|
| 3×3, 2-player | Baseline model | +16 ± 3 |
| 3×3, 2-player | GPT-4.1 | +19 ± 3 |
| 3×3, 2-player | OpenAI o3 | +9 ± 3 |
| 3×3, 2-player | Elite humans | −4 ± 10 |
| 3×3, 2-player | DQN best-response | −12 ± 3 |
| 5×5, 2-player | Elite humans | +10 ± 10 |
| 3×3, 3-player | 2 elite humans | +17 ± 15 |
Ablation studies indicate that hand-abstraction (mapping hands to “canonical” count vectors) and architectural scaling (increased MLP depth, reward rescaling) further suppress exploitability and provide a credible route to scaling methods up to full-format (8×10) Liar’s Poker.
6. Implications and Prospective Directions
Solly establishes a benchmark as the first AI to achieve competitive, elite performance in a full multi-player, sequential bidding game with both bluffing and rebid complexity. Despite the small scale of the input and network (relative to commercial poker bots), the agent requires no explicit search at test time and achieves strong results with under $3$ billion self-play hands on commodity hardware.
Key implications include:
- The shared-policy self-play paradigm is effective in 3+ player imperfect-information games with continuous engagement and non-myopic dynamics.
- Existing RL infrastructure (off-the-shelf actor-critic, GAE, modest MLP) suffices for attaining robust, low-exploitability performance.
- Strategic play instituting deliberate bluffing via rebids and policy-driven randomization emerges naturally under entropy-regularized RL.
Open research avenues include incorporating Monte Carlo search at test time, curriculum learning or transfer learning from reduced to full-format games, multi-policy asynchronous self-play, and the combination of LLM reasoning components with learned policy distributions. These directions suggest a generalizable methodology for AI development in other complex, multi-agent, imperfect-information environments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free