Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning
Abstract: AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold'em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar's Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar's Poker. Solly also outperformed LLMs, including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
This paper introduces Solly, an AI that learns to play a bluffing game called Liar’s Poker at a high (elite human) level. Unlike regular poker, Liar’s Poker keeps many players involved in every round and includes a unique “rebid” option that encourages bluffing. The authors show how Solly learned to beat strong human players and performed better than popular LLMs in this game, even though it was trained with modest computing power.
The big questions the researchers asked
- Can an AI learn to play a hard, multi-player, bluff-heavy game—where players don’t see each other’s hands—at an elite human level?
- Can it do this without huge supercomputers or complex search during play (no heavy “thinking” at move time)?
- How does such an AI compare to expert humans and to LLMs like GPT-4.1 and o3?
- Can the AI discover new, effective strategies—especially involving bluffing and the special “rebid” move?
- How hard is it to “exploit” (find patterns and beat) the AI’s strategy?
What is Liar’s Poker (in simple terms)?
- Each player gets a small set of hidden digits (like 3 digits, each from 1–3 in a “3x3” version).
- Players take turns making bids like “four 2s,” which means “I think there are at least four 2s across everyone’s hidden digits.”
- Other players can raise (make a stronger bid) or challenge (call your bluff).
- If someone challenges, everyone reveals their digits, and you check if the final bid was true.
- Special twist: the bidder who gets challenged may be allowed one “rebid” (raise again) before everyone counts. This creates powerful bluffing and mind games.
The researchers focused on smaller, “reduced-format” games such as:
- 3x3: 3 players, each holding 3 digits, digits from 1–3
- 5x5: 2 players, each with 5 digits, digits from 1–5
These smaller versions keep the strategy rich but make training more practical.
How they taught the AI (the approach)
Think of training like learning to play a sport by scrimmaging yourself millions of times and adjusting your tactics each time.
- Self-play: Solly played against copies of itself over and over. This lets it learn from its own mistakes and discover effective strategies.
- Reinforcement learning (RL): The AI tries moves, sees what leads to winning, and updates its behavior to win more often in the future. It uses an “actor-critic” setup:
- Actor: suggests what to do next (bid, raise, or challenge).
- Critic: evaluates how good those choices were, guiding the actor to improve.
- Neural network “brain”: A simple multi-layer perceptron (MLP) takes in the game state (what bids were made, your hand, etc.) and outputs a probability for each allowed action. This helps Solly mix its play (not be predictable) and bluff effectively.
- Regularized Nash dynamics (R-NaD): A learning rule that nudges strategies toward a balanced “no easy exploits” point, especially powerful in 2-player games. The authors extended it to multiple players.
- Practical setup: The team used standard tools (OpenSpiel) and off-the-shelf hardware. No heavy search at move time; Solly acts quickly using its trained policy.
Analogy: Imagine a chess player who doesn’t calculate deep trees on every move during a tournament; instead, they rely on deep intuition learned from millions of practice games. That’s Solly.
What they found and why it matters
Here are the key results, followed by why they’re important:
- Against elite humans:
- Heads-up (2 players) 3x3: Solly was roughly on par (about 48% wins in a small sample).
- Heads-up 5x5: Solly won more often (about 55% wins).
- 3-player 3x3: Solly did well (about 54% wins and positive money). With only 100 hands, results are promising but not “statistically certain.”
- Why it matters: Multi-player bluffing games are much harder for AI than 2-player games. Showing strong, human-level play here is a big step.
- Against simple probability-based opponents (a baseline model):
- Solly clearly won. It beat a player that makes decisions purely by “what’s most likely,” with no bluffing or randomness.
- Why it matters: Great play in Liar’s Poker needs more than raw probabilities—it needs deception, adaptation, and unpredictable play.
- Against LLMs (GPT-4.1 and o3):
- Solly beat both LLMs in 1,000-hand matches, winning about 60% vs GPT-4.1 and 55% vs o3.
- LLMs tended to play deterministically and conservatively, rarely (or never) using the powerful rebid bluff.
- Why it matters: Even advanced LLMs struggle with strategic uncertainty and deception. Specialized RL agents can outperform them in games requiring subtle bluffing and mixed strategies.
- New strategies and behavior:
- Solly bluffed more often by using the “rebid” feature far more than humans expected.
- It didn’t always make “forcing” opening bids that humans considered standard. This surprised elite players and sometimes beat stronger human hands.
- Why it matters: AI can discover creative strategies that challenge human “received wisdom.”
- Harder to exploit over time:
- A separate “best response” agent tried to find weaknesses in Solly. As Solly trained more, the exploiter’s success dropped, meaning Solly became less predictable and harder to beat.
- Why it matters: Strong game AIs aren’t just good; they’re also tough to “game.”
- Compute efficiency and “thinking time”:
- Solly was trained on billions of practice hands but uses almost no extra “thinking” during a match.
- LLMs took longer to respond but still didn’t adapt or bluff well.
- Humans used little “thinking time” each move but could adapt their strategy across hands.
- Why it matters: There’s a trade-off between training time and thinking time. Combining both smart training and smart “on-the-spot” planning could push performance even higher.
Why this research matters (big picture impact)
- A better testbed for multi-player uncertainty: Liar’s Poker keeps everyone involved every round and forces constant bluffing and reading the room. It’s a great playground for building AIs that handle uncertainty, deception, and group dynamics—skills useful beyond games (like auctions, negotiations, and markets).
- Strong results without huge compute: Solly reached elite human-level play in smaller versions of the game using modest resources. This opens the door for more researchers to explore advanced game AI.
- Insights for future AI systems:
- Mix of strategy and psychology: Winning isn’t just about math; it’s about being unpredictable and creating pressure. Solly learned this through self-play.
- LLM limitations: Today’s LLMs can explain their thinking, but in this setting they didn’t adapt or bluff well. Combining RL agents (like Solly) with LLM reasoning, or adding smarter planning at move time, could be powerful.
- Path to scaling up: The authors outline steps—like smarter rewards, deeper networks, and hand abstractions—to extend this approach to the full-size game played on Wall Street (8x10). Their early tests suggest this is feasible.
In short
The paper shows that an RL-trained AI, Solly, can learn to outbid and outbluff strong human players in a tough multi-player game, using clever self-play rather than expensive search at game time. It discovers surprising strategies, resists exploitation, and beats LLMs. This points to a future where AI handles real-world, uncertain, multi-player situations more strategically—and where small, well-designed games help us build those skills efficiently.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored, framed to be actionable for future research.
- Multi-player equilibrium guarantees: The extension of R-NaD to three or more players lacks theoretical guarantees; it remains unknown whether training converges to any equilibrium in Liar’s Poker or how “safe” the learned strategies are under opponent deviations.
- Full-scale game performance: Solly is only demonstrated on reduced-format (3x3, 5x5) Liar’s Poker; performance, stability, and compute requirements for the full 8x10 game with realistic player counts (≥4) are not established.
- Statistical robustness of human evaluation: Human trials (100–300 hands across conditions) are too small to support strong statistical claims; target sample sizes, variance estimates, and power analyses for heads-up and multi-player settings are not provided.
- Player selection bias: Results rely on a small set of elite players; generalization to broader skill distributions (novice, intermediate, mixed-skill tables) and different stakes is unknown.
- Multi-player scaling beyond three players: Solly’s performance and training stability with 4+ players—where round lengths and state space explode—remain unexplored.
- Positional effects at scale: While preliminary best-response tests found no material positional differences in 3-player 3x3, the impact of player order on exploitability and equity in larger games and longer rounds is not characterized.
- Lack of test-time planning and adaptation: Solly uses no test-time compute (e.g., MCTS, lookahead) and no online opponent modeling; the gains from adding search, dynamic adaptation, memory of past hands, or meta-strategy updates remain unquantified.
- Exploitability measurement limitations: Best-response exploitability is approximated with DQN for 1M steps and a small network; tighter bounds (e.g., stronger exploiters, policy-gradient exploiters, multi-agent exploiters, and asymptotic convergence studies) are needed.
- Role and optimality of rebids: Solly rebids much more than humans; a formal analysis of optimal rebid frequency, bluff value, and equilibrium strategies under the rebid mechanism is missing.
- Reward design and credit assignment: The paper notes diluted reward signals in longer rounds and suggests reward scaling; systematic ablations (shaping, scaling factors, terminal bonuses, or temporal-difference variants) and their impact on policy quality are not reported.
- Architecture and training ablations: The relative contributions of deeper networks, hand abstractions, entropy regularization, learning-rate schedules, and multi-agent multi-policy training are not isolated via controlled experiments.
- Shared-policy vs. agent-specific policies: Solly uses a single shared policy network; whether agent-specific policies or population-based training improve robustness and reduce exploitability remains open.
- Canonical hand abstraction trade-offs: The benefits and potential harms (e.g., loss of nuance, symmetry-induced bias) of training on canonical hands versus explicit digits are not fully analyzed.
- Robustness to collusion or correlated strategies: Preliminary evidence suggests identical-strategy LLMs leak information to each other; Solly’s robustness to coordinated or colluding opponents has not been systematically tested.
- Performance under official Salomon Brothers variants: The agent is evaluated under reduced rules; effects of extensions/bonuses and variant rule sets on strategy and performance are unknown.
- Opening-bid theory validation: Humans claim forcing openings are optimal; the paper does not provide a formal comparative analysis of opening-bid strategies (e.g., forcing vs. probing) under different hands and table configurations.
- Hand-type-specific weaknesses: Humans underperformed with 2-of-a-kind in 3-player; a causal analysis of this pattern and whether targeted training can fix such weaknesses (for both humans and agents) is absent.
- Compute and sample efficiency: Concrete compute budgets, scaling curves (performance vs. steps/parameters), and sample efficiency comparisons to search-based methods (e.g., ReBeL, NFSP) are not provided.
- Evaluation metrics beyond win rate/equity: Metrics such as exploitability bounds, calibration of value head estimates, bluff frequency optimality, and action distribution entropy are not reported.
- LLM evaluation breadth: Only GPT‑4.1 and o3 were tested; broader model families, fine-tuning, tool-use (e.g., external calculators, MCTS), memory across rounds, and prompt engineering for bluffing/randomization need systematic study.
- Adaptation over sessions: Human adaptation was anecdotally observed; controlled longitudinal studies quantifying how humans learn to exploit Solly (and vice versa) over repeated sessions are missing.
- Memory and long-horizon deception: Neither Solly nor LLMs track game histories to enable long-term deception or signaling; methods for history-aware strategies and their benefits remain unexplored.
- Generalization across digit distributions: All experiments assume uniform digit distributions; robustness to non-uniform or adversarial distributions (and the induced strategic shifts) is unknown.
- Risk-sensitive objectives: Strategies are evaluated under linear equity; effects of risk preferences, bankroll constraints, and utility shaping (e.g., CVaR, risk-sensitive RL) on bidding behavior are untested.
- Reproducibility and open resources: Details on released code, trained checkpoints, human game logs, and exact hyperparameters sufficient for full replication are not provided.
- Explainability of learned strategies: There is no systematic analysis of Solly’s bidding/bluffing policy (e.g., feature importances, policy summaries by state/hand, interpretable strategy rules), which hampers scientific understanding and transfer.
Practical Applications
Immediate Applications
Below are actionable applications that can be deployed with modest integration and without major advances in algorithms or compute.
- Solly-powered training and assessment for negotiation and auctions
- Sectors: finance (trading floors, bond/electricity/spectrum auctions), energy markets, ad-tech, procurement (B2B), education (MBA/exec-ed)
- What: A web-based simulator using the released OpenSpiel environment and Solly checkpoints to train teams on bluffing, signaling, and randomization under uncertainty; includes dashboards for win rate, equity, and exploitability against best-response agents
- Tools/products/workflows: “Solly Trainer” with scripted modules on rebids/forcing bids, adaptive sparring difficulty, post-session analytics (hand-quality breakdowns, risky vs safe play), and leaderboards
- Assumptions/dependencies: Organizational adoption; mapping of game dynamics to target auction/negotiation contexts; basic compute for self-play and evaluation; careful use for hiring/assessment (consent, bias, fairness)
- Lightweight benchmark/testbed for multi-agent imperfect-information RL
- Sectors: academia, AI labs, startups
- What: Use 3x3 and 5x5 Liar’s Poker as a reproducible, low-compute benchmark to compare algorithms (R-NaD, MMD, NFSP), reward-scaling, hand abstractions, and inference-time search vs no-search
- Tools/products/workflows: Open-source repo + Docker image; evaluation protocol with best-response exploitability curves; public leaderboard (Kaggle-style) with fixed seeds and compute budgets
- Assumptions/dependencies: Availability of environment code and model checkpoints; community acceptance as a benchmark
- LLM evaluation harness for planning, randomization, and exploitability
- Sectors: LLM product teams, AI safety/evals, red-teaming
- What: Automated tournaments that pit LLMs (with/without chain-of-thought) against Solly and best-response agents to quantify determinism, lack of bluffing, and susceptibility to “strategy leakage”
- Tools/products/workflows: “Strategy Eval Suite” with metrics for equity, bid vs challenge win composition, rebid usage, multi-agent collusion probes (two identical LLMs vs third agent)
- Assumptions/dependencies: API access to LLMs; eval governance for prompting and context windows; standardized TTC budgets
- Decision-making courses and workshops on probabilistic reasoning and signaling
- Sectors: education (undergrad/grad statistics, decision science), corporate L&D
- What: Classroom exercises using Solly as a sparring partner to teach conditional binomial/multinomial reasoning, value of information, and randomized strategies under imperfect information
- Tools/products/workflows: Interactive probability calculators, hand-quality visualizers, annotated replays demonstrating human vs AI mistakes
- Assumptions/dependencies: Instructor adoption; mapping of in-game concepts to course outcomes
- Exploitability scanning (“strategy red team”) for deployed agents
- Sectors: gaming, fintech, marketplaces, AI assurance
- What: Train DQN best-response agents against a fixed strategy to score exploitability and recommend policy patches (e.g., increased randomization in specific states)
- Tools/products/workflows: “Exploitability Scanner” CI step for strategy updates; periodic regressions; reports with positional vulnerability heatmaps
- Assumptions/dependencies: Access to agent policies or APIs; stable evaluation environments
- Anti-collusion and “strategy leakage” diagnostics in multi-agent systems
- Sectors: ad-tech, marketplaces, platform integrity, policy
- What: Use controlled multi-agent tournaments to detect unintended coordination when agents share similar deterministic policies (as observed with two identical LLMs)
- Tools/products/workflows: Collusion probes, similarity metrics over action distributions, alerts for correlated play patterns
- Assumptions/dependencies: Sufficient logs and agent identity separation; consent/compliance; interpretation guardrails (false positives)
- Low-compute RL pipeline template for imperfect-information games
- Sectors: startups, academic labs with limited resources
- What: Reference implementation (R-NaD + MLP + reward scaling + hand abstraction) and hyperparameter “recipes” to reproduce strong play on off-the-shelf hardware
- Tools/products/workflows: “Solly Recipe” with OpenSpiel fork, CLI for training/eval/human-in-the-loop play, learning-rate schedules, logging
- Assumptions/dependencies: Access to code; minimal GPU/CPU resources; reproducibility docs
- Educational probability and decision calculators
- Sectors: education, finance training
- What: Interactive tools computing P(S_r ≥ q | hand) with visualizations of hand strength and expected value of bids/challenges under different game sizes
- Tools/products/workflows: Web calculators and notebooks; classroom assignments linking analytics to bidding strategy
- Assumptions/dependencies: Basic web deployment; verified formulas and UX
Long-Term Applications
Below are higher-impact applications that require additional research, scaling to larger games, integration with live systems, or regulatory/ethical frameworks.
- Auction-strategy copilot with adaptive bluffing and signaling
- Sectors: spectrum/electricity auctions, ad exchanges, procurement platforms
- What: Decision support that proposes bid/rebid sequences and randomized play to avoid exploitation; integrates opponent modeling and inference-time search (e.g., MCTS)
- Tools/products/workflows: Real-time “Auction Copilot” with policy uncertainty bands, explanations, and compliance logging
- Assumptions/dependencies: High-fidelity simulators, reliable opponent modeling in non-stationary markets, strict regulatory compliance, human-in-the-loop controls
- Multi-agent market simulation for policy and mechanism design
- Sectors: regulators, exchanges, policy think tanks
- What: Calibrated simulations to evaluate rule changes (rebid-like options, reserve prices), measure welfare, manipulation risk, and emergent collusion
- Tools/products/workflows: “Market Sandbox” with dashboards for equilibrium shifts, exploitability, and fairness under alternative designs
- Assumptions/dependencies: Data for calibration; model validation; cross-disciplinary oversight (economics, policy, AI)
- Adaptive negotiation agents with hybrid LLM + RL strategy cores
- Sectors: B2B procurement/sales, customer support, dispute resolution
- What: Agents that handle language negotiation (LLM) while the underlying strategy is driven by an RL policy that randomizes and bluffs optimally; supports tactic rotation and opponent-specific adaptation
- Tools/products/workflows: “Negotiate Assist” embedded in CRMs and sourcing tools; guardrails for ethics and transparency
- Assumptions/dependencies: Robust online learning, explainability of strategic recommendations, user trust and governance
- Robustness, safety, and compliance toolchain for multi-agent AI
- Sectors: AI governance, compliance, platform integrity
- What: Continuous auditing for exploitability and tacit collusion; detection of strategy determinism and information leakage across agents; intervention policies
- Tools/products/workflows: “Agent Safety Suite” with red-teaming via best-response training, adversarial tournaments, and automated mitigations
- Assumptions/dependencies: Access to agent logs/policies; organizational buy-in; clear definitions of prohibited behaviors
- Full-scale 8×10 Liar’s Poker benchmark for strategic AI
- Sectors: AI research
- What: A standardized, challenging benchmark with multi-player engagement throughout, rebid dynamics, and large state space; used to study TTC vs self-play, search vs no-search, and algorithmic generalization
- Tools/products/workflows: Public leaderboards; compute-budgeted tracks; reference baselines (R-NaD, MMD, NFSP, ReBeL variants)
- Assumptions/dependencies: Compute resources; community stewardship; reproducible protocols
- Real-time TTC-enhanced strategic engines for markets
- Sectors: finance, energy, logistics
- What: Low-latency inference-time planners (e.g., MCTS on top of learned policies) that evaluate multiple contingencies and roll-outs under uncertainty
- Tools/products/workflows: “Strategy Engine” APIs with latency/compute SLAs; confidence and risk controls
- Assumptions/dependencies: Hardware acceleration; strict latency budgets; reliability under distribution shift
- Behavioral analytics and bias reduction programs
- Sectors: HR, training, compliance
- What: Longitudinal measurement of human biases (anchoring, herding) and the impact of training against Solly-like agents on decision quality
- Tools/products/workflows: Pre/post tests; analytics on hand-type performance; targeted coaching modules
- Assumptions/dependencies: Ethical oversight; validated psychometrics; privacy-preserving analytics
- Cyber deception and moving-target defense inspired by bluffing strategies
- Sectors: cybersecurity (defensive operations)
- What: Deception agents that randomize signals and “rebid” defenses to increase attacker uncertainty in multi-attacker, imperfect-information settings
- Tools/products/workflows: “Deception Orchestrator” that tunes randomness/bluffing to maximize attacker resource burn while minimizing defender cost
- Assumptions/dependencies: Domain transfer validation; red-team evaluation; careful safety constraints
- Team sports/robotics strategy under partial observability (concept transfer)
- Sectors: robotics, sports analytics
- What: Shared-policy multi-agent training where roles rotate and agents must randomize and signal under uncertainty
- Tools/products/workflows: Simulation frameworks with role-sharing and best-response diagnostics
- Assumptions/dependencies: Significant domain adaptation; sensor/actuator constraints; safety
- Fairness- and competition-aware agent design in platforms
- Sectors: marketplaces, ride-hailing, food delivery
- What: Design agent bidding/pricing strategies that avoid pathological equilibria or emergent collusion; audit LLM-driven agent determinism
- Tools/products/workflows: Policy stress tests, intervention playbooks, and fairness monitoring
- Assumptions/dependencies: Data-sharing agreements; regulator collaboration; socio-technical evaluation
Cross-cutting assumptions and dependencies
- Domain transfer: Liar’s Poker captures key elements (imperfect information, multi-player engagement, signaling), but real markets introduce richer dynamics, non-stationarity, and regulatory constraints.
- Scaling: Moving from 3x3/5x5 to full 8×10 or domain-scale systems requires reward shaping, hand abstractions, deeper architectures, and more compute.
- Safety and ethics: Bluffing and randomness raise transparency and trust issues; human-in-the-loop oversight and audit trails are essential for real deployments.
- Evaluation: Best-response exploitability is a lower bound; comprehensive robustness requires adversarial tournaments, out-of-distribution tests, and long-horizon adaptation checks.
- Compute and latency: TTC (e.g., MCTS) improves performance but may violate real-time constraints; design requires careful trade-offs.
Glossary
- Actor-critic: A reinforcement learning paradigm combining a policy (actor) and a value function (critic) to learn actions and evaluate them. "We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm."
- Best response: A strategy that maximizes a player’s expected return against the fixed strategies of opponents; used to measure exploitability. "A best response policy for a player or AI is one that maximizes that player's return against all other players"
- Canonical hands: Strategically distinct hand representations that group equivalent configurations to reduce complexity. "The number of canonical (strategically distinct) hands a single player can be dealt is the well-known ``stars-and-bars'' result"
- Counterfactual Regret Minimization (CFR): An iterative game-theoretic algorithm that minimizes regret for counterfactual decisions, converging to Nash equilibria in two-player zero-sum games. "In two-player zero-sum games, algorithms such as counterfactual regret minimization \citep[CFR;] []{zinkevich2007cfr} are known to converge to Nash equilibria."
- Deep Q-Network (DQN): A deep reinforcement learning algorithm that approximates Q-values to select actions; used to train exploiting agents. "Finally, the best response agents are trained using the deep Q-network (DQN) algorithm with a smaller neural network architecture and higher learning rate."
- DeepNash: An RL approach and architecture that achieved superhuman play in Stratego using R-NaD. "Stratego \citep{Stratego} combined R-NaD with a deep neural network architecture to build the DeepNash AI to play Stratego"
- Equity: A poker performance metric representing expected money won or lost, accounting for stake sizes. "We compare both the average win rates and player equity or total dollar winnings."
- Exploitability: The degree to which a fixed strategy can be exploited by an optimal counter-strategy; lower is better. "This method provides an estimate of Solly's exploitability"
- Follow the Regularized Leader (FoReL): A no-regret dynamics framework that updates strategies by regularized optimization of cumulative rewards. "To train agents to play Liar's Poker, we used the regularized Nash dynamics (R-NaD) actor-critic algorithm of \citet{rnad-perolat21a}, which implements follow the regularized leader (FoReL) dynamics with an additional regularization term on the game reward"
- Forcing bid: A bid that pressures the next player such that raising typically requires a very strong hand to remain favored if challenged. "A forcing bid is one that, if the next player raises, they will need to have 2 or 3 of a kind to have an expected win if challenged."
- Hand abstraction: The technique of grouping strategically identical hands to simplify state space and improve training efficiency. "we explored the use of reward scaling during training, hand abstractions (i.e. grouping strategically identical hands together, which is common in the poker literature)"
- Heads-up: A two-player game format, often used for evaluation against a single opponent. "Solly played at an elite human level as measured by win rate (won over 50\% of hands) and equity (money won) in heads-up and multi-player Liarâs Poker."
- Logits: Unnormalized scores output by a neural network representing action preferences before softmax. "outputs a vector of logits specifying a distribution over possible actions"
- Magnetic Mirror Descent (MMD): A successor algorithm to R-NaD for learning game strategies via regularized dynamics. "we suspect that Liar's Poker could be implemented using the successor magnetic mirror descent (MMD) algorithm \citep{sokota2023unified}"
- Monte Carlo CFR (MCCFR): A sampling-based variant of CFR that scales to larger games by estimating counterfactual regrets via Monte Carlo methods. "Methods such as counterfactual regret minimization \citep[CFR;] []{zinkevich2007cfr} and Monte Carlo CFR \citep[MCCFR;] []{lanctot2009mccfr} were the standard algorithms for poker and poker-like games"
- Monte Carlo Tree Search (MCTS): A planning method that uses randomized simulations to evaluate actions at inference time. "It is very likely that Solly would have benefited from a technique like Monte Carlo tree search (MCTS) at test-time"
- Nash equilibrium: A strategy profile where no player can gain by unilaterally deviating; foundational in game theory. "In two-player zero-sum games, algorithms such as counterfactual regret minimization \citep[CFR;] []{zinkevich2007cfr} are known to converge to Nash equilibria."
- OpenSpiel: A research framework and library for games, RL, and evaluation used to implement and train agents. "We apply the R-NaD algorithm, available through OpenSpiel \citep{lanctot2019openspiel}, to Liar's Poker."
- Policy head: The output layer of a network that produces action probabilities. "The MLP includes multiple hidden layers with two output layers, a policy head and a value head."
- Rebid: A game mechanism allowing the bidder to strengthen their bid after being challenged, continuing the round. "The player who proposed the challenged bid may choose either to proceed with a count ... or they may choose to rebid, i.e., to propose a stronger bid , and continue playing."
- Regularized Nash Dynamics (R-NaD): An actor-critic training algorithm implementing FoReL with reward regularization to approach equilibrium play. "To train agents to play Liar's Poker, we used the regularized Nash dynamics (R-NaD) actor-critic algorithm of \citet{rnad-perolat21a}"
- Self-play: Training method where agents learn by playing against themselves or copies of themselves. "We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm."
- Stars-and-bars: A combinatorial method to count compositions or multisets, here used to count distinct hand types. "The number of canonical (strategically distinct) hands a single player can be dealt is the well-known ``stars-and-bars'' result"
- Test-time compute (TTC): Inference-time computation used to plan or reason beyond the trained policy during decision-making. "we report interesting results related to test-time compute (TTC), which was a major advancement in previous poker models and is also at the heart of chain-of-thought reasoning."
- Value head: The output layer of a network that estimates expected return (state value). "The MLP includes multiple hidden layers with two output layers, a policy head and a value head."
- Zero-sum: A game type where one player’s gain equals another’s loss, making total payoff constant. "In two-player zero-sum games, algorithms such as counterfactual regret minimization \citep[CFR;] []{zinkevich2007cfr} are known to converge to Nash equilibria."
Collections
Sign up for free to add this paper to one or more collections.