GTO Wizard Benchmark: HUNL AI Evaluation

Updated 2 July 2026

GTO Wizard Benchmark is a comprehensive evaluation framework for Heads-Up No-Limit Texas Hold’em, designed for reproducible performance measurement in imperfect-information domains.
It leverages a RESTful API and a Python client library to run automated simulations with detailed variance reduction via the AIVAT estimator.
The benchmark establishes a superhuman Nash-equilibrium baseline, enabling actionable insights for improving agent strategies in multi-agent, sequential decision-making.

The GTO Wizard Benchmark is a public, continuously evolving evaluation framework designed for the standardized benchmarking of algorithms in Heads-Up No-Limit Texas Hold’em (HUNL). It addresses both the need for reproducible measurement of agent performance and the methodological challenge of variance in imperfect-information domains, leveraging a RESTful API and a state-of-the-art superhuman baseline. The framework supports robust empirical comparison among artificial agents, including reinforcement learning systems and LLMs, and delivers results suitable for fine-grained analysis by researchers in computational game theory, multi-agent learning, and sequential decision-making under partial observability (Provost et al., 24 Mar 2026).

1. Benchmark Objectives and Domain Scope

The GTO Wizard Benchmark provides a scientifically rigorous, low-cost, and easily automated environment for evaluating agent performance in HUNL poker. It defines the domain as two-player, zero-sum, imperfect-information poker with 50/100 blinds and 200-big-blind stacks (20,000 chips), adhering to the ruleset of the Annual Computer Poker Competition (ACPC). The framework’s immediate goal is to facilitate precise quantitative comparison against a superhuman baseline, with an explicit long-term vision to evolve beyond static head-to-head tests: future extensions are planned to cover arbitrary stack sizes, alternative cash/tournament formats, multi-player variants (e.g., Pot-Limit Omaha), and continuous agent improvement supported by a real-time public leaderboard.

2. Public API, Client Library, and Integration Workflow

The evaluation interface is a RESTful API; researchers interact with a centralized server that manages all game logic, legal actions, card dealing, and outcome reporting. An official Python client library (gtowizard_api_client) encapsulates HTTP communication and provides an Agent interface requiring implementation of the method select_action(game_state). The standard workflow is:

Instantiate a GTO Wizard client with a personal API key.
Register a custom agent (which could encapsulate a conventional AI, an LLM wrapper, or any decision process) via client.register_agent(...).
Execute benchmark evaluation by calling client.evaluate(agent, num_hands=5000); the server orchestrates hand reset, move validation, and result collection.
Receive a JSON-formatted report containing the agent's AIVAT score, raw chips, empirical variance, chance and action corrections, and additional fields for analysis.

All results are automatically published to a publicly accessible leaderboard, increasing transparency and supporting reproducibility in algorithmic evaluation. Researchers can modify the number of hands, swap in new agent strategies, and immediately compare against historical submissions.

3. GTO Wizard AI: Architecture and Nash-Equilibrium Benchmarking

The reference agent, GTO Wizard AI, employs self-play reinforcement learning over hundreds of millions of HUNL hands, coupled with real-time equilibrium solving. Crucially, the agent does not rely on fixed abstractions or precomputed lookup tables. Its policy approximates a Nash Equilibrium, effectively generalizing across diverse stack depths and bet sizes.

As an empirical baseline, GTO Wizard AI played over 150,000 hands against Slumbot, the 2018 ACPC champion and previous strongest publicly accessible HUNL benchmark. The AI achieved a result of $19.4 \pm 4.1$ big blinds per 100 hands (bb/100), substantially surpassing the performance of top human professionals (approximately 5 bb/100). This establishes a quantifiable, superhuman baseline; researchers can express agent progress as “distance to equilibrium” using this benchmark (Provost et al., 24 Mar 2026).

4. Variance Reduction via the AIVAT Estimator

Variance presents a major obstacle to precise measurement of agent performance in stochastic, imperfect-information games. GTO Wizard Benchmark addresses this using the AIVAT (Action-Informed Value Assessment Tool) control-variates method. The estimator for the mean agent value is:

$\widehat V_{\rm AIVAT} = \frac{1}{N} \sum_{i=1}^N (X_i - Y_i) + \mathbb{E}[Y_i]$

where $X_i$ is the raw payoff of hand $i$ and $Y_i$ is a control variate derived from partial-information continuation values with known expectation. The estimator is provably unbiased:

$\mathbb{E}[\widehat V_{\rm AIVAT}] = \mathbb{E}[X]$

and achieves variance reduction such that $\mathrm{Var}[X - Y] \approx \mathrm{Var}[X]/9$ in practice. Consequently, approximately ten times fewer hands are required for the same statistical confidence as naive Monte Carlo estimation. The AIVAT adjustments (chance and action corrections) are reported in detail for each evaluation, ensuring that all score comparisons are both fair and statistically efficient (Provost et al., 24 Mar 2026).

5. Evaluation of State-of-the-Art LLMs

The benchmark supports empirical studies of LLMs in HUNL poker under zero-shot conditions without external tool usage. The protocol involves:

Running each model (e.g., GPT-5.4 Extra High, GPT-5.3 Extra High/High, Claude Opus 4.6/4.5, Gemini 3.1 Pro/3 Pro/2.5 Pro, Grok 4 High, Kimi K2.5, GPT-4o, GPT-4) for 5,000 hands versus GTO Wizard AI at standard blinds and stack depths.
Each prompt encodes the full game state, legal actions, raise ranges, and action history; the LLM outputs a JSON action (with optional bet amount and freeform reasoning).
AIVAT-adjusted scores (bb/100), raw chips, variance, and detailed corrections are collected.

A sample of top-performing LLMs and trivial baselines is summarized below:

Model	AIVAT bb/100 ± std
GPT-5.3 Extra High	–16.0 ± 3.0
GPT-5.4 Extra High	–17.8 ± 3.7
Claude Opus 4.6	–20.4 ± 8.6
Gemini 3.1 Pro	–30.8 ± 4.5
Trivial Baselines	–64 to –380

These results reveal dramatic progress: GPT-4 scored –136 bb/100, while GPT-5.3 achieves –16 bb/100 (in 2 years and 9 months), yet all LLMs remain well below the superhuman benchmark of approximately +19 bb/100 (Provost et al., 24 Mar 2026).

6. Analysis of Current Model Limitations

Qualitative analysis highlights several model weaknesses. Representation errors are documented (about 2% suit-mismatch rate), including misinterpretation of card suits or board context, which can trigger compounding strategic errors. Most notably, LLMs lack explicit handling of hidden-state reasoning; they do not maintain probabilistic distributions ("ranges") over opponent hands, producing largely deterministic and unbalanced play devoid of the stochastic mixing characteristic of Nash Equilibrium strategies.

Furthermore, exploitability remains substantial: correlated actions (e.g., 3-betting exclusively with strongest hands) result in information leakage. Current LLMs fail to randomize appropriately and often misbalance considerations of long-term equity versus immediate expected value in ambiguous scenarios (Provost et al., 24 Mar 2026).

7. Recommendations and Future Directions

Suggested areas for further research include the integration of explicit belief or range representation, possibly through probabilistic embeddings or prompt engineering, to improve hidden-state reasoning. Extension of the benchmark to measure exploitability via opponent-adaptive evaluation is recommended, as is combining LLM reasoning with fast approximate equilibrium solvers in real time to enable hierarchical planning.

Advancements in control variate proxies and the introduction of online variance reduction methods are proposed to further cut evaluation costs. Expanding the API to multi-player formats is a key future direction, targeting research on coalition dynamics and n-person imperfect-information reasoning within a unified experimental environment (Provost et al., 24 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GTO Wizard Benchmark (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GTO Wizard Benchmark.