Test-Time Search in Imperfect Games
- Test-time search under imperfect information is a framework that refines subgame solutions using real-time optimization and belief state tracking.
- It employs multi-valued leaf evaluations and subgame decomposition to mitigate strategy fusion and develop robust equilibrium strategies.
- Practical implementations demonstrate unexploitable and high-performance play in domains like poker and Stratego through theoretical and empirical validation.
Test-time search under imperfect information refers to the application of computational resources at decision time in imperfect-information games to improve strategic play, typically by solving or refining a subgame starting from the player's current belief over hidden variables. Classical search methods from perfect-information games are structurally unsound in imperfect-information domains, because the value of a state depends not only on its local properties but also on the beliefs and potential adaptation of the opponent. Modern approaches address this by (i) rigorously defining the structure of subgames and belief states, (ii) employing robust multi-valued or adversarial value assignments at subgame boundaries, and (iii) re-solving for locally consistent equilibrium strategies using real-time optimization, often via Counterfactual Regret Minimization (CFR) or variants. This article surveys the theoretical underpinnings, algorithmic frameworks, practical instantiations, and empirical results of test-time search in imperfect-information settings.
1. Formal Framework and Fundamental Limitations
In perfect-information games, a search subtree can be terminated at a fixed depth, assigning a value estimate (“heuristic”) to each leaf state, and then applying minimax or MCTS on this finite tree. In imperfect-information (II) games, an agent cannot act on a precise state. Instead, it acts in an information set: a collection of histories indistinguishable under public and private observations. A fixed policy at a given information set may yield different outcomes depending on which history is the true one, and, most critically, the value of a subgame rooted at a public belief state (PBS) depends on both players’ continuation strategies. No single value function can summarize this unless both agents commit to fixed strategies for the remainder of play (Brown et al., 2018, Brown et al., 2020, Schmid, 2021).
Naïve approaches such as perfect-information determinization (sampling one completion of hidden variables and running perfect-information search) fail due to “strategy fusion”: the agent acts inconsistently across indistinguishable histories within an information set (Arjonilla et al., 5 Aug 2024). This leads to highly exploitable strategies and over-optimism.
2. Test-Time Subgame Decomposition and Belief Tracking
Central to sound test-time search is the consistent and correct definition of the subgame to be solved at each decision point. The canonical formulation defines public states S as the sequences of publicly observable actions and outcomes; the player’s belief over private information is encoded as a range r(S) (i.e., reach probabilities for each compatible private information set under the reference or blueprint strategy) (Solinas et al., 2023, Schmid, 2021, Brown et al., 2020). A subgame to be solved at test time starts from such a (S, r(S)), representing all possible private states consistent with the observations.
All forward simulations, value updates, and belief-range computations must be grounded in this structure. Crucially, even generating or enumerating the set of histories consistent with S can be intractable (FNP-complete) in dense public games, but is polynomial-time in sparse domains (e.g., small poker) (Solinas et al., 2023). For complex domains (e.g., trick-taking card games), scalable Markov Chain Monte Carlo history samplers enable unbiased estimation over the set of possible world states without explicit enumeration.
| Domain Type | History-Filtering Method | Complexity |
|---|---|---|
| Sparse (Poker) | Enumeration | Polynomial time |
| Dense (Bridge) | MCMC (e.g., TTCG-Gibbs Sampler) | Poly per-sample |
3. Depth-Limited Solving and Multi-Valued State Evaluations
A pivotal advance is the multi-valued or adversarial value assignment at the leaf nodes of the depth-limited subgame. At the depth limit (frontier), instead of guessing a single equilibrium value, the subgame is augmented by allowing the opponent to “choose” (explicitly or implicitly) among a representative set of continuation strategies for the rest of the game. Each such strategy leads to a distinct value at the leaf. The search then computes a robust equilibrium against this adversarial set (multi-valued states) (Brown et al., 2018).
Concretely, at each P2 (opponent) leaf infoset I of the depth-limited subgame S:
where are the continuation strategies, and are the expected utilities following blueprint and continuation policies. By increasing , the solution can be made arbitrarily close to that of a “full” re-solve with all possible off-tree opponent strategies. This enforces robustness to opponent adaptation past the depth limit. Failure to provide this adversarial closure leads to severe exploitation, as evidenced by the poor performance of naïve single-value lookahead (Brown et al., 2018).
4. Algorithmic Realizations
The continual resolving architecture operationalizes robust test-time search:
- Blueprint Construction: Compute an approximate equilibrium (“blueprint”) strategy offline using abstractions and MCCFR or similar methods.
- Frontier Value Calculation: For each possible leaf at depth limit, precompute or predict via DNN the set of multi-valued payoffs corresponding to each candidate opponent continuation strategy.
- Subgame Solving at Test-Time:
- At each decision, construct the truncated subgame up to the allowed depth.
- At each frontier infoset, insert a virtual opponent decision selecting among N continuations, each leading to a distinct terminal value.
- Solve the resulting finite subgame (CFR+, MCCFR, or accelerated methods) to equilibrium.
- Play according to the equilibrium strategy at the current information set, resolving again as new public actions are observed.
The structure is summarized in the following table:
| Step | Operation |
|---|---|
| Preprocessing | Blueprint/MCCFR / Multi-valued frontier storage |
| At decision | Tree build / Multi-valued leaf attachment |
| Subgame solution | CFR+ / sampling iterations |
| Action selection | Equilibrium root strategy / iterate as play advances |
Recent approaches generalize this with neural policy/value critics (for parametric RL) (Kubicek et al., 2023) or learned environment models/abstractions for large II domains (Kubíček et al., 6 Oct 2025). All build on the same architectural motif: multi-valued or adversarial margin subgame gadgets at the search frontier.
5. Theoretical Guarantees and Consistency
Soundness of test-time search is analyzed via the concepts of ε-soundness and various levels of consistency (Šustr et al., 2020):
- ε-soundness: The online (test-time) algorithm’s average performance cannot be worse than an ε-Nash equilibrium strategy, regardless of the opponent’s adaptations.
- Strong Global Consistency: Achieved if the test-time search strategy is always strongly consistent with some single offline equilibrium strategy across all information sets and histories. This property ensures finite-run soundness bounds.
- Global Consistency: Guarantees only asymptotic ε-soundness as the number of matches increases.
- Local Consistency: Achieved by some sampling-based methods (e.g., Online Outcome Sampling), but is not sufficient for even asymptotic ε-soundness in II games.
Algorithms such as continual resolving, depth-limited solving with CFR gadgets, and sample-an-iteration CFR policies (as in ReBeL) attain the strong global consistency needed for robust guarantees (Brown et al., 2018, Schmid, 2021, Šustr et al., 2020, Brown et al., 2020). Empirical studies confirm unexploitable, human-level or superhuman play.
6. Empirical Results and Practical Considerations
Exemplary test-time search systems achieving master or superhuman performance include:
- Modicum (HUNL poker): Uses 4-core CPU and 16 GB RAM, defeating Baby Tartanian8 and Slumbot with resource requirements ≲10³ core-hours (Brown et al., 2018).
- Stratego (Ataraxos): Monte-Carlo evaluation over hidden-state completions with policy/value networks, regularized mirror descent update at test time, robust to OOD opponents, yielding +123 Elo over policy-only baseline, all under 1.3s move latency on a single GPU (Sokota et al., 10 Nov 2025).
- Obscuro (FoW Chess): Layered search with KLUSS subgame construction, PCFR+ on sampled belief sets, and dynamic partial-tree growth, operating at superhuman performance on consumer CPUs, and achieving 80% win rate against the world #1 human (Zhang et al., 2 Jun 2025).
- ReBeL: Approximate equilibrium via belief-CFR, with test-time subgame re-solve (CFR iterations), random iteration-sampling for safe play, defeating top poker bots and professionals (Brown et al., 2020).
- EPIMC: Determinization-based methods with deferred perfect information solving (postponed reasoning), theoretically eliminating strategy fusion, doubling win rates in private-information games over standard PIMC (Arjonilla et al., 5 Aug 2024).
Complexity bottlenecks primarily arise in history filtering and public state enumeration. In sparse domains, explicit enumeration suffices; in dense domains, scalable MCMC sampling is essential (Solinas et al., 2023). The choice between exact and sampling-based subgame belief generation is primarily dictated by the combinatorial structure of the public game tree.
Resource requirements for practical deployment have dropped by orders of magnitude: from supercomputer-scale runs (18TB RAM, millions of core-hours) to commodity hardware (Brown et al., 2018, Sokota et al., 10 Nov 2025, Zhang et al., 2 Jun 2025). Integration with neural networks for value/frontier estimation or policy rollouts is now the standard for large domains.
7. Extensions, Limitations, and Future Directions
Current frameworks for sound, robust test-time search are well-understood for two-player zero-sum games with perfect recall, and can be instantiated using both tabular and neural methods (Brown et al., 2018, Schmid, 2021, Kubíček et al., 6 Oct 2025, Brown et al., 2020). The primary limitations include:
- Extension to multiplayer or general-sum games requires new theoretical developments; current robustifying gadgets rely on minimax structure and monotonicity of counterfactual value constraints.
- Imperfect recall and partial information about the opponent present both computational and theoretical challenges: soundness proofs assume full consistent tracking of public beliefs and ranges.
- History-filtering and subgame belief generation can be intractable in massive or highly entangled domains, but scalable approximation schemes using MCMC or model learning show promise (Solinas et al., 2023, Kubíček et al., 6 Oct 2025).
Open research directions include: scalable belief-state abstractions; robust value-function learning under extreme information asymmetry; and analysable, general-purpose test-time search for multi-agent and non-zero-sum environments (Schmid, 2021, Brown et al., 2020).
Test-time search under imperfect information has undergone a transformation from theoretically problematic determinization to rigorous, adversarially robust, and computationally efficient gadget-based continual resolving. At scale, these methods deliver empirically unexploitable and highly performant play across structurally diverse domains, provided the core requirements—subgame decomposition at public states, robust multi-valued leaf evaluation, safe subgame solving, and tractable belief tracking—are satisfied.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free