Papers
Topics
Authors
Recent
Search
2000 character limit reached

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Published 16 Mar 2026 in cs.LG and cs.AI | (2603.15563v1)

Abstract: We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

Summary

  • The paper introduces a benchmark that jointly stresses partial observability, adversarial reasoning, and long-horizon planning using two tracks in Pokémon games.
  • The study employs diverse methodologies including RL, LLM-based approaches, and hybrid systems, revealing that specialized RL and search methods outperform LLMs in competitive scenarios.
  • The research underscores the importance of modular harness design and efficiency metrics, offering critical insights for advancing embodied, long-context AI systems.

The PokeAgent Challenge: A Large-Scale Benchmark for Competitive, Partially Observable, and Long-Horizon AI Reasoning

Motivation and Benchmark Design

The PokeAgent Challenge (2603.15563) addresses a central gap in the current AI ecosystem: the absence of standardized benchmarks that jointly stress partial observability, strong game-theoretic reasoning, and long-horizon planning in realistic environments. Existing benchmarks typically isolate these axes—e.g., games emphasize adversarial reasoning but in short or fully observable settings, while open-ended environments test exploration in the absence of strategic adversaries. Pokémon, with its exponentially large, partially observed state space, dynamic competitive metagames, and visual RPG environments, uniquely combines these challenges.

The challenge introduces two complementary tracks:

  1. Competitive Battling Track: Two-player, zero-sum Pokémon battles on Pokémon Showdown, invoking partial observability (hidden information in team construction and movesets), stochastic transitions, and adversarial reasoning.
  2. Speedrunning Track: Long-horizon, sequential decision-making in Pokémon Emerald, evaluated via completion rates and wall-clock time across decomposed game milestones, demanding long-context memory, robust planning, and visual perception.

Both tracks deploy standardized infrastructure, harmonized baselines (RL, LLM, hybrid), and large curated datasets. Importantly, the evaluation is designed to decouple agent architecture/model capability from harness/es scaffold, allowing for controlled comparative experiments, especially critical for LLM-based agents in embodied tasks.

Environment Complexity and Evaluation Protocols

The battleground for agent evaluation is both theoretically and practically challenging. The combinatorial size of the Pokémon battle state space is extreme, with the paper providing a detailed derivation showing, for example, 10564\sim 10^{564} possible states for Gen 9 OU (Figure 1 and state-space tables). Figure 1

Figure 1: Cumulative usage distributions reveal that a small set of Pokémon/moves dominate competitive play, yet the combinatorial team space remains formidable.

Agents are evaluated with experiment-trackable, reproducible leaderboards. In the battling track, a public leaderboard tracks Full-History Bradley-Terry (FH-BT) skill ratings, Glicko-1, and GXE (expected win rate versus random opponents) on a private agent-only Showdown server, supporting multiple competitive formats (Gen 1 OU, Gen 9 OU), and time-constrained as well as 'extended timer' variants for LLMs. Figure 2

Figure 2: RL baselines approach strong human performance, LLMs lag except when scaffolded with specialized harnesses.

For the speedrunning track, agents are evaluated based on progression through 15 milestones, wall-clock time, and sample efficiency (number of agent steps), formalized as an episodic MDP. The environmental setup provides only limited, noisy vision-state, requiring agents to handle perception and partial observability akin to real human players. Figure 3

Figure 3: Milestone-based breakdown for Pokémon Emerald speedrunning, emphasizing the nonlinear exploration dependencies.

Baseline Architectures and Agent Capabilities

Competitive Battling Track:

Baselines span heuristic bots, RL agents (Metamon), and LLM-based agents with search harnesses (PokeChamp). The RL baselines (e.g., Transformers trained via large-scale offline RL) leverage >22M battle trajectories (4M human, 18M synthetic self-play), representing skill levels up to near-top human players. LLM-based agents employ structured text generation of game state, domain knowledge enrichment (usage statistics, opponent prediction), and protocolized depth-limited minimax search.

Advanced search-based systems (e.g., Foul Play) integrate root-parallelized MCTS with custom battle engines, employing innovations like "damage roll grouping" to achieve tractable lookahead in the face of combinatorial complexity. Figure 4

Figure 4: Foul Play's MCTS architecture with damage roll clustering enables efficient deep search.

Figure 5

Figure 5: Set prediction pipeline leverages observed battle signals to infer hidden opponent configuration.

Speedrunning Track:

RL, VLM-LLM harnesses, and hybrid methods are compared under common instrumentation and evaluation. Baselines include:

  • A multi-agent orchestration system coordinating subagents for context management, pathfinding (A*), memory, battle reasoning, and gym-specific puzzles
  • LLM and VLM backends for perception and tool invocation
  • Harnesses for external tool use, memory compaction, and reflection.

Rigorous evaluation reveals that, absent sophisticated harnesses, even frontier VLMs (including GPT-5, Gemini variants, Claude) fail to achieve nontrivial progression. Figure 6

Figure 6: Model families' wall-clock completion times and cost for baseline harnesses; Gemini 3 Flash fastest, but cost/sample efficiency varies.

Empirical Results and Notable Algorithms

Battling Track

Competition and baseline results demonstrate:

  • RL and search-based agents decisively outperform LLMs in partial information, adversarial settings. RL agents trained via large-scale offline data and fine-tuning (e.g., PA-Agent) and search agents (Foul Play; root-parallel MCTS) consistently secured top leaderboard positions and tournament wins. Figure 7

    Figure 7: Organizer and participant ratings, showing clear stratification between specialist and generalist methods.

    Figure 8

    Figure 8: Tournament standings reveal the edge of specialist approaches in both formats.

  • Key architectural innovations included iterative offline RL with dynamic data weighting, curriculum-based two-phase training (Team Q), and tailored optimizer/activation function pairings (4thLesson's Kron optimizer + AID for plasticity). Figure 9

    Figure 9: The Kron optimizer plus AID mitigates plasticity loss/instabilities in RL training.

Speedrunning Track

  • Pure RL remains sample-inefficient for true long-horizon RPG play, but hybrid methods unlock scalability. The leading method, Scripted Policy Distillation (SPD), synthesizes LLM-generated subgoals and policies as exploration priors, followed by distillation into neural networks and RL refinement, yielding the only near-human-score run (40:13). Figure 10

    Figure 10: SPD pipeline—LLM task decomposition, scripting, and policy distillation via expert action supervision.

  • Harness design is paramount: Model performance is bottlenecked by harness sophistication—CLI-coding agent scaffolds are unable to maintain task coherence over thousands of steps, whereas specialized modular harnesses (as in PokeAgent) enable successful runs.
  • Action sample efficiency versus wall-clock tradeoff: The Deepest agent completed with the fewest steps (most sample-efficient), but slower reasoning increased overall wall time. Figure 8

    Figure 8: Efficiency-frontier; Deepest achieves action-minimal run.

Failure Modes and Benchmark Orthogonality

PokeAgent exposes failure modes rarely observed in standard LLM evaluation: "panic behavior," memory corruption cascades, computational paralysis (recursive reasoning loops blocking timely action), and strategic overcommitment. Chain-of-thought (CoT) visualizations synchronize agent reasoning with game state and uncover LLM pathologies. Figure 11

Figure 11: Qwen model entering a computational paralysis loop during competitive play, evident in reasoning traces.

Critically, empirical BenchPress matrix analysis confirms that Pokémon battling skill is nearly orthogonal to summary performance on established QA/math/coding benchmarks, as no combination of standard LLM benchmarks predicts Battling GXE scores (max Spearman ρ ≈ 0.77, but mean |ρ| = 0.45, and SVD rank-2 structure fails to explain >70% of GXE variance seen).

Practical and Theoretical Implications

Practical:

  • The large, open-source datasets (20M+ battles, 200K+ teams), modular harnesses, and multi-agent orchestration systems directly support future research in competitive, partially observed, and embodied AI.
  • Standardization of evaluation protocols and task decomposition is poised to drive reproducible, comparable progress across LLM, RL, and hybrid agents.
  • Action and wall-clock efficiency measurements are relevant for deploying real-time and time-critical AI agents.

Theoretical:

  • The near-orthogonality of Pokémon performance to canonical NLP and contextual reasoning benchmarks demonstrates the inadequacy of current model benchmarks for measuring strategic sequential reasoning with latent state and adversarial response.
  • The evidence supports the "LLMs as prior, RL as refinement" paradigm for embodied, long-context tasks.
  • The necessity and impact of harnesses—versus raw model capability—on embodied OOD environments offers insight for future agentic evaluation and safety discussions.

Future Directions

Open challenges highlighted include:

  • VLM-SLAM: Robust spatial consistency and action localization demands integration of vision-LM outputs with SLAM-like perceptual frameworks, a major bottleneck for RPG/ALE-style environments.
  • LLM–RL performance gap: Designing hybrid or improved LLM architectures to approach RL sample efficiency in partial-observation, adversarial domains is a priority.
  • Open-source full-game completion: Achieving human-level speed and robustness in long-context, open-source models would democratize research and evaluation.
  • Planning efficiency: Sample efficiency, navigation, and obstacle avoidance are under-explored in current agent designs.

Conclusion

The PokeAgent Challenge establishes a scalable, actively maintained benchmark that exposes fundamental weaknesses and research opportunities for both RL and frontier LLM-based agents on challenging, high-dimensional sequential decision-making tasks. Empirically, specialist RL and search methods continue to dominate in competitive, partially observed domains, while LLMs play an essential role as priors and for task decomposition. The benchmark's living deployment ensures continued relevance and supports rigorous, reproducible advancement in long-context agentic AI research.

(2603.15563)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

A simple guide to “The PokeAgent Challenge: Competitive and Long-Context Learning at Scale”

1) What is this paper about?

This paper introduces the PokeAgent Challenge, a big, fair test (a “benchmark”) for AI systems using Pokémon games. It has two parts:

  • Competitive Battling: two-player Pokémon battles (like on Pokémon Showdown).
  • Speedrunning: finishing a Pokémon Emerald storyline segment as fast as possible.

The goal is to measure and improve how AI makes decisions when:

  • it can’t see everything (hidden information),
  • it faces smart opponents,
  • and it must plan far ahead over thousands of steps.

2) What did the researchers want to learn?

In friendly, everyday terms, they wanted to answer:

  • Can AI think strategically when it doesn’t know everything, like in a card game where the opponent’s hand is hidden?
  • Can AI plan well over a long time, like planning a whole road trip with many stops, detours, and changing conditions?
  • Do general-purpose AIs (LLMs, or LLMs) actually play well, or do specialist methods (like reinforcement learning, RL) work better?
  • Can we build a standard, fair way to compare different AI systems on the same tasks so results are meaningful?

3) How did they study it?

They built two standardized tracks with tools, datasets, and leaderboards so anyone can test their AI fairly.

  • Competitive Battling (Pokémon Showdown):
    • What it’s like: A turn-based, two-player game where you choose moves without seeing everything about the opponent’s team at first. That’s called “partial observability.” Think of it like chess where some pieces are hidden until revealed.
    • What they provided: Over 20 million recorded battle examples (so AIs can learn), 200,000+ realistic team setups, and strong example AIs (baselines) using different methods:
    • RL (reinforcement learning): AIs learn by trying actions and getting feedback—like practicing a sport and improving with coaching.
    • LLM-based agents: A LLM “reasons” about the game using a support system (a “harness”) that turns the game state into text and lets the model analyze options.
    • How they scored AI: They ran AI-vs-AI matches on a dedicated server and used stable rating systems to rank skill (similar to Elo, but tuned for AI-only play).
  • Speedrunning (Pokémon Emerald, early-game route):
    • What it’s like: The AI only sees the game screen (plus a tiny bit of status info) and must navigate, battle, and manage items to reach key “milestones” as quickly as possible. This needs good vision, memory, and planning over hours.
    • What they provided: A fixed-speed emulator, a list of milestones (like checkpoints), and a timing/step-count system. They also released the first open-source “multi-agent orchestration” harness for long tasks. Think of it as a smart control room that:
    • keeps the long-term plan,
    • calls specialized helper bots for battles, pathfinding, and puzzles,
    • and compresses history so the AI doesn’t forget what happened hours ago.
    • Why a “harness” matters: A harness is like giving the AI a map, a to-do list, and tools (e.g., a pathfinder) so it doesn’t get lost. Without this support, even very advanced models wander or get stuck.

They validated everything by running a NeurIPS 2025 competition with 100+ teams across both tracks, then set up permanent leaderboards at https://pokeagentchallenge.com.

4) What did they find, and why is it important?

Here are the main takeaways, in simple terms:

  • Specialist methods beat generalist models right now.
    • In battles, RL and search-based agents (like advanced planning bots) performed better than LLMs that “think in words.”
    • In speedrunning, the winner combined LLMs and RL: an LLM first wrote simple “scripts” for sub-tasks, then RL distilled these into fast, reliable policies. This hybrid was far faster than pure LLM setups.
  • A harness is essential for long tasks.
    • Raw LLMs (even very powerful ones) made almost no progress in speedrunning without a good harness. They forgot plans, repeated mistakes, or got stuck in menus.
    • With the right harness, the same models improved a lot—but still lagged humans and RL-heavy approaches.
  • Pokémon tests skills other benchmarks miss.
    • The paper shows that being good on typical LLM tests (like trivia or coding problems) does not predict being good at Pokémon battling.
    • That means Pokémon battles measure different abilities—like recovering after mistakes, planning under pressure, and reasoning when parts of the world are hidden.
  • Big, high-quality resources are now public.
    • 20M+ battle examples, 200k+ team sets, strong baseline agents, servers, and evaluation tools are all released. This helps the community make fair, reproducible progress.

Why this matters: Real-world decision-making often involves hidden information, changing opponents, and long-term plans (think cybersecurity, logistics, or robotics). A benchmark that brings all three together can push AI toward skills that transfer beyond games.

5) What’s the impact, and what’s next?

This challenge turns Pokémon into a “living benchmark”—a shared, evolving place to test AI systems fairly and repeatedly. It already drew strong interest and produced new ideas (like script-to-RL distillation). The authors highlight four open problems that point to future progress:

  • Better in-game “sense of place” for vision models.
    • Today’s agents often struggle to know exactly where they are and how far actions move them. This is like trying to navigate a city with a blurry map.
  • Closing the LLM–RL gap in battles.
    • Can we combine LLMs’ knowledge with RL’s precision so generalist models play as strategically as specialist ones?
  • Full-game runs with open-source models.
    • Frontier (proprietary) models have finished full games with heavy support; doing this with open models would make research more accessible.
  • Approaching human speedrunning times.
    • The best AI is still much slower than skilled humans. Getting faster requires better navigation, route planning, and fewer wasted steps—useful skills for any time-critical task.

In short: The PokeAgent Challenge gives AI researchers a clear, fair way to test and improve decision-making under pressure and over long stretches of time. It’s fun (it’s Pokémon!), but it also pushes AI toward abilities that matter in the real world.

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

The following concrete gaps and open problems emerge from the paper and can guide targeted follow‑up research:

  • Benchmark scope in RPG speedrunning is limited to early-game (first gym in Emerald); it remains unclear how methods scale to mid/late‑game puzzles (HMs, dungeons, long multi-battle segments), full-game completion, and cross‑title generalization (e.g., Red/Blue/Crystal/Emerald, remakes, ROM hacks).
  • Perception settings are not standardized across difficulty tiers: the speedrunning track provides visual frames plus privileged state (party/HP), while the battling track uses structured symbolic state. A pure vision-only setting (for both tracks) and a strictly “no privileged state” variant are not provided or evaluated, leaving perception and representation learning effects underexplored.
  • Compute and latency fairness are not fully controlled: Extended Timer enables arbitrarily long deliberation; API costs, token budgets, and inference latencies vary widely. There is no budgeted track (e.g., fixed think-time, token count, or FLOP limits) or compute-tier leaderboards to disentangle algorithmic merit from compute.
  • Harness-versus-model attribution remains only partially disentangled despite the S×T×M×F×Φ (state, tools, memory, feedback, fine-tuning) framework; a factorial evaluation with fixed harnesses across multiple models and fixed models across multiple harnesses, under identical compute budgets, is not reported.
  • Data quality for reconstructed private-information trajectories (4M “from each player’s perspective”) is not validated with error bounds; the accuracy of hidden-state inference from spectator logs, bias in reconstruction, and its downstream impact on training/offline RL are unquantified.
  • Synthetic self-play (18M trajectories) and curated team datasets may encode organizer-baseline biases; generalization to adversaries and teams outside these distributions is not rigorously measured with held-out, adversarially generated, or human-curated OOD test sets.
  • Team-building is central to competitive Pokémon but is not benchmarked as a standalone task; there is no standardized evaluation for team construction, opponent team inference, or joint team+policy optimization.
  • Robustness to metagame shifts and ruleset changes is not systematically tested (e.g., time-split evals, “patch/week” generalization); adaptation to evolving opponent distributions and tier rebalancing remains unmeasured.
  • The new Full-History Bradley–Terry (FH‑BT) rating is motivated but not deeply analyzed: sensitivity to opponent-pool composition, match sampling, and non-stationarity (live leaderboard drift) is not characterized; required match counts for stable rankings and confidence calibration are unspecified.
  • Rare-event stochasticity (crits, secondary effects) and variance under different RNG seeds are not benchmarked; no protocols quantify agent stability, risk sensitivity, or robustness to tail events across repeated seeded evaluations.
  • Opponent modeling and belief-state tracking under partial observability are not directly evaluated; there is no auxiliary benchmark for belief calibration (e.g., ability/sets/posterior inference), nor metrics like Brier scores for outcome/win-prob prediction.
  • Failure mode taxonomies (panic behavior, goal oscillation, memory corruption) are qualitative; there is no standardized stress-test suite to elicit and quantify these failure patterns across models and harnesses.
  • Search‑time scaling laws in battling (benefit of additional lookahead or rollouts) are not measured across formats/timers; the trade-off between compute and rating gains is unquantified.
  • Memory and localization in long-context RPG play are identified as bottlenecks (VLM‑SLAM), but no dedicated sub-benchmarks exist to isolate map-building, place recognition, action-distance estimation, or objective verification under perceptual noise.
  • The speedrunning environment lacks standardized ablations that isolate subskills (e.g., pure navigation tasks, dialogue/menu parsing micro-tasks, deterministic vs stochastic battle modules) to diagnose where agents fail.
  • Generalization across game versions and formats (e.g., transfer from Gen 1 OU to Gen 9 OU, or Emerald to other games) is not evaluated; cross-domain transfer benchmarks and protocols are missing.
  • Leaderboard overfitting risks are unaddressed: there are no secret test pools, periodic test-set refreshes, or anti-adversarial measures to deter tailoring agents to organizer baselines and fixed opponent pools.
  • Frontier API reproducibility is fragile (version drift, rate limits, context window differences); standardized “frozen” model snapshots and/or reproducible open-source surrogates are not provided for long-term comparability.
  • Human comparators are limited: beyond ladder ratings and a small set of speedrun references, there’s no controlled human-vs-agent experimental design quantifying human learning/adaptation against agents or human supervision effects (e.g., coaching, preference feedback).
  • Accessibility and resource equity are not enforced; training on tens of millions of trajectories and the advantage of low-latency distillations create barriers for low-compute labs; a low‑resource track or fixed‑data regime is not offered.
  • Privacy/licensing implications of replay-derived datasets are not discussed; consent, TOS compliance, and any redaction policies for sensitive metadata are unclear.
  • Battle evaluation uses symbolic Showdown mechanics; perception-to-action agents in a visual battle UI (to align with RPG perception demands) are not benchmarked, obscuring the cost of perception in competitive settings.
  • Calibration of value estimates and action-confidence in battling is not scored; probabilistic accuracy metrics (e.g., calibration curves for win-rate predictions) are missing, hindering apples-to-apples comparison of RL vs LLM vs search.
  • Training sample efficiency and compute-to-performance curves (in both tracks) are not systematically reported; standardized protocols for measuring improvement per environment step, token, or GPU-hour are absent.
  • Full-game completion with open-source models remains an explicit open challenge, but the benchmark does not yet provide a scaffolding, milestones, or resource estimates for end-to-end open-source attempts.
  • Safety and robustness to adversarial harness inputs (e.g., malformed observations, emulator glitches, tool failures) are not tested; failure-handling and recovery benchmarks are missing.

Practical Applications

Immediate Applications

The paper’s resources and methods can be deployed today to improve how organizations build, test, and compare decision-making agents and long-horizon systems.

  • Agent performance benchmarking and QA (software, AI platforms)
    • Use the live Battling leaderboard, FH-BT skill ratings, and standardized Showdown server to run A/B tests, regression checks, and release gates for agent updates.
    • Apply the real-time Speedrunning evaluation to measure latency–performance tradeoffs for long-horizon agents where the world does not pause while the agent reasons.
    • Potential tools/products: “FH-BT-as-a-Service” for competitive agent ranking; CI plugins that run nightly matches or route segments; dashboards tracking GXE/FH-BT vs inference cost.
    • Dependencies/assumptions: Access to compute for match volumes; integration of agents with the Showdown or Emerald harness APIs.
  • Harness × model evaluation for agent design (software, MLOps, academia)
    • Adopt the S×T×M×F×Φ framework to disentangle model capability from scaffolding in complex agents (state, tools, memory, feedback, finetuning).
    • Potential workflows: “Harness profiler” that audits agent stacks, comparable across teams/models; standardized ablation templates for research reports.
    • Dependencies/assumptions: Ability to expose harness components modularly; logging standards for reproducibility.
  • Offline RL experimentation with large-scale demonstrations (academia, industry R&D)
    • Leverage 4M reconstructed human trajectories and 18M self-play battles to study offline RL, partial observability, and dataset curation (e.g., dynamic data weighting).
    • Potential products: Starter kits with preprocessed trajectories; curriculum-learning splits; reproducible training pipelines and checkpoints.
    • Dependencies/assumptions: GPU time for training; acceptance of game domain as a proxy for real-world sequential decision-making.
  • Multi-agent orchestration for long-horizon tasks (software automation, enterprise RPA, education)
    • Reuse the open-source orchestration system (sub-agents, A* planning, objective verification, context compaction) to build tool-using agents for UI navigation, multi-step workflows, and game-based education labs.
    • Potential products: “Long-horizon agent SDK” with planners, memory, and verifiers; plug-ins for A* over GUI graphs or web DOMs; curriculum routes with milestones.
    • Dependencies/assumptions: Domain-specific adapters (e.g., GUI graph extraction); careful tool permissions and sandboxing.
  • Scripted Policy Distillation (SPD) to accelerate agents (software automation, operations)
    • Apply the winning approach—LLM decomposes tasks into scripts, then distill into fast neural policies—to compile slow prompt-chains into fast runtime controllers for repetitive workflows (e.g., form processing, multi-step ETL, UI macros).
    • Potential products: “LLM→Policy compiler” that emits executable policies plus a verification harness; distillation pipelines integrated with RL fine-tuning.
    • Dependencies/assumptions: Tasks decomposable into stable sub-policies; availability of simulators or safe sandboxes for imitation/RL.
  • Competitive evaluation for market-facing agents (adtech/auctions, games, security)
    • Use FH-BT and tournament play to evaluate bidding or game bots against fixed pools of baselines; schedule round-robin stress tests under partial observability.
    • Potential products: “Competitive agent ladder” for internal agents; anomaly detection for sudden rating drops; tournament organizers for vendor bake-offs.
    • Dependencies/assumptions: Domain simulators with comparable rules; careful mapping of reward structure to business KPIs.
  • Agent failure-mode diagnostics beyond standard LLM benchmarks (safety, reliability engineering)
    • Employ the battling and speedrunning tracks to surface “panic,” goal oscillation, and memory-corruption failure modes that do not appear in static QA tests.
    • Potential workflows: Pre-deployment stress-tests; postmortem templates tied to observed failure classes; guardrail policies (e.g., reset, mode switch) when panic patterns hit.
    • Dependencies/assumptions: Engineering time to wire telemetry; acceptance of game-derived failure taxonomies as predictors of real-world brittleness.
  • Game and esports operations (gaming industry)
    • Use datasets and bots to test balance changes, detect degenerate strategies, and run fairness checks on metagame shifts.
    • Potential products: “Metagame simulator” for balance teams; automated scrimmage systems; anti-exploit regression suites.
    • Dependencies/assumptions: Legal/IP coordination; internal simulator integrations.
  • Teaching and skills development (education)
    • Integrate the benchmark into coursework on RL, game theory, and agent systems with plug-and-play baselines, leaderboards, and reproducible labs.
    • Potential products: Course modules; capstone competitions; grading rubrics aligned to FH-BT/GXE/milestones.
    • Dependencies/assumptions: Instructor familiarity with RL/LLM agents; classroom compute or cloud credits.
  • Procurement and model selection due diligence (policy, enterprise IT)
    • Add PokéAgent tasks to vendor evaluation to test long-horizon, adversarial, partially observed reasoning that is orthogonal to standard LLM benchmarks.
    • Potential workflows: “Orthogonal capability panel” alongside MMLU/code/math; scorecards separating model vs harness; inference-latency budget checks.
    • Dependencies/assumptions: Willingness to include game-based evals in RFPs; governance for handling model logs and costs.

Long-Term Applications

These opportunities require additional research, domain adaptation, or integration to meet safety, regulatory, and performance thresholds.

  • Clinical decision support under uncertainty (healthcare)
    • Map partial observability and adversarial reasoning to differential diagnosis, triage, and multi-visit care planning; use SPD to compile guideline-aware, fast policies for routine pathways.
    • Potential products: “Care-path compilers” that distill clinician-authored protocols into verifiable policies; longitudinal memory modules for patient histories.
    • Dependencies/assumptions: Regulatory approvals, robust clinical datasets, bias/safety audits, human-in-the-loop oversight.
  • Hierarchical autonomy with VLM-SLAM (robotics, logistics, home/warehouse)
    • Translate the VLM-SLAM challenge into embodied localization, action-distance estimation, and objective verification; reuse orchestrator + sub-agents for navigation, manipulation, and recovery behaviors.
    • Potential products: Long-horizon task planners with grounded maps; perception–memory bridges; failure-aware controllers that avoid panic cascades.
    • Dependencies/assumptions: Reliable perception stacks, real-world simulators/digital twins, safety validation, sim2real transfer.
  • Multi-agent planning in dynamic markets (finance, energy)
    • Use RL + search + self-play to train bidding/hedging/scheduling agents under partial information and adversarial counterparts; evaluate strategies with FH-BT-like ranking over fixed agent pools.
    • Potential products: “Market ladder” simulators for strategy testing; policy distillation for low-latency execution; risk-constrained training with adversarial scenarios.
    • Dependencies/assumptions: High-fidelity market sims, risk/compliance guardrails, robust backtesting to avoid overfitting.
  • Cybersecurity red–blue tournaments (security)
    • Adapt competitive evaluation to attacker–defender simulations with hidden information and evolving tactics; track capability growth with FH-BT ratings and orthogonal benchmarks.
    • Potential products: Continuous attack lab with agent ladders; SPD to compile playbooks into fast detection/response policies.
    • Dependencies/assumptions: Secure sandboxes, curated threat models, oversight to prevent unsafe generalization.
  • Autonomous driving and traffic coordination (mobility)
    • Model multi-agent, partially observed interactions (merging, negotiation) with RL + search; deploy hierarchies that separate high-level route planners from fast, distilled low-level policies.
    • Potential products: Scenario ladders for behavior validation; latency-aware controllers for edge compute; failure-mode monitors to prevent panic-like error cascades.
    • Dependencies/assumptions: Certified simulators, regulatory testing, explainability and safety cases.
  • Grid and plant operations optimization (energy/manufacturing)
    • Apply long-horizon planning and milestone frameworks to scheduling, maintenance, and contingency management with real-time latency budgets.
    • Potential products: Orchestration layers coordinating sub-policies for different subsystems; verifiers for objective completion and safety constraints.
    • Dependencies/assumptions: High-quality telemetry, digital twins, operator oversight, fail-safes.
  • Standardized agent evaluation governance (policy, standards bodies)
    • Establish benchmarks that separate harness and model effects, track inference cost, and include adversarial/partially observed tasks to complement existing suites.
    • Potential products: Open standards for long-horizon agent eval; certification programs; public leaderboards for safety-relevant capabilities.
    • Dependencies/assumptions: Multi-stakeholder consensus; sustained funding for “living benchmarks.”
  • Edge deployment via policy distillation (IoT, on-device AI)
    • Use SPD to convert heavy LLM workflows into compact, robust controllers for constrained hardware (e.g., AR assistants, inspection drones).
    • Potential products: Toolchains that auto-generate, verify, and continuously improve on-device policies from scripted behavior.
    • Dependencies/assumptions: Reliable verification/oracles, continual learning without drift, safe update mechanisms.
  • Cross-domain benchmark kits (transport, operations, education)
    • Port the living-benchmark template—datasets, leaderboards, FH-BT ranking, orchestration stack—to other simulators (e.g., rail networks, warehouses) to catalyze reproducible progress.
    • Potential products: “Benchmark-in-a-box” for new domains; reusable rating servers; standardized baselines (LLM/RL/hybrids).
    • Dependencies/assumptions: Domain owners providing simulators/data; IP considerations; community maintenance.
  • Robustness stress-tests for generalist models (enterprise AI safety)
    • Institutionalize Poké-like adversarial, long-context exams as part of pre-deployment checks to detect brittleness not seen in static QA (panic behavior, goal oscillation).
    • Potential products: Safety scorecards with orthogonal benchmarks; failure-mode libraries and mitigations; red-team playbooks.
    • Dependencies/assumptions: Alignment with internal risk frameworks; resources to respond to discovered weaknesses.

Glossary

  • A* pathfinding: A best-first graph search algorithm that finds shortest paths using cost-to-come plus heuristic estimates. "MCP tools (A* pathfinding, button inputs, knowledge retrieval)"
  • Action space: The set of all actions available to an agent at a decision point. "alternative observation spaces, action spaces, and reward functions"
  • BenchPress evaluation matrix: A large curated matrix of models and benchmarks used to analyze and predict LLM performance across tasks. "Analysis against the BenchPress evaluation matrix shows"
  • Bradley--Terry model: A statistical model for pairwise comparisons that estimates latent skill from match outcomes. "based on a Bradley--Terry model"
  • Bootstrapped uncertainty: An uncertainty estimate obtained by resampling data (bootstrapping) and aggregating variability in the estimated metric. "with bootstrapped uncertainty"
  • CLI-agent: An LLM agent architecture that operates via command-line interfaces to plan and execute actions. "common CLI-agent harnesses (Claude Code, Codex CLI, Gemini CLI)"
  • Context compaction: The process of compressing or pruning accumulated reasoning/context to fit within model context limits over long tasks. "requires context compaction to manage the thousands of reasoning steps involved"
  • Dynamic data weighting: Adjusting the relative importance of training samples over time to improve learning. "iterative offline RL with dynamic data weighting"
  • Elo rating: A rating system for estimating relative skill from game outcomes via incremental updates. "an Elo variant incorporating uncertainty"
  • Episodic MDP: A Markov Decision Process with episodes that start and terminate, often with undiscounted returns. "We formalize RPG gameplay as an episodic MDP M=(S,A,T,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, T, R, \gamma)"
  • Extended Timer: An evaluation setting that relaxes per-turn time limits to allow longer test-time reasoning. "an ``Extended Timer'' variant provides nearly unlimited deliberation time"
  • FH-BT rating (Full-History Bradley--Terry): A Bradley–Terry skill rating fit to an agent’s full match history, used as the primary leaderboard metric. "We refer to this metric as the Full-History Bradley--Terry (FH-BT) rating"
  • Gen 1 OU: The Generation 1 “OverUsed” competitive format in Pokémon Showdown with specific rules and metagame. "Gen 1 OU and Gen 9 OU"
  • Gen 9 OU: The Generation 9 “OverUsed” competitive format in Pokémon Showdown with contemporary rules and metagame. "Gen 1 OU and Gen 9 OU"
  • Glicko-1: A rating system extending Elo by modeling rating uncertainty that updates with matches. "We report the standard Showdown implementations of Glicko-1"
  • GXE: “Glicko X-Act Estimate,” an expected win probability metric derived from rating. "and GXE (expected win probability against a randomly sampled opponent)"
  • Harness (LLM harness): The surrounding software scaffold that handles perception, tools, memory, and control around an LLM. "harness-based LLM approaches"
  • Imitation learning: Learning policies by mimicking expert demonstrations rather than via environmental rewards. "distilled into neural networks via imitation learning"
  • Imperfect information: Settings where agents lack full knowledge of the game state or opponent’s hidden variables. "stochastic games with imperfect information and simultaneous action selection"
  • Long-horizon planning: Planning that spans many steps with delayed credit assignment and extended context. "Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems"
  • MCP tools: Tooling integrated via a Model Context Protocol interface to extend agent capabilities (e.g., planning, I/O). "MCP tools (A* pathfinding, button inputs, knowledge retrieval)"
  • Metagame: The evolving ecosystem of dominant strategies and counter-strategies shaped by players and rules. "a competitive metagame that evolves continuously"
  • Minimax search: An adversarial search algorithm that optimizes against a worst-case opponent. "combining minimax search with LLMs"
  • Multi-agent orchestration system: Infrastructure that coordinates multiple specialized agents or modules toward a shared objective. "an open-source multi-agent orchestration system that enables modular, reproducible comparisons"
  • Observation space: The set or structure of information available to the agent at decision time. "alternative observation spaces, action spaces, and reward functions"
  • Offline reinforcement learning (offline RL): Learning policies from fixed logged data without further environment interaction. "iterative offline RL with dynamic data weighting"
  • Orthogonality analysis: Studying (lack of) correlation between evaluations to identify independent capability axes. "and orthogonality analysis showing that Pokémon battling captures capabilities not predicted"
  • Out-of-distribution evaluation: Testing a model on tasks or distributions that differ from its training data to assess generalization. "offers a distinctive form of out-of-distribution evaluation"
  • Panic behavior: A failure mode where an agent compounds errors after a setback rather than recovering. "exhibit ``panic behavior'' (also observed by \citep{gemini2p5report})"
  • Partial observability: Conditions where the agent cannot directly see the full state and must infer hidden information. "Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems"
  • Rank-2 SVD: A two-dimensional singular value decomposition capturing most variance in a data matrix. "the rank-2 SVD that explains 91% of standard benchmark variance"
  • Recurrent PPO: Proximal Policy Optimization augmented with recurrent networks to handle partial observability. "recurrent PPO with milestone-conditioned rewards"
  • Reward function: A mapping that assigns scalar feedback to states/actions to guide learning. "alternative observation spaces, action spaces, and reward functions"
  • Root-parallelized MCTS: Monte Carlo Tree Search where simulations from the root are parallelized to improve search throughput. "root-parallelized MCTS in imperfect-information battling"
  • Scripted Policy Distillation (SPD): Generating scripted sub-policies with an LLM and distilling them into neural policies via imitation/RL. "Scripted Policy Distillation (SPD)"
  • Self-play: Training by having an agent play against itself or its past versions to generate data. "competitive performance often requires the scale of self-play"
  • Showdown ladder: The public ranked matchmaking ladder on Pokémon Showdown used to benchmark against humans. "Official ratings on the Showdown ladder"
  • Simultaneous action selection: Both players choose actions at the same time, increasing strategic uncertainty. "simultaneous action selection"
  • SLAM: Simultaneous Localization and Mapping; building a map while tracking position within it. "analogous to classical SLAM but through language-vision interfaces"
  • Spearman ρ: A nonparametric rank correlation coefficient measuring monotonic association. "max Spearman ρ=0.77\rho = 0.77; mean ρ=0.45|\rho| = 0.45"
  • Stochastic game: A game with probabilistic state transitions and/or outcomes. "two-player, zero-sum, stochastic games"
  • Trajectories: Sequences of states, actions, and outcomes collected from play for training or evaluation. "a dataset of 20M+ battle trajectories"
  • VLM-SLAM: Applying SLAM-like grounding to vision-LLM outputs to stabilize spatial understanding. "VLM-SLAM: Speedrunning agents struggle with basic localization"
  • Vision-LLM (VLM): A model that processes and reasons over visual inputs and language jointly. "raw frontier VLMs achieve effectively 0\% task completion"
  • Zero-sum game: A game where one player’s gain is exactly the other’s loss. "two-player, zero-sum, stochastic games"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 814 likes about this paper.