CATArena: Iterative Evaluation for LLM Agents
- CATArena is an evaluation platform that uses iterative, tournament-style competitions based on classic board and card games with open-ended scoring.
- It implements an iterative peer-learning framework where agents refine their strategies over multiple rounds using comprehensive match logs and scoring matrices.
- The platform overcomes benchmark limitations by assessing self-improvement, adaptability, and peer-learning through scalable, automated competitions.
CATArena is an evaluation platform for LLM agents that employs iterative tournament competitions using classic board and card games with open-ended scoring. By structuring tournaments to support repeated, competitive peer-learning, CATArena systematically benchmarks not only direct performance but also agents’ capacity for continual self-improvement and adaptation to peers, addressing several intrinsic bottlenecks in traditional LLM assessment methodologies (Fu et al., 30 Oct 2025).
1. Motivation and Conceptual Innovations
Conventional benchmarks for LLM-based agents, such as code generation or GUI automation, increasingly suffer from score saturation (performance plateaus near fixed maxima), narrow scenario dependence, and high human annotation costs. More fundamentally, these static benchmarks inadequately evaluate an agent’s ability to learn—either by improving its own strategies (self-learning) or by adopting effective tactics observed in others (peer-learning). CATArena addresses these deficiencies through three innovations:
- Tournament-style, open-ended evaluation: Utilizing board and card games with no upper bound on achievable scores.
- Iterative peer-learning loop: Enabling agents to revise strategies after each round by analyzing others’ code and historical logs.
- Full automation: New competition rounds generate their own supervision through results (win/loss/draw, normalized scores, match logs), minimizing reliance on human labeling.
2. Iterative Competitive Peer-Learning Framework
CATArena operates over rounds, each structured in two phases: strategy submission (as executable code) and full tournament execution. The progression is as follows:
- Round 1: Initial Strategy Development
- Agents receive game code and a trivial sample AI.
- Each agent implements a baseline coded strategy with no external hints, e.g., Minimax for Gomoku, basic heuristics for Bridge.
- Assesses strategy coding ability.
- Rounds 2 to N: Iterative Improvement
- All strategies are pitted against one another, populating a scoring matrix where is the normalized score when agent (round ) plays agent (round ).
- Detailed tournament reports are generated (outcomes, logs, per-match rankings).
- Agents for the next round receive all prior code, logs, and are tasked to analyze and revise their code before resubmitting.
- Both self-improvement (round-on-round personal advancement) and peer-learning (incorporation of effective opponent behaviors) are empirically observable.
Process Loop Diagram (Textual)
- Agents write code
- Tournament runs, scoring and logs generated
- Agents receive all codes/logs
- Agents analyze and revise code Repeat for subsequent rounds
3. Game Suite and Open-Endedness
CATArena employs four classic games—each admitting unbounded skill and strategic diversity—structured to preclude solution memorization. Games and scoring conventions are summarized as follows:
| Game | Format & Variants | Scoring Mechanism |
|---|---|---|
| Gomoku | 15×15 board, “forbidden points”/“dual-three” | Win=1, Draw=0.5, Loss=0 |
| Texas Hold’em | Up to 12 players, escalating blinds | Fractional chip share [0,1] |
| Chess | Standard FIDE + Chess960, special move variants | Win=1, Draw=0.5, Loss=0 |
| Bridge | 4 players, 2 partnerships, variant bidding | VPs normalized to [0,1] |
All games are open-ended: there is no maximum possible score and continued agent improvement is always measurable. Introducing game variants (e.g., Chess960, forbidden moves in Gomoku) prevents pattern memorization and probes extrapolative generalization.
4. Tournament Structure, Metrics, and Core Formulas
Tournament Mechanics
- Symmetric Games (Gomoku, Chess): Full round-robin tournaments for all submitted strategies, with multiple repetitions to counteract stochasticity.
- Asymmetric Games (Texas Hold’em, Bridge): Randomized batches of size ; each batch outputs a result vector, entries absorbed into the scoring matrix .
Metrics and Notation
Let 0 be the number of agents, 1 the number of rounds, and 2 index agent 3’s round-4 submission.
- Strategy Coding: Average initial performance
5
- Global Learning: Mean improvement over rounds
6
- Counter-Adaptation: Improvement against prior round’s opponents
7
8
- Self-Improvement: Cross-round performance correlation
9
0
- Generalizability: Baseline difference on variants
1
Rankings are derived from these metrics, not from Elo.
Pseudocode Sketch
6
5. Empirical Findings and Benchmark Characteristics
Performance Spread
- Minimal Agents (built with lightweight toolkit and a single LLM): Display wide performance variance; e.g., Claude-4-Sonnet outperforms smaller open-source LLMs substantially.
- Commercial Code Agents (Claude-Code, CodeX CLI, Gemini-CLI, Qwen-Coder): Cluster tightly, with top agents matching the best minimal agents but with reduced variance.
Benchmark Properties
- Reliability & Stability: Independent runs yield leaderboard rank standard deviation 2 for nearly all agents; standard games yield more stable rankings than variants.
- Scalability: ML track (agents required to implement self-play training loops) and a multi-lingual code track (Python/JS/Go) confirm that metrics are non-saturating and accommodate further agent improvement.
Learning Dynamics
- In simpler environments (Texas Hold’em), many agents achieve positive 3, 4, and high 5, indicating both effective peer-learning and self-refinement.
- In complex/variant environments (Chess960, Gomoku with forbidden moves), agents typically show low or negative learning metrics, highlighting current LLM agent limitations in strategy discovery absent richer forms of peer-learning.
- Action-consistency analyses on mid-game states confirm that agents increasingly emulate stronger peers’ trajectories over rounds.
6. Significance, Limitations, and Extensions
CATArena eliminates common bottlenecks—score saturation, scenario fixity, and expert annotation cost—by coupling open-ended games with iterative, code-based peer competition. It enables scalable comparison of strategy coding, self-improvement, peer-learning, and generalizability, with all metrics dynamically tracking agent development (Fu et al., 30 Oct 2025).
Separate experimental tracks for ML-based self-play agents and multi-language implementations extend CATArena’s reach beyond pure code-based LLMs. Reliability in ranking, as evidenced by low variance in repeated runs, and the observed unsaturated metric growth, establish CATArena as a stable platform for the longitudinal assessment of agent learning ability and adaptability.
This suggests that future LLM evaluation frameworks may increasingly incorporate iterative, competitive peer-learning processes to robustly assess core general intelligence attributes without human-labeled supervision.