Papers
Topics
Authors
Recent
Search
2000 character limit reached

CATArena: Iterative Evaluation for LLM Agents

Updated 3 July 2026
  • CATArena is an evaluation platform that uses iterative, tournament-style competitions based on classic board and card games with open-ended scoring.
  • It implements an iterative peer-learning framework where agents refine their strategies over multiple rounds using comprehensive match logs and scoring matrices.
  • The platform overcomes benchmark limitations by assessing self-improvement, adaptability, and peer-learning through scalable, automated competitions.

CATArena is an evaluation platform for LLM agents that employs iterative tournament competitions using classic board and card games with open-ended scoring. By structuring tournaments to support repeated, competitive peer-learning, CATArena systematically benchmarks not only direct performance but also agents’ capacity for continual self-improvement and adaptation to peers, addressing several intrinsic bottlenecks in traditional LLM assessment methodologies (Fu et al., 30 Oct 2025).

1. Motivation and Conceptual Innovations

Conventional benchmarks for LLM-based agents, such as code generation or GUI automation, increasingly suffer from score saturation (performance plateaus near fixed maxima), narrow scenario dependence, and high human annotation costs. More fundamentally, these static benchmarks inadequately evaluate an agent’s ability to learn—either by improving its own strategies (self-learning) or by adopting effective tactics observed in others (peer-learning). CATArena addresses these deficiencies through three innovations:

  • Tournament-style, open-ended evaluation: Utilizing board and card games with no upper bound on achievable scores.
  • Iterative peer-learning loop: Enabling agents to revise strategies after each round by analyzing others’ code and historical logs.
  • Full automation: New competition rounds generate their own supervision through results (win/loss/draw, normalized scores, match logs), minimizing reliance on human labeling.

2. Iterative Competitive Peer-Learning Framework

CATArena operates over NN rounds, each structured in two phases: strategy submission (as executable code) and full tournament execution. The progression is as follows:

  • Round 1: Initial Strategy Development
    • Agents receive game code and a trivial sample AI.
    • Each agent implements a baseline coded strategy with no external hints, e.g., Minimax for Gomoku, basic heuristics for Bridge.
    • Assesses strategy coding ability.
  • Rounds 2 to N: Iterative Improvement
    • All strategies are pitted against one another, populating a scoring matrix WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)} where Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1] is the normalized score when agent ii (round nn) plays agent jj (round mm).
    • Detailed tournament reports are generated (outcomes, logs, per-match rankings).
    • Agents for the next round receive all prior code, logs, and are tasked to analyze and revise their code before resubmitting.
    • Both self-improvement (round-on-round personal advancement) and peer-learning (incorporation of effective opponent behaviors) are empirically observable.

Process Loop Diagram (Textual)

  1. Agents write code
  2. Tournament runs, scoring and logs generated
  3. Agents receive all codes/logs
  4. Agents analyze and revise code Repeat for subsequent rounds

3. Game Suite and Open-Endedness

CATArena employs four classic games—each admitting unbounded skill and strategic diversity—structured to preclude solution memorization. Games and scoring conventions are summarized as follows:

Game Format & Variants Scoring Mechanism
Gomoku 15×15 board, “forbidden points”/“dual-three” Win=1, Draw=0.5, Loss=0
Texas Hold’em Up to 12 players, escalating blinds Fractional chip share [0,1]
Chess Standard FIDE + Chess960, special move variants Win=1, Draw=0.5, Loss=0
Bridge 4 players, 2 partnerships, variant bidding VPs normalized to [0,1]

All games are open-ended: there is no maximum possible score and continued agent improvement is always measurable. Introducing game variants (e.g., Chess960, forbidden moves in Gomoku) prevents pattern memorization and probes extrapolative generalization.

4. Tournament Structure, Metrics, and Core Formulas

Tournament Mechanics

  • Symmetric Games (Gomoku, Chess): Full round-robin tournaments for all TNT \cdot N submitted strategies, with multiple repetitions to counteract stochasticity.
  • Asymmetric Games (Texas Hold’em, Bridge): Randomized batches of size BB; each batch outputs a result vector, entries absorbed into the scoring matrix WW.

Metrics and Notation

Let WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}0 be the number of agents, WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}1 the number of rounds, and WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}2 index agent WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}3’s round-WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}4 submission.

  • Strategy Coding: Average initial performance

WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}5

  • Global Learning: Mean improvement over rounds

WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}6

  • Counter-Adaptation: Improvement against prior round’s opponents

WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}7

WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}8

  • Self-Improvement: Cross-round performance correlation

WR(TN)×(TN)W\in\mathbb{R}^{(T N)\times(T N)}9

Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]0

  • Generalizability: Baseline difference on variants

Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]1

Rankings are derived from these metrics, not from Elo.

Pseudocode Sketch

Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]6

5. Empirical Findings and Benchmark Characteristics

Performance Spread

  • Minimal Agents (built with lightweight toolkit and a single LLM): Display wide performance variance; e.g., Claude-4-Sonnet outperforms smaller open-source LLMs substantially.
  • Commercial Code Agents (Claude-Code, CodeX CLI, Gemini-CLI, Qwen-Coder): Cluster tightly, with top agents matching the best minimal agents but with reduced variance.

Benchmark Properties

  • Reliability & Stability: Independent runs yield leaderboard rank standard deviation Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]2 for nearly all agents; standard games yield more stable rankings than variants.
  • Scalability: ML track (agents required to implement self-play training loops) and a multi-lingual code track (Python/JS/Go) confirm that metrics are non-saturating and accommodate further agent improvement.

Learning Dynamics

  • In simpler environments (Texas Hold’em), many agents achieve positive Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]3, Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]4, and high Wi,jn,m[0,1]W_{i,j}^{n,m} \in [0,1]5, indicating both effective peer-learning and self-refinement.
  • In complex/variant environments (Chess960, Gomoku with forbidden moves), agents typically show low or negative learning metrics, highlighting current LLM agent limitations in strategy discovery absent richer forms of peer-learning.
  • Action-consistency analyses on mid-game states confirm that agents increasingly emulate stronger peers’ trajectories over rounds.

6. Significance, Limitations, and Extensions

CATArena eliminates common bottlenecks—score saturation, scenario fixity, and expert annotation cost—by coupling open-ended games with iterative, code-based peer competition. It enables scalable comparison of strategy coding, self-improvement, peer-learning, and generalizability, with all metrics dynamically tracking agent development (Fu et al., 30 Oct 2025).

Separate experimental tracks for ML-based self-play agents and multi-language implementations extend CATArena’s reach beyond pure code-based LLMs. Reliability in ranking, as evidenced by low variance in repeated runs, and the observed unsaturated metric growth, establish CATArena as a stable platform for the longitudinal assessment of agent learning ability and adaptability.

This suggests that future LLM evaluation frameworks may increasingly incorporate iterative, competitive peer-learning processes to robustly assess core general intelligence attributes without human-labeled supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CATArena.