TextArena: Interactive LLM Benchmark

Updated 2 July 2025

TextArena is an open-source suite of competitive text games designed to benchmark dynamic, social, and agentic skills in LLMs.
It employs a TrueSkill-based ranking system for real-time evaluation across diverse environments including negotiation, deception, and strategic planning.
The platform supports extensible, multi-agent research, fostering community-driven advancements and curriculum-based training for interactive LLM applications.

TextArena is an open-source suite of competitive text-based games intended for training and evaluation of agentic behaviors in LLMs. Designed as a research, benchmarking, and community platform, TextArena addresses critical gaps in the evaluation of LLMs by focusing on dynamic, social, and interactive skills beyond static language understanding or code-generation tasks.

1. Foundation, Scope, and Motivation

TextArena arose in response to diminishing marginal utility of traditional LLM benchmarks such as MMLU or HumanEval, which primarily measure static knowledge or problem-solving ability. The platform is structured to probe richer, interactive aspects of model behavior—especially those relevant to real-world agentic applications—such as negotiation, theory of mind, deception, persuasion, and adaptability.

At launch, TextArena included 57+ environments and has since expanded to over 74. These environments span single-player, two-player, and multi-player games, incorporating logic puzzles (e.g., Sudoku, Tower of Hanoi), board games (Chess, Othello), and bespoke social games engaging negotiation, bluffing, and strategic planning. Each game environment is annotated by the primary "soft skills" evaluated, providing a granular skill profile for tested agents.

2. Features and Benchmarking Methodology

Addressing Benchmark Limitations

A central goal of TextArena is to remedy the underrepresentation of dynamic and social intelligence in LLM benchmarks. Many of the platform’s games require adversarial or cooperative decision-making over multiple rounds, necessitating theory-of-mind, non-trivial planning, deception detection, and nuanced communication—skills difficult or impossible to assess with classical QA or static reasoning tasks.

TrueSkill Scoring and Online Play

Performance assessment is managed through integration with Microsoft's Bayesian TrueSkill™ ranking system:

Models (and humans) are each rated by a pair $(\mu, \sigma)$ , with initial values $\mu = 25$ , $\sigma = 25/3$ .
The system updates scores after every online or offline match, converging to robust skill estimates more quickly than Elo.
A live leaderboard tracks both global and per-skill rankings, using skill tags assigned to environments.

The online-play system facilitates real-time matches between models, between models and humans, or among human participants. Match results are immediately reflected in TrueSkill ratings and public leaderboards.

3. Research Orientation and Extensibility

TextArena is engineered for extensibility and collaborative research:

The environment API is designed to mimic OpenAI Gym/Gymnasium, supporting familiar reinforcement learning workflows and easy adoption in existing codebases.
Multi-agent and multi-player mechanisms are modeled after those in PettingZoo, supporting flexible agent interaction schemes, including competitive and cooperative settings.
The core framework uses stackable wrappers and unified environment interfaces, significantly lowering the overhead for researchers wishing to add or modify games.

To foster community engagement, the platform maintains an active online presence, a Discord channel for collaborative development, and open submission of both models and game environments. As of publication, 283 models (64 official, the rest community-submitted) and many human players have been evaluated in repeated competitive settings.

A planned dataset release will include gameplay trajectories from humans and state-of-the-art models, supporting downstream research in imitation learning, behavioral analysis, and curriculum construction.

4. Applications and Example Use Cases

TextArena supports both human-vs-model and model-vs-model evaluation, as well as fully offline self-play for training:

Researchers can assess model performance relative to skilled humans (“Humanity” as a tracked baseline) or other LLMs, across one or more tagged skill domains.
Typical use-cases include adversarial skill evaluation, targeted agent training, synthetic curriculum loops for RL, and iterative agent refinement via competition dynamics.
Rich data on model behaviors—such as unintentional information leakage (e.g., revealing hidden cards), unsuccessful deception, or strategic innovation—is available for post hoc analysis.
Example code for initializing agents and running multi-environment matches:

import textarena as ta

agents = {
    0: ta.agents.OpenRouterAgent(model_name="GPT-4o-mini"),
    1: ta.agents.OpenRouterAgent(model_name="anthropic/claude-3.5-haiku"),
}
env = ta.make(env_id=["TicTacToe-v0", "SpellingBee-v0"])
env = ta.wrappers.LLMObservationWrapper(env=env)

env.reset(num_players=len(agents))
done = False
while not done:
    player_id, observation = env.get_observation()
    action = agents[player_id](observation)
    done, info = env.step(action=action)
rewards = env.close()

Documentation, tutorials, and environment/game details are accessible through the project repositories and official website (github.com/LeonGuertler/TextArena, textarena.ai).

5. Technical Infrastructure and Skill Profiling

The TextArena API supports any agent count, enabling rigorous evaluation of multi-agent interaction as well as single-agent skill assessment. Each game environment is tagged with up to five “soft skills” (from a curated set of ten), and a model’s skill profile is computed as a weighted average of performance across environments associated with each tag.

The leaderboard aggregates these measurements, yielding both global skill ratings and per-skill breakdowns, directly algorithmically derived from TrueSkill-updated match statistics. Game rules, setups, and other technical details are included in platform documentation.

6. Comparative Significance and Future Directions

TextArena distinguishes itself from prior LLM game benchmarks by:

Providing a uniquely broad coverage of environment types, including multi-agent, multiplayer, and dynamic social deduction games.
Enabling large-scale, continuous, and decentralized benchmarking of both human and LLM participants via the online system.
Offering public access to both evaluation infrastructure and skill breakdown data, supporting reproducible research and benchmarking at scale.

TextArena is positioned as both a future-proof benchmark—mitigating “saturation” risks inherent in fixed QA or code sets—and a generative resource for ongoing curriculum-driven model training and interactive RL. A plausible implication is continued extension to novel games and nuanced skill categories as community participation increases and LLM capabilities evolve.

Table: Sample Features and Protocols in TextArena

Feature	Description
Online Play System	Live model vs model/human, real-time leaderboard, TrueSkill ranking
API Design	OpenAI Gym/Gymnasium-style, PettingZoo-style multi-agent support
Skill Profiling	10-tag soft skill annotation, per-skill and global scoring
Extensibility	Add/modify games with wrappers, open model/game submission
Measurement	Bayesian TrueSkill ( $\mu, \sigma$ ), global and skill-tagged leaderboard

TextArena thus provides a robust, extensible, and community-driven framework for advancing the paper and development of agentic LLMs, emphasizing skills essential for both theoretical research and practical deployment in interactive, multi-agent contexts.

PDF Markdown Chat (Upgrade)