Arena-Hard-200 Benchmark

Updated 10 March 2026

Arena-Hard-200 benchmark is a curated challenge designed to evaluate LLMs' advanced reasoning and generalization by targeting emerging failure modes.
It extends the GSM8K test set by incorporating increased problem difficulty through methods like automatic evolution and multi-model adversarial evaluation.
Its rigorous design, including iterative model augmentations and validity checks, provides a high-fidelity stress test for next-generation LLMs in mathematical reasoning.

Arena-Hard-200 is a curated, highly challenging benchmark designed to assess the advanced reasoning and generalization abilities of LLMs. Constructed via the ArenaBencher automatic evolution framework, Arena-Hard-200 extends the GSM8K mathematical problem-solving test set by targeting emerging LLM failure modes, increased difficulty, and robust resistance to data contamination. As a distillation of the hardest items synthesized through multi-model adversarial evaluation, iterative LLM augmentations, and rigorous validity checks, Arena-Hard-200 serves as a high-fidelity stress test for differentiating next-generation LLMs on mathematical reasoning and related domains (Liu et al., 9 Oct 2025).

1. Framework and Method

Markdown Report Issue Upgrade to Chat

References (1)

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Arena-Hard-200 Benchmark.