2000 character limit reached
Arena-Hard-200 Benchmark
Updated 10 March 2026
- Arena-Hard-200 benchmark is a curated challenge designed to evaluate LLMs' advanced reasoning and generalization by targeting emerging failure modes.
- It extends the GSM8K test set by incorporating increased problem difficulty through methods like automatic evolution and multi-model adversarial evaluation.
- Its rigorous design, including iterative model augmentations and validity checks, provides a high-fidelity stress test for next-generation LLMs in mathematical reasoning.
Arena-Hard-200 is a curated, highly challenging benchmark designed to assess the advanced reasoning and generalization abilities of LLMs. Constructed via the ArenaBencher automatic evolution framework, Arena-Hard-200 extends the GSM8K mathematical problem-solving test set by targeting emerging LLM failure modes, increased difficulty, and robust resistance to data contamination. As a distillation of the hardest items synthesized through multi-model adversarial evaluation, iterative LLM augmentations, and rigorous validity checks, Arena-Hard-200 serves as a high-fidelity stress test for differentiating next-generation LLMs on mathematical reasoning and related domains (Liu et al., 9 Oct 2025).