LiveBench: Dynamic LLM Benchmark Suite

Updated 4 September 2025

LiveBench is a benchmark suite that uses fresh, real-world tasks to evaluate LLMs and LMMs while avoiding test data contamination.
It employs automated, objective scoring with rigorous ground-truth metrics across six diverse categories, ensuring unbiased performance measurement.
The suite reveals that even top-performing models rarely exceed 65% accuracy, highlighting the challenge of achieving true generalization.

LiveBench is a benchmark suite designed for contamination-free, objective, and challenging evaluation of LLMs and, in multimodal extension, large multimodal models (LMMs). It addresses a central obstacle in LLM assessment: traditional static benchmarks are quickly rendered obsolete when test data leaks into training sets, while dynamic, crowd-judged or LLM-judged approaches introduce bias and break down for hard problems. LiveBench mitigates these issues by frequently updating its tasks using recent, diverse real-world data and by automating answer scoring against rigorous ground truth, enabling the precise measurement of model capabilities across multiple domains.

1. Design Philosophy and Objectives

LiveBench was constructed to overcome several deficiencies observed in prior LLM benchmarks, emphasizing three core objectives:

Contamination Resistance: Tasks are sourced from frequently updated and recent information—such as new math competitions, arXiv preprints, news articles, and Kaggle datasets. The continual refresh cycle ensures evaluation examples are unlikely to appear in any model’s training set.
Automated, Objective Scoring: LiveBench eschews subjective human or LLM-based judging in favor of ground-truth metrics, even for open-ended or difficult tasks. This enables robust, unbiased evaluation of model outputs.
Diversity and Difficulty: The suite covers six distinct categories—math, coding, reasoning, data analysis, instruction following, and language comprehension—and incorporates contamination-limited or procedurally generated versions of historically “leaked” tasks (e.g., Big-Bench Hard, AMPS, IFEval). Top-performing models typically achieve less than 65% overall accuracy.

LiveBench's design prevents artificially inflated performance due to test/train overlap and ensures that measured improvements reflect genuine progress in model capabilities.

2. Task Structure and Evaluation Procedures

LiveBench organizes its tasks into six principal categories, each with distinct prompting formats, evaluation functions, and methods for generating ground-truth answers.

Category	Example Tasks	Scoring Method
Math	AMC12, AIME, SMC, USAMO, IMO, AMPS Hard	Symbolic matching, duplication, custom formats
Coding	Code Generation, Code Completion	pass@1
Reasoning	Web of Lies (extended), Zebra puzzles	Boolean logic, group constraints
Data Analysis	Column Type Annotation, Table Reformat, Join Prediction	F1, structural matching
Instruction Following	Paraphrase, Summarize, Story Gen (from Guardian)	Automated constraint checks
Language Comprehension	Connections, Typos, Plot Unscrambling	Fuzzy matching, normalized edit distance

Technical Details

Column Type Annotation (CTA): Each table column is defined as function $C_i:\text{rows} \rightarrow \text{strings}$ , with label set $L$ . Evaluation proceeds by verifying $\forall C, \exists l_j$ such that $(C_i, l_j) \in CTA$ .
Plot Unscrambling: Scoring uses the formula $\text{Score} = 1 - \frac{d}{n_{\text{sentences}}}$ , with $d$ the Levenshtein distance between predicted and gold sentence orderings.
Coding Tasks: Solutions accepted only when all test cases pass (pass@1). Open-ended reasoning, procedural generation, and complex constraint verification are standard.

Task prompts often include fine-grained instruction and format requirements, e.g., LaTeX boxed math answers or CSV/TSV conversions with Pandas.

3. Comparative Analysis with Prior Benchmarks

LiveBench stands apart from predecessors (Big-Bench Hard, AMPS, IFEval) via several mechanisms:

Dynamic Question Generation: Static benchmarks eventually leak into model training; LiveBench’s monthly updates from real competitions, datasets, and recently published sources counteract this.
Rigorous Automated Scoring: Human/LLM judges introduce bias and scale poorly for hard or open-ended tasks (error rates up to 46% were observed for LLM judging of LiveBench math/reasoning problems). LiveBench scoring is formally defined per task.
Difficulty Calibration: Even top models do not surpass roughly 65% accuracy. New versions of older benchmark problems are made algorithmically harder and less susceptible to memorization.
Task Breadth: Six categories cover structured problem solving, program synthesis, logical and relational reasoning, real-world paraphrasing, and intrinsic language comprehension. Category-wise scatter plots reveal model-specific strengths and weaknesses.

A plausible implication is that LiveBench is positioned as a forward-looking, resilient benchmark for tracking real LLM progress over time.

4. Model Evaluation Results

Extensive evaluation is performed on a broad range of LLMs (49 models, 0.5B–405B parameters), with results reported on the official leaderboard.

Overall Difficulty: No model exceeds approximately 65% accuracy; e.g., claude-3-5-sonnet-20240620 achieves 61.2, while gpt-4o-2024-05-13 attains 55.0.
Model Family Trends: Proprietary, largest-scale models outperform open-source and smaller-scale models.
Category Variation: Distinct models excel in different domains (coding vs. reasoning vs. math).
Cross-Benchmark Correlation: LiveBench scores correlate strongly (0.88–0.91) with other leaderboards (ChatBot Arena, Arena-Hard), but some outliers evidence genuine domain strengths/weaknesses.
Ablation and Judging Studies: Automated ground-truth scoring avoids the substantial error rates of LLM-judge-based protocols for difficult problems.

These findings illustrate that LiveBench discriminates between models not just by scale or memorization, but actual reasoning and generalization ability.

5. Benchmark Extensions: Multimodal LIVEBENCH and Routing Integrations

Multimodal LIVEBENCH (Zhang et al., 17 Jul 2024) extends the LiveBench paradigm to LMMs, targeting low-cost and zero-contamination evaluation. Rather than fixed datasets, current web news and forum content is collected and refined; Q&A pairs are generated by a high-quality quiz model and scored by current judge models (e.g., GPT-4o, Claude-3-Opus), then verified by human annotators. Scores are assigned in a scale (1–10, mapped to 0–100 accuracy).

This dynamic strategy enables measurement of zero-shot generalization to fresh, unseen multimodal stimuli, emphasizing practical performance in real-world conditions. Leaderboards and open-source repositories support ongoing community benchmarking and model integration.

Additionally, InferenceDynamics (Shi et al., 22 May 2025) leverages LiveBench as a core component in group-level LLM routing evaluation. RouteMix (composite dataset) uses LiveBench among other benchmarks; routing is performed via the formula: $\mathcal{R}_{mt}(\mathbf{x}) = \arg\max_{mt \in \mathcal{M}_t} \left\{ \gamma \cdot KS^{(\alpha)}(M_t, \mathbf{x}) + \delta \cdot CS^{(\alpha)}(M_t, \mathbf{x}) \right\}$ where $KS$ and $CS$ are knowledge/capability scores and $\gamma$ , $\delta$ , $\alpha$ are tunable weights. Empirically, InferenceDynamics routed queries to the optimal model for each LiveBench subtask, improving average scores over even the best single-model baselines.

6. Applications of LiveBench in Protocol Innovations

Protocols such as TICK and STICK (Cook et al., 4 Oct 2024) utilize LiveBench’s structure for interpretable, checklist-driven evaluation and self-improvement:

TICK: Uses LLMs to generate and apply YES/NO checklists for each instruction. Agreement between LLM and human judgment increased from 46.4% to 52.2%.
STICK: Applies checklist-driven self-refinement. For example, the Command-R+ model’s reasoning score increased from 29.2 to 37.0 (+7.8%) on LiveBench tasks, demonstrating that checklist-based feedback is more effective than unstructured self-critique.
LaTeX Metrics: Checklist Pass Rate (PR) and Decomposed Requirements Following Ratio (DRFR) precisely measure the proportion of checklist requirements met.

These protocols demonstrate that LiveBench is not only fit for evaluation, but also as a substrate for advancing model generation reliability and interpretability.

7. Community Ecosystem and Future Directions

LiveBench is openly available under Apache 2.0, with questions, code, and model answers released via GitHub and HuggingFace (https://github.com/livebench/livebench, https://huggingface.co/livebench). Community involvement is integral—researchers may contribute new models, tasks, and prompt strategies. Leaderboards at https://livebench.ai facilitate ongoing comparison and progress tracking.

Future plans include monthly content updates, further task diversification (e.g., non-English tasks, new data analysis types), and ongoing prompt and metric refinement to prevent residual bias and maintain evaluation fidelity. The range of applications is anticipated to expand as LLM and LMM capabilities evolve.

LiveBench thus serves as a rigorous, state-of-the-art tool for the continuous, multi-domain and multimodal evaluation of advanced AI systems, ensuring persistent validity amid rapid generational changes in model architectures and training regimes.