Beyond Imitation Game Benchmark Hard

Updated 16 April 2026

BBH is a set of 23 tasks designed to rigorously test multi-step reasoning across domains like logic, arithmetic, and commonsense by avoiding shortcut solutions.
Evolving into BBEH, the tasks increase complexity with many-hop reasoning, extended context, and dynamic rule induction to challenge LLMs further.
Evaluation metrics reveal that while models achieve over 90% on BBH, their performance drops significantly on BBEH, with harmonic mean scores as low as 9.8%.

Beyond Imitation Game Benchmark Hard (BBH) refers to a class of evaluation tasks for LLMs designed to rigorously probe general reasoning capabilities across diverse domains. Initially conceived as BIG-Bench Hard (BBH), this suite of 23 challenging tasks was curated to stress-test LLMs with multi-step, compositional, and algorithmically complex questions. BBH emerged as a critical benchmarking tool when state-of-the-art models saturated earlier general reasoning benchmarks, pushing the frontier of measurable progress in LLM reasoning capabilities. However, with recent advances resulting in near-perfect accuracy on BBH, its utility as a differentiator has diminished. To address this, BIG-Bench Extra Hard (BBEH) was introduced, systematically increasing task difficulty and complexity—redefining the state-of-the-art for broad-coverage, automatically scorable, and reproducible reasoning assessment of frontier LLMs (Kazemi et al., 26 Feb 2025, Dunham et al., 2024).

1. Historical Development and Motivation

BBH was introduced in 2022 as a focused, rigorous subset of the broader BIG-Bench project, isolating 23 tasks uniquely resistant to pattern-matching and shortcut solutions. Each task was chosen for the need for multi-step reasoning, structured inference, and a lack of simple cues solvable by conventional pretraining alone. Domains included formal logic, arithmetic, commonsense, spatial reasoning, temporal judgment, and specialized algorithmic challenges.

As LLMs—driven by architectural improvements and prompt engineering, notably chain-of-thought (CoT) reasoning—surpassed 90% accuracy thresholds on BBH, the benchmark reached saturation. The lack of benchmarking headroom hindered the ability to discriminate among newly developed, highly capable models. This bottleneck motivated the systematic augmentation of task difficulty, culminating in BBEH, where every BBH task is replaced with a substantially harder analogue to probe the same category of reasoning but with increased inferential and contextual demands (Kazemi et al., 26 Feb 2025).

2. Task Design and Skill Coverage

Each of the 23 BBH tasks was originally drawn from domains such as temporal understanding, spatial and geometric reasoning, commonsense and humour, causal judgment, deductive logic, linguistic reasoning, data-structures and algorithms, counting and filtering, and arithmetic (Kazemi et al., 26 Feb 2025, Dunham et al., 2024).

BBEH preserves these skill categories but introduces new tasks for each, specifically designed to increase complexity along axes including:

Many-hop reasoning: Expanding inference chains to 6–8 steps, compared to the 2–3 typical in BBH.
Long-context processing: Prompts are on average six times longer, increasing compositional and memory demands.
Needle-in-a-haystack retrieval: Requiring fine-grained extraction among distractors.
On-the-fly rule induction: Forcing models to learn new rules per instance.
Strong-prior inversion: Challenging standard model assumptions.
Error localization: Identifying mistakes in multi-step reasoning chains.
Constraint satisfaction and compositional puzzles: Including multi-facet reasoning and constraint propagation tasks.
Ambiguous and out-of-distribution reasoning: Tasks with significant novelty or ambiguity, such as hyperbaton in syntactic order or sarcasm detection.

Task development followed a semi-adversarial loop, iterating prompts until reference models (e.g., Gemini 1.5 Flash and Gemini Thinking Experimental) achieved below 70% accuracy, systematically blocking trivial or shortcut-based solutions (Kazemi et al., 26 Feb 2025).

3. Evaluation Methodology and Metrics

Both BBH and BBEH share the property of being automatically scorable. Tasks are presented as short, zero- or few-shot questions with a small output space (typically 2–5 options) and oracle-checkable answers. For BBH, evaluation typically uses zero-shot chain-of-thought prompting, eliciting model-generated reasoning traces. Output scoring differs by task type; multiple-choice is evaluated on the first letter parsed from the model’s output, while exact-match tasks compare normalized strings to the ground-truth answer (Dunham et al., 2024).

BBEH augments standard average (macro/micro) accuracy metrics with a harmonic mean accuracy to penalize uneven performance and amplify the impact of weakest-performing tasks. Formally, with $a_i$ denoting accuracy on task $i$ and $N$ task total:

Average accuracy: $A = \frac{1}{N}\sum_{i=1}^N a_i$
Harmonic-mean accuracy: $H = \frac{N}{\sum_{i=1}^N \frac{1}{a_i}}$

A 1% smoothing is applied to all $a_i$ to avoid division-by-zero artifacts (Kazemi et al., 26 Feb 2025). This approach discourages models from optimizing only for “easy wins” and exposes failure modes across the full reasoning skill spectrum.

4. Dataset Composition and Example Tasks

BBEH consists of 23 tasks, each comprising 200 examples (120 for pronoun Disambiguation QA). Tasks span eleven core BBH categories and new areas that emerged with adversarial task crafting. Notable examples include:

Time Arithmetic: Multi-part calculation with compositional “tree-of-thought” requirements.
SpatialLLMEval+: Spatial relations and geometric constructions from SVG descriptions with distractors.
New Yorker Cartoon Caption: Humour detection and ranking for “top-10 funniest.”
Causal Judgment & Necessary/Sufficient Cause: Multi-label causal assignments, often ambiguous.
Boardgame QA & Web of Lies: Defeasible logic with conflict, truth-tellers/liars puzzles extending to cycles and unknown quantities.
Buggy Tables & Dyck Language: Faulty table reconstruction, error identification in bracket-stack traces.
Hyperbaton: Inducing novel English adjective orderings.
Object Counting, Properties, Shuffled Objects: Multi-instance, compositional object tracking over long action chains.
Zebra Puzzles, SportQA: High-constraint, multi-question compositional reasoning.

A typical example (Geometric Shapes task): The model receives a multi-line SVG path and a set of possible shapes, and must identify which appear, requiring synthesis of geometric and visual inference.

5. Quantitative Performance and Analysis

The difficulty of BBEH is quantitatively evident. On BBH, state-of-the-art models such as Gemini 2.0 Flash achieve a harmonic mean (H) of 85.2%; on BBEH, the same model’s H falls to 9.8%. The best reasoning-specialized model (o3-mini high) reaches H = 44.8% on BBEH, with micro-average accuracy of 54.2%. Random baselines are at H = 2.4% and micro-average A = 8.4% (Kazemi et al., 26 Feb 2025).

Selected results:

Model	H (BBEH)	Micro A (BBEH)
Random	2.4 %	8.4 %
GPT4o	6.0 %	22.3 %
Gemini 2.0 Flash	9.8 %	23.9 %
o3-mini (high)	44.8 %	54.2 %

By contrast, both general-purpose and reasoning-specialized LLMs routinely exceed 90% on BBH. Task-level breakdowns highlight that reasoning-specialized models outperform general-purpose ones primarily in formal, algorithmic, and long-context tasks (e.g., Multi-step Arithmetic, Boardgame QA, Buggy Tables), with little or negative gains on “soft” reasoning (e.g., Sarcasm, Causal Judgment, Humour).

6. Failure Modes and Open Challenges

Current LLMs display several characteristic failure modes on BBEH:

Long-list counting errors: Under- or over-counting, especially in the presence of distractors or incomplete information.
Prior dependence: Over-reliance on standard knowledge, such as default adjective order or conventional word sorting, leading to errors on out-of-distribution tasks (e.g., Hyperbaton).
Insufficient search depth: Missing first errors in Dyck Language or failing to identify the correct bug in Buggy Tables.
Imprecise causal inference: Tendency to over-attribute causality to agents permitted by scenario norms.
Chain degeneracy: Truncation or hallucination in ultra-long reasoning chains, as in Shuffled Objects or Web of Lies.

This suggests that as task length and compositional demands increase, LLMs still struggle to maintain consistent reasoning traces and accuracy, even with advanced prompting and architecture.

7. Significance, State-of-the-Art, and Implications

BBEH reopens the frontier of general reasoning evaluation, providing a highly challenging, transparent, and automatically scored benchmark for future LLM research. Even the best reasoning-specialized models currently solve less than half of BBEH by harmonic mean, underscoring persistent gaps in robust, broad-spectrum reasoning (Kazemi et al., 26 Feb 2025).

The emergence of new results and architectural approaches, such as the Lychee AI–powered Reactor Mk.1, demonstrates how targeted innovations in attention routing, Mixture-of-Experts efficiency, and CoT-optimized training can yield advances on legacy benchmarks (88% BBH for Reactor Mk.1) (Dunham et al., 2024). However, such gains remain elusive on BBEH, reflecting its role as the new gold standard for discriminative, structure-sensitive evaluation of LLM reasoning.

A plausible implication is that further progress will require both architectural scale and qualitative breakthroughs in reasoning-specific model design, as current methods still fall short of robustly solving the extended, high-complexity tasks exemplified by BBEH. BBEH thus guides ongoing research toward breadth and depth of cognition, beyond mere pattern recognition or context matching, and provides an open platform for benchmarking advances in general intelligence.

Markdown Report Issue Upgrade to Chat

References (2)

BIG-Bench Extra Hard (2025)

Reactor Mk.1 performances: MMLU, HumanEval and BBH test results (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Beyond Imitation Game Benchmark Hard (BBH).