Papers
Topics
Authors
Recent
2000 character limit reached

BBH: Challenging Benchmark for LLM Reasoning

Updated 1 January 2026
  • BBH is a curated evaluation benchmark that tests challenging algorithmic, logical, linguistic, and commonsense reasoning tasks in LLMs.
  • It comprises 23–27 tasks with automated, deterministic scoring designed to expose models' weaknesses in few-shot prompting scenarios.
  • BBH underpins research into chain-of-thought prompting, distillation effects, and scaling laws, offering actionable insights on emergent reasoning capabilities.

BIG-Bench Hard (BBH) is a curated evaluation benchmark for LLMs that targets the most challenging reasoning tasks within the broader BIG-Bench suite. It was designed as an automated, diverse, and rigorous test bed for general reasoning—encompassing algorithmic, logical, linguistic, commonsense, and world-knowledge inference—precisely at the frontier where few-shot prompting with contemporary LLMs previously failed to match average human-rater performance. BBH has been central in the evaluation of emergent capabilities, prompting paradigms such as chain-of-thought (CoT) reasoning, and recent advancements in reasoning-oriented LLM distillation and scaling studies.

1. Composition and Task Selection

BBH consists of 23–27 distinct tasks (varying slightly across studies depending on subtask grouping) identified from the BIG-Bench collection, selected through criteria that ensure exceptional difficulty for LLMs under direct few-shot prompting. Tasks were filtered by:

  • No prior LLM achieving average human-rater performance
  • Minimum size (≥ 100–103 test examples)
  • Simple, automatically verifiable formats (multiple choice or exact-match)
  • Absence of specialized knowledge barriers

The resulting task set spans a broad spectrum, including:

  • Multi-step algorithmic reasoning (Boolean expressions, Dyck languages, logical deduction, object/state tracking, temporal sequences, multi-step arithmetic, geometric shape inference)
  • Natural language understanding (coreference disambiguation, hyperbaton, sarcasm identification)
  • Commonsense and world knowledge (date understanding, causality, movie and sports understanding, wordplay, multilingual translation error detection)
  • Pattern induction and symbolic manipulation ("tracking shuffled objects," "web of lies," "penguins in a table")

Each task typically features concise (≈ 700 character) prompts with constrained output spaces (2–5 options), ensuring deterministic scoring and low annotation ambiguity (Suzgun et al., 2022, Schaeffer et al., 2023, Do et al., 7 Nov 2025).

2. Motivation, Design Goals, and Evaluation Protocols

The primary objective of BBH is to probe general reasoning skill—both breadth and depth—across domains in a unified evaluation regime. Specific design goals include:

  • Ensuring tasks are difficult but not dependent on obscure knowledge
  • Providing coverage of formal, informal, and naturalistic reasoning
  • Enabling automated, consistent, large-scale evaluation for LLMs of varying capability

Evaluation is conducted using standard accuracy (percentage of exact-match correct answers per task), with BBH aggregate scores computed as micro-averaged accuracy or, in some work, harmonic mean across tasks to penalize uneven performance (Schaeffer et al., 2023, Do et al., 7 Nov 2025, Kazemi et al., 26 Feb 2025).

In knowledge distillation settings, BBH is used to assess improvements arising from transferring reasoning, often by observing changes in exact-match score following different forms of supervision (e.g., vanilla distillation vs. distillation with chain-of-thought rationales) (Do et al., 7 Nov 2025).

3. Quantitative Benchmarks and Scaling Properties

BBH is widely used both for absolute performance benchmarking and as a unit of analysis for scaling and capability prediction:

  • In answer-only few-shot regimes, average accuracy is well below human raters (random ≈ 25.7%, average human ≈ 67.7%, best prior LMs ≤ 56.6%) (Suzgun et al., 2022).
  • Chain-of-thought prompting confers substantial gains, enabling Codex (d-002) in CoT regime to surpass average human on 17/23 tasks and reach 73.9%—a +17.3 ppt improvement over answer-only (Suzgun et al., 2022).
  • Scaling studies fit BBH performance versus compute with a logistic curve in log10C\log_{10} C (FLOP): P(C)=11+exp[k(log10CC0)]P(C) = \frac{1}{1 + \exp[-k (\log_{10} C - C_0)]}, best-fit k2.0k\approx 2.0, C023.2C_0\approx 23.2 for the mid-point. MAE for an order of magnitude extrapolation is 6 percentage points. Per-task predictions are less stable (MAE ≈18 ppts), but aggregation smooths emergent effects (Owen, 2024).
  • Knowledge distillation studies show CoT distillation yields measurable gains (e.g., Qwen-1.8B: 17.77% → 24.44%, Llama2-7B: 39.44% → 41.50%) (Do et al., 7 Nov 2025).

Table: Illustration of Model Performance on BBH (selected results) | Model / Method | Accuracy (%) | Regime | |----------------------|-------------|-----------------------| | Random Baseline | ~25.7 | Answer Only | | Average Human | 67.7 | Human Rater | | PaLM 540B | 65.2 | CoT (10/23 > human) | | Codex d-002 | 73.9 | CoT (17/23 > human) | | Qwen-1.8B | 17.77 | Distillation Baseline | | Qwen-1.8B (CoT KD) | 24.44 | KD+CoT |

4. Impact of Chain-of-Thought Prompting and Reasoning Evaluation

BBH was a central benchmark in demonstrating that chain-of-thought (CoT) prompting triggers "emergent abilities" in large LLMs, wherein reasoning performance surges sharply above a critical model scale. Notably, CoT enables the closing of much of the human–model gap on multi-step reasoning, pattern manipulation, and structurally complex tasks (Suzgun et al., 2022, Schaeffer et al., 2023). However, subsequent analysis reveals that even logically invalid CoT rationales deliver nearly the same accuracy gains as valid ones (Δ ≈ −2%, p ≈ 0.025), suggesting that surface form and multi-step demonstration structure—not strict logical fidelity—drive the observed improvements (Schaeffer et al., 2023). This indicates that BBH gains may arise from learned associations with prompt structure rather than symbolic inference per se.

BBH also serves as a granular monitor for distillation quality, revealing which categories of reasoning (algorithmic, state-tracking, world knowledge) are improved by CoT-augmented transfer and which remain bottlenecks or degrade (Do et al., 7 Nov 2025).

5. Variants, Evolution, and Limitations

Owing to rapid progress, state-of-the-art LLMs (e.g., Gemini 2.0 Flash) surpassed 90% accuracy on multiple BBH tasks, causing benchmark saturation. Limitations noted include:

  • High chance baselines due to small output spaces
  • Vulnerability to shortcuts (Python evaluation, pattern matching)
  • Short context windows and limited reasoning hops
  • Gaps in the range and depth of reasoning skills

To address these, BIG-Bench Extra Hard (BBEH) was proposed, replacing each BBH task with an adversarially calibrated, more difficult analog (richer distractors, longer contexts, more reasoning steps, partial information, error-detection, "needle-in-haystack" queries). On BBEH, top general-purpose models score as low as 9.8% (harmonic mean) and 23.9% (micro-average), with the best reasoning-specialized model reaching only 44.8% and 54.2% respectively (Kazemi et al., 26 Feb 2025). The category breakdown demonstrates that tasks requiring long-context state tracking, many-hop compounded reasoning, induction against priors, and error-detection exhibit the largest performance drops relative to BBH.

6. Methodological Role in Evaluation Science

BBH has been formally utilized not only as a target for absolute scoring but as a component in methodological studies:

  • As a "hard subset" baseline for small-bench construction and task selection optimization (Ye et al., 2023).
  • In predictive modeling of LLM capabilities, reflecting recoverability and informativeness tradeoffs: BBH, while canonical, is not the optimal minimal suite—more diverse or value-weighted small-benches can match its informativeness with fewer tasks (Ye et al., 2023).
  • In scaling law studies and policy-relevant capability forecasting, BBH aggregate scores offer a stable, low-variance indicator relative to single-task or un-aggregated suites (Owen, 2024).

7. Broader Implications and Ongoing Developments

The design and evolution of BBH have informed both the evaluation and development of LLMs, pushing for:

  • Multi-step, context-rich, and adversarial task inclusion in future benchmarks to stress test beyond shortcuts and shallow cues
  • Aggregation metrics (e.g., harmonic mean) that detect uneven or brittle model ability distribution
  • Analysis of the distinction between true symbolic reasoning and surface-level competence in LLM prompting successes (Schaeffer et al., 2023)

A plausible implication is that robust reasoning evaluation now demands adversarial hardening and continual evolution, as evidenced by the necessity of BBEH and similar next-generation testbeds. BBH remains a foundational reference point for measuring progress, regression, and transfer in LLM reasoning research.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BIG-Bench Hard (BBH).