Papers
Topics
Authors
Recent
Search
2000 character limit reached

Auto-Bench: Interactive Benchmark for LLM Discovery

Updated 14 April 2026
  • Auto-Bench is an interactive benchmark that assesses large language models’ ability to perform iterative, intervention-driven scientific discovery across natural and social sciences.
  • It employs a causal graph framework based on structural causal models to simulate experiment planning, justification, sequential interventions, and hypothesis refinement.
  • Empirical studies highlight performance gaps in scalability, temporal reasoning, and causal hypothesis refinement, pointing to key opportunities for future model improvements.

Auto-Bench is an automated, interactive benchmark specifically designed to evaluate the capacity of LLMs for scientific discovery, encompassing both natural sciences (e.g., chemistry) and social sciences (e.g., social network analysis). The benchmark departs from traditional, static reasoning tasks by embedding LLMs directly into an environment that simulates the iterative, intervention-driven hypothesis-testing process typical of human scientific inquiry. Auto-Bench employs a causal graph framework—rooted in the formalism of structural causal models (SCMs)—and challenges LLM agents to plan experiments, generate justifications, perform sequential interventions, and infer hidden structures, with evaluation scenarios covering directed acyclic chemical systems and undirected social interaction networks. Empirical studies across state-of-the-art LLMs reveal a sharp performance gap between simple and complex instances, identifying key limitations in scalability, temporal reasoning, and explicit causal hypothesis refinement (Chen et al., 21 Feb 2025).

1. Motivation and Objectives

Auto-Bench is motivated by the gap between LLMs’ success on standard text- or code-based benchmarks and the more demanding requirements of iterative, experiment-driven scientific reasoning. Unlike static benchmarks, scientific discovery—a central activity of human research—requires agents to:

  • Formulate causal hypotheses.
  • Select and execute informative interventions (in silico or in vitro “experiments”).
  • Observe system responses and update internal models accordingly.
  • Iterate this loop until a satisfactory explanation is achieved.

The principal goals are to (i) measure an LLM’s ability to recover hidden causal structure, (ii) compel strategic intervention planning, (iii) demand explicit justification (through chain-of-thought), and (iv) expose scalability bottlenecks by quantifying performance as problem complexity increases (graph size, state space, and trajectory length). Importantly, Auto-Bench integrates both natural and social sciences into a common benchmarking protocol, unifying disparate paradigms under a single causal-graph approach (Chen et al., 21 Feb 2025).

2. Causal Graph Modeling Framework

The foundation of Auto-Bench lies in standard SCM formalism:

M=(U,V,F,P(U)),M = (U, V, F, P(U)),

where UU is the set of exogenous (latent) variables, VV the set of endogenous (observed) variables, F={fi}F = \{f_i\} deterministic functions (vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)), and P(U)P(U) a joint distribution over UU.

Each scenario corresponds to a causal graph GG with binary adjacency matrix H{0,1}N×NH \in \{0,1\}^{N \times N}, where Hi,j=1H_{i,j} = 1 if and only if there is a directed (or undirected) edge from UU0 to UU1. Interventions are formalized via the UU2-operator: UU3.

  • Chemistry tasks use a directed acyclic graph (DAG), where UU4 fixes molecule UU5 and causes random resampling of its descendants’ states, leaving ancestors and independent variables unaffected.
  • Social network tasks employ a symmetric adjacency matrix (graph is undirected); intervention UU6 increases the state of node UU7 and its immediate neighbors by UU8.

3. Benchmark Task Definitions

Auto-Bench implements two task types:

Chemistry (Natural Science):

  • Variables: UU9 molecules, each with discrete state in VV0.
  • Ground-truth: VV1, a DAG.
  • Observation: VV2, where each row is the observed state after the VV3-th intervention.
  • Intervention effect: Only descendants of intervention node are affected (change their state).

Social Network (Social Science):

  • Variables: VV4 people, each with VV5-valued state.
  • Ground-truth: VV6, undirected.
  • Observation: VV7, each row is post-intervention states.
  • Intervention effect: State increment on the chosen node and its neighbors.

Both settings entail discovery of VV8 via sequential interventions, with full visibility of the ongoing state matrix after each round.

4. Interactive Evaluation and Metrics

Each benchmark episode consists of up to VV9 cycles:

  1. LLM receives: task description, current hypothesis F={fi}F = \{f_i\}0, and full intervention/observation history.
  2. LLM outputs an updated F={fi}F = \{f_i\}1 and the next intervention choice.
  3. An Oracle, which knows F={fi}F = \{f_i\}2, simulates the effect of the chosen intervention and provides a new observation.
  4. If F={fi}F = \{f_i\}3 (for chemistry, its reachability matrix F={fi}F = \{f_i\}4) matches F={fi}F = \{f_i\}5 (or F={fi}F = \{f_i\}6 observationally), the episode ends with success; otherwise, the loop continues.

Performance is quantified by:

  • Success Rate: Fraction of episodes terminating in correct structure discovery within F={fi}F = \{f_i\}7 cycles.
  • Average Iterations: Mean cycles required (conditioned on success).
  • Trajectory Prediction Accuracy:

    • For observation sequence F={fi}F = \{f_i\}8, LLM predicts binary “change” matrix F={fi}F = \{f_i\}9, with

    vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)0

    vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)1 - Chain-of-thought (CoT) prompting is evaluated for its impact on multi-step accuracy.

5. Empirical Results and Model Limitations

Empirical evaluation includes GPT-4o, Claude-3.5-Haiku, Gemini-1.5-Pro, Llama-3.1-70b-Instr, and Qwen2.5-72b, each tested on multiple configurations:

Task/Config Success: N=3,5 Success: N=10 Trend w/ CoT
Chemistry (DAG) 100% ~15% 91–100% OA-Acc @M=3,5; falls @M>10
Social network 100% <30% or 0% Steep drop in OA-Acc w/o CoT
  • Simple graphs (vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)2): Top models achieve 100% success.
  • Large graphs (vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)3): Sharp degradation; weaker models frequently fail, and average cycles increase.
  • Long-term trajectory: Only GPT-4o and Qwen2.5 maintain high OA-Acc (vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)491–100%) in short trajectories; all models degrade past vifi(pa(vi),ui)v_i \leftarrow f_i(\mathrm{pa}(v_i), u_i)5 due to "temporal attention decay."
  • Adding CoT substantially mitigates attention decay for state-of-the-art models but does not rescue weaker baselines.

Key bottlenecks identified:

  • Scalability: Number of interventions required, and failure rates, scale unfavorably with graph size and state space cardinality.
  • Long-horizon tracking: LLMs cannot consistently model long causal chains; beyond 10–15 steps, attention and accuracy decay.
  • Causal hypothesis fragility: Iterative refinement of causal structure is error-prone in complex graphs; convergence is unreliable in weaker models.

6. Design Implications and Future Work

Auto-Bench exposes fundamental shortcomings in current LLMs' scientific reasoning:

  • Scalability to large causal spaces requires explicit memory architectures or latent graph structures resilient over extended interaction sequences.
  • Intervention planning may benefit from meta-learned or reinforcement-learning-based selection policies targeting minimal expected regret.
  • Architectural inductive biases might include explicit SCM modules or differentiable do-calculus operators.
  • Continuous/discrete hybrid benchmarks—extending beyond discrete state spaces and deterministic graphs—will increase ecological validity, particularly for natural phenomena exhibiting stochastic/probabilistic or continuous-valued dynamics.

By systematically surfacing these failure points in a rigorously controlled setting, Auto-Bench establishes a new research frontier for LLM evaluation as “AI scientists," offering a testbed for causal reasoning, experiment planning, and iterative scientific discovery (Chen et al., 21 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Auto-Bench.