Auto-Bench: Interactive Benchmark for LLM Discovery
- Auto-Bench is an interactive benchmark that assesses large language models’ ability to perform iterative, intervention-driven scientific discovery across natural and social sciences.
- It employs a causal graph framework based on structural causal models to simulate experiment planning, justification, sequential interventions, and hypothesis refinement.
- Empirical studies highlight performance gaps in scalability, temporal reasoning, and causal hypothesis refinement, pointing to key opportunities for future model improvements.
Auto-Bench is an automated, interactive benchmark specifically designed to evaluate the capacity of LLMs for scientific discovery, encompassing both natural sciences (e.g., chemistry) and social sciences (e.g., social network analysis). The benchmark departs from traditional, static reasoning tasks by embedding LLMs directly into an environment that simulates the iterative, intervention-driven hypothesis-testing process typical of human scientific inquiry. Auto-Bench employs a causal graph framework—rooted in the formalism of structural causal models (SCMs)—and challenges LLM agents to plan experiments, generate justifications, perform sequential interventions, and infer hidden structures, with evaluation scenarios covering directed acyclic chemical systems and undirected social interaction networks. Empirical studies across state-of-the-art LLMs reveal a sharp performance gap between simple and complex instances, identifying key limitations in scalability, temporal reasoning, and explicit causal hypothesis refinement (Chen et al., 21 Feb 2025).
1. Motivation and Objectives
Auto-Bench is motivated by the gap between LLMs’ success on standard text- or code-based benchmarks and the more demanding requirements of iterative, experiment-driven scientific reasoning. Unlike static benchmarks, scientific discovery—a central activity of human research—requires agents to:
- Formulate causal hypotheses.
- Select and execute informative interventions (in silico or in vitro “experiments”).
- Observe system responses and update internal models accordingly.
- Iterate this loop until a satisfactory explanation is achieved.
The principal goals are to (i) measure an LLM’s ability to recover hidden causal structure, (ii) compel strategic intervention planning, (iii) demand explicit justification (through chain-of-thought), and (iv) expose scalability bottlenecks by quantifying performance as problem complexity increases (graph size, state space, and trajectory length). Importantly, Auto-Bench integrates both natural and social sciences into a common benchmarking protocol, unifying disparate paradigms under a single causal-graph approach (Chen et al., 21 Feb 2025).
2. Causal Graph Modeling Framework
The foundation of Auto-Bench lies in standard SCM formalism:
where is the set of exogenous (latent) variables, the set of endogenous (observed) variables, deterministic functions (), and a joint distribution over .
Each scenario corresponds to a causal graph with binary adjacency matrix , where if and only if there is a directed (or undirected) edge from 0 to 1. Interventions are formalized via the 2-operator: 3.
- Chemistry tasks use a directed acyclic graph (DAG), where 4 fixes molecule 5 and causes random resampling of its descendants’ states, leaving ancestors and independent variables unaffected.
- Social network tasks employ a symmetric adjacency matrix (graph is undirected); intervention 6 increases the state of node 7 and its immediate neighbors by 8.
3. Benchmark Task Definitions
Auto-Bench implements two task types:
Chemistry (Natural Science):
- Variables: 9 molecules, each with discrete state in 0.
- Ground-truth: 1, a DAG.
- Observation: 2, where each row is the observed state after the 3-th intervention.
- Intervention effect: Only descendants of intervention node are affected (change their state).
Social Network (Social Science):
- Variables: 4 people, each with 5-valued state.
- Ground-truth: 6, undirected.
- Observation: 7, each row is post-intervention states.
- Intervention effect: State increment on the chosen node and its neighbors.
Both settings entail discovery of 8 via sequential interventions, with full visibility of the ongoing state matrix after each round.
4. Interactive Evaluation and Metrics
Each benchmark episode consists of up to 9 cycles:
- LLM receives: task description, current hypothesis 0, and full intervention/observation history.
- LLM outputs an updated 1 and the next intervention choice.
- An Oracle, which knows 2, simulates the effect of the chosen intervention and provides a new observation.
- If 3 (for chemistry, its reachability matrix 4) matches 5 (or 6 observationally), the episode ends with success; otherwise, the loop continues.
Performance is quantified by:
- Success Rate: Fraction of episodes terminating in correct structure discovery within 7 cycles.
- Average Iterations: Mean cycles required (conditioned on success).
- Trajectory Prediction Accuracy:
- For observation sequence 8, LLM predicts binary “change” matrix 9, with
0
1 - Chain-of-thought (CoT) prompting is evaluated for its impact on multi-step accuracy.
5. Empirical Results and Model Limitations
Empirical evaluation includes GPT-4o, Claude-3.5-Haiku, Gemini-1.5-Pro, Llama-3.1-70b-Instr, and Qwen2.5-72b, each tested on multiple configurations:
| Task/Config | Success: N=3,5 | Success: N=10 | Trend w/ CoT |
|---|---|---|---|
| Chemistry (DAG) | 100% | ~15% | 91–100% OA-Acc @M=3,5; falls @M>10 |
| Social network | 100% | <30% or 0% | Steep drop in OA-Acc w/o CoT |
- Simple graphs (2): Top models achieve 100% success.
- Large graphs (3): Sharp degradation; weaker models frequently fail, and average cycles increase.
- Long-term trajectory: Only GPT-4o and Qwen2.5 maintain high OA-Acc (491–100%) in short trajectories; all models degrade past 5 due to "temporal attention decay."
- Adding CoT substantially mitigates attention decay for state-of-the-art models but does not rescue weaker baselines.
Key bottlenecks identified:
- Scalability: Number of interventions required, and failure rates, scale unfavorably with graph size and state space cardinality.
- Long-horizon tracking: LLMs cannot consistently model long causal chains; beyond 10–15 steps, attention and accuracy decay.
- Causal hypothesis fragility: Iterative refinement of causal structure is error-prone in complex graphs; convergence is unreliable in weaker models.
6. Design Implications and Future Work
Auto-Bench exposes fundamental shortcomings in current LLMs' scientific reasoning:
- Scalability to large causal spaces requires explicit memory architectures or latent graph structures resilient over extended interaction sequences.
- Intervention planning may benefit from meta-learned or reinforcement-learning-based selection policies targeting minimal expected regret.
- Architectural inductive biases might include explicit SCM modules or differentiable do-calculus operators.
- Continuous/discrete hybrid benchmarks—extending beyond discrete state spaces and deterministic graphs—will increase ecological validity, particularly for natural phenomena exhibiting stochastic/probabilistic or continuous-valued dynamics.
By systematically surfacing these failure points in a rigorously controlled setting, Auto-Bench establishes a new research frontier for LLM evaluation as “AI scientists," offering a testbed for causal reasoning, experiment planning, and iterative scientific discovery (Chen et al., 21 Feb 2025).