DDR-Bench: LLM Investigatory Intelligence

Updated 3 February 2026

DDR-Bench is a comprehensive evaluation platform that measures investigatory intelligence, allowing LLMs to autonomously explore raw data using generic tools like SQL and Python.
It uniquely contrasts traditional fixed-task benchmarks by enabling models to formulate their own objectives and synthesize insights across extensive, real-world datasets such as MIMIC-IV and 10-K.
The platform employs clear, quantitative metrics and continuous interaction cycles to assess long-horizon exploration, planning capabilities, and fact-level verification.

DDR-Bench is a large-scale, checklist-based evaluation platform designed to benchmark investigatory intelligence in agentic LLMs through the open-ended Deep Data Research (DDR) task. By supplying an LLM with only a raw structured database and a generic toolset (SQL, Python), with no explicit query, DDR-Bench probes the model's capacity to autonomously formulate objectives, explore, and synthesize insights—fundamentally contrasting with traditional executional intelligence benchmarks centered on fixed questions or tasks. This framework offers objective, verifiable, fact-level assessment, enabling rigorous study of autonomous agency and exploratory strategies in state-of-the-art LLMs (Liu et al., 2 Feb 2026).

1. Design Principles and Benchmark Differentiation

DDR-Bench is anchored in the DDR task formalism, which exposes the following interaction protocol:

The LLM receives a database $D$ and toolset $T$ (SQL, Python), with zero initial guidance.
Across an unrestricted number of tool-mediated "reason $\rightarrow$ action $\rightarrow$ observation" cycles, the agent autonomously selects queries, hypothesizes, tests, reasons about findings, and decides when to terminate (FINISH) or continue exploring.
After each cycle, message-wise insights ( $I_m$ ) are required; upon episode completion, a trajectory-wise insight ( $I_t$ ) summarizes the agent's findings.

DDR-Bench isolates "investigatory intelligence," defined as the ability to discover problems and synthesize nontrivial insights from raw data, as distinct from executional intelligence, or the ability to execute a fixed instruction. Unlike prior table QA or text-to-SQL datasets (e.g., Spider, FeTaQA), DDR-Bench:

Provides no predefined questions or goals.
Does not rely on subjective scoring rubrics or LLM-based holistic grading.
Employs a ground-truth checklist of verifiable facts for each scenario.
Imposes no limit on interaction rounds, directly targeting long-horizon exploration.

The benchmark thereby enables open-ended, fact-level evaluation of model-driven agency (Liu et al., 2 Feb 2026).

2. Dataset Composition and Task Construction

DDR-Bench comprises carefully curated real-world databases, covering diverse modalities and analytical challenges:

A. MIMIC-IV (Electronic Health Records)

Scale: 200M+ rows, 27 tables, 6,372 fields (including clinical notes).
Task: Each of 100 patients is paired with ~58 checklist facts requiring extraction via both simple addresses (e.g., “What is the patient’s sex?”) and multi-table temporal analysis (e.g., “Compute CHA₂DS₂-VASc score”).
Fact format: Free-form answers (e.g., “Adenocarcinoma involving ovarian tissue.”).

B. GLOBEM (Wearables & Mental Health Surveys)

Scale: 55K+ records, 4 CSV modalities, 2,058 fields.
Task: For each of 91 users, identify ~5 outcomes relating activity/behavioral trends to survey changes (e.g., “How did the user’s depressive symptoms change?”: {Worsened/Remained the same/Improved}).

C. 10-K (SEC XBRL Financials)

Scale: 3M+ XBRL facts, 2,058 fields, 100 companies.
Task: Synthesize multi-year financial and risk trends, reconstructing complex relations from both structured tables and unstructured sections (e.g., “What trend is described regarding regulatory scrutiny?”).

For every “task entity,” a scenario-specific checklist operationalizes ground truth, enabling automated, scalable verification of agent outputs.

3. Evaluation Metrics and Analytical Measures

The benchmark's metrics are explicitly quantitative, reflecting recognizably rigorous standards:

Checklist Accuracy
- Item-Averaged:
$\text{Acc}_{\rm item} = \frac{1}{M}\sum_{j=1}^M \mathbb{I}[\text{item}_j\text{ supported}]$ - Sample-Averaged:

$\text{Acc}_{\rm samp} = \frac{1}{N}\sum_{i=1}^N \frac{1}{M_i}\sum_{j=1}^{M_i}\mathbb{I}[\text{item}_{ij}\text{ supported}]$
Information Retrieval Metrics (if explicit): Precision, Recall, $F_1$ .
Composite Scores:

$\mathrm{DDR}_{\rm score} = \alpha\cdot \mathrm{Precision} + \beta\cdot \mathrm{Recall} - \gamma\cdot \mathrm{Overloads}$

where “Overloads” penalize aborted or invalid tool calls.

Exploration Metrics
- Horizon Efficiency:
$E_{\rm horizon} = \frac{1}{N} \sum_{i=1}^N f_i(t_i)$

for accuracy (or other metrics) at turn index $t_i$ . - Field Coverage: Proportion of database fields accessed. - Normalized Exploration Entropy:

$H_{\rm norm} = -\frac{1}{\log_2 n}\sum_{i=1}^n p_i\log_2 p_i$

with $p_i$ the fraction of field $i$ accesses.

A plausible implication is that this rigorous, turn-based analysis affords both micro-level (per-insight) and macro-level (cumulative strategy) measures of agent performance.

4. Experimental Setup and Agent Scaffolding

The evaluative framework standardizes agent configuration and tool access:

Agent Architecture: Minimal ReAct-style loop, orchestrated via a single LLM API. Tools are exposed through a protocol supporting one SQL or Python action per turn.
Prompting Discipline: System prompt enforces “Thought: → Action: → Observation:” interleaving, autonomous stopping via FINISH, and prohibits external memory/planning modules.
Model Coverage: Proprietary (e.g., Claude 4.5 Sonnet, GPT-5.2, Gemini 3) and open-source LLMs (DeepSeek v3.2, GLM 4.6, Qwen 2.5/3, Llama 3 70B) at scales 4B–230B (including dense and sparse MoE variants).
Variants Studied: Agentic modules (reasoning token budget, long/short-term memory, reactive vs. proactive prompting) are ablated to diagnose what architectural traits support effective exploration.

Context windows extend to 128K tokens for long-context variants, addressing the challenge of knowledge retention across extended search trajectories.

5. Key Empirical Findings

DDR-Bench yields several quantitative and qualitative results regarding model capability and behavior:

Accuracy Trends: State-of-the-art proprietary models (e.g., Claude 4.5 Sonnet) attain $\sim 40\%$ sample-averaged accuracy on MIMIC/GLOBEM and $\sim 78\%$ on 10-K, while most models remain below 35%. Open models approach but rarely cross the 40% threshold.
Scaling Dynamics: Performance curves exhibit sigmoidal “interaction-scaling”—later commitment and sharply rising final accuracy in stronger models—implying implicit long-term planning.
Exploration Patterns: High-performing models exhibit moderate entropy and broad database coverage; low performers demonstrate either narrow focus or high-entropy scattershot behaviors.
Self-Termination: Newer models more reliably learn to terminate (FINISH) as exploration proceeds, while older variants display erratic or premature stopping.
Ablations: Raising per-turn reasoning budget reduces steps but not necessarily accuracy, indicating diminishing returns. Naive memory summarization disrupts emergent strategies. Explicit questions boost accuracy 10–20 points, confirming open-ended autonomy as a harder regime.
Error Decomposition: Manual review indicates 58% of failures stem from inadequate exploration depth/breadth, 27% from flawed data-to-insight translation, and 15% from trajectory problems (debugging, summarization, prompt compliance).
Hallucination and Checker Trustworthiness: Factual hallucination rates remain below 5% and do not track accuracy. LLM-as-checker evaluations are stable, with repeated runs yielding coefficient of variation $<$ 5% and human-LLM agreement $F_1\approx 90\%$ .

6. Recommendations and Prospects for Development

Empirical findings elicit the following recommendations for advancing investigatory intelligence:

Agentic-First Training: Parameter/context scaling is insufficient; models require targeted pre-training and reinforcement learning on open-ended exploration protocols.
Implicit Planning and Stopping: Auxiliary objectives should reward effective exploration coverage and well-controlled termination.
Memory Engineering: Robust memory/summarization modules are necessary to preserve salient context without introducing narrow or myopic searches.
Task Expansion: Broadening beyond current domains (e.g., to geospatial or log-stream analytics) and into multi-agent collaborative setups is suggested.
Metric Development: Innovations in end-to-end cost-adjusted scores and finer-grained information gain tracking will further illuminate agentic progress.
Human Collaboration: Incorporating human-in-the-loop checkpoints may yield synergies, especially in risk-critical arenas such as medicine or finance (Liu et al., 2 Feb 2026).

In sum, DDR-Bench empirically demonstrates that the emergence of investigatory intelligence in LLMs is governed less by model scale or scaffold sophistication than by the intrinsic ability to plan, explore, and recognize when an inquiry is complete. The benchmark provides a neutral, extensible foundation for the analytic study of autonomous data science agents, setting a rigorous standard for future research in open-ended LLM evaluation.

Markdown Upgrade to Chat

References (1)

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DDR-Bench.