FINDER: Fine-Grained Benchmark for Deep Research Agents
- FINDER is a fine-grained benchmark for DRAs that standardizes multi-source research tasks with 100 curated prompts and detailed evaluation rubrics.
- It employs a synthetic data pipeline and multi-hop reasoning to simulate real-world artifact generation and diagnose failures via the DEFT taxonomy.
- Benchmark metrics such as RACE and FACT, along with baseline evaluations, highlight retrieval challenges and avenues for enhancing agentic research performance.
Fine-grained DEepResearch bench (FINDER) is an expert-driven benchmark suite established to rigorously evaluate the end-to-end research capabilities of deep research agents (DRAs) operating on complex, multi-source information tasks (Zhang et al., 1 Dec 2025, Choubey et al., 29 Jun 2025, Zhu et al., 15 Oct 2025, Chandrahasan et al., 7 Jul 2025). FINDER combines highly structured task design, granular evaluation rubrics, diagnostic failure taxonomy, and robust annotation protocols to support reproducible, interpretable assessment of DRAs in enterprise, academic, financial, and open-domain settings.
1. Definition and Core Principles
FINDER is defined as a fine-grained benchmark for DRAs, extending DeepResearch Bench with 100 human-curated research tasks (50 English, 50 Chinese), 419 checklist items for granular report evaluation, and a complementary taxonomy (DEFT) for failure analysis (Zhang et al., 1 Dec 2025). Its core objectives are to standardize output structure, enforce analytical depth, and quantify factual grounding via explicit criteria. Unlike "question-answering" paradigms, FINDER evaluates agentic workflows that autonomously retrieve, synthesize, and report across diverse, sparsely linked sources.
Key design principles:
- Explicit Guidance: Each task prompt specifies required sections, format, and citation style.
- Checklist-Driven Scoring: Reports are evaluated against granular, per-task criteria covering structural, analytical, and factual dimensions.
- Diagnostic Focus: Complementary error taxonomy facilitates process-level failure analysis.
- Reproducibility: All components (tasks, rubrics, metrics) are public and formally defined.
2. Synthetic Data Pipeline and Artifact Creation
FINDER incorporates a workflow-guided, query-first synthetic data pipeline (notably for enterprise scenarios under HERB, a FINDER alias) (Choubey et al., 29 Jun 2025). The pipeline comprises four stages:
- Schema Initialization: Organizational entities, products, employees, and customers are instantiated.
- Query Template Instantiation: 41 templates generate multi-hop questions (content, people, artifact, customer types).
- Workflow Execution: Nine manually crafted workflows (three per business phase) simulate authentic process artifacts under realistic noise (α ≈ 0.15).
- Answer & Unanswerable Generation: Each answerable query is mapped to explicit ground-truth evidence; unanswerable cases are created by ensuring critical evidence is missing.
Artifact statistics (enterprise context):
- Slack messages: 33,632 (27 tokens avg.)
- Meeting transcripts: 321 (1,200 tokens avg.)
- Documents: 400 (800 tokens avg.)
- GitHub pull requests: 3,562
- URLs: 575; Customer profiles: 120 Total artifacts: 39,190, indexed via hybrid embedding (vector/BM25) and structured indices.
3. Evaluation Rubrics, Metrics, and Taxonomies
FINDER implements granular scoring frameworks tailored to domain. In finance, HisRubric employs a hierarchical analytical structure with a four-level competence hierarchy (Recognition, Calculation, Abstraction, Interpretation), mapped to sector-specific sections and graded via 15,808 items (Zhu et al., 15 Oct 2025).
In research-report synthesis, FINDER's rubric includes:
- Structural Compliance: Section/order, format conventions, citation style adherence.
- Analytical Depth: Conceptual definitions, methodological rigor, cross-domain analysis.
- Factual Grounding: Citation accuracy, evidence verification, data source authority.
Principal scoring metrics (Zhang et al., 1 Dec 2025, Zhu et al., 15 Oct 2025):
- RACE: Reference-based Adaptive Criteria-driven Evaluation, scoring outputs across comprehensiveness, depth, instruction-following, and readability relative to gold-standard references.
- FACT: Factual Abundance and Citation Trustworthiness, quantifying supported statement–citation pairs and citation diversity.
- Checklist Pass Rate: Fraction of task-specific rubric items satisfied.
- Positive Taxonomy Metric: Success score reflecting error distributions in DEFT categories.
In comparative agent evaluation, the platform Deep Research Comparator applies the Bradley–Terry model for pairwise outcome-based ranking and process-based annotation rates (step/spans) (Chandrahasan et al., 7 Jul 2025).
4. Multi-Hop Reasoning and Unanswerability Protocols
Multi-hop queries in FINDER are generated by template grammars, typically requiring 2–4 chain-of-thought reasoning steps traversing distinct entity–source pairs (e.g., locating documents, inferring authorship, mapping to organizational IDs) (Choubey et al., 29 Jun 2025). Each reasoning hop must be grounded in explicit artifact evidence for answerability; unanswerable queries are constructed by omitting required artifacts in the workflow.
Unanswerable handling is formalized: systems must abstain or flag "cannot answer" to score positively; performance is summarized via the Unan.% metric (percentage of correctly identified unanswerables).
5. Baselines, Empirical Results, and Bottleneck Analyses
A comprehensive battery of baseline systems—ranging from zero-shot LLMs to advanced retrieval-augmented generation (RAG) frameworks (hybrid, RAPTOR, GraphRAG, HippoRAG 2, PGRAG, ReAct agents)—are evaluated using FINDER (Choubey et al., 29 Jun 2025). Experimental highlights:
- Best-performing standard RAG: Hybrid (Avg = 20.61)
- Best agentic RAG (ReAct + GPT-4o): Avg = 32.96, Unan.% = 23.03
- 0-shot LLM: Avg = 4.55, Unan.% = 88.70
- GraphRAG: Avg = 10.31, Unan.% = 63.41
Retrieval is consistently identified as the main bottleneck; recall@20 rarely exceeds 50–60%, requiring LLMs to reason over partial contexts. Under oracle conditions (perfect evidence), best model scores rise to Avg = 85.76, yet errors persist, indicating imperfect reasoning resilience.
In report synthesis (1000 outputs over 15 agents), best proprietary system achieves RACE ≈ 50.95, FACT C.Acc. ≈ 57.1%, checklist pass ≈ 63%, and structured taxonomy scores (Zhang et al., 1 Dec 2025).
6. Deep rEsearch Failure Taxonomy (DEFT): Error Typology
FINDER introduces DEFT, a taxonomy of 14 empirically-derived failure modes, constructed via multi-stage human–LLM co-annotation and validated with Krippendorff’s (Zhang et al., 1 Dec 2025). DEFT spans:
- Reasoning: FUR (failure to understand requirements), LAD (lack analytical depth), LAS (limited scope), RPS (rigid planning)
- Retrieval: IIA (insufficient info acquisition), IHD (handling deficiency), IIF (integration failure), IRM (representation misalignment), VMF (verification failure)
- Generation: RCP (redundant content), SOD (structural dysfunction), CSD (spec deviation), DAR (deficient rigor), SCF (fabrication)
Checklist-driven evaluation is complemented by DEFT diagnosis, supporting both output-level and process-level agent assessment.
7. Domain Extension, Annotation, and Future Directions
FINDER’s methodology generalizes to financial (FinDeepResearch), legal, clinical, and open-domain settings by constructing domain-specific hierarchical structures, extracting fine-grained items, weighting complexity, assembling multilingual reference sets, and applying three-pronged evaluation protocols (Zhu et al., 15 Oct 2025). Annotation workflows leverage both human and LLM inputs, with metrics such as Cohen's κ, Fleiss’s κ, and process-level upvote rates central to aggregate reliability (Chandrahasan et al., 7 Jul 2025).
Recommendations for advancing FINDER include:
- Workflow-aware retrieval planners routing queries by dynamic signal.
- Multi-hop retrievers exploiting temporal, topical, and structural affinities.
- Enhanced abstention strategies and adversarial query generation for unanswerability.
- Joint models integrating retrieval and reasoning, minimizing partial evidence propagation.
A plausible implication is that transparent, checklist-based evaluation supplemented by error taxonomy will accelerate the deployment of DRAs that meet real-world demands for rigor, reliability, and explainability across heterogeneous research contexts.
| FINDER Component | Description | Paper Reference |
|---|---|---|
| Task Design | 100 curated research tasks, 419 checklist items | (Zhang et al., 1 Dec 2025) |
| Synthetic Pipeline | Workflow-guided artifact generation, multi-hop queries | (Choubey et al., 29 Jun 2025) |
| Evaluation Rubrics | RACE, FACT, HisRubric, Checklist Pass Rate | (Zhang et al., 1 Dec 2025, Zhu et al., 15 Oct 2025) |
| Error Taxonomy (DEFT) | 14 failure modes: reasoning, retrieval, generation | (Zhang et al., 1 Dec 2025) |
| Annotation Platform | Outcome/process annotation, BT ranking, upvote rates | (Chandrahasan et al., 7 Jul 2025) |
FINDER establishes the technical and methodological foundation for fine-grained benchmarking of deep research agents, supporting systematic analysis, robust failure diagnosis, and scalable cross-domain evaluation.