Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 85 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Deep Reasoning Dataset (DeepRD)

Updated 31 October 2025
  • DeepRD is a synthetic benchmark designed to rigorously evaluate reasoning capabilities in language models using controlled complexity metrics.
  • It employs a parameterized generator that varies lookahead and branching factors to create contamination-free tasks in graph connectivity and proof planning.
  • Empirical results highlight abrupt performance drops at defined complexity thresholds, urging enhanced agentic reasoning and specialized tool integration.

The Deep Reasoning Dataset (DeepRD) is a systematically constructed synthetic benchmark for evaluating and diagnosing the reasoning capabilities—and failure modes—of LLMs and large reasoning models (LRMs) under scalable, precisely controlled complexity. DeepRD’s design addresses the inadequacy of prior benchmarks for probing true depth in symbolic and natural language reasoning, especially in domains such as graph connectivity and proof planning, and provides an extensible procedure for generating arbitrarily complex, contamination-free evaluation tasks.

1. Purpose and Rationale

DeepRD was developed in response to observations that transformer-based LLMs and RL-fine-tuned LRMs, while performing strongly on established benchmarks, exhibit abrupt performance collapses at moderate complexity thresholds not captured by prior test sets. Existing datasets such as NLGraph are found to contain tasks with limited complexity—often only requiring shallow inference that does not generalize beyond the distribution seen during model training (Rameshkumar et al., 25 Oct 2025). The core motivation for DeepRD is rigorous complexity scaling and strict isolation from training-distribution contamination, thereby diagnosing the boundaries of compositional, multi-step reasoning achievable by current model architectures.

2. Generative Methodology and Dataset Structure

DeepRD is produced via a parameterized synthetic generator that yields directed acyclic graphs (DAGs) and associated queries. The construction algorithm systematically varies key complexity metrics, enabling granular control over problem hardness:

  • Lookahead (LL): The primary driver of difficulty; LL is defined as the minimal number of breadth-first search (BFS) layers required to uniquely identify the correct next step in a path or proof. Calculated per instance using a formal algorithm (see Algorithm~\ref{alg:lookahead} in (Rameshkumar et al., 25 Oct 2025)).
  • Branching Factor (BB): Out-degree of the start node; larger BB increases discriminative complexity and reduces random guessing success to $1/B$.

The generator avoids isomorphisms and trivial patterns (e.g., star topologies) that can be short-circuited by simple heuristics. DeepRD thus establishes a “complexity continuum” unavailable in prior real-world or crowd-constructed datasets.

Task Formats

  1. Symbolic Graph Connectivity: Given a set of directed edges and specified start/goal nodes, models must return a valid connecting path, not just existence claims.
  2. Natural Language Proof Planning: Graph nodes/edges are mapped to facts or inference steps in natural language. The model must select the next logical deduction under implicit chain-of-thought constraints.

Dataset size for initial release is 2,220 examples, custom sampled to span the cross-product of LL and BB configurations, including chain graphs of depth up to 1,536. The open-source generator allows unlimited expansion and custom complexity scaling.

3. Complexity Metrics and Formal Definitions

DeepRD enforces explicit complexity quantification:

  • Lookahead (LL): Minimal BFS layers from root to goal, defined as the step at which the goal or unique correct next child is deterministically identified.
  • Branching Factor (BB): Number of immediate children from root; impacts chance-level accuracy and difficulty.

Comparison to prior benchmarks (e.g., NLGraph) reveals that typical samples have expected lookahead 1.8 via logNγlog(pN)+12\frac{\log N - \gamma}{\log(pN)}+\frac{1}{2}, with most tasks requiring trivial pathfinding. By contrast, DeepRD tasks reach LL values of up to 800 and branching factors 2–16, making the challenge genuinely scalable and diagnosis of reasoning depth precise.

4. Evaluation Protocols and Empirical Findings

Evaluation on DeepRD employs both full-path and next-step accuracies:

  • LRMs: Models such as DeepSeek-R1, OpenAI o3/o3-mini, trained via RL with verifiable reasoning rewards.
  • LLMs: Standard transformers (DeepSeek V3, GPT-4o) lacking explicit reasoning optimization.

Each response is scored for complete solution correctness and local deduction accuracy, with error types (missing edges, premature branch pruning, hallucinations) cataloged.

Quantitative Results:

  • LRMs and LLMs exhibit high accuracy for low-to-moderate LL, but all display sharp performance cliffs at model-specific thresholds. At B=16B=16, accuracy collapses for L50L\approx50; for B=2B=2, some hold until L200L\approx200.
  • When complexity exceeds capability, models devolve to random guessing ($1/B$ success).
  • Natural language proof planning tasks induce earlier collapses than symbolic graph tasks.
  • Chain graphs (B=1B=1) reveal drops at depth L>600L>600, indicating sequential or compositional generalization limits independent of search/branching.

Token budget and refusal analysis shows that failures arise from reasoning limitations, not truncation or early stops.

5. Comparative Analysis with Existing Datasets

Aspect DeepRD Previous Benchmarks
Task Types Graph connectivity; proof planning Graph; proof; shallow
Complexity Explicitly scalable (LL, BB up to 1,536/16) Rarely quantified; low LL
Contamination Synthetic, unique, no real-world overlap Real-world, high contamination risk
Evaluation Must produce valid path/proof steps Yes/no or existence checks
Usefulness Diagnoses reasoning depth, limit/failure Routine cases only

Prior graphs/proofs (OGB, ConceptNet, NaturalProofs) typically fall at L<10L<10; DeepRD generalizes beyond this regime and exposes the full “failure” landscape in long-tailed complexity, as confirmed by empirical distributions (Rameshkumar et al., 25 Oct 2025).

6. Implications, Pathways for Model Development, and Tooling Integration

DeepRD demonstrates that state-of-the-art LRMs are shallow reasoners compared to absolute complexity requirements. Despite aggressive RL and fine-tuning, generalization collapses suddenly at thresholds inherent in test distribution gaps. A plausible implication is that increasing dataset complexity and breadth is necessary (but likely not sufficient) for robust compositional and long-range reasoning.

Recent frameworks such as Agentic Reasoning (Wu et al., 7 Feb 2025) deploy specialized agentic toolchains—e.g., web-search, Mind-Map structured memory, code execution—on DeepRD-aligned tasks, raising state-of-the-art automatic and human metrics (ROUGE-1, entity recall, organization, logical coverage). Ablation studies confirm that only deliberate combination of the right tools (not an arbitrary set) delivers substantial synergy on DeepRD-scale benchmarks.

Model ROUGE-1 Entity Recall
Direct Gen 27.32 6.11
RAG 29.14 8.84
STORM 47.93 15.43
Agentic Reasoning 54.10 18.77

This suggests DeepRD is driving methodological advances in agentic reasoning pipelines and tooling configurations tailored for multi-hop, knowledge-intensive inference.

7. Open Challenges and Future Research Directions

DeepRD’s extensible, parameterized generator is publicly released (DeepRD GitHub page), enabling ongoing diagnosis and benchmarking as new models, frameworks, and reasoning agents emerge. Future work may involve:

  • Training on high-LL, high-BB DeepRD instances to probe or extend generalization frontiers.
  • Exploring tooling augmentation (e.g., Mind-Map agents) for structured memory and abstraction, especially on long-chain and multi-modal reasoning.
  • Extending proof planning to richer forms, including theorem proving and symbolic-to-natural language translation.
  • Mapping and scaling real-world datasets’ complexity distributions vis-à-vis DeepRD to better forecast real-world performance collapse risks.

8. Conclusion

DeepRD establishes a rigorous standard for complexity-scaled, contamination-free evaluation of reasoning models. By exposing sharp “reasoning cliffs” and systematic failure modes as complexity increases, it moves the field beyond shallow, routine benchmarks and provides a foundation for targeted progress in deep, multi-step reasoning. Its interoperability with state-of-the-art agentic frameworks confirms its utility for both diagnostic and improvement purposes in theorem proving, symbolic inference, and knowledge-intensive natural language synthesis (Rameshkumar et al., 25 Oct 2025, Wu et al., 7 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Deep Reasoning Dataset (DeepRD).