Deep Reasoning Dataset (DeepRD)
- DeepRD is a synthetic benchmark designed to rigorously evaluate reasoning capabilities in language models using controlled complexity metrics.
- It employs a parameterized generator that varies lookahead and branching factors to create contamination-free tasks in graph connectivity and proof planning.
- Empirical results highlight abrupt performance drops at defined complexity thresholds, urging enhanced agentic reasoning and specialized tool integration.
The Deep Reasoning Dataset (DeepRD) is a systematically constructed synthetic benchmark for evaluating and diagnosing the reasoning capabilities—and failure modes—of LLMs and large reasoning models (LRMs) under scalable, precisely controlled complexity. DeepRD’s design addresses the inadequacy of prior benchmarks for probing true depth in symbolic and natural language reasoning, especially in domains such as graph connectivity and proof planning, and provides an extensible procedure for generating arbitrarily complex, contamination-free evaluation tasks.
1. Purpose and Rationale
DeepRD was developed in response to observations that transformer-based LLMs and RL-fine-tuned LRMs, while performing strongly on established benchmarks, exhibit abrupt performance collapses at moderate complexity thresholds not captured by prior test sets. Existing datasets such as NLGraph are found to contain tasks with limited complexity—often only requiring shallow inference that does not generalize beyond the distribution seen during model training (Rameshkumar et al., 25 Oct 2025). The core motivation for DeepRD is rigorous complexity scaling and strict isolation from training-distribution contamination, thereby diagnosing the boundaries of compositional, multi-step reasoning achievable by current model architectures.
2. Generative Methodology and Dataset Structure
DeepRD is produced via a parameterized synthetic generator that yields directed acyclic graphs (DAGs) and associated queries. The construction algorithm systematically varies key complexity metrics, enabling granular control over problem hardness:
- Lookahead (): The primary driver of difficulty; is defined as the minimal number of breadth-first search (BFS) layers required to uniquely identify the correct next step in a path or proof. Calculated per instance using a formal algorithm (see Algorithm~\ref{alg:lookahead} in (Rameshkumar et al., 25 Oct 2025)).
- Branching Factor (): Out-degree of the start node; larger increases discriminative complexity and reduces random guessing success to $1/B$.
The generator avoids isomorphisms and trivial patterns (e.g., star topologies) that can be short-circuited by simple heuristics. DeepRD thus establishes a “complexity continuum” unavailable in prior real-world or crowd-constructed datasets.
Task Formats
- Symbolic Graph Connectivity: Given a set of directed edges and specified start/goal nodes, models must return a valid connecting path, not just existence claims.
- Natural Language Proof Planning: Graph nodes/edges are mapped to facts or inference steps in natural language. The model must select the next logical deduction under implicit chain-of-thought constraints.
Dataset size for initial release is 2,220 examples, custom sampled to span the cross-product of and configurations, including chain graphs of depth up to 1,536. The open-source generator allows unlimited expansion and custom complexity scaling.
3. Complexity Metrics and Formal Definitions
DeepRD enforces explicit complexity quantification:
- Lookahead (): Minimal BFS layers from root to goal, defined as the step at which the goal or unique correct next child is deterministically identified.
- Branching Factor (): Number of immediate children from root; impacts chance-level accuracy and difficulty.
Comparison to prior benchmarks (e.g., NLGraph) reveals that typical samples have expected lookahead 1.8 via , with most tasks requiring trivial pathfinding. By contrast, DeepRD tasks reach values of up to 800 and branching factors 2–16, making the challenge genuinely scalable and diagnosis of reasoning depth precise.
4. Evaluation Protocols and Empirical Findings
Evaluation on DeepRD employs both full-path and next-step accuracies:
- LRMs: Models such as DeepSeek-R1, OpenAI o3/o3-mini, trained via RL with verifiable reasoning rewards.
- LLMs: Standard transformers (DeepSeek V3, GPT-4o) lacking explicit reasoning optimization.
Each response is scored for complete solution correctness and local deduction accuracy, with error types (missing edges, premature branch pruning, hallucinations) cataloged.
Quantitative Results:
- LRMs and LLMs exhibit high accuracy for low-to-moderate , but all display sharp performance cliffs at model-specific thresholds. At , accuracy collapses for ; for , some hold until .
- When complexity exceeds capability, models devolve to random guessing ($1/B$ success).
- Natural language proof planning tasks induce earlier collapses than symbolic graph tasks.
- Chain graphs () reveal drops at depth , indicating sequential or compositional generalization limits independent of search/branching.
Token budget and refusal analysis shows that failures arise from reasoning limitations, not truncation or early stops.
5. Comparative Analysis with Existing Datasets
| Aspect | DeepRD | Previous Benchmarks |
|---|---|---|
| Task Types | Graph connectivity; proof planning | Graph; proof; shallow |
| Complexity | Explicitly scalable (, up to 1,536/16) | Rarely quantified; low |
| Contamination | Synthetic, unique, no real-world overlap | Real-world, high contamination risk |
| Evaluation | Must produce valid path/proof steps | Yes/no or existence checks |
| Usefulness | Diagnoses reasoning depth, limit/failure | Routine cases only |
Prior graphs/proofs (OGB, ConceptNet, NaturalProofs) typically fall at ; DeepRD generalizes beyond this regime and exposes the full “failure” landscape in long-tailed complexity, as confirmed by empirical distributions (Rameshkumar et al., 25 Oct 2025).
6. Implications, Pathways for Model Development, and Tooling Integration
DeepRD demonstrates that state-of-the-art LRMs are shallow reasoners compared to absolute complexity requirements. Despite aggressive RL and fine-tuning, generalization collapses suddenly at thresholds inherent in test distribution gaps. A plausible implication is that increasing dataset complexity and breadth is necessary (but likely not sufficient) for robust compositional and long-range reasoning.
Recent frameworks such as Agentic Reasoning (Wu et al., 7 Feb 2025) deploy specialized agentic toolchains—e.g., web-search, Mind-Map structured memory, code execution—on DeepRD-aligned tasks, raising state-of-the-art automatic and human metrics (ROUGE-1, entity recall, organization, logical coverage). Ablation studies confirm that only deliberate combination of the right tools (not an arbitrary set) delivers substantial synergy on DeepRD-scale benchmarks.
| Model | ROUGE-1 | Entity Recall |
|---|---|---|
| Direct Gen | 27.32 | 6.11 |
| RAG | 29.14 | 8.84 |
| STORM | 47.93 | 15.43 |
| Agentic Reasoning | 54.10 | 18.77 |
This suggests DeepRD is driving methodological advances in agentic reasoning pipelines and tooling configurations tailored for multi-hop, knowledge-intensive inference.
7. Open Challenges and Future Research Directions
DeepRD’s extensible, parameterized generator is publicly released (DeepRD GitHub page), enabling ongoing diagnosis and benchmarking as new models, frameworks, and reasoning agents emerge. Future work may involve:
- Training on high-, high- DeepRD instances to probe or extend generalization frontiers.
- Exploring tooling augmentation (e.g., Mind-Map agents) for structured memory and abstraction, especially on long-chain and multi-modal reasoning.
- Extending proof planning to richer forms, including theorem proving and symbolic-to-natural language translation.
- Mapping and scaling real-world datasets’ complexity distributions vis-à-vis DeepRD to better forecast real-world performance collapse risks.
8. Conclusion
DeepRD establishes a rigorous standard for complexity-scaled, contamination-free evaluation of reasoning models. By exposing sharp “reasoning cliffs” and systematic failure modes as complexity increases, it moves the field beyond shallow, routine benchmarks and provides a foundation for targeted progress in deep, multi-step reasoning. Its interoperability with state-of-the-art agentic frameworks confirms its utility for both diagnostic and improvement purposes in theorem proving, symbolic inference, and knowledge-intensive natural language synthesis (Rameshkumar et al., 25 Oct 2025, Wu et al., 7 Feb 2025).