Papers
Topics
Authors
Recent
Search
2000 character limit reached

Datarus-R1-14B: Agentic Analytical LLM

Updated 3 July 2026
  • Datarus-R1-14B is a 14B-parameter language model specialized for automated data analysis and multi-step STEM reasoning using a process-centric training approach.
  • It employs a hybrid supervised and reinforcement learning strategy with GRPO and a composite reward system to balance structural compliance with semantic accuracy.
  • Its dual-mode interface, featuring agentic ReAct-style workflows with code execution and reflective chain-of-thought outputs, enhances analytic efficiency and error correction.

Datarus-R1-14B is a 14B-parameter open-weights LLM specialized for automated data analysis and multi-step STEM reasoning. It is fine-tuned from Qwen2.5-14B-Instruct using a hybrid trajectory-centric supervised and reinforcement learning framework. The system implements a dual-mode reasoning interface, supporting both agentic ReAct-style notebook workflows with code execution and more compact reflection-mode outputs for chain-of-thought (CoT) reasoning. Datarus-R1-14B is positioned to address the limitations of instruction-tuned LLMs on real analytical tasks by explicitly training on process-centric, multi-turn analytical trajectories, not static question-answer pairs (Chaliah et al., 18 Aug 2025).

1. Model Architecture and Design Rationale

Datarus-R1-14B is architected as a transformer-based decoder-only model with 14 billion parameters, derived from Qwen2.5-14B-Instruct. The design goal is to bridge the gap between answer-focused SFT models and practical automated analysts capable of managing complex, iterative, code-driven analytical workflows. Unlike most open 14B LLMs, Datarus is trained end-to-end for agentic behavior: not just final-answer accuracy, but the full reasoning-act-observe-reflect-revise cycle typical of human analysts.

A core innovation is the dual reasoning interface:

  • Agentic (ReAct-style) mode: The model emits structured notebook steps, invoking Python code and incrementally processing results in a ReAct tag schema: <step>, <thought>, <action>, <action_input>, <observation>, <stop_analysis>, <answer>.
  • Reflection mode: For applications needing interpretability or compact reasoning traces, Datarus emits text within > and <answer> tags, providing readable but concise CoT explanations.

    The architecture and interface are tightly coupled to a data pipeline grounded in synthetic notebook trajectories and reinforced through RL with a semantic-structural curriculum signal.

    2. Training Pipeline and Data Regimen

    Datarus-R1-14B is trained in two major phases: large-scale supervised fine-tuning on synthetic analytic trajectories, followed by curriculum RL using Group Relative Policy Optimization (GRPO) with a composite reward.

    • Synthetic trajectory data generation: The core SFT corpus consists of 144,000 notebook-formatted analytic episodes, generated over 20,000 synthetic datasets spanning quantitative domains such as finance, numerical analysis, medicine, probability, linear algebra, and graph theory. Trajectories are generated by an ensemble of LLM-based "analyst loops," rather than traditional SFT pair extraction.

    • Trajectory structure: Each trajectory includes explicit reasoning steps, Python code, execution outputs (including errors/tracebacks), self-corrective edits, and final answers. To ensure broad coverage, episodes are stratified into clean success, error-correction, in-step self-correction, and persistent failure, at approximately 40/35/15/10% proportions.
    • SFT curriculum: The SFT phase focuses on trajectory quality and conciseness, oversampling information-dense and genuine revision episodes while filtering out repetitive or hollow “overthinking” traces. The SFT dataset includes 60% error-free, 20% error-corrected, and 20% curated reasoning examples.

    The RL phase uses GRPO over a dual reward: a tag-based structural signal and a Hierarchical Reward Model (HRM) for semantic trajectory quality.

    3. Reinforcement Learning, Reward Structure, and System Optimizations

    Group Relative Policy Optimization

    Datarus-R1-14B uses Group Relative Policy Optimization (GRPO) to optimize long, tag-structured outputs under process-level and semantic rewards. The paper does not provide an explicit loss formula, but conceptually:

    • Multiple completions per prompt are sampled.
    • Rewards are scored using a weighted sum of structural tag presence (tag-based reward) and semantic quality (HRM).
    • Policy updates are performed relative to a reference model with KL regularization; the groupwise advantage reflects intra-group reward ranking.

    Composite Reward with Cosine Curriculum

    The total reward at timestep tt, RtotalR_{\text{total}}, is:

    Rtotal=λtagRtag+(1λtag)RhrmR_{\text{total}} = \lambda_{\text{tag}} R_{\text{tag}} + (1 - \lambda_{\text{tag}}) R_{\text{hrm}}

    where RtagR_{\text{tag}} incentivizes prompt structural compliance (start-of-output <step>, presence bonuses for analytical markup) and RhrmR_{\text{hrm}} is provided by a Qwen2.5-3B-based HRM scoring correctness at the step and full-trajectory level. The curriculum coefficient λtag\lambda_{\text{tag}} decays using a cosine schedule from 1 (structure-focused early) to 0 (semantics-focused late), mitigating RL-induced format collapse (Chaliah et al., 18 Aug 2025).

    • Tag reward: Position-based for <step>, presence bonuses for <thought>, <action>, <action_input>, </step>, <stop_analysis>.
    • HRM: Provides dense feedback on both local step quality and global trajectory coherence, including positive credit for error recovery, and is trained with pairwise preference comparison.

    Systems Optimizations

    To scale GRPO to 14B, Datarus implements several systems-level enhancements:

    • KV-cache reuse and batch prompt encoding, amortizing decoding cost across multiple generations.
    • Sequential sample processing to dramatically reduce GPU memory load for group sampling.
    • Reference-model sharding for KL penalty computation, halving needed model copies.
    • Gradient accumulation, BF16 computation, and other industry-standard stability/efficiency patches, all integrated over 8 H200 GPUs (Chaliah et al., 18 Aug 2025).

    4. Evaluation, Benchmarks, and Empirical Results

    Datarus-R1-14B is benchmarked on several reasoning and STEM domains, using greedy decoding by default except for specific multi-seed or sampling-based protocols.

    Main Results

    Model AIME 2024 AIME 2025 GPQA Diamond LCB v6
    Datarus-R1-14B-preview 70.1 66.2 62.1 57.7
    DeepSeek-R1-Distill-14B 58.6 48.6
    Light-R1-14B-DS 74.0 60.2 61.7
    QwQ-32B 76.2 66.2 60.1 56.6
    DeepCoder-14B 63.7 51.2 55.0 54.3
    Magistral-S-2506 (23.6B) 70.7* 62.7* 56.6 55.4

    Datarus achieves strong results on AIME 2024/2025 and LCB, outperforming peer 14B models and matching or exceeding QwQ-32B and larger models on some metrics. It is especially noteworthy for its token efficiency: Datarus responses contain 18–49% fewer tokens per problem than comparators, including 30.5% fewer than QwQ-32B and 49% fewer than DeepCoder-14B (Chaliah et al., 18 Aug 2025).

    Analysis

    The paper emphasizes not only accuracy but concise, adaptive reasoning. For example, Datarus avoids the extreme verbosity inflation seen in other RL-aligned reasoning models as task difficulty increases, maintaining moderate output growth while preserving solution quality.

    5. Distinguishing Features and Behavioral Characteristics

    Datarus-R1-14B exhibits several behavioral properties stemming from its training data and dual reward structure:

    • Process-centric supervision: The model learns to handle error-tracing, self-correction, and iterative debugging within an explicit tag schema, rather than merely outputting one-shot or verbose step-by-step solutions.
    • “AHA-moment” reasoning: On challenging problems, the model typically generates a hypothesis, refines it one or two times (often catching errors via code execution or reflection), then converges to a solution—avoiding circular, degenerate reasoning loops (Chaliah et al., 18 Aug 2025).
    • Dual-mode interface: The agentic ReAct-style and compact reflection mode enable deployment across both interactive tool-calling environments (such as notebooks) and batch reasoning workflows.
    • Domain generalization: While grounded in synthetic analytic data, Datarus spans a wide range of quantitative fields and demonstrates strong transfer to math, coding, and graduate-level analysis benchmarks.

    6. Limitations and Practical Considerations

    Datarus-R1-14B’s most distinctive strength—agentic, Python-integrated analysis—is also the source of several limitations:

    • Tool dependency: Effective operation in agentic mode requires access to Python execution; performance may degrade on pure-text environments or tasks outside the synthetic notebook distribution.
    • Training and adoption complexity: RL with GRPO and process-model-based reward is nontrivial, requiring significant GPU, memory, and orchestration resources; the model is released as "preview," indicating that further stability improvements may be forthcoming.
    • Limited real-world generalization evidence: Most empirical gains are on synthetic or standardized public benchmarks; while multi-domain synthetic data is broad, industrial validation on messy data is not detailed in the paper.
    • Partial reproducibility: Model weights and inference pipelines are released, but the full training dataset, HRM, and detailed RL configs are not, constraining full end-to-end reproduction (Chaliah et al., 18 Aug 2025).

    7. Comparative Context within the R1-Style Model Ecosystem

    Relative to contemporaries such as DeepSeek-R1 (DeepSeek-AI et al., 22 Jan 2025), R1-Code-Interpreter (Chen et al., 27 May 2025), and Light-R1 (Wen et al., 13 Mar 2025), Datarus-R1-14B is distinguished by its explicit trajectory-based SFT, semantic-structural dual reward, and focus on process supervision for analytic workflows. Unlike R1-Code-Interpreter, which emphasizes hybrid tool-use via code interleaving mainly for symbolic problem solving, Datarus targets full analytic episodes and self-correction within agentic contexts, and yields higher token efficiency. Unlike Light-R1-14B-DS, which achieves stronger math SOTA via distilled SFT and carefully filtered RL on hard datasets, Datarus places greater emphasis on rich, tagged multi-domain data science and agented analysis. All R1-style models at this scale benefit substantially from distilled teacher traces and curation, rather than learning robust multi-step reasoning from RL alone.

    8. Technical Table of Model and Method Characteristics

    Component Datarus-R1-14B Light-R1-14B-DS R1-Code-Interpreter-14B
    Base Model Qwen2.5-14B-Instruct DeepSeek-R1-Distill-14B Qwen2.5-14B-Instruct-1M
    SFT Data 144k notebook trajectories 3k hard math SFT 6.5k multi-turn code/text SFT
    RL Algorithm GRPO GRPO GRPO, PPO
    RL Reward Tag-based + HRM Outcome + length mod. Rule-based correctness only
    Main Interface ReAct tags + CoT Long CoT Python code/text alternation
    Token Efficiency Yes (18–49% savings) Not primary focus Not reported
    Evaluated Benchmarks LCB, AIME, GPQA AIME, GPQA 37-task reasoning suite
    Notable Distinction Dual interface, analytic process Curriculum SFT, math SOTA Selective code reasoning, emergent self-checking

    References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Datarus-R1-14B.