daVinci-LLM: Transparent LLM Research

Updated 31 March 2026

daVinci-LLM is a suite of large language models defined by decoder-only Transformer architectures that emphasize transparent methodologies and open science.
It employs staged curricula and agent-native mid-training, using controlled ablation studies to optimize reasoning, coding efficiency, and clinical integration.
The framework supports diverse applications from autonomous software engineering to multi-agent orchestration in clinical settings, highlighting systematic data curation and deployable performance.

daVinci-LLM refers collectively to a lineage of LLMs, tools, and agentic systems built upon or inspired by the “DaVinci”-class architectures and naming conventions, as well as to a set of methodologies advancing the science and transparency of pretraining, agentic mid-training, and applied deployment in both scientific and clinical domains. These systems span from foundational decoder-only Transformer models for open research, to agentic mid-training frameworks for autonomous software engineering, to multi-agent orchestration platforms in medical settings. The daVinci-LLM suite and related efforts are notable for their transparent methodology, controlled ablation studies, and public release of data and code, as well as for revealing how large-scale LLMs inherit human conceptual biases through statistical learning.

1. Model Architectures and Core Design

daVinci-LLM encompasses several architectures, primarily anchored in the Qwen2 or Qwen2.5-Base family of large-scale decoder-only Transformers. For its most systematic open-science instantiation (Qin et al., 28 Mar 2026), daVinci-LLM is a 3B-parameter model with 36 layers, hidden size 2,048, MLP intermediate size 11,008 (expansion ratio ≈5.4), and group-query attention (16 heads, 2 KV heads shared via GQA). Rotary position embeddings (RoPE, base θ=10 000), RMSNorm (ε=10⁻⁶), and SwiGLU activations are incorporated. The model uses a sequence length of 4,096 tokens and a large vocabulary of 151,936 tokens; training is performed in bfloat16 precision for efficiency.

Agent-native mid-training as exemplified by daVinci-Dev (Zeng et al., 26 Jan 2026) retains the standard Qwen2.5-Base backbone at 32B and 72B parameters, deliberately eschewing architectural modifications (e.g., no tool heads or controller modules) to isolate the effect of training regime and data.

In downstream, real-time deployment (notably in clinical environments), daVinci-LLM is embedded as the core reasoning module within hierarchical multi-agent frameworks that orchestrate voice commands, multimedia data retrieval, and clinical workflow mapping (Park et al., 10 Nov 2025). These systems deploy quantization-aware, instruction-tuned models (e.g., gemma3:27b-it-qat) to operate within strict memory and latency constraints on surgical consoles.

2. Data Curation: Data Darwinism and Agent-Native Corpora

The daVinci-LLM research program advances data curation methodology by formalizing a “Data Darwinism” framework (Qin et al., 28 Mar 2026). This taxonomy, covering L₀–L₉, systematizes the data pipeline:

L₀: Raw web-scale acquisition
L₁: Format normalization
L₂: Rule-based filtering (e.g., language, deduplication)
L₃: Lightweight model filtering (e.g., educational value classification)
L₄: Generative refinement by LLMs
L₅: Cognitive completion—expanding or reconstructing latent reasoning
L₆–L₉: Synthetic, context-rich, or multi-agent data generations and even full “world synthesis”

In practice, daVinci-LLM pretraining utilizes extensive L₂–L₅ processing. For code and agentic domains, agent-native mid-training (Zeng et al., 26 Jan 2026) constructs two complementary trajectory corpora:

Contextually-native: Reconstructed end-to-end PR histories from GitHub, encapsulating full information flow and edit sequences.
Environmentally-native: Real agent rollouts, recording sequences of tool calls and observations from dynamic execution (tests, searches) inside Dockerized repositories.

Contamination and privacy concerns are managed via exclusion rules and n-gram overlap checks against benchmark instances.

3. Training Paradigms: Staged Curricula and Agentic Mid-training

daVinci-LLM pretraining follows a staged adaptive curriculum (Qin et al., 28 Mar 2026):

Stage 1 (6T tokens): Foundation, with a gradual ramp-up of batch size and initial focus on web, code, and science domains.
Stage 2 (2T tokens): Reasoning intensification, shifting to higher proportions of structured QA and refined scientific data; up to 70% QA in late-stage to sustain reasoning growth as general web-text utility plateaus.

For agentic applications in software engineering (Zeng et al., 26 Jan 2026), the key training innovation is agent-native mid-training, where the LLM is exposed to sequences reflecting authentic action–observation loops as encountered by code agents in complex repo environments. This precedes any SFT or RL.

Losses are standard next-token cross-entropy, with SFT maskings to exclude non-modeluser tokens.

4. Empirical Ablations and Quantitative Outcomes

Over 200 controlled ablations (Qin et al., 28 Mar 2026) have established critical axes of pretraining science:

Processing depth: Transitioning from L₂ (rule-based) to L₅ (cognitive completion) in data yields systematic capability gains, outperforming raw volume scaling alone (e.g., code: MBPP +3.4, math: MATH +7.0).
Training mixture: Domain-specific saturation curves inform dynamic mixture rebalancing; pure text after certain thresholds offers marginal returns compared to incorporating QA or structured tasks.
Mixture optimization: Excessive focus (e.g. >30% QA) can collapse code performance; balanced intensification avoids domain performance cliffs.
Evaluation methodology: Metric choice (perplexity vs. generative) leads to benchmark ranking reversals.

Key benchmark results illustrate efficiency: daVinci-3B matches OLMo-3-7B average scores (51.72 vs. 51.65) with half the parameter count; HumanEval code: 61.6; MATH: 62.8 (Qin et al., 28 Mar 2026). In agentic code settings, daVinci-Dev exceeds prior state-of-the-art with 56.1% (32B)/58.5% (72B) Pass@1 on SWE-Bench Verified, surpassing Kimi-Dev by up to 9.9% (Zeng et al., 26 Jan 2026).

In clinical settings (Park et al., 10 Nov 2025), the SAOP achieves 95.8% overall multi-pass success rate across 240 voice commands, with per-agent multi-pass SR of 100% (IR), 94% (IV), and 93% (AR). Latency remains below the 5s surgical comfort threshold.

5. Applied Agentic Systems: Clinical Integration and Orchestration

The daVinci-LLM agentic paradigm extends to hierarchical multi-agent platforms for surgical robotics:

Core workflow comprises a Workflow Orchestrator Agent and three Task-Specific Agents (Information Retrieval, Image Viewer, Anatomy Rendering).
Voice commands (detected via Whisper-Small and Silero-VAD) are transcribed, error-corrected, contextually disambiguated with local/global memory, and mapped into agent actions via deterministic, prompt-based LLM calls (Park et al., 10 Nov 2025).
Robustness is achieved against STT errors and ambiguous natural language; multi-level orchestration metrics formalize success rates at both command and high-level workflow layers.

This pipeline permits seamless, hands-free data manipulation on surgical consoles, with no need for hardware modifications.

6. Conceptual Biases and Cognitive Implications

The “Davinci the Dualist” study demonstrates that purely data-driven learning in LLMs induces human-like mind–body dualism (Berent et al., 2023). Models such as GPT-3 (davinci) and its successor text-davinci-003 (GPT-3.5) acquire differential attitudes about epistemic (thoughts, beliefs) vs. non-epistemic (motor, emotion) states appearing in the brain or surviving death, as measured by the dualist bias index Δ. The effect intensifies with increased model inductive capacity (Δ≈0.09 in GPT-3, Δ≈0.82 in GPT-3.5).

This indicates that LLMs not only absorb factual knowledge from text but also the cultural and conceptual biases embedded therein. Because human linguistic corpora encode intuitive Dualism, further pretraining on similar data reinforces such biases. Syntactic fragility and prompt sensitivity remain as limitations.

7. Scientific Contributions, Limitations, and Open Science

The daVinci-LLM projects are distinguished by a fully-open methodology (Qin et al., 28 Mar 2026):

Release of all pretraining data, pipelines, model code, training logs, and checkpoints.
Publication of all ablation results, including negative findings, to provide a cumulative empirical substrate for pretraining science.
The proposed research pattern enables the field to progress from anecdotal heuristics toward systematic, reusable, and replicable methods.

Main limitations include resource intensity, prompt-coupling (e.g., to specific models), evaluation drift due to metric selection, and lingering privacy or contamination risks in code datasets. Future work involves scaling to larger, more diverse corpora, refined evaluation protocols, and multi-modality; for agentic systems, planned extensions target real-time feedback integration and multilingual interface support.

References

daVinci-LLM: Towards the Science of Pretraining (Qin et al., 28 Mar 2026)
daVinci-Dev: Agent-native Mid-training for Software Engineering (Zeng et al., 26 Jan 2026)
Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction (Park et al., 10 Nov 2025)
Davinci the Dualist: the mind-body divide in LLMs and in human learners (Berent et al., 2023)
DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning (Mathur et al., 2024)