MDAgent2: MD Q&A & Code Generation Framework
- MDAgent2 is a domain-specific large language model framework that performs both MD knowledge Q&A and LAMMPS code generation.
- It employs a three-stage post-training pipeline—continued pre-training, supervised fine-tuning, and execution-grounded reinforcement learning—to optimize script reliability.
- The framework integrates a deployable multi-agent runtime that iteratively generates, executes, evaluates, and revises code based on real simulation feedback.
MDAgent2 is a domain-specialized LLM framework for two tightly coupled molecular-dynamics tasks: answering molecular-dynamics knowledge questions and generating LAMMPS simulation scripts from natural-language descriptions. It is presented as the first end-to-end framework capable of performing both knowledge Q&A and code generation within the MD domain, and it combines a domain-specific data-construction pipeline, a three-stage post-training strategy, an execution-grounded reinforcement-learning method named MD-GRPO, a deployable multi-agent runtime, and a dedicated evaluation suite named MD-EvalBench (Shi et al., 5 Jan 2026).
1. Definition, scope, and system position
MDAgent2 is motivated by a specific practical bottleneck in molecular-dynamics research: writing LAMMPS scripts is difficult, highly specialized, and labor-intensive because valid simulations require not only correct syntax, but also physically appropriate setups, interatomic potentials, ensembles, parameters, and complete workflows. The framework is designed to address several deficiencies that the paper attributes to generic LLMs in this domain: scarce high-quality MD-specific data, the lack of a dedicated benchmark for MD knowledge and LAMMPS generation, low code executability, and the cost of deploying very large general-purpose models (Shi et al., 5 Jan 2026).
Within the paper’s own framing, MDAgent2 extends an earlier MDAgent that focused mainly on code generation for obtaining thermodynamic parameters and used supervised fine-tuning plus a multi-agent generate–evaluate–rewrite loop. MDAgent2 broadens that scope in several directions: it covers both MD knowledge Q&A and code generation, introduces a systematic data-construction pipeline, adopts continued pre-training, supervised fine-tuning, and reinforcement learning, adds MD-GRPO as an execution-grounded RL method, implements a new deployable runtime with LAMMPS-aware tools and self-correction, and introduces MD-EvalBench as a benchmark for molecular-dynamics QA and LAMMPS code generation (Shi et al., 5 Jan 2026).
The paper organizes the system into four layers: data construction, model training, runtime deployment, and evaluation. At the data layer, MDAgent2 builds three datasets. At the training layer, it adapts a Qwen3-8B backbone into two specialized models, MD-Instruct-8B and MD-Code-8B. At deployment time, MDAgent2-RUNTIME iteratively generates, checks, executes, evaluates, and revises code. At the evaluation layer, MD-EvalBench measures both domain knowledge and script-generation capability (Shi et al., 5 Jan 2026).
2. Data resources and benchmark design
A major part of MDAgent2 is its domain-specific data pipeline. The framework introduces three datasets that target complementary parts of the molecular-dynamics workflow: domain knowledge, instruction-following QA, and text-to-code generation. The paper also introduces MD-EvalBench, which it characterizes as the first benchmark for MD QA and LAMMPS code generation in this setting (Shi et al., 5 Jan 2026).
| Resource | Role | Size |
|---|---|---|
| MD-Knowledge | Continued pre-training corpus | 17,808 samples; 10,865,191 tokens |
| MD-InstructQA | Domain instruction-answer tuning set | 27,346 samples |
| MD-CodeGen | Text-to-LAMMPS paired dataset | 4,253 samples |
| MD-KnowledgeEval | Theory benchmark | 336 questions |
| LAMMPS-SyntaxEval | Syntax benchmark | 333 in main text; 368 in appendix |
| LAMMPS-CodeGenEval | Code-generation benchmark | 566 samples |
MD-Knowledge is built from “thousands of high-quality molecular-dynamics-related papers, textbooks, technical documents, and public manuals.” Its construction includes useless-text filtering, regex-based artifact removal, approximate deduplication with MinHash and locality-sensitive hashing, semantic deduplication with sentence embeddings and cosine similarity, and final filtering with deepseek-chat as an automatic evaluator of clarity, coherence, and information density. MD-InstructQA is then derived from this corpus through PDF-to-Markdown conversion, structure-sensitive chunking, semantic domain label-tree construction, and tailored prompt-based question and answer generation. MD-CodeGen combines manual expert collection, structured task-template synthesis, runtime-assisted script generation, and expert review to produce natural-language task descriptions paired with physically meaningful LAMMPS scripts (Shi et al., 5 Jan 2026).
MD-EvalBench has three components. MD-KnowledgeEval covers theoretical MD concepts such as interatomic potentials, integration algorithms, equilibrium conditions, statistical ensembles, and simulation principles. LAMMPS-SyntaxEval focuses on command usage, syntax rules, parameter structures, and module semantics. LAMMPS-CodeGenEval measures text-to-script generation. The paper notes an internal inconsistency for LAMMPS-SyntaxEval: the main text reports 333 questions, while the appendix statistics table reports 368 (Shi et al., 5 Jan 2026).
3. Post-training pipeline and specialized models
MDAgent2 uses a three-stage post-training pipeline over a Qwen3-8B backbone: continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). This produces two domain-specialized variants rather than two different architectures: MD-Instruct-8B and MD-Code-8B (Shi et al., 5 Jan 2026).
In continued pre-training, the system mixes MD-Knowledge with general-domain data in order to improve the model’s representation of materials terminology, simulation workflows, and structured conventions without abandoning general language competence. In supervised fine-tuning, MD-InstructQA is mixed with general instruction data, and a curated subset of MD-CodeGen is added as a cold-start seed. This stage yields MD-Instruct-8B, which is positioned as the domain-understanding and question-answering model, while also providing the foundation for later code-oriented optimization (Shi et al., 5 Jan 2026).
The RL stage applies MD-GRPO on code-generation tasks, using execution and evaluator feedback to optimize for executable and goal-achieving LAMMPS scripts. MD-Code-8B is therefore the Qwen3-8B backbone after CPT, SFT, and code-focused RL. The paper does not report tokenizer details, learning rates, batch sizes, epoch counts, optimizer settings, or data-mixing ratios for these stages, so the training pipeline is conceptually clear but only partially specified at the systems level (Shi et al., 5 Jan 2026).
The distinction between the two models is functional rather than architectural. MD-Instruct-8B is intended for molecular-dynamics comprehension, knowledge Q&A, and LAMMPS syntax understanding. MD-Code-8B is optimized for mapping natural-language simulation goals into executable and physically meaningful scripts. This specialization through staged training is central to the framework’s design (Shi et al., 5 Jan 2026).
4. MD-GRPO and execution-grounded reinforcement learning
MD-GRPO is the paper’s reinforcement-learning method for code generation. It turns LAMMPS script generation into a closed-loop optimization problem grounded in actual simulator behavior rather than preference labels. In each rollout, the policy model generates code, the code is executed through a scheduler, and an evaluator scores the result; those outcomes are then used to improve the policy under a GRPO-style framework (Shi et al., 5 Jan 2026).
The paper defines the total reward as a weighted sum of a format reward and a correctness reward, with default weights and . The format reward is binary and requires > and <answer> tags to appear in order, with <answer> containing valid and structurally correct JSON. The correctness reward is built from expert-assigned bonus and penalty indicators over eight evaluation dimensions:
- Syntax Correctness
- Logical Consistency
- Parameter Rationality
- Core Logic Accuracy
- Logical Completeness
- Code Completeness
- Result Validity
- Physical Soundness
Representative deductions include non-existent commands, using FCC instead of a required BCC lattice, mismatched units in
fix npt, missing thermostat orpair_style, missingrun, runtime abnormalities such as “lost atoms,” or physically implausible behavior such as temperature diverging from 300 K to 3000 K. This reward construction makes MD-GRPO execution-grounded and science-specific rather than preference-based (Shi et al., 5 Jan 2026).A distinctive feature of MD-GRPO is low-reward trajectory recycling, also described as failure-feedback-tracking memory. When generated code fails to execute or receives a low evaluation score, the system records the failure cause and reconstructs the task context accordingly. The modified context is then reused as an additional training sample in the next iteration. The paper presents this as extending single-turn RL into a trajectory-aware process tailored to scientific code generation (Shi et al., 5 Jan 2026).
5. MDAgent2-RUNTIME and deployable self-correction
MDAgent2-RUNTIME is the deployable multi-agent system layered on top of the trained models. It consists of three functional nodes: a Code Generator, a Code Runner, and a Result Evaluator. Together they implement an iterative loop of generation, checking, sandboxed execution, scoring, and revision (Shi et al., 5 Jan 2026).
The Code Generator contains the Writer LLM. It drafts initial LAMMPS code and invokes two classes of tools: syntax tools and potential-file tools. The syntax tools assess correctness either semantically or by dry-run execution. The potential tools inspect whether specified interatomic potential files exist locally, attempt supplementation from local or official sources, and recommend the Top- most similar files when a potential is wrong or misspelled. The appendix lists tools such as
check_syntax_tool,check_lammps_potentials_tool,get_potential_file_info_tool,list_available_potentials_tool,find_similar_potentials_tool,check_potential_real_exists_tool,visualize_tool,evaluate_log_quality_by_rule_tool, andlammps_run_tool(Shi et al., 5 Jan 2026).The Code Runner executes candidate scripts inside a sandboxed Docker container for environment isolation and reproducibility. The Result Evaluator analyzes both the code and the simulation outputs using the same multi-dimensional rubric used in reward design. If the final score is below threshold, feedback is returned to the generator and the loop continues (Shi et al., 5 Jan 2026).
The runtime case study illustrates how this loop operates. For the task “Using LAMMPS to simulate the melting process of a Cu–Ni nanoparticle,” the first draft used
pair_coeff * * CuNi.eam Cu Ni; the potential tool detected thatCuNi.eamwas unavailable and recommendedCuNi.eam.alloy. After regeneration, the code passed syntax and potential checks and executed, but the evaluator initially assigned a final score of 4, below threshold, which triggered further iterations. The final script includedunits metal, periodic boundaries, FCC lattice construction, spherical nanoparticle generation, alloy potential setup,nvtheating,ave/timeoutputs, atom dumps, and a loop over temperatures (Shi et al., 5 Jan 2026).6. Empirical performance
The empirical picture is strongest on two axes: domain Q&A and runtime-assisted executability. On the aggregate QA benchmark across MD-KnowledgeEval and LAMMPS-SyntaxEval, MD-Instruct-8B scores 74.67, outperforming Qwen3-8B at 70.50, Qwen3-14B at 72.91, and Qwen-flash at 73.47, while trailing Qwen3-32B at 77.34 and Qwen3-max at 82.49. On MD-KnowledgeEval specifically, MD-Instruct-8B reaches 76.89 versus 75.15 for Qwen3-8B. On LAMMPS-SyntaxEval, it scores 72.45 versus 65.84 for Qwen3-8B and 67.92 for Qwen3-14B, nearly matching Qwen3-32B at 72.74 (Shi et al., 5 Jan 2026).
The question-type breakdown shows asymmetry across task forms. MD-Instruct-8B obtains 86.15 on single-choice, 69.67 on multiple-choice, 64.06 on fill-in-the-blank, and 78.99 on open QA overall. It performs especially well on MD-KnowledgeEval open QA with 90.74, but on LAMMPS-SyntaxEval open QA it scores 67.23, below Qwen3-8B’s 79.11. The paper does not provide a detailed explanation for this discrepancy, but it suggests that open-ended syntax explanation remains harder than structured syntax recognition (Shi et al., 5 Jan 2026).
For code generation, the key quantitative result concerns executability under runtime self-correction. For MD-Code-8B, enabling MDAgent2-RUNTIME increases Exec-Success@3 from 14.23% to 37.95%, while Code Human Score increases slightly from 9.29 to 9.32. The paper interprets this as evidence that runtime self-correction primarily improves executability and reliability rather than already-high human-perceived script quality (Shi et al., 5 Jan 2026).
The paper also states that MDAgent2-RUNTIME consistently outperforms the earlier MDAgent framework across the three backbone models tested, though the provided text does not list the full per-backbone table. The strongest supported conclusion is therefore that domain-specific post-training improves MD Q&A and syntax understanding, and that execution-grounded runtime loops materially improve the probability of obtaining a working LAMMPS script (Shi et al., 5 Jan 2026).
7. Limitations, significance, and relation to adjacent MD agent systems
The paper is strongest as a full-stack demonstration—from data construction to deployment and benchmarking—but it is less complete in formal optimization details and systems reporting. It does not provide the full GRPO objective, the KL term, full training hyperparameters, cost or latency measurements, or comprehensive ablation tables for CPT-only, SFT-only, and CPT+SFT+RL variants. It also acknowledges that current dataset coverage spans thermodynamic property computation, fluid dynamics simulations, and mechanical property simulations, but not the full range of MD tasks. As future work, it proposes integrating multimodal LLMs so that thermodynamic curves, atomic trajectories, and structural-evolution images or GIFs can enter the evaluation loop directly (Shi et al., 5 Jan 2026).
Within the broader landscape of agentic MD systems, MDAgent2 occupies a narrower niche than some later frameworks. MDAgent focuses on end-to-end MD research, integrating problem understanding, literature-guided strategy design, simulation execution, trajectory analysis, mechanistic interpretation, and quality supervision, while using case-based learning through Skill and Memory for cross-task transfer without retraining (Ma et al., 18 Apr 2026). MDForge, by contrast, treats MD pipeline design as open-ended code generation under sparse simulator feedback and reshapes behavior online through verbal reward and multi-agent debate among physics experts (Wang et al., 11 Jun 2026). This suggests that MDAgent2 is most distinctive where domain-adapted language modeling, LAMMPS-specific code generation, and deployable execution-grounded correction are the primary requirements, while adjacent systems push further toward autonomous research planning or open-ended protocol design.
Taken together, MDAgent2 establishes a methodological template for scientific-code agents in molecular simulation: domain-specific corpora, task-specific instruction data, execution-grounded post-training, sandboxed runtime verification, and benchmarked evaluation. Its principal significance lies in showing that a relatively lightweight 8B model, when specialized through MD-aware data, reward design, and runtime tooling, can outperform larger generic models on in-domain QA and substantially improve LAMMPS script executability (Shi et al., 5 Jan 2026).