CodeDistiller: Scientific Code Library
- CodeDistiller is an automated system that ingests, filters, distills, debugs, and validates scientific code from public repositories to build curated libraries.
- The system features a multi-stage pipeline including repository purpose summarization, file classification, automated example generation with debugging, and expert validation for scientific accuracy.
- Empirical evaluations in materials science show enhanced ASD agent performance, with significant improvements in completeness, accuracy, and scientific robustness of experimental setups.
CodeDistiller is an automated system for constructing large-scale, vetted libraries of scientific code examples from public repositories, designed to enhance the capabilities of experiment-driven Automated Scientific Discovery (ASD) agents. By systematically ingesting, filtering, distilling, debugging, and validating real-world domain code, CodeDistiller addresses a key bottleneck in ASD: the limited code-generating capacity of agents relying on either parametric (latent) knowledge or a small set of hand-curated experiment templates (Jansen et al., 30 Nov 2025).
1. System Architecture and Distillation Pipeline
CodeDistiller operates in four principal, tightly coupled stages:
1. Repository Ingestion & Purpose Identification
- Curators specify a set of domain-relevant libraries (e.g., PyMatgen, ASE, LAMMPS for materials science).
- GitHub’s API is queried for permissively licensed repositories importing these libraries.
- Each repository’s README.md is summarized using a LLM, producing a one-sentence "core purpose" statement (e.g., "Perform molecular dynamics simulations of metal alloys").
2. Relevant File Classification
- Every file is classified by an LLM into a hierarchical scheme: coarse types (code, docs, scripts, data) and subtypes (example, API, config).
- Each is scored 1–5 for relevance. GPU requirements and external data calls are extracted as additional metadata.
- The top-K most relevant files (typically 5–10) are passed to the next phase, optimizing LLM context efficiency.
3. Example Generation & Automated Debugging
- A specialized LLM prompt merges the repo purpose and file contents, requesting three artifacts:
- A Python script demonstrating the core function.
- A Conda or Bash runscript specifying dependencies.
- A JSON metadata record (resources, rationale, inclusion/exclusion logic).
Candidate scripts are executed inside a containerized (Ubuntu) environment. Logs, figures, and outputs are captured.
- An "LLM-as-judge" agent inspects execution logs and outputs. Failures trigger reflective, error-aware re-prompting of the code generator for up to N=8 iterations. Persistent failures are marked as unsuccessful.
4. Expert Validation and Library Organization
- For a held-out sample, domain experts manually verify:
- Successful, error-free execution.
- Accurate demonstration of core repository capabilities.
- Scientifically faithful output.
- Successfully distilled instances are indexed in a searchable library by core-purpose keywords, required libraries, compute needs, and domain-specific tags (e.g., "molecular dynamics", "phase diagram plotting").
Pseudo-code for the Complete Loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
for repo in github_query(domain_libraries): purpose = LLM.summarize(repo.README.md) files = repo.list_files() scored_files = [] for f in files: class, subtype, relevance, meta = LLM.classify(f.content) scored_files.append((f, relevance, meta)) top_files = select_top_k(scored_files, k=10) example = initialize_example(purpose, top_files) for iter in 1..max_iters: script, deps, runscript, meta = LLM.generate(example.context) result = container.execute(runscript) verdict, feedback = LLM_judge.assess(result, purpose) if verdict == "pass": library.add(example_id, script, deps, meta) break else: example.context = example.context + feedback if iter == max_iters and verdict != "pass": mark_failure(repo) |
2. Automated and Expert Evaluation Metrics
Effectiveness and fidelity are quantified through both automatic and human-in-the-loop evaluations:
| Evaluation Level | Success Metric | Criteria |
|---|---|---|
| LLM-as-judge | Pass/fail per code example | Code executes and output matches core purpose |
| Domain expert | Triple-metric (binary per sample) | Runs without error; demonstrates core repository function; output is scientifically correct |
In addition, runtime, API cost per example, and mean number of debug iterations are tracked.
3. Empirical Performance and Cost
In a study of 250 materials science repositories:
| Model | Auto-success (%) | Expert-verified (%) | Avg. runtime (min) | Avg. cost ($) | Avg. debug iters |
|---|---|---|---|---|---|
| GPT-OSS-120B | 61.6 | 25.9 | 13.8 | 0.09 | 2.4 |
| GPT-5 | 70.4 | 60.5 | 20.3 | 0.70 | 2.2 |
| Claude 4.5 | 75.6 | 74.1 | 19.0 | 1.71 | 1.9 |
The leading configuration (Claude 4.5) yielded a 74% domain-expert-verified rate of functionally correct, scientific examples.
Performance formulas:
- Success % = (passing judge) / (total repos)
- Expert fidelity = (expert-verified correct) / (samples auto-passed)
- AvgCost = Σ(iter_i × token_cost) / N_success
- AvgRuntime = Σ(container_time_i) / N_success
4. Impact on Downstream ASD Agents
CodeDistiller-derived libraries substantially enhance ASD agent capabilities. Integration into an experiment-generation agent (CodeScientist) was benchmarked against a baseline library of generic materials-science code:
| Preference | Baseline (%) | CodeDistiller-augmented (%) | Tie (%) |
|---|---|---|---|
| Per completeness/accuracy/soundness | 22–29 | 48–50 | 25–30 |
A concrete example: in a molecular dynamics application for GeSbTe alloys, the CodeDistiller-augmented agent used a community-derived CHGNet potential setup, yielding physically plausible volume changes (–16% to +75%), whereas the baseline agent’s generic Lennard-Jones setup produced nonphysical results (volume collapse >80%).
5. Library Structure and Retrieval
Distilled examples are stored in a structured library, indexed by:
- Core-purpose keywords (e.g., "DFT workflow", "crystal structure enumeration").
- Required packages, CPU/GPU/RAM needs.
- Domain metadata tags.
Downstream agents (e.g., Code-RAG, CodeScientist) retrieve code snippets via similarity search over keywords and resource requirements, then patch user parameters for new experiments (e.g., substituting materials or simulation conditions).
6. Limitations and Extensibility
Identified limitations include:
- Time-limited expert validation; fully exhaustive test suites are not feasible.
- Automatic repository filtering yields ~50% "out-of-domain" code imports; more advanced, graph-based or topic-model filtering is a proposed improvement.
- Current failure modes include complex multi-repo workflows, private or data-protected assets, and interactive GUIs.
- Extensibility is feasible in biology (e.g., genomics), chemistry (e.g., reaction kinetics), climate modeling, and robotics where large open-source codebases exist.
- Human-in-the-loop strategies (spot checks, unit-test augmentation) may push the verified example rate toward 100%.
7. Significance for the Automated Scientific Discovery Ecosystem
By automatically vetting and curating large-scale, domain-specific code corpora from community repositories, CodeDistiller overcomes critical constraints of ASD systems relying on parametric code generation or a few hand-constructed templates. It enables agents to rapidly construct and execute complex, scientifically robust experiments, while reducing the need for manual engineering and custom curation. This both improves experiment coverage and accelerates iteration across data-rich scientific domains (Jansen et al., 30 Nov 2025).