Papers
Topics
Authors
Recent
2000 character limit reached

CodeDistiller: Scientific Code Library

Updated 7 December 2025
  • CodeDistiller is an automated system that ingests, filters, distills, debugs, and validates scientific code from public repositories to build curated libraries.
  • The system features a multi-stage pipeline including repository purpose summarization, file classification, automated example generation with debugging, and expert validation for scientific accuracy.
  • Empirical evaluations in materials science show enhanced ASD agent performance, with significant improvements in completeness, accuracy, and scientific robustness of experimental setups.

CodeDistiller is an automated system for constructing large-scale, vetted libraries of scientific code examples from public repositories, designed to enhance the capabilities of experiment-driven Automated Scientific Discovery (ASD) agents. By systematically ingesting, filtering, distilling, debugging, and validating real-world domain code, CodeDistiller addresses a key bottleneck in ASD: the limited code-generating capacity of agents relying on either parametric (latent) knowledge or a small set of hand-curated experiment templates (Jansen et al., 30 Nov 2025).

1. System Architecture and Distillation Pipeline

CodeDistiller operates in four principal, tightly coupled stages:

1. Repository Ingestion & Purpose Identification

  • Curators specify a set of domain-relevant libraries (e.g., PyMatgen, ASE, LAMMPS for materials science).
  • GitHub’s API is queried for permissively licensed repositories importing these libraries.
  • Each repository’s README.md is summarized using a LLM, producing a one-sentence "core purpose" statement (e.g., "Perform molecular dynamics simulations of metal alloys").

2. Relevant File Classification

  • Every file is classified by an LLM into a hierarchical scheme: coarse types (code, docs, scripts, data) and subtypes (example, API, config).
  • Each is scored 1–5 for relevance. GPU requirements and external data calls are extracted as additional metadata.
  • The top-K most relevant files (typically 5–10) are passed to the next phase, optimizing LLM context efficiency.

3. Example Generation & Automated Debugging

  • A specialized LLM prompt merges the repo purpose and file contents, requesting three artifacts:

    1. A Python script demonstrating the core function.
    2. A Conda or Bash runscript specifying dependencies.
    3. A JSON metadata record (resources, rationale, inclusion/exclusion logic).
  • Candidate scripts are executed inside a containerized (Ubuntu) environment. Logs, figures, and outputs are captured.

  • An "LLM-as-judge" agent inspects execution logs and outputs. Failures trigger reflective, error-aware re-prompting of the code generator for up to N=8 iterations. Persistent failures are marked as unsuccessful.

4. Expert Validation and Library Organization

  • For a held-out sample, domain experts manually verify:
    • Successful, error-free execution.
    • Accurate demonstration of core repository capabilities.
    • Scientifically faithful output.
  • Successfully distilled instances are indexed in a searchable library by core-purpose keywords, required libraries, compute needs, and domain-specific tags (e.g., "molecular dynamics", "phase diagram plotting").

Pseudo-code for the Complete Loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for repo in github_query(domain_libraries):
    purpose = LLM.summarize(repo.README.md)
    files = repo.list_files()
    scored_files = []
    for f in files:
        class, subtype, relevance, meta = LLM.classify(f.content)
        scored_files.append((f, relevance, meta))
    top_files = select_top_k(scored_files, k=10)
    example = initialize_example(purpose, top_files)
    for iter in 1..max_iters:
        script, deps, runscript, meta = LLM.generate(example.context)
        result = container.execute(runscript)
        verdict, feedback = LLM_judge.assess(result, purpose)
        if verdict == "pass":
            library.add(example_id, script, deps, meta)
            break
        else:
            example.context = example.context + feedback
    if iter == max_iters and verdict != "pass":
        mark_failure(repo)

2. Automated and Expert Evaluation Metrics

Effectiveness and fidelity are quantified through both automatic and human-in-the-loop evaluations:

Evaluation Level Success Metric Criteria
LLM-as-judge Pass/fail per code example Code executes and output matches core purpose
Domain expert Triple-metric (binary per sample) Runs without error; demonstrates core repository function; output is scientifically correct

In addition, runtime, API cost per example, and mean number of debug iterations are tracked.

3. Empirical Performance and Cost

In a study of 250 materials science repositories:

Model Auto-success (%) Expert-verified (%) Avg. runtime (min) Avg. cost ($) Avg. debug iters
GPT-OSS-120B 61.6 25.9 13.8 0.09 2.4
GPT-5 70.4 60.5 20.3 0.70 2.2
Claude 4.5 75.6 74.1 19.0 1.71 1.9

The leading configuration (Claude 4.5) yielded a 74% domain-expert-verified rate of functionally correct, scientific examples.

Performance formulas:

  • Success % = (passing judge) / (total repos)
  • Expert fidelity = (expert-verified correct) / (samples auto-passed)
  • AvgCost = Σ(iter_i × token_cost) / N_success
  • AvgRuntime = Σ(container_time_i) / N_success

4. Impact on Downstream ASD Agents

CodeDistiller-derived libraries substantially enhance ASD agent capabilities. Integration into an experiment-generation agent (CodeScientist) was benchmarked against a baseline library of generic materials-science code:

Preference Baseline (%) CodeDistiller-augmented (%) Tie (%)
Per completeness/accuracy/soundness 22–29 48–50 25–30

A concrete example: in a molecular dynamics application for GeSbTe alloys, the CodeDistiller-augmented agent used a community-derived CHGNet potential setup, yielding physically plausible volume changes (–16% to +75%), whereas the baseline agent’s generic Lennard-Jones setup produced nonphysical results (volume collapse >80%).

5. Library Structure and Retrieval

Distilled examples are stored in a structured library, indexed by:

  • Core-purpose keywords (e.g., "DFT workflow", "crystal structure enumeration").
  • Required packages, CPU/GPU/RAM needs.
  • Domain metadata tags.

Downstream agents (e.g., Code-RAG, CodeScientist) retrieve code snippets via similarity search over keywords and resource requirements, then patch user parameters for new experiments (e.g., substituting materials or simulation conditions).

6. Limitations and Extensibility

Identified limitations include:

  • Time-limited expert validation; fully exhaustive test suites are not feasible.
  • Automatic repository filtering yields ~50% "out-of-domain" code imports; more advanced, graph-based or topic-model filtering is a proposed improvement.
  • Current failure modes include complex multi-repo workflows, private or data-protected assets, and interactive GUIs.
  • Extensibility is feasible in biology (e.g., genomics), chemistry (e.g., reaction kinetics), climate modeling, and robotics where large open-source codebases exist.
  • Human-in-the-loop strategies (spot checks, unit-test augmentation) may push the verified example rate toward 100%.

7. Significance for the Automated Scientific Discovery Ecosystem

By automatically vetting and curating large-scale, domain-specific code corpora from community repositories, CodeDistiller overcomes critical constraints of ASD systems relying on parametric code generation or a few hand-constructed templates. It enables agents to rapidly construct and execute complex, scientifically robust experiments, while reducing the need for manual engineering and custom curation. This both improves experiment coverage and accelerates iteration across data-rich scientific domains (Jansen et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CodeDistiller.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube