DockSmith: Agentic Docker Build System

Updated 7 February 2026

DockSmith is an agentic Docker builder that reframes environment setup as a dynamic task involving tool use, dependency reasoning, and iterative failure recovery.
It leverages a multi-agent system—including context retrieval, Dockerfile synthesis, and log-driven evaluation—to progressively repair and validate real-world builds.
The system achieves marked performance gains via loop-detection and cross-task memory, significantly improving build reliability across diverse programming ecosystems.

DockSmith is a specialized agentic Docker builder introduced to address the core bottleneck in execution-grounded software engineering (SWE) pipelines: reliable and scalable environment construction. Unlike conventional systems that treat Docker-based environment setup as a preprocessing step, DockSmith models environment construction as a complex, long-horizon agentic task involving tool use, dependency reasoning, and iterative failure recovery. By embedding environment setup into the agentic learning loop, DockSmith yields not only significant improvements in build reliability but also transfers supervision to broader code synthesis and repair tasks (Zhang et al., 31 Jan 2026).

1. Motivation and Problem Domain

Execution-grounded training and evaluation of SWE agents require that repositories build and run deterministically in isolated environments, typically using Docker. However, Docker-based builds routinely fail on diverse real-world projects due to inconsistent or absent manifests, complex native dependencies, undocumented system-level configuration, and build reproducibility issues. The Multi-Docker-Eval benchmark, explicitly designed to measure environment construction robustness, demonstrates that even state-of-the-art closed- and open-source models achieve success rates below 40%, with each failed build aborting the agentic pipeline and sharply reducing data yield for downstream execution-grounded learning. DockSmith reframes environment setup as a core agentic task, treating repair, log interpretation, dependency installation, and diagnosis as learnable, verifiable behaviors with supervision that generalizes to other SWE challenges.

2. System Architecture and Workflow

DockSmith extends the SWE-Factory multi-agent pipeline architecture, orchestrating four specialized agents in an iterative repair loop:

Context Retrieval Agent: Extracts dependency manifests, build scripts, entry points, and language/runtime information using file-system navigation, manifest parsing, and CI config analysis.
Dockerfile Agent: Synthesizes new or patches existing Dockerfiles based on contextual signals and historic build outputs, adapting installation sequences and base images.
Eval Script Agent: Generates shell scripts to check out the correct commit, apply test suite patches, invoke in-container testing, and produce standardized exit codes.
Test Analysis Agent: Executes docker build/run, parses failure logs (e.g., missing libraries, compilation errors), and formalizes structured remediation proposals.

Two principal extensions differentiate DockSmith from vanilla SWE-Factory setups:

Loop-Detection Controller: Monitors the k most recent agent/failure signature pairs. On detection of non-progressing cycles (same agents/failures for m+ steps), it diversifies agent strategies (e.g., alternative demonstrations, priority shifts) to break deadlocks.
Cross-Task Success Memory: Maintains a scalable, global cache of (Dockerfile, eval script) pairs from previously verified successes. For a new repository, similar past solutions (matched on language ecosystem and dependency footprint) are surfaced as demonstrations to seed rapid convergence.

The core workflow is an iterative build→test→diagnose→patch→rebuild loop, dynamically controlled to prevent stalling and to maximize the reuse of agentic experience across heterogeneous repository structures.

3. Data Curation, Training Regimen, and Curriculum

DockSmith is trained on large-scale, execution-grounded Docker-building trajectories obtained through a multi-stage pipeline:

Data Sourcing: Over 15 000 GitHub repositories (each >500 stars, >200 forks, spanning 10 languages) are used. Filtering restricts attention to merged pull requests involving test or CI configuration, ensuring that all data instances are backed by human-validated, executable ground truth. Short PR descriptions are expanded using a general-purpose LLM to increase specification clarity.
Agentic Trajectory Generation: For each PR, agents replay a patch within a fresh container, iterating until test pass or a maximal step count. Loop-Detection and Cross-Task Memory mechanisms operate throughout, and only verified-successful rollouts—fully capturing tool calls, file edits, build logs, and diagnoses—are retained.
Curriculum Shaping: Dockerfiles are scored on complexity as

$\mathrm{Score}(d) = 0.5\,L(d) + 5\,R(d) + 3\,P(d),$

where $L(d)$ is the number of Dockerfile lines, $R(d)$ is the number of RUN instructions, and $P(d)$ is the count of distinct apt-get/apt install packages. Rollouts are bucketed into Easy/Medium/Hard and sampled in a 1:2:2 ratio.

Joint Training: DockSmith employs a 30B-parameter Qwen3-Coder-A3B backbone, fine-tuned on the curated Docker data. Joint fine-tuning occurs alongside general SWE/coding data (e.g., Nex Agent-SFT) using token budget mixing, preserving generalized SWE knowledge while fusing environment-construction patterns.

Training hyperparameters consist of a global batch size of 32, learning rate $1 \times 10^{-5}$ , two epochs, and a maximum Docker sequence length of 32K tokens. The fully trained system and trajectories are public at https://huggingface.co/collections/8sj7df9k8m5x8/docksmith.

4. Evaluation Metrics and Benchmarks

DockSmith is evaluated using primary and secondary metrics designed to quantify environment construction capability:

Fail-to-Pass:

$\mathrm{Fail\text{-}to\text{-}Pass} = \frac{N_{\mathrm{fail \to pass}}}{N_{\mathrm{attempts}}} \times 100\%$

where $N_{\mathrm{fail \to pass}}$ is the number of tasks where initial test failures are resolved by the agent, and $N_{\mathrm{attempts}}$ is the total number of tasks attempted.

Commit Rate:

$\mathrm{Commit\ Rate} = \frac{N_{\mathrm{successful\ commits}}}{N_{\mathrm{total\ commits}}} \times 100\%$

where $N_{\mathrm{successful\ commits}}$ is the number of passing solutions out of all model-submitted attempts.

Benchmarks span Multi-Docker-Eval (39 repos in 9 languages), which assesses real-world heterogeneity, alongside out-of-distribution tests such as SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0.

5. Quantitative and Language-Wise Performance

DockSmith achieves open-source state-of-the-art performance on Multi-Docker-Eval:

Fail-to-Pass: 39.72%
Commit Rate: 58.28%

Relative to the Qwen3-Coder-30B-A3B-Instruct base (19.46% F2P, 34.13% CR), these represent +20.26 and +24.15 percentage point improvements, respectively. Average input/output token utilization and per-repo Docker images built are competitive with other strong baselines.

Performance breakdown by language (Fail-to-Pass, %):

Language	Fail-to-Pass (%)
Python	51.28
JavaScript	51.67
Java	19.05
C++	15.56
C	20.00
Go	63.33
Ruby	57.50
Rust	30.00
PHP	41.11

DockSmith displays marked gains for ecosystems with high package-manager standardization (e.g., Python, Go, Ruby).

In out-of-distribution settings, jointly trained models improve over SWE-only baselines:

SWE-bench Verified: +2.25 pp (49.65% → 51.90%)
SWE-bench Multilingual: +2.09 pp (31.83% → 33.92%)
Terminal-Bench 2.0: +3.37 pp (10.67% → 14.04%)

These improvements substantiate the claim that environment-construction supervision is agentically transferrable.

6. Analysis of Agentic Benefits and Error Patterns

Detailed error-propagation analysis demonstrates that DockSmith not only reduces absolute terminal-error rates (8.7%→7.1%, −1.6 pp) but also prompts more persistent and layered handling of environment and runtime errors:

Increased within-layer persistence for environment (+6%) and runtime errors (+7%)
Improved resolution rates for runtime errors (+3.3 pp) and logic faults (+8.7 pp)
Higher high-precision intent matches (34.6%→40.4%), and more principled, system-grounded repair rationales (27.1%→33.0%)

Modeling environment setup as an agentic, verifiable activity—rather than a static preprocessing phase—yields actionable skills: dependency inference, tool invocation ordering, log parsing, and multi-step failure recovery, all of which enhance broader code synthesis and debugging.

7. Algorithmic Structure and Control Components

The DockSmith agentic loop can be textually represented as follows:

+---------------------+      +---------------------+      +---------------------+      +---------------------+
| Context Retrieval   | ---> | Dockerfile Agent    | ---> | Eval Script Agent   | ---> | Test Analysis Agent |
|   (tool use & parse)|      | (generate/patch DF) |      | (generate test.sh)  |      | (run & summarize)   |
+---------------------+      +---------------------+      +---------------------+      +---------------------+
          ^                                                                                        |
          |<---------------------------- Loop-Detection Controller -------------------------------|
          |                                                                                        v
    +------------------------ Cross-Task Success Memory (retrieve/store)--------------------------+

Pseudocode outlines:

def LoopDetectionController(history, failure_signatures):
    for S in unique_agent_subsets(history[-m:]):
        if count_repeats(S, history[-m:]) >= threshold and not success_detected():
            return diversify_strategy()  # e.g., switch demonstration
    return continue_normal_flow()

def DependencyReasoning(logs):
    missing_packages = parse_missing_system_libs(logs)
    apt_lines = [f"RUN apt-get update && apt-get install -y {pkg}" for pkg in missing_packages]
    return apt_lines

def DockerfileAgent(context, prev_DF, diagnostics):
    if diagnostics.indicates_missing_deps():
        new_lines = DependencyReasoning(diagnostics.logs)
        patched_DF = insert_after(prev_DF, first_from(apt_lines), new_lines)
        return patched_DF
    elif not Dockerfile_exists():
        return synthesize_base_DF(context.languages, context.dependencies)
    else:
        return minimal_patch(prev_DF, context)

By modularizing context acquisition, Dockerfile synthesis, evaluation scripting, log-driven diagnosis, and targeted repair—with robust control and memory mechanisms—DockSmith reliably bootstraps complex repositories into reproducible, testable environments. This agentic framing delivers both immediate gains in environment construction and enduring improvements in tool-oriented code generation (Zhang et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DockSmith.