DockSmith: Agentic Docker Build System
- DockSmith is an agentic Docker builder that reframes environment setup as a dynamic task involving tool use, dependency reasoning, and iterative failure recovery.
- It leverages a multi-agent system—including context retrieval, Dockerfile synthesis, and log-driven evaluation—to progressively repair and validate real-world builds.
- The system achieves marked performance gains via loop-detection and cross-task memory, significantly improving build reliability across diverse programming ecosystems.
DockSmith is a specialized agentic Docker builder introduced to address the core bottleneck in execution-grounded software engineering (SWE) pipelines: reliable and scalable environment construction. Unlike conventional systems that treat Docker-based environment setup as a preprocessing step, DockSmith models environment construction as a complex, long-horizon agentic task involving tool use, dependency reasoning, and iterative failure recovery. By embedding environment setup into the agentic learning loop, DockSmith yields not only significant improvements in build reliability but also transfers supervision to broader code synthesis and repair tasks (Zhang et al., 31 Jan 2026).
1. Motivation and Problem Domain
Execution-grounded training and evaluation of SWE agents require that repositories build and run deterministically in isolated environments, typically using Docker. However, Docker-based builds routinely fail on diverse real-world projects due to inconsistent or absent manifests, complex native dependencies, undocumented system-level configuration, and build reproducibility issues. The Multi-Docker-Eval benchmark, explicitly designed to measure environment construction robustness, demonstrates that even state-of-the-art closed- and open-source models achieve success rates below 40%, with each failed build aborting the agentic pipeline and sharply reducing data yield for downstream execution-grounded learning. DockSmith reframes environment setup as a core agentic task, treating repair, log interpretation, dependency installation, and diagnosis as learnable, verifiable behaviors with supervision that generalizes to other SWE challenges.
2. System Architecture and Workflow
DockSmith extends the SWE-Factory multi-agent pipeline architecture, orchestrating four specialized agents in an iterative repair loop:
- Context Retrieval Agent: Extracts dependency manifests, build scripts, entry points, and language/runtime information using file-system navigation, manifest parsing, and CI config analysis.
- Dockerfile Agent: Synthesizes new or patches existing Dockerfiles based on contextual signals and historic build outputs, adapting installation sequences and base images.
- Eval Script Agent: Generates shell scripts to check out the correct commit, apply test suite patches, invoke in-container testing, and produce standardized exit codes.
- Test Analysis Agent: Executes docker build/run, parses failure logs (e.g., missing libraries, compilation errors), and formalizes structured remediation proposals.
Two principal extensions differentiate DockSmith from vanilla SWE-Factory setups:
- Loop-Detection Controller: Monitors the k most recent agent/failure signature pairs. On detection of non-progressing cycles (same agents/failures for m+ steps), it diversifies agent strategies (e.g., alternative demonstrations, priority shifts) to break deadlocks.
- Cross-Task Success Memory: Maintains a scalable, global cache of (Dockerfile, eval script) pairs from previously verified successes. For a new repository, similar past solutions (matched on language ecosystem and dependency footprint) are surfaced as demonstrations to seed rapid convergence.
The core workflow is an iterative build→test→diagnose→patch→rebuild loop, dynamically controlled to prevent stalling and to maximize the reuse of agentic experience across heterogeneous repository structures.
3. Data Curation, Training Regimen, and Curriculum
DockSmith is trained on large-scale, execution-grounded Docker-building trajectories obtained through a multi-stage pipeline:
- Data Sourcing: Over 15 000 GitHub repositories (each >500 stars, >200 forks, spanning 10 languages) are used. Filtering restricts attention to merged pull requests involving test or CI configuration, ensuring that all data instances are backed by human-validated, executable ground truth. Short PR descriptions are expanded using a general-purpose LLM to increase specification clarity.
- Agentic Trajectory Generation: For each PR, agents replay a patch within a fresh container, iterating until test pass or a maximal step count. Loop-Detection and Cross-Task Memory mechanisms operate throughout, and only verified-successful rollouts—fully capturing tool calls, file edits, build logs, and diagnoses—are retained.
- Curriculum Shaping: Dockerfiles are scored on complexity as
where is the number of Dockerfile lines, is the number of RUN instructions, and is the count of distinct apt-get/apt install packages. Rollouts are bucketed into Easy/Medium/Hard and sampled in a 1:2:2 ratio.
- Joint Training: DockSmith employs a 30B-parameter Qwen3-Coder-A3B backbone, fine-tuned on the curated Docker data. Joint fine-tuning occurs alongside general SWE/coding data (e.g., Nex Agent-SFT) using token budget mixing, preserving generalized SWE knowledge while fusing environment-construction patterns.
Training hyperparameters consist of a global batch size of 32, learning rate , two epochs, and a maximum Docker sequence length of 32K tokens. The fully trained system and trajectories are public at https://huggingface.co/collections/8sj7df9k8m5x8/docksmith.
4. Evaluation Metrics and Benchmarks
DockSmith is evaluated using primary and secondary metrics designed to quantify environment construction capability:
- Fail-to-Pass:
where is the number of tasks where initial test failures are resolved by the agent, and is the total number of tasks attempted.
- Commit Rate:
where is the number of passing solutions out of all model-submitted attempts.
Benchmarks span Multi-Docker-Eval (39 repos in 9 languages), which assesses real-world heterogeneity, alongside out-of-distribution tests such as SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0.
5. Quantitative and Language-Wise Performance
DockSmith achieves open-source state-of-the-art performance on Multi-Docker-Eval:
- Fail-to-Pass: 39.72%
- Commit Rate: 58.28%
Relative to the Qwen3-Coder-30B-A3B-Instruct base (19.46% F2P, 34.13% CR), these represent +20.26 and +24.15 percentage point improvements, respectively. Average input/output token utilization and per-repo Docker images built are competitive with other strong baselines.
Performance breakdown by language (Fail-to-Pass, %):
| Language | Fail-to-Pass (%) |
|---|---|
| Python | 51.28 |
| JavaScript | 51.67 |
| Java | 19.05 |
| C++ | 15.56 |
| C | 20.00 |
| Go | 63.33 |
| Ruby | 57.50 |
| Rust | 30.00 |
| PHP | 41.11 |
DockSmith displays marked gains for ecosystems with high package-manager standardization (e.g., Python, Go, Ruby).
In out-of-distribution settings, jointly trained models improve over SWE-only baselines:
- SWE-bench Verified: +2.25 pp (49.65% → 51.90%)
- SWE-bench Multilingual: +2.09 pp (31.83% → 33.92%)
- Terminal-Bench 2.0: +3.37 pp (10.67% → 14.04%)
These improvements substantiate the claim that environment-construction supervision is agentically transferrable.
6. Analysis of Agentic Benefits and Error Patterns
Detailed error-propagation analysis demonstrates that DockSmith not only reduces absolute terminal-error rates (8.7%→7.1%, −1.6 pp) but also prompts more persistent and layered handling of environment and runtime errors:
- Increased within-layer persistence for environment (+6%) and runtime errors (+7%)
- Improved resolution rates for runtime errors (+3.3 pp) and logic faults (+8.7 pp)
- Higher high-precision intent matches (34.6%→40.4%), and more principled, system-grounded repair rationales (27.1%→33.0%)
Modeling environment setup as an agentic, verifiable activity—rather than a static preprocessing phase—yields actionable skills: dependency inference, tool invocation ordering, log parsing, and multi-step failure recovery, all of which enhance broader code synthesis and debugging.
7. Algorithmic Structure and Control Components
The DockSmith agentic loop can be textually represented as follows:
1 2 3 4 5 6 7 8 |
+---------------------+ +---------------------+ +---------------------+ +---------------------+
| Context Retrieval | ---> | Dockerfile Agent | ---> | Eval Script Agent | ---> | Test Analysis Agent |
| (tool use & parse)| | (generate/patch DF) | | (generate test.sh) | | (run & summarize) |
+---------------------+ +---------------------+ +---------------------+ +---------------------+
^ |
|<---------------------------- Loop-Detection Controller -------------------------------|
| v
+------------------------ Cross-Task Success Memory (retrieve/store)--------------------------+ |
Pseudocode outlines:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def LoopDetectionController(history, failure_signatures): for S in unique_agent_subsets(history[-m:]): if count_repeats(S, history[-m:]) >= threshold and not success_detected(): return diversify_strategy() # e.g., switch demonstration return continue_normal_flow() def DependencyReasoning(logs): missing_packages = parse_missing_system_libs(logs) apt_lines = [f"RUN apt-get update && apt-get install -y {pkg}" for pkg in missing_packages] return apt_lines def DockerfileAgent(context, prev_DF, diagnostics): if diagnostics.indicates_missing_deps(): new_lines = DependencyReasoning(diagnostics.logs) patched_DF = insert_after(prev_DF, first_from(apt_lines), new_lines) return patched_DF elif not Dockerfile_exists(): return synthesize_base_DF(context.languages, context.dependencies) else: return minimal_patch(prev_DF, context) |
By modularizing context acquisition, Dockerfile synthesis, evaluation scripting, log-driven diagnosis, and targeted repair—with robust control and memory mechanisms—DockSmith reliably bootstraps complex repositories into reproducible, testable environments. This agentic framing delivers both immediate gains in environment construction and enduring improvements in tool-oriented code generation (Zhang et al., 31 Jan 2026).