Papers
Topics
Authors
Recent
2000 character limit reached

SWE-Agent with Claude 4 Sonnet

Updated 4 December 2025
  • SWE-Agent with Claude 4 Sonnet is an advanced agentic coding system that integrates LLM-driven dynamic prompting with modular repository navigation.
  • Empirical benchmarks show a 61.0% functional correctness rate but only 10.5% secure outputs, highlighting significant security vulnerabilities.
  • The framework employs chain-of-thought strategies and structured API protocols to orchestrate complex, end-to-end software engineering tasks.

SWE-Agent with Claude 4 Sonnet denotes an advanced agentic coding architecture employing the SWE-Agent framework with Anthropic’s frontier Claude 4 Sonnet model as LLM backbone. This paradigm is prominent in recent studies of "vibe coding," where LLM agents autonomously perform complex end-to-end software engineering tasks under real-world constraints. Empirical evidence from leading benchmarks such as SU S VI B E S, SecureAgentBench, and SWE-Compass collectively demonstrates SWE-Agent’s strengths in repository-scale navigation and functional code generation, while highlighting persistent software security challenges and limitations of current prompting-based mitigations (Zhao et al., 2 Dec 2025, Yang et al., 6 May 2024, Chen et al., 26 Sep 2025, Xu et al., 7 Nov 2025).

1. System Design and Workflow

The core of SWE-Agent is a modular, loop-based architecture orchestrating LLM-driven repository interaction. Each episode consists of the following cyclic phases:

  • Repository Observation: The agent retrieves repository state, including file tree and recent diffs.
  • Action Planning: The agent, via the Claude 4 Sonnet LLM, plans discrete actions—opening files, editing code, running tests.
  • Dynamic Prompting: Action proposals are generated by invoking the LLM with a system prompt and per-turn context, including an explicit chain-of-thought (CoT) strategy: “think step by step about file edits, test execution, and patch generation.”
  • Action Execution: The specified action modifies source code, invokes shell commands, or triggers test runs, with results captured for subsequent cycles.
  • Terminal Condition: Iteration continues until a task-defined endpoint—typically, all tests pass or a hard limit of 200 planning steps is reached (Zhao et al., 2 Dec 2025).

The Agent–Computer Interface (ACI) enforces strict command schemas, guardrails (e.g., flake8 for Python syntax validation), and budgeting for API calls, cost, and wall-clock timeouts (Yang et al., 6 May 2024). Integration with Claude 4 Sonnet is performed via the Anthropic API, using model-appropriate prompt syntax and API parameter tuning (temperature=0.0, deterministic decoding, pass@1) (Zhao et al., 2 Dec 2025, Xu et al., 7 Nov 2025).

2. Prompt Design and Subroutine Structure

SWE-Agent utilizes a sophisticated prompting regime, both at the episode level and for specialized subroutines. The agent’s main system prompt enumerates supported interface commands and enforces single-action JSON protocols (e.g., {"discussion":"...", "command":"..."} for classic ACI; explicit function-calling for SWE-Compass).

For SU S VI B E S tasks, three prompt templates structure feature masking, issue description, and mask verification:

  • Prompt I (Feature Masking): Instructs Claude 4 Sonnet to produce a deletion mask encapsulating an implementation area around a provided diff.
  • Prompt II (Issue-Style Task): Generates a developer-facing issue specification for re-implementing the masked code.
  • Prompt III (Mask Verification): Verifies that each implementation line is justified by the issue requirements (Zhao et al., 2 Dec 2025).

Operational prompts in SWE-Compass and similar agentic loops embed CoT directives, tool call schemas (e.g., bash, str_replace_editor, file_viewer), and stateful user contexts (e.g., last test failures), all carried in structured JSON over the chat.completions API (Xu et al., 7 Nov 2025).

3. Benchmarking: Functional Correctness and Security

Metric Formulation

Three primary metrics are used for quantitative evaluation:

  • Functional Correctness (FuncPass):

Functional Correctness=#{solutions passing Tfunc}#{total tasks}\text{Functional Correctness} = \frac{\#\{\text{solutions passing } \mathcal{T}_{func}\}}{\#\{\text{total tasks}\}}

A solution must pass the repository’s original unit tests.

  • Security Rate (SecPass):

Security Rate=#{solutions passing Tfunc+Tsecure}#{total tasks}\text{Security Rate} = \frac{\#\{\text{solutions passing } \mathcal{T}_{func} + \mathcal{T}_{secure}\}}{\#\{\text{total tasks}\}}

Only solutions passing both functional and security-specific post-fix tests are counted as secure.

  • Insecurity Rate:

Insecurity Rate=1−Security Rate\text{Insecurity Rate} = 1 - \text{Security Rate}

These definitions extend to multi-category outcome values in SecureAgentBench, adding distinctions such as "Correct but Vulnerable" (CV), "Correct and Secure" (CSC_S), and newly introduced vulnerabilities (CS) (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025).

Empirical Results

Key results for SWE-Agent using Claude 4 Sonnet on SU S VI B E S:

Agent Framework FuncPass (%) SecPass (%)
SWE-Agent 61.0 10.5
OpenHands 49.5 12.5
Claude Code 44.0 6.0

Despite the highest functional correctness (61.0%), only 10.5% of SWE-Agent+Claude 4 Sonnet’s outputs are secure, with 50.5% of all tasks yielding correct but insecure code. For comparison, in SecureAgentBench (Claude 3.7 Sonnet), SWE-Agent achieves Rfunc≈34R_{func} \approx 34–38% and Rsec≈8R_{sec} \approx 8–10%; projections for Claude 4 Sonnet forecast RsecR_{sec} rising to 12–18% (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025). In SWE-Compass, detailed Pass@1 rates are recorded per task category, maintained by explicit LaTeX formulations (Xu et al., 7 Nov 2025).

4. Vulnerability Analysis and Security Failure Modes

Systematic analysis highlights recurrent security failure modes across agent-generated code:

  • Timing side channels: Premature returns in authentication functions yielding observable timing differences (e.g., Django’s verify_password).
  • Header injection: Unsanitized CRLF sequences in HTTP Location headers enabling response splitting (e.g., Buildbot).
  • Cross-site scripting (XSS): Unchecked JavaScript scheme in URLs within content-management features (e.g., Wagtail).
  • Session expiry bypass: Failure to enforce session max_age, leading to persistent stale credentials (e.g., aiohttp_session) (Zhao et al., 2 Dec 2025).

Empirical studies show that over 82% of functionally correct solutions produced by SWE-Agent+Claude 4 Sonnet remain insecure (i.e., fail Tsecure\mathcal{T}_{secure}). Notably, security vulnerability patterns remain diverse, spanning 77 distinct CWE categories in SU S VI B E S (Zhao et al., 2 Dec 2025).

5. Prompting-Based Security Mitigations and Limitations

Multiple strategies to prompt SWE-Agent+Claude 4 Sonnet for security robustness have been evaluated, including:

  • Generic reminders: Appending best-practice notices (“Make sure to follow best security practicesâ€Ĥ”).
  • Self-selection CWE: Having the agent read the full CWE list, select relevant categories, and integrate protective logic.
  • Oracle CWE hints: Explicitly instructing the agent which concrete CWE to avoid (Zhao et al., 2 Dec 2025).

Results indicate that self-selection and oracle hints reduce functional correctness (e.g., by up to –8.5 percentage points), without improving ultimate security rates; trade-off effects are observed where some insecure→secure transitions are offset by new secure→incorrect ones. Agents exhibit difficulty in reliable CWE identification (mean taskwise recall 0.737). These findings corroborate SecureAgentBench’s conclusion that explicit security prompting fails to substantially increase RsecR_{sec}, instead necessitating more profound architectural, training, or tool-integrated interventions (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025).

6. Implementation and Reproduction Guidelines

Practical deployment of SWE-Agent with Claude 4 Sonnet involves:

  • API and Configuration: Anthropic Python SDK with model="claude-sonnet-4-20250514", temperature=0.0, max_tokens_to_sample=2048, and function-calling enabled for tool invocation schemas (Xu et al., 7 Nov 2025).
  • Tool Abstraction Layer: Provides access to containerized bash, file editors, repository-wide search utilities, and test runners within an isolated Docker environment.
  • Turn and Time Budgets: Recommendations include max_turns=150, per-tool timeout=600 s, and job-level capping via containerized process limits.
  • State and Context Management: Retain only the last 10–15 tool interactions, summarize earlier history beyond 50 k tokens, and enforce strict guardrails on code edits (e.g., flake8 for syntax in Python, static analysis as in SecureAgentBench for C/C++).

Configuration best practices, including OS pinning, offline cache prep, and encoding normalization, are detailed for robust reproduction (Xu et al., 7 Nov 2025).

7. Implications and Prospects

SWE-Agent with Claude 4 Sonnet achieves competitive functional coverage in agentic coding benchmarks, often outperforming alternative agents in raw Pass@1 rates; yet, persistent security vulnerability in over 80% of generated correct outputs exposes critical limitations of current LLM agentic methods under the vibe coding paradigm (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025). Prompt engineering, even with precise task-specific CWE guidance, does not meaningfully mitigate insecurity. This suggests that future progress will require integration of first-class security objectives at the architectural or training level, potentially involving hybrid combinations of LLMs with traditional static/dynamic analysis tools or security-specific fine-tuning.

SWE-Agent’s modular orchestration loop and extensible ACI maintain its relevance as an empirical testbed for evaluating and hardening agentic workflows under evolving model backends, such as Claude 4 Sonnet. These findings motivate ongoing research into agent-aligned software security, the limits of prompt-driven risk mitigation, and the design of secure automated code-generation pipelines for deployment in security-sensitive domains (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025, Xu et al., 7 Nov 2025, Yang et al., 6 May 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SWE-Agent with Claude 4 Sonnet.