SWE-Agent with Claude 4 Sonnet

Updated 4 December 2025

SWE-Agent with Claude 4 Sonnet is an advanced agentic coding system that integrates LLM-driven dynamic prompting with modular repository navigation.
Empirical benchmarks show a 61.0% functional correctness rate but only 10.5% secure outputs, highlighting significant security vulnerabilities.
The framework employs chain-of-thought strategies and structured API protocols to orchestrate complex, end-to-end software engineering tasks.

SWE-Agent with Claude 4 Sonnet denotes an advanced agentic coding architecture employing the SWE-Agent framework with Anthropic’s frontier Claude 4 Sonnet model as LLM backbone. This paradigm is prominent in recent studies of "vibe coding," where LLM agents autonomously perform complex end-to-end software engineering tasks under real-world constraints. Empirical evidence from leading benchmarks such as SU S VI B E S, SecureAgentBench, and SWE-Compass collectively demonstrates SWE-Agent’s strengths in repository-scale navigation and functional code generation, while highlighting persistent software security challenges and limitations of current prompting-based mitigations (Zhao et al., 2 Dec 2025, Yang et al., 2024, Chen et al., 26 Sep 2025, Xu et al., 7 Nov 2025).

1. System Design and Workflow

The core of SWE-Agent is a modular, loop-based architecture orchestrating LLM-driven repository interaction. Each episode consists of the following cyclic phases:

Repository Observation: The agent retrieves repository state, including file tree and recent diffs.
Action Planning: The agent, via the Claude 4 Sonnet LLM, plans discrete actions—opening files, editing code, running tests.
Dynamic Prompting: Action proposals are generated by invoking the LLM with a system prompt and per-turn context, including an explicit chain-of-thought (CoT) strategy: “think step by step about file edits, test execution, and patch generation.”
Action Execution: The specified action modifies source code, invokes shell commands, or triggers test runs, with results captured for subsequent cycles.
Terminal Condition: Iteration continues until a task-defined endpoint—typically, all tests pass or a hard limit of 200 planning steps is reached (Zhao et al., 2 Dec 2025).

The Agent–Computer Interface (ACI) enforces strict command schemas, guardrails (e.g., flake8 for Python syntax validation), and budgeting for API calls, cost, and wall-clock timeouts (Yang et al., 2024). Integration with Claude 4 Sonnet is performed via the Anthropic API, using model-appropriate prompt syntax and API parameter tuning (temperature=0.0, deterministic decoding, pass@1) (Zhao et al., 2 Dec 2025, Xu et al., 7 Nov 2025).

2. Prompt Design and Subroutine Structure

SWE-Agent utilizes a sophisticated prompting regime, both at the episode level and for specialized subroutines. The agent’s main system prompt enumerates supported interface commands and enforces single-action JSON protocols (e.g., {"discussion":"...", "command":"..."} for classic ACI; explicit function-calling for SWE-Compass).

For SU S VI B E S tasks, three prompt templates structure feature masking, issue description, and mask verification:

Prompt I (Feature Masking): Instructs Claude 4 Sonnet to produce a deletion mask encapsulating an implementation area around a provided diff.
Prompt II (Issue-Style Task): Generates a developer-facing issue specification for re-implementing the masked code.
Prompt III (Mask Verification): Verifies that each implementation line is justified by the issue requirements (Zhao et al., 2 Dec 2025).

Operational prompts in SWE-Compass and similar agentic loops embed CoT directives, tool call schemas (e.g., bash, str_replace_editor, file_viewer), and stateful user contexts (e.g., last test failures), all carried in structured JSON over the chat.completions API (Xu et al., 7 Nov 2025).

3. Benchmarking: Functional Correctness and Security

Metric Formulation

Three primary metrics are used for quantitative evaluation:

Functional Correctness (FuncPass):

$\text{Functional Correctness} = \frac{\#\{\text{solutions passing } \mathcal{T}_{func}\}}{\#\{\text{total tasks}\}}$

A solution must pass the repository’s original unit tests.

Security Rate (SecPass):

$\text{Security Rate} = \frac{\#\{\text{solutions passing } \mathcal{T}_{func} + \mathcal{T}_{secure}\}}{\#\{\text{total tasks}\}}$

Only solutions passing both functional and security-specific post-fix tests are counted as secure.

Insecurity Rate:

$\text{Insecurity Rate} = 1 - \text{Security Rate}$

These definitions extend to multi-category outcome values in SecureAgentBench, adding distinctions such as "Correct but Vulnerable" (CV), "Correct and Secure" ( $C_S$ ), and newly introduced vulnerabilities (CS) (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025).

Empirical Results

Key results for SWE-Agent using Claude 4 Sonnet on SU S VI B E S:

Agent Framework	FuncPass (%)	SecPass (%)
SWE-Agent	61.0	10.5
OpenHands	49.5	12.5
Claude Code	44.0	6.0

Despite the highest functional correctness (61.0%), only 10.5% of SWE-Agent+Claude 4 Sonnet’s outputs are secure, with 50.5% of all tasks yielding correct but insecure code. For comparison, in SecureAgentBench (Claude 3.7 Sonnet), SWE-Agent achieves $R_{func} \approx 34$ –38% and $R_{sec} \approx 8$ –10%; projections for Claude 4 Sonnet forecast $R_{sec}$ rising to 12–18% (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025). In SWE-Compass, detailed Pass@1 rates are recorded per task category, maintained by explicit LaTeX formulations (Xu et al., 7 Nov 2025).

4. Vulnerability Analysis and Security Failure Modes

Systematic analysis highlights recurrent security failure modes across agent-generated code:

Timing side channels: Premature returns in authentication functions yielding observable timing differences (e.g., Django’s verify_password).
Header injection: Unsanitized CRLF sequences in HTTP Location headers enabling response splitting (e.g., Buildbot).
Cross-site scripting (XSS): Unchecked JavaScript scheme in URLs within content-management features (e.g., Wagtail).
Session expiry bypass: Failure to enforce session max_age, leading to persistent stale credentials (e.g., aiohttp_session) (Zhao et al., 2 Dec 2025).

Empirical studies show that over 82% of functionally correct solutions produced by SWE-Agent+Claude 4 Sonnet remain insecure (i.e., fail $\mathcal{T}_{secure}$ ). Notably, security vulnerability patterns remain diverse, spanning 77 distinct CWE categories in SU S VI B E S (Zhao et al., 2 Dec 2025).

5. Prompting-Based Security Mitigations and Limitations

Multiple strategies to prompt SWE-Agent+Claude 4 Sonnet for security robustness have been evaluated, including:

Generic reminders: Appending best-practice notices (“Make sure to follow best security practices…”).
Self-selection CWE: Having the agent read the full CWE list, select relevant categories, and integrate protective logic.
Oracle CWE hints: Explicitly instructing the agent which concrete CWE to avoid (Zhao et al., 2 Dec 2025).

Results indicate that self-selection and oracle hints reduce functional correctness (e.g., by up to –8.5 percentage points), without improving ultimate security rates; trade-off effects are observed where some insecure→secure transitions are offset by new secure→incorrect ones. Agents exhibit difficulty in reliable CWE identification (mean taskwise recall 0.737). These findings corroborate SecureAgentBench’s conclusion that explicit security prompting fails to substantially increase $R_{sec}$ , instead necessitating more profound architectural, training, or tool-integrated interventions (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025).

6. Implementation and Reproduction Guidelines

Practical deployment of SWE-Agent with Claude 4 Sonnet involves:

API and Configuration: Anthropic Python SDK with model="claude-sonnet-4-20250514", temperature=0.0, max_tokens_to_sample=2048, and function-calling enabled for tool invocation schemas (Xu et al., 7 Nov 2025).
Tool Abstraction Layer: Provides access to containerized bash, file editors, repository-wide search utilities, and test runners within an isolated Docker environment.
Turn and Time Budgets: Recommendations include max_turns=150, per-tool timeout=600 s, and job-level capping via containerized process limits.
State and Context Management: Retain only the last 10–15 tool interactions, summarize earlier history beyond 50 k tokens, and enforce strict guardrails on code edits (e.g., flake8 for syntax in Python, static analysis as in SecureAgentBench for C/C++).

Configuration best practices, including OS pinning, offline cache prep, and encoding normalization, are detailed for robust reproduction (Xu et al., 7 Nov 2025).

7. Implications and Prospects

SWE-Agent with Claude 4 Sonnet achieves competitive functional coverage in agentic coding benchmarks, often outperforming alternative agents in raw Pass@1 rates; yet, persistent security vulnerability in over 80% of generated correct outputs exposes critical limitations of current LLM agentic methods under the vibe coding paradigm (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025). Prompt engineering, even with precise task-specific CWE guidance, does not meaningfully mitigate insecurity. This suggests that future progress will require integration of first-class security objectives at the architectural or training level, potentially involving hybrid combinations of LLMs with traditional static/dynamic analysis tools or security-specific fine-tuning.

SWE-Agent’s modular orchestration loop and extensible ACI maintain its relevance as an empirical testbed for evaluating and hardening agentic workflows under evolving model backends, such as Claude 4 Sonnet. These findings motivate ongoing research into agent-aligned software security, the limits of prompt-driven risk mitigation, and the design of secure automated code-generation pipelines for deployment in security-sensitive domains (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025, Xu et al., 7 Nov 2025, Yang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (4)

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks (2025)

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024)

SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios (2025)

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Agent with Claude 4 Sonnet.

SWE-Agent with Claude 4 Sonnet

1. System Design and Workflow

2. Prompt Design and Subroutine Structure

3. Benchmarking: Functional Correctness and Security

Metric Formulation

Empirical Results

4. Vulnerability Analysis and Security Failure Modes

5. Prompting-Based Security Mitigations and Limitations

6. Implementation and Reproduction Guidelines

7. Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SWE-Agent with Claude 4 Sonnet

1. System Design and Workflow

2. Prompt Design and Subroutine Structure

3. Benchmarking: Functional Correctness and Security

Metric Formulation

Empirical Results

4. Vulnerability Analysis and Security Failure Modes

5. Prompting-Based Security Mitigations and Limitations

6. Implementation and Reproduction Guidelines

7. Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research