SWE-Agent with Claude 4 Sonnet
- SWE-Agent with Claude 4 Sonnet is an advanced agentic coding system that integrates LLM-driven dynamic prompting with modular repository navigation.
- Empirical benchmarks show a 61.0% functional correctness rate but only 10.5% secure outputs, highlighting significant security vulnerabilities.
- The framework employs chain-of-thought strategies and structured API protocols to orchestrate complex, end-to-end software engineering tasks.
SWE-Agent with Claude 4 Sonnet denotes an advanced agentic coding architecture employing the SWE-Agent framework with Anthropicâs frontier Claude 4 Sonnet model as LLM backbone. This paradigm is prominent in recent studies of "vibe coding," where LLM agents autonomously perform complex end-to-end software engineering tasks under real-world constraints. Empirical evidence from leading benchmarks such as SU S VI B E S, SecureAgentBench, and SWE-Compass collectively demonstrates SWE-Agentâs strengths in repository-scale navigation and functional code generation, while highlighting persistent software security challenges and limitations of current prompting-based mitigations (Zhao et al., 2 Dec 2025, Yang et al., 6 May 2024, Chen et al., 26 Sep 2025, Xu et al., 7 Nov 2025).
1. System Design and Workflow
The core of SWE-Agent is a modular, loop-based architecture orchestrating LLM-driven repository interaction. Each episode consists of the following cyclic phases:
- Repository Observation: The agent retrieves repository state, including file tree and recent diffs.
- Action Planning: The agent, via the Claude 4 Sonnet LLM, plans discrete actionsâopening files, editing code, running tests.
- Dynamic Prompting: Action proposals are generated by invoking the LLM with a system prompt and per-turn context, including an explicit chain-of-thought (CoT) strategy: âthink step by step about file edits, test execution, and patch generation.â
- Action Execution: The specified action modifies source code, invokes shell commands, or triggers test runs, with results captured for subsequent cycles.
- Terminal Condition: Iteration continues until a task-defined endpointâtypically, all tests pass or a hard limit of 200 planning steps is reached (Zhao et al., 2 Dec 2025).
The AgentâComputer Interface (ACI) enforces strict command schemas, guardrails (e.g., flake8 for Python syntax validation), and budgeting for API calls, cost, and wall-clock timeouts (Yang et al., 6 May 2024). Integration with Claude 4 Sonnet is performed via the Anthropic API, using model-appropriate prompt syntax and API parameter tuning (temperature=0.0, deterministic decoding, pass@1) (Zhao et al., 2 Dec 2025, Xu et al., 7 Nov 2025).
2. Prompt Design and Subroutine Structure
SWE-Agent utilizes a sophisticated prompting regime, both at the episode level and for specialized subroutines. The agentâs main system prompt enumerates supported interface commands and enforces single-action JSON protocols (e.g., {"discussion":"...", "command":"..."} for classic ACI; explicit function-calling for SWE-Compass).
For SU S VI B E S tasks, three prompt templates structure feature masking, issue description, and mask verification:
- Prompt I (Feature Masking): Instructs Claude 4 Sonnet to produce a deletion mask encapsulating an implementation area around a provided diff.
- Prompt II (Issue-Style Task): Generates a developer-facing issue specification for re-implementing the masked code.
- Prompt III (Mask Verification): Verifies that each implementation line is justified by the issue requirements (Zhao et al., 2 Dec 2025).
Operational prompts in SWE-Compass and similar agentic loops embed CoT directives, tool call schemas (e.g., bash, str_replace_editor, file_viewer), and stateful user contexts (e.g., last test failures), all carried in structured JSON over the chat.completions API (Xu et al., 7 Nov 2025).
3. Benchmarking: Functional Correctness and Security
Metric Formulation
Three primary metrics are used for quantitative evaluation:
- Functional Correctness (FuncPass):
A solution must pass the repositoryâs original unit tests.
- Security Rate (SecPass):
Only solutions passing both functional and security-specific post-fix tests are counted as secure.
- Insecurity Rate:
These definitions extend to multi-category outcome values in SecureAgentBench, adding distinctions such as "Correct but Vulnerable" (CV), "Correct and Secure" (), and newly introduced vulnerabilities (CS) (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025).
Empirical Results
Key results for SWE-Agent using Claude 4 Sonnet on SU S VI B E S:
| Agent Framework | FuncPass (%) | SecPass (%) |
|---|---|---|
| SWE-Agent | 61.0 | 10.5 |
| OpenHands | 49.5 | 12.5 |
| Claude Code | 44.0 | 6.0 |
Despite the highest functional correctness (61.0%), only 10.5% of SWE-Agent+Claude 4 Sonnetâs outputs are secure, with 50.5% of all tasks yielding correct but insecure code. For comparison, in SecureAgentBench (Claude 3.7 Sonnet), SWE-Agent achieves â38% and â10%; projections for Claude 4 Sonnet forecast rising to 12â18% (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025). In SWE-Compass, detailed Pass@1 rates are recorded per task category, maintained by explicit LaTeX formulations (Xu et al., 7 Nov 2025).
4. Vulnerability Analysis and Security Failure Modes
Systematic analysis highlights recurrent security failure modes across agent-generated code:
- Timing side channels: Premature returns in authentication functions yielding observable timing differences (e.g., Djangoâs
verify_password). - Header injection: Unsanitized CRLF sequences in HTTP Location headers enabling response splitting (e.g., Buildbot).
- Cross-site scripting (XSS): Unchecked JavaScript scheme in URLs within content-management features (e.g., Wagtail).
- Session expiry bypass: Failure to enforce session max_age, leading to persistent stale credentials (e.g., aiohttp_session) (Zhao et al., 2 Dec 2025).
Empirical studies show that over 82% of functionally correct solutions produced by SWE-Agent+Claude 4 Sonnet remain insecure (i.e., fail ). Notably, security vulnerability patterns remain diverse, spanning 77 distinct CWE categories in SU S VI B E S (Zhao et al., 2 Dec 2025).
5. Prompting-Based Security Mitigations and Limitations
Multiple strategies to prompt SWE-Agent+Claude 4 Sonnet for security robustness have been evaluated, including:
- Generic reminders: Appending best-practice notices (âMake sure to follow best security practicesâĤâ).
- Self-selection CWE: Having the agent read the full CWE list, select relevant categories, and integrate protective logic.
- Oracle CWE hints: Explicitly instructing the agent which concrete CWE to avoid (Zhao et al., 2 Dec 2025).
Results indicate that self-selection and oracle hints reduce functional correctness (e.g., by up to â8.5 percentage points), without improving ultimate security rates; trade-off effects are observed where some insecureâsecure transitions are offset by new secureâincorrect ones. Agents exhibit difficulty in reliable CWE identification (mean taskwise recall 0.737). These findings corroborate SecureAgentBenchâs conclusion that explicit security prompting fails to substantially increase , instead necessitating more profound architectural, training, or tool-integrated interventions (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025).
6. Implementation and Reproduction Guidelines
Practical deployment of SWE-Agent with Claude 4 Sonnet involves:
- API and Configuration: Anthropic Python SDK with
model="claude-sonnet-4-20250514",temperature=0.0,max_tokens_to_sample=2048, and function-calling enabled for tool invocation schemas (Xu et al., 7 Nov 2025). - Tool Abstraction Layer: Provides access to containerized bash, file editors, repository-wide search utilities, and test runners within an isolated Docker environment.
- Turn and Time Budgets: Recommendations include max_turns=150, per-tool timeout=600âŻs, and job-level capping via containerized process limits.
- State and Context Management: Retain only the last 10â15 tool interactions, summarize earlier history beyond 50âŻk tokens, and enforce strict guardrails on code edits (e.g., flake8 for syntax in Python, static analysis as in SecureAgentBench for C/C++).
Configuration best practices, including OS pinning, offline cache prep, and encoding normalization, are detailed for robust reproduction (Xu et al., 7 Nov 2025).
7. Implications and Prospects
SWE-Agent with Claude 4 Sonnet achieves competitive functional coverage in agentic coding benchmarks, often outperforming alternative agents in raw Pass@1 rates; yet, persistent security vulnerability in over 80% of generated correct outputs exposes critical limitations of current LLM agentic methods under the vibe coding paradigm (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025). Prompt engineering, even with precise task-specific CWE guidance, does not meaningfully mitigate insecurity. This suggests that future progress will require integration of first-class security objectives at the architectural or training level, potentially involving hybrid combinations of LLMs with traditional static/dynamic analysis tools or security-specific fine-tuning.
SWE-Agentâs modular orchestration loop and extensible ACI maintain its relevance as an empirical testbed for evaluating and hardening agentic workflows under evolving model backends, such as Claude 4 Sonnet. These findings motivate ongoing research into agent-aligned software security, the limits of prompt-driven risk mitigation, and the design of secure automated code-generation pipelines for deployment in security-sensitive domains (Zhao et al., 2 Dec 2025, Chen et al., 26 Sep 2025, Xu et al., 7 Nov 2025, Yang et al., 6 May 2024).