SWE-Bench Multimodal Benchmark

Updated 30 June 2025

SWE-Bench Multimodal is a benchmark suite that assesses AI systems’ ability to resolve software engineering challenges using both code and visual data.
The benchmark focuses on JavaScript-based projects, incorporating images and diagrams to test multimodal and cross-domain reasoning.
Empirical results reveal current AI methods struggle with visual reasoning, with top systems achieving as low as 12.2% resolution rates, highlighting the need for enhanced multimodal tools.

SWE-Bench Multimodal (also commonly cited as SWE-bench M) is a benchmark suite specifically developed to assess the ability of AI systems—and in particular, language-model-powered autonomous agents—to resolve real-world software engineering issues that require multimodal and cross-domain understanding. Unlike the original SWE-bench, which is restricted to Python and primarily text-only problem descriptions, SWE-bench Multimodal targets JavaScript-based, user-facing, and visually intensive codebases, systematically incorporating visual input (images and, in some cases, videos) into its problem formulations. The benchmark exposes significant limitations in current AI systems' visual reasoning and code generalization capabilities, and provides an important step toward evaluating practical AI for modern software environments.

1. Motivation and Benchmark Definition

SWE-bench Multimodal was created to address fundamental coverage gaps in the evaluation of autonomous software engineering agents. The original SWE-bench (Jimenez et al., 2023) restricts its focus to Python repositories, typically backend or scientific computing libraries, with issues almost exclusively described in text (5.6% contain images) and few that require non-textual reasoning. This excludes entire categories of modern software development—including user-facing front-end, UI/UX, visualization, game development, DevOps, and other domains that require reasoning across both code and visual artifacts.

SWE-bench Multimodal directly targets these unassessed domains, focusing on JavaScript and related web technologies that underpin visual software. Every task instance in SWE-bench M includes at least one image (e.g., screenshots, diagrams, visual diffs, error message captures, maps, or other UI/graphical elements) in its problem statement or its test cases. Human annotation establishes that visual input is required to resolve 83.5% of the benchmark’s issues.

2. Dataset Structure and Task Coverage

SWE-bench Multimodal comprises 617 task instances sourced from 17 widely-used JavaScript libraries focusing on web UI, diagramming, data visualization, syntax highlighting, and interactive mapping. The codebases span a range of software categories and programming paradigms (including dynamic and statically typed languages—JavaScript, TypeScript, HTML, CSS; some also include non-English images and source commentary).

Each instance contains:

A natural language issue statement (sometimes multilingual, depending on the upstream repository).
One or more visual artifacts (862 total images; including UI screenshots, diagrams, code screenshots, error displays, maps, and artistic outputs).
The full repository code snapshot, with ground-truth patches and relevant test suites (often including fail-to-pass tests that operate over visual outputs, such as pixel-diff or screenshot comparisons).
Reference solutions that typically span multiple files, functions, and lines: median edits are 2 files, 3 functions, and 27 lines, higher than in the original SWE-bench.

Tasks are evaluated on the system’s ability to modify the codebase so that all fail-to-pass (F2P) unit tests pass and all pass-to-pass (P2P) tests continue to pass, under the same “patch” paradigm as SWE-bench but with substantially increased visual and cross-file complexity.

3. Evaluation Methodology and Metrics

Systems participating in SWE-bench Multimodal are presented with the full multimodal task description and repository snapshot. Evaluation is strictly automated:

Apply the system-generated patch to the repository.
Run all F2P and P2P tests; successful resolution requires all F2P tests to pass (the issue fixed) and no regressions (all prior P2P tests pass).

The main metric is percentage resolved: the fraction of instances correctly solved as above.

Additional analyses include:

Performance stratified by presence/absence of images in the system’s input.
Breakdown of tasks by type and by whether human annotators deemed image input necessary.
File localization F1 (how well the method identifies relevant files for modification).
Resource and cost metrics (average inference cost, trajectory length, and action composition in agent frameworks).

4. Empirical Findings and Analysis

State-of-the-art SWE-bench systems—optimized and evaluated primarily for the Python/text-dominated original SWE-bench—exhibit marked declines on SWE-bench Multimodal:

The highest reported resolve rate for any system on SWE-bench M is 12.2% (SWE-agent M, using GPT-4o).
The next best baseline (Agentless JS, also with GPT-4o) resolves only 6.2%, and naïve retrieval-augmented generation (RAG) approaches resolve 6.0%.
By contrast, top systems on SWE-bench Lite (Python/text) achieve up to 43% resolved.
When the visual input is omitted, resolve rates decrease by up to 50% on tasks where images are annotated “necessary.”

Table: Comparative Performance on Benchmark

System	LLM	% Resolved	Avg. $ Cost
SWE-agent M	GPT-4o	12.2%	2.94
Agentless JS	GPT-4o	6.2%	0.38
RAG	GPT-4o	6.0%	0.17

Ablation studies demonstrate that SWE-agent M—distinguished by its language-agnostic, tool-rich, and minimal workflow-imposition approach—outperforms systems designed for monolingual, monomodal, or rigid pipelines. For SWE-agent M, 38% of agent actions use web/image tools (browsing, screenshots, image comparison).

Localization analysis shows that SWE-agent M achieves a file-level F1 of 0.367 (dev set, Claude 3.5 Sonnet), exceeding adapted Python-centric alternatives (F1: 0.142).

5. Limitations Revealed and Generalization Challenges

SWE-bench Multimodal highlights several critical limitations:

Limited cross-language generalization: Methods with deep coupling to Python’s AST, language-specific localization, or pre/post-processing pipelines cannot be trivially adapted to JavaScript or multi-language environments.
Multimodal reasoning is a major bottleneck: Omission of images results in up to 50% lower success on tasks where visual context is necessary for correct bug localization, regression verification, or feature validation.
Visual and UI-based tasks amplify the challenges of context size, input heterogeneity, and code navigation, especially when source code organization is complex or linguistically diverse.

Some system design choices—such as requiring language-specific parsing pipelines—prove brittle, as JavaScript codebases often interleave multiple file types (e.g., .js, .ts, .html, .css, with 28% of tasks editing >1 filetype).

6. SWE-agent’s Language-Agnostic and Multimodal Tools

SWE-agent M achieves the highest known resolve rate due to several adaptable features:

Text-based, minimal-dependence ACI: The agent’s interface is decoupled from repository or language specifics, exposing generic shell, file, and code editing tools.
Web and image utilities: Integrated browser simulation, screenshot, and image viewing/inspection actions.
JavaScript-aware linting: Adapts edit validation to JS (via eslint), reducing corrupted or non-executable patches.
Flexibility: Works with minimal a priori knowledge about project structure or language, enabling generalization across diverse codebases and tasks.

Though multimodal tools increase trajectory length and cost, they enable more successful use of visual input for verification, especially when reference solutions require reproducing or interpreting visual artifacts.

7. Implications and Future Directions

SWE-bench Multimodal exposes systemic weaknesses in current AI-driven software engineering, particularly for visually oriented software and non-Pythonic programming environments. Several implications and research avenues emerge:

Benchmarks must reflect real-world modal and language diversity: Sole reliance on single-language, text-focused benchmarks (like SWE-bench) overstates current generalization.
Agent architectures favoring modality-agnostic, compositional tool use outperform rigid, workflow-centric pipelines when scaling to real software engineering tasks.
Enhanced multimodal tools will be crucial: Richer browser, visual diff, and inspection tools—with well-defined interfaces for LLM agents—are likely to improve both localization and repair performance.
Need for expanded evaluation metrics: Partial credit, localization accuracy, and regression potential should be systematically reported alongside resolved rates.
Further coverage of modalities and domains: Extending SWE-bench Multimodal to mobile, game, VR/AR domains, and to other modalities (audio, video) remains an unsolved challenge.

SWE-bench Multimodal has set a new, realistic bar for assessing AI agents in modern, practical software engineering scenarios, emphasizing the need for multimodal, multilingual, and cross-domain competence in future AI systems.

PDF Markdown Chat (Pro)

References (1)

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (2023)

Follow Topic

Get notified by email when new papers are published related to SWE-Bench Multimodal.