Ambiguous SWE-bench: Interactive Evaluation

Updated 3 July 2026

Ambiguous SWE-bench is a class of benchmarks that evaluate agent performance in under-specified software tasks through iterative goal discovery and clarification.
They use systematic redaction and simulated user interactions to mimic realistic developer–stakeholder exchanges and incremental requirement disclosure.
Evaluation metrics extend beyond test pass rates by measuring requirement discovery, clarification efficiency, and error mode analysis to assess genuine reasoning.

An Ambiguous SWE-bench refers to a class of software engineering benchmarks specifically designed to evaluate agent and system performance under conditions of under-specification, incomplete requirements, or scenario-driven ambiguity. These benchmarks diverge from traditional, fully specified SWE-bench tasks by demanding interactive goal discovery, iterative clarification, and adaptive reasoning—mirroring the open-ended, information-poor conditions often encountered in actual software development workflows. Empirical results and protocol designs in recent literature demonstrate that ambiguous SWE-bench formulations radically alter task difficulty, evaluation methodology, error modes, and the interpretability of leaderboard performance for both code-generation and system specification tasks.

1. Formalization and Taxonomy of Ambiguity

Ambiguity within SWE-bench-style benchmarks is operationalized by presenting agents with prompts or scenarios lacking critical specification details, requiring agents to seek clarification or infer missing requirements before producing a valid solution. In "TOM-SWE: User Mental Modeling For Software Engineering Agents" (Zhou et al., 24 Oct 2025), each ambiguous instance is modeled as a pair $(u_0, T)$ , where $u_0$ is an initial, underspecified user utterance and $T$ is the hidden, fully specified test suite. The cardinality of the solution space consistent with $u_0$ , denoted as $|C_0|$ , is large due to omitted constraints; the agent must iteratively obtain information to reduce $|C_t|$ and converge to a correct solution. The information-gap measure $A(u_0) = H(T) - H(T')$ , where $H(\cdot)$ captures the entropy or constraint count between the full test set and the subset referenced in the initial prompt, quantifies ambiguity.

This approach stands in contrast to traditional, single-turn SWE-bench paradigms, which provide all requirements, and code generation is evaluated through purely functional metrics (test pass rate, patch exactness). By encoding task ambiguity, ambiguous SWE-bench enables measurement of goal discovery, clarification efficiency, and robustness to evolving requirements (Raghavendra et al., 29 Jun 2026).

2. Construction Protocols and Simulator Design

Construction of ambiguous SWE-bench instances involves systematic redaction, paraphrasing, or withholding of critical information from task prompts. In major exemplars such as "TOM-SWE" (Zhou et al., 24 Oct 2025) and "SWE-INTERACT" (Raghavendra et al., 29 Jun 2026), initial user messages lack function names, parameters, API surfaces, or environmental context. These benchmarks deploy simulators (user personas) that reveal requirements incrementally: feedback and constraints are surfaced one-at-a-time in response to agent queries or code submissions, emulating realistic developer–stakeholder interactions.

Simulators are persona-conditioned (“expert nitpicker,” “busy senior dev”), replicate brief, layered feedback, and enforce “one correction per turn” disclosure. This protocol is formalized as an interactive loop, with containers isolating agent and user workspaces and structured APIs mediating exchanges. Ground-truth dialogue trajectories are maintained for annotation and evaluation, typically paired with a sequence of clarification steps and an eventual successful patch (Zhou et al., 24 Oct 2025). This multistage approach ensures both the presence and controllability of ambiguity.

3. Evaluation Metrics and Experimental Results

Key metrics in ambiguous SWE-bench evaluation extend beyond standard pass rates:

Issue Resolved Rate: Percentage of tasks in which the final code submission passes all hidden tests within a bounded number of interaction turns (Zhou et al., 24 Oct 2025).
Requirement Discovery Rate: Fraction of rubric items or requirements surfaced/discovered by the agent at each turn $RDR_t = D_t / N_{\text{goal}}$ (Raghavendra et al., 29 Jun 2026).
Clarification Efficiency: Number of clarification questions asked before the first passing patch (Zhou et al., 24 Oct 2025).
ChurnOverhead/LateChangeShare: Quantifies excess edits and late-stage code rework, providing insight into agent planning and adaptability (Raghavendra et al., 29 Jun 2026).

Empirical results reveal a steep performance drop when moving from single-turn to ambiguous multi-turn workflows: state-of-the-art LLMs and agents solving around 50% of traditional tasks complete only 25% of interactive, ambiguous tasks (Raghavendra et al., 29 Jun 2026). Error analyses show that ambiguity-induced failures evenly split between forgotten requirements and technical implementation mistakes, with additional modes including misinterpretation and user requirement omission.

Augmentations such as “theory-of-mind” user modeling substantially boost ambiguous SWE-bench performance (+8–12 percentage points), as these dual-agent systems build domain-specific hierarchies and maintain persistent memory over user preferences (Zhou et al., 24 Oct 2025).

4. Impact on Benchmark Validity and Interpretation

Ambiguous SWE-bench exposes latent issues in the interpretation of conventional SWE-bench results. "The SWE-Bench Illusion" (Liang et al., 14 Jun 2025) details how high accuracy on fully specified tasks may be confounded by memorization effects—models exploit repository bias or instance-specific exposure rather than genuine reasoning. Conversely, ambiguous SWE-bench is resistant to such contamination by design: the open-endedness and requirement discovery process force systems to generalize and interact, directly probing transferable problem-solving skills rather than memorized mappings.

Ambiguity also impacts test adequacy. "SWE-ABS" (Yu et al., 28 Feb 2026) demonstrates that one in five previously “solved” SWE-Bench tasks harbor semantic errors undetected by weak test suites—ambiguity in specification or deficient test coverage can mask incorrect solutions, inflating leaderboard success rates. This underscores the importance of adversarial strengthening and coverage-guided augmentation to expose and correct for ambiguous or under-tested requirements.

5. Relationship to Specification Ambiguity

Specification-level ambiguity, as articulated in "SpecBench" (Hamblin et al., 28 May 2026), is directly encoded as one of four major deficiency classes (omission, ambiguity, inconsistency, incorrect assumption). Here, ambiguity is strictly defined per IEEE Std. 1028–1997 as admitting more than one reasonable interpretation. Agents must enumerate such ambiguous constructs within incomplete design proposals, with accuracy measured by how many expert-identified defects they surface. Despite access to full project history and consultation data, even top-tier agents only identify around 44.4% of actual deficiencies, with true ambiguity detection proving more difficult than omission or simple inconsistency.

SpecBench highlights that ambiguous SWE-bench is not limited to code or patch-level uncertainties, but is an essential frontier in specification understanding, requirements engineering, and review simulation.

6. Methodological Best Practices and Future Directions

Best practices arising from ambiguous SWE-bench development include:

Use verifier-blind, persona-driven user simulators tethered to the agent’s actual workspace; static LLM evaluators lacking access to intermediate code states are discouraged (Raghavendra et al., 29 Jun 2026).
Rubric-based decomposition of requirements yields interpretable, turn-by-turn progress signals and supports fair benchmarking.
Standardize prompt scaffolding: fixed interaction style, disclosure rules, and hidden full specifications.
Embed cost-aware and rank-aware selection in benchmark subsets to permit reproducible, low-cost development cycles without sacrificing ranking or difficulty stratification (Zheng et al., 10 Jun 2026).
Extend benchmark diversity: bet on tasks with layered, cross-cutting, and evolving requirements (e.g., cross-module changes, concurrency, heterogenous language codebases).
Support human-in-the-loop correction and ambiguity calibration pipelines, with profiles or ambiguity scores parameterizing allowed clarification rounds (Zhou et al., 24 Oct 2025).

The research community recommends moving toward richer sandboxing, latent-state tracking, and sandwiched evaluations (pre-, mid-, and post-handoff) to assess agents' robustness to ambiguity in both technical and collaborative software engineering settings.

7. Ambiguity Outside Classical SWE-bench: Broader Benchmarking Paradigms

Ambiguity-centric design principles have propagated across the broader SWE-bench family and adjacent benchmarks:

Domain-governed testbeds such as SWE-Bench 5G (Chen et al., 29 Apr 2026) introduce ambiguity through multi-source domain knowledge (e.g., 3GPP specification clauses), requiring agents to resolve latent requirements not expressed in standard bug descriptions.
Multi-language and multi-harness benchmarks (Claw-SWE-Bench (Zheng et al., 10 Jun 2026)) reinforce the need for adapters and standardized contracts to compare agentic solutions under variable, potentially ambiguous workspace and runtime constraints.
Full-stack, agency-level frameworks (SWE-WebDev Bench (Saxena et al., 6 May 2026)) encode ambiguity as a property of business requirements, non-functional intent, and deployment-scale readiness, providing multidimensional, role-aware evaluation cubes.

By treating ambiguity as an axis of benchmark difficulty and realism, these paradigms ensure that system advances are measured not just by patch, function, or architectural novelty, but by genuine progress in collaborative, interactive software development.

Key Sources

(Zhou et al., 24 Oct 2025): ToM-SWE: User Mental Modeling For Software Engineering Agents
(Raghavendra et al., 29 Jun 2026): SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions
(Yu et al., 28 Feb 2026): SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark
(Hamblin et al., 28 May 2026): SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents
(Chen et al., 29 Apr 2026): SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks
(Liang et al., 14 Jun 2025): The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
(Zheng et al., 10 Jun 2026): Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
(Saxena et al., 6 May 2026): SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies