- The paper demonstrates that no model exceeded a 39% Task Completion Rate, revealing significant blind execution in code generation.
- It employs persona-driven instruction perturbation and a unified multi-path action space to evaluate clarification, implementation, and verification.
- Empirical findings indicate excessive code over-generation with high hallucination rates and ineffective GUI feedback utilization, highlighting urgent needs for dynamic planning.
InteractWeb-Bench: Evaluating Multimodal Agents for Interactive, Intent-Aligned Website Generation
InteractWeb-Bench addresses a critical deficiency in the current paradigm of multimodal LLM (MLLM)-based website generation: the phenomenon of blind execution. This failure mode emerges when agents, guided solely by ambiguous, low-quality user instructions characteristic of non-experts, synthesize code without actively clarifying intent or verifying interim outputs in a user-centric way. State-of-the-art benchmarks unrealistically assume well-structured, information-rich, and logically consistent instructions, ignoring the high-variance, underspecified, or contradictory inputs prevalent in practical software engineering. As a result, current systems demonstrate brittle performance under authentic user conditions, consistently failing to recover true user intent or adaptively interact throughout the development workflow.
Benchmark Design: Realism Through Persona-Driven Instruction Perturbation
InteractWeb-Bench introduces a systematic evaluation framework grounded in requirement engineering defect taxonomies and the pragmatics of natural user-agent communication. Central to the benchmark is the simulation of four user agent personas—Minimalist (P-MIN), Rambling (P-RAM), Intuitive (P-INT), and Conflicting (P-CON)—inducing varied forms of instruction ambiguity, noise, and contradiction. The minimalist persona omits secondary constraints, testing intent elicitation; the rambling persona introduces extensive irrelevant conversational context, challenging robust information extraction; the intuitive persona employs sensory metaphors rather than engineering terminology, probing multimodal semantic alignment; and the conflicting persona purposely injects logical contradiction, evaluating the agent's ability to detect and resolve invalid requirements. This persona-driven mutation yields 404 dynamic test cases spanning a curated distribution of task complexity, promoting robustness evaluation on realistic, noisy user instructions.
Interactive Environment: Unified Multi-Path Action Space
Distinct from prior benchmarks, InteractWeb-Bench situates the agent within an interactive execution environment exposing a discrete, non-linear action space: Clarify (proactive user interrogation), Implement (code synthesis and execution), Verify (GUI-based structural and visual inspection), and Submit (final task delivery). This environment operationalizes an iterative workflow that demands dynamic trade-offs between requirement clarification, multimodal feedback integration, and code implementation. Agents must autonomously select actions based on the evolving context, toggling fluidly between eliciting additional specifications (without soliciting the entire ground truth), concrete code generation, and leveraging structured visual feedback—including screenshots, browser console errors, and explicit system reasoning traces—to inform repair and refinement. Robustness to infinite debugging loops and context window exhaustion is ensured via trajectory and exploration boundaries.
Evaluation Protocols and Metrics
Task completion is evaluated via a fine-grained, constraint-slot-based metric framework. Each web generation task is decomposed into atomic constraints (oracle slots), assessed by an independent, SoM-augmented visual evaluator (WebVoyager) operating on GUI renderings. The principal metric, Task Completion Rate (TCR), quantifies the proportion of fulfilled oracle slots, measuring the agent’s success in implementing all specified requirements. A hallucination rate tracks unrequested or redundant UI elements, exposing functional and stylistic over-generation. Visual Bug Rate (VBR) and aesthetic quality (analysed on a Likert scale via both automated and expert human judges) provide additional axes for qualitative assessment, focusing on rendering failures and design coherence, respectively.
Empirical Findings
Across diverse MLLM agents (Qwen3.6-Plus, Kimi-K2.5, Qwen3.5-397B-A17B, Gemma-4-31B-it, GPT-4.1 family, Gemini-3.1-Flash-Lite), experiments reveal persistent limitations:
- Blind execution persists as a dominant failure mode: No agent achieved greater than a 39% TCR, with the top models (Qwen3.6-Plus, Kimi-K2.5) exhibiting high rates of both premature code submission and excessive code hallucination (over 60%). Clarification Hit Rate (CHR) remained below 40% across all models, indicating poor ability to proactively elicit missing information.
- Models compensate for underspecification via over-generation: Agents systematically produced large codebases (often >1000 LoC) with a commensurate increase in hallucinated UI elements and superfluous features rather than seeking clarifying information.
- GUI-based feedback utilization is ineffective: Despite the inclusion of rich, structured visual and error feedback, agents failed to meaningfully update their internal task models or engage in global replanning. Debugging degenerated into local, superficial patching cycles (implement–verify loops) rather than substantive reassessment of requirement misalignment.
- User persona sensitivity: Agents exhibited significantly degraded performance for the P-MIN (minimalist) persona, reflecting a marked inability to reconstruct intent from sparse information. Performance was less impacted by noise (P-RAM), indicating relative robustness to low-signal but verbose instructions compared to truly incomplete requirements.
- Exploration–commitment trade-off: Model behavior varied between indecisive, exploratory patterns (long trajectories, increased clarification, low submission rates) and overconfident early commitment (fewer clarification steps, high submission rates, increased hallucination). No agent attained a robust balance conducive to optimal performance.
- Aesthetic and visual quality plateau: Models reliably delivered structurally valid websites, but subtle layout errors and visual inconsistencies persisted. Slight correlation between model scale and aesthetic quality was observed; however, creative alignment across models remained limited.
Implications and Future Directions
InteractWeb-Bench exposes the inadequacy of current MLLM-based agents for interactive, user-aligned website generation in environments featuring realistic linguistic ambiguity and visual complexity. Empirically, agents exhibit only superficial intent understanding, limited clarification strategies, and a strong tendency to default to passively following initial instructions, even when clearly insufficient or inconsistent. This points to foundational gaps in multimodal intent modeling, dynamic interactive planning, and cross-modal feedback assimilation.
To advance toward genuinely user-aligned, co-creative agents, research must pivot toward:
- Hierarchical intent modeling integrating both language and perception, bridging the semantic gap between ambiguous human input and technical requirement representations.
- Meta-cognitive models capable of recognizing uncertainty, proactively questioning, and negotiating requirements with both language and visual feedback mechanisms.
- Dynamic, memory-augmented agent architectures that facilitate effective repair and replanning across protracted, multimodal user interactions.
- Robust hallucination suppression strategies to ensure that generated UIs remain minimal, relevant, and aligned with user intent, even under sparse specifications.
The InteractWeb-Bench framework establishes rigorous benchmarks and operational environments necessary to drive progress on these fronts.
Conclusion
InteractWeb-Bench provides the first multimodal, persona-driven interactive benchmark explicitly designed to evaluate the ability of website generation agents to escape blind execution under low-code, non-expert user conditions. Experimental results comprehensively demonstrate that state-of-the-art agents—despite impressive language, coding, and multimodal competencies—fail to robustly elicit intent, clarify under-specifications, or leverage feedback beyond local patching strategies. This benchmark lays the foundation for systematic research into interactive, intent-aligned software generation agents and delineates concrete challenges for future MLLM and agent architecture development (2604.27419).