MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Published 16 Apr 2026 in cs.CV, cs.AI, and cs.CL | (2604.15309v1)

Abstract: The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper introduces a hierarchical agentic framework that decomposes webpage generation into global layout planning and local multimodal asset synthesis.
It employs multi-level reflection—local, context, and global—to refine asset fidelity and layout coherence, achieving a mean score of 0.75 on MM-WebGEN-Bench.
Ablation studies validate that both hierarchical planning and reflection significantly boost multimodal integration and overall aesthetic quality.

MM-WebAgent: Hierarchical Multimodal Agentic Framework for Webpage Generation

Motivation and Background

Automated webpage generation with LLMs has historically focused on the synthesis of HTML/CSS code from natural-language prompts, optimizing for text and structural correctness. However, functional websites are inherently multimodal, requiring native integration of images, videos, and charts with coherent global layouts and styling. Existing solutions—retrieval-based augmentation and rudimentary asset insertion—fail to coordinate element semantics, layout geometry, and visual style, resulting in incoherent designs and suboptimal user experiences. MM-WebAgent addresses these challenges by operationalizing multimodal webpage generation as a hierarchical agentic process: it structures design as iterative planning and refinement at multiple abstraction levels, systematically coordinating global and local decisions.

Framework Design

MM-WebAgent implements a four-stage workflow: hierarchical task planning, hierarchical multimodal generation, multi-level evaluation, and iterative hierarchical reflection. The planning stage decomposes the design prompt into a global layout plan and distinct local element plans for multimodal assets, explicitly defining section hierarchy, spatial constraints, and style attributes. Local element plans encode contextual role and modality-specific attributes, enabling precise invocation of AIGC tools for asset generation.

Figure 1: MM-WebAgent framework overview, detailing task planning, hierarchical generation, multi-level evaluation, and iterative reflection.

MM-WebAgent executes local generation in parallel, integrating each asset into the global layout under explicit guidance. The iterative hierarchical reflection phase further refines outcomes in three layers: (i) local reflection enhances individual asset fidelity, (ii) context reflection resolves embedding and layout issues, and (iii) global reflection enforces cross-section coherence and visual balance.

Figure 2: Hierarchical reflection process, exemplifying MM-WebAgent's iterative asset and layout refinement across local, context, and global levels.

Benchmarking: MM-WebGEN-Bench

To address the lack of rigorous benchmarks for multimodal web generation, MM-WebGEN-Bench is introduced. It spans 120 high-quality, diverse webpages, systematically controlled for layout complexity, visual style, semantic intent, and multimodal composition. The benchmark integrates robust filtering pipelines—automatic format validation plus manual curation—and comprehensively covers 11 scene categories and 11 visual styles, with a range of asset types across modalities (images, videos, charts).

Figure 3: MM-WebGEN-Bench construction pipeline and diversity statistics, illustrating controlled data generation and manual quality filtering.

Evaluation in MM-WebGEN-Bench is multi-level: global metrics include layout correctness, style coherence, and aesthetics; local metrics assess asset quality and integration for each embedded multimodal component. Penalty-based and graded scoring protocols yield fine-grained and quantitative assessments for both holistic and element-wise webpage quality.

Strong Numerical Results and Ablations

MM-WebAgent demonstrates superior performance across all evaluation axes. On MM-WebGEN-Bench, it achieves a mean score of 0.75—outperforming both code-generation (HTML/CSS) and code-centric agent baselines, with particularly strong gains on multimodal element integration. Notably, code-only pipelines augmented with AIGC tool access yield marginal improvements, whereas the hierarchical agentic planning and multi-level reflection of MM-WebAgent unlocks substantial quantitative gains, confirming the necessity of the proposed framework.

Results highlight pronounced improvements in image, video, and chart metrics, validating the effectiveness of context-aware planning and joint global-local optimization.

Ablation studies confirm:

Hierarchical planning is critical for multimodal content coordination, yielding significant boosts on local metrics.
Hierarchical reflection yields complementary improvements; local reflection optimizes asset-level fidelity, whereas global reflection strengthens layout and visual coherence.
Most refinement gains occur within the first few reflection rounds, indicating efficient convergence.

Qualitative Assessment and User Study

Qualitative comparisons on MM-WebGEN-Bench show MM-WebAgent consistently producing webpages with coherent spatial organization, consistent style, and well-integrated multimodal content, unlike baselines which frequently miss alignment and visual consistency.

Figure 4: Visual comparison of MM-WebAgent versus baseline methods, highlighting improvements in layout and multimodal integration.

User study with 50 experienced annotators yields a winning rate of 78.99% for MM-WebAgent, strongly corroborating automatic evaluations with human preferences regarding coherence, attractiveness of multimodal assets, and overall aesthetic appeal.

Figure 5: Representative survey evaluation questions on coherence, aesthetics, and chart readability.

Computational Considerations

While MM-WebAgent involves higher computational costs and longer average runtimes than code-only agents (average execution time: 155.8s), these reflect the intrinsic complexity of multimodal generation. Content generation is parallelized across modalities, and the cost overhead is attributable to native asset synthesis and iterative reflection—not redundant computation. Continued advances in multimodal model efficiency and open-source alternatives will likely reduce these overheads.

Practical and Theoretical Implications

Practically, MM-WebAgent establishes a principled agentic paradigm for multimodal webpage generation, decoupling global layout and local asset planning while systematically coordinating their integration via modular reflection. Theoretically, this architecture extends agent abstraction from code-centric orchestration to design abstraction, paving the way for more complex multimodal composition and structured cross-modal reasoning in generative AI.

The explicit hierarchical planning and multi-level self-reflection mechanisms suggest promising avenues for sequential decision optimization, possibly via reinforcement learning or meta-learning, to further enhance agent performance in dynamic tool selection and long-term interaction optimization. Future developments may broaden applicability to interactive UI/UX generation, procedural asset creation, and real-time adaptation.

Conclusion

MM-WebAgent advances multimodal webpage generation by operationalizing hierarchical plan-and-refine workflows, native integration of diverse assets, and iterative joint optimization at all abstraction levels. Rigorous benchmarking with MM-WebGEN-Bench demonstrates consistent superiority over conventional code-generation pipelines and code-centric agents, both in quantitative metrics and human preferences. Its agentic design principles offer robust foundations for the evolution of multimodal generative systems and downstream applications in design automation and agentic creativity.

Markdown Report Issue