WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

Published 26 Sep 2025 in cs.CL and cs.AI | (2509.22644v1)

Abstract: Agent systems powered by LLMs have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual LLM (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textit{Step-GRPO with Screenshot and GUI-agent Feedback} to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.

Abstract PDF Upgrade to Chat

Summary

The paper introduces WebGen-Agent, which integrates multi-level feedback and step-level RL to enhance website generation from natural language instructions.
It employs a novel Step-GRPO paradigm that leverages dense rewards from screenshot assessments and GUI-agent testing to improve both accuracy and visual appeal.
Empirical results demonstrate significant performance gains for both proprietary and open-source models, with robust error recovery and optimal state selection mechanisms.

WebGen-Agent: Multi-Level Feedback and Step-Level RL for Interactive Website Generation

Introduction

WebGen-Agent introduces a specialized code agent architecture for end-to-end website generation from natural language instructions, targeting both functional correctness and visual quality. The system addresses the limitations of prior code agents, which predominantly rely on code execution feedback and thus fail to capture visual and interactive deficiencies in generated web applications. WebGen-Agent integrates multi-level feedback—specifically, screenshot-based visual assessment and GUI-agent-based functional testing—into an iterative code generation and refinement loop. Furthermore, the paper proposes a novel Step-GRPO (Generalized Reinforced Policy Optimization) training paradigm that leverages dense, step-level rewards derived from these feedback signals, enabling effective reinforcement learning for open-source LLMs in this domain.

WebGen-Agent Workflow

The WebGen-Agent workflow is an iterative, multi-stage process, where each step consists of code generation, code execution, and feedback gathering. The agent receives a natural language website specification and maintains a trajectory of edits, execution outputs, and feedback. The core workflow is as follows:

Code Generation: The agent, powered by a coding LLM, generates code edits to the current codebase.
Code Execution: The codebase is executed in a sandboxed environment. If errors occur, the agent receives the error output and attempts to fix the issue in the next step. After five consecutive errors, a backtracking mechanism restores the codebase to the last known good state.
Screenshot Feedback: Upon successful execution, a screenshot of the landing page is captured. A visual LLM (VLM) provides a structured description, an appearance score, and improvement suggestions.
GUI-Agent Testing: If the appearance is deemed satisfactory, a GUI-agent is triggered to test the website’s interactive functionalities, guided by a synthesized instruction that covers all requirements. The GUI-agent returns a functional score and suggestions.
Best-Step Selection: At the end of the trajectory, the codebase state with the highest combined functional and appearance scores is selected as the final output.
Figure 1: Iterative website generation with screenshot- and GUI-agent-based feedback. A backtracking and best-step-selection mechanism is applied on the basis of the screenshot and GUI-agent testing scores.

This design ensures that both visual and functional requirements are enforced throughout the generation process, with explicit mechanisms to recover from regressions and select optimal intermediate states.

Step-GRPO: Step-Level Reinforcement Learning with Multi-Level Feedback

To address the performance gap between proprietary and open-source LLMs in website code generation, the paper introduces Step-GRPO, a reinforcement learning framework that utilizes dense, step-level rewards derived from the WebGen-Agent workflow. The key innovations are:

Reward Signal: At each step, the sum of the screenshot appearance score and the GUI-agent functional score is used as the immediate reward.
Advantage Computation: Unlike naive outcome-based RL, Step-GRPO standardizes the immediate reward at each step, providing fine-grained supervision.
Training Pipeline: Models are first warm-started with supervised fine-tuning (SFT) on high-quality agent trajectories, then further optimized with Step-GRPO using a relatively small but high-quality dataset.
Figure 2: Step-GRPO with Screenshot and GUI-agent Feedback. Multiple WebGen-Agent trajectories are produced, and the reward for each step is computed by summing the screenshot score and the GUI-agent score.

This approach enables open-source models (e.g., Qwen2.5-Coder-7B-Instruct, Qwen3-8B) to achieve substantial improvements in both accuracy and visual quality, as measured on the WebGen-Bench benchmark.

Empirical Results

WebGen-Agent is evaluated on WebGen-Bench, a comprehensive benchmark for interactive website generation. The system is tested with both proprietary and open-source LLMs as the coding engine, and with Qwen2.5-VL-32B-Instruct as the feedback VLM. Key findings include:

Performance Gains: WebGen-Agent with proprietary models (e.g., Claude-3.5-Sonnet) achieves 51.9% accuracy and a 3.9 appearance score, compared to 26.4% and 3.0 for Bolt.diy, the previous SOTA.
Open-Source Model Improvements: Step-GRPO training increases Qwen2.5-Coder-7B-Instruct accuracy from 12.4% (raw) to 45.4% (Step-GRPO), and appearance score from 1.6 to 3.7.
Ablation Studies: Each component—screenshot feedback, GUI-agent testing, backtracking, and best-step selection—contributes to performance. The largest accuracy gain is attributed to GUI-agent testing, while screenshot feedback most improves appearance.
Feedback VLM Efficiency: Using a small, open-source VLM for feedback is nearly as effective as using a proprietary VLM, with minimal impact on accuracy or appearance scores.
Figure 3: Comparison of the average file count and average line count among the original, SFT, and Step-GRPO models for Qwen2.5-Coder-7B-Instruct and Qwen3-8B.

Figure 4: Accuracy (\%) and Appearance Score as a function of the maximum number of iterations.

Qualitative Analysis

Qualitative examples demonstrate that supervised fine-tuning and Step-GRPO training progressively reduce malformed outputs and improve adherence to both appearance and functional requirements. The agent is shown to iteratively refine websites, addressing both visual and interactive deficiencies based on structured feedback.

Figure 5: Screenshots of websites created by Qwen2.5-Coder-7B-Instruct, WebGenAgent-LM-7B-SFT, and WebGenAgent-LM-7B-Step-GRPO.

Figure 6: Screenshots of websites created by Qwen3-8B, WebGenAgent-LM-8B-SFT, and WebGenAgent-LM-8B-Step-GRPO.

Implementation Considerations

Agent Architecture: The decoupling of the coding LLM and feedback VLM is critical for cost efficiency and scalability. The feedback VLM can be a smaller, open-source model, while the coding LLM should be as strong as possible within resource constraints.
Backtracking and Best-Step Selection: These mechanisms are essential for robustness, preventing error accumulation and ensuring that the final output is the best observed during the trajectory.
Training Data: High-quality, step-annotated trajectories are required for effective SFT and Step-GRPO. The paper demonstrates that a relatively small number of such trajectories suffice due to the dense supervision.
Resource Requirements: Step-GRPO training for 7B/8B models requires significant GPU resources (e.g., >24 hours on 16 A800 GPUs), and scaling to larger models is currently limited by hardware availability.

Limitations and Future Directions

Scalability: Step-GRPO training is currently demonstrated only on 7B/8B models due to computational constraints. Scaling to 30B–72B models is a natural next step.
Evaluation Scope: The current evaluation does not consider website response speed or network conditions.
Generalization: While the system is tailored for website generation, the multi-level feedback and step-level RL paradigm could be extended to other domains requiring both functional and visual/interactive quality.

Implications and Future Developments

WebGen-Agent establishes a new paradigm for code agents in domains where both visual and interactive quality are critical. The integration of multi-level feedback and dense, step-level RL supervision enables open-source models to close the gap with proprietary LLMs. This approach is likely to generalize to other agentic tasks involving multimodal outputs and complex user requirements. Future work should focus on scaling the approach to larger models, extending the feedback modalities, and exploring applications beyond website generation.

Conclusion

WebGen-Agent demonstrates that integrating screenshot and GUI-agent feedback into an iterative code agent workflow, combined with step-level RL via Step-GRPO, yields substantial improvements in both the functional and visual quality of generated websites. The approach is effective across both proprietary and open-source LLMs, with strong empirical results and robust ablation analyses. The methodology provides a blueprint for future agentic systems in domains requiring multimodal quality assurance and dense process supervision.

Markdown