WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch (2505.03733v1)

Published 6 May 2025 in cs.CL

Abstract: LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.

Summary

Evaluation of LLMs in Web Application Generation: Insights from WebGen-Bench

The paper presents WebGen-Bench, a benchmark specifically devised to evaluate the capability of LLMs-based agents to autonomously generate interactive and functional websites from scratch. As researchers continue to explore the scope and efficacy of LLMs beyond traditional text generation tasks, this benchmark is crucial in assessing their potential in complex software development tasks such as web application generation.

WebGen-Bench stands out due to its comprehensive approach to evaluating LLMs' performance in generating multi-file website codebases. It involves synthesizing diverse website-generation instructions, ensuring broad coverage of web application categories, and evaluating the corresponding models on their ability to meet functional and aesthetic requirements. The benchmark introduces rigorous testing protocols—including 647 test cases—crafted to verify the fulfillment of specific user requirements and design preferences encoded within each instruction.

The authors have implemented systematic data curation and evaluation pipelines, leveraging both human expertise and advanced AI models like GPT-4o, to ensure accuracy and relevance in task representation. This ensures that the instructions encompass nearly all significant types of web applications, facilitating an all-encompassing evaluation framework.

The evaluation of high-performance frameworks such as Bolt.diy, OpenHands, and Aider with varied LLM engines showed noteworthy findings. Bolt.diy coupled with DeepSeek-R1 achieved a mere 27.8% accuracy in fulfilling the test cases, indicating the stringent criteria of WebGen-Bench and pointing towards ample room for improvement in LLM-based web code generation systems. The paper reports that upon training with WebGen-Instruct, a collection of 6,667 curated website-generation instructions, models like Qwen2.5-Coder-32B-Instruct improved accuracy to 38.2%.

WebGen-Bench not only measures functionality but also factors in design quality through an evaluation framework using WebVoyager. This comprehensive approach draws attention to the importance of aesthetic and responsive design aspects in website generation, further challenging LLM-based agents to transcend mere functional capability.

From a theoretical and practical standpoint, WebGen-Bench opens avenues for diversifying the application scope of LLMs. It encourages future research to focus on enhancing these models' autonomous planning, organizational abilities, and their competence in balancing complex requirements—traits essential for real-world software engineering applications. As the benchmark sets a high bar, it invites exploration into effective post-training and reinforcement learning techniques that could better align LLMs' computational efficiency with task complexity.

WebGen-Bench's implications extend beyond LLM capabilities, suggesting pivotal roles for AI in democratizing software development, particularly for users with minimal programming knowledge. As such frameworks advance, stakeholders must also consider ethical development practices to prevent misuse and ensure equitable access. The benchmark thus serves not only as an evaluative tool but also as a catalyst in steering AI development toward adaptive, user-centric technological solutions in the field of web application development. This paper signifies a critical step towards realizing the burgeoning potential of AI in generating sophisticated, interactive digital experiences.