Papers
Topics
Authors
Recent
2000 character limit reached

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Published 5 Jan 2026 in cs.SE and cs.AI | (2601.02430v1)

Abstract: Web applications (web apps) have become a key arena for LLMs to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation. WebCoderBench comprises 1,572 real user requirements, covering diverse modalities and expression styles that reflect realistic user intentions. WebCoderBench provides 24 fine-grained evaluation metrics across 9 perspectives, combining rule-based and LLM-as-a-judge paradigm for fully automated, objective, and general evaluation. Moreover, WebCoderBench adopts human-preference-aligned weights over metrics to yield interpretable overall scores. Experiments across 12 representative LLMs and 2 LLM-based agents show that there exists no dominant model across all evaluation metrics, offering an opportunity for LLM developers to optimize their models in a targeted manner for a more powerful version.

Summary

  • The paper introduces a benchmark that leverages real-world, multi-turn user requirements to assess LLM-generated web apps.
  • It employs a robust framework with 24 metrics across general and alignment quality, incorporating user-driven weights via the Borda Count.
  • Evaluation results from 12 LLMs reveal significant performance variations, offering actionable insights for targeted model enhancements.

WebCoderBench: Benchmarking Web Application Generation

Web applications serve as a pivotal domain for demonstrating the capabilities of LLMs in code generation. The paper "WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics" (2601.02430) introduces WebCoderBench, a benchmark designed to address the challenges of authenticity, generality, and interpretability in evaluating LLMs' ability to generate web applications. This benchmark offers a substantial dataset derived from real-world user requirements, coupled with a robust evaluation framework comprising 24 metrics across various quality perspectives.

Dataset Construction

WebCoderBench emphasizes authenticity by sourcing 1,572 user requirements from an industrial partner's online service. This dataset spans multiple modalities, from textual descriptions to images and URLs, capturing the diverse expression styles of users. The data collection process (Figure 1) involved merging multi-turn requirements into coherent single-turn inputs, ensuring comprehensibility and content anonymization. The curated dataset was further enhanced by establishing ground-truth checklists across functionality, visual design, and content, validated by human experts (Figure 2). Figure 1

Figure 1: The dataset construction process of WebCoderBench.

Evaluation Metrics

WebCoderBench's evaluation framework is structured around two major quality aspects: general quality and alignment quality, further divided into nine perspectives encompassing 24 metrics (Table 1). This comprehensive structure ensures an objective and quantitative assessment of LLM-generated web applications without reliance on ground-truth implementations.

The benchmark employs rule-based methods and the LLM-as-a-judge paradigm to ensure automated evaluation. General quality covers code, visual, content, performance, accessibility, and maintainability aspects, while alignment quality focuses on meeting user-specified functional, visual, and content requirements. Figure 2

Figure 2: An example user requirement with its corresponding ground-truth checklists.

Weight Assignment and User Preferences

To reflect human preferences accurately, WebCoderBench integrates preference-aligned weights into its evaluation. This process involved a company-wide survey, leveraging the Borda Count method to calculate weights for each metric based on user importance rankings (Figure 3). This ensures the benchmark's overall score aligns with real-world user preferences, providing actionable insights into user priorities. Figure 3

Figure 3: The weight assignment and evaluation workflow of WebCoderBench.

Evaluation Results and Analysis

Experiments conducted with 12 LLMs and 2 LLM-based agents showcased WebCoderBench's efficacy in providing interpretable evaluations (Figure 4). The results indicate significant performance variance across models with no single model leading in all metrics, highlighting opportunities for targeted improvements. The analysis revealed a narrowing gap between open-source and closed-source LLMs and emphasized the rapid pace of model evolution, with newer models consistently outperforming older ones. Figure 4

Figure 4: The weight proportion of each perspective and each evaluation metric.

Notably, the evaluation uncovered that models excel in certain areas but lack comprehensive effectiveness across all metrics, reinforcing the specialization trend among current LLMs. User preferences play a crucial role in shaping evaluations, offering developers clear directions for model enhancement.

Conclusion

WebCoderBench provides a robust, interpretable benchmarking framework for LLM-generated web applications, with authenticity and user-centric evaluation at its core. By actively addressing user needs and evolving with the industry, this benchmark sets a new standard for assessing LLM capabilities in web application generation, offering a powerful tool for developers aiming to optimize and refine their models. Figure 5

Figure 5: The detailed raw scores of 24 evaluation metrics for each LLM and LLM-based agent, with the x-axis indices denoting the IDs of evaluation metrics (corresponding to Table~\ref{metrics table}).

Paper to Video (Beta)

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.