Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
130 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping (2410.16232v1)

Published 21 Oct 2024 in cs.CL and cs.AI

Abstract: Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting accessibility and impeding efficient design iteration. To bridge this gap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision LLMs (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Code supports interactive agent evaluation that mimics real-world design workflows, where a VLM-based agent iteratively refines its generations by communicating with a simulated user, either passively receiving feedback instructions or proactively asking clarification questions. We comprehensively analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs; even the most capable models struggle to accurately interpret sketches and formulate effective questions that lead to steady improvement. Nevertheless, a user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception, highlighting the need to develop more effective paradigms for multi-turn conversational agents.

Summary

  • The paper introduces Sketch2Code, a novel benchmark assessing VLMs' conversion of low-fidelity sketches into coherent webpage prototypes using multi-turn interactions.
  • It details a curated dataset of 731 sketches from 484 real webpages, evaluating both single-turn and multi-turn scenarios with layout and visual similarity metrics.
  • Results reveal that commercial models outperform open-source ones, underscoring the need for enhanced proactive interaction in AI-driven design automation.

An Overview of "Sketch2Code: Evaluating Vision-LLMs for Interactive Web Design Prototyping"

The paper "Sketch2Code: Evaluating Vision-LLMs for Interactive Web Design Prototyping" addresses the challenge of converting low-fidelity sketches into webpage prototypes using state-of-the-art Vision LLMs (VLMs). It presents Sketch2Code, a novel benchmark framework designed to evaluate VLMs’ capabilities in automating this conversion process through interactive prototyping.

Approach and Framework

The authors introduce Sketch2Code to bridge the accessibility gap between initial design concepts and high-fidelity UI implementations. This framework evaluates the VLMs’ ability to understand and transform rudimentary sketches into coherent webpage layouts, reflecting real-world design processes. Notably, Sketch2Code facilitates multi-turn interaction, allowing agents to iteratively refine outputs based on user feedback through two scenarios: feedback following and proactive question asking.

Dataset and Benchmark Design

A crucial element of Sketch2Code involves a curated dataset comprising 731 sketches derived from 484 real-world webpages. These sketches serve as inputs for the VLMs to generate HTML code. The benchmark assesses the models on single-turn direct generation as well as multi-turn interactions, mimicking a realistic design workflow. Evaluation metrics include layout and visual similarity scores, focusing on how closely the generated designs match with reference implementations.

Results and Analysis

The results reveal substantial challenges for VLMs in accurately interpreting sketch inputs, with commercial models outperforming open-source counterparts significantly. While feedback following yielded improved layout similarity, the proactive question-asking scenarios highlighted limitations in current VLM capabilities to effectively query and incorporate user objectives. Despite the difficulties in achieving consistent improvement, the preference among UI/UX professionals for the question-asking mode points to a demand for more proactive agent behavior.

Theoretical and Practical Implications

The findings underscore key issues in current human-AI interaction models, particularly the need for VLMs to enhance their interactive capabilities. There is a clear implication for further research into aligning AI-driven tools with human expectations in design collaboration, promoting more efficient and iterative design processes. The Sketch2Code framework provides a structured basis for testing and improving multi-modal AI systems in practical applications.

Future Directions

Future research could focus on refining VLMs’ ability to handle multi-turn interactions more effectively, possibly through refined training methods or improved understanding of design elements from sketches. The potential development of larger open-source models equipped to process longer contexts could also enhance feasibility in real-world deployment scenarios. Moreover, expanding input modalities to include direct manipulation elements could further bridge the gap between AI tools and user needs.

Conclusion

In summary, Sketch2Code offers a sophisticated and practical benchmark for advancing VLM research in design automation. While current models display limitations in interactive design tasks, the framework paves the way for more seamless integration of AI in creative workflows, emphasizing the importance of continual refinement in collaborative AI systems within the UI/UX design domain.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.