Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Published 14 Mar 2024 in cs.HC, cs.AI, and cs.CV | (2403.09029v1)

Abstract: Using vision-LLMs (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot. arXiv preprint arXiv:1705.07962, 2017.
  2. Ocr-idl: Ocr annotations for industry document library dataset, 2022.
  3. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution, 2023.
  4. Datacomp: In search of the next generation of multimodal datasets, 2023.
  5. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  6. Cogagent: A visual language model for gui agents, 2023.
  7. Mistral 7b, 2023.
  8. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=SKN2hflBIZ.
  9. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/lee23g.html.
  10. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  11. Dora: Weight-decomposed low-rank adaptation, 2024b.
  12. Reverse engineering mobile application user interfaces with remaui (t). 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 248–259, 2015. URL https://api.semanticscholar.org/CorpusID:7499368.
  13. OpenAI et al. Gpt-4 technical report, 2023.
  14. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  15. Design2code: How far are we from automating front-end engineering?, 2024.
  16. Gemini Team et al. Gemini: A family of highly capable multimodal models, 2023.
  17. Sigmoid loss for language image pre-training, 2023.
  18. Multimodal c4: An open, billion-scale corpus of images interleaved with text. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=tOd8rSjcWz.
Citations (21)

Summary

  • The paper presents the WebSight dataset, which offers 2M synthetic HTML-screenshot pairs to overcome data noise in traditional web HTML sources.
  • It outlines a novel methodology using LLMs like Mistral-7B-Instruct and Deepseek-Coder-33b to generate clean HTML code from diverse web designs.
  • Fine-tuning the Sightseer model with DoRA significantly improves OCR, spatial understanding, and robustness in converting web layouts into functional code.

Unlocking the Conversion of Web Screenshots into HTML Code with the WebSight Dataset

The paper "Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset" introduces an innovative approach to translating web screenshots into functional HTML code. This paper offers significant insights into the construction and application of a dataset aimed specifically at this task, addressing a gap in the use of vision-LLMs (VLMs) for web development.

Overview of the WebSight Dataset

The core contribution of the paper is the introduction of WebSight, a synthetic dataset comprising 2 million pairs of HTML codes and their corresponding screenshots. This dataset is designed to advance the capability of VLMs in the field of web development, especially for converting static images of web interfaces into usable code.

The authors identified the lack of a substantial dataset as a primary obstacle preventing VLMs from effectively addressing this translation task. To create WebSight, the authors bypassed the complexity of using existing web HTML files—often burdened with noisy content, scripts, or external dependencies—by synthesizing HTML codes. This synthesis was achieved using LLMs proficient in code generation, thereby ensuring clean and structured data suitable for model training.

Methodology and Data Construction

The methodology involved two critical steps: generating a diverse array of website concepts and converting these into HTML code using an LLM, specifically Deepseek-Coder-33b-instruct. The process began with Mistral-7B-Instruct generating unique website designs, followed by the LLM producing the final HTML code enhanced with Tailwind CSS—a utility-first framework facilitating concise and direct styling within HTML documents.

Key to the dataset's diversity and utility was the inclusion of screenshots captured at various resolutions using Playwright, ensuring a range of image sizes and formats that would enhance model robustness. This meticulous construction resulted in a dataset that could significantly accelerate VLMs' proficiency in HTML code generation from web screenshots.

Model Fine-Tuning and Performance

The authors fine-tuned a foundational VLM—Sightseer—on the WebSight dataset. Sightseer is built upon Mistral-7B and SigLIP-SO400M, employing the Patch n’ Pack strategy to maintain the original aspect ratios of input images. Fine-tuning involved using the parameter efficient DoRA method to stabilize training and optimizing the model’s OCR capabilities, spatial understanding, and object recognition.

Qualitative evaluations revealed that Sightseer could produce functional HTML code that closely replicates the structure and content of the input web screenshots. This performance extended to untrained scenarios where the model successfully converted handwritten sketches into HTML, demonstrating significant versatility.

Implications and Future Directions

The introduction of WebSight and the fine-tuned Sightseer holds promising implications for web development and UI design. These tools can expedite the design-to-code pipeline, enabling faster iteration and prototyping. For practitioners, this translates into enhanced efficiency and reduced reliance on manual coding, fostering the development of no-code solutions.

However, several challenges and failure cases were identified, particularly with complex layouts or content-heavy pages. These issues suggest that further fine-tuning, possibly incorporating more diverse real-world data or alternative CSS frameworks, could enhance the model's accuracy and generalizability.

Conclusion

The work presented in this paper represents a significant step towards automating the conversion of web screenshots into HTML code. By introducing the WebSight dataset and demonstrating the effectiveness of fine-tuning VLMs on this data, the authors provide a foundational resource for future research and development in this area. The open-source release of WebSight is poised to catalyze further innovations, driving advancements in AI-powered web development tools and no-code solutions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 442 likes about this paper.

HackerNews

  1. Screenshot to HTML Code Dataset (1 point, 0 comments)