Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029v1)

Published 14 Mar 2024 in cs.HC, cs.AI, and cs.CV

Abstract: Using vision-LLMs (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.

Unlocking the Conversion of Web Screenshots into HTML Code with the WebSight Dataset

The paper "Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset" introduces an innovative approach to translating web screenshots into functional HTML code. This paper offers significant insights into the construction and application of a dataset aimed specifically at this task, addressing a gap in the use of vision-LLMs (VLMs) for web development.

Overview of the WebSight Dataset

The core contribution of the paper is the introduction of WebSight, a synthetic dataset comprising 2 million pairs of HTML codes and their corresponding screenshots. This dataset is designed to advance the capability of VLMs in the field of web development, especially for converting static images of web interfaces into usable code.

The authors identified the lack of a substantial dataset as a primary obstacle preventing VLMs from effectively addressing this translation task. To create WebSight, the authors bypassed the complexity of using existing web HTML files—often burdened with noisy content, scripts, or external dependencies—by synthesizing HTML codes. This synthesis was achieved using LLMs proficient in code generation, thereby ensuring clean and structured data suitable for model training.

Methodology and Data Construction

The methodology involved two critical steps: generating a diverse array of website concepts and converting these into HTML code using an LLM, specifically Deepseek-Coder-33b-instruct. The process began with Mistral-7B-Instruct generating unique website designs, followed by the LLM producing the final HTML code enhanced with Tailwind CSS—a utility-first framework facilitating concise and direct styling within HTML documents.

Key to the dataset's diversity and utility was the inclusion of screenshots captured at various resolutions using Playwright, ensuring a range of image sizes and formats that would enhance model robustness. This meticulous construction resulted in a dataset that could significantly accelerate VLMs' proficiency in HTML code generation from web screenshots.

Model Fine-Tuning and Performance

The authors fine-tuned a foundational VLM—Sightseer—on the WebSight dataset. Sightseer is built upon Mistral-7B and SigLIP-SO400M, employing the Patch n’ Pack strategy to maintain the original aspect ratios of input images. Fine-tuning involved using the parameter efficient DoRA method to stabilize training and optimizing the model’s OCR capabilities, spatial understanding, and object recognition.

Qualitative evaluations revealed that Sightseer could produce functional HTML code that closely replicates the structure and content of the input web screenshots. This performance extended to untrained scenarios where the model successfully converted handwritten sketches into HTML, demonstrating significant versatility.

Implications and Future Directions

The introduction of WebSight and the fine-tuned Sightseer holds promising implications for web development and UI design. These tools can expedite the design-to-code pipeline, enabling faster iteration and prototyping. For practitioners, this translates into enhanced efficiency and reduced reliance on manual coding, fostering the development of no-code solutions.

However, several challenges and failure cases were identified, particularly with complex layouts or content-heavy pages. These issues suggest that further fine-tuning, possibly incorporating more diverse real-world data or alternative CSS frameworks, could enhance the model's accuracy and generalizability.

Conclusion

The work presented in this paper represents a significant step towards automating the conversion of web screenshots into HTML code. By introducing the WebSight dataset and demonstrating the effectiveness of fine-tuning VLMs on this data, the authors provide a foundational resource for future research and development in this area. The open-source release of WebSight is poised to catalyze further innovations, driving advancements in AI-powered web development tools and no-code solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot. arXiv preprint arXiv:1705.07962, 2017.
  2. Ocr-idl: Ocr annotations for industry document library dataset, 2022.
  3. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution, 2023.
  4. Datacomp: In search of the next generation of multimodal datasets, 2023.
  5. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  6. Cogagent: A visual language model for gui agents, 2023.
  7. Mistral 7b, 2023.
  8. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=SKN2hflBIZ.
  9. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/lee23g.html.
  10. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  11. Dora: Weight-decomposed low-rank adaptation, 2024b.
  12. Reverse engineering mobile application user interfaces with remaui (t). 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 248–259, 2015. URL https://api.semanticscholar.org/CorpusID:7499368.
  13. OpenAI et al. Gpt-4 technical report, 2023.
  14. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  15. Design2code: How far are we from automating front-end engineering?, 2024.
  16. Gemini Team et al. Gemini: A family of highly capable multimodal models, 2023.
  17. Sigmoid loss for language image pre-training, 2023.
  18. Multimodal c4: An open, billion-scale corpus of images interleaved with text. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=tOd8rSjcWz.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hugo Laurençon (11 papers)
  2. Léo Tronchon (5 papers)
  3. Victor Sanh (21 papers)
Citations (21)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Screenshot to HTML Code Dataset (1 point, 0 comments)