Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Abstract: Using vision-LLMs (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
- Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot. arXiv preprint arXiv:1705.07962, 2017.
- Ocr-idl: Ocr annotations for industry document library dataset, 2022.
- Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution, 2023.
- Datacomp: In search of the next generation of multimodal datasets, 2023.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
- Cogagent: A visual language model for gui agents, 2023.
- Mistral 7b, 2023.
- OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=SKN2hflBIZ.
- Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/lee23g.html.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Dora: Weight-decomposed low-rank adaptation, 2024b.
- Reverse engineering mobile application user interfaces with remaui (t). 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 248–259, 2015. URL https://api.semanticscholar.org/CorpusID:7499368.
- OpenAI et al. Gpt-4 technical report, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- Design2code: How far are we from automating front-end engineering?, 2024.
- Gemini Team et al. Gemini: A family of highly capable multimodal models, 2023.
- Sigmoid loss for language image pre-training, 2023.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=tOd8rSjcWz.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.