VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs (2404.06369v1)
Abstract: Automatically generating UI code from webpage design visions can significantly alleviate the burden of developers, enabling beginner developers or designers to directly generate Web pages from design diagrams. Currently, prior research has accomplished the objective of generating UI code from rudimentary design visions or sketches through designing deep neural networks. Inspired by the groundbreaking advancements achieved by Multimodal LLMs (MLLMs), the automatic generation of UI code from high-fidelity design images is now emerging as a viable possibility. Nevertheless, our investigation reveals that existing MLLMs are hampered by the scarcity of authentic, high-quality, and large-scale datasets, leading to unsatisfactory performance in automated UI code generation. To mitigate this gap, we present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information, tailored specifically for finetuning MLLMs in UI code generation. Specifically, this dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset. In order to uphold its quality, a neural scorer trained on labeled samples is utilized to refine the data, retaining higher-quality instances. Ultimately, this process yields a dataset comprising 2,000 (Much more is coming soon) parallel samples encompassing design visions and UI code. The dataset is available at https://huggingface.co/datasets/xcodemind/vision2ui.
- Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems, EICS 2018, Paris, France, June 19-22, 2018, pages 3:1–3:6. ACM, 2018. doi: 10.1145/3220134.3220135. URL https://doi.org/10.1145/3220134.3220135.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar.org/CorpusID:225039882.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
- Long short-term memory. Neural Computation, 9:1735–1780, 1997. URL https://api.semanticscholar.org/CorpusID:1915014.
- Unlocking the conversion of web screenshots into html code with the websight dataset. 2024. URL https://api.semanticscholar.org/CorpusID:268385510.
- Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989. URL https://api.semanticscholar.org/CorpusID:41312633.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 2023. URL https://proceedings.mlr.press/v202/lee23g.html.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks, 2021.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2022.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.org/CorpusID:246426909.
- Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21:1–67, 2020.
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015. URL https://api.semanticscholar.org/CorpusID:10328909.
- Codebleu: a method for automatic evaluation of code synthesis. ArXiv, abs/2009.10297, 2020. URL https://api.semanticscholar.org/CorpusID:221836101.
- Alex Robinson. Sketch2code: Generating a website from a paper mockup. CoRR, abs/1905.13750, 2019. URL http://arxiv.org/abs/1905.13750.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Design2code: How far are we from automating front-end engineering? 2024. URL https://api.semanticscholar.org/CorpusID:268248801.
- Learning ui-to-code reverse generator using visual critic without rendering. 2023. URL https://api.semanticscholar.org/CorpusID:265302631.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, pages 8696–8708, 2021.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004. URL https://api.semanticscholar.org/CorpusID:207761262.
- Screen parsing: Towards reverse engineering of ui models from screenshots. The 34th Annual ACM Symposium on User Interface Software and Technology, 2021. URL https://api.semanticscholar.org/CorpusID:237571719.
- Yi Gui (7 papers)
- Zhen Li (334 papers)
- Yao Wan (70 papers)
- Yemin Shi (18 papers)
- Hongyu Zhang (147 papers)
- Yi Su (70 papers)
- Shaoling Dong (3 papers)
- Xing Zhou (19 papers)
- Wenbin Jiang (18 papers)