GUing: A Mobile GUI Search Engine using a Vision-Language Model (2405.00145v3)
Abstract: Graphical User Interfaces (GUIs) are central to app development projects. App developers may use the GUIs of other apps as a means of requirements refinement and rapid prototyping or as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and lack app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-LLM called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which display the most representative screenshots and are often captioned (i.e.~labelled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-LLM, which is, to the best of our knowledge, the first of its kind for GUI retrieval. We evaluated our approach on various datasets from related work and in a manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.
- AppBrain. 2024a. Google Play Ranking in the United States (Oct 2023). https://www.appbrain.com/stats/google-play-rankings. Accessed: 2023-10-01.
- AppBrain. 2024b. Number of Android apps on Google Play (Mar 2024). https://www.appbrain.com/stats/number-of-android-apps. Accessed: 2024-3-10.
- An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM 45, 6 (1998), 891–923. https://doi.org/10.1145/293347.293348
- GUIfetch: Supporting app design and development through GUI search. Proceedings - International Conference on Software Engineering (2018), 236–246. https://doi.org/10.1145/3197231.3197244
- Yoshua Bengio. 2009. Learning deep architectures for AI. Vol. 2. 1–27 pages. https://doi.org/10.1561/2200000006
- Guigle: A GUI search engine for android apps. Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion, ICSE-Companion 2019 (2019), 71–74. https://doi.org/10.1109/ICSE-Companion.2019.00041 arXiv:1901.00891
- VINS: Visual Search for Mobile User Interface Design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3411764.3445762
- End-to-End Object Detection with Transformers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12346 LNCS (2020), 213–229. https://doi.org/10.1007/978-3-030-58452-8_13 arXiv:2005.12872
- Gallery D.C.: Design search and knowledge discovery through auto-created GUI component gallery. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019). https://doi.org/10.1145/3359282
- From UI design image to GUI skeleton: A neural machine translator to bootstrap mobile GUI implementation. In Proceedings - International Conference on Software Engineering, Vol. 6. 665–676. https://doi.org/10.1145/3180155.3180240
- Wireframe-based UI Design Search through Image Autoencoder. ACM Transactions on Software Engineering and Methodology 29, 3 (2020). https://doi.org/10.1145/3391613 arXiv:2103.07085
- How Should I Improve the UI of My App?: A Study of User Reviews of Popular Apps in the Google Play. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1–37. https://doi.org/10.1145/3447808
- Automatically Distilling Storyboard With Rich Features for Android Apps. IEEE Transactions on Software Engineering 49, 2 (2023), 667–683. https://doi.org/10.1109/TSE.2022.3159548 arXiv:2203.06420
- StoryDroid: Automated Generation of Storyboard for Android Apps. In Proceedings - International Conference on Software Engineering, Vol. 2019-May. IEEE, 596–607. https://doi.org/10.1109/ICSE.2019.00070 arXiv:1902.00476
- Rico: A mobile app dataset for building data-driven design applications. UIST 2017 - Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (2017), 845–854. https://doi.org/10.1145/3126594.3126651
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, Vol. 1. 4171–4186. arXiv:1810.04805 https://arxiv.org/abs/1810.04805
- An Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. In ICLR 2021 - 9th International Conference on Learning Representations. arXiv:2010.11929
- The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
- A Survey of Vision-Language Pre-Trained Models. IJCAI International Joint Conference on Artificial Intelligence (2022), 5436–5443. https://doi.org/10.24963/ijcai.2022/762 arXiv:2202.10936
- Explosion. 2024. Prodigy - An annotation tool for AI, Machine Learning and NLP. https://prodi.gy/. Accessed: 2024-3-10.
- Gallery D.C.: Auto-created GUI Component Gallery for Design Search and Knowledge Discovery. In Proceedings - International Conference on Software Engineering, Vol. 1. Association for Computing Machinery, 80–84. https://doi.org/10.1109/ICSE-Companion55297.2022.9793764 arXiv:2204.06700
- Alessio Ferrari and Paola Spoletini. 2023. Strategies, Benefits and Challenges of App Store-inspired Requirements Elicitation. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1290–1302. https://doi.org/10.1109/ICSE48619.2023.00114
- Google. 2024. Add preview assets to showcase your app - Play Console Help. https://support.google.com/googleplay/android-developer/answer/9866151?hl=en&sjid=206438066775745925-EU. Accessed: 2024-3-10.
- Automatically matching bug reports with related app reviews. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 970–981.
- Studying Bad Updates of Top Free-to-Download Apps in the Google Play Store. IEEE Transactions on Software Engineering 46, 7 (2020), 773–793. https://doi.org/10.1109/TSE.2018.2869395
- Swire: Sketch-based User Interface Retrieval. Conference on Human Factors in Computing Systems - Proceedings (2019), 1–10. https://doi.org/10.1145/3290605.3300334
- Nic Hughart. 2023. 50 Best App Ideas For 2024. https://buildfire.com/best-app-ideas. Accessed: 2024-3-10.
- Data-driven prototyping via natural-language-based GUI retrieval. Automated Software Engineering 30, 1 (2023), 13. https://doi.org/10.1007/s10515-023-00377-x
- Enrico: A Dataset for Topic Modeling of Mobile UI Designs. Extended Abstracts - 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services: Expanding the Horizon of Mobile Interaction, MobileHCI 2020 (2020). https://doi.org/10.1145/3406324.3410710
- Describing UI Screenshots in Natural Language. ACM Transactions on Intelligent Systems and Technology 14, 1 (2023), 1–28. https://doi.org/10.1145/3564702
- PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System. (2022). arXiv:2206.03001 http://arxiv.org/abs/2206.03001
- Gang Li and Yang Li. 2022. Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus. 2022 (2022), 1–16. arXiv:2209.14927 http://arxiv.org/abs/2209.14927
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. (2023). arXiv:2301.12597 http://arxiv.org/abs/2301.12597
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Proceedings of Machine Learning Research 162, 2 (2022), 12888–12900. arXiv:2201.12086
- Screen2vec: Semantic embedding of GUI screens and GUI components. Conference on Human Factors in Computing Systems - Proceedings (2021). https://doi.org/10.1145/3411764.3445049 arXiv:2101.11103
- Widget captioning: Generating natural language description for mobile user interface elements. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference 2015 (2020), 5495–5510. https://doi.org/10.18653/v1/2020.emnlp-main.443 arXiv:2010.04295
- Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference. https://doi.org/10.1007/978-3-319-10602-1_48 arXiv:1405.0312
- Learning design semantics for mobile apps. UIST 2018 - Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (2018), 569–579. https://doi.org/10.1145/3242587.3242650
- When users become collaborators: towards continuous and context-aware user input. In Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications (OOPSLA). 981–990.
- Soumik Mohian and Christoph Csallner. 2022. PSDoodle: Fast App Screen Search via Partial Screen Doodle. Proceedings - 9th IEEE/ACM International Conference on Mobile Software Engineering and Systems, MOBILESoft 2022 January (2022), 89–99. https://doi.org/10.1145/3524613.3527816
- Soumik Mohian and Christoph Csallner. 2023. Searching Mobile App Screens via Text + Doodle. (2023). arXiv:2305.06165 http://arxiv.org/abs/2305.06165
- ClipCap : CLIP Prefix for Image Captioning. (2021). arXiv:2111.09734 https://arxiv.org/abs/2111.09734
- Machine Learning-Based Prototyping of Graphical User Interfaces for Mobile Apps. IEEE Transactions on Software Engineering 46, 2 (2020), 196–221. https://doi.org/10.1109/TSE.2018.2844788 arXiv:1802.02312
- Automated reporting of GUI design violations for mobile apps. In 40th International Conference on Software Engineering. 165–175. https://doi.org/10.1145/3180155.3180246 arXiv:1802.04732
- An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation. Proceedings - 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022 (2022), 514–525. https://doi.org/10.1109/SANER53432.2022.00069 arXiv:2301.01224
- Facundo Olano. 2015. Google Play Scraper. https://github.com/facundoolano/google-play-scraper. Accessed: 2024-3-10.
- OpenAI. 2021a. CLIP. https://github.com/openai/CLIP/blob/main/clip/clip.py. Accessed: 2024-3-10.
- OpenAI. 2021b. Model Card openai/clip-vit-base-patch32 on HuggingFace. https://huggingface.co/openai/clip-vit-base-patch32. Accessed: 2024-3-10.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. arXiv:2103.00020 http://arxiv.org/abs/2103.00020https://proceedings.mlr.press/v139/radford21a.html
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (2020), 1–67. arXiv:1910.10683
- Monitoring user interactions for supporting failure reproduction. In 2013 21st International Conference on Program Comprehension (ICPC). IEEE, 73–82.
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. 1–14. arXiv:1409.1556
- FLAVA: A Foundational Language And Vision Alignment Model. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2022-June (2022), 15617–15629. https://doi.org/10.1109/CVPR52688.2022.01519 arXiv:2112.04482
- Filip Sondej. 2020. Autocorrect. https://github.com/filyp/autocorrect. Accessed: 2024-3-10.
- Peter M. Stahl. 2022. Lingua. https://github.com/pemistahl/lingua-py. Accessed: 2024-3-10.
- Towards Better Semantic Understanding of Mobile Interfaces. Proceedings - International Conference on Computational Linguistics, COLING 29, 1 (2022), 5636–5650. arXiv:2210.02663
- Vladimir Terekhov. 2023. 138 Features to Consider While Developing a Mobile App. https://attractgroup.com/blog/most-comprehensive-list-of-mobile-app-features-while-developing-a-mobile-application. Accessed: 2024-3-10.
- Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, Nips (2017), 5999–6009. arXiv:1706.03762
- Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. In UIST 2021 - Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology. 498–510. https://doi.org/10.1145/3472749.3474765 arXiv:2108.03353
- Boosting GUI Prototyping with Diffusion Models. In 2023 IEEE 31st International Requirements Engineering Conference (RE). 275–280. https://doi.org/10.1109/RE57278.2023.00035 arXiv:2306.06233
- Never-ending Learning of User Interfaces. (2023). https://doi.org/10.1145/3586183.3606824 arXiv:2308.08726
- WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. Conference on Human Factors in Computing Systems - Proceedings (2023). https://doi.org/10.1145/3544548.3581158 arXiv:2301.13280
- C-Pack: Packaged Resources To Advance General Chinese Embedding. (2023). arXiv:2309.07597 http://arxiv.org/abs/2309.07597
- Alaa Zaki and Mohamed Abdallah. 2023. MASC : A Dataset for the Development and Classification of Mobile Applications Screens. (2023), 1–15. https://doi.org/10.21203/rs.3.rs-3786876/v1
- Vision-Language Models for Vision Tasks: A Survey. March (2023), 1–23. arXiv:2304.00685 http://arxiv.org/abs/2304.00685
- Scene-Driven Exploration and GUI Modeling for Android Apps. (2023). arXiv:2308.10228 http://arxiv.org/abs/2308.10228
- Jialiang Wei (7 papers)
- Anne-Lise Courbis (6 papers)
- Thomas Lambolais (6 papers)
- Binbin Xu (37 papers)
- Pierre Louis Bernard (5 papers)
- Gérard Dray (9 papers)
- Walid Maalej (41 papers)