Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GUing: A Mobile GUI Search Engine using a Vision-Language Model (2405.00145v3)

Published 30 Apr 2024 in cs.SE and cs.CV

Abstract: Graphical User Interfaces (GUIs) are central to app development projects. App developers may use the GUIs of other apps as a means of requirements refinement and rapid prototyping or as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and lack app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-LLM called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which display the most representative screenshots and are often captioned (i.e.~labelled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-LLM, which is, to the best of our knowledge, the first of its kind for GUI retrieval. We evaluated our approach on various datasets from related work and in a manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. AppBrain. 2024a. Google Play Ranking in the United States (Oct 2023). https://www.appbrain.com/stats/google-play-rankings. Accessed: 2023-10-01.
  2. AppBrain. 2024b. Number of Android apps on Google Play (Mar 2024). https://www.appbrain.com/stats/number-of-android-apps. Accessed: 2024-3-10.
  3. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM 45, 6 (1998), 891–923. https://doi.org/10.1145/293347.293348
  4. GUIfetch: Supporting app design and development through GUI search. Proceedings - International Conference on Software Engineering (2018), 236–246. https://doi.org/10.1145/3197231.3197244
  5. Yoshua Bengio. 2009. Learning deep architectures for AI. Vol. 2. 1–27 pages. https://doi.org/10.1561/2200000006
  6. Guigle: A GUI search engine for android apps. Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion, ICSE-Companion 2019 (2019), 71–74. https://doi.org/10.1109/ICSE-Companion.2019.00041 arXiv:1901.00891
  7. VINS: Visual Search for Mobile User Interface Design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3411764.3445762
  8. End-to-End Object Detection with Transformers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12346 LNCS (2020), 213–229. https://doi.org/10.1007/978-3-030-58452-8_13 arXiv:2005.12872
  9. Gallery D.C.: Design search and knowledge discovery through auto-created GUI component gallery. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019). https://doi.org/10.1145/3359282
  10. From UI design image to GUI skeleton: A neural machine translator to bootstrap mobile GUI implementation. In Proceedings - International Conference on Software Engineering, Vol. 6. 665–676. https://doi.org/10.1145/3180155.3180240
  11. Wireframe-based UI Design Search through Image Autoencoder. ACM Transactions on Software Engineering and Methodology 29, 3 (2020). https://doi.org/10.1145/3391613 arXiv:2103.07085
  12. How Should I Improve the UI of My App?: A Study of User Reviews of Popular Apps in the Google Play. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1–37. https://doi.org/10.1145/3447808
  13. Automatically Distilling Storyboard With Rich Features for Android Apps. IEEE Transactions on Software Engineering 49, 2 (2023), 667–683. https://doi.org/10.1109/TSE.2022.3159548 arXiv:2203.06420
  14. StoryDroid: Automated Generation of Storyboard for Android Apps. In Proceedings - International Conference on Software Engineering, Vol. 2019-May. IEEE, 596–607. https://doi.org/10.1109/ICSE.2019.00070 arXiv:1902.00476
  15. Rico: A mobile app dataset for building data-driven design applications. UIST 2017 - Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (2017), 845–854. https://doi.org/10.1145/3126594.3126651
  16. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, Vol. 1. 4171–4186. arXiv:1810.04805 https://arxiv.org/abs/1810.04805
  17. An Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. In ICLR 2021 - 9th International Conference on Learning Representations. arXiv:2010.11929
  18. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
  19. A Survey of Vision-Language Pre-Trained Models. IJCAI International Joint Conference on Artificial Intelligence (2022), 5436–5443. https://doi.org/10.24963/ijcai.2022/762 arXiv:2202.10936
  20. Explosion. 2024. Prodigy - An annotation tool for AI, Machine Learning and NLP. https://prodi.gy/. Accessed: 2024-3-10.
  21. Gallery D.C.: Auto-created GUI Component Gallery for Design Search and Knowledge Discovery. In Proceedings - International Conference on Software Engineering, Vol. 1. Association for Computing Machinery, 80–84. https://doi.org/10.1109/ICSE-Companion55297.2022.9793764 arXiv:2204.06700
  22. Alessio Ferrari and Paola Spoletini. 2023. Strategies, Benefits and Challenges of App Store-inspired Requirements Elicitation. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1290–1302. https://doi.org/10.1109/ICSE48619.2023.00114
  23. Google. 2024. Add preview assets to showcase your app - Play Console Help. https://support.google.com/googleplay/android-developer/answer/9866151?hl=en&sjid=206438066775745925-EU. Accessed: 2024-3-10.
  24. Automatically matching bug reports with related app reviews. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 970–981.
  25. Studying Bad Updates of Top Free-to-Download Apps in the Google Play Store. IEEE Transactions on Software Engineering 46, 7 (2020), 773–793. https://doi.org/10.1109/TSE.2018.2869395
  26. Swire: Sketch-based User Interface Retrieval. Conference on Human Factors in Computing Systems - Proceedings (2019), 1–10. https://doi.org/10.1145/3290605.3300334
  27. Nic Hughart. 2023. 50 Best App Ideas For 2024. https://buildfire.com/best-app-ideas. Accessed: 2024-3-10.
  28. Data-driven prototyping via natural-language-based GUI retrieval. Automated Software Engineering 30, 1 (2023), 13. https://doi.org/10.1007/s10515-023-00377-x
  29. Enrico: A Dataset for Topic Modeling of Mobile UI Designs. Extended Abstracts - 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services: Expanding the Horizon of Mobile Interaction, MobileHCI 2020 (2020). https://doi.org/10.1145/3406324.3410710
  30. Describing UI Screenshots in Natural Language. ACM Transactions on Intelligent Systems and Technology 14, 1 (2023), 1–28. https://doi.org/10.1145/3564702
  31. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System. (2022). arXiv:2206.03001 http://arxiv.org/abs/2206.03001
  32. Gang Li and Yang Li. 2022. Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus. 2022 (2022), 1–16. arXiv:2209.14927 http://arxiv.org/abs/2209.14927
  33. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. (2023). arXiv:2301.12597 http://arxiv.org/abs/2301.12597
  34. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Proceedings of Machine Learning Research 162, 2 (2022), 12888–12900. arXiv:2201.12086
  35. Screen2vec: Semantic embedding of GUI screens and GUI components. Conference on Human Factors in Computing Systems - Proceedings (2021). https://doi.org/10.1145/3411764.3445049 arXiv:2101.11103
  36. Widget captioning: Generating natural language description for mobile user interface elements. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference 2015 (2020), 5495–5510. https://doi.org/10.18653/v1/2020.emnlp-main.443 arXiv:2010.04295
  37. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference. https://doi.org/10.1007/978-3-319-10602-1_48 arXiv:1405.0312
  38. Learning design semantics for mobile apps. UIST 2018 - Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (2018), 569–579. https://doi.org/10.1145/3242587.3242650
  39. When users become collaborators: towards continuous and context-aware user input. In Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications (OOPSLA). 981–990.
  40. Soumik Mohian and Christoph Csallner. 2022. PSDoodle: Fast App Screen Search via Partial Screen Doodle. Proceedings - 9th IEEE/ACM International Conference on Mobile Software Engineering and Systems, MOBILESoft 2022 January (2022), 89–99. https://doi.org/10.1145/3524613.3527816
  41. Soumik Mohian and Christoph Csallner. 2023. Searching Mobile App Screens via Text + Doodle. (2023). arXiv:2305.06165 http://arxiv.org/abs/2305.06165
  42. ClipCap : CLIP Prefix for Image Captioning. (2021). arXiv:2111.09734 https://arxiv.org/abs/2111.09734
  43. Machine Learning-Based Prototyping of Graphical User Interfaces for Mobile Apps. IEEE Transactions on Software Engineering 46, 2 (2020), 196–221. https://doi.org/10.1109/TSE.2018.2844788 arXiv:1802.02312
  44. Automated reporting of GUI design violations for mobile apps. In 40th International Conference on Software Engineering. 165–175. https://doi.org/10.1145/3180155.3180246 arXiv:1802.04732
  45. An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation. Proceedings - 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022 (2022), 514–525. https://doi.org/10.1109/SANER53432.2022.00069 arXiv:2301.01224
  46. Facundo Olano. 2015. Google Play Scraper. https://github.com/facundoolano/google-play-scraper. Accessed: 2024-3-10.
  47. OpenAI. 2021a. CLIP. https://github.com/openai/CLIP/blob/main/clip/clip.py. Accessed: 2024-3-10.
  48. OpenAI. 2021b. Model Card openai/clip-vit-base-patch32 on HuggingFace. https://huggingface.co/openai/clip-vit-base-patch32. Accessed: 2024-3-10.
  49. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. arXiv:2103.00020 http://arxiv.org/abs/2103.00020https://proceedings.mlr.press/v139/radford21a.html
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (2020), 1–67. arXiv:1910.10683
  51. Monitoring user interactions for supporting failure reproduction. In 2013 21st International Conference on Program Comprehension (ICPC). IEEE, 73–82.
  52. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. 1–14. arXiv:1409.1556
  53. FLAVA: A Foundational Language And Vision Alignment Model. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2022-June (2022), 15617–15629. https://doi.org/10.1109/CVPR52688.2022.01519 arXiv:2112.04482
  54. Filip Sondej. 2020. Autocorrect. https://github.com/filyp/autocorrect. Accessed: 2024-3-10.
  55. Peter M. Stahl. 2022. Lingua. https://github.com/pemistahl/lingua-py. Accessed: 2024-3-10.
  56. Towards Better Semantic Understanding of Mobile Interfaces. Proceedings - International Conference on Computational Linguistics, COLING 29, 1 (2022), 5636–5650. arXiv:2210.02663
  57. Vladimir Terekhov. 2023. 138 Features to Consider While Developing a Mobile App. https://attractgroup.com/blog/most-comprehensive-list-of-mobile-app-features-while-developing-a-mobile-application. Accessed: 2024-3-10.
  58. Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, Nips (2017), 5999–6009. arXiv:1706.03762
  59. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. In UIST 2021 - Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology. 498–510. https://doi.org/10.1145/3472749.3474765 arXiv:2108.03353
  60. Boosting GUI Prototyping with Diffusion Models. In 2023 IEEE 31st International Requirements Engineering Conference (RE). 275–280. https://doi.org/10.1109/RE57278.2023.00035 arXiv:2306.06233
  61. Never-ending Learning of User Interfaces. (2023). https://doi.org/10.1145/3586183.3606824 arXiv:2308.08726
  62. WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. Conference on Human Factors in Computing Systems - Proceedings (2023). https://doi.org/10.1145/3544548.3581158 arXiv:2301.13280
  63. C-Pack: Packaged Resources To Advance General Chinese Embedding. (2023). arXiv:2309.07597 http://arxiv.org/abs/2309.07597
  64. Alaa Zaki and Mohamed Abdallah. 2023. MASC : A Dataset for the Development and Classification of Mobile Applications Screens. (2023), 1–15. https://doi.org/10.21203/rs.3.rs-3786876/v1
  65. Vision-Language Models for Vision Tasks: A Survey. March (2023), 1–23. arXiv:2304.00685 http://arxiv.org/abs/2304.00685
  66. Scene-Driven Exploration and GUI Modeling for Android Apps. (2023). arXiv:2308.10228 http://arxiv.org/abs/2308.10228
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jialiang Wei (7 papers)
  2. Anne-Lise Courbis (6 papers)
  3. Thomas Lambolais (6 papers)
  4. Binbin Xu (37 papers)
  5. Pierre Louis Bernard (5 papers)
  6. Gérard Dray (9 papers)
  7. Walid Maalej (41 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com