Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts (2404.02475v1)

Published 3 Apr 2024 in cs.HC

Abstract: Robotic Process Automation (RPA) offers a valuable solution for efficiently automating tasks on the graphical user interface (GUI), by emulating human interactions, without modifying existing code. However, its broader adoption is constrained by the need for expertise in both scripting languages and workflow design. To address this challenge, we present PromptRPA, a system designed to comprehend various task-related textual prompts (e.g., goals, procedures), thereby generating and performing corresponding RPA tasks. PromptRPA incorporates a suite of intelligent agents that mimic human cognitive functions, specializing in interpreting user intent, managing external information for RPA generation, and executing operations on smartphones. The agents can learn from user feedback and continuously improve their performance based on the accumulated knowledge. Experimental results indicated a performance jump from a 22.28% success rate in the baseline to 95.21% with PromptRPA, requiring an average of 1.66 user interventions for each new task. PromptRPA presents promising applications in fields such as tutorial creation, smart assistance, and customer service.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Santiago Aguirre and Alejandro Rodriguez. 2017. Automation of a Business Process Using Robotic Process Automation (RPA): A Case Study. In Applied Computer Sciences in Engineering, Juan Carlos Figueroa-García, Eduyn Ramiro López-Santana, José Luis Villa-Ramírez, and Roberto Ferro-Escobar (Eds.). Springer International Publishing, Cham, 65–71.
  2. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv:2204.01691 [cs.RO]
  3. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4 (Dec. 2014), 105–120. https://doi.org/10.1609/aimag.v35i4.2513 Number: 4.
  4. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300233
  5. Communication Breakdowns Between Families and Alexa. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300473
  6. Ilastik: interactive machine learning for (bio) image analysis. Nature methods 16, 12 (2019), 1226–1232.
  7. Reinforcement Learning for Mapping Instructions to Actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 82–90. https://aclanthology.org/P09-1010
  8. A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility. http://arxiv.org/abs/2202.02312 arXiv:2202.02312 [cs].
  9. Teachable Machine: Approachable Web-Based Tool for Exploring Machine Learning Classification. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI EA ’20). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3334480.3382839
  10. Davide Castelvecchi. 2016. Can we open the black box of AI? Nature News 538, 7623 (2016), 20.
  11. From Robotic Process Automation to Intelligent Process Automation. In Business Process Management: Blockchain and Robotic Process Automation Forum, Aleksandre Asatiani, José María García, Nina Helander, Andrés Jiménez-Ramírez, Agnes Koschmider, Jan Mendling, Giovanni Meroni, and Hajo A. Reijers (Eds.). Springer International Publishing, Cham, 215–228.
  12. VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18030–18040.
  13. Towards Complete Icon Labeling in Mobile Applications. In CHI Conference on Human Factors in Computing Systems. ACM, New Orleans LA USA, 1–14. https://doi.org/10.1145/3491102.3502073
  14. Design of Interactive Tutorials on Mobile Applications for Chinese Middle-Aged and Older Adults. Art and Design Review 05, 03 (2017), 162. https://doi.org/10.4236/adr.2017.53013 Number: 03 Publisher: Scientific Research Publishing.
  15. Interactive machine learning for soybean seed and seedling quality classification. Scientific reports 10, 1 (2020), 11267.
  16. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, Québec City QC Canada, 845–854. https://doi.org/10.1145/3126594.3126651
  17. Integrating Machine Learning with Human Knowledge. iScience 23, 11 (Nov. 2020), 101656. https://doi.org/10.1016/j.isci.2020.101656
  18. Pixel-based methods for widget state and style in a runtime implementation of sliding widgets. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 2231–2240. https://doi.org/10.1145/2556288.2556979
  19. Multi-Agent Systems: A Survey. IEEE Access 6 (2018), 28573–28593. https://doi.org/10.1109/ACCESS.2018.2831228
  20. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL]
  21. John J. Dudley and Per Ola Kristensson. 2018. A Review of User Interface Design for Interactive Machine Learning. ACM Trans. Interact. Intell. Syst. 8, 2, Article 8 (jun 2018), 37 pages. https://doi.org/10.1145/3185517
  22. Jerry Alan Fails and Dan R. Olsen. 2003. Interactive Machine Learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces (Miami, Florida, USA) (IUI ’03). Association for Computing Machinery, New York, NY, USA, 39–45. https://doi.org/10.1145/604045.604056
  23. Language-agnostic BERT Sentence Embedding. arXiv:2007.01852 [cs.CL]
  24. RePlay: Contextually Presenting Learning Videos Across Software Applications. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300527
  25. Jonathan Grudin and Richard Jacques. 2019. Chatbots, Humbots, and the Quest for Artificial General Intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300439
  26. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 5 (2018), 1–42.
  27. ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces. Proceedings of the AAAI Conference on Artificial Intelligence 35, 7 (May 2021), 5931–5938. https://doi.org/10.1609/aaai.v35i7.16741 Number: 7.
  28. A framework for implementing robotic process automation projects. Information Systems and e-Business Management 21, 1 (2023), 1–35.
  29. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352 [cs.AI]
  30. Interaction Proxy Manager: Semantic Model Generation and Run-Time Support for Reconstructing Ubiquitous User Interfaces of Mobile Services. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 3, Article 99 (sep 2023), 39 pages. https://doi.org/10.1145/3610929
  31. Mobile application and its global impact. International Journal of Engineering & Technology 10, 6 (2010), 72–78.
  32. Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum 32, 1 (April 1998), 5–17. https://doi.org/10.1145/281250.281253
  33. Synapse: Interactive Guidance by Demonstration with Trial-and-Error Support for Older Adults to Use Smartphone Apps. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3, Article 121 (sep 2022), 24 pages. https://doi.org/10.1145/3550321
  34. Robotic process automation: overview and opportunities. International Journal Advanced Quality 46, 3-4 (2018), 34–39.
  35. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. http://arxiv.org/abs/2210.03347 arXiv:2210.03347 [cs].
  36. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 18893–18912. https://proceedings.mlr.press/v202/lee23g.html
  37. Intelligently Creating Contextual Tutorials for GUI Applications. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom). IEEE, Beijing, 187–196. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.50
  38. SUGILITE: Creating Multimodal Smartphone Automation by Demonstration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 6038–6049. https://doi.org/10.1145/3025453.3025483
  39. APPINITE: A Multi-Modal Interface for Specifying Data Descriptions in Programming by Demonstration Using Natural Language Instructions. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 105–114. https://doi.org/10.1109/VLHCC.2018.8506506 ISSN: 1943-6106.
  40. Screen2Vec: Semantic Embedding of GUI Screens and GUI Components. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–15. https://doi.org/10.1145/3411764.3445049
  41. Toby Jia-Jun Li and Oriana Riva. 2018. Kite: Building Conversational Bots from Mobile Apps. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (Munich, Germany) (MobiSys ’18). Association for Computing Machinery, New York, NY, USA, 96–109. https://doi.org/10.1145/3210240.3210339
  42. Wei Li. 2021. Learning UI Navigation through Demonstrations composed of Macro Actions. http://arxiv.org/abs/2110.08653 arXiv:2110.08653 [cs].
  43. Mapping Natural Language Instructions to Mobile UI Action Sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8198–8210. https://doi.org/10.18653/v1/2020.acl-main.729
  44. Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. http://arxiv.org/abs/2010.04295 arXiv:2010.04295 [cs].
  45. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv:2305.19118 [cs.CL]
  46. Learning Design Semantics for Mobile Apps. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (¡conf-loc¿, ¡city¿Berlin¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 569–579. https://doi.org/10.1145/3242587.3242650
  47. Natural language understanding approaches based on joint task of intent detection and slot filling for IoT voice interaction. Neural Computing and Applications 32 (2020), 16149–16166.
  48. Towards A Unified Agent with Foundation Models. arXiv:2307.09668 [cs.RO]
  49. Automatically Generating and Improving Voice Command Interface from Operation Sequences on Smartphones. In CHI Conference on Human Factors in Computing Systems. ACM, New Orleans LA USA, 1–21. https://doi.org/10.1145/3491102.3517459
  50. Liviu Panait and Sean Luke. 2005. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems 11 (2005), 387–434. https://doi.org/10.1007/s10458-005-2631-2
  51. Icon-function relationship in toolbar icons. Displays 29, 5 (2008), 521–525. https://doi.org/10.1016/j.displa.2008.07.001
  52. Mapping Natural Language Commands to Web Elements. arXiv:1808.09132 [cs.CL]
  53. Pause-and-Play: Automatically Linking Screencast Video Tutorials with Applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) (UIST ’11). Association for Computing Machinery, New York, NY, USA, 135–144. https://doi.org/10.1145/2047196.2047213
  54. Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
  55. Robotic Process Automation and Artificial Intelligence in Industry 4.0 – A Literature review. Procedia Computer Science 181 (2021), 51–58. https://doi.org/10.1016/j.procs.2021.01.104 CENTERIS 2020 - International Conference on ENTERprise Information Systems / ProjMAN 2020 - International Conference on Project MANagement / HCist 2020 - International Conference on Health and Social Care Information Systems and Technologies 2020, CENTERIS/ProjMAN/HCist 2020.
  56. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]
  57. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 307 (01 2023). https://doi.org/10.1148/radiol.230163
  58. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. In 2023 IEEE International Conference on Robotics and Automation (ICRA). 11523–11530. https://doi.org/10.1109/ICRA48891.2023.10161317
  59. Methods and apparatus for providing search results in response to an ambiguous search query. US Patent 7,136,854.
  60. VIANA: Visual Interactive Annotation of Argumentation. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST). 11–22. https://doi.org/10.1109/VAST47406.2019.8986917
  61. Robotic process automation. , 269–272 pages. https://doi.org/10.1007/s12599-018-0542-4
  62. UGIF: UI Grounded Instruction Following. http://arxiv.org/abs/2211.07615 arXiv:2211.07615 [cs].
  63. Voicify Your UI: Towards Android App Control with Voice Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 1, Article 44 (mar 2023), 22 pages. https://doi.org/10.1145/3581998
  64. Enabling Conversational Interaction with Mobile UI using Large Language Models. http://arxiv.org/abs/2209.08655 arXiv:2209.08655 [cs].
  65. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, Virtual Event USA, 498–510. https://doi.org/10.1145/3472749.3474765
  66. EverTutor: automatically creating interactive guided tutorials on smartphones by user demonstration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Toronto Ontario Canada, 4027–4036. https://doi.org/10.1145/2556288.2557407
  67. Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models. arXiv:2304.13835 [cs.CL]
  68. Clustering User Queries of a Search Engine. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW ’01). Association for Computing Machinery, New York, NY, USA, 162–168. https://doi.org/10.1145/371920.371974
  69. A User Acceptance Model for Robotic Process Automation. In 2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC). 97–106. https://doi.org/10.1109/EDOC49727.2020.00021
  70. Improving random GUI testing with image-based widget detection. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 307–317. https://doi.org/10.1145/3293882.3330551
  71. Terry Winograd. 1972. Understanding natural language. Cognitive Psychology 3, 1 (1972), 1–191. https://doi.org/10.1016/0010-0285(72)90002-3
  72. Never-Ending Learning of User Interfaces. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, Article 113, 13 pages. https://doi.org/10.1145/3586183.3606824
  73. UIED: a hybrid tool for GUI element detection. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1655–1659. https://doi.org/10.1145/3368089.3417940
  74. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
  75. Building Cooperative Embodied Agents Modularly with Large Language Models. arXiv:2307.02485 [cs.AI]
  76. ExpeL: LLM Agents Are Experiential Learners. arXiv:2308.10144 [cs.LG]
  77. HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based Instructions. In The 34th Annual ACM Symposium on User Interface Software and Technology (UIST ’21). Association for Computing Machinery, New York, NY, USA, 1144–1153. https://doi.org/10.1145/3472749.3474812
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tian Huang (16 papers)
  2. Chun Yu (25 papers)
  3. Weinan Shi (2 papers)
  4. Zijian Peng (2 papers)
  5. David Yang (33 papers)
  6. Weiqi Sun (10 papers)
  7. Yuanchun Shi (51 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets