PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts (2404.02475v1)
Abstract: Robotic Process Automation (RPA) offers a valuable solution for efficiently automating tasks on the graphical user interface (GUI), by emulating human interactions, without modifying existing code. However, its broader adoption is constrained by the need for expertise in both scripting languages and workflow design. To address this challenge, we present PromptRPA, a system designed to comprehend various task-related textual prompts (e.g., goals, procedures), thereby generating and performing corresponding RPA tasks. PromptRPA incorporates a suite of intelligent agents that mimic human cognitive functions, specializing in interpreting user intent, managing external information for RPA generation, and executing operations on smartphones. The agents can learn from user feedback and continuously improve their performance based on the accumulated knowledge. Experimental results indicated a performance jump from a 22.28% success rate in the baseline to 95.21% with PromptRPA, requiring an average of 1.66 user interventions for each new task. PromptRPA presents promising applications in fields such as tutorial creation, smart assistance, and customer service.
- Santiago Aguirre and Alejandro Rodriguez. 2017. Automation of a Business Process Using Robotic Process Automation (RPA): A Case Study. In Applied Computer Sciences in Engineering, Juan Carlos Figueroa-García, Eduyn Ramiro López-Santana, José Luis Villa-Ramírez, and Roberto Ferro-Escobar (Eds.). Springer International Publishing, Cham, 65–71.
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv:2204.01691 [cs.RO]
- Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4 (Dec. 2014), 105–120. https://doi.org/10.1609/aimag.v35i4.2513 Number: 4.
- Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300233
- Communication Breakdowns Between Families and Alexa. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300473
- Ilastik: interactive machine learning for (bio) image analysis. Nature methods 16, 12 (2019), 1226–1232.
- Reinforcement Learning for Mapping Instructions to Actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 82–90. https://aclanthology.org/P09-1010
- A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility. http://arxiv.org/abs/2202.02312 arXiv:2202.02312 [cs].
- Teachable Machine: Approachable Web-Based Tool for Exploring Machine Learning Classification. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI EA ’20). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3334480.3382839
- Davide Castelvecchi. 2016. Can we open the black box of AI? Nature News 538, 7623 (2016), 20.
- From Robotic Process Automation to Intelligent Process Automation. In Business Process Management: Blockchain and Robotic Process Automation Forum, Aleksandre Asatiani, José María García, Nina Helander, Andrés Jiménez-Ramírez, Agnes Koschmider, Jan Mendling, Giovanni Meroni, and Hajo A. Reijers (Eds.). Springer International Publishing, Cham, 215–228.
- VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18030–18040.
- Towards Complete Icon Labeling in Mobile Applications. In CHI Conference on Human Factors in Computing Systems. ACM, New Orleans LA USA, 1–14. https://doi.org/10.1145/3491102.3502073
- Design of Interactive Tutorials on Mobile Applications for Chinese Middle-Aged and Older Adults. Art and Design Review 05, 03 (2017), 162. https://doi.org/10.4236/adr.2017.53013 Number: 03 Publisher: Scientific Research Publishing.
- Interactive machine learning for soybean seed and seedling quality classification. Scientific reports 10, 1 (2020), 11267.
- Rico: A Mobile App Dataset for Building Data-Driven Design Applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, Québec City QC Canada, 845–854. https://doi.org/10.1145/3126594.3126651
- Integrating Machine Learning with Human Knowledge. iScience 23, 11 (Nov. 2020), 101656. https://doi.org/10.1016/j.isci.2020.101656
- Pixel-based methods for widget state and style in a runtime implementation of sliding widgets. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 2231–2240. https://doi.org/10.1145/2556288.2556979
- Multi-Agent Systems: A Survey. IEEE Access 6 (2018), 28573–28593. https://doi.org/10.1109/ACCESS.2018.2831228
- Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL]
- John J. Dudley and Per Ola Kristensson. 2018. A Review of User Interface Design for Interactive Machine Learning. ACM Trans. Interact. Intell. Syst. 8, 2, Article 8 (jun 2018), 37 pages. https://doi.org/10.1145/3185517
- Jerry Alan Fails and Dan R. Olsen. 2003. Interactive Machine Learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces (Miami, Florida, USA) (IUI ’03). Association for Computing Machinery, New York, NY, USA, 39–45. https://doi.org/10.1145/604045.604056
- Language-agnostic BERT Sentence Embedding. arXiv:2007.01852 [cs.CL]
- RePlay: Contextually Presenting Learning Videos Across Software Applications. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300527
- Jonathan Grudin and Richard Jacques. 2019. Chatbots, Humbots, and the Quest for Artificial General Intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300439
- A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 5 (2018), 1–42.
- ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces. Proceedings of the AAAI Conference on Artificial Intelligence 35, 7 (May 2021), 5931–5938. https://doi.org/10.1609/aaai.v35i7.16741 Number: 7.
- A framework for implementing robotic process automation projects. Information Systems and e-Business Management 21, 1 (2023), 1–35.
- MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352 [cs.AI]
- Interaction Proxy Manager: Semantic Model Generation and Run-Time Support for Reconstructing Ubiquitous User Interfaces of Mobile Services. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 3, Article 99 (sep 2023), 39 pages. https://doi.org/10.1145/3610929
- Mobile application and its global impact. International Journal of Engineering & Technology 10, 6 (2010), 72–78.
- Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum 32, 1 (April 1998), 5–17. https://doi.org/10.1145/281250.281253
- Synapse: Interactive Guidance by Demonstration with Trial-and-Error Support for Older Adults to Use Smartphone Apps. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3, Article 121 (sep 2022), 24 pages. https://doi.org/10.1145/3550321
- Robotic process automation: overview and opportunities. International Journal Advanced Quality 46, 3-4 (2018), 34–39.
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. http://arxiv.org/abs/2210.03347 arXiv:2210.03347 [cs].
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 18893–18912. https://proceedings.mlr.press/v202/lee23g.html
- Intelligently Creating Contextual Tutorials for GUI Applications. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom). IEEE, Beijing, 187–196. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.50
- SUGILITE: Creating Multimodal Smartphone Automation by Demonstration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 6038–6049. https://doi.org/10.1145/3025453.3025483
- APPINITE: A Multi-Modal Interface for Specifying Data Descriptions in Programming by Demonstration Using Natural Language Instructions. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 105–114. https://doi.org/10.1109/VLHCC.2018.8506506 ISSN: 1943-6106.
- Screen2Vec: Semantic Embedding of GUI Screens and GUI Components. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–15. https://doi.org/10.1145/3411764.3445049
- Toby Jia-Jun Li and Oriana Riva. 2018. Kite: Building Conversational Bots from Mobile Apps. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (Munich, Germany) (MobiSys ’18). Association for Computing Machinery, New York, NY, USA, 96–109. https://doi.org/10.1145/3210240.3210339
- Wei Li. 2021. Learning UI Navigation through Demonstrations composed of Macro Actions. http://arxiv.org/abs/2110.08653 arXiv:2110.08653 [cs].
- Mapping Natural Language Instructions to Mobile UI Action Sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8198–8210. https://doi.org/10.18653/v1/2020.acl-main.729
- Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. http://arxiv.org/abs/2010.04295 arXiv:2010.04295 [cs].
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv:2305.19118 [cs.CL]
- Learning Design Semantics for Mobile Apps. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (¡conf-loc¿, ¡city¿Berlin¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 569–579. https://doi.org/10.1145/3242587.3242650
- Natural language understanding approaches based on joint task of intent detection and slot filling for IoT voice interaction. Neural Computing and Applications 32 (2020), 16149–16166.
- Towards A Unified Agent with Foundation Models. arXiv:2307.09668 [cs.RO]
- Automatically Generating and Improving Voice Command Interface from Operation Sequences on Smartphones. In CHI Conference on Human Factors in Computing Systems. ACM, New Orleans LA USA, 1–21. https://doi.org/10.1145/3491102.3517459
- Liviu Panait and Sean Luke. 2005. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems 11 (2005), 387–434. https://doi.org/10.1007/s10458-005-2631-2
- Icon-function relationship in toolbar icons. Displays 29, 5 (2008), 521–525. https://doi.org/10.1016/j.displa.2008.07.001
- Mapping Natural Language Commands to Web Elements. arXiv:1808.09132 [cs.CL]
- Pause-and-Play: Automatically Linking Screencast Video Tutorials with Applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) (UIST ’11). Association for Computing Machinery, New York, NY, USA, 135–144. https://doi.org/10.1145/2047196.2047213
- Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
- Robotic Process Automation and Artificial Intelligence in Industry 4.0 – A Literature review. Procedia Computer Science 181 (2021), 51–58. https://doi.org/10.1016/j.procs.2021.01.104 CENTERIS 2020 - International Conference on ENTERprise Information Systems / ProjMAN 2020 - International Conference on Project MANagement / HCist 2020 - International Conference on Health and Social Care Information Systems and Technologies 2020, CENTERIS/ProjMAN/HCist 2020.
- Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]
- ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 307 (01 2023). https://doi.org/10.1148/radiol.230163
- ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. In 2023 IEEE International Conference on Robotics and Automation (ICRA). 11523–11530. https://doi.org/10.1109/ICRA48891.2023.10161317
- Methods and apparatus for providing search results in response to an ambiguous search query. US Patent 7,136,854.
- VIANA: Visual Interactive Annotation of Argumentation. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST). 11–22. https://doi.org/10.1109/VAST47406.2019.8986917
- Robotic process automation. , 269–272 pages. https://doi.org/10.1007/s12599-018-0542-4
- UGIF: UI Grounded Instruction Following. http://arxiv.org/abs/2211.07615 arXiv:2211.07615 [cs].
- Voicify Your UI: Towards Android App Control with Voice Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 1, Article 44 (mar 2023), 22 pages. https://doi.org/10.1145/3581998
- Enabling Conversational Interaction with Mobile UI using Large Language Models. http://arxiv.org/abs/2209.08655 arXiv:2209.08655 [cs].
- Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. ACM, Virtual Event USA, 498–510. https://doi.org/10.1145/3472749.3474765
- EverTutor: automatically creating interactive guided tutorials on smartphones by user demonstration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Toronto Ontario Canada, 4027–4036. https://doi.org/10.1145/2556288.2557407
- Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models. arXiv:2304.13835 [cs.CL]
- Clustering User Queries of a Search Engine. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW ’01). Association for Computing Machinery, New York, NY, USA, 162–168. https://doi.org/10.1145/371920.371974
- A User Acceptance Model for Robotic Process Automation. In 2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC). 97–106. https://doi.org/10.1109/EDOC49727.2020.00021
- Improving random GUI testing with image-based widget detection. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 307–317. https://doi.org/10.1145/3293882.3330551
- Terry Winograd. 1972. Understanding natural language. Cognitive Psychology 3, 1 (1972), 1–191. https://doi.org/10.1016/0010-0285(72)90002-3
- Never-Ending Learning of User Interfaces. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, Article 113, 13 pages. https://doi.org/10.1145/3586183.3606824
- UIED: a hybrid tool for GUI element detection. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1655–1659. https://doi.org/10.1145/3368089.3417940
- ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
- Building Cooperative Embodied Agents Modularly with Large Language Models. arXiv:2307.02485 [cs.AI]
- ExpeL: LLM Agents Are Experiential Learners. arXiv:2308.10144 [cs.LG]
- HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based Instructions. In The 34th Annual ACM Symposium on User Interface Software and Technology (UIST ’21). Association for Computing Machinery, New York, NY, USA, 1144–1153. https://doi.org/10.1145/3472749.3474812
- Tian Huang (16 papers)
- Chun Yu (25 papers)
- Weinan Shi (2 papers)
- Zijian Peng (2 papers)
- David Yang (33 papers)
- Weiqi Sun (10 papers)
- Yuanchun Shi (51 papers)