Model-Enhanced LLM-Driven VUI Testing of VPA Apps (2407.02791v1)
Abstract: The flourishing ecosystem centered around voice personal assistants (VPA), such as Amazon Alexa, has led to the booming of VPA apps. The largest app market Amazon skills store, for example, hosts over 200,000 apps. Despite their popularity, the open nature of app release and the easy accessibility of apps also raise significant concerns regarding security, privacy and quality. Consequently, various testing approaches have been proposed to systematically examine VPA app behaviors. To tackle the inherent lack of a visible user interface in the VPA app, two strategies are employed during testing, i.e., chatbot-style testing and model-based testing. The former often lacks effective guidance for expanding its search space, while the latter falls short in interpreting the semantics of conversations to construct precise and comprehensive behavior models for apps. In this work, we introduce Elevate, a model-enhanced LLM-driven VUI testing framework. Elevate leverages LLMs' strong capability in natural language processing to compensate for semantic information loss during model-based VUI testing. It operates by prompting LLMs to extract states from VPA apps' outputs and generate context-related inputs. During the automatic interactions with the app, it incrementally constructs the behavior model, which facilitates the LLM in generating inputs that are highly likely to discover new states. Elevate bridges the LLM and the behavior model with innovative techniques such as encoding behavior model into prompts and selecting LLM-generated inputs based on the context relevance. Elevate is benchmarked on 4,000 real-world Alexa skills, against the state-of-the-art tester Vitas. It achieves 15% higher state space coverage compared to Vitas on all types of apps, and exhibits significant advancement in efficiency.
- “Total number of amazon alexa skills in selected countries as of january 2021,” https://www.statista.com/statistics/917900/selected-countries-amazon-alexa-skill-count/, 2022.
- L. Cheng, C. Wilson, S. Liao, J. Young, D. Dong, and H. Hu, “Dangerous skills got certified: Measuring the trustworthiness of skill certification in voice personal assistant platforms,” in CCS ’20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, November 9-13, 2020, J. Ligatti, X. Ou, J. Katz, and G. Vigna, Eds. ACM, 2020, pp. 1699–1716.
- N. Zhang, X. Mi, X. Feng, X. F. Wang, Y. Tian, and F. Qian, “Dangerous skills: Understanding and mitigating security risks of voice-controlled third-party functions on virtual personal assistant systems.” IEEE Symposium on Security and Privacy, 2019.
- M. Ford and W. Palmer, “Alexa, are you listening to me? an analysis of alexa voice service network traffic,” Pers. Ubiquitous Comput., vol. 23, no. 1, pp. 67–79, 2019.
- “Portland family says their amazon alexa recorded private conversations.” https://www.wweek.com/news/2018/05/26/portland-family-says-their-amazon-alexa-recorded-private-conversations-and-sent-them-to-a-random-contact-in-seattle/, 2018.
- Z. Guo, Z. Lin, P. Li, and K. Chen, “Skillexplorer: Understanding the behavior of skills in large scale,” in 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020, S. Capkun and F. Roesner, Eds. USENIX Association, 2020, pp. 2649–2666.
- F. Xie, Y. Zhang, C. Yan, S. Li, L. Bu, K. Chen, Z. Huang, and G. Bai, “Scrutinizing privacy policy compliance of virtual personal assistant apps,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 2022, pp. 90:1–90:13.
- J. Young, S. Liao, L. Cheng, H. Hu, and H. Deng, “Skilldetective: Automated policy-violation detection of voice assistant applications in the wild,” in 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, K. R. B. Butler and K. Thomas, Eds. USENIX Association, 2022, pp. 1113–1130.
- F. H. Shezan, H. Hu, G. Wang, and Y. Tian, “Verhealth: Vetting medical voice applications through policy enforcement,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 4, no. 4, pp. 153:1–153:21, 2020. [Online]. Available: https://doi.org/10.1145/3432233
- S. Li, L. Bu, G. Bai, Z. Guo, K. Chen, and H. Wei, “VITAS : Guided model-based VUI testing of VPA apps,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 2022, pp. 115:1–115:12.
- Y. Zhang, L. Xu, A. Mendoza, G. Yang, P. Chinprutthiwong, and G. Gu, “Life after speech recognition: Fuzzing semantic misinterpretation for voice assistant applications,” in 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society, 2019.
- E. Guglielmi, G. Rosa, S. Scalabrino, G. Bavota, and R. Oliveto, “Sorry, I don’t understand: Improving voice user interface testing,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 2022, pp. 96:1–96:12.
- Emanuela Guglielmi and Giovanni Rosa and Simone Scalabrino and Gabriele Bavota and Rocco Oliveto, “Help them understand: Testing and improving voice user interfaces,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 6, 2024. [Online]. Available: https://doi.org/10.1145/3654438
- M. Shanahan, “Talking about large language models,” CoRR, vol. abs/2212.03551, 2022.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, and J. Wen, “A survey of large language models,” CoRR, vol. abs/2303.18223, 2023.
- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in NeurIPS, 2022.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
- T. B. Brown and et al, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- D. Kumar, R. Paccagnella, P. Murley, E. Hennenfent, J. Mason, A. Bates, and M. Bailey, “Skill squatting attacks on amazon alexa,” in 27th USENIX Security Symposium, USENIX Security 2018, Baltimore, MD, USA, August 15-17, 2018, W. Enck and A. P. Felt, Eds. USENIX Association, 2018, pp. 33–47.
- “Scenes||||conversational actions||||google developers,” https://developers.google.com/assistant/conversational/scenes, 2021.
- A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533
- “Chatgpt,” https://openai.com/chatgpt, 2023.
- “Gpt-4,” https://openai.com/gpt-4, 2023.
- “Llama 2 - meta ai,” https://ai.meta.com/llama/, 2023.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- “meta-llama/llama-2-70b-chat-hf,” https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, 2023.
- “Amazon.com: Alexa skills,” https://www.amazon.com/alexa-skills/, 2014.
- “Vitas - dataset,” https://vitas000.github.io/tool/cases/skill_dataset.zip, 2022.
- “Alexa simulator limitations,” https://developer.amazon.com/en-US/docs/alexa/devconsole/test-your-skill.html##use-simulator, 2017.
- S. Liao, L. Cheng, H. Cai, L. Guo, and H. Hu, “Skillscanner: Detecting policy-violating voice applications through static analysis at the development phase,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023, W. Meng, C. D. Jensen, C. Cremers, and E. Kirda, Eds. ACM, 2023, pp. 2321–2335.
- P. Cheng and U. Roedig, “Personal voice assistant security and privacy - A survey,” Proc. IEEE, vol. 110, no. 4, pp. 476–507, 2022.
- J. S. Edu, J. M. Such, and G. Suarez-Tangil, “Smart home personal assistants: A security and privacy review,” ACM Comput. Surv., vol. 53, no. 6, pp. 116:1–116:36, 2021.
- C. Yan, X. Ji, K. Wang, Q. Jiang, Z. Jin, and W. Xu, “A survey on voice assistant security: Attacks and countermeasures,” ACM Comput. Surv., vol. 55, no. 4, pp. 84:1–84:36, 2023.
- H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, and P. Traynor, “Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems,” in 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. IEEE, 2021, pp. 730–747.
- N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. A. Wagner, and W. Zhou, “Hidden voice commands,” in 25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August 10-12, 2016, T. Holz and S. Savage, Eds. USENIX Association, 2016, pp. 513–530.
- G. Chen, S. Chen, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” in 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021. IEEE, 2021, pp. 694–711.
- Q. Yan, K. Liu, Q. Zhou, H. Guo, and N. Zhang, “Surfingattack: Interactive hidden attack on voice assistants using ultrasonic guided waves,” in 27th Annual Network and Distributed System Security Symposium, NDSS 2020, San Diego, California, USA, February 23-26, 2020. The Internet Society, 2020.
- C. Lentzsch, S. J. Shah, B. Andow, M. Degeling, A. Das, and W. Enck, “Hey alexa, is this skill safe?: Taking a closer look at the alexa skill ecosystem,” in 28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21-25, 2021. The Internet Society, 2021.
- J. S. Edu, X. F. Aran, J. M. Such, and G. Suarez-Tangil, “Measuring alexa skill privacy practices across three years,” in WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, F. Laforest, R. Troncy, E. Simperl, D. Agarwal, A. Gionis, I. Herman, and L. Médini, Eds. ACM, 2022, pp. 670–680.
- B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J. Lou, and W. Chen, “Codet: Code generation with generated tests,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2023, pp. 919–931.
- Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July 17-21, 2023, R. Just and G. Fraser, Eds. ACM, 2023, pp. 423–435.
- Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, “Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt,” CoRR, vol. abs/2304.02014, 2023.
- Z. Liu, C. Chen, J. Wang, X. Che, Y. Huang, J. Hu, and Q. Wang, “Fill in the blank: Context-aware automated text input generation for mobile GUI testing,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2023, pp. 1355–1367.
- Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,” CoRR, vol. abs/2305.09434, 2023.