GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants (2309.05138v3)
Abstract: Privacy policies of websites are often lengthy and intricate. Privacy assistants assist in simplifying policies and making them more accessible and user friendly. The emergence of generative AI (genAI) offers new opportunities to build privacy assistants that can answer users questions about privacy policies. However, genAIs reliability is a concern due to its potential for producing inaccurate information. This study introduces GenAIPABench, a benchmark for evaluating Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench includes: 1) A set of questions about privacy policies and data protection regulations, with annotated answers for various organizations and regulations; 2) Metrics to assess the accuracy, relevance, and consistency of responses; and 3) A tool for generating prompts to introduce privacy documents and varied privacy questions to test system robustness. We evaluated three leading genAI systems ChatGPT-4, Bard, and Bing AI using GenAIPABench to gauge their effectiveness as GenAIPAs. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy.
- P. Voigt and A. von dem Bussche, “The eu general data protection regulation (gdpr): A practical guide,” Springer, vol. 2, no. 1, pp. 1–16, 2017.
- J. Greenberg and J. Maier, “California consumer privacy act (ccpa): Compliance guide,” Business Law Today, vol. 30, no. 3, pp. 1–11, 2020.
- D. Solove, Nothing to hide: The false tradeoff between privacy and security. Yale University Press, 2013.
- J. A. Obar and A. Oeldorf-Hirsch, “The biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services,” in TPRC 44: The 44th Research Conference on Communication, Information and Internet Policy, 2018.
- M. Langheinrich, “Privacy and mobile devices,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 34–44, 2001.
- M. Ackerman, L. Cranor, and J. Reagle, “Privacy policies that people can understand,” in Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 415–422, ACM, 2001.
- A. Cavoukian, “Privacy by design: The 7 foundational principles,” in 2010 33rd International Conference on Privacy and Data Protection, pp. 2–58, IEEE, 2010.
- S. Wilson, S. Komanduri, G. Norcie, A. Acquisti, P. Leon, and L. Cranor, “Summarizing privacy policies with crowdsourcing and natural language processing,” in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 2363–2374, ACM, 2016.
- B. Knijnenburg and A. Kobsa, “Personalized privacy assistants for the internet of things: enabling user control over privacy in smart environments,” in Adjunct Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 1603–1608, ACM, 2013.
- Y. Zhang, Y. Chen, and N. Li, “Privacy risk analysis for mobile applications,” IEEE Transactions on Dependable and Secure Computing, vol. 15, no. 6, pp. 968–981, 2016.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
- Y. Belinkov, I. Dagan, S. Shieber, and A. Subramanian, “Lama: Language-agnostic model agnosticization,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 215–225, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Gpt-3: Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2021.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
- Now Foundations and Trends, 2019.
- T. W. Bickmore and R. W. Picard, “Establishing and maintaining long-term human-computer relationships,” ACM Transactions on Computer-Human Interaction (TOCHI), vol. 12, no. 2, pp. 293–327, 2005.
- B. Li, X. Wu, L. Qin, and J. Huang, “Alice: A conversational agent for financial planning,” in International Conference on Web Intelligence, pp. 1163–1167, 2017.
- K. K. Fitzpatrick, A. Darcy, and M. Vierhile, “Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (woebot): A randomized controlled trial,” JMIR Mental Health, vol. 4, no. 2, p. e19, 2017.
- R. Winkler, M. Söllner, and S. Neuweiler, “Evaluating the engagement with conversational agents: Experiments in education and health,” in International Conference on Design Science Research in Information Systems and Technology, pp. 102–114, 2018.
- A. Wang, K. Cho, and M. Lewis, “Truthfulqa: Measuring how models mimic human falsehoods,” arXiv preprint arXiv:2109.07958, 2021.
- T. Schick, A. Lauscher, and I. Gurevych, “‘it’s not a bug, it’s a feature’: Unwanted model outputs as bugs in ai systems,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2849–2859, 2021.
- E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?,” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1354–1360, 2021.
- L. Chen, M. Zaharia, and J. Zou, “How is ChatGPT’s behavior changing over time?,” arXiv preprint arXiv:2307.09009, 2023.
- Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, and Y. Zhang, “OpenAGI: When LLM meets domain experts,” arXiv preprint arXiv:2304.04370, 2023.
- W.-C. Kang, J. Ni, N. Mehta, M. Sathiamoorthy, L. Hong, E. Chi, and D. Z. Cheng, “Do llms understand user preferences? evaluating llms on user rating prediction,” 2023.
- X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” 2023.
- Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, and P. Fung, “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” 2023.
- T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto, “Benchmarking large language models for news summarization,” 2023.
- A. Ravichander, A. W. Black, S. Wilson, T. B. Norton, and N. M. Sadeh, “Question answering for privacy policies: Combining computational and legal perspectives,” ArXiv, vol. abs/1911.00841, 2019.
- N. Sadeh, A. Acquisti, T. D. Breaux, L. F. Cranor, A. M. McDonald, J. R. Reidenberg, N. A. Smith, F. Liu, N. C. Russell, F. Schaub, et al., “The usable privacy policy project,” in Technical report, Technical Report, CMU-ISR-13-119, Carnegie Mellon University, 2013.
- A. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, et al., “The creation and analysis of a website privacy policy corpus,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1330–1340, Association for Computational Linguistics, 2016.
- C.-H. Chiang and H. yi Lee, “Can large language models be an alternative to human evaluations?,” 2023.
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 2023.
- P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, 2016.
- M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017.
- P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha, N. Chatterji, O. Khattab, P. Henderson, Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda, “Holistic evaluation of language models,” 2022.
- “ISO/IEC 29100:2011 - Information technology - Security techniques - Privacy framework,” 2011.
- Information and Privacy Commissioner of Ontario, “7 foundational principles of privacy by design,” n.d.
- I. Pollach, “What’s wrong with online privacy policies?,” Commun. ACM, vol. 50, pp. 103–108, 09 2007.
- O. of the Australian Information Commissioner, “Office of the australian information commissioner - oaic,” 2023. Accessed: May 3, 2023.
- A. Ravichander, A. W. Black, S. Wilson, T. Norton, and N. Sadeh, “Question answering for privacy policies: Combining computational and legal perspectives,” arXiv preprint arXiv:1911.00841, 2019.
- GDPR.eu, “Gdpr faqs,” 2021.
- C. A. General, “Ccpa faqs,” 2021.
- Future of Privacy Forum, “Best Practices for Consumer-Facing Privacy Notices and Consent Forms,” June 2020.
- K. Martin, “Ethical implications and accountability of algorithms,” Journal of Business Ethics, vol. 160, 12 2019.
- K. A. Bamberger and D. K. Mulligan, “Privacy on the books and on the ground,” Stanford Law Review, vol. 63, p. 247, 2011.
- T. W. Bickmore, L. M. Pfeifer, D. Schulman, and L. Yin, “Maintaining continuity in longitudinal, relational agents for chronic disease self-care,” Journal of Medical Systems, vol. 42, no. 5, p. 91, 2018.
- E. Luger and A. Sellen, “Like having a really bad pa: the gulf between user expectation and experience of conversational agents,” Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2016.
- Q. V. Liao, Y. Gao, Y. Wu, and Y. Zhang, “Evaluating the effectiveness of human-machine collaboration in human-in-the-loop text classification,” Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, 2019.
- H. Choi, J. Park, and Y. Jung, “The role of privacy fatigue in online privacy behavior,” Computers in Human Behavior, vol. 81, pp. 42–51, 2018.
- H. P. Grice, “Logic and conversation,” Speech acts, 1975.
- C. Jensen and C. Potts, “Privacy policies as decision-making tools: an evaluation of online privacy notices,” in Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 471–478, ACM, 2004.
- N. M. Radziwill and M. C. Benton, “Evaluating quality of chatbots and intelligent conversational agents,” arXiv preprint arXiv:1704.04579, 2017.
- J. Savelka and K. D. Ashley, “Extracting case law sentences for argumentation about the gdpr,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, ACL, 2016.
- OpenAI, “Gpt-4 technical report,” 2023.
- E. H. Hiebert, “Unique words require unique instruction,”
- R. Flesch, “Flesch-kincaid readability test. retrieved october,” 2007.
- W. M. Steijn and A. Vedder, “Privacy under construction: A developmental perspective on privacy perception,” Science, Technology, & Human Values, vol. 40, no. 4, pp. 615–637, 2015.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.