Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems (2410.23472v2)

Published 30 Oct 2024 in cs.CY, cs.AI, and cs.LG

Abstract: There is an urgent need to identify both short and long-term risks from newly emerging types of AI, as well as available risk management measures. In response, and to support global efforts in regulating AI and writing safety standards, we compile an extensive catalog of risk sources and risk management measures for general-purpose AI (GPAI) systems, complete with descriptions and supporting examples where relevant. This work involves identifying technical, operational, and societal risks across model development, training, and deployment stages, as well as surveying established and experimental methods for managing these risks. To the best of our knowledge, this paper is the first of its kind to provide extensive documentation of both GPAI risk sources and risk management measures that are descriptive, self-contained and neutral with respect to any existing regulatory framework. This work intends to help AI providers, standards experts, researchers, policymakers, and regulators in identifying and mitigating systemic risks from GPAI systems. For this reason, the catalog is released under a public domain license for ease of direct use by stakeholders in AI governance and standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (238)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Sanity checks for saliency maps. Advances in neural information processing systems 31 (2018).
  3. AI Safety Institute. Advanced AI evaluations at AISI: May update — AISI Work — aisi.gov.uk. https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-update, 2024. [Accessed 22-09-2024].
  4. Intelligent financial system: How AI is transforming finance. Tech. rep., Bank for International Settlements, 2024.
  5. Alic, J. A. The dual use of technology: Concepts and policies. Technology in society 16, 2 (1994), 155–172.
  6. Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718 (2023).
  7. Many-shot jailbreaking. Anthropic (2024).
  8. Large-scale differentially private BERT. arXiv preprint arXiv:2108.01624 (2021).
  9. Anthropic. Anthropic’s responsible scaling policy. https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf, 2023. [Accessed 01-10-2024].
  10. Anthropic. Frontier Threats Red Teaming for AI Safety — anthropic.com. https://www.anthropic.com/news/frontier-threats-red-teaming-for-ai-safety, 2024. [Accessed 01-10-2024].
  11. Anthropic. Preparing for global elections in 2024. https://www.anthropic.com/news/preparing-for-global-elections-in-2024, 2024. [Accessed 01-10-2024].
  12. Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. [Accessed 01-10-2024].
  13. Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932 (2024).
  14. Ashwini, K. Contemporary forms of racism, racial discrimination, xenophobia and related intolerance: report of the special rapporteur on contemporary forms of racism, racial discrimination, xenophobia and related intolerance, Human Rights Council, UN, 2024. Office of the United Nations High Commissioner for Human Rights (2024).
  15. Guidelines for artificial intelligence containment. Next-Generation Ethics: Engineering a Better Society (Ed.) Ali. E. Abbas (2019), 90–112.
  16. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
  17. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073 (2022).
  18. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236 (2023).
  19. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. arXiv preprint arXiv:2203.04592 (2022).
  20. Actionable guidance for high-consequence AI risk management: Towards standards addressing AI catastrophic risks. arXiv preprint arXiv:2206.08966 (2022).
  21. Benchmark early and red team often: A framework for assessing and managing dual-use hazards of AI foundation models. arXiv preprint arXiv:2405.10986 (2024).
  22. AI risk-management standards profile for general-purpose AI systems (GPAIS) and foundation models. Center for Long-Term Cybersecurity, UC Berkeley. https://perma. cc/8W6P-2UUK (2023).
  23. Identifying and mitigating the security risks of generative AI. Foundations and Trends® in Privacy and Security 6, 1 (2023), 1–52.
  24. Mechanistic interpretability for AI safety–A review. arXiv preprint arXiv:2404.14082 (2024).
  25. Taken out of context: On measuring situational awareness in LLMs. arXiv preprint arXiv:2309.00667 (2023).
  26. Purple Llama CyberSecEval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724 (2023).
  27. Safety-tuned LLaMAs: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875 (2023).
  28. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021).
  29. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332 (2023).
  30. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  31. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540 (2022).
  32. The persuasive power of large language models. In Proceedings of the International AAAI Conference on Web and Social Media (2024), vol. 18, pp. 152–163.
  33. Brooks, T. N. Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems. In Intelligent Computing: Proceedings of the 2018 Computing Conference, Volume 2 (2019), Springer, pp. 1083–1102.
  34. Superhuman AI for multiplayer poker. Science 365, 6456 (2019), 885–890.
  35. Brown, T. B. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
  36. Structured access for third-party research on frontier AI models: Investigating researchers’ model access requirements, 2023.
  37. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390 (2023).
  38. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022).
  39. The path to defence: A roadmap to characterising data poisoning attacks on victim models. ACM Computing Surveys 56, 7 (2024), 1–39.
  40. Visibility into AI agents. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (2024), pp. 958–973.
  41. Designing a dashboard for transparency and control of conversational AI. arXiv preprint arXiv:2406.07882 (2024).
  42. The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering. arXiv preprint arXiv:2305.16519 (2023).
  43. CycleGAN, a master of steganography. arXiv preprint arXiv:1712.02950 (2017).
  44. Safety cases: How to justify the safety of advanced AI systems. arXiv preprint arXiv:2403.10462 (2024).
  45. Here comes the AI worm: Unleashing zero-click worms that target GenAI-powered applications. arXiv preprint arXiv:2403.02817 (2024).
  46. Commission, E. Commission implementing decision on a standardisation request to the european committee for standardisation and the european committee for electrotechnical standardisation in support of union policy on artificial intelligence annex. https://ec.europa.eu/transparency/documents-register/detail?ref=C(2023)3215&lang=en, 2023. [Accessed 01-10-2024].
  47. CISSP study guide. Newnes, 2012.
  48. Vulnerabilities in AI code generators: Exploring targeted data poisoning attacks. In Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension (2024), pp. 280–292.
  49. Creative Commons. CC BY 4.0 deed. https://creativecommons.org/licenses/by/4.0/.
  50. Creative Commons. CC0 1.0 deed. https://creativecommons.org/publicdomain/zero/1.0/.
  51. Crête, R. The Volkswagen scandal from the viewpoint of corporate governance. European Journal of Risk Regulation 7, 1 (2016), 25–31.
  52. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670 (2020).
  53. Et tu certifications: Robustness certificates yield better adversarial examples. In Forty-first International Conference on Machine Learning (2024).
  54. Davis, E. Benchmarks for automated commonsense reasoning: A survey. ACM Computing Surveys 56, 4 (2023), 1–41.
  55. Scaling compute is not all you need for adversarial robustness. arXiv preprint arXiv:2312.13131 (2023).
  56. Sophon: Non-fine-tunable learning to restrain task transferability for pre-trained models. arXiv preprint arXiv:2404.12699 (2024).
  57. Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162 (2024).
  58. Department of Homeland Security. Safety and Security Guidelines for Critical Infrastructure Owners and Operators. https://www.dhs.gov/sites/default/files/2024-04/24_0426_dhs_ai-ci-safety-security-guidelines-508c.pdf. [Accessed 01-10-2024].
  59. Goal misgeneralization in deep reinforcement learning. In International Conference on Machine Learning (2022), PMLR, pp. 12004–12019.
  60. Queens are powerful too: Mitigating gender bias in dialogue generation. arXiv preprint arXiv:1911.03842 (2019).
  61. Explanations can be manipulated and geometry is to blame. Advances in neural information processing systems 32 (2019).
  62. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492 (2023).
  63. Multi-agent systems: A survey. IEEE Access 6 (2018), 28573–28593.
  64. Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796 (2024).
  65. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
  66. Measuring the persuasiveness of language models. Anthropic (2024).
  67. Near to mid-term risks and opportunities of open source generative AI. arXiv preprint arXiv:2404.17047 (2024).
  68. Human-in-the-loop machine learning for safe and ethical autonomous vehicles: Principles, challenges, and opportunities. arXiv preprint arXiv:2408.12548 (2024).
  69. Uncertainty, information, and risk in international technology races. Journal of Conflict Resolution (2023), 00220027231214996.
  70. European Parliament, C. o. t. E. U. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689, 2024. [Accessed 26-09-2024].
  71. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese 198, Suppl 27 (2021), 6435–6467.
  72. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 1625–1634.
  73. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science 378, 6624 (2022), 1067–1074.
  74. LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144 (2024).
  75. Feldt, R. Generating diverse software versions with genetic programming: An experimental study. IEE Proceedings-Software 145, 6 (1998), 228–236.
  76. Foerderer, J. Should we trust web-scraped data? arXiv preprint arXiv:2308.02231 (2023).
  77. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 8730–8738.
  78. BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B. arXiv preprint arXiv:2311.00117 (2023).
  79. Garcia, M. What Air Canada lost in ‘remarkable’ lying AI chatbot case. Forbes. Available at: https://www. forbes. com/sites/marisagarcia/2024/02/19/what-air-canada-lost-in-re markable-lying-ai-chatbot-case (2024).
  80. Datasheets for datasets. Communications of the ACM 64, 12 (2021), 86–92.
  81. Github. Github copilot features. https://github.com/features/copilot, 2023. [Accessed 01-10-2024].
  82. Generative adversarial networks. Communications of the ACM 63, 11 (2020), 139–144.
  83. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  84. AI control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942 (2023).
  85. On calibration of modern neural networks. In International conference on machine learning (2017), PMLR, pp. 1321–1330.
  86. Introducing Google’s Secure AI Framework — blog.google. https://blog.google/technology/safety-security/introducing-googles-secure-ai-framework/, 2023. [Accessed 10-10-2024].
  87. Heikkilä, M. The viral AI avatar app Lensa undressed me—without my consent — technologyreview.com. https://www.technologyreview.com/2022/12/12/1064751/the-viral-ai-avatar-app-lensa-undressed-me-without-my-consent/. [Accessed 22-09-2024].
  88. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (2023), pp. 287–296.
  89. Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916 (2021).
  90. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566 (2024).
  91. AI safety via debate. arXiv preprint arXiv:1805.00899 (2018).
  92. ISO. ISO 31000:2018 — iso.org. https://www.iso.org/standard/65694.html, 2018. [Accessed 30-09-2024].
  93. ISO. ISO/IEC 22989:2022 — iso.org. https://www.iso.org/standard/74296.html, 2022. [Accessed 30-09-2024].
  94. ISO. ISO/IEC/IEEE 24748-7000:2022 — iso.org. https://www.iso.org/standard/84893.html, 2022. [Accessed 30-09-2024].
  95. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. arXiv preprint arXiv:2305.10160 (2023).
  96. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504 (2022).
  97. Uncovering deceptive tendencies in language models: A simulated company AI assistant. arXiv preprint arXiv:2405.01576 (2024).
  98. Forcing generative models to degenerate ones: The power of data poisoning attacks. arXiv preprint arXiv:2312.04748 (2023).
  99. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency (2020), pp. 306–316.
  100. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. In 2024 IEEE Security and Privacy Workshops (SPW) (2024), IEEE, pp. 132–143.
  101. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  102. Debating with more persuasive LLMs leads to more truthful answers. arXiv preprint arXiv:2402.06782 (2024).
  103. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337 (2021).
  104. Kilian, K. A. Beyond accidents and misuse: Decoding the structural risk dynamics of artificial intelligence. arXiv preprint arXiv:2406.14873 (2024).
  105. Jailbreaking is best solved by definition. arXiv preprint arXiv:2403.14725 (2024).
  106. Evaluating language-model agents on realistic autonomous tasks. arXiv preprint arXiv:2312.11671 (2023).
  107. Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries. arXiv preprint arXiv:2307.08823 (2023).
  108. Risk thresholds for frontier AI. arXiv preprint arXiv:2406.14713 (2024).
  109. ChatGPT’s inconsistent moral advice influences users’ judgment. Scientific Reports 13, 1 (2023), 4569.
  110. Generative AI and personalized video advertisements. Available at SSRN 4614118 (2023).
  111. Towards a situational awareness benchmark for LLMs. In Socially responsible language modelling research (2023).
  112. ”How do I fool you?” Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2020), pp. 79–85.
  113. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702 (2023).
  114. AI AI Bias: Large Language Models Favor Their Own Generated Content. arXiv preprint arXiv:2407.12856 (2024).
  115. LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B. arXiv preprint arXiv:2310.20624 (2023).
  116. The Winograd Schema Challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning (2012).
  117. The WMDP benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218 (2024).
  118. Unveiling backdoor risks brought by foundation models in heterogeneous federated learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (2024), Springer, pp. 168–181.
  119. Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models. arXiv preprint arXiv:2403.09792 (2024).
  120. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  121. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
  122. Evaluate what you can’t evaluate: Unassessable quality for generated response. arXiv preprint arXiv:2305.14658 (2023).
  123. A safe harbor for AI evaluation and red teaming. arXiv preprint arXiv:2403.04893 (2024).
  124. Cr-utp: Certified robustness against universal text perturbations on large language models. In Findings of the Association for Computational Linguistics ACL 2024 (2024), pp. 9863–9875.
  125. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292 (2024).
  126. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. Journal of the American Medical Informatics Association (2024).
  127. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747 (2023).
  128. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379 (2023).
  129. Simple probes can catch sleeper agents, 2024.
  130. Generative AI misuse: A taxonomy of tactics and insights from real-world data. arXiv preprint arXiv:2406.13843 (2024).
  131. MLOps: A Guide to its Adoption in the Context of Responsible AI. In Proceedings of the 1st Workshop on Software Engineering for Responsible AI (2022), pp. 45–49.
  132. ‘There is no standard’: investigation finds AI algorithms objectify women’s bodies — theguardian.com. https://www.theguardian.com/technology/2023/feb/08/biased-ai-algorithms-racy-women-bodies, 2023. [Accessed 22-09-2024].
  133. Great power, great responsibility: Recommendations for reducing energy for training language models. arXiv preprint arXiv:2205.09646 (2022).
  134. Privacy Risks of General-Purpose AI Systems: A Foundation for Investigating Practitioner Perspectives. arXiv preprint arXiv:2407.02027 (2024).
  135. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229 (2022).
  136. Let me take over: Variable autonomy for meaningful human control. front. Artif. Intell 4 (2021).
  137. Cross-lingual transfer of large language model by visually-derived supervision toward low-resource languages. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 3637–3646.
  138. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
  139. Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models. Rand Corporation, 2024.
  140. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626 (2022).
  141. Nikolenko, S. I. Synthetic data for deep learning, vol. 174. Springer, 2021.
  142. NIST. Draft - Taxonomy of AI Risk. https://www.nist.gov/system/files/documents/2021/10/15/taxonomy_AI_risks.pdf, 2021. [Accessed 20-10-2024].
  143. NIST Generative AI Public Working Group. Artificial intelligence risk management framework: Generative artificial intelligence profile, 2024.
  144. OpenAI. OpenAI o1 System Card. https://cdn.openai.com/o1-system-card-20240917.pdf, 2024. [Accessed 16-10-2024].
  145. OpenAI. Understanding the source of what we see and hear online. https://openai.com/index/understanding-the-source-of-what-we-see-and-hear-online, May 2024. [Accessed 01-10-2024].
  146. OWASP. Owasp top 10: LLM & Generative AI security risks. https://genai.owasp.org/, Oct 2023. [Accessed 01-10-2024].
  147. LLM evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076 (2024).
  148. AI deception: A survey of examples, risks, and potential solutions. Patterns 5, 5 (2024).
  149. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251 (2022).
  150. Perrow, C. Normal accidents: Living with high risk technologies-Updated edition. Princeton university press, 2011.
  151. A comprehensive artificial intelligence vulnerability taxonomy. In European Conference on Cyber Warfare and Security (2024), vol. 23, pp. 379–387.
  152. A multilayer framework for good cybersecurity practices for ai. Tech. Rep. (2023).
  153. Dynasent: A dynamic benchmark for sentiment analysis. arXiv preprint arXiv:2012.15349 (2020).
  154. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693 (2023).
  155. Learning to poison large language models during instruction tuning. arXiv preprint arXiv:2402.13459 (2024).
  156. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344 (2018).
  157. Outsider oversight: Designing a third party audit ecosystem for AI governance. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (2022), pp. 557–571.
  158. Truce: Private benchmarking to prevent contamination and improve comparative evaluation of LLMs. arXiv e-prints (2024), arXiv–2403.
  159. How weaponizing disinformation can bring down a city’s power grid. PloS one 15, 8 (2020), e0236517.
  160. Compositional capabilities of autoregressive transformers: A study on synthetic, interpretable tasks. In Forty-first International Conference on Machine Learning (2024).
  161. How much are LLMs contaminated? a comprehensive survey and the llmsanitize library. arXiv preprint arXiv:2404.00699 (2024).
  162. Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? arXiv preprint arXiv:2407.21792 (2024).
  163. Overfitting in adversarially robust deep learning. In International conference on machine learning (2020), PMLR, pp. 8093–8104.
  164. AVID — avidml.org. https://avidml.org/, 2023. [Accessed 10-10-2024].
  165. Beyond the ML model: Applying safety engineering frameworks to text-to-image development. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (2023), pp. 70–83.
  166. Preventing language models from hiding their reasoning. arXiv preprint arXiv:2310.18512 (2023).
  167. Benchmarks for detecting measurement tampering. arXiv preprint arXiv:2308.15605 (2023).
  168. Immunization against harmful fine-tuning attacks. arXiv preprint arXiv:2402.16382 (2024).
  169. Rosenthol, L. C2pa: the world’s first industry standard for content provenance (conference presentation). In Applications of Digital Image Processing XLV (2022), SPIE.
  170. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. arXiv preprint arXiv:2310.18018 (2023).
  171. AI and security of critical infrastructure. Handbook of Big Data Privacy (2020), 7–36.
  172. On the conversational persuasiveness of large language models: A randomized controlled trial. arXiv preprint arXiv:2403.14380 (2024).
  173. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021), pp. 1–15.
  174. Sayler, K. M. Artificial intelligence and national security. Congressional Research Service 45178 (2020).
  175. Do datasets have politics? Disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–37.
  176. Foundation models: A new paradigm for artificial intelligence. Business & Information Systems Engineering (2024), 1–11.
  177. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324 (2023).
  178. Open-sourcing highly capable foundation models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives. arXiv preprint arXiv:2311.09227 (2023).
  179. Sehgal, S. Red teaming for cybersecurity, Oct 2018.
  180. Goal misgeneralization: Why correct specifications aren’t enough for correct goals. arXiv preprint arXiv:2210.01790 (2022).
  181. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844 (2023).
  182. Human-adversarial visual question answering. Advances in Neural Information Processing Systems 34 (2021), 20346–20359.
  183. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324 (2023).
  184. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning (2023), PMLR, pp. 31210–31227.
  185. BadGPT: Exploring security vulnerabilities of ChatGPT via backdoor attacks to InstructGPT. arXiv preprint arXiv:2304.12298 (2023).
  186. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP) (2017), IEEE, pp. 3–18.
  187. On the exploitability of instruction tuning. Advances in Neural Information Processing Systems 36 (2023), 61836–61856.
  188. Knowledge unlearning for LLMs: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766 (2023).
  189. Significant Gravitas. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT, 2023. [Accessed 16-10-2024].
  190. A taxonomy of human–machine collaboration: capturing automation and technical autonomy. Ai & Society 36, 1 (2021), 239–250.
  191. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems 35 (2022), 9460–9471.
  192. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2020), pp. 180–186.
  193. The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence. arXiv preprint arXiv:2408.12622 (2024).
  194. Can large language models democratize access to dual-use biotechnology? arXiv preprint arXiv:2306.03809 (2023).
  195. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203 (2019).
  196. Evaluating the social impact of generative AI systems in systems and society. arXiv preprint arXiv:2306.05949 (2023).
  197. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
  198. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  199. Tamper-resistant safeguards for open-weight LLMs. arXiv preprint arXiv:2408.00761 (2024).
  200. Prioritizing safeguarding over autonomy: Risks of LLM agents for science. arXiv preprint arXiv:2402.04247 (2024).
  201. The White House. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence — The White House — whitehouse.gov. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/, 2023. [Accessed 26-09-2024].
  202. Level generation through large language models. In Proceedings of the 18th International Conference on the Foundations of Digital Games (2023), pp. 1–8.
  203. UK Government Office for Science. Future risks of frontier AI. https://assets.publishing.service.gov.uk/media/653bc393d10f3500139a6ac5/future-risks-of-frontier-ai-annex-a.pdf, Oct 2023. [Accessed 01-10-2024].
  204. Dual use of artificial-intelligence-powered drug discovery. Nature machine intelligence 4, 3 (2022), 189–191.
  205. Explanations can reduce overreliance on AI systems during decision-making. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023), 1–38.
  206. Improving trustworthiness of AI solutions: A qualitative approach to support ethically-grounded AI design. International Journal of Human–Computer Interaction 39, 7 (2023), 1405–1422.
  207. Humans inherit artificial intelligence biases. Scientific Reports 13, 1 (2023), 15737.
  208. Learning from the worst: Dynamically generated datasets to improve online hate detection. arXiv preprint arXiv:2012.15761 (2020).
  209. Vincent, J. Meta’s powerful AI language model has leaked online - what happens now? https://www.theverge.com/2023/3/8/23629362/meta-ai-language-model-llama-leak-online-misuse, Mar 2023. [Accessed 01-10-2024].
  210. Vincent, J. Microsoft’s Bing is an emotionally manipulative liar, and people love it. https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bing-personality-conversations-spy-employees-webcams, 2023. [Accessed 01-10-2024].
  211. Poisoning language models during instruction tuning. In International Conference on Machine Learning (2023), PMLR, pp. 35413–35425.
  212. Can ChatGPT defend its belief in truth? Evaluating LLM reasoning via debate. arXiv preprint arXiv:2305.13160 (2023).
  213. Poisoning attacks and countermeasures in intelligent networks: Status quo and prospects. Digital Communications and Networks 8, 2 (2022), 225–234.
  214. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems 36 (2024).
  215. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  216. Establishing data provenance for responsible artificial intelligence systems. ACM Transactions on Management Information Systems (TMIS) 13, 2 (2022), 1–23.
  217. An approach to measure the effectiveness of the mitre atlas framework in safeguarding machine learning systems against data poisoning attack. In Cybersecurity and Artificial Intelligence: Transformational Strategies and Disruptive Innovation. Springer, 2024, pp. 81–116.
  218. In-context learning can re-learn forbidden tasks. arXiv preprint arXiv:2402.05723 (2024).
  219. Xiang, C. ”He would still be here”: Man dies by suicide after talking with AI chatbot, widow says. VICE. https://www.vice.com/en/article/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says. [Accessed 22-09-2024].
  220. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. arXiv preprint arXiv:2305.13300 (2023).
  221. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244 (2024).
  222. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710 (2023).
  223. The earth is flat because…: Investigating LLMs’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085 (2023).
  224. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824 (2024).
  225. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949 (2023).
  226. Data contamination can cross language barriers. arXiv preprint arXiv:2406.13236 (2024).
  227. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. Advances in Neural Information Processing Systems 36 (2024).
  228. International Scientific Report on the Safety of Advanced AI. PhD thesis, Department for Science, Innovation and Technology, 2024.
  229. Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446 (2023).
  230. Understanding robust overfitting of adversarial training and beyond. In International Conference on Machine Learning (2022), PMLR, pp. 25595–25610.
  231. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. arXiv preprint arXiv:2403.17336 (2024).
  232. Beyond confidence: Reliable models should also consider atypicality. Advances in Neural Information Processing Systems 36 (2024).
  233. AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies. arXiv preprint arXiv:2406.17864 (2024).
  234. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045 (2023).
  235. Don’t make your LLM an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964 (2023).
  236. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 (2024).
  237. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405 (2023).
  238. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rokas Gipiškis (4 papers)
  2. Ayrton San Joaquin (5 papers)
  3. Ze Shen Chin (2 papers)
  4. Adrian Regenfuß (1 paper)
  5. Ariel Gil (1 paper)
  6. Koen Holtman (6 papers)