Safety cases for frontier AI (2410.21572v1)
Abstract: As frontier AI systems become more capable, it becomes more important that developers can explain why their systems are sufficiently safe. One way to do so is via safety cases: reports that make a structured argument, supported by evidence, that a system is safe enough in a given operational context. Safety cases are already common in other safety-critical industries such as aviation and nuclear power. In this paper, we explain why they may also be a useful tool in frontier AI governance, both in industry self-regulation and government regulation. We then discuss the practicalities of safety cases, outlining how to produce a frontier AI safety case and discussing what still needs to happen before safety cases can substantially inform decisions.
- \APACinsertmetastaradelard2024{APACrefauthors}Adelard. \APACrefYearMonthDay2024. \APACrefbtitleThe Adelard Safety Case Development (ASCAD) Manual. The Adelard safety case development (ASCAD) manual. {APACrefURL} https://perma.cc/27J5-UPQ4 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleA Grading Rubric for AI Safety Frameworks A grading rubric for AI safety frameworks.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2409.08751. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \APACrefbtitleFrom Safety Cases to Security Cases. From safety cases to security cases. \APAChowpublishedUniversity of York. {APACrefURL} https://perma.cc/434F-NP59 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleFrontier AI regulation: Managing emerging risks to public safety Frontier AI regulation: Managing emerging risks to public safety.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.03718. \PrintBackRefs\CurrentBib
- \APACinsertmetastaranthropic2023{APACrefauthors}Anthropic. \APACrefYearMonthDay2023. \APACrefbtitleModel Card and Evaluations for Claude Models. Model card and evaluations for Claude models. {APACrefURL} https://perma.cc/D8NP-VGL3 \PrintBackRefs\CurrentBib
- \APACinsertmetastaranthropic2024{APACrefauthors}Anthropic. \APACrefYearMonthDay2024. \APACrefbtitleResponsible Scaling Policy. Responsible scaling policy. {APACrefURL} https://perma.cc/DB9F-GAV4 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleFoundational Challenges in Assuring Alignment and Safety of Large Language Models Foundational challenges in assuring alignment and safety of large language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2404.09932. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleConstitutional AI: Harmlessness from AI Feedback Constitutional AI: Harmlessness from AI feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2212.08073. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2011. \BBOQ\APACrefatitleSelf-Regulation, Meta-regulation, and Regulatory Networks Self-regulation, meta-regulation, and regulatory networks.\BBCQ \BIn R. Baldwin, M. Cave\BCBL \BBA M. Lodge (\BEDS), \APACrefbtitleUnderstanding Regulation: Theory, Strategy, and Practice. Understanding regulation: Theory, strategy, and practice. \APACaddressPublisherOxford University Press. {APACrefDOI} \doi10.1093/acprof:osobl/9780199576081.003.0008 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleManaging Extreme AI Risks amid Rapid Progress Managing extreme AI risks amid rapid progress.\BBCQ \APACjournalVolNumPagesScience3846698842–845. {APACrefDOI} \doi10.1126/science.adn0117 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleMechanistic Interpretability for AI Safety: A Review Mechanistic interpretability for AI safety: A review.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2404.14082. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleAI Auditing: The Broken Bus on the Road to AI Accountability AI auditing: The broken bus on the road to AI accountability.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2401.14462. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2010. \BBOQ\APACrefatitleSafety and Assurance Cases: Past, Present and Possible Future – An Adelard Perspective Safety and assurance cases: Past, present and possible future – An Adelard perspective.\BBCQ \BIn C. Dale \BBA T. Anderson (\BEDS), \APACrefbtitleMaking Systems Safer Making systems safer (\BPGS 51–67). \APACaddressPublisherSpringer. {APACrefDOI} \doi10.1007/978-1-84996-086-1_4 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleSafety Case Templates for Autonomous Systems Safety case templates for autonomous systems.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2102.02625. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \APACrefbtitleStructured Access for Third-Party Research on Frontier AI Models: Investigating Researchers’ Model Access Requirements. Structured access for third-party research on frontier AI models: Investigating researchers’ model access requirements. \APAChowpublishedOxford Martin AI Governance Initiative. {APACrefURL} https://perma.cc/E553-EJC2 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleWeak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.\BBCQ \BIn \APACrefbtitleInternational Conference on Machine Learning International Conference on Machine Learning (\BPGS 4971–5012). \APACaddressPublisherPMLR. {APACrefURL} https://perma.cc/5A69-69X8 \PrintBackRefs\CurrentBib
- \APACinsertmetastarcarlsmith2023{APACrefauthors}Carlsmith, J. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleScheming AIs: Will AIs Fake Alignment during Training in Order to Get Power? Scheming AIs: Will AIs fake alignment during training in order to get power?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2311.08379. \PrintBackRefs\CurrentBib
- \APACinsertmetastarukcentrefordataethicsandinnovationcdei2021{APACrefauthors}CDEI. \APACrefYearMonthDay2021. \APACrefbtitleThe Roadmap to an Effective AI Assurance Ecosystem. The roadmap to an effective AI assurance ecosystem. {APACrefURL} https://perma.cc/CZ9M-HYTC \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleHarms from Increasingly Agentic Algorithmic Systems Harms from increasingly agentic algorithmic systems.\BBCQ \BIn \APACrefbtitleACM Conference on Fairness, Accountability, and Transparency ACM Conference on Fairness, Accountability, and Transparency (\BPGS 651–666). {APACrefDOI} \doi10.1145/3593013.3594033 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleJailbreaking Black Box Large Language Models in Twenty Queries Jailbreaking black box large language models in twenty queries.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2310.08419v4. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleSupervising Strong Learners by Amplifying Weak Experts Supervising strong learners by amplifying weak experts.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1810.08575. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleSafety Cases: How to Justify the Safety of Advanced AI Systems Safety cases: How to justify the safety of advanced AI systems.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2403.10462. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2010. \BBOQ\APACrefatitleMeta-Regulation and Self-Regulation Meta-regulation and self-regulation.\BBCQ \BIn R. Baldwin, M. Cave\BCBL \BBA M. Lodge (\BEDS), \APACrefbtitleThe Oxford Handbook of Regulation The Oxford handbook of regulation (\BPGS 146–168). \APACaddressPublisherOxford University Press. {APACrefDOI} \doi10.1093/oxfordhb/9780199560219.003.0008 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2020. \BBOQ\APACrefatitleManagement-Based Regulation Management-based regulation.\BBCQ \BIn K\BPBIR. Richards \BBA J. van Zeben (\BEDS), \APACrefbtitlePolicy Instruments in Environmental Law Policy instruments in environmental law (\BPGS 292–307). \APACaddressPublisherEdward Elgar. {APACrefDOI} \doi10.4337/9781785365683.VIII.20 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleRegulating Advanced Artificial Agents Regulating advanced artificial agents.\BBCQ \APACjournalVolNumPagesScience384669136–38. {APACrefDOI} \doi10.1126/science.adl0625 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleTowards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2405.06624. \PrintBackRefs\CurrentBib
- \APACinsertmetastar2020{APACrefauthors}David, R. \APACrefYearMonthDay2018. \APACrefbtitleAn Introduction to System Safety Management in the MoD (White Booklet) Part 1. An introduction to system safety management in the MoD (white booklet) Part 1. \APAChowpublishedMoD. {APACrefURL} https://perma.cc/B4LM-F64A \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleAI Capabilities Can Be Significantly Improved without Expensive Retraining AI capabilities can be significantly improved without expensive retraining.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2312.07413. \PrintBackRefs\CurrentBib
- \APACinsertmetastarDecker2018-kd{APACrefauthors}Decker, C. \APACrefYearMonthDay2018. \APACrefbtitleGoals-Based and Rules-Based Approaches to Regulation. Goals-based and rules-based approaches to regulation. \APAChowpublishedUK Department for Business, Energy & Industrial Strategy. {APACrefURL} https://perma.cc/YLD3-NFQA \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2011. \BBOQ\APACrefatitleTowards Measurement of Confidence in Safety Cases Towards measurement of confidence in safety cases.\BBCQ \BIn \APACrefbtitleInternational Symposium on Empirical Software Engineering and Measurement International symposium on empirical software engineering and measurement (\BPGS 380–383). {APACrefDOI} \doi10.1109/ESEM.2011.53 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2014. \APACrefbtitleNASA System Safety Handbook. Volume 2: System Safety Concepts, Guidelines, and Implementation Examples. NASA system safety handbook. Volume 2: System safety concepts, guidelines, and implementation examples. \APAChowpublishedNASA. {APACrefURL} https://perma.cc/759U-C6E4 \PrintBackRefs\CurrentBib
- \APACinsertmetastardsit2024{APACrefauthors}DSIT. \APACrefYearMonthDay2024. \APACrefbtitleFrontier AI Safety Commitments, AI Seoul Summit 2024. Frontier AI safety commitments, AI Seoul Summit 2024. {APACrefURL} https://perma.cc/M9NQ-GNED \PrintBackRefs\CurrentBib
- \APACinsertmetastareuparliament2024{APACrefauthors}European Parliament. \APACrefYearMonthDay2024. \APACrefbtitleRegulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). {APACrefURL} https://perma.cc/J7RE-DRVR \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024a. \BBOQ\APACrefatitleLLM Agents Can Autonomously Exploit One-day Vulnerabilities LLM agents can autonomously exploit one-day vulnerabilities.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2404.08144. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024b. \BBOQ\APACrefatitleLLM Agents Can Autonomously Hack Websites LLM agents can autonomously hack websites.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2402.06664. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleBuilding a Credible Case for Safety: Waymo’s Approach for the Determination of Absence of Unreasonable Risk Building a credible case for safety: Waymo’s approach for the determination of absence of unreasonable risk.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.01917. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitlePredictability and Surprise in Large Generative Models Predictability and surprise in large generative models.\BBCQ \BIn \APACrefbtitleACM Conference on Fairness, Accountability, and Transparency ACM Conference on Fairness, Accountability, and Transparency (\BPGS 1747–1764). {APACrefDOI} \doi10.1145/3531146.3533229 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDayforthcoming. \APACrefbtitleSafety case template for frontier AI: A cyber inability argument. Safety case template for frontier AI: A cyber inability argument. \PrintBackRefs\CurrentBib
- \APACinsertmetastargoogledeepmind2024{APACrefauthors}Google DeepMind. \APACrefYearMonthDay2024. \APACrefbtitleFrontier Safety Framework Version 1.0. Frontier Safety Framework Version 1.0. {APACrefURL} https://perma.cc/5Q3D-LM2Q \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleThousands of AI Authors on the Future of AI Thousands of AI authors on the future of AI.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2401.02843. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleAI Control: Improving Safety Despite Intentional Subversion AI control: Improving safety despite intentional subversion.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2312.06942. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleSafety Cases: An Impending Crisis? Safety cases: An impending crisis?\BBCQ \BIn \APACrefbtitleSafety-Critical Systems Symposium. Safety-Critical Systems Symposium. \APACaddressPublisherUniversity of York. {APACrefURL} https://perma.cc/73QJ-XQL2 \PrintBackRefs\CurrentBib
- \APACinsertmetastarHaddon-Cave2009{APACrefauthors}Haddon-Cave, C. \APACrefYearMonthDay2009. \APACrefbtitleThe Nimrod Review: An Independent Review into the Broader Issues Surrounding the Loss of the RAF Nimrod MR2 Aircraft XV230 in Afghanistan in 2006. The Nimrod review: An independent review into the broader issues surrounding the loss of the RAF Nimrod MR2 aircraft XV230 in Afghanistan in 2006. \APAChowpublishedThe Stationery Office. {APACrefURL} https://perma.cc/S4HG-9ER9 \PrintBackRefs\CurrentBib
- \APACinsertmetastarhagendorff2024{APACrefauthors}Hagendorff, T. \APACrefYearMonthDay2024. \BBOQ\APACrefatitleDeception Abilities Emerged in Large Language Models Deception abilities emerged in large language models.\BBCQ \APACjournalVolNumPagesPNAS12124. {APACrefDOI} \doi10.1073/pnas.2317967121 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2013. \BBOQ\APACrefatitleAssurance Cases and Prescriptive Software Safety Certification: A Comparative Study Assurance cases and prescriptive software safety certification: A comparative study.\BBCQ \APACjournalVolNumPagesSafety Science5955–71. {APACrefDOI} \doi10.1016/j.ssci.2013.04.007 \PrintBackRefs\CurrentBib
- \APACinsertmetastarhelfrich2024{APACrefauthors}Helfrich, G. \APACrefYearMonthDay2024. \BBOQ\APACrefatitleThe Harms of Terminology: Why We Should Reject so-Called “Frontier AI” The harms of terminology: Why we should reject so-called “frontier AI”.\BBCQ \APACjournalVolNumPagesAI and Ethics43699–705. {APACrefDOI} \doi10.1007/s43681-024-00438-1 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleAn Overview of Catastrophic AI Risks An overview of catastrophic AI risks.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.12001. \PrintBackRefs\CurrentBib
- \APACinsertmetastarhopkins2012{APACrefauthors}Hopkins, A. \APACrefYearMonthDay2012. \APACrefbtitleExplaining “Safety Case”. Explaining “safety case”. \APAChowpublishedAustralian National University. {APACrefURL} https://perma.cc/5VQ8-3FEZ \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleSleeper Agents: Training Deceptive LLMs That Persist Through Safety Training Sleeper agents: Training deceptive LLMs that persist through safety training.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2401.05566. \PrintBackRefs\CurrentBib
- \APACinsertmetastarHinton2024{APACrefauthors}IDAIS. \APACrefYearMonthDay2024. \APACrefbtitleIDAIS-Beijing, 2024 Statement. IDAIS-Beijing, 2024 statement. {APACrefURL} https://perma.cc/RVE3-PLMP \PrintBackRefs\CurrentBib
- \APACinsertmetastaringe2007{APACrefauthors}Inge, J\BPBIR. \APACrefYearMonthDay2007. \BBOQ\APACrefatitleThe Safety Case, Its Development and Use in the United Kingdom The safety case, its development and use in the United Kingdom.\BBCQ \BIn \APACrefbtitleEquipment Safety Assurance Symposium 2007. Equipment safety assurance symposium 2007. \APACaddressPublisherISSC. {APACrefURL} https://perma.cc/T4L4-FP7C \PrintBackRefs\CurrentBib
- \APACinsertmetastarirving2024{APACrefauthors}Irving, G. \APACrefYearMonthDay2024. \APACrefbtitleSafety Cases at AISI. Safety cases at AISI. \APAChowpublishedUK AI Safety Institute. {APACrefURL} https://perma.cc/6MRR-4VD4 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleAI Safety via Debate AI safety via debate.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1805.00899. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2020. \BBOQ\APACrefatitleA Case Study of Agile Software Development for Safety-Critical Systems Projects A case study of agile software development for safety-critical systems projects.\BBCQ \APACjournalVolNumPagesReliability Engineering & System Safety200106954. {APACrefDOI} \doi10.1016/j.ress.2020.106954 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleUncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant Uncovering deceptive tendencies in language models: A simulated company AI assistant.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2405.01576. \PrintBackRefs\CurrentBib
- \APACinsertmetastardraganjovicic2009{APACrefauthors}Jovicic, D. \APACrefYearMonthDay2009. \APACrefbtitleGuide for the Application of the Commission Regulation on the Adoption of a Common Safety Method on Risk Evaluation and Assessment as Referred to in Article 6(3)(a) of the Railway Safety Directive (ERA/GUI/01-2008/SAF). Guide for the application of the Commission Regulation on the Adoption of a Common Safety Method on Risk Evaluation and Assessment as Referred to in Article 6(3)(a) of the Railway Safety Directive (ERA/GUI/01-2008/SAF). \APAChowpublishedEuropean Railway Agency. {APACrefURL} https://perma.cc/C7CC-XXRF \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleOn the Societal Impact of Open Foundation Models On the societal impact of open foundation models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2403.07918. \PrintBackRefs\CurrentBib
- \APACinsertmetastarkelly2017{APACrefauthors}Kelly, T. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleSafety Cases Safety cases.\BBCQ \BIn N. Moller, S\BPBIO. Hansson, J\BHBIE. Holmberg\BCBL \BBA C. Rollenhagen (\BEDS), \APACrefbtitleHandbook of Safety Principles Handbook of safety principles (\BPGS 361–385). \APACaddressPublisherWiley. {APACrefDOI} \doi10.1002/9781119443070.ch16 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleEvaluating Language-Model Agents on Realistic Autonomous Tasks Evaluating language-model agents on realistic autonomous tasks.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2312.11671. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleRisk Assessment at AGI Companies: A Review of Popular Risk Assessment Techniques from Other Safety-Critical Industries Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.08823. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleRisk Thresholds for Frontier AI Risk thresholds for frontier AI.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2406.14713. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleResponsible Reporting for Frontier AI Development Responsible reporting for frontier AI development.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2404.02675. \PrintBackRefs\CurrentBib
- \APACinsertmetastarkoopman2022{APACrefauthors}Koopman, P. \APACrefYear2022. \APACrefbtitleHow Safe Is Safe Enough? Measuring and Predicting Autonomous Vehicle Safety How safe is safe enough? Measuring and predicting autonomous vehicle safety. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2013. \BBOQ\APACrefatitleWhat Is a Complex System? What is a complex system?\BBCQ \APACjournalVolNumPagesEuropean Journal for Philosophy of Science3133–67. {APACrefDOI} \doi10.1007/s13194-012-0056-8 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2013. \BBOQ\APACrefatitleSafety Cases: A Review of Challenges Safety cases: A review of challenges.\BBCQ \BIn \APACrefbtitle1st International Workshop on Assurance Cases for Software-Intensive Systems (ASSURE) 1st international workshop on assurance cases for software-intensive systems (ASSURE) (\BPGS 1–6). {APACrefDOI} \doi10.1109/ASSURE.2013.6614263 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleScalable Agent Alignment via Reward Modeling: A Research Direction Scalable agent alignment via reward modeling: A research direction.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1811.07871. \PrintBackRefs\CurrentBib
- \APACinsertmetastarleveson2011{APACrefauthors}Leveson, N\BPBIG. \APACrefYearMonthDay2011. \APACrefbtitleThe Use of Safety Cases in Certification and Regulation. The use of safety cases in certification and regulation. \APAChowpublishedMIT. {APACrefURL} https://perma.cc/LN5W-5UYE \PrintBackRefs\CurrentBib
- \APACinsertmetastarleveson2012{APACrefauthors}Leveson, N\BPBIG. \APACrefYear2012. \APACrefbtitleEngineering a Safer World: Systems Thinking Applied to Safety Engineering a safer world: Systems thinking applied to safety. \APACaddressPublisherMIT Press. {APACrefDOI} \doi10.7551/mitpress/8179.001.0001 \PrintBackRefs\CurrentBib
- \APACinsertmetastarmilitaryaviationauthority{APACrefauthors}MAA. \APACrefYearMonthDay2019. \APACrefbtitleManual of Air System Safety Cases (MASSC). Manual of air system safety cases (MASSC). {APACrefURL} https://perma.cc/P3FM-LYNA \PrintBackRefs\CurrentBib
- \APACinsertmetastarmagic2024{APACrefauthors}Magic. \APACrefYearMonthDay2024. \APACrefbtitleAGI Readiness Policy. AGI readiness policy. {APACrefURL} https://perma.cc/AQ5N-G5V5 \PrintBackRefs\CurrentBib
- \APACinsertmetastarmetr2024{APACrefauthors}METR. \APACrefYearMonthDay2024. \APACrefbtitleCommon elements of frontier AI safety policies. Common elements of frontier AI safety policies. {APACrefURL} https://perma.cc/5NN9-QPLU \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe Threat of Offensive AI to Organizations The threat of offensive AI to organizations.\BBCQ \APACjournalVolNumPagesComputers & Security124103006. {APACrefDOI} \doi10.1016/j.cose.2022.103006 \PrintBackRefs\CurrentBib
- \APACinsertmetastarstan1996{APACrefauthors}MoD. \APACrefYearMonthDay2007. \APACrefbtitleSafety Management Requirements for Defence Systems (Standard 00-56). Safety management requirements for defence systems (Standard 00-56). {APACrefURL} https://perma.cc/YUP4-V83D \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleAuditing large language models: A three-layered approach Auditing large language models: A three-layered approach.\BBCQ \APACjournalVolNumPagesAI and Ethics. {APACrefDOI} \doi10.1007/s43681-023-00289-2 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2001. \BBOQ\APACrefatitleSpace Shuttle RTOS Bayesian Network Space shuttle RTOS bayesian network.\BBCQ \BIn \APACrefbtitle20th Digital Avionics Systems Conference. 20th Digital Avionics Systems Conference. {APACrefDOI} \doi10.1109/DASC.2001.963378 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \APACrefbtitleThe Operational Risks of AI in Large-Scale Biological Attacks: A Red-Team Approach. The operational risks of AI in large-scale biological attacks: A red-team approach. \APAChowpublishedRAND. {APACrefDOI} \doi10.7249/RRA2977-1 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDayforthcoming. \APACrefbtitleEstimating the Marginal Risk of LLMs: A Pilot Study. Estimating the marginal risk of LLMs: A pilot study. \PrintBackRefs\CurrentBib
- \APACinsertmetastarNCSC2024{APACrefauthors}NCSC. \APACrefYearMonthDay2024. \APACrefbtitleThe Near-Term Impact of AI on the Cyber Threat. The near-term impact of AI on the cyber threat. {APACrefURL} https://perma.cc/6NPT-UHX4 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleA Probabilistic Model of Belief in Safety Cases A probabilistic model of belief in safety cases.\BBCQ \APACjournalVolNumPagesSafety Science138105187. {APACrefDOI} \doi10.1016/j.ssci.2021.105187 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleThe Alignment Problem from a Deep Learning Perspective The alignment problem from a deep learning perspective.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2209.00626. \PrintBackRefs\CurrentBib
- \APACinsertmetastarnrc_safety_goals_1986{APACrefauthors}NRC. \APACrefYearMonthDay1986. \APACrefbtitleSafety Goals for the Operations of Nuclear Power Plants: Policy Statement Republication. Safety goals for the operations of nuclear power plants: Policy statement republication. {APACrefURL} https://perma.cc/3M2V-U4TK \PrintBackRefs\CurrentBib
- \APACinsertmetastarusnuclearregulatorycommissionnrc2007{APACrefauthors}NRC. \APACrefYearMonthDay2007. \APACrefbtitle10 CFR § 52.79 – Contents of applications; technical information in final safety analysis report. 10 CFR § 52.79 – Contents of applications; technical information in final safety analysis report. {APACrefURL} https://perma.cc/P2VS-8UBN \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleDeployment Corrections: An Incident Response Framework for Frontier AI Models Deployment corrections: An incident response framework for frontier AI models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2310.00328. \PrintBackRefs\CurrentBib
- \APACinsertmetastarukofficefornuclearregulationonr2014{APACrefauthors}ONR. \APACrefYearMonthDay2014. \APACrefbtitleSafety Assessment Principles For Nuclear Power Plants. Safety assessment principles for nuclear power plants. {APACrefURL} https://perma.cc/5RQK-YCD8 \PrintBackRefs\CurrentBib
- \APACinsertmetastarukofficefornuclearregulationonr2021{APACrefauthors}ONR. \APACrefYearMonthDay2021. \APACrefbtitleLicensing Nuclear Installations. Licensing nuclear installations. {APACrefURL} https://perma.cc/B762-8SYY \PrintBackRefs\CurrentBib
- \APACinsertmetastarOpenAI2023-tt{APACrefauthors}OpenAI. \APACrefYearMonthDay2023. \APACrefbtitlePreparedness Framework (Beta). Preparedness Framework (Beta). {APACrefURL} https://perma.cc/9FBB-URXF \PrintBackRefs\CurrentBib
- \APACinsertmetastaropenai2024{APACrefauthors}OpenAI. \APACrefYearMonthDay2024. \APACrefbtitleOpenAI o1 System Card. OpenAI o1 system card. {APACrefURL} https://perma.cc/8PA5-UBL5 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2309.15840. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleAI Deception: A Survey of Examples, Risks, and Potential Solutions AI deception: A survey of examples, risks, and potential solutions.\BBCQ \APACjournalVolNumPagesPatterns55. {APACrefDOI} \doi10.1016/j.patter.2024.100988 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleEvaluating Frontier Models for Dangerous Capabilities Evaluating frontier models for dangerous capabilities.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2403.13793. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleThe Fallacy of AI Functionality The fallacy of AI functionality.\BBCQ \BIn \APACrefbtitleACM Conference on Fairness, Accountability, and Transparency ACM Conference on Fairness, Accountability, and Transparency (\BPGS 959–972). {APACrefDOI} \doi10.1145/3531146.3533158 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleGaps in the Safety Evaluation of Generative AI Gaps in the safety evaluation of generative AI.\BBCQ \BIn \APACrefbtitleAAAI/ACM Conference on AI, Ethics, and Society. AAAI/ACM Conference on AI, Ethics, and Society. \APACaddressPublisherAAAI. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \APACrefbtitleUnderstanding What It Means for Assurance Cases to “Work”. Understanding what it means for assurance cases to “work”. \APAChowpublishedNASA. {APACrefURL} https://perma.cc/TA87-8TZL \PrintBackRefs\CurrentBib
- \APACinsertmetastarsandbrink2023{APACrefauthors}Sandbrink, J\BPBIB. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleArtificial Intelligence and Biological Misuse: Differentiating Risks of Language Models and Biological Design Tools Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.13952. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleLarge Language Models Can Strategically Deceive Their Users When Put Under Pressure Large language models can strategically deceive their users when put under pressure.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2311.07590. \PrintBackRefs\CurrentBib
- \APACinsertmetastarschuett2023{APACrefauthors}Schuett, J. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThree Lines of Defense against Risks from AI Three lines of defense against risks from AI.\BBCQ \APACjournalVolNumPagesAI & Society. {APACrefDOI} \doi10.1007/s00146-023-01811-0 \PrintBackRefs\CurrentBib
- \APACinsertmetastarschuett2024-ia{APACrefauthors}Schuett, J. \APACrefYearMonthDay2024. \BBOQ\APACrefatitleFrontier AI developers need an internal audit function Frontier AI developers need an internal audit function.\BBCQ \APACjournalVolNumPagesRisk Analysis1–21. {APACrefDOI} \doi10.1111/risa.17665 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleFrom Principles to Rules: A Regulatory Approach for Frontier AI From principles to rules: A regulatory approach for frontier AI.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2407.07300. \PrintBackRefs\CurrentBib
- \APACinsertmetastarscscassurancecaseworkinggroupacwg2021{APACrefauthors}SCSC. \APACrefYearMonthDay2021. \APACrefbtitleGoal Structuring Notation Community Standard (Version 3). Goal structuring notation community standard (Version 3). {APACrefURL} https://perma.cc/CD9W-YX6S \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleOpen-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives Open-sourcing highly capable foundation models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2311.09227. \PrintBackRefs\CurrentBib
- \APACinsertmetastarshevlane2024{APACrefauthors}Shevlane, T. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleStructured Access: An Emerging Paradigm for Safe AI Deployment Structured access: An emerging paradigm for safe AI deployment.\BBCQ \BIn J\BPBIB. Bullock \BOthers. (\BEDS), \APACrefbtitleThe Oxford Handbook of AI Governance. The oxford handbook of AI governance. \APACaddressPublisherOxford University Press. {APACrefDOI} \doi10.1093/oxfordhb/9780197579329.013.39 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleModel Evaluation for Extreme Risks Model evaluation for extreme risks.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.15324. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleCan Large Language Models Democratize Access to Dual-Use Biotechnology? Can large language models democratize access to dual-use biotechnology?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.03809. \PrintBackRefs\CurrentBib
- \APACinsertmetastarsolaiman2023{APACrefauthors}Solaiman, I. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe Gradient of Generative AI Release: Methods and Considerations The gradient of generative AI release: Methods and considerations.\BBCQ \BIn \APACrefbtitleACM Conference on Fairness, Accountability, and Transparency ACM Conference on Fairness, Accountability, and Transparency (\BPGS 111–122). {APACrefDOI} \doi10.1145/3593013.3593981 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2016. \BBOQ\APACrefatitleShould Healthcare Providers Do Safety Cases? Lessons from a Cross-Industry Review of Safety Case Practices Should healthcare providers do safety cases? Lessons from a cross-industry review of safety case practices.\BBCQ \APACjournalVolNumPagesSafety Science84181–189. {APACrefDOI} \doi10.1016/j.ssci.2015.12.021 \PrintBackRefs\CurrentBib
- \APACinsertmetastarthecollectiveintelligenceproject2023{APACrefauthors}The Collective Intelligence Project. \APACrefYearMonthDay2023. \APACrefbtitleParticipatory AI Risk Prioritization. Participatory AI risk prioritization. {APACrefURL} https://perma.cc/8P29-NUCH \PrintBackRefs\CurrentBib
- \APACinsertmetastarcleland2012evidence{APACrefauthors}The Health Foundation. \APACrefYearMonthDay2012. \APACrefbtitleEvidence: Using Safety Cases in Industry and Healthcare. Evidence: Using safety cases in industry and healthcare. {APACrefURL} https://perma.cc/X38X-EDXP \PrintBackRefs\CurrentBib
- \APACinsertmetastarThe_White_House2023-ru{APACrefauthors}The White House. \APACrefYearMonthDay2023. \APACrefbtitleSafe, Secure, and Trustworthy Development and Use of Artificial Intelligence (Executive Order 14110). Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence (Executive Order 14110). {APACrefURL} https://perma.cc/99VP-6H55 \PrintBackRefs\CurrentBib
- \APACinsertmetastarukparliament2000{APACrefauthors}UK Parliament. \APACrefYearMonthDay2000. \APACrefbtitleThe Railways (Safety Case) Regulations 2000. The railways (safety case) regulations 2000. {APACrefURL} https://perma.cc/849E-3G9Y \PrintBackRefs\CurrentBib
- \APACinsertmetastarukparliament2015{APACrefauthors}UK Parliament. \APACrefYearMonthDay2015\BCnt1. \APACrefbtitleThe Control of Major Accident Hazards Regulations 2015. The control of major accident hazards regulations 2015. {APACrefURL} https://perma.cc/8B34-36Z3 \PrintBackRefs\CurrentBib
- \APACinsertmetastarukparliament2015a{APACrefauthors}UK Parliament. \APACrefYearMonthDay2015\BCnt2. \APACrefbtitleThe Offshore Installations (Offshore Safety Directive) (Safety Case Etc.) Regulations 2015. The offshore installations (offshore safety directive) (safety case etc.) regulations 2015. {APACrefURL} https://perma.cc/H94N-CDVF \PrintBackRefs\CurrentBib
- \APACinsertmetastarulstandards&engagementulse2023{APACrefauthors}UL. \APACrefYearMonthDay2023. \APACrefbtitleEvaluation of Autonomous Products (UL 4600). Evaluation of Autonomous Products (UL 4600). \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleDual Use of Artificial-Intelligence-Powered Drug Discovery Dual use of artificial-intelligence-powered drug discovery.\BBCQ \APACjournalVolNumPagesNature Machine Intelligence43189–191. {APACrefDOI} \doi10.1038/s42256-022-00465-9 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleAI Sandbagging: Language Models Can Strategically Underperform on Evaluations AI sandbagging: Language models can strategically underperform on evaluations.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2406.07358. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleModelling Confidence in Railway Safety Case Modelling confidence in railway safety case.\BBCQ \APACjournalVolNumPagesSafety Science110286–299. {APACrefDOI} \doi10.1016/j.ssci.2017.11.012 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \BBOQ\APACrefatitleAffirmative Safety: An Approach to Risk Management for High-Risk AI Affirmative safety: An approach to risk management for high-risk AI.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2406.15371. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2011. \BBOQ\APACrefatitleSoftware Certification: Is There a Case against Safety Cases? Software certification: Is there a case against safety cases?\BBCQ \BIn R. Calinescu \BBA E. Jackson (\BEDS), \APACrefbtitleFoundations of Computer Software. Modeling, Development, and Verification of Adaptive Systems Foundations of computer software. modeling, development, and verification of adaptive systems (\BPGS 206–227). \APACaddressPublisherSpringer. {APACrefDOI} \doi10.1007/978-3-642-21292-5_12 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleSociotechnical Safety Evaluation of Generative AI Systems Sociotechnical safety evaluation of generative AI systems.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2310.11986. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2024. \APACrefbtitleInternational Scientific Report on the Safety of Advanced AI: Interim Report. International scientific report on the safety of advanced AI: Interim report. \APAChowpublishedDSIT. {APACrefURL} https://perma.cc/PM8K-9G48 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2019. \APACrefbtitleThinking About Risks From AI: Accidents, Misuse and Structure. Thinking about risks from AI: Accidents, misuse and structure. \APAChowpublishedLawfare. {APACrefURL} https://perma.cc/57GZ-QXBD \PrintBackRefs\CurrentBib
- Marie Davidsen Buhl (5 papers)
- Gaurav Sett (2 papers)
- Leonie Koessler (6 papers)
- Jonas Schuett (20 papers)
- Markus Anderljung (29 papers)