Black-Box Access is Insufficient for Rigorous AI Audits (2401.14446v3)
Abstract: External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to training and deployment information (e.g., methodology, code, documentation, data, deployment details, findings from internal evaluations) allows auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.
- The algorithm audit: Scoring the algorithms that score us. Big Data & Society, 8(1):2053951720983865, 2021.
- Governing algorithmic systems with impact assessments: Six observations. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 1010–1022, 2021.
- Algorithmic impact assessments and accountability: The co-construction of impacts. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 735–746, 2021.
- The medical algorithmic audit. The Lancet Digital Health, 4(5):e384–e397, May 2022a. ISSN 2589-7500. doi: 10.1016/S2589-7500(22)00003-6. URL https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00003-6/fulltext. Publisher: Elsevier.
- A relationship and not a thing: A relational approach to algorithmic accountability and assessment documentation. arXiv preprint arXiv:2203.01455, 2022.
- Outsider oversight: Designing a third party audit ecosystem for ai governance. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 557–571, 2022.
- Inioluwa Deborah Raji. The anatomy of ai audits: Form, process, and consequences. 2022.
- Actionable auditing revisited: Investigating the impact of publicly naming biased performance results of commercial ai products. Communications of the ACM, 66(1):101–108, 2022.
- Algorithm auditing: Managing the legal, ethical, and technological risks of artificial intelligence, machine learning, and associated algorithms. Computer, 55(4):40–50, 2022.
- Jonas Schuett. Three lines of defense against risks from ai. arXiv preprint arXiv:2212.08364, 2022.
- Towards best practices in agi safety and governance: A survey of expert opinion. arXiv preprint arXiv:2305.07153, 2023.
- Auditing large language models: a three-layered approach. AI and Ethics, may 2023a. doi: 10.1007/s43681-023-00289-2. URL https://doi.org/10.1007%2Fs43681-023-00289-2.
- Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
- Open-sourcing highly capable foundation models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives. 2023.
- Auditing the ai auditors: A framework for evaluating fairness and bias in high stakes ai predictive models. American Psychologist, 78(1):36, 2023.
- Frontier ai regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718, 2023a.
- Irene Solaiman. The gradient of generative ai release: Methods and considerations. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 111–122, 2023.
- Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949, 2023.
- Accountability in algorithmic systems: From principles to practice. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–4, 2023.
- Jakob Mökander. Auditing of ai: Legal, ethical and technical approaches. Digital Society, 2(3):49, 2023.
- Taking control: Policies to address extinction risks from advanced ai. arXiv preprint arXiv:2310.20563, 2023.
- A causal framework for ai regulation and auditing. 2023.
- Towards publicly accountable frontier llms: Building an external scrutiny ecosystem under the aspire framework. 2023b.
- Managing ai risks in an era of rapid progress.
- METR. Metr, 2023. URL https://evals.alignment.org/.
- European Commission. Laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts. Eur Comm, 106:1–108, 2021.
- U.S. Department of Commerce and National Institute of Standards and Technology. AI Risk Management Framework: AI RMF (1.0), January 2023. URL https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.
- Chinese National Information Security Standardization Technical Committee. Translation: Basic Safety Requirements for Generative Artificial Intelligence Services (Draft for Feedback), November 2023. URL https://cset.georgetown.edu/publication/china-safety-requirements-for-generative-ai/?utm_source=substack&utm_medium=email.
- UK Department for Science, Innovation & Technology. A pro-innovation approach to AI regulation. Technical report, August 2023. URL https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper.
- Office of the President of the United States. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, October 2023. URL https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/.
- Explainable artificial intelligence (xai) in auditing. International Journal of Accounting Information Systems, 46:100572, 2022.
- Openxai: Towards a transparent evaluation of model explanations. Advances in Neural Information Processing Systems, 35:15784–15799, 2022.
- A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003, 2023.
- Identifying and mitigating privacy risks stemming from language models: A survey. arXiv preprint arXiv:2310.01424, 2023.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
- Copyright violations and large language models. arXiv preprint arXiv:2310.13771, 2023.
- Measuring the success of diffusion models at imitating human artists. arXiv preprint arXiv:2307.04028, 2023a.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
- Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
- Layers of bias: A unified approach for understanding problems with risk assessment. Criminal Justice and Behavior, 46(2):185–209, 2019.
- Evaluating impact of race in facial recognition across machine learning and deep learning algorithms. Computers, 10(9):113, 2021.
- A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Machine bias. In Ethics of data and analytics, pages 254–264. Auerbach Publications, 2022.
- Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health, 4(6):e406–e414, 2022.
- Fairness testing: A comprehensive survey and analysis of trends. arXiv preprint arXiv:2207.10223, 2022.
- Auditing and mitigating cultural bias in llms. 2023.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872, 2021.
- Measuring fairness of text classifiers via prediction sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5830–5842, 2022.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
- Into the laions den: Investigating hate in multimodal datasets. arXiv preprint arXiv:2311.03449, 2023.
- David Thiel. Identifying and eliminating csam in generative ml training data and models. 2023.
- Generative ml and csam: Implications and mitigations. 2023.
- Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. arXiv preprint arXiv:2305.13873, 2023.
- Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
- Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023a.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. July 2023a. doi: 10.48550/arXiv.2307.15043. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043 [cs].
- Scalable and transferable black-box jailbreaks for language models via persona modulation. 2023.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023a.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. 2023.
- Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590, 2023.
- Ai deception: A survey of examples, risks, and potential solutions. 2023a.
- Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
- From text to mitre techniques: Exploring the malicious use of large language models for generating cyber attack payloads. arXiv preprint arXiv:2305.15336, 2023.
- Evaluating language-model agents on realistic autonomous tasks. July 2023.
- Harms from Increasingly Agentic Algorithmic Systems. In 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 651–666, June 2023. doi: 10.1145/3593013.3594033. URL http://arxiv.org/abs/2302.10329. arXiv:2302.10329 [cs].
- OpenAI. Gpt-4 technical report. 2023a.
- Sociotechnical Safety Evaluation of Generative AI Systems. October 2023. URL http://arxiv.org/abs/2310.11986. arXiv:2310.11986 [cs].
- Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pages 1571–1583, New York, NY, USA, June 2022. Association for Computing Machinery. ISBN 978-1-4503-9352-2. doi: 10.1145/3531146.3533213. URL https://doi.org/10.1145/3531146.3533213.
- Auditing large language models: a three-layered approach. AI and Ethics, May 2023b. ISSN 2730-5953, 2730-5961. doi: 10.1007/s43681-023-00289-2. URL http://arxiv.org/abs/2302.08500. arXiv:2302.08500 [cs].
- Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. April 2020. doi: 10.48550/arXiv.2004.07213. URL http://arxiv.org/abs/2004.07213. arXiv:2004.07213 [cs].
- Google. Consultation on the EU AI Act Proposal, July 2021. URL https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12527-Artificial-intelligence-ethical-and-legal-requirements/F2662492_en.
- The Foundation Model Transparency Index. October 2023. URL http://arxiv.org/abs/2310.12941. arXiv:2310.12941 [cs].
- OpenAI. Gpt-3.5 turbo fine-tuning and api updates, 2023b. URL https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates.
- Anthropic. Challenges in evaluating ai systems. 2023. URL https://www.anthropic.com/index/evaluating-ai-systems.
- Llama 2: Open foundation and fine-tuned chat models. 2023.
- Noam Kolt. Algorithmic black swans. Washington University Law Review, 101, 2023.
- Representation bias in data: A survey on identification and resolution techniques. ACM Computing Surveys, 2023.
- Structured Access for Third-Party Research on Frontier AI Models: Investigating Researchers’ Model Access Requirements. October 2023. URL https://www.oxfordmartin.ox.ac.uk/publications/structured-access-for-third-party-research-on-frontier-ai-models-investigating-researchers-model-access-requirements/.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Technical report on the cleverhans v2. 1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2016.
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023.
- A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023a.
- Removing rlhf protections in gpt-4 via fine-tuning. 2023.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019.
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. September 2023b. doi: 10.48550/arXiv.2307.15217. URL http://arxiv.org/abs/2307.15217. arXiv:2307.15217 [cs].
- A comparative study of white box, black box and grey box testing techniques. International Journal of Advanced Computer Science and Applications, 3(6), 2012.
- European Union. General Data Protection Regulation, April 2016. URL https://gdpr-info.eu/.
- European Union. Artificial Intelligence Act, April 2021. URL https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52021PC0206.
- Translation: Artificial Intelligence Law, Model Law v. 1.0 (Expert Suggestion Draft) – Aug. 2023. August 2023. URL https://digichina.stanford.edu/work/translation-artificial-intelligence-law-model-law-v-1-0-expert-suggestion-draft-aug-2023/.
- Local Law 144 of 2021, December 2021. URL https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=4344524&GUID=B051915D-A9AC-451E-81F8-6596032FA3F9&Options=ID%7cText%7c&Search=.
- Office of Science and Technology Policy. Notice and Explanation, October 2022. URL https://www.whitehouse.gov/ostp/ai-bill-of-rights/notice-and-explanation/.
- United Nations. Principles for the ethical use of artificial intelligence in the United Nations system, October 2022. URL https://unsceb.org/sites/default/files/2023-03/CEB_2022_2_Add.1%20%28AI%20ethics%20principles%29.pdf.
- National New Generation Artificial Intelligence Governance Expert Committee. Translation: Chinese Expert Group Offers ’Governance Principles’ for ’Responsible AI’, June 2019. URL https://digichina.stanford.edu/work/translation-chinese-expert-group-offers-governance-principles-for-responsible-ai/.
- China Academy of Information and Communications Technology and JD Explore Academy. White Paper on Trustworthy Artificial Intelligence, August 2021. URL https://cset.georgetown.edu/publication/white-paper-on-trustworthy-artificial-intelligence/.
- G7. Hiroshima Process International Code of Conduct for Organizations Developing Advanced AI Systems, October 2023. URL https://digital-strategy.ec.europa.eu/en/library/hiroshima-process-international-code-conduct-advanced-ai-systems.
- OECD. Recommendation of the Council on Artificial Intelligence, May 2019. URL https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449.
- Personal Data Protection Commission Singapore. Model Artificial Intelligence Governance Framework, Second Edition, January 2020. URL https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Resource-for-Organisation/AI/SGModelAIGovFramework2.pdf.
- AI Safety Summit. The Bletchley Declaration by Countries Attending the AI Safety Summit, November 2023. URL https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023.
- National New Generation Artificial Intelligence Governance Specialist Committee. ”Ethical Norms for New Generation Artificial Intelligence” Released, October 2021. URL https://cset.georgetown.edu/publication/ethical-norms-for-new-generation-artificial-intelligence-released/.
- Four principles of explainable artificial intelligence. Technical Report NIST IR 8312, National Institute of Standards and Technology (U.S.), Gaithersburg, MD, September 2021a. URL https://nvlpubs.nist.gov/nistpubs/ir/2021/NIST.IR.8312.pdf.
- Varieties of AI Explanations Under the Law. From the GDPR to the AIA, and Beyond. In Andreas Holzinger, Randy Goebel, Ruth Fong, Taesup Moon, Klaus-Robert Müller, and Wojciech Samek, editors, xxAI - Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers, Lecture Notes in Computer Science, pages 343–373. Springer International Publishing, Cham, 2022. ISBN 978-3-031-04083-2. doi: 10.1007/978-3-031-04083-2˙17. URL https://doi.org/10.1007/978-3-031-04083-2_17.
- Black box algorithms and the rights of individuals: No easy solution to the” explainability” problem. Internet Policy Review, 10(2):1–24, 2021.
- Right to contest ai diagnostics: Defining transparency and explainability requirements from a patient’s perspective. In Artificial Intelligence in Medicine, pages 1–12. Springer, 2021.
- A Vision on What Explanations of Autonomous Systems are of Interest to Lawyers. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), pages 332–336, Hannover, Germany, September 2023. IEEE. ISBN 9798350326918. doi: 10.1109/REW57809.2023.00062. URL https://ieeexplore.ieee.org/document/10260983/.
- European Union. Digital markets act, September 2022. URL https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32022R1925.
- Regulating Gatekeeper AI and Data: Transparency, Access, and Fairness under the DMA, the GDPR, and beyond. August 2023. URL http://arxiv.org/abs/2212.04997. arXiv:2212.04997 [cs].
- EU Liability Rules for the Age of Artificial Intelligence. April 2021. doi: 10.2139/ssrn.3817520. URL https://papers.ssrn.com/abstract=3817520.
- Philipp Hacker. The European AI liability directives – Critique of a half-hearted approach and lessons for the future. Computer Law & Security Review, 51:105871, November 2023. ISSN 0267-3649. doi: 10.1016/j.clsr.2023.105871. URL https://www.sciencedirect.com/science/article/pii/S026736492300081X.
- Four Principles of Explainable Artificial Intelligence. Interagency or Internal Report 8312, National Institute for Standards and Technology, September 2021b.
- Systematizing Audit in Algorithmic Recruitment. Journal of Intelligence, 9(3):46, September 2021. ISSN 2079-3200. doi: 10.3390/jintelligence9030046. URL https://www.mdpi.com/2079-3200/9/3/46. Number: 3 Publisher: Multidisciplinary Digital Publishing Institute.
- Mitigating bias in algorithmic hiring: evaluating claims and practices. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pages 469–481, New York, NY, USA, January 2020. Association for Computing Machinery. ISBN 978-1-4503-6936-7. doi: 10.1145/3351095.3372828. URL https://doi.org/10.1145/3351095.3372828.
- Investigating Bias in Facial Analysis Systems: A Systematic Review. IEEE Access, 8:130751–130761, 2020. ISSN 2169-3536. doi: 10.1109/ACCESS.2020.3006051. URL https://ieeexplore.ieee.org/abstract/document/9130131. Conference Name: IEEE Access.
- Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, pages 145–151, New York, NY, USA, February 2020a. Association for Computing Machinery. ISBN 978-1-4503-7110-0. doi: 10.1145/3375627.3375820. URL https://dl.acm.org/doi/10.1145/3375627.3375820.
- The Algorithmic Audit: Working with Vendors to Validate Radiology-AI Algorithms—How We Do It. Academic Radiology, 27(1):132–135, January 2020. ISSN 1076-6332. doi: 10.1016/j.acra.2019.09.009. URL https://www.sciencedirect.com/science/article/pii/S1076633219304350.
- Auditing the Personalization and Composition of Politically-Related Search Engine Results Pages. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 955–965, Republic and Canton of Geneva, CHE, April 2018. International World Wide Web Conferences Steering Committee. ISBN 978-1-4503-5639-8. doi: 10.1145/3178876.3186143. URL https://dl.acm.org/doi/10.1145/3178876.3186143.
- Bias and Debias in Recommender System: A Survey and Future Directions. ACM Transactions on Information Systems, 41(3):67:1–67:39, February 2023. ISSN 1046-8188. doi: 10.1145/3564284. URL https://doi.org/10.1145/3564284.
- OpenAI. Openai preparedness challenge, 2023c. URL https://openai.com/form/preparedness-challenge.
- OpenAI. Openai red teaming network, 2023d. URL https://openai.com/blog/red-teaming-network.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Trustllm: Trustworthiness in large language models, 2024.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022a.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022b.
- Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023c.
- The operational risks of ai in large-scale biological attacks: A red-team approach. 2023.
- Jonas B Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. arXiv preprint arXiv:2306.13952, 2023.
- Using large language models for cybersecurity capture-the-flag challenges and certification questions. arXiv preprint arXiv:2308.10443, 2023.
- Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
- Can large language models democratize access to dual-use biotechnology? arXiv preprint arXiv:2306.03809, 2023.
- Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
- Rabimba Karanjai. Targeted phishing campaigns using large scale language models. arXiv preprint arXiv:2301.00665, 2022.
- Scalable Extraction of Training Data from (Production) Language Models. November 2023. doi: 10.48550/arXiv.2311.17035. URL http://arxiv.org/abs/2311.17035. arXiv:2311.17035 [cs].
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- Adversarial training for high-stakes reliability. 2022.
- Emma Roth. The New York Times is suing OpenAI and Microsoft for copyright infringement. The Verge, December 2023. URL https://www.theverge.com/2023/12/27/24016212/new-york-times-openai-microsoft-lawsuit-copyright-infringement.
- Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783, 2023a.
- Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. arXiv preprint arXiv:2305.10160, 2023.
- Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9):100804, September 2023. ISSN 26663899. doi: 10.1016/j.patter.2023.100804. URL https://linkinghub.elsevier.com/retrieve/pii/S2666389923001599.
- Evaluating LLMs is a minefield, October 2023. URL https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/#/8.
- Limitations of the “Four-Fifths Rule” and Statistical Parity Tests for Measuring Fairness. December 2023. URL https://openreview.net/forum?id=M2aNjwX4Ec&referrer=%5Bthe%20profile%20of%20Manish%20Raghavan%5D(%2Fprofile%3Fid%3D~Manish_Raghavan1).
- Are emergent abilities of large language models a mirage? 2023.
- Bridging the Gap Between AI and Explainability in the GDPR: Towards Trustworthiness-by-Design in Automated Decision-Making. IEEE Computational Intelligence Magazine, 17(1):72–85, February 2022. ISSN 1556-6048. doi: 10.1109/MCI.2021.3129960. URL https://ieeexplore.ieee.org/document/9679770. Conference Name: IEEE Computational Intelligence Magazine.
- Cynthia Rudin. Please stop explaining black box models for high stakes decisions. Stat, 1050:26, 2018.
- Counterfactuals and Causability in Explainable Artificial Intelligence: Theory, Algorithms, and Applications. June 2021. doi: 10.48550/arXiv.2103.04244. URL http://arxiv.org/abs/2103.04244. arXiv:2103.04244 [cs].
- Fairwashing: the risk of rationalization. In Proceedings of the 36th International Conference on Machine Learning, pages 161–170. PMLR, May 2019. URL https://proceedings.mlr.press/v97/aivodji19a.html. ISSN: 2640-3498.
- Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, pages 180–186, New York, NY, USA, February 2020. Association for Computing Machinery. ISBN 978-1-4503-7110-0. doi: 10.1145/3375627.3375830. URL https://dl.acm.org/doi/10.1145/3375627.3375830.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. 2023.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020a.
- Shortcut learning of large language models in natural language understanding. Communications of the ACM (CACM), 2023.
- Knowledge editing for large language models: A survey. 2023a.
- Shadow alignment: The ease of subverting safely-aligned language models. 2023.
- Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. 2023.
- Threat of adversarial attacks on deep learning in computer vision: A survey. Ieee Access, 6:14410–14430, 2018.
- Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020.
- Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents. October 2023d. URL http://arxiv.org/abs/2209.02167. arXiv:2209.02167 [cs].
- Adversarial policies beat superhuman go ais. 2023b.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
- Universal adversarial attacks with natural triggers for text classification. arXiv preprint arXiv:2005.00174, 2020.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
- Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097, 2019.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Character-level white-box adversarial attacks against transformers via attachable subwords substitution. ArXiv, abs/2210.17004, 2022b. URL https://api.semanticscholar.org/CorpusID:253236900.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
- Transferable adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
- A survey of black-box adversarial attacks on computer vision models. arXiv preprint arXiv:1912.01667, 2019.
- Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
- Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pages 2137–2146. PMLR, 2018.
- Do large language models show decision heuristics similar to humans? a case study using gpt-3.5. arXiv preprint arXiv:2305.04400, 2023.
- Backdoorbench: A comprehensive benchmark of backdoor learning. arXiv preprint arXiv:2206.12654, 2022.
- Paul Christiano. Worst-case guarantees, 2019. URL https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d.
- Harnessing the vulnerability of latent layers in adversarially trained models. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2779–2785, 2019.
- Evan Hubinger. Relaxed adversarial training, Sept 2019. URL https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment.
- Adam Jermyn. Latent adversarial training, June 2022. URL https://www.alignmentforum.org/posts/atBQ3NHyqnBadrsGP/latent-adversarial-training.
- Feature representation in convolutional neural networks. arXiv preprint arXiv:1507.02313, 2015.
- Abstract representations emerge naturally in neural networks trained to perform multiple tasks. Nature Communications, 14(1):1040, 2023.
- Adversarial attacks and defenses in large language models: Old and new threats. 2023.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023b.
- Reliably fast adversarial training via latent adversarial perturbation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7758–7767, 2021.
- Towards speeding up adversarial training in latent spaces. arXiv preprint arXiv:2102.00662, 2021.
- Regularizing deep networks using efficient layerwise adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Latent adversarial defence with boundary-guided generation. arXiv preprint arXiv:1907.07001, 2019.
- Latent space virtual adversarial training for supervised and semi-supervised learning. IEICE TRANSACTIONS on Information and Systems, 105(3):667–678, 2022.
- Latent feature relation consistency for adversarial robustness. arXiv preprint arXiv:2303.16697, 2023b.
- Adversarial machine learning in latent representations of neural networks. arXiv preprint arXiv:2309.17401, 2023.
- Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
- Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
- Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- Scale-invariant-fine-tuning (sift) for improved generalization in classification.
- Token-aware virtual adversarial training in natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8410–8418, 2021.
- Weighted token-level virtual adversarial training in text classification. In 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), pages 117–123. IEEE, 2022.
- Improved text classification via contrastive adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11130–11138, 2022.
- Making attention mechanisms more robust and interpretable with virtual adversarial training. Applied Intelligence, 53(12):15802–15817, 2023.
- Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Toward human readable prompt tuning: Kubrick’s the shining is a good movie, and a good prompt too? arXiv preprint arXiv:2212.10539, 2022.
- Gradient-based constrained sampling from language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2251–2277, 2022.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
- Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
- Red Teaming Deep Neural Networks with Feature Synthesis Tools. September 2023e. URL http://arxiv.org/abs/2302.10894. arXiv:2302.10894 [cs].
- Explaining Explanations: An Overview of Interpretability of Machine Learning. February 2019. URL http://arxiv.org/abs/1806.00069. arXiv:1806.00069 [cs, stat].
- Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483. IEEE, 2023.
- Evan Hubinger. An overview of 11 proposals for building safe advanced ai. arXiv preprint arXiv:2012.07532, 2020.
- The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- Joemon M Jose et al. On fairness and interpretability. arXiv preprint arXiv:2106.13271, 2021.
- Interpreting clip’s image representation via text-based decomposition. arXiv preprint arXiv:2310.05916, 2023.
- Exploring neural networks with activation atlases. Distill., 2019.
- Neuron shapley: Discovering the responsible neurons. 2020.
- Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163, 2020.
- Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2021.
- Post-hoc concept bottleneck models. arXiv preprint arXiv:2205.15480, 2022.
- Robust feature-level adversaries are interpretability tools. Advances in Neural Information Processing Systems, 35:33093–33106, 2022a.
- Diagnostics for deep neural networks with automated copy/paste attacks. In NeurIPS ML Safety Workshop, 2022b.
- Depn: Detecting and editing privacy neurons in pretrained language models. 2023.
- Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
- Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.
- Understanding intermediate layers using linear classifier probes. 2018.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning, pages 2668–2677. PMLR, July 2018. URL https://proceedings.mlr.press/v80/kim18d.html. ISSN: 2640-3498.
- Meaningfully debugging model mistakes using conceptual counterfactual explanations. In International Conference on Machine Learning, pages 66–88. PMLR, 2022.
- What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070, 2018.
- Fairness indicators for systematic assessments of visual feature extractors. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 70–88, 2022.
- Language models represent space and time. 2023.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. 2023.
- Probing the probing paradigm: Does probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719, 2020.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
- On the pitfalls of analyzing individual neurons in language models. arXiv preprint arXiv:2110.07483, 2021.
- Towards understanding sycophancy in language models. 2023.
- Thread: Circuits. Distill, 2020. doi: 10.23915/distill.00024. https://distill.pub/2020/circuits.
- Progress measures for grokking via mechanistic interpretability. 2023.
- The clock and the pizza: Two stories in mechanistic explanation of neural networks. 2023.
- Provably safe systems: the only path to controllable AGI. September 2023. doi: 10.48550/arXiv.2309.01933. URL http://arxiv.org/abs/2309.01933. arXiv:2309.01933 [cs].
- Sparse autoencoders find highly interpretable features in language models. 2023.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386, 2016.
- Opportunities and challenges in explainable artificial intelligence (xai): A survey. arXiv preprint arXiv:2006.11371, 2020.
- Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18, 2020.
- Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 2023.
- Explainable ai (xai): Core ideas, techniques, and solutions. ACM Computing Surveys, 55(9):1–33, 2023.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665 – 673, 2020b. URL https://api.semanticscholar.org/CorpusID:215786368.
- Alex Albert. Jailbreak chat. 2023. URL https://www.jailbreakchat.com/.
- A.J. Oneal. Chat gpt ”dan” (and other ”jailbreaks”). https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023b.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023b.
- Low-resource languages jailbreak gpt-4. 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Ring-a-bell! how reliable are concept removal methods for diffusion models? arXiv preprint arXiv:2310.10012, 2023.
- Jose Antonio Lanz. Stable diffusion xl v0.9 leaks early, generating raves from users, July 2023. URL https://decrypt.co/147612/stable-diffusion-xl-v0-9-leaks-early-generating-raves-from-users.
- Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
- Aligning with whom? large language models have gender and racial biases in subjective nlp tasks. 2023.
- On the challenges of using black-box apis for toxicity evaluation in research. arXiv preprint arXiv:2304.12397, 2023.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023b.
- Universal jailbreak backdoors from poisoned human feedback. 2023.
- On the exploitability of reinforcement learning with human feedback for large language models. 2023c.
- Poisoning language models during instruction tuning. 2023.
- Foundation models and fair use. arXiv preprint arXiv:2303.15715, 2023.
- Daniel Rodriguez Maffioli. Copyright in generative ai training: Balancing fair use through standardization and transparency. Available at SSRN 4579322, 2023.
- The New York Times Company. The new york times company v. openai, December 2023. URL https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf. Case e 1:23-cv-11195.
- Maranke Wieringa. What to account for when accounting for algorithms: a systematic literature review on algorithmic accountability. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 1–18, 2020.
- A penalty default approach to preemptive harm disclosure and mitigation for ai systems. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 823–830, 2022.
- Ethics-Based Auditing to Develop Trustworthy AI. Minds and Machines, 31(2):323–327, June 2021. ISSN 1572-8641. doi: 10.1007/s11023-021-09557-8. URL https://doi.org/10.1007/s11023-021-09557-8.
- Entangled preferences: The history and risks of reinforcement learning and human feedback. 2023.
- Securing Artificial Intelligence Model Weights: Interim Report. Technical report, RAND Corporation, October 2023. URL https://www.rand.org/pubs/working_papers/WRA2849-1.html.
- Toby Shevlane. Structured access: an emerging paradigm for safe ai deployment. 2022.
- Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases. March 2023. URL https://arxiv.org/abs/2303.08956v2.
- nnsight, 2023. URL https://nnsight.net/.
- Openmined. How to audit an AI model owned by someone else (part 1). OpenMined Blog, June 2023. URL https://blog.openmined.org/ai-audit-part-1/.
- Scaling shared model governance via model splitting. arXiv preprint arXiv:1812.05979, 2018.
- H. E. van den Brom. On-site inspection and legal certainty. SSRN Electronic Journal, 2022. URL https://api.semanticscholar.org/CorpusID:249326468.
- EY. Ey global code of conduct. Online, 2019. Retrieved from: https://assets.ey.com/content/dam/ey-sites/ey-com/en_gl/generic/EY_Code_of_Conduct.pdf.
- Compiled Auditing Standard ASA. Auditing standard asa 210 terms of audit engagements, 2006.
- PCAOB. Sarbanes-oxley act of 2002, 2002. URL https://pcaobus.org/About/History/Documents/PDFs/Sarbanes_Oxley_Act_of_2002.pdf. Public Law 107-204, 116 Stat. 745.
- Electronic Code of Federal Regulations. Regulation m. Code of Federal Regulations, 2023. URL https://www.ecfr.gov/current/title-17/chapter-II/part-242/subject-group-ECFR3dd95cf4d3f6730. 17 CFR Part 242.
- Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 429–435, 2019.
- Where’s the Liability in Harmful AI Speech? SSRN Electronic Journal, 2023. ISSN 1556-5068. doi: 10.2139/ssrn.4531029. URL https://www.ssrn.com/abstract=4531029.
- Aaron L Nielson. Sticky regulations. U. Chi. L. Rev., 85:85, 2018.
- Ross D Fuerman. Bernard madoff and the solo auditor red flag. Journal of Forensic & Investigative Accounting, 1(1):1–38, 2009.
- Social responsibility and corporate reputation: The case of the arthur andersen enron audit failure. Journal of Accounting and Public Policy, 29(2):160–176, 2010.
- Cheap Talk. Journal of Economic Perspectives, 10(3):103–118, September 1996. ISSN 0895-3309. doi: 10.1257/jep.10.3.103. URL https://www.aeaweb.org/articles?id=10.1257/jep.10.3.103.
- Evan Westra. Virtue Signaling and Moral Progress. Philosophy & Public Affairs, 49(2):156–178, 2021. ISSN 1088-4963. doi: 10.1111/papa.12187. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/papa.12187. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/papa.12187.
- Kimberly D. Krawiec. Cosmetic Compliance and the Failure of Negotiated Governance. SSRN Electronic Journal, 2003. ISSN 1556-5068. doi: 10.2139/ssrn.448221. URL http://www.ssrn.com/abstract=448221.
- Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 33–44, Barcelona Spain, January 2020b. ACM. ISBN 978-1-4503-6936-7. doi: 10.1145/3351095.3372873. URL https://dl.acm.org/doi/10.1145/3351095.3372873.
- Heidy Khlaaf. How AI Can Be Regulated Like Nuclear Energy. TIME, October 2023. URL https://time.com/6327635/ai-needs-to-be-regulated-like-nuclear-weapons/.
- Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries. July 2023. URL https://arxiv.org/abs/2307.08823v1.
- Jonas Schuett. AGI labs need an internal audit function. May 2023. URL https://arxiv.org/abs/2305.17038v1.
- The unexpected benefits of Sarbanes-Oxley. Harvard Business Review, 84(4):133–140; 150, April 2006. ISSN 0017-8012.
- The Disciplining Effect of the Internal Control Provisions of the Sarbanes–Oxley Act on the Governance Structures of Firms. The International Journal of Accounting, 48(2):248–278, June 2013. ISSN 0020-7063. doi: 10.1016/j.intacc.2013.04.004. URL https://www.sciencedirect.com/science/article/pii/S0020706313000496.
- WebArena: A Realistic Web Environment for Building Autonomous Agents. October 2023. doi: 10.48550/arXiv.2307.13854. URL http://arxiv.org/abs/2307.13854. arXiv:2307.13854 [cs].
- Generative Agents: Interactive Simulacra of Human Behavior. August 2023b. doi: 10.48550/arXiv.2304.03442. URL http://arxiv.org/abs/2304.03442. arXiv:2304.03442 [cs].
- AI capabilities can be significantly improved without expensive retraining. December 2023. URL https://arxiv.org/abs/2312.07413v1.
- Testing Language Model Agents Safely in the Wild. December 2023. doi: 10.48550/arXiv.2311.10538. URL http://arxiv.org/abs/2311.10538. arXiv:2311.10538 [cs].
- Managing ai risks in an era of rapid progress. arXiv preprint arXiv:2310.17688, 2023.
- Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8):832, 2019.
- Adversarial attacks on deep learning models in natural language processing: A survey. arXiv: Computation and Language, 2019. URL https://api.semanticscholar.org/CorpusID:260428188.
- Token-modification adversarial attacks for natural language processing: A survey. ArXiv, abs/2103.00676, 2021. URL https://api.semanticscholar.org/CorpusID:232075640.
- Natural adversarial examples. 2021.
- Testing robustness against unforeseen adversaries. 2023.
- Towards interpretable deep neural networks by leveraging adversarial examples. arXiv preprint arXiv:1708.05493, 2017.
- Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267:1–38, 2019.
- Maya Krishnan. Against interpretability: a critical examination of the interpretability problem in machine learning. Philosophy & Technology, 33(3):487–502, 2020.
- Debugging tests for model explanations. arXiv preprint arXiv:2011.05429, 2020.
- United States National Science Foundation. National deep inference facility for very large language models (ndif). 2023.
- Ai regulation is (not) all you need. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023. URL https://api.semanticscholar.org/CorpusID:259139804.
- Lawrence J. White. Markets: The Credit Rating Agencies. Journal of Economic Perspectives, 24(2):211–226, June 2010. ISSN 0895-3309. doi: 10.1257/jep.24.2.211. URL https://www.aeaweb.org/articles?id=10.1257/jep.24.2.211.
- The Credit Ratings Game. The Journal of Finance, 67(1):85–111, 2012. ISSN 1540-6261. doi: 10.1111/j.1540-6261.2011.01708.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.2011.01708.x. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1540-6261.2011.01708.x.
- The Auditor-Firm Conflict of Interests: Its Implications for Independence. The Accounting Review, 49(4):707–718, 1974. ISSN 0001-4826. URL https://www.jstor.org/stable/245049. Publisher: American Accounting Association.
- Conflicts of Interest and the Case of Auditor Independence: Moral Seduction and Strategic Issue Cycling. The Academy of Management Review, 31(1):10–29, 2006. ISSN 0363-7425. URL https://www.jstor.org/stable/20159182. Publisher: Academy of Management.
- Clive Lennox. Do companies successfully engage in opinion-shopping? Evidence from the UK. Journal of Accounting and Economics, 29(3):321–337, June 2000. ISSN 0165-4101. doi: 10.1016/S0165-4101(00)00025-2. URL https://www.sciencedirect.com/science/article/pii/S0165410100000252.
- Ramin P. Baghai and Bo Becker. Reputations and credit ratings: Evidence from commercial mortgage-backed securities. Journal of Financial Economics, 135(2):425–444, February 2020. ISSN 0304-405X. doi: 10.1016/j.jfineco.2019.06.001. URL https://www.sciencedirect.com/science/article/pii/S0304405X19301588.
- Christine Oliver. Strategic responses to institutional processes. Academy of Management Review, 16(1):145–179, January 1991. ISSN 0363-7425. doi: 10.5465/amr.1991.4279002. URL https://journals.aom.org/doi/abs/10.5465/AMR.1991.4279002. Publisher: Academy of Management.
- The grey hoodie project: Big tobacco, big tech, and the threat on academic integrity. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–297, 2021.
- Scrutiny, Norms, and Selective Disclosure: A Global Study of Greenwashing. Organization Science, 27(2):483–504, March 2016. ISSN 1047-7039. doi: 10.1287/orsc.2015.1039. URL https://pubsonline.informs.org/doi/10.1287/orsc.2015.1039. Publisher: INFORMS.
- David Hess. The Transparency Trap: Non-Financial Disclosure and the Responsibility of Business to Respect Human Rights. American Business Law Journal, 56(1):5–53, 2019. ISSN 1744-1714. doi: 10.1111/ablj.12134. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/ablj.12134. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/ablj.12134.
- Lauren B. Edelman. Legal Ambiguity and Symbolic Structures: Organizational Mediation of Civil Rights Law. American Journal of Sociology, 97(6):1531–1576, May 1992. ISSN 0002-9602. doi: 10.1086/229939. URL https://www.journals.uchicago.edu/doi/abs/10.1086/229939. Publisher: The University of Chicago Press.
- Lauren B. Edelman. Working Law: Courts, Corporations, and Symbolic Civil Rights. Chicago Series in Law and Society. University of Chicago Press, Chicago, IL, November 2016. ISBN 978-0-226-40076-1. URL https://press.uchicago.edu/ucp/books/book/chicago/W/bo24550454.html.
- Ari Ezra Waldman. Privacy Law’s False Promise. SSRN Electronic Journal, 2019. ISSN 1556-5068. doi: 10.2139/ssrn.3339372. URL https://www.ssrn.com/abstract=3339372.
- Stephen Casper (40 papers)
- Carson Ezell (11 papers)
- Charlotte Siegmann (2 papers)
- Noam Kolt (12 papers)
- Taylor Lynn Curtis (2 papers)
- Benjamin Bucknall (2 papers)
- Andreas Haupt (11 papers)
- Kevin Wei (11 papers)
- Jérémy Scheurer (15 papers)
- Marius Hobbhahn (19 papers)
- Lee Sharkey (16 papers)
- Satyapriya Krishna (27 papers)
- Marvin Von Hagen (2 papers)
- Silas Alberti (8 papers)
- Alan Chan (23 papers)
- Qinyi Sun (4 papers)
- Michael Gerovitch (7 papers)
- David Bau (62 papers)
- Max Tegmark (133 papers)
- David Krueger (75 papers)