Adversarial Training for High-Stakes Reliability (2205.01663v5)

Published 3 May 2022 in cs.LG, cs.AI, and cs.CL

Abstract: In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.

Authors (12)

Daniel M. Ziegler (8 papers)
Seraphina Nix (4 papers)
Lawrence Chan (16 papers)
Tim Bauman (1 paper)
Peter Schmidt-Nielsen (1 paper)
Tao Lin (167 papers)
Adam Scherlis (10 papers)
Noa Nabeshima (4 papers)
Ben Weinstein-Raun (2 papers)
Daniel de Haas (1 paper)
Buck Shlegeris (21 papers)
Nate Thomas (2 papers)

Citations (55)

View on Semantic Scholar

Summary

Adversarial Training for High-Stakes Reliability

The paper "Adversarial training for high-stakes reliability" addresses the development of adversarial training techniques aimed at enhancing reliability in AI systems deployed in critical settings where failures can result in catastrophic consequences. The authors present their findings using a tractable language generation task—avoiding injurious story completions—as a testbed to improve worst-case performance through adversarial training.

Objective and Methodology

The primary objective is to minimize the risk of failures during deployment time in high-stakes environments. The paper introduces adversarial training as a solution to improve AI reliability, particularly through the generation and training on adversarial examples that expose worst-case scenarios. This paper employed a language generation task, filtering story completions to exclude injurious content to illustrate the effectiveness of adversarial training.

The paper details a robust methodology incorporating diverse adversarial training techniques, including:

Manual Adversarial Examples: Human contractors manually identified and crafted injurious examples that evaded initial classifiers.
Automatic Paraphrasing: Using GPT-3 for generating paraphrases of adversarial manual examples to inspect and amplify the adversarial training dataset.
Tool-Assisted Human Attacks: Integration of a developed tool that assists human adversaries by suggesting adversarial edits to enhance the adversarial strength of human-generated examples.

Results

The paper indicates significant improvements in resilience to adversarial attacks, with notable increase in time required by contractors to generate successful adversarial examples—from 13 minutes to 26 minutes with the tool, and from 20 minutes to 44 minutes without the tool for the most robust classifier. Adversarial training improved robustness without degrading the model's performance on in-distribution data. Specifically, models kept their false negative rates low when filtering injuries on a test set, displaying a capacity for strong high-stakes reliability under a rigorous evaluation protocol.

Implications and Future Directions

The findings have substantial implications both in the theory and application of AI safety. Primarily, the paper advocates for the importance of adversarial training in high-stakes AI applications, promoting its integration into broader AI safety strategies. Additionally, conservative classifier thresholds were shown to be effective without notably compromising the quality of outputs, suggesting that threshold tuning could be integral to developing high-stakes AI systems.

Future work is anticipated to expand upon types of adversarial attacks, improve tools available to human adversaries, and incorporate larger models to further reinforce reliability. As models grow increasingly powerful, methodologies for measuring adversarial robustness and worst-case scenarios will need refinement to ensure deployment safety. This exploratory work sets the stage for subsequent innovations in adversarial training methodologies and reliability assessment techniques in AI systems.

Related Papers

Tweets

https://twitter.com/francescofaenzi/status/1788456369341046889

YouTube

Show All Videos