Scope of Risks Effectively Detectable via AI Red-Teaming

Determine which categories of undesirable behaviors, model limitations, and misuse risks in generative AI systems can or should be effectively detected and mitigated through AI red-teaming exercises.

Background

The paper notes that policy definitions of AI red-teaming, including those in the U.S. Executive Order, are vague about what kinds of harms or vulnerabilities such exercises can realistically surface and mitigate. To ground future standards, the authors highlight that the basic scope of what red-teaming can effectively catch remains unresolved.

Clarifying the classes of risks that AI red-teaming can reliably expose is essential for scoping evaluations, allocating resources, and setting expectations for regulators and stakeholders.

References

For example, the definition offered by the presidential executive order leaves the following key questions unanswered: What types of undesirable behaviors, limitations, and risks can or should be effectively caught and mitigated through red-teaming exercises?

— Red-Teaming for Generative AI: Silver Bullet or Security Theater? (2401.15897 - Feffer et al., 29 Jan 2024) in Section 1 Introduction

Scope of Risks Effectively Detectable via AI Red-Teaming

Background

References

Related Problems