Guidelines for public release of red teaming findings and exploits

Determine whether and how researchers conducting evaluation and red teaming of deployed generative AI systems should publicly release their findings, methods, and discovered exploits, specifying appropriate protocols, timing, and scope of disclosure to avoid overly broad or overly limited sharing that could harm the community.

Background

In discussing the chilling effects that current terms of service and unclear norms have on independent evaluation, the authors note that researchers face ambiguity about responsible disclosure practices for sensitive safety findings. Without explicit guidance, researchers risk either oversharing potentially harmful details or under-sharing information in ways that impede reproducibility and community learning.

The paper proposes legal and technical safe harbors and standardized vulnerability disclosure policies precisely to resolve such ambiguities, but acknowledges that concrete guidance on if and how to release methods and exploits remains unresolved.

References

It is unclear whether and how researchers should publicly release their findings, methods or the exploits themselves.

A Safe Harbor for AI Evaluation and Red Teaming (2403.04893 - Longpre et al., 7 Mar 2024) in Table “Themes and observations,” row “Chilling Effect on Vulnerability Disclosure,” Section 3 (Challenges to Independent AI Evaluation)