Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (2407.21792v2)

Published 31 Jul 2024 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with both upstream model capabilities and training compute, potentially enabling "safetywashing" -- where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

PDF HTML Abstract

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

The paper "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?" authored by Richard Ren et al., from various institutions including the Center for AI Safety and renowned universities, critically evaluates whether existing AI safety benchmarks genuinely measure safety advancements or merely reflect general AI capabilities. This research is becoming increasingly pertinent as AI systems are deployed in more complex and high-stakes environments, necessitating robust safety measures.

Summary of Findings

The authors undertake a comprehensive meta-analysis of widely-used AI safety benchmarks, determining their correlation with general model capabilities. A primary contribution of this paper is the introduction of the term "safetywashing," analogous to "greenwashing," where AI developments are presented as safety improvements despite not making the systems intrinsically safer.

Capabilities Score: The paper operationalizes a "capabilities score" by extracting the first principal component from a benchmark matrix of model performances across various tasks. This score captures the majority of variance in model performance, attributed to general capabilities.
Empirical Results: The paper evaluates numerous benchmarks across different safety aspects, revealing that many of these benchmarks are highly correlated with general capabilities. For example, the correlation of benchmarks like MT-Bench and LMSYS Chatbot Arena with capabilities scores is found to be significantly high (upwards of 62.1% to 78.7%), indicating these benchmarks do not distinctly measure safety properties.
Machine Ethics and Bias: Different benchmarks within these areas show varied correlation results. Whereas the ETHICS benchmark shows a high correlation (82.2%) with capabilities, implying it primarily measures capabilities rather than ethical behavior, the MACHIAVELLI benchmark shows a negative correlation (-49.9%), suggesting that ethical behavior does not necessarily improve with general capabilities. Similarly, bias benchmarks like BBQ and CrowS-Pairs exhibit mixed results, with some showing low correlation indicating distinct safety measures.
Truthfulness and Hallucinations: The analysis finds that benchmarks like TruthfulQA have high correlations with capabilities (81.2% for the MC1 task), reducing their validity as distinct safety measures. However, other misconception and hallucination benchmarks like sycophancy and HaluEval show varying levels of correlation, which calls for a nuanced approach to evaluating AI truthfulness.
Adversarial Robustness: Traditional adversarial benchmarks like ANLI and ImageNet-A correlate highly with capabilities (81.5% and 97.9%, respectively). In contrast, novel benchmarks like human-written jailbreaks and gradient-based attacks display low or negative correlations, indicating these may better capture distinct aspects of robustness that are not automatically improved by scaling the model's general capabilities.
Calibration and Security: Calibration metrics for accuracy show mixed results. While the Brier Score is highly correlated with capabilities (up to 98.5% for vision models), the RMS Calibration Error shows low correlation (around 20.1% for LLMs), suggesting the latter is a more robust metric for distinct safety properties. Security benchmarks focusing on weaponization capabilities display negative correlations, underscoring the growing danger of more capable models being exploited for harmful purposes.

Implications and Future Directions

The research implicates that many purported AI safety benchmarks do not distinctly measure safety and are susceptible to safetywashing. The findings emphasize the necessity of creating and adopting benchmarks that are empirically separable from general capabilities.

Benchmark Design: Model developers and researchers should prioritize designing benchmarks that measure safety attributes independently of general capabilities. Reporting the correlation with capabilities should become a standard practice to ensure transparency.
Resource Allocation: Research funding and resources should be strategically allocated to areas where safety improvements are not merely a byproduct of increased capabilities but require distinct technical advances.
Standardization and Norms: Establishing norms where benchmarks that are highly correlated with capabilities should not be used to claim safety progress is critical. This mitigates safetywashing and drives genuine advancements in AI safety.

Conclusion

This insightful analysis by Ren et al. brings to light a significant issue in current AI research and development practices—safetywashing. By empirically demonstrating the extent to which safety benchmarks correlate with general model capabilities, this work lays the groundwork for more rigorous and meaningful safety evaluations. As AI systems evolve, the need for robust, independent safety metrics will only grow more pressing, demanding concerted efforts from the research community to ensure that AI safety advancements are both genuine and impactful.