Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies (2409.17216v1)

Published 25 Sep 2024 in cs.CY and cs.AI

Abstract: Current regulations on powerful AI capabilities are narrowly focused on "foundation" or "frontier" models. However, these terms are vague and inconsistently defined, leading to an unstable foundation for governance efforts. Critically, policy debates often fail to consider the data used with these models, despite the clear link between data and model performance. Even (relatively) "small" models that fall outside the typical definitions of foundation and frontier models can achieve equivalent outcomes when exposed to sufficiently specific datasets. In this work, we illustrate the importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future. More broadly, we emphasize the risk posed by over-regulating reactively and provide a path towards careful, quantitative evaluation of capabilities that can lead to a simplified regulatory environment.

Summary

The paper argues that overemphasis on models obscures the essential impact of data on AI risks and performance.
By analyzing regulatory inconsistencies and benchmark examples, the study reveals how efficient smaller models challenge traditional FLOP-based governance.
The findings imply that integrating data-centric criteria into policy can create more transparent and robust AI governance frameworks.

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

The paper "Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies," authored by Ritwik Gupta et al., presents a critical analysis of the current AI governance landscape, addressing the limitations of model-focused regulatory approaches. The authors argue that existing governance frameworks, which primarily focus on large, computationally intensive models, are insufficient for effectively managing the risks posed by AI technologies. Instead, they propose a shift towards a data-centric approach to AI governance, emphasizing the crucial role of data in determining model capabilities and risks.

Shortcomings of Model-Focused AI Governance

The authors identify three primary shortcomings in the existing model-focused governance frameworks:

Inconsistent Definitions:
- There is no consistent, universally accepted definition for terms like "frontier," "foundation," "dual-use," and "general purpose" models. This inconsistency creates regulatory confusion and loopholes.
Efficiency of Smaller Models:
- Advances in ML have led to more efficient models that require fewer parameters and FLOPs to achieve similar capabilities as larger models, potentially evading existing regulatory thresholds.
Overlooking Data:
- Regulatory efforts focus on model size and computation while neglecting the significant role of data in achieving model performance and capability. Smaller models exposed to specialized datasets can perform as well or better than larger models, indicating the necessity of incorporating data considerations into governance frameworks.

Definitional Challenges and the Role of Data

In analyzing various influential AI governance documents, the paper highlights the inconsistencies and arbitrary thresholds, particularly those based on model size and FLOPs. For example, current thresholds based on FLOPs (e.g., $10^{26}$ ) fail to capture the capabilities of models that achieve high performance with significantly fewer computational resources. The authors illustrate this with examples showing that smaller, task-focused models can outperform larger models on specific benchmarks, such as image segmentation in the RefCOCO dataset.

The paper further emphasizes that AI capabilities are not strictly correlated with model size or computational expense. Efficient training methods and optimizations can decouple model performance from computational cost, debunking the efficacy of FLOP-based regulatory thresholds.

Data-Centric Approach to AI Governance

The authors argue for the inclusion of data considerations in AI governance:

Role of Data in Model Performance:
- The quality and specificity of datasets play a critical role in model capabilities. General models trained on extensive datasets like ImageNet or Common Crawl can achieve significant capabilities, while targeted fine-tuning on curated data can enhance performance further.
Retrieval and Derivation:
- The paper distinguishes between two essential features of modern ML: retrieval (the model's capacity to recall specific data points) and derivation (the model’s ability to synthesize new information from existing data). Both features raise unique challenges and risks in AI governance.
Implications for Policy and Regulation:
- Existing data-centric legal frameworks addressing personal data, child sexual abuse material, and classified information can be expanded to AI governance. This would simplify the regulatory landscape by leveraging established data governance policies.

Future Directions

The authors advocate for the development of a rigorous evaluation framework to assess AI capabilities, incorporating both model size and data quality. Such frameworks would provide a comprehensive understanding of the potential risks and benefits of AI technologies. Additionally, the paper calls for standardizing dataset documentation and provenance tracking to ensure transparency and accountability in data usage.

Conclusion

The paper concludes by stressing the need for a paradigm shift in AI governance from model-centric to data-centric approaches. This shift acknowledges the intertwined role of data in model capabilities and risks, paving the way for more robust and effective regulatory frameworks. The authors suggest that such a pivot is essential for aligning AI governance with the rapid advancements and evolving nature of AI technologies. Future research within the Frontier Data Initiative will continue to explore and develop data-centric AI governance solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/areddie89/status/1841899301368431077

https://twitter.com/Ritwik_G/status/1840794394682933490

https://twitter.com/Ritwik_G/status/1868744452036346156

https://twitter.com/Ritwik_G/status/1839815158849503695

HackerNews

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies (4 points, 2 comments)