The Singapore Consensus on Global AI Safety Research Priorities (2506.20702v2)

Published 25 Jun 2025 in cs.AI and cs.CY

Abstract: Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control).

Summary

The paper establishes a multi-stakeholder consensus that identifies key research domains: risk assessment, trustworthy system development, and post-deployment control.
It outlines the need for standardized benchmarks, secure audit infrastructure, and formal verification to tackle current gaps in AI risk evaluation.
The findings call for global coordination and interdisciplinary research to integrate technical, organizational, and societal measures for safe AI deployment.

The Singapore Consensus on Global AI Safety Research Priorities

The Singapore Consensus presents a comprehensive synthesis of technical research priorities for AI safety, with a focus on general-purpose AI (GPAI) systems. Developed through a multi-stakeholder process involving leading researchers, industry, and government representatives, the document aims to facilitate global coordination and accelerate impactful research to ensure AI systems are trustworthy, reliable, and secure. The Consensus is structured around a defense-in-depth model, organizing research into three interdependent domains: Risk Assessment, Development of Trustworthy Systems, and Post-Deployment Control.

1. Risk Assessment

The Consensus identifies risk assessment as foundational for safe AI development and deployment. It emphasizes the need for rigorous, standardized methods to measure and forecast the impact of AI systems, both current and prospective. Key research priorities include:

Audit Techniques and Benchmarks: The development of robust, standardized benchmarks and audit protocols is highlighted as central, yet current benchmarks often fail to capture real-world complexities and are susceptible to gaming. The report calls for dynamic, automated evaluations, technical "red lines," and secure, maintainable evaluation resources.
Downstream Impact Assessment and Forecasting: The Consensus stresses the importance of field tests, uplift studies, and structured analytical techniques (e.g., scenario analysis, probabilistic risk assessment) to anticipate societal impacts such as labor market shifts, misinformation, and privacy risks. The "evidence dilemma"—balancing early mitigation with the risk of overreaction—is noted as a persistent challenge.
Secure Evaluation Infrastructure: There is a strong call for double-blind, secure infrastructure enabling third-party audits without compromising proprietary information. The engineering and policy challenges of such infrastructure are recognized as open problems.
Metrology for AI Risk Assessment: The report identifies a lack of standardization, repeatability, and precision in current risk measurement approaches. It advocates for the development of quantitative, AI-specific risk assessment methodologies to reduce uncertainty and enable more precise risk management.
Dangerous Capability and Propensity Assessment: The Consensus underscores the nascent state of methods for eliciting and assessing dangerous capabilities (e.g., dual-use knowledge, autonomy) and their propensities for harm. It notes that current tests are insufficient to rule out harmful behaviors, and calls for research into more reliable elicitation and inference techniques.
Loss-of-Control Risk Assessment: The document highlights the lack of expert consensus on the likelihood of loss-of-control scenarios with advanced AI, noting both the diversity of expert opinion and the need for improved methodologies to assess and quantify such risks.

2. Developing Trustworthy, Secure, and Reliable Systems

The Consensus adopts a classic safety engineering framework, emphasizing specification, design, and verification:

Specification & Validation: The challenge of faithfully translating human intent into system specifications is foregrounded, with attention to reward hacking, specification loopholes, and the complexity of aligning with both single and multiple stakeholders. The report calls for scalable methods to discover specification flaws and for principled approaches to balancing competing preferences and legal/ethical alignment.
Design and Implementation: Research priorities include:
- Data Curation and Pretraining: Addressing the challenges of harmful content in large-scale datasets and understanding its impact on model behavior.
- Robustness: Developing adversarial training and tamper-resistance techniques, especially for open-weight models, where current defenses are notably limited.
- Truthfulness and Honesty: Reducing confabulation and dishonesty through improved training, interpretability, and claim substantiation.
- Targeted Model Editing: Advancing methods for precise, efficient post-hoc model modifications.
- Avoiding Hazardous Capabilities: Techniques for limiting autonomy, generality, or intelligence to reduce risk, including minimally-agentic, de-generalized, and intelligence-bounded systems.
- Guaranteed Safety by Design: Pursuing verifiable program synthesis, formal world models, and compositional verification to provide strong safety guarantees.
Verification: The Consensus highlights the need for robust testing (including adversarial and multi-agent contexts), quantitative and formal verification, interpretability (mechanistic and explainability), and the use of "model organisms" to test safety interventions in controlled settings.

3. Control: Monitoring and Intervention

Post-deployment control is treated as a critical, ongoing process, with research priorities spanning:

System and Ecosystem Monitoring: The development of hardware-enabled monitoring, user and system state monitoring, modular system design for easier oversight, and comprehensive logging infrastructure. The report also emphasizes the importance of data and model provenance, agent authentication, and compute/hardware tracking for ecosystem-level risk management.
Intervention Mechanisms: The Consensus discusses the technical and organizational challenges of implementing off-switches, override protocols, and incident response mechanisms, especially for highly autonomous or distributed systems.
AGI and ASI Control Problem: The document identifies scalable oversight, corrigibility, agent foundations, and containment as key research frontiers for controlling highly capable, potentially adversarial AI systems. It notes the theoretical and practical difficulties of ensuring corrigibility and safe self-modification.
Societal Resilience: Recognizing that AI-related disruptions may be diffuse and systemic, the Consensus calls for research into strengthening societal infrastructure, developing incident response protocols, and fostering institutional adaptation to AI-driven change.

Strong Numerical Results and Claims

While the Consensus is primarily an overview and prioritization document rather than an empirical paper, it makes several strong claims:

Current risk assessment and robustness techniques are insufficient to reliably detect or prevent harmful capabilities and behaviors in frontier AI systems.
Existing formal verification and assurance methods are not yet able to provide strong guarantees for the behaviors of large, general-purpose AI systems.
Open-weight models remain highly vulnerable to tampering and distillation attacks, with current defenses being easily circumvented.
There is no expert consensus on the likelihood of catastrophic loss-of-control scenarios, but the potential severity warrants significant research investment.

Implications and Future Directions

The Singapore Consensus has several important implications for both research and practice:

Global Coordination: By identifying areas of mutual interest (e.g., risk thresholds, audit protocols, incident reporting), the Consensus provides a foundation for international cooperation, even among competitive actors.
Standardization and Infrastructure: The call for standardized benchmarks, secure evaluation infrastructure, and robust logging/reporting systems points to a future where technical and organizational standards are as critical as algorithmic advances.
Socio-Technical Integration: The document recognizes that technical solutions alone are insufficient; effective risk management requires integration with governance, legal, and societal frameworks.
Research Gaps: The Consensus highlights persistent gaps in specification, robustness, interpretability, and control, especially for highly agentic and general systems. It calls for interdisciplinary, empirical, and theoretical work to address these challenges.
Practical Deployment: The emphasis on field tests, real-world monitoring, and incident response reflects a shift toward operationalizing safety research in deployed systems, not just in laboratory settings.

Speculation on Future Developments

The research agenda outlined in the Singapore Consensus is likely to shape the next decade of AI safety work. Anticipated developments include:

Widespread adoption of secure, standardized audit and evaluation protocols as prerequisites for deployment of advanced AI systems.
Advances in formal verification and mechanistic interpretability that enable stronger assurances for system behavior, especially in high-stakes domains.
Emergence of international incident reporting and risk management frameworks, analogous to those in aviation and cybersecurity.
Increased focus on societal resilience and adaptation, as AI systems become more deeply integrated into critical infrastructure and economic systems.
Ongoing tension between openness and security, particularly regarding open-weight models and the sharing of safety-relevant information.

The Singapore Consensus provides a rigorous, actionable roadmap for technical AI safety research, emphasizing the need for defense-in-depth, global coordination, and the integration of technical, organizational, and societal approaches to risk management. Its influence is likely to be significant in shaping both research priorities and practical standards for the safe development and deployment of advanced AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tegmark/status/1938563088589943269

https://twitter.com/StephenLCasper/status/1938720601633657269

https://twitter.com/Synced_Global/status/1938478048334254250

https://twitter.com/S_OhEigeartaigh/status/1938564322151583933

https://twitter.com/CARMA_411/status/1938620355750326598

https://twitter.com/theaipie/status/1939315953164005887

YouTube

Show All Videos

Reddit

[2506.20702] The Singapore Consensus on Global AI Safety Research Priorities (18 points, 4 comments)