STAR-1: Safer Alignment of Reasoning LLMs with 1K Data (2504.01903v1)

Published 2 Apr 2025 in cs.CL and cs.AI

Abstract: This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

Summary

Overview of STAR-1 Safety Dataset for Large Reasoning Models

The paper "STAR-1: Safer Alignment of Reasoning LLMs with 1K Data" presents a dataset named STAR-1 that aims to address safety alignment challenges in large reasoning models (LRMs), such as DeepSeek-R1. These models are distinguished from traditional LLMs by their enhanced chain-of-thought reasoning capabilities. However, this strength introduces unique vulnerabilities, notably their susceptibility to generating unsafe outputs under specific prompts.

Design Principles and Dataset Construction

STAR-1 is developed following three core principles: diversity, deliberative reasoning, and rigorous filtering. The dataset integrates samples from existing open-source safety datasets and is categorized under eight safety concerns, including harassment, sexual content, violence, and misinformation. This broad categorization aims to provide comprehensive coverage across various dimensions of content safety. For data generation, deliberative reasoning is employed, requiring models to reference explicit safety policies during their reasoning process. This methodological choice helps ensure responses are policy-compliant.

To maintain high quality, STAR-1 employs a GPT-4o-driven scoring system, assessing generated samples on content safety compliance, policy relevancy, and reasoning accuracy. Only the highest scoring samples are retained, resulting in a curated dataset of 1,000 entries that balances category and source diversity. This curated approach is pivotal to achieving effective alignment without compromising reasoning ability, a common issue in safety training paradigms.

Experimental Results

Extensive experiments demonstrate that fine-tuning LRMs with STAR-1 results in an average 40% improvement in safety performance across benchmarks, with only a 1.1% decline in reasoning ability. This performance indicates STAR-1's capability to enhance safety alignment effectively while preserving the general reasoning power of LRMs. Notably, larger models tend to exhibit diminishing returns in safety improvements, likely due to their robust pre-training, yet still benefit from STAR-1’s enhancements consistently.

Implications and Future Directions

The implications of this research are substantial for both theoretical understanding and practical applications of AI safety. The positive impact of STAR-1 suggests that small, high-quality datasets can suffice in significantly improving model safety. This finding may inspire reduced dependency on large-scale datasets, facilitating faster and more resource-efficient model training.

Future advancements may explore further tailoring of the deliberative reasoning process to optimize both reasoning fidelity and safety compliance. Moreover, addressing issues like overrefusal—where models might excessively refuse benign queries—could enhance user interactions without compromising safety standards. The paper’s empirical findings and methodological innovations may drive novel explorations in AI alignment, reinforcing safer deployment of reasoning-based AI systems.

Conclusion

The STAR-1 dataset offers an effective, novel approach to tackling the safety challenges inherent in LRMs. By embedding safety directly into the model's reasoning paradigm and ensuring data quality through robust filtering, this research increases our methodological repertoire in AI safety. The use of high-quality, small-scale datasets represents a compelling direction for future developments in the training and deployment of safe, reliable AI systems.