UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models (2510.02194v1)

Published 2 Oct 2025 in cs.AI, cs.CR, and cs.LG

Abstract: LLMs have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

Summary

The paper proposes a unified framework, UpSafe°C, that upcycles safety-critical layers via a two-stage training process and introduces a controllable safety temperature parameter.
The methodology leverages intrinsic safety sensitivity to transform select layers into a sparse Mixture-of-Experts design, achieving a 100% safety rate and up to 10.8% improvement over baselines.
The approach enables dynamic, inference-time safety control without retraining, preserving general capabilities while allowing explicit adjustments to the safety-utility trade-off.

UpSafe $^\circ$ C: Upcycling for Controllable Safety in LLMs

Motivation and Problem Statement

LLMs exhibit strong generalization across diverse tasks but remain susceptible to safety risks, including harmful content generation and jailbreak attacks. Existing safety interventions—external guardrails, inference-time guidance, and post-training alignment—each present trade-offs in terms of safety, utility, and controllability. External guardrails are decoupled from model internals and incur deployment overhead; inference guidance increases latency and is less effective for models with weak instruction-following; post-training alignment (SFT, RLHF) often suffers from safety-utility trade-offs and lacks dynamic inference-time control.

UpSafe $^\circ$ C introduces a unified framework for LLM safety enhancement, integrating training and inference mechanisms to achieve robust, controllable safety while preserving utility. The approach leverages intrinsic safety-critical layers, upcycles them into a sparse Mixture-of-Experts (MoE) structure, and introduces a safety temperature for dynamic inference-time control.

Figure 1: The UpSafe $^\circ$ C framework: safety-critical layer identification, upcycling with safety experts via two-stage SFT, and safety temperature for inference-time control.

Safety-Critical Layer Identification

The framework begins by probing the pretrained LLM to identify layers most sensitive to safety-relevant signals. A Safety Sensitivity Score (SS-Score) is computed for each layer using linear probes trained to distinguish harmful from benign input representations. Layers with the lowest validation loss (highest discriminative power) are designated as safety-critical.

Figure 2: (a) t-SNE visualization shows safety-critical layers yield more separable representations for harmful/benign inputs. (b) SS-Score scan highlights top-3 safety-critical layers in Llama3.1-8B-Instruct.

This targeted approach reduces parameter overhead and avoids degradation of general capabilities, as only a small subset of layers is modified.

Safety-Aware Upcycling via Sparse MoE

Selected safety-critical layers are upcycled into a sparse MoE structure. Each upcycled layer comprises a router, the original MLP (general expert), and multiple duplicated MLPs (safety experts). The router acts as a soft guardrail, dynamically activating experts based on input characteristics.

Training proceeds in two stages:

Stage 1: Safety experts and routers are trained on harmful data, with the general expert frozen. The router is constrained to activate only safety experts, specializing them for risk mitigation.
Stage 2: Experts are frozen; routers are trained on mixed harmful/benign data to learn soft guardrail behavior—activating safety experts for harmful prompts and the general expert for benign prompts.

This staged optimization ensures specialization without catastrophic forgetting and enables precise discrimination between harmful and benign inputs.

Safety Temperature: Dynamic Inference-Time Control

UpSafe $^\circ$ C introduces a safety temperature parameter $\tau \in [0,1]$ at inference, which biases the router's logits to favor safety or general experts. The bias and temperature scaling are mathematically designed to provide monotonic, differentiable control over the safety-utility trade-off. As $\tau$ increases, routing shifts toward safety experts; as $\tau$ decreases, general experts dominate.

Figure 3: Safety–utility trade-off curves under varying safety temperature $\tau$ , with the Pareto frontier highlighted.

This mechanism allows fine-grained, on-the-fly adjustment of model behavior without retraining, supporting dynamic deployment scenarios.

Empirical Evaluation

Experiments span multiple open-source LLMs (Qwen2.5-7B/14B-Instruct, Llama3.1-8B-Instruct) and reasoning models (DeepSeek-R1-Distill variants), with baselines including vanilla, SFT-only, and MoE (single-stage) models. Training uses the STAR-1 safety dataset; evaluation covers red-team, jailbreak, over-refusal, and general ability benchmarks.

Key results:

UpSafe $^\circ$ C achieves 100% safety rate on JBB and StrongReject, and outperforms SFT-only and MoE baselines by up to 10.8% on challenging WildJailbreak scenarios.
General capabilities (HumanEval, MMLU, Math-500) are preserved or improved, with average performance gains of 4.5% over SFT-only and 0.5% over MoE.
Safety temperature $\tau$ enables explicit traversal of the safety–utility Pareto frontier, outperforming static baselines and SafeKey across all operating points.

Ablation and Analysis

Ablation studies confirm:

Layer selection: Top- $k$ safety-critical layers yield superior trade-offs compared to random selection, and this holds across model scales.
Figure 4: Ablation studies: (a) Top-k vs. random layer upcycling; (b) Number of safety experts per layer; (c) Two-stage vs. one-stage SFT.
Expert number: Three safety experts per layer balance safety and efficiency; more experts yield diminishing returns and increased cost.
Training strategy: Two-stage SFT consistently achieves optimal Pareto frontiers, validating staged optimization.

Router analysis shows selective activation: safety experts for harmful prompts, general expert for benign, confirming the soft guardrail hypothesis.

Theoretical and Practical Implications

UpSafe $^\circ$ C demonstrates that safety-relevant signals are concentrated in a small subset of LLM layers, enabling parameter-efficient, targeted safety enhancement. The modular MoE design supports specialization and robustness against adversarial attacks. The safety temperature mechanism provides a practical interface for dynamic safety control, suitable for real-world deployment where requirements may shift over time.

The approach generalizes across architectures and scales, and does not compromise general utility, as confirmed by stable performance on knowledge-intensive, coding, and reasoning tasks under varying $\tau$ .

Future Directions

Potential extensions include:

Automated selection of safety-critical layers via unsupervised or meta-learning approaches.
Integration with other modular architectures (e.g., task-specific experts, domain adaptation).
Exploration of more granular safety temperature schedules, adaptive to user profiles or context.
Application to multimodal models and broader risk domains.

Conclusion

UpSafe $^\circ$ C provides a principled, modular framework for controllable LLM safety, combining targeted upcycling of safety-critical layers, staged expert specialization, and dynamic inference-time control via safety temperature. The method achieves robust safety improvements, maintains general capabilities, and enables explicit, fine-grained trade-offs, representing a significant advance toward dynamic, inference-aware LLM safety.