Proxy or black-box metrics for refusal-cliff monitoring and data selection
Develop proxy metrics or black-box analogues of the refusal-score probing framework used to monitor refusal intentions and compute misalignment scores in the Cliff-as-a-Judge data selection method, enabling safety alignment assessment and training example selection for proprietary large reasoning models that do not expose internal hidden-state representations.
References
Second, our data‑recipe method depends on having access to the model’s internal representations and refusal scores, which is feasible for open models but may be impractical for proprietary systems. Investigation of proxy metrics or black‑box analogues remains future work.
— Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
(2510.06036 - Yin et al., 7 Oct 2025) in Section 7 (Limitations)