Proxy or black-box metrics for refusal-cliff monitoring and data selection

Develop proxy metrics or black-box analogues of the refusal-score probing framework used to monitor refusal intentions and compute misalignment scores in the Cliff-as-a-Judge data selection method, enabling safety alignment assessment and training example selection for proprietary large reasoning models that do not expose internal hidden-state representations.

Background

The paper introduces Cliff-as-a-Judge, a probing-driven data selection approach that relies on internal hidden-state representations to compute refusal scores and misalignment metrics for aligning large reasoning models. While effective for open models where activations are accessible, this dependence limits applicability to proprietary systems that do not provide access to internal representations.

To extend these techniques to closed-source models, the authors highlight the need for proxy metrics or black-box analogues that can approximate refusal-intention monitoring and data selection without direct access to activations. Establishing such proxies would provide practical tools for alignment in settings where internals are unavailable.

References

Second, our data‑recipe method depends on having access to the model’s internal representations and refusal scores, which is feasible for open models but may be impractical for proprietary systems. Investigation of proxy metrics or black‑box analogues remains future work.

— Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? (2510.06036 - Yin et al., 7 Oct 2025) in Section 7 (Limitations)

Proxy or black-box metrics for refusal-cliff monitoring and data selection

Background

References

Related Problems