Generalisation of contrastive-pair–trained probes vs naturalistic off‑policy probes

Determine whether activation probes trained on minimal contrastive pairs (for example, binary contrasts such as "yes" versus "no") exhibit the same generalisation properties as activation probes trained on naturalistic off‑policy scenarios, and ascertain whether increased realism in off‑policy training data is necessary to achieve robust off‑policy generalisation in large language model monitoring tasks.

Background

The paper evaluates how different response-generation strategies (on-policy natural, on-policy incentivised, on-policy prompted, and off-policy) affect probe performance across eight behaviours. For off-policy experiments, the authors used naturalistic scenarios to generate activations rather than minimal contrastive pairs.

They note that while their naturalistic approach shows particular generalisation patterns, it remains unknown whether probes trained on simpler, minimal contrastive data would generalise similarly, or whether naturalistic realism is required for robust off-policy generalisation. Resolving this would guide data collection strategies for reliable probe training when on-policy data is scarce.

References

A possible limitation of our work is that the Off-Policy experiments focused on naturalistic scenarios rather than minimal contrastive pairs (e.g., "yes" vs "no" responses to simple questions). It remains unclear whether probes trained on contrastive pairs exhibit similar generalisation properties, or whether the additional realism in our training data is necessary for robust Off-Policy generalisation.

That's not natural: The Impact of Off-Policy Training Data on Probe Performance  (2511.17408 - Kirch et al., 21 Nov 2025) in Section: Limitations and Future Work