Generalisation of contrastive-pair–trained probes vs naturalistic off‑policy probes
Determine whether activation probes trained on minimal contrastive pairs (for example, binary contrasts such as "yes" versus "no") exhibit the same generalisation properties as activation probes trained on naturalistic off‑policy scenarios, and ascertain whether increased realism in off‑policy training data is necessary to achieve robust off‑policy generalisation in large language model monitoring tasks.
References
A possible limitation of our work is that the Off-Policy experiments focused on naturalistic scenarios rather than minimal contrastive pairs (e.g., "yes" vs "no" responses to simple questions). It remains unclear whether probes trained on contrastive pairs exhibit similar generalisation properties, or whether the additional realism in our training data is necessary for robust Off-Policy generalisation.