Identifying when alignment data must be online and on-policy
Ascertain and delineate the conditions or task regimes under which alignment data must necessarily be collected online and on-policy to achieve desired performance or stability, even when applying humanline variants of alignment objectives such as DPO, KTO, and GRPO to offline data.
References
What then makes offline data ‘good-quality’, and can these characteristics be formalized? Conversely, are there settings under which alignment data must necessarily be online and on-policy? We leave these as directions for future work.
— Humanline: Online Alignment as Perceptual Loss
(2509.24207 - Liu et al., 29 Sep 2025) in Section 6: Limitations and Future Work