Identifying when alignment data must be online and on-policy

Ascertain and delineate the conditions or task regimes under which alignment data must necessarily be collected online and on-policy to achieve desired performance or stability, even when applying humanline variants of alignment objectives such as DPO, KTO, and GRPO to offline data.

Background

While humanline variants can close the performance gap between offline off-policy and online on-policy alignment in the authors’ experiments, the paper cautions that this is not guaranteed across all scenarios. The dependence on data quality and training dynamics suggests there may exist regimes where offline data—even with humanline preprocessing—cannot substitute for online on-policy sampling.

The authors explicitly pose the question of whether there are settings that inherently require online on-policy data, framing a need to understand the boundaries of applicability for humanline alignment and to codify when online collection is indispensable.

References

What then makes offline data ‘good-quality’, and can these characteristics be formalized? Conversely, are there settings under which alignment data must necessarily be online and on-policy? We leave these as directions for future work.

— Humanline: Online Alignment as Perceptual Loss (2509.24207 - Liu et al., 29 Sep 2025) in Section 6: Limitations and Future Work

Identifying when alignment data must be online and on-policy

Background

References

Related Problems