Dice Question Streamline Icon: https://streamlinehq.com

Formalizing the characteristics of good-quality offline alignment data

Determine and formally characterize the properties that make offline off-policy alignment data suitable for achieving parity with online on-policy alignment when using humanline variants of Direct Preference Optimization (DPO), Kahneman–Tversky Optimization (KTO), and Grouped Relative Policy Optimization (GRPO). Specify measurable criteria or diagnostics that predict when offline datasets will enable humanline-trained policies to match the performance of policies trained with online on-policy data.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces humanline variants of common alignment objectives that incorporate prospect-theoretic probability weighting, enabling offline off-policy training to match the performance of online on-policy methods in several empirical settings. However, this equivalence is observed as an empirical regularity rather than a guarantee, and depends strongly on data quality. In experiments, different offline data sources led to varying outcomes, indicating that some offline datasets suffice while others do not.

The authors explicitly raise the question of what constitutes ‘good-quality’ offline data and whether its defining characteristics can be rigorously formalized, highlighting the need for principled criteria to assess offline datasets’ suitability for humanline alignment and to predict success prior to training.

References

What then makes offline data ‘good-quality’, and can these characteristics be formalized? Conversely, are there settings under which alignment data must necessarily be online and on-policy? We leave these as directions for future work.

Humanline: Online Alignment as Perceptual Loss (2509.24207 - Liu et al., 29 Sep 2025) in Section 6: Limitations and Future Work