Formalizing the characteristics of good-quality offline alignment data
Determine and formally characterize the properties that make offline off-policy alignment data suitable for achieving parity with online on-policy alignment when using humanline variants of Direct Preference Optimization (DPO), Kahneman–Tversky Optimization (KTO), and Grouped Relative Policy Optimization (GRPO). Specify measurable criteria or diagnostics that predict when offline datasets will enable humanline-trained policies to match the performance of policies trained with online on-policy data.
References
What then makes offline data ‘good-quality’, and can these characteristics be formalized? Conversely, are there settings under which alignment data must necessarily be online and on-policy? We leave these as directions for future work.