Extending evaluation beyond HH-RLHF to broader preference datasets
Investigate the applicability and performance of the DEFT framework on extensive and complex preference datasets beyond HH-RLHF’s harmlessness and helpfulness domains to assess whether DEFT’s data filtering and distribution-guided fine-tuning generalize to broader preference taxonomies.
References
Additionally, the HH-RLHF dataset only reflects a portion of preferences, namely Harmless and Helpful, while other more extensive and complex preference datasets remain to be explored.
— DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
(2604.01787 - Zhu et al., 2 Apr 2026) in Section: Limitations