User preference for models fine-tuned on the moral graph

Determine whether users prefer interacting with a language model fine-tuned on the moral graph alignment target, as compared to their current interactions with existing systems.

Background

The authors outline several ways to train models on the moral graph, including generating datasets for RLHF-like pipelines and training reward models on wisdom upgrades. They discuss the need to deduce moral contexts throughout a dialogue and to rate completions by adherence to retrieved values cards.

Despite these proposals, the authors explicitly state uncertainty regarding user preference for models fine-tuned using the moral graph and indicate an ongoing effort to fine-tune a larger model to address this question.

References

Finally, we don’t yet know if users will prefer interacting with a model fine-tuned on the moral graph. We are in the process of fine-tuning a model on a new, larger moral graph, and will be able to answer this question soon.

What are human values, and how do we align AI to them? (2404.10636 - Klingefjord et al., 27 Mar 2024) in Subsection “Limitations” (Fine-Tuning), Section Discussion