Develop a training algorithm for optimizing the moral graph alignment target

Develop a concrete algorithm for training large language models to optimize the moral graph alignment target produced by Moral Graph Elicitation, converting the moral graph into an objective function suitable for post-training and demonstrating its effectiveness relative to existing alignment methods.

Background

The paper decomposes "aligning to human values" into three stages: eliciting values, reconciling those values into an alignment target, and training the model to optimize that target. The work focuses on the first two stages and introduces Moral Graph Elicitation (MGE) and the moral graph as a new alignment target.

While several options for using the moral graph in training are sketched later, the authors explicitly leave the final stage—creating and validating a training algorithm—open for future work.

References

Finally, we need an algorithm for training a model to optimize this target; we leave this final stage for future work.

— What are human values, and how do we align AI to them? (2404.10636 - Klingefjord et al., 27 Mar 2024) in Section 1 (Introduction)

Develop a training algorithm for optimizing the moral graph alignment target

Sponsor

Background

References

Related Problems