Assess viability of proposed moral-graph training approaches

Evaluate the viability of the proposed training approaches that use moral contexts to retrieve applicable values and either (i) fine-tune with adherence ratings to values cards or (ii) train a reward model on wisdom-upgrade orderings, including empirical performance and alignment outcomes.

Background

The authors outline two concrete training approaches: SFT/RLHF-like datasets rated for adherence to values cards per moral context, and reward modeling based on wisdom-upgrade orderings derived from the moral graph.

They do not yet have evidence of the viability of these approaches and explicitly flag the need for further evaluation.

References

More work is required to evaluate the viability of these approaches.

— What are human values, and how do we align AI to them? (2404.10636 - Klingefjord et al., 27 Mar 2024) in Section 6.2 (How to train a model on a moral graph)

Assess viability of proposed moral-graph training approaches

Sponsor

Background

References

Related Problems