Joint optimization of all consistency metrics during fine-tuning

Investigate whether simultaneously optimizing prompt-to-line consistency, line-to-line consistency, and Q&A consistency as training objectives during multi-turn fine-tuning of the User Simulator large language model yields more robust persona consistency than optimizing only a single metric (prompt-to-line consistency).

Background

The paper introduces three complementary automatic metrics—prompt-to-line consistency, line-to-line consistency, and Q&A consistency—to evaluate persona fidelity and coherence in multi-turn LLM dialogues. For fine-tuning, the authors prioritized prompt-to-line consistency as the primary reward signal in multi-turn PPO due to its strong alignment with human judgments and computational efficiency.

While this choice produced significant gains, the authors note that real conversational consistency is multi-faceted and hypothesize that jointly optimizing across all three metrics may improve robustness. They explicitly defer empirical validation and methodological development of joint training to future work.

References

However, jointly training with all consistency metrics may yield more robust behavior, which we leave for future work.

— Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning (2511.00222 - Abdulhai et al., 31 Oct 2025) in Section 6: Limitations

Joint optimization of all consistency metrics during fine-tuning

Background

References

Related Problems