Effectiveness of Fine‑Tuning to Control Escalatory Tendencies

Investigate how effectively fine‑tuning can steer large language models to be escalatory or non‑escalatory within the multi‑agent wargame simulation framework, including measuring the extent, reliability, and limits of such controllability across different models and training regimes.

Background

The paper finds notable differences in escalation behavior across models, including RLHF‑tuned variants and a base model without safety fine‑tuning, suggesting that training and alignment significantly affect decision propensities.

The authors explicitly identify as an unresolved question the extent to which fine‑tuning can make a model more or less escalatory. Resolving this would inform the feasibility of using fine‑tuning as a risk‑management tool for high‑stakes applications and guide responsible deployment practices.

References

There are still a series of unresolved questions that could use some further understanding. The first is an exploration of how well, a model can be fine-tuned to be escalatory or non-escalotry.

— Escalation Risks from Language Models in Military and Diplomatic Decision-Making (2401.03408 - Rivera et al., 7 Jan 2024) in Section: Future Work

Effectiveness of Fine‑Tuning to Control Escalatory Tendencies

Background

References

Related Problems