Effectiveness of Fine‑Tuning to Control Escalatory Tendencies
Investigate how effectively fine‑tuning can steer large language models to be escalatory or non‑escalatory within the multi‑agent wargame simulation framework, including measuring the extent, reliability, and limits of such controllability across different models and training regimes.
References
There are still a series of unresolved questions that could use some further understanding. The first is an exploration of how well, a model can be fine-tuned to be escalatory or non-escalotry.
                — Escalation Risks from Language Models in Military and Diplomatic Decision-Making
                
                (2401.03408 - Rivera et al., 7 Jan 2024) in Section: Future Work