Stability of value-based preferences under systematic perturbations
Ascertain how value-based preferences in large language models behave under systematic perturbations, including whether such preferences remain stable across population-level variations induced by model dropout.
References
Further, if a model has value-based preferences (VBPs), it is unclear how these preferences will fair under systematic perturbation.
                — Do Large Language Models Learn Human-Like Strategic Preferences?
                
                (2404.08710 - Roberts et al., 11 Apr 2024) in Section 3, Do LLMs Prefer Strategies Based on Value?, opening paragraph