Stability of value-based preferences under systematic perturbations

Ascertain how value-based preferences in large language models behave under systematic perturbations, including whether such preferences remain stable across population-level variations induced by model dropout.

Background

For human-adjacent applications such as human-robot interaction, reliable and stable preferences are essential. The authors note uncertainty about whether any learned value-based preferences would persist under systematic perturbations (e.g., Monte Carlo dropout) applied to create model populations, which they use to assess brittleness.

References

Further, if a model has value-based preferences (VBPs), it is unclear how these preferences will fair under systematic perturbation.

— Do Large Language Models Learn Human-Like Strategic Preferences? (2404.08710 - Roberts et al., 2024) in Section 3, Do LLMs Prefer Strategies Based on Value?, opening paragraph

Stability of value-based preferences under systematic perturbations

Background

References

Related Problems