Potential non-human strategic preferences in Solar and Mistral

Investigate whether Solar 10.7B and Mistral 7B exhibit distinctly non-human strategic preferences in contexts beyond those studied and characterize the conditions under which such divergences arise.

Background

While the experiments demonstrate human-like strategic preferences by Solar and Mistral in specific scenarios, the authors caution that this does not establish global human-likeness. They highlight the likelihood of contexts where these models deviate from human preferences and note the difficulty of proving otherwise.

References

It is probable, though not established, that in some circumstances these models may have distinctly non-human strategic preferences.

— Do Large Language Models Learn Human-Like Strategic Preferences? (2404.08710 - Roberts et al., 2024) in Section 7.1, Limitations

Potential non-human strategic preferences in Solar and Mistral

Background

References

Related Problems