Drivers of the base-model advantage in multi-round strategic play

Determine which mechanisms in multi-round strategic interactions are responsible for base language models outperforming aligned language models in predicting human decisions, specifically adjudicating among opponent modeling, history integration, and trajectory novelty as candidate explanations.

Background

The paper finds that base models predict human decisions substantially better than aligned models in multi-round strategic games, with the advantage increasing as interaction history accumulates. This pattern is attributed to descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation that alignment may suppress.

Identifying the precise mechanisms behind this base-model advantage—whether superior opponent modeling, better integration of interaction history, or sensitivity to novel trajectory features—remains unresolved and is necessary to explain why post-training alignment induces a normative bias in these settings.

References

Several open questions follow naturally. Which aspects of multi-round play drive the base advantage---opponent modeling, history integration, or trajectory novelty?

— Alignment Makes Language Models Normative, Not Descriptive (2603.17218 - Shapira et al., 17 Mar 2026) in Discussion and Conclusion

Drivers of the base-model advantage in multi-round strategic play

Background

References

Related Problems