Generalization of collaboration-oriented post-training across model families

Ascertain whether collaboration-oriented post-training methods (supervised finetuning on multi-turn data and reinforcement learning with PPO using conversational rewards) that increase genuine-followup rates for Qwen3.5-2B also generalize to other model families without degrading task accuracy.

Background

The authors apply collaboration-oriented post-training (SFT and PPO-based RL) to Qwen3.5-2B and observe increased genuine-followup rates, with differing effects on task accuracy. This provides preliminary evidence that interaction awareness can be trained.

They explicitly note that this study is preliminary and that it remains unknown whether similar gains would hold for other model families, indicating a need for broader validation.

References

Our post-training study was preliminary and generalization to other model families remain unexplored.

— Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models (2604.02315 - Shekkizhar et al., 2 Apr 2026) in Discussion and Conclusion — Limitations

Generalization of collaboration-oriented post-training across model families

Background

References

Related Problems