When and why does pass@k policy optimization degrade pass@1 performance?
Determine when and why optimizing the pass@k objective through policy optimization degrades single-shot pass@1 performance for large language models on verifiable tasks, by identifying the conditions under which this trade-off appears and explaining the mechanisms that cause it.
References
Open question. Despite growing adoption of pass@$k$ objectives, it is still not well understood why pass@$k$ optimization can hurt pass@1, and when we should expect this trade-off to appear. Without a principled explanation, it is difficult to design reliable inference-aware fine-tuning methods that deliver multi-attempt gains while preserving strong single-shot performance. This leads to our research question: "When and why can pass@$k$ policy optimization degrade pass@1 performance?"