Create a Video View Paper

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

This presentation examines a critical trade-off in large language model training: optimizing for pass@k—the probability that at least one of k attempts succeeds—can inadvertently harm pass@1, the single-try success rate. Through rigorous theoretical analysis and empirical validation on mathematical reasoning tasks, the work reveals that this degradation stems from prompt interference, where training updates that help difficult prompts actively harm easier ones. The formalization introduces gradient conflict mechanisms and demonstrates how pass@k optimization implicitly upweights negatively interfering prompts, creating a fundamental tension with pass@1 performance that has significant implications for deployment in cost-sensitive or reliability-critical settings.

Script

Training a language model to succeed when given multiple attempts can backfire: the very optimization that boosts pass at k—getting at least one correct answer in k tries—can simultaneously degrade pass at 1, the probability of success on the first shot. This isn't just a statistical curiosity; it's a fundamental conflict with real consequences for deployment.

Pass at k has become a standard metric for verifiable tasks like code generation and mathematical reasoning, where you can check correctness and retry. But here's the problem: in many real settings, you need reliability on the first try—retries are expensive, verifiers are imperfect, or fallback mechanisms don't exist. The researchers found that training specifically for pass at k can make that first attempt worse, not better.

So what's actually causing this conflict?

The key insight is formalizing prompt interference through gradient geometry. When two prompts have negatively interfering gradients—meaning their pass at 1 improvement directions point in opposite ways—you face a choice. Pass at k optimization resolves this choice by exponentially upweighting lower-success prompts. A prompt with 10 percent success gets far more influence than one with 90 percent success. When those hard prompts are negatively interfering, the aggregate update rotates away from improving pass at 1.

This heatmap visualizes the interference structure across prompt pairs in a minimal example. The blue regions represent negative cosine similarity—prompts whose pass at 1 gradients point in opposing directions. Notice how substantial the interference is: these aren't isolated edge cases. Prompts with overlapping features but different optimal responses create systematic gradient conflicts, and those conflicts are exactly what pass at k optimization amplifies.

The mathematics reveal exactly when conflict emerges. The pass at k gradient conflicts with pass at 1 if and only if the weighted average agreement score is negative—meaning negatively interfering prompts dominate under pass at k's implicit importance weighting. Absent interference, everything cooperates. But introduce hard, negatively interfering prompts, and pass at k's exponential upweighting flips the sign of the entire aggregate gradient relative to pass at 1.

Does this actually happen in real language models?

Absolutely. This result from mathematical reasoning with DeepSeek-Llama is striking. Even though hard prompts—those with low pass at 1—are a minority of the dataset, pass at 5 weighting shifts almost all importance onto them. The distribution is dramatically skewed. And critically, these heavily weighted hard prompts have negative agreement scores: they're precisely the ones whose gradients oppose the average pass at 1 direction. This is the mechanism in action on a real model with thousands of prompts.

The implications are operationally significant. If your deployment requires high first-attempt reliability—because retries cost too much, your verifier isn't perfect, or you have no fallback—then naive pass at k optimization is a trap. You'll see impressive pass at k gains on your evaluation set while quietly degrading the metric that actually matters in production. The solution isn't to abandon inference-aware training, but to explicitly account for prompt interference through careful monitoring, reweighting, or gradient surgery techniques.

Pass at k optimization doesn't just create a statistical trade-off—it encodes a choice about which prompts to prioritize, and that choice can directly oppose your deployment requirements. To learn more about this work and create your own research presentations, visit EmergentMind.com.