Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 59 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Mirage or Method? How Model-Task Alignment Induces Divergent RL Conclusions (2508.21188v2)

Published 28 Aug 2025 in cs.LG and cs.CL

Abstract: Recent advances in applying reinforcement learning (RL) to LLMs have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper identifies model–task alignment (via pass@k accuracy) as the main determinant of reinforcement learning performance in LLM reasoning tasks.
  • It demonstrates that high alignment can yield robust RL outcomes even with spurious, random, or negative-only reward signals, a benefit not observed in low-alignment settings.
  • Empirical evaluations with Qwen2.5-7B and Llama-3.1-8B-Instruct reveal that one-shot and sample selection methods are effective only when the model’s pretrained capabilities align well with task requirements.

Model–Task Alignment as the Determinant of RL Outcomes in LLM Reasoning

Introduction and Motivation

The application of reinforcement learning (RL) to LLMs has produced a series of empirical phenomena that diverge from classical RL expectations. Notably, recent studies have reported that LLMs can achieve strong performance with minimal or even spurious reward signals, that one-shot RL can rival full-dataset training, and that negative-only reward signals can suffice for effective learning. However, these claims have been largely based on narrow experimental settings, often involving specific model-task pairs such as Qwen models on mathematical reasoning. The paper "Mirage or Method? How Model-Task Alignment Induces Divergent RL Conclusions" (2508.21188) systematically investigates the conditions under which these counterintuitive RL behaviors arise, positing that the degree of model–task alignment—quantified by pass@k accuracy—serves as the critical differentiator. Figure 1

Figure 1: Model-task alignment, which is measured by pass@k accuracy on the evaluated task, drives distinct outcomes from the same series of RL approaches.

Model–Task Alignment: Definition and Measurement

The central hypothesis advanced is that the effectiveness of RL techniques in LLM reasoning is fundamentally contingent on the alignment between a model’s pretrained capabilities and the requirements of the target task. This alignment is operationalized via the pass@k metric, which measures the probability that at least one correct solution appears among k independent model samples for a given problem. High pass@k indicates strong inherent model proficiency on the task, while low pass@k signals misalignment. Figure 2

Figure 2: Pass@k for different tasks. Different LLMs have significantly different abilities on different tasks, which will affect how the RL techniques perform across model-task combinations.

Empirical evaluation across Qwen2.5-7B and Llama-3.1-8B-Instruct on mathematical and logical reasoning benchmarks reveals substantial variance in pass@k, with Qwen2.5-7B exhibiting strong alignment on math tasks and both models showing strong alignment on certain KOR-Bench subtasks (Operation, Counterfactual), but weak alignment on more complex logical reasoning tasks. Figure 3

Figure 3: Pass@k for math tasks. Qwen demonstrates strong capabilities across all three mathematical evaluation datasets.

Figure 4

Figure 4: Pass@k for KOR-Bench. Both models demonstrate strong inherent reasoning capabilities in Operation and Counterfactual subtasks, but exhibit limited inherent logical reasoning abilities in Cipher, Puzzle and Logic.

Disentangling Contamination from Alignment

A competing hypothesis attributes the observed RL phenomena to data contamination (i.e., test set leakage during pretraining). The authors conduct prompt truncation and completion experiments to assess contamination, finding that strong RL effects persist even in settings with no evidence of contamination, provided model–task alignment is high. Conversely, in low-alignment settings, neither contamination nor RL idiosyncrasies are observed. This decouples contamination from the core mechanism and reinforces alignment as the primary explanatory variable.

Reward Signal Robustness and RL Effectiveness

The paper systematically evaluates RL performance under various reward signal regimes: ground-truth, random, incorrect, and self-rewarded (e.g., majority voting, entropy minimization). The results demonstrate:

  • Ground-truth rewards consistently yield the highest performance across all settings.
  • Robustness to noisy or spurious rewards is observed only in high-alignment settings. For example, Qwen2.5-7B on math tasks maintains strong performance even with random or incorrect rewards, while Llama-3.1-8B-Instruct on the same tasks does not benefit from such signals.
  • Self-rewarded methods underperform compared to external reward-based RL, especially in low-alignment domains.

These findings indicate that the apparent fault tolerance of RL in LLMs is not a universal property, but rather a consequence of latent model proficiency on the task.

One-Shot RL and Sample Selection

The claim that one-shot RL can match full-dataset training is scrutinized by comparing performance when training on a single example (randomly chosen or selected by reward variance) versus the full dataset. The results show:

  • One-shot RL is effective only when model–task alignment is strong. In these cases, both random and selected examples yield substantial improvements, approaching full-dataset performance.
  • In low-alignment settings, one-shot RL fails to drive meaningful learning, regardless of sample selection strategy.
  • Sophisticated sample selection algorithms do not consistently outperform random selection in high-alignment regimes. Figure 5

    Figure 5: The changes in two models' accuracy during the training. If the initial rollout accuracy is non-zero, both models rapidly fit the employed samples (lsimplel_{simple}, lmidl_{mid}) and exhibit generalization within the same subtask; however, no generalization to puzzles of other types is observed.

Negative-Only and Positive-Only RL Signals

The paper further examines the effectiveness of negative-only (NSR) and positive-only (PSR) RL signals. The key findings are:

  • In high-alignment settings, both NSR and PSR can recover most of the performance gains of full-signal RL.
  • In low-alignment settings, PSR outperforms NSR, and NSR often fails to improve over baseline.
  • Negative-only RL maintains higher entropy and exploration, but this does not translate to improved accuracy in challenging domains. Figure 6

    Figure 6: Entropy Dynamics of Qwen2.5-7B during Training. NSR can maintain the exploration space of reinforcement learning, but a larger exploration space is not always favorable, as in logical tasks.

Implications and Theoretical Significance

The results collectively indicate that many of the celebrated RL phenomena in LLMs—robustness to reward noise, one-shot learning, and negative-only signal sufficiency—are not general properties of RL, but rather artifacts of strong model–task alignment. In these cases, RL serves primarily as a mechanism for capability elicitation rather than genuine skill acquisition. For unfamiliar or misaligned tasks, standard RL with accurate rewards and sufficient data remains necessary.

This has several implications:

  • Evaluation of RL methods in LLMs must control for model–task alignment to avoid overgeneralizing from high-alignment cases.
  • Resource allocation strategies should consider whether to invest in pretraining/mid-training for domain-specific capabilities or in RL with high-quality rewards and data.
  • Future research should focus on developing RL techniques that are effective in low-alignment regimes, where true generalization and reasoning skill acquisition are required.

Conclusion

This work establishes model–task alignment, as measured by pass@k, as the principal determinant of when counterintuitive RL phenomena manifest in LLM reasoning. The findings challenge the universality of recent RL claims and provide a rigorous framework for interpreting RL outcomes in LLMs. Theoretical and practical advances in RL for LLMs will require explicit consideration of alignment, with future work needed to develop methods that can drive learning in genuinely novel domains.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets