Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TTRL: Test-Time Reinforcement Learning (2504.16084v3)

Published 22 Apr 2025 in cs.CL and cs.LG

Abstract: This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in LLMs. The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

Summary

  • The paper introduces Test-Time Reinforcement Learning (TTRL) to train LLMs using unlabeled test data with majority voting for reward signals.
  • It applies RL algorithms like GRPO and PPO to boost performance, achieving up to a 159.3% improvement on mathematical reasoning benchmarks.
  • The approach enhances model generalization on out-of-distribution tasks while underscoring the method’s sensitivity to hyperparameter settings and base model capacities.

This paper introduces Test-Time Reinforcement Learning (TTRL), a novel framework designed to train LLMs using Reinforcement Learning (RL) on data without explicit ground-truth labels during the test phase. The core challenge addressed is reward estimation when ground truth is unavailable at inference time. TTRL leverages the insight that common Test-Time Scaling (TTS) techniques, specifically majority voting, can provide sufficiently effective reward signals to drive RL training.

TTRL allows LLMs to undergo self-evolution by utilizing the priors present in the pre-trained models. This approach is particularly relevant for handling novel or distribution-shifted data encountered during deployment, where obtaining large amounts of labeled data for traditional RL fine-tuning is impractical.

Methodology

The TTRL process works as follows:

  1. Given an input prompt (state) xx, the LLM generates a set of NN candidate outputs {y1,y2,,yN}\{y_1, y_2, \ldots, y_N\} by sampling from its policy πθ(yx)\pi_\theta(y \mid x).
  2. An answer extraction step is applied to each output to get predicted answers {y^1,y^2,,y^N}\{\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_N\}.
  3. A consensus output yy^* is determined from these predicted answers. The primary method used is majority voting, where yy^* is the most frequently occurring predicted answer among the NN samples.
  4. A rule-based reward r(yi,y)r(y_i, y^*) is computed for each sampled output yiy_i by comparing its extracted answer y^i\hat{y}_i to the consensus answer yy^*. The paper defines a simple reward function: R(y^i,y)=1R(\hat{y}_i, y^*) = 1 if y^i=y\hat{y}_i = y^*, and $0$ otherwise.
  5. The RL objective is to maximize the expected reward Eyπθ(x)[r(y,y)]\mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)}[r(y, y^*)]. Model parameters θ\theta are updated using gradient ascent to improve the probability of generating outputs that match the majority-voted consensus.

The majority voting reward function can be implemented as shown in the pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from collections import Counter

def majority_voting_reward_fn(outputs, answer_extractor):
    """
    Assigns a reward of 1 to each output whose extracted answer matches the majority answer, otherwise 0.
    """
    # Extract answers from each output
    answers = [answer_extractor(output) for output in outputs]

    # Find the majority answer
    counts = Counter(answers)
    if not counts: # Handle case with no outputs
        return [0] * len(outputs)
    majority_answer, _ = counts.most_common(1)[0]

    # Assign rewards: 1 if matches majority, else 0
    rewards = [1 if ans == majority_answer else 0 for ans in answers]
    return rewards

This approach converts a consensus derived from multiple model generations into a supervisory signal for RL, enabling training on unlabeled data.

Implementation Details

The authors implemented TTRL using the GRPO algorithm and also demonstrated compatibility with PPO. Experiments were conducted on mathematical reasoning benchmarks: AIME 2024, AMC, and MATH-500. Backbone models included Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and LLaMA-3.1-8B-Instruct.

Key implementation hyperparameters and strategies included:

  • Using a constant learning rate (5×1075 \times 10^{-7}) with the AdamW optimizer.
  • Sampling N=64N=64 responses ($32$ for MATH-500) with temperature $1.0$ for the majority voting step (label estimation).
  • Downsampling to $16$ responses per prompt for actual RL training rollouts (vote-then-sample strategy to reduce computation).
  • Setting the maximum generation length to $3072$ tokens.
  • Setting the KL coefficient to $0$.
  • Training for a fixed number of episodes (40 for MATH, 50 for AMC, 60 for AIME) based on dataset size and complexity.
  • Using optimizations like bf16, adam_offload, gradient_checkpointing, packing_samples, and flash_attn.

Experimental Results and Practical Implications

The experiments show that TTRL consistently improves performance across tested tasks and models, despite using only unlabeled test data.

  • Performance Gains: On AIME 2024, TTRL boosted the pass@1 performance of Qwen2.5-Math-7B by a substantial 159.3% (from 16.7 to 43.3). Across AIME, AMC, and MATH-500, Qwen2.5-Math-7B saw an average improvement of 84.1%. This demonstrates the efficacy of self-evolution through TTRL.
  • Scaling: Performance gains were more significant with the larger 7B model compared to the 1.5B model, suggesting TTRL benefits from greater model capacity to produce more accurate majority voting rewards.
  • Generalization: Applying TTRL on one benchmark (e.g., AIME 2024) led to performance improvements on the other benchmarks (AMC, MATH-500) even though training was not done on them (out-of-distribution evaluation). This indicates that TTRL fosters generalizable reasoning abilities rather than just overfitting to the test set structure.
  • RL Algorithm Compatibility: TTRL was shown to be compatible with different RL algorithms like PPO, achieving similar performance trajectories to GRPO.

Discussions: Why TTRL Works and When it Might Fail

A surprising finding is that TTRL can surpass the performance achieved by simply using majority voting on the initial model outputs (Maj@N). This suggests the model learns to generate better solutions than its initial capability to form a consensus, essentially "lifting itself by its own bootstraps." Furthermore, TTRL's performance on MATH-500 was shown to closely approach that of performing RL directly on the test data with access to ground-truth labels, highlighting its efficiency in an unsupervised setting.

The effectiveness of TTRL, even with potentially inaccurate pseudo-labels from majority voting, is attributed to:

  1. RL's Robustness to Reward Noise: RL algorithms can tolerate a degree of reward inaccuracy and primarily use rewards as directional signals for exploration, making them less reliant on perfect labels compared to supervised fine-tuning.
  2. Denser Reward Signals: Rewards (match/no-match) are assigned per generated output based on the estimated label, providing a denser signal than a single ground-truth label for the entire problem. Even if the estimated label is incorrect, many incorrect generations might still receive a correct "negative" reward (0) because they don't match the incorrect estimated label.
  3. Potentially More Accurate Rewards from Weaker Models: Counterintuitively, when a model is weak, its diverse and often incorrect outputs can lead to a higher reward accuracy (proportion of rewards matching ground-truth rewards) even if the label accuracy (estimated label matching ground truth) is low. This is because most incorrect answers correctly receive a reward of 0 relative to the (potentially incorrect but consistent) estimated label. The paper shows high initial reward accuracy (92% on AIME 2024) despite low label accuracy (20-50%).

TTRL is not without limitations, largely inheriting challenges from standard RL:

  • Lack of Prior Knowledge: TTRL relies heavily on the base model's prior knowledge. If the test data is significantly more complex or out-of-distribution than what the base model is capable of, TTRL may fail to achieve meaningful gains. Experiments on MATH-500 difficulty levels showed diminishing returns on harder problems.
  • Sensitivity to Hyperparameters: Like all RL methods, TTRL is sensitive to hyperparameters. Suboptimal settings (e.g., temperature, batch size, number of episodes) can lead to training instability or failure. High temperature is needed for sufficient exploration, and more episodes are needed for smaller, more difficult datasets.

Terminology

The paper clarifies the relationship between Test-Time Scaling (TTS), Test-Time Training (TTT), and Test-Time Inference (TTI). TTS is the overall concept of increasing compute at test time. This can be split into TTT (adapting model parameters on test data, like TTRL) and TTI (using increased compute for inference-time techniques with fixed parameters, like Majority Voting or Best-of-N). TTRL is a form of TTT.

Limitations and Future Work

Current limitations include the need for more in-depth analysis of the impact of prior knowledge and hyperparameter configurations. Future research directions proposed include:

  • Developing theoretical convergence guarantees for TTRL.
  • Extending TTRL to online learning settings with continuously streaming data (Test-Time Adaptation).
  • Scaling TTRL to much larger datasets and models for self-supervised RL training.
  • Applying TTRL to more complex domains like agentic tasks and multi-step scientific discovery where ground truth is often unavailable.

Overall, TTRL presents a practical method for improving LLMs on unlabeled data at test time, showcasing the potential of self-supervised RL driven by consensus-based rewards.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com