Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning (2511.01191v1)

Published 3 Nov 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.

Summary

The paper introduces Self-Harmony, a method that fuses problem-solving and self-reframing to generate consistent pseudo-labels.
It employs the harmonic mean of answer frequencies from both original and reframed queries to reduce bias and enhance reasoning accuracy.
Experimental results demonstrate up to a 31% improvement in reasoning accuracy across 28 of 30 test configurations on key benchmarks.

Self-Harmony: Harmonizing Self-Supervision and Self-Play in TTRL

The paper "Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning" explores a novel framework aimed at enhancing test-time reinforcement learning (TTRL) by overcoming limitations found in traditional pseudo-label selection methods like majority voting. Self-Harmony proposes a new paradigm where a single model serves dual roles of solving and reframing a problem to generate pseudo-labels based on harmonic means. This approach is designed to address and correct the model's own reasoning biases without relying on external supervision.

Self-Harmony Framework

Core Mechanisms

Self-Harmony leverages a single LLM to perform two tasks: solving the original problem and producing a paraphrased variant via reframing. It contrasts with conventional majority voting by selecting pseudo-labels through the harmonic mean of answer frequencies from both original and reframed problems. This harmonic mean method prioritizes solutions that remain consistent across different problem formulations, effectively reducing the potential bias of spurious answers dominant in one view but not another.

Theoretical Foundations

The theoretical underpinning of Self-Harmony is rooted in the view-invariance principle. For semantically equivalent queries, correct answers should maintain consistent probability across different phrasing styles—this concept guides the pseudo-label selection using the harmonic mean. The harmonic mean operates as a principled regularizer, rewarding solutions robust to both semantic and syntactic variations, aligning with the Infomax principle to enforce representation consistency across distinct views.

Implementation Details

The practical application of Self-Harmony involves a structured self-play routine. A pre-trained model generates answers for both the original and reframed problems. Subsequently, the harmonic mean is used to identify the most stable answer as the pseudo-label, which is then reinforced through policy optimization. This process is designed to be efficient, reducing computational costs by merging reframing and answering into a single generative step.

Algorithm (Conceptual Flow)

for t in range(T):
    for x in batch:
        y_rollouts = generate_rollouts(model, x)
        x_prime = reframe_question(model, x)
        y_prime_rollouts = generate_rollouts(model, x_prime)
        
        pseudo_label = harmonic_mean_selection(y_rollouts, y_prime_rollouts)
        
        model.update(pseudo_label)

This pseudo-code encapsulates the core iterative process of Self-Harmony, emphasizing the dual-question consistency through the reframing mechanism.

Experimental Results

Performance Metrics

The Self-Harmony framework demonstrates superior performance on several reasoning benchmarks, achieving state-of-the-art results in 28 of 30 test configurations. Notably, it markedly enhances the reasoning accuracy of models like Qwen3-4B and Llama-3.1-8B on datasets such as MATH500 and GSM8K, with improvements up to a lift of approximately 31% in some cases.

Robustness and Generalizability

The experiments reveal that Self-Harmony not only enhances accuracy but also provides robust training stability across diverse model architectures and problem domains. Its ability to generalize to different datasets and scale efficiently with model size makes it a versatile tool in TTRL applications.

Implications and Future Work

Self-Harmony's framework offers a scalable solution to the inherent biases of TTRL systems reliant on test-time adaptation without labeled data. Its self-contained mechanism aligns well with the goals of developing adaptive AI systems that require minimal supervision and demonstrate high adaptability to evolving problem settings. Future exploration could involve integrating this framework with other self-supervised learning paradigms to further enhance its efficacy and applicability across broader AI research domains.

Conclusion

Self-Harmony represents a substantive contribution to the field of test-time reinforcement learning, addressing critical limitations inherent in existing pseudo-labeling methods. Through a novel application of harmonic mean for pseudo-label selection and its innovative use of self-play, Self-Harmony advances the capabilities of LLMs in reasoning tasks, setting a new benchmark for state-of-the-art performance in TTRL frameworks.