- The paper introduces Self-Harmony, a method that fuses problem-solving and self-reframing to generate consistent pseudo-labels.
- It employs the harmonic mean of answer frequencies from both original and reframed queries to reduce bias and enhance reasoning accuracy.
- Experimental results demonstrate up to a 31% improvement in reasoning accuracy across 28 of 30 test configurations on key benchmarks.
Self-Harmony: Harmonizing Self-Supervision and Self-Play in TTRL
The paper "Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning" explores a novel framework aimed at enhancing test-time reinforcement learning (TTRL) by overcoming limitations found in traditional pseudo-label selection methods like majority voting. Self-Harmony proposes a new paradigm where a single model serves dual roles of solving and reframing a problem to generate pseudo-labels based on harmonic means. This approach is designed to address and correct the model's own reasoning biases without relying on external supervision.
Self-Harmony Framework
Core Mechanisms
Self-Harmony leverages a single LLM to perform two tasks: solving the original problem and producing a paraphrased variant via reframing. It contrasts with conventional majority voting by selecting pseudo-labels through the harmonic mean of answer frequencies from both original and reframed problems. This harmonic mean method prioritizes solutions that remain consistent across different problem formulations, effectively reducing the potential bias of spurious answers dominant in one view but not another.
Theoretical Foundations
The theoretical underpinning of Self-Harmony is rooted in the view-invariance principle. For semantically equivalent queries, correct answers should maintain consistent probability across different phrasing styles—this concept guides the pseudo-label selection using the harmonic mean. The harmonic mean operates as a principled regularizer, rewarding solutions robust to both semantic and syntactic variations, aligning with the Infomax principle to enforce representation consistency across distinct views.
Implementation Details
The practical application of Self-Harmony involves a structured self-play routine. A pre-trained model generates answers for both the original and reframed problems. Subsequently, the harmonic mean is used to identify the most stable answer as the pseudo-label, which is then reinforced through policy optimization. This process is designed to be efficient, reducing computational costs by merging reframing and answering into a single generative step.
Algorithm (Conceptual Flow)
1
2
3
4
5
6
7
8
9
|
for t in range(T):
for x in batch:
y_rollouts = generate_rollouts(model, x)
x_prime = reframe_question(model, x)
y_prime_rollouts = generate_rollouts(model, x_prime)
pseudo_label = harmonic_mean_selection(y_rollouts, y_prime_rollouts)
model.update(pseudo_label) |
This pseudo-code encapsulates the core iterative process of Self-Harmony, emphasizing the dual-question consistency through the reframing mechanism.
Experimental Results
The Self-Harmony framework demonstrates superior performance on several reasoning benchmarks, achieving state-of-the-art results in 28 of 30 test configurations. Notably, it markedly enhances the reasoning accuracy of models like Qwen3-4B and Llama-3.1-8B on datasets such as MATH500 and GSM8K, with improvements up to a lift of approximately 31% in some cases.
Robustness and Generalizability
The experiments reveal that Self-Harmony not only enhances accuracy but also provides robust training stability across diverse model architectures and problem domains. Its ability to generalize to different datasets and scale efficiently with model size makes it a versatile tool in TTRL applications.
Implications and Future Work
Self-Harmony's framework offers a scalable solution to the inherent biases of TTRL systems reliant on test-time adaptation without labeled data. Its self-contained mechanism aligns well with the goals of developing adaptive AI systems that require minimal supervision and demonstrate high adaptability to evolving problem settings. Future exploration could involve integrating this framework with other self-supervised learning paradigms to further enhance its efficacy and applicability across broader AI research domains.
Conclusion
Self-Harmony represents a substantive contribution to the field of test-time reinforcement learning, addressing critical limitations inherent in existing pseudo-labeling methods. Through a novel application of harmonic mean for pseudo-label selection and its innovative use of self-play, Self-Harmony advances the capabilities of LLMs in reasoning tasks, setting a new benchmark for state-of-the-art performance in TTRL frameworks.