LADDER: Self-Improving LLMs Through Recursive Problem Decomposition (2503.00735v3)

Published 2 Mar 2025 in cs.LG and cs.AI

Abstract: We introduce LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), a framework which enables LLMs to autonomously improve their problem-solving capabilities through self-guided learning by recursively generating and solving progressively simpler variants of complex problems. Unlike prior approaches that require curated datasets or human feedback, LADDER leverages a model's own capabilities to generate easier question variants. We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems and enabling Qwen2.5 7B Deepseek-R1 Distilled to achieve 73% on the MIT Integration Bee qualifying examination. We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance. These results show how self-directed strategic learning can achieve significant capability improvements without relying on architectural scaling or human supervision.

Summary

The paper introduces the LADDER framework that decomposes complex reasoning tasks into manageable sub-problems for autonomous LLM self-improvement.
It integrates reinforcement learning with test-time updates to boost performance on mathematical integration from 1% to over 80% pass rates.
The framework’s dynamic variant generation and rigorous numerical verification offer an unsupervised, scalable approach to enhance analytical precision.

Overview

The LADDER framework introduces a self-improvement paradigm for LLMs by leveraging recursive problem decomposition and automated reinforcement learning. The core idea is to decompose complex reasoning tasks into a tree of progressively simpler sub-problems, enabling autonomous performance enhancements without curated datasets or human supervision. This approach exploits a self-guided learning mechanism by recursively generating, solving, and verifying variants of complex problems, leading to substantial accuracy improvements in domains such as mathematical integration.

Methodology

Recursive Problem Decomposition

LADDER employs a systematic strategy for variant generation, where the primary challenge is decomposed into a hierarchical tree of simpler instances. Key aspects of the decomposition include:

Variant Generation:

The framework generates batches of variant problems using a diverse set of transformation operators. Temperature cycling (ranging between 0.8 and 1.4) and persona-based prompts (e.g., "think like Euler") are applied to enhance the diversity and ensure appropriate difficulty scaling. A capped recursion depth (typically at three levels) is used to maintain the relevance of the transformed instances. Approximately 8% of generated variants, as reported, may be unsolvable due to the sensitivity of the underlying problem structure.

Solution Verification:

Verification is performed using a robust numerical integration protocol that relies on multi-point evaluations. Correctness is determined by comparing the numerical solution with reference estimates, with a tolerance threshold set to a relative difference of $10^{-2}$ . The verification mechanism incorporates singularity detection and timeout management (2 seconds per attempt) to mitigate the effects of degenerate or edge cases.

Reinforcement Learning Integration

Central to LADDER is the integration of reinforcement learning (RL) via Group Relative Policy Optimization (GRPO). The RL component is designed in two distinct phases:

Training Phase:

During training, the model is exposed to the generated variant trees. A rule-based reward model quantifies performance based on two primary metrics: solution accuracy (derived from the numerical verification process) and answer formatting (ensuring consistency within the expected delimiters, e.g., <ANSWER> </ANSWER>). A KL divergence coefficient of 0.001 is utilized to regulate policy updates, balancing exploration and exploitation effectively.

Test-Time Reinforcement Learning (TTRL):

An extension of the standard training process occurs at inference time. TTRL performs reinforcement learning on-the-fly by generating and sampling variants for each test problem. This dynamic adjustment allows for localized optimization that tailors the solution space for specific test instances. TTRL has been shown to elevate performance significantly, scaling compute at test time without permanent parameter updates across the task sequence.

Experimental Results

Mathematical Integration Tasks

LADDER was evaluated primarily on mathematical integration problems, demonstrating impressive empirical improvements:

Llama 3.2 3B Model:

The framework improved the model's accuracy on undergraduate-level integration problems from a baseline of 1% (pass@1) to 82%. Traditional reinforcement learning approaches without variant generation failed to achieve comparable improvements, underscoring the importance of the difficulty gradient introduced by recursive problem decomposition.

Qwen2.5 7B Deepseek-R1 Distilled:

With standard LADDER, the model achieved a 73% pass rate on the MIT Integration Bee qualifying examination. When combined with TTRL, performance increased to a state-of-the-art 90%, outstripping larger models such as OpenAI o1. This performance enhancement is especially significant given that these improvements were realized without increasing the scale of the underlying model architectures.

Performance Considerations

Compute Scaling:

TTRL allows for test-time computational scaling, making it a highly parallelizable approach compared to traditional sequential token generation methods. This scaling differentiation enables the model to adjust its inference strategy dynamically while minimizing the risk of overfitting or degradation in generalization.

Numerical Robustness:

The reliance on a numerical verification framework ensures that the approach is applicable even to domains where symbolic solvers are unavailable. However, the trade-off resides in potential tolerance discrepancies and the inherent approximation errors in numerical integration.

Discussion and Implications

LADDER's recursive problem decomposition represents a strategic shift in model self-improvement. By autonomously generating a hierarchy of problem variants, the framework circumvents the need for external supervision and hand-curated data. The integration of GRPO and TTRL enables both constant training improvements and dynamic test-time adjustments, leading to significant performance gains.

The quantitative results demonstrate the potential of introducing self-directed learning mechanisms into LLMs, particularly in domains requiring high-precision reasoning such as mathematical integration. The framework's ability to boost performance from single-digit percentages to near state-of-the-art levels without increasing model size emphasizes the efficiency of leveraging strategic compute scaling and task-specific reinforcement learning.

Future work could explore extending LADDER to other formal reasoning domains, incorporating more complex variant generation strategies, and refining the numerical verification protocols. The recursive problem decomposition strategy, augmented by test-time reinforcement learning, may serve as a blueprint for future advancements in autonomous model improvement and self-directed learning paradigms.

Limitations and Future Directions

While LADDER demonstrates strong numerical improvements, several challenges remain:

Variant Quality Control:

The generation of unsolvable or degenerate variants, albeit minimal, requires further refinement. Enhanced filtering and adaptive difficulty scaling mechanisms may be explored to mitigate this issue.

Computational Overheads:

TTRL introduces additional compute at inference time, which, while highly parallelizable, may lead to increased latency in real-time applications. Optimizing the balance between variant generation and reinforcement learning updates is critical for deployment in latency-sensitive environments.

Generality of Verification Methods:

The current numerical verification is specifically tailored to integration tasks. Extending this to other reasoning or formal tasks would necessitate domain-specific verification tools capable of handling idiosyncratic computational challenges.

In summary, LADDER and its test-time reinforcement learning component represent a robust, self-improving framework for LLMs, capable of substantial performance gains on challenging tasks. The methodological innovations and strong empirical results underline its relevance for both theoretical research and practical deployment in high-stakes problem-solving environments.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/yoshiyama_akira/status/1897665115522019648

https://twitter.com/tobyrsimonds/status/1897767409664073943

https://twitter.com/deedydas/status/1898056641410744701

https://twitter.com/fly51fly/status/1898485533904646579

https://twitter.com/TheTuringPost/status/1898127624851145170

https://twitter.com/JJ_Bloom/status/1898029880467435620