Deep Self-Evolving Reasoning (DSER)
- DSER is a probabilistic, iterative framework where models enhance answers via self-generated verification, modeled as a two-state Markov process.
- It employs parallel long-horizon rollouts to mitigate weak verification in compact open-weight models, significantly boosting problem-solving accuracy.
- Empirical results show that DSER enables smaller models to tackle challenging benchmarks and even outperform larger teacher models through majority-vote aggregation.
Deep Self-Evolving Reasoning (DSER) refers to a probabilistic and algorithmic paradigm in which a reasoning system—most often a LLM or deep network—iteratively improves its solutions by self-verification and refinement. Unlike static reasoning or fixed chain-of-thought approaches, DSER operates over long horizons, treating each reasoning/refinement step as a stochastic process, typically formalized as a Markov chain in the solution space. This paradigm enables convergence towards correct answers even for highly complex tasks, provided the probability of iterative improvement exceeds that of degradation. DSER has been shown to substantially enhance the capabilities of compact, open-weight models, extending their problem-solving range well beyond their single-turn accuracy and providing both a practical tool and a diagnostic lens for limitations in current self-verification and refinement processes (Liu et al., 20 Oct 2025).
1. Probabilistic Foundation and Markov Chain Formalism
DSER conceptualizes the reasoning process as a discrete-time Markov chain, where each step is a potentially stochastic transition between solution states—“Correct” (C) and “Incorrect” (I). The fundamental assumption is that, after each reasoning attempt or refinement, the new state depends only on the present (not the full history), consistent with Markovian dynamics. Specifically, for a reasoning problem posed as , let denote the initial candidate solution, typically produced by an LLM: At each iteration , this candidate is subject to verification (e.g., via an LLM-generated critique) and then refinement, yielding : Transitions between “Correct” and “Incorrect” form a two-state Markov process with transition probabilities (I C) and (C I), yielding the transition matrix: The stationary distribution of being in the correct state is
If , repeated long-horizon reasoning trajectories will converge to predominantly correct solutions.
2. Iterative Self-Verification and Refinement
A critical distinction in DSER is the explicit modeling of the verification–refinement cycle. Solutions are not accepted at face value; instead, after each generation, an autonomous or model-internal verification phase is triggered (e.g., via a specialized prompt or self-critique module), and the observed deficiencies or errors inform the subsequent refinement step. Even if each individual step is only mildly better than random (i.e., ), the process, through repeated iteration or parallelism, “stochastically wanders towards correctness.” There is no requirement that each verification or refinement is perfect or reliable—only that, in aggregate, improvement occurs with non-zero probability.
This self-evolving dynamic is critical in real-world scenarios where the verification and correction modules in open-weight, smaller-scale LLMs are known to be much weaker than in proprietary or massive models. DSER leverages long-horizon and parallel rollouts to amplify marginally positive tendencies, effectively overcoming high error rates and limited single-step verification capacities.
3. Empirical Results and Test-Time Scaling
The DSER paradigm has been empirically validated on highly challenging mathematical benchmarks such as AIME 2024–2025 (Liu et al., 20 Oct 2025). When applied to the DeepSeek-R1-0528-Qwen3-8B model—a compact, open-weight 8B-parameter LLM—DSER enabled the solution of 5/9 previously unsolvable Olympiad-level problems. By executing many parallel long-horizon DSER rollouts (as many as 80 per hard instance), the framework achieved majority-voting correctness in aggregate, boosting Pass@1 by 6.5–9.0% over the baseline. Notably, when output aggregation via majority vote was employed, the DSER-augmented model exceeded the single-turn accuracy of its original 600B-parameter teacher model.
Evaluation utilized both pointwise and aggregated metrics:
- Avg@K (Average Accuracy): Mean Pass@1 accuracy over runs.
- Cons@K (Consistency Accuracy): Majority-vote correctness over parallel DSER trajectories.
Successful convergence was observed even when individual verification steps were notably noisy, provided the net improvement probability remained positive.
4. Diagnostic and Theoretical Implications
DSER not only extends the practical reasoning capabilities of LLMs but also serves as a diagnostic lens for their intrinsic limitations. Because the stationary distribution and convergence speed are tightly linked to the quality of internal verification and correction, DSER exposes situations where weak verification (high ) or unstable refinement impede performance—even with many rollouts. This provides a pathway to both:
- Benchmark and compare models in terms of their intrinsic self-evolving tendencies.
- Motivate research on architectures or training schemes that improve the reliability and stability of internal self-critique modules.
A key implication is that test-time computation and iterative refinement—central to DSER—can partially offset for model scale, allowing compact models to match or surpass larger models in final solution quality when combined with robust self-evolving loops.
5. Limitations and Research Directions
DSER, by construction, assumes nonzero improvement probability—if , the process cannot reliably converge to correct answers, irrespective of rollout count. In practice, this requires at least a “weakly self-corrective” base model. The effectiveness and efficiency of DSER are also bottlenecked by the current limits of open-weight model verification and correction modules; progress depends on future advances in these components.
Open questions identified include:
- How best to incentivize and train next-generation models for strong, stable self-verification and correction cycles.
- How to integrate DSER-inspired iterative reasoning into end-to-end reinforcement learning to reinforce high-quality, self-corrected trajectories.
- How many iterations or parallel rollouts are required for convergence in various task domains, and how this scales with model quality.
DSER thus serves both as a toolkit for immediate test-time performance gains and as a foundational concept for developing future models imbued with intrinsic, deep self-evolving reasoning capabilities.