Self-Revision Training (SRT)

Updated 17 April 2026

Self-Revision Training (SRT) is a framework that enables models to self-critique and iteratively refine outputs through a closed feedback loop.
It employs multi-stage pipelines including supervised fine-tuning, reinforcement learning, and self-distillation, achieving state-of-the-art results in tasks like instruction following and code reasoning.
SRT minimizes errors and improves model alignment by transforming one-shot generation into iterative improvement with explicit critique, revision, and decision protocols.

Self-Revision Training (SRT) is a framework for augmenting LLMs and multimodal models with the explicit ability to critique, refine, and revise their own outputs. SRT transforms the traditionally one-shot generation paradigm into an iterative improvement process, leveraging self-evaluation, structured feedback, and explicit revision to enhance alignment, quality, and robustness. SRT methods are characterized by multi-phase learning pipelines that can employ supervised fine-tuning, reinforcement learning, self-distillation, or a mixture of these, and may incorporate specialized procedures for error localization, preference optimization, or step-level correction. SRT has been instantiated in diverse contexts, including natural language instruction following, math/code reasoning, multimodal grounding, and agentic reinforcement learning, with consistently state-of-the-art empirical results across benchmark tasks.

1. Foundational Principles and Objectives

SRT formalizes model improvement as a process of self-critique followed by revision, creating a closed feedback loop—critique → refinement → update—that operates without reliance on human-in-the-loop corrections. The core premise is to teach models to generate and act upon feedback about their own outputs, enabling “self-alignment” and continuous self-improvement. The general SRT objective is to maximize task-specific reward (e.g., correctness, adherence, informativeness) not just in initial output generation, but also in post-hoc self-revision. SRT differs from conventional supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF) in that it creates, evaluates, and learns from revisions internally or via other models, often optimizing over chains of critique and refinement (Hu et al., 2024, He et al., 13 Apr 2026, Kumar et al., 2024, Yuan et al., 20 Jan 2025).

Key SRT objectives include:

Minimizing error rates through targeted self-correction (Kumar et al., 2024)
Improving alignment with instruction constraints and quality signals (Park, 8 Jul 2025)
Densifying sparse supervision into per-token corrective signals (He et al., 13 Apr 2026)
Enhancing reasoning quality by recognizing and mitigating internal flaws (redundancy, disorder, over/under-thinking) (Yao et al., 20 Nov 2025)
Enabling robust, on-the-fly error recovery in interactive agents (Yuan et al., 20 Jan 2025)

2. Iterative Self-Feedback and Self-Revision Loops

Most SRT frameworks employ an explicit multi-turn, multi-stage architecture. At each iteration, a model executes a critique–revise–decide loop over its own outputs, using structured natural-language or token-level feedback.

Self-Feedback Protocol: The model or a learned feedback head critiques initial responses, typically producing weakness analyses, improvement suggestions, and/or quality scores (Hu et al., 2024, Lee et al., 2023).
Revision Protocol: Conditioned on feedback, the model generates a revised output. This can include explicit edit instructions, re-generation, or localized fixes (Park, 8 Jul 2025, Lee et al., 2023).
Decision Protocol: If the revision is determined to be superior (measured by scores, preference models, or critics), it replaces the previous best response; otherwise, iteration halts (Lee et al., 2023, Hu et al., 2024).
Self-Distillation/Optimization: Optionally, revised outputs and their trajectories are distilled into the base model, using e.g. direct preference optimization (DPO), KL minimization, or reinforcement learning updates (He et al., 13 Apr 2026, Hu et al., 2024, Yuan et al., 20 Jan 2025).

For example, in the Volcano model, the pipeline implements up to three iterations of feedback–revision–decision, using a single multimodal model for all stages, and supervised multi-head learning for (1) answer, (2) feedback, and (3) revision outputs (Lee et al., 2023). In SRT for LLMs, feedback is either provided by stronger external critics (e.g., GPT-4 Turbo) or generated by the model itself (“self-feedback”), and the resulting system is trained to jointly produce the initial answer, feedback, and refined output (Hu et al., 2024).

3. Formal Algorithms and Training Paradigms

SRT has been instantiated through several algorithmic families:

A. Supervised Self-Revision and Distillation

Stage 1: Generate response(s), receive model or critic feedback (weakness, suggestions, score), and produce a refinement. Supervise the model to output the complete chain.
Stage 2: Use the self-revised model to generate more data, filtering on improvements, and optimize preferences via DPO. This enables scaling by making the model its own critic and reviser (Hu et al., 2024).

B. Reinforcement Learning and Reward Shaping

Self-revision can be implemented as a multi-turn Markov decision process, where the agent is rewarded both for initial correctness and for making effective corrections. SCoRe, for instance, uses a two-turn RL objective with reward bonuses for genuine error correction and explicit turn-wise KL penalties against a reference policy to avoid degenerate strategies (Kumar et al., 2024).
The RL objective, for $\theta$ model parameters and binary per-turn correctness $\widehat r(y, y^*)$ , is:

$J(\theta) = \mathbb{E}\left[ \widehat r(y_1, y^*) + \widehat r(y_2, y^*) \right]$

with reward shaping to amplify progress between turns and KL regularization for stability.

C. Self-Distillation and Teacher Synchronization

SD-Zero trains a single model to act as both initial generator and reviser. After collecting revision traces (Generator → reward → Reviser → improved output), it freezes a "teacher" reviser, then on-policy distills the reviser's token distributions into the generator via KL minimization, synchronizing teacher and student periodically to enable