Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Revision Training (SRT)

Updated 17 April 2026
  • Self-Revision Training (SRT) is a framework that enables models to self-critique and iteratively refine outputs through a closed feedback loop.
  • It employs multi-stage pipelines including supervised fine-tuning, reinforcement learning, and self-distillation, achieving state-of-the-art results in tasks like instruction following and code reasoning.
  • SRT minimizes errors and improves model alignment by transforming one-shot generation into iterative improvement with explicit critique, revision, and decision protocols.

Self-Revision Training (SRT) is a framework for augmenting LLMs and multimodal models with the explicit ability to critique, refine, and revise their own outputs. SRT transforms the traditionally one-shot generation paradigm into an iterative improvement process, leveraging self-evaluation, structured feedback, and explicit revision to enhance alignment, quality, and robustness. SRT methods are characterized by multi-phase learning pipelines that can employ supervised fine-tuning, reinforcement learning, self-distillation, or a mixture of these, and may incorporate specialized procedures for error localization, preference optimization, or step-level correction. SRT has been instantiated in diverse contexts, including natural language instruction following, math/code reasoning, multimodal grounding, and agentic reinforcement learning, with consistently state-of-the-art empirical results across benchmark tasks.

1. Foundational Principles and Objectives

SRT formalizes model improvement as a process of self-critique followed by revision, creating a closed feedback loop—critique → refinement → update—that operates without reliance on human-in-the-loop corrections. The core premise is to teach models to generate and act upon feedback about their own outputs, enabling “self-alignment” and continuous self-improvement. The general SRT objective is to maximize task-specific reward (e.g., correctness, adherence, informativeness) not just in initial output generation, but also in post-hoc self-revision. SRT differs from conventional supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF) in that it creates, evaluates, and learns from revisions internally or via other models, often optimizing over chains of critique and refinement (Hu et al., 2024, He et al., 13 Apr 2026, Kumar et al., 2024, Yuan et al., 20 Jan 2025).

Key SRT objectives include:

2. Iterative Self-Feedback and Self-Revision Loops

Most SRT frameworks employ an explicit multi-turn, multi-stage architecture. At each iteration, a model executes a critique–revise–decide loop over its own outputs, using structured natural-language or token-level feedback.

For example, in the Volcano model, the pipeline implements up to three iterations of feedback–revision–decision, using a single multimodal model for all stages, and supervised multi-head learning for (1) answer, (2) feedback, and (3) revision outputs (Lee et al., 2023). In SRT for LLMs, feedback is either provided by stronger external critics (e.g., GPT-4 Turbo) or generated by the model itself (“self-feedback”), and the resulting system is trained to jointly produce the initial answer, feedback, and refined output (Hu et al., 2024).

3. Formal Algorithms and Training Paradigms

SRT has been instantiated through several algorithmic families:

A. Supervised Self-Revision and Distillation

  • Stage 1: Generate response(s), receive model or critic feedback (weakness, suggestions, score), and produce a refinement. Supervise the model to output the complete chain.
  • Stage 2: Use the self-revised model to generate more data, filtering on improvements, and optimize preferences via DPO. This enables scaling by making the model its own critic and reviser (Hu et al., 2024).

B. Reinforcement Learning and Reward Shaping

  • Self-revision can be implemented as a multi-turn Markov decision process, where the agent is rewarded both for initial correctness and for making effective corrections. SCoRe, for instance, uses a two-turn RL objective with reward bonuses for genuine error correction and explicit turn-wise KL penalties against a reference policy to avoid degenerate strategies (Kumar et al., 2024).
  • The RL objective, for θ\theta model parameters and binary per-turn correctness r^(y,y)\widehat r(y, y^*), is:

J(θ)=E[r^(y1,y)+r^(y2,y)]J(\theta) = \mathbb{E}\left[ \widehat r(y_1, y^*) + \widehat r(y_2, y^*) \right]

with reward shaping to amplify progress between turns and KL regularization for stability.

C. Self-Distillation and Teacher Synchronization

  • SD-Zero trains a single model to act as both initial generator and reviser. After collecting revision traces (Generator → reward → Reviser → improved output), it freezes a "teacher" reviser, then on-policy distills the reviser's token distributions into the generator via KL minimization, synchronizing teacher and student periodically to enable

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Revision Training (SRT).