Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Critical Sequence Training (SCST)

Updated 20 January 2026
  • Self-Critical Sequence Training (SCST) is a reinforcement learning framework that directly optimizes sequence-level metrics by using the model's greedy output as a baseline.
  • It reduces gradient variance and aligns training with evaluation metrics like CIDEr-D, BLEU, and ROUGE through novel baseline strategies such as sample-average and Bayesian approaches.
  • Extensions like entropy-augmented, off-policy, and group-wise SCST broaden its applicability across tasks including image captioning, summarization, ASR, and radiology report generation.

Self-Critical Sequence Training (SCST) is a reinforcement learning–based algorithmic framework for sequence generation tasks. Originally introduced for image captioning, SCST directly optimizes non-differentiable sequence-level metrics by employing the model’s own test-time inference output as the baseline for policy-gradient estimation. SCST harmonizes training and inference, reduces variance without requiring a learned critic, and is extensible to diverse modalities including captioning, summarization, automatic speech recognition, and radiology report generation.

1. Mathematical Formulation and Algorithmic Workflow

In SCST, the goal is to maximize the expected value of a reward function r(y1:T)r(y_{1:T}) defined over complete output sequences. For a model with parameters θ\theta and policy pθ(y1:Tx)p_\theta(y_{1:T}|x), the objective is: LR(θ)=Ey1:Tpθ[r(y1:T)]L_R(\theta) = -\mathbb{E}_{y_{1:T} \sim p_\theta}[r(y_{1:T})] To address the high variance of vanilla REINFORCE, SCST introduces a baseline bb, specifically the reward of the output produced by the model’s test-time inference procedure (typically greedy decoding): θLR(θ)(r(ys)r(y^))θlogpθ(ys)\nabla_\theta L_R(\theta) \approx - (r(y^s) - r(\hat{y})) \cdot \nabla_\theta \log p_\theta(y^s) The standard workflow involves:

  1. Cross-entropy pretraining on ground-truth sequences.
  2. At each SCST update: a. Sample one or more outputs ysy^s from pθp_\theta. b. Compute baseline y^\hat{y} by greedy decoding. c. Calculate rewards, advantage (r(ys)r(y^))(r(y^s)-r(\hat{y})). d. Accumulate and apply gradient updates (Rennie et al., 2016, Hu et al., 2023).

This approach directly aligns the training criterion with the evaluation metric (e.g., CIDEr-D, BLEU, ROUGE) used in practical tasks.

2. Baseline Selection and Variance Reduction Strategies

The key innovation of SCST is substituting learned critics with the model’s own test-time output as the baseline. This guarantees unbiased gradient estimation while significantly reducing variance. Several variants have extended the baseline mechanism:

  • Sample Average Reward Baseline: Uses the mean reward over KK independently sampled outputs, replacing the single greedy baseline. Empirically yields lower gradient variance and improves final metric scores. Training throughput increases marginally due to the lack of an extra greedy pass (Luo, 2020).
  • Bayesian Baseline via MC Dropout: Uses the average reward obtained from multiple stochastic forward passes with dropout, approximating posterior model uncertainty. This Bayesian approach further stabilizes updates and supports uncertainty quantification (Bujimalla et al., 2020).
  • Group-wise Baseline and Trust Region Penalties: Recent GRPO methods compare multiple candidate samples within each mini-batch and use intra-group normalized advantages, further reducing variance and providing trust-region constraints via explicit KL divergence penalties (Liang, 3 Mar 2025).
Variant Baseline Used Reported Gains in CIDEr
SCST (original) Greedy decode +12.3 to +15.8
Sample-average Mean sampled rewards +0.6 to +3.0
Bayesian MC-dropout mean +2.1 to +2.8
GRPO Group-wise, KL +2.4

3. Reward Function Selection and Impact

SCST is agnostic to the particular reward function r(y)r(y), provided it is a scalar sequence-level metric. Notable choices include:

Reward selection tightly links training progress to task performance and exposes SCST to effects such as reward overfitting, mode collapse, and metric-specific artifacts. For instance, omitting the <Eos> token from CIDEr-D computation enables caption models to exploit high-frequency n-gram fragments for inflated scores at the cost of semantic completeness (Hu et al., 2023).

4. Empirical Performance and Extensions

SCST consistently yields substantial improvements across a variety of sequence generation benchmarks:

  • MSCOCO image captioning: single-model CIDEr-D scores boost from XE-trained baselines (e.g., 99.0) to up to 113.7 for attention models, with ensemble scores reaching 114.7 (Rennie et al., 2016).
  • Graph/text generation: +1.0 to +1.2 BLEU and METEOR points; small or negative effects observed when the reward metric is too rigid (Dognin et al., 2021).
  • Summarization: ROUGE-1 and ROUGE-L gains, increased coherence/diversity, especially with topic-aware biased generation (Wang et al., 2018).
  • Automatic speech recognition: 8.7% and 7.8% relative Word Error Rate reduction (RATS dataset) (Chen et al., 2022).
  • Radiology reports: entropy-regularized SCST (EAST) avoids stock phrase overfitting and improves RadGraph-F1 (Nicolson et al., 2024).
  • Off-policy extensions and n-step variants enable application to Transformer-based paragraph generation and finer credit assignment (Yan et al., 2020, Gao et al., 2019).

SCST generalizes across modalities and architectures, including CNN-LSTM, attention-based models, Transformers, RNN-based autoencoders, and large pretrained encoder-decoders.

5. Implementation Specifics and Best Practices

Implementation of SCST requires:

  • Two-stage training: cross-entropy pretraining followed by RL-based fine-tuning.
  • Sampling and reward calculation: ancestral sampling, greedy decoding, or beam-search; strict inclusion of special tokens such as <Eos> during both TF–IDF and reward steps.
  • Optimizer settings: Adam/AdamW with learning rates ∼5×10⁻⁶ for RL stage; batch sizes typically 8–32; RL epochs ranging from 1 (EAST) to 20+ (captioning).
  • Reward and baseline configuration reporting: the SacreEOS library generates standardized signatures capturing all critical SCST parameters and handling <Eos> transparency (Hu et al., 2023).
Step Value/Settings
Optimizer Adam/AdamW, LR 5×10⁻⁶ to 10⁻⁵
Sampling strategy Multinomial, top-k, greedy, beam
Baseline Greedy decode, sample mean, MC dropout
Special tokens <Eos> must be included in tf–idf and reward
Reporting SacreEOS for signature clarity

6. Known Issues, Pitfalls, and Recommendations

Several methodological pitfalls undermine SCST’s reliability:

  • Omission of <Eos> token: Yields up to +4.3 CIDEr-D from trivial incomplete fragments, undermining fair comparison (Hu et al., 2023).
  • High-variance gradient estimation: If the greedy baseline is anomalously poor, gradient estimates become unstable.
  • Limited diversity: Single-sample SCST may miss solution space regions.
  • Transparent configuration reporting: SacreEOS signatures are recommended in all publications/releases for integrity and reproducibility.

Best practices dictate:

  • Always include <Eos> in tf–idf and reward computation.
  • Report the full SCST signature, detailing all parameters.
  • Provide qualitative output examples with correct endings.
  • Prefer group-wise or averaged baselines for robustness when computationally feasible.
  • Integrate entropy regularization for semantic diversity in domain-adapted tasks (e.g., radiology).

7. Extensions and Generalizations

SCST is actively extended:

SCST remains an influential and evolving algorithm for sequence generation, with compelling empirical results and an established methodological foundation for transparent, metric-driven optimization in both vision and language contexts.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Critical Sequence Training (SCST).