Self-Critical Sequence Training (SCST)

Updated 20 January 2026

Self-Critical Sequence Training (SCST) is a reinforcement learning framework that directly optimizes sequence-level metrics by using the model's greedy output as a baseline.
It reduces gradient variance and aligns training with evaluation metrics like CIDEr-D, BLEU, and ROUGE through novel baseline strategies such as sample-average and Bayesian approaches.
Extensions like entropy-augmented, off-policy, and group-wise SCST broaden its applicability across tasks including image captioning, summarization, ASR, and radiology report generation.

Self-Critical Sequence Training (SCST) is a reinforcement learning–based algorithmic framework for sequence generation tasks. Originally introduced for image captioning, SCST directly optimizes non-differentiable sequence-level metrics by employing the model’s own test-time inference output as the baseline for policy-gradient estimation. SCST harmonizes training and inference, reduces variance without requiring a learned critic, and is extensible to diverse modalities including captioning, summarization, automatic speech recognition, and radiology report generation.

1. Mathematical Formulation and Algorithmic Workflow

In SCST, the goal is to maximize the expected value of a reward function $r(y_{1:T})$ defined over complete output sequences. For a model with parameters $\theta$ and policy $p_\theta(y_{1:T}|x)$ , the objective is: $L_R(\theta) = -\mathbb{E}_{y_{1:T} \sim p_\theta}[r(y_{1:T})]$ To address the high variance of vanilla REINFORCE, SCST introduces a baseline $b$ , specifically the reward of the output produced by the model’s test-time inference procedure (typically greedy decoding): $\nabla_\theta L_R(\theta) \approx - (r(y^s) - r(\hat{y})) \cdot \nabla_\theta \log p_\theta(y^s)$ The standard workflow involves:

Cross-entropy pretraining on ground-truth sequences.
At each SCST update: a. Sample one or more outputs $y^s$ from $p_\theta$ . b. Compute baseline $\hat{y}$ by greedy decoding. c. Calculate rewards, advantage $(r(y^s)-r(\hat{y}))$ . d. Accumulate and apply gradient updates (Rennie et al., 2016, Hu et al., 2023).

This approach directly aligns the training criterion with the evaluation metric (e.g., CIDEr-D, BLEU, ROUGE) used in practical tasks.

2. Baseline Selection and Variance Reduction Strategies

The key innovation of SCST is substituting learned critics with the model’s own test-time output as the baseline. This guarantees unbiased gradient estimation while significantly reducing variance. Several variants have extended the baseline mechanism:

Sample Average Reward Baseline: Uses the mean reward over $K$ independently sampled outputs, replacing the single greedy baseline. Empirically yields lower gradient variance and improves final metric scores. Training throughput increases marginally due to the lack of an extra greedy pass (Luo, 2020).
Bayesian Baseline via MC Dropout: Uses the average reward obtained from multiple stochastic forward passes with dropout, approximating posterior model uncertainty. This Bayesian approach further stabilizes updates and supports uncertainty quantification (Bujimalla et al., 2020).
Group-wise Baseline and Trust Region Penalties: Recent GRPO methods compare multiple candidate samples within each mini-batch and use intra-group normalized advantages, further reducing variance and providing trust-region constraints via explicit KL divergence penalties (Liang, 3 Mar 2025).

Variant	Baseline Used	Reported Gains in CIDEr
SCST (original)	Greedy decode	+12.3 to +15.8
Sample-average	Mean sampled rewards	+0.6 to +3.0
Bayesian	MC-dropout mean	+2.1 to +2.8
GRPO	Group-wise, KL	+2.4

3. Reward Function Selection and Impact

SCST is agnostic to the particular reward function $r(y)$ , provided it is a scalar sequence-level metric. Notable choices include:

CIDEr-D: Standard for image captioning. Measures consensus with ground-truth using TF–IDF weighted n-gram vectors.
BLEU, ROUGE-L, METEOR: Popular in machine translation and summarization.
Task-specific metrics: WER for ASR (Chen et al., 2022), RadGraph-F1 for radiology reports (Nicolson et al., 2024), CLIP-derived PAC-S++ for vision-language evaluation (Sarto et al., 2024).

Reward selection tightly links training progress to task performance and exposes SCST to effects such as reward overfitting, mode collapse, and metric-specific artifacts. For instance, omitting the <Eos> token from CIDEr-D computation enables caption models to exploit high-frequency n-gram fragments for inflated scores at the cost of semantic completeness (Hu et al., 2023).

4. Empirical Performance and Extensions

SCST consistently yields substantial improvements across a variety of sequence generation benchmarks:

MSCOCO image captioning: single-model CIDEr-D scores boost from XE-trained baselines (e.g., 99.0) to up to 113.7 for attention models, with ensemble scores reaching 114.7 (Rennie et al., 2016).
Graph/text generation: +1.0 to +1.2 BLEU and METEOR points; small or negative effects observed when the reward metric is too rigid (Dognin et al., 2021).
Summarization: ROUGE-1 and ROUGE-L gains, increased coherence/diversity, especially with topic-aware biased generation (Wang et al., 2018).
Automatic speech recognition: 8.7% and 7.8% relative Word Error Rate reduction (RATS dataset) (Chen et al., 2022).
Radiology reports: entropy-regularized SCST (EAST) avoids stock phrase overfitting and improves RadGraph-F1 (Nicolson et al., 2024).
Off-policy extensions and n-step variants enable application to Transformer-based paragraph generation and finer credit assignment (Yan et al., 2020, Gao et al., 2019).

SCST generalizes across modalities and architectures, including CNN-LSTM, attention-based models, Transformers, RNN-based autoencoders, and large pretrained encoder-decoders.

5. Implementation Specifics and Best Practices

Implementation of SCST requires:

Two-stage training: cross-entropy pretraining followed by RL-based fine-tuning.
Sampling and reward calculation: ancestral sampling, greedy decoding, or beam-search; strict inclusion of special tokens such as <Eos> during both TF–IDF and reward steps.
Optimizer settings: Adam/AdamW with learning rates ∼5×10⁻⁶ for RL stage; batch sizes typically 8–32; RL epochs ranging from 1 (EAST) to 20+ (captioning).
Reward and baseline configuration reporting: the SacreEOS library generates standardized signatures capturing all critical SCST parameters and handling <Eos> transparency (Hu et al., 2023).

Step	Value/Settings
Optimizer	Adam/AdamW, LR 5×10⁻⁶ to 10⁻⁵
Sampling strategy	Multinomial, top-k, greedy, beam
Baseline	Greedy decode, sample mean, MC dropout
Special tokens	<Eos> must be included in tf–idf and reward
Reporting	SacreEOS for signature clarity

6. Known Issues, Pitfalls, and Recommendations

Several methodological pitfalls undermine SCST’s reliability:

Omission of <Eos> token: Yields up to +4.3 CIDEr-D from trivial incomplete fragments, undermining fair comparison (Hu et al., 2023).
High-variance gradient estimation: If the greedy baseline is anomalously poor, gradient estimates become unstable.
Limited diversity: Single-sample SCST may miss solution space regions.
Transparent configuration reporting: SacreEOS signatures are recommended in all publications/releases for integrity and reproducibility.

Best practices dictate:

Always include <Eos> in tf–idf and reward computation.
Report the full SCST signature, detailing all parameters.
Provide qualitative output examples with correct endings.
Prefer group-wise or averaged baselines for robustness when computationally feasible.
Integrate entropy regularization for semantic diversity in domain-adapted tasks (e.g., radiology).

7. Extensions and Generalizations

SCST is actively extended:

Entropy-Augmented SCST (EAST): Maintains higher token-distribution entropy to increase semantic diversity and reduce phrase collapse (Nicolson et al., 2024).
Off-policy SCST: Employs a cheap behaviour policy, importance/re-weighting, and KL-control for large, resource-intensive models (Yan et al., 2020).
Group Relative Policy Optimization (GRPO): Uses group-wise advantage normalization, PPO-style clipping, and explicit KL trust-region stabilization for improved variance reduction and diversity (Liang, 3 Mar 2025).
Bayesian SCST: Monte Carlo dropout for predictive uncertainty estimation and baseline stabilization (Bujimalla et al., 2020).
SCST with learnable reward metrics: PAC-S++–based SCST combines CLIP with contrastive learning for improved metric alignment and hallucination penalization (Sarto et al., 2024).
N-step SCST: Employs per-token or blockwise bootstrapped advantages, enabling finer credit assignment and further stabilizing policy updates in deterministic sequence settings (Gao et al., 2019).

SCST remains an influential and evolving algorithm for sequence generation, with compelling empirical results and an established methodological foundation for transparent, metric-driven optimization in both vision and language contexts.