Self-Critical Sequence Training

Updated 3 December 2025

Self-critical sequence training is a reinforcement learning method that optimizes non-differentiable sequence-level metrics by using the model’s own inference as an adaptive baseline.
It addresses exposure bias and metric mismatch inherent in teacher-forcing by aligning gradient updates with evaluation metrics such as CIDEr, BLEU, and WER.
Variants like Bayesian SCST and entropy-augmented SCST enhance reward estimation, promote output diversity, and improve training stability in various domains.

Self-critical sequence training (SCST) is a policy-gradient reinforcement learning method for sequence prediction that directly optimizes non-differentiable sequence-level metrics by leveraging the model’s own inference output to provide a low-variance, dynamic baseline. Originating in image captioning, SCST has been widely adopted across natural language generation (NLG), speech recognition, and structured prediction tasks, with numerous theoretical variants and empirical enhancements explored in subsequent research.

1. Motivation for Sequence-Level Reinforcement Learning

Standard sequence-to-sequence models are typically trained to maximize the likelihood of ground-truth sequences via cross-entropy loss (teacher forcing), which introduces two central mismatches: exposure bias and metric mismatch. Exposure bias arises because the model is conditioned on correct previous tokens during training but not at inference. Metric mismatch is due to the cross-entropy objective, which does not reflect the typically nondifferentiable evaluation metrics (e.g., CIDEr, BLEU, WER, ROUGE, task-specific F1) used in downstream tasks. Reinforcement learning addresses these mismatches by optimizing expected reward defined by the true metric, using policy gradient methods such as REINFORCE to navigate the non-differentiable loss landscape (Rennie et al., 2016, Chen et al., 2022).

2. The Self-Critical Sequence Training (SCST) Algorithm

SCST is a variance-reduced policy-gradient algorithm that leverages the model’s own inference output as a baseline. Denote model parameters by $\theta$ , the input as $I$ , model policy as $\pi(\cdot|I; \theta)$ , a sampled sequence as $w^s$ , and the greedy-decoded (inference-time) baseline sequence as $w^b$ . The primary SCST loss is

$L_\text{SCST}(\theta) = -\left(r(w^s) - r(w^b)\right) \cdot \log \pi(w^s \mid I;\theta),$

where $r(\cdot)$ is a scalar reward function (e.g., CIDEr-D, RadGraph F1, WER) (Rennie et al., 2016, Nicolson et al., 7 Aug 2024). This gradient update encourages sampled sequences that achieve higher reward than the greedy baseline and penalizes those that are worse, effectively aligning gradient direction with metric improvement. The baseline is adaptive, updating as model performance improves, and test-time consistency is naturally enforced.

3. Algorithmic and Theoretical Extensions

Several significant extensions to the canonical SCST framework have been developed:

Bayesian SCST (B-SCST): Models $\theta$ as a variational distribution (e.g., via MC dropout), averaging rewards over multiple parameter samples, using the average as the baseline (Bujimalla et al., 2020). This provides uncertainty quantification correlated with sequence quality and further reduces policy gradient variance.
Leave-One-Out Baseline: For $K$ sampled sequences, each sequence’s baseline is the average reward of the other $K-1$ samples, providing a nearly optimal unbiased baseline and improved gradient stability compared to a greedy decode (Luo, 2020).
$n$ -Step SCST: Interpolates between per-token and sequence-level REINFORCE by grouping token blocks and using nonparametric rollout estimates for the baseline, balancing variance and per-token credit assignment (Gao et al., 2019).
Entropy-Augmented SCST (EAST): Adds an entropy regularization term to the SCST objective,

$L_\text{EAST}(\theta) = L_\text{SCST}(\theta) + \lambda \cdot H(\pi),$

where $H(\pi)$ is the next-token entropy and $\lambda$ is a fixed coefficient. This explicitly maintains higher distributional entropy during RL fine-tuning, increasing lexical diversity and preventing mode collapse (Nicolson et al., 7 Aug 2024).

The table below summarizes select SCST variants:

Variant	Baseline	Notable Properties
SCST	Greedy decode	Low-variance, test-time consistency
B-SCST	Bayesian avg. reward	Uncertainty quantification, robust
Leave-One-Out	K–1 sample avg.	Lower variance, unbiased
n-Step SCST	Rollout by block	Finer credit, bias-variance tradeoff
EAST	SCST + entropy	Diversity, avoids overfitting

4. Implementation and Training Protocol

The canonical SCST training pipeline is two-stage (Rennie et al., 2016, Nicolson et al., 7 Aug 2024):

Cross-Entropy Pretraining: Standard next-token prediction using teacher forcing and Adam(W) optimizer until convergence or early stopping by development metric.
RL Fine-Tuning: Decoder parameters are further optimized using SCST (or a variant). For each input, the model generates a sample sequence (e.g., via top- $k$ sampling) and a baseline (greedy decode or Bayesian sample average). Rewards are evaluated with the sequence metric of interest. The SCST (or variant) loss is then computed and backpropagated.

For EAST in radiology report generation, the entropy regularization parameter was set to $\lambda=0.05$ , and token distribution entropy was computed at each decoding step. Encoder parameters were frozen during RL fine-tuning to focus adaptation on language generation (Nicolson et al., 7 Aug 2024).

5. Empirical Results and Analysis

SCST and its extensions consistently outperform cross-entropy-only models and standard REINFORCE baselines. On MSCOCO, SCST achieves up to 10 CIDEr-D points gain over MIXER (REINFORCE with a learned baseline), and $n$ -step and Bayesian variants provide further increments of 1–3 points depending on architecture (Rennie et al., 2016, Gao et al., 2019, Luo, 2020, Bujimalla et al., 2020). In automatic speech recognition (ASR), SCST achieves relative WER improvements of 8.7% (clean) and 7.8% (noisy) over cross-entropy baselines, using WER-driven rewards and N-best baseline averaging (Chen et al., 2022).

In radiology (RRG24), entropy-augmented SCST (EAST) yields material improvements in factual completeness (RadGraph F1) and natural language metrics. For instance, on the public test set, findings/impression RadGraph-F1 improved from 27.66/25.04 (SCST) to 29.46/27.01 (EAST) (Nicolson et al., 7 Aug 2024). Qualitatively, entropy regularization prevents phrase collapse and is essential for the diversity demanded by clinical NLG tasks.

6. Technical Considerations and Best Practices

A critical detail in SCST evaluation is the treatment of end-of-sequence (<Eos>) tokens in metric computation, especially with CIDEr-D. Omitting <Eos> enables pathological sequence truncations, artificially boosting scores by up to 4.1 CIDEr-D via trivial fragments. The SacreEOS library standardizes signature reporting around <Eos> inclusion/exclusion and enforces transparent, reproducible experimental protocols (Hu et al., 2023). Practitioners are advised to disclose <Eos> handling, reward metrics, baseline type, and code versions for fair comparison.

Key stabilization techniques include:

Interpolating a small cross-entropy loss during RL fine-tuning to avoid divergence (Chen et al., 2022).
Frequent validation and early stopping on development metrics.
Careful management of tokenization and sequence boundaries to prevent metric artifacts (Hu et al., 2023).

7. Applications and Future Directions

SCST is applicable to any sequence-to-sequence model with non-differentiable, sequence-level evaluation metrics, including image captioning, machine translation, radiology report generation, and ASR. Success in RRG24 and COCO demonstrates domain transferability and improvement over standard baselines (Rennie et al., 2016, Nicolson et al., 7 Aug 2024, Chen et al., 2022). Current research explores variance–bias tradeoffs in baseline choice, uncertainty-calibrated reward estimation, increased entropy for language diversity, and further theoretical connections to actor-critic and nonparametric rollouts (Bujimalla et al., 2020, Gao et al., 2019). Transparent reporting, robust metric handling, and regularization for diversity are emerging best practices for reproducibility and cross-domain generalization.