Scheduled Sampling Regime

Updated 26 December 2025

Scheduled Sampling Regime is a curriculum-based training strategy that gradually shifts from using ground-truth labels to model predictions to mitigate exposure bias in sequence-to-sequence learning.
It employs a two-stage training approach where initial reliance on gold labels transitions to model-generated predictions, enhancing both prediction accuracy and explanation alignment.
Empirical results show that dynamic scheduling and adaptive loss weighting improve performance metrics such as macro-F1 scores and rationale fidelity in complex multi-class tasks.

A scheduled sampling regime denotes a curriculum-based training strategy for sequence-to-sequence and multi-task LLMs, in which the conditioning input at training time is gradually shifted from gold (ground-truth) values toward the model’s own predictions. This regime mitigates exposure bias—the discrepancy between model behavior observed during teacher-forced training versus that at inference, where ground-truth is unavailable. In rationale-driven multi-task learning, scheduled sampling is employed to ensure generated rationales remain aligned with the model's own predictions, producing explanations faithful to predicted outcomes as encountered at test time (Hasan et al., 23 Dec 2025).

1. Formal Definition and Motivation

A scheduled sampling regime, as formalized in recent work (notably “Reason2Decide: Rationale-Driven Multi-Task Learning” (Hasan et al., 23 Dec 2025)), introduces a stochastic schedule $\pi_t$ for mixing ground-truth labels $y^*$ and model-predicted labels $\hat{y}$ at each training step $t$ . For each instance, the label $ỹ$ used to condition downstream modules (e.g., rationale generators) is sampled as: $ỹ = \begin{cases} y^* & \text{with probability } 1-\pi_t, \ \hat{y} & \text{with probability } \pi_t. \end{cases}$ The function $\pi_t$ is typically a piecewise linear or gradually increasing schedule, ramping from 0 to a cap (e.g., 0.9) over the majority of training steps. The rationale for this regime is to alleviate the train–test distribution shift, particularly in multi-task or self-rationalizing models, where rationales must justify the model’s own predictions—correct or otherwise (Hasan et al., 23 Dec 2025).

2. Scheduled Sampling in Rationale-Driven Multi-Task Learning

In multi-task setups that jointly optimize for prediction (label) and rationalization (explanation) objectives, naive training with teacher forcing conditions rationales exclusively on gold labels. This creates a marked exposure bias: at inference, the rationale generator receives model predictions as input, potentially causing generation drift or inconsistencies whenever predictions are incorrect. Scheduled sampling strategically transitions rationalization training inputs from $y^*$ to $\hat{y}$ , enabling the model to learn to self-explain even in off-nominal (error) regimes.

The typical instantiation of scheduled sampling in such settings involves two learning stages:

Stage 1: Train solely on gold rationales with cross-entropy loss.
Stage 2: Jointly train prediction and rationalization, using scheduled sampling so that, as training progresses, an increasing fraction of explanations are conditioned on the model's own predictions (Hasan et al., 23 Dec 2025).

The total loss at each step $t$ fuses prediction and explanation losses via a dynamic weighting $\alpha_t$ : $\mathcal{L}_{\mathrm{total}}(\theta) = \alpha_t \mathcal{L}_{\mathrm{pred}}(\theta) + (1-\alpha_t) \mathcal{L}_{\mathrm{expl}}(\theta).$ Here, $\mathcal{L}_{\mathrm{expl}}$ conditions on $ỹ$ as sampled per the scheduled regime.

3. Impact on Exposure Bias and Explanation Fidelity

Exposure bias—the divergence between train and test conditioning—often manifests as a degradation in explanation quality, especially when predictions are erroneous. A scheduled sampling regime directly counteracts this, as shown via both empirical and ablation results in (Hasan et al., 23 Dec 2025). Removing scheduled sampling results in pronounced drops in label F1 and LLM-judged rationale fidelity, particularly for complex or imbalanced multi-class settings like triage disposition prediction. The gradual mixing induced by $\pi_t$ enables explanation modules to generalize to the prediction-conditioned setting that is actually encountered at inference.

4. Implementation Schedules and Practical Considerations

Schedules for $\pi_t$ are typically chosen to balance stability and realism: $\pi_t = \begin{cases} 0, & 0 \le t < w \ \min(0.9, \frac{t-w}{m}), & w \le t < w+m \ 0.9, & t \ge w+m \end{cases}$ where $w$ is a warm-up window (e.g., 5% of total steps) and $m$ is the transition window (e.g., 60%). The schedule is capped at $\pi_t\leq0.9$ to retain some exposure to gold labels for optimization stability (Hasan et al., 23 Dec 2025). Losses are weighted adaptively (e.g., increasing prediction loss weight $\alpha_t$ over a warm-up period), and curriculum order is strictly enforced to avoid destabilizing the rationale generator by prematurely exposing it to noisy prediction inputs.

5. Empirical Results and Ablation Studies

Evaluation on clinical and medical language understanding tasks as reported in (Hasan et al., 23 Dec 2025) demonstrates that models using a scheduled sampling regime (Reason2Decide) outperform strong baselines and ablated variants—in predictive macro-F1, BERTScore, and BLEU for rationale fidelity:

Model	Clinical F₁	BERTScore	BLEU
Full (sch. sampling)	59.92	92.09	22.74
– no scheduled sampling	57.28	92.09	22.59

Ablations reveal that removing scheduled sampling most acutely degrades prediction accuracy, while rationale fidelity (as measured by BERTScore and BLEU) is only slightly affected for binary tasks but much more so for complex classification tasks. Warm-up steps further stabilize performance.

Scheduled sampling also enables effective use of LLM-generated rationales in the pre-training stage, reducing reliance on scarce human annotations while preserving or enhancing explainability and performance.

While scheduled sampling originated in general sequence-to-sequence learning (notably Bengio et al., 2015), its application to rationale-driven, multi-task regimes is critical for co-training predictors and self-explainers. Analogous exposure mitigation techniques appear in token-level rationale extraction frameworks (e.g., via confidence-weighted losses (Bhat et al., 2021)) and in dual-teacher or self-training approaches, which use pseudo-labeling with confidence re-weighting but may not implement dynamic switching as formalized above (Veerubhotla et al., 2023, Bhat et al., 2021).

A crucial implication is that scheduled sampling can be viewed as a higher-level instance of curriculum learning, now specialized for tasks that require consistent, aligned, and faithful explanations alongside prediction.

7. Limitations and Future Directions

While scheduled sampling mitigates exposure bias, it does not entirely eliminate the train–test gap, especially if model predictions during transition are consistently erroneous or if rationale generators are overly sensitive to prediction noise. Piecewise-linear or other schedules still require careful calibration for each task, and in some ablations, removing the “rationale foundation” stage (Stage-1) also led to rationale quality degradation (Hasan et al., 23 Dec 2025). Plausibly, further integration with confidence-adaptive or reinforcement-based curriculum design may yield additional robustness.

Emerging research is generalizing scheduled sampling to settings with richer explanation structures (e.g., multi-hop subgraph rationales (Si et al., 2022)), or integrating contrastive explanation replay for continual learning (Xiong et al., 2023), suggesting broad relevance for explainable, trustworthy AI development.