Sequential Finetuning (SeqL)

Updated 26 February 2026

Sequential finetuning (SeqL) is a training approach where models are updated in ordered stages to separate learning objectives and reduce forgetting.
This method decouples parameter updates, using techniques like Bayesian updates, reinforcement learning, and modular head training to optimize distinct tasks sequentially.
SeqL has demonstrated robust performance in applications such as robotic control, continual learning, and uncertainty-aware online adaptation.

Sequential finetuning (SeqL) is a family of training protocols where a neural network or other model is updated in multiple distinct stages, each corresponding to a pre-specified order of tasks, classes, objectives, or data regimes. Unlike standard joint finetuning—which optimizes all objectives or adapts to all domains simultaneously—sequential finetuning decouples learning into ordered phases. This approach has gained prominence across diverse research areas including robotic control, continual learning with pre-trained models, sequence generation under reward and prior constraints, and Bayesian online adaptation of transformers. Key motivations include mitigating catastrophic forgetting, facilitating graceful plasticity-stability tradeoffs, and enabling robust adaptation under uncertainty or resource constraints.

1. Sequential Finetuning: Definitions and Core Variants

SeqL encompasses protocols in which a pre-trained model is updated by (a) alternating between disjoint training slices (classes, tasks), (b) splitting objectives or heads and optimizing them in distinct phases, or (c) applying Bayesian/posterior updates online as new data arrive.

A canonical example is the two-stage optimization strategy detailed in "SeqVLA: Sequential Task Execution for Long-Horizon Manipulation with Completion-Aware Vision-Language-Action Model" (Yang et al., 17 Sep 2025). Here, action and detection heads are finetuned not jointly but in strict temporal sequence:

Phase 1: Optimize only the action-generation head and the shared backbone for low-level control.
Phase 2: Freeze backbone and action head; train only the binary completion-detection head.

A different form appears in continual learning for speech emotion recognition (Jain et al., 2024) and vision (Zhang et al., 2024), where sequential finetuning traverses classes or tasks in a pre-defined or randomly permuted order (class-by-class or task-by-task), updating the model on each partition before proceeding.

Another instantiation is observed in RL sequence generation with KL-control (Jaques et al., 2016), in which a maximum-likelihood (MLE) model defines a frozen prior; a second phase uses RL (with a KL penalty) to sequentially shape the model's output distribution under domain-specific rewards.

In online or streaming adaptation settings—exemplified by the "Kalman Bayesian Transformer" (Jing et al., 12 Sep 2025)—sequential finetuning comprises recursive Bayesian moment updates as each mini-batch or single datapoint is received, thereby incrementally balancing new and prior information.

2. Methodological Frameworks and Algorithms

Sequential finetuning protocols are highly dependent on context but share core algorithmic motifs. The following table summarizes representative frameworks:

Context/Domain	SeqL Protocol (Algorithmic Summary)	Main Objectives
Dual-head robotic control (Yang et al., 17 Sep 2025)	2-stage (action-only, then detection-only) training	Decouple skills and event detection, reduce error propagation
Continual class learning (Jain et al., 2024, Zhang et al., 2024)	Class-by-class or task-by-task optimization	Alleviate forgetting, sharpen per-class boundaries
RL sequence tuning (Jaques et al., 2016)	MLE pretrain → RL/KL-control fine-tune	Maximize reward, preserve data-likelihood
Bayesian transformer adaptation (Jing et al., 12 Sep 2025)	Online Kalman filter updates per sample/batch	Quantify/model uncertainty, rapid adaptation

A key feature is the strict separation of parameter updates for models or submodules across the distinct finetuning stages. For example, in (Yang et al., 17 Sep 2025), no backward pass ever combines $\mathcal{L}_{\text{action}}$ and $\mathcal{L}_{\text{completion}}$ ; in (Jain et al., 2024), all weights are updated only on samples from a single class at each stage.

The "Slow Learner with Classifier Alignment" (SLCA++) framework (Zhang et al., 2024) further refines standard SeqL by selectively reducing the learning rate for the backbone while allowing the classifier head to adapt rapidly. Parameter-efficient variants additionally restrict updates to low-rank subspaces.

SeqL in the Bayesian regime (Jing et al., 12 Sep 2025) eschews SGD entirely, propagating and updating Gaussian means/covariances via closed-form Kalman filter recursions at each sample arrival, crucial for rapid, uncertainty-aware online finetuning.

3. Loss Functions and Optimization Schemes

Losses in SeqL are generally modular, reflecting the decoupling of objectives:

In (Yang et al., 17 Sep 2025), the action-generation head is optimized via flow-matching:

$L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$

The detection head uses binary cross-entropy:

$L_{\text{completion}} = -\left[y \log(p) + (1-y)\log(1-p)\right], \quad p = \sigma(WF+b).$

These are never summed in SeqL; each phase minimizes only its respective loss.

In class-sequential finetuning for SER (Jain et al., 2024), each stage minimizes the standard cross-entropy loss plus $L_2$ regularization for that class subset.
In RL-based sequence modeling (Jaques et al., 2016), the sequential objective integrates the expected sum of task rewards and a KL divergence to the prior:

$J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=1}^T \frac{r_T(s_t, a_t)}{c} + \log p_{\text{prior}}(a_t|s_t) - \log \pi_\theta(a_t|s_t)\right].$

Three KL-control off-policy algorithms (Q-learning with log-prior, generalized $\Psi$ -learning, and G-learning) operationalize this in practice.

In the Kalman Bayesian Transformer (Jing et al., 12 Sep 2025), the update rules propagate mean/covariance statistics for each weight layer, incorporating the new likelihood via the Kalman gain $K_{w^i}$ :

$K_{w^i} = \Sigma_{w^i,u^i}\,\Sigma_{u^i,u^i}^{-1}.$

This balancing mechanism reflects uncertainty from both prior weights and new data.

4. Empirical Results, Benchmarks, and Ablations

Robust empirical validation of SeqL protocols has been conducted across domains:

Completion-aware robotic manipulation (Yang et al., 17 Sep 2025): SeqVLA-S (sequential) and SeqVLA-J (joint) with full backbone finetuning attain $\sim$ 95–100% subtask success. However, joint finetuning yields sharper completion signals (entropy: $\mathcal{L}_{\text{completion}}$ 0 bits vs $\mathcal{L}_{\text{completion}}$ 1; KS separation: $\mathcal{L}_{\text{completion}}$ 2 vs $\mathcal{L}_{\text{completion}}$ 3).
SER continual learning (Jain et al., 2024): SeQuiFi outperforms vanilla, EWC, and replay baselines on all metrics; for CREMA-D $\mathcal{L}_{\text{completion}}$ 4RAVDESS, SeQuiFi achieves $\mathcal{L}_{\text{completion}}$ 5, F1= $\mathcal{L}_{\text{completion}}$ 6 compared to $\mathcal{L}_{\text{completion}}$ 7 for vanilla FT.
SLCA++ for Class-Incremental Vision (Zhang et al., 2024): On Split CIFAR-100, SLCA++ (full) attains $\mathcal{L}_{\text{completion}}$ 8 last-task accuracy; standard Seq FT yields $\mathcal{L}_{\text{completion}}$ 9, SLCA++ (hybrid PEFT) achieves $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 0 with only $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 1M parameters.
Bayesian transformer adaptation (Jing et al., 12 Sep 2025): SeqL with Kalman updates is $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 2– $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 3 faster and more stable than retraining on buffered samples. Uncertainty estimates track true data noise levels.
RL melody/molecule generation (Jaques et al., 2016): KL-control SeqL retains high data-likelihood while optimizing for domain metrics; e.g., valid SMILES rate raised from $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 4 (p $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 50.001).

Ablation studies consistently indicate that backbone finetuning, modular objective separation, and uncertainty-aware updates are all necessary for optimal stability and retention. Freezing the backbone or naive joint updates result in diminished adaptation or increased drift/overfitting.

5. Advantages, Trade-offs, and Limitations

SeqL offers the following benefits:

Catastrophic forgetting mitigation: By isolating updates, prior knowledge and features corresponding to old classes/tasks are preserved longer, especially in class-wise (Jain et al., 2024) or task-wise (Zhang et al., 2024) SeqL.
Improved confidence calibration: As shown in detection-head finetuning for SeqVLA (Yang et al., 17 Sep 2025), joint and sequential head training influence entropy and separation of event boundaries.
Resource efficiency: Bayesian/Kalman-style SeqL (Jing et al., 12 Sep 2025) operates with minimal data buffer (memory size one) and sub-millisecond updates, essential for streaming or latency-critical adaptation.
Parameter efficiency: Hybrid PEFT variants of SeqL (Zhang et al., 2024) achieve competitive accuracy with $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 61% of the full parameter count.

There are intrinsic trade-offs:

Sharpness vs. modularity: Sequential separation of objectives (e.g., in dual-head models) can result in weaker detection signals compared to joint optimization (Yang et al., 17 Sep 2025).
Lack of explicit replay/regularization: Purely sequential class/task exposure can still induce boundary drift or overfitting if regularization is not appropriately tuned (Zhang et al., 2024).
Order sensitivity: Randomization over multiple class/task orders is recommended to minimize bias (Jain et al., 2024).

6. Extensions, Best Practices, and General Recommendations

SeqL is broadly applicable wherever classes, domains, or objectives are naturally partitionable or data arrives in non-stationary stream form. Best practices include:

Careful learning-rate scheduling: "Slow Learner" principles—tiny rates for backbone, flexible rates for classifier head—are critical in vision models (Zhang et al., 2024).
Per-stage regularization: Use modest $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 7, light $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\|_2^2.$ 8, dropout to prevent overfitting in per-class or per-phase updates (Jain et al., 2024).
Post-hoc calibration: Classifier alignment steps further harmonize disjoint output heads in class-incremental regimes (Zhang et al., 2024).
Uncertainty quantification: Bayesian SeqL (Jing et al., 12 Sep 2025) provides calibrated adaptivity, robust under covariate shift and limited data.
KL/policy stabilization: In sequence modeling, interpolating between RL objectives and data priors via a KL constraint ensures output diversity and information retention (Jaques et al., 2016).

A plausible implication is that SeqL frameworks, particularly when combined with parameter-efficient representations and calibration procedures, can set new strong baselines for continual adaptation scenarios in both supervised and control settings.

7. Comparative Table: Key Sequential Finetuning Protocols

Paper/Domain	Protocol	Memory	Retention Mechanism	SOTA Results
SeqVLA (Yang et al., 17 Sep 2025)	2-phase (action/detection)	Episodic	Backbone+head separation	95–100% subtask
SeQuiFi (Jain et al., 2024)	Sequential class FT	None	Per-class focus, L2 reg	70%+ macro-F1
SLCA++ (Zhang et al., 2024)	Seq FT+SL+CA	None	Slow backbone, classifier align	91% CIFAR-100
Kalman Bayesian Transformer (Jing et al., 12 Sep 2025)	Recursive Bayesian update	None	Uncertainty-adaptive, moment propagation	Fast+robust online
Sequence Tutor (Jaques et al., 2016)	MLE→KL-control RL SeqL	Replay	KL to data prior, Q/ $L_{\text{action}} = \mathbb{E}_{\tau\sim U[0,1], a_t^{(0)}, a_t^{(1)}} \bigl\\|\,\pi_\theta(a_t^{(\tau)}, \tau, z_t) - (a_t^{(1)} - a_t^{(0)})\,\bigr\\|_2^2.$ 9/G-learning	Structural retention

The accumulated literature demonstrates that sequential finetuning is a versatile, theoretically grounded, and empirically robust strategy for continual, modular, and uncertainty-aware model adaptation across domains.