Joint-Dataset SFT: Multi-task Fine-Tuning

Updated 13 March 2026

Joint-Dataset SFT is the supervised adaptation of large models using heterogeneous data from multiple tasks, domains, or modalities to enhance broad generalization.
It employs structured techniques like CPI-FT to isolate core parameters and fuse per-task updates, effectively mitigating gradient interference and catastrophic forgetting.
Empirical outcomes demonstrate measurable gains, including improved normalized scores and robustness in vision, speech, and multilingual applications.

Joint-Dataset Supervised Fine-Tuning (SFT) denotes supervised adaptation of large models—especially LLMs and vision-LLMs—on heterogeneous data pooled from multiple tasks, domains, or modalities. This paradigm is motivated by the need to efficiently leverage diverse supervision signals for broad generalization, while mitigating gradient interference, catastrophic forgetting, and overfitting. Recent work has established advanced frameworks—such as Core Parameter Isolation Fine-Tuning (CPI-FT)—for structured multi-task SFT, and demonstrated joint SFT as a backbone for continual learning, vision, code, speech, and multilingual capabilities (Wang et al., 29 Aug 2025, Ding et al., 11 Jun 2025, Ye et al., 1 Jun 2025, Chen et al., 2024, Peng et al., 2024, Jiang et al., 2024).

1. Definitions and General Principles

In joint-dataset SFT, a pretrained model $\mathbb{M}$ with parameters $\theta\in\mathbb{R}^D$ is fine-tuned on a mixture of labeled datasets $\{\mathcal{D}_i\}_{i=1}^N$ corresponding to downstream tasks $T_1,\ldots,T_N$ . Fine-tuning is performed by minimizing the aggregate cross-entropy loss:

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\sum_{(x,y)\in \bigcup_{i=1}^N \mathcal{D}_i}\log P_\theta(y|x)$

Typical data types include in-domain, out-of-domain, multi-modal (e.g., text and speech), or multilingual samples. Sampling strategies and loss-weighting across datasets are essential to control performance trade-offs and address data imbalance.

Recent formulations extend basic joint SFT by mixing synthetic data, domain-adaptation corpora, or interleaving modalities within single-stage or staged protocols (Ding et al., 11 Jun 2025, Peng et al., 2024).

2. Core Parameter Isolation and Structured Joint SFT

The Core Parameter Isolation Fine-Tuning (CPI-FT) framework exemplifies state-of-the-art structured joint SFT (Wang et al., 29 Aug 2025). CPI-FT decomposes the tuning process as follows:

Task probing: Independently fine-tune $\theta^{(0)}$ on each $\mathcal{D}_i$ for $E_{\text{probe}}$ epochs to obtain per-task parameters $\theta^{(i)}$ .
Core region extraction: For each task, compute element-wise update magnitudes $\Delta|\theta_j^{(i)}| = |\theta_j^{(i)}-\theta_j^{(0)}|$ ; define the core region $\theta\in\mathbb{R}^D$ 0 as the top- $\theta\in\mathbb{R}^D$ 1% indices or those exceeding a threshold $\theta\in\mathbb{R}^D$ 2 (e.g., $\theta\in\mathbb{R}^D$ 3).
Task grouping via Jaccard similarity: Tasks $\theta\in\mathbb{R}^D$ 4, $\theta\in\mathbb{R}^D$ 5 with core overlap $\theta\in\mathbb{R}^D$ 6 are grouped.
Parameter fusion: After multi-stage SFT, fuse per-task core parameters by transplanting core indices and interpolate non-core regions via Spherical Linear Interpolation (SLERP).
Pipelined joint SFT: Execute staged fine-tuning over task groups, dynamically freezing cores accumulated so far, and backpropagate only through unfrozen parameters on a mixed calibration set.

CPI-FT was shown to alleviate the "seesaw phenomenon" (i.e., instability across tasks), and yielded an avg. +0.46 normalized score gain over best baselines in LLaMA-2-7B multi-task settings. Catastrophic forgetting between conflicting tasks is substantially reduced (<6 points loss vs. 16–24 for vanilla SFT), and performance under data imbalance is robust (Wang et al., 29 Aug 2025).

3. Data Construction and Joint-Dataset Mixing Strategies

Appropriate construction of joint datasets is fundamental. In code LLMs, Han et al. designed a pipeline mixing "atomic" (human-crafted function) and "synthetic" (auto-chained, multi-step) code examples (Chen et al., 2024), observing that:

Pure atomic or synthetic SFT yielded limited generalization (max pass@1 ≈33%).
A small number of synthetic chains (as few as 50) plus up-sampled atomic data sufficed to reach 55% pass@1, far exceeding either subset alone.

VoiceTextBlender (Peng et al., 2024) mixes modalities—text-only, ASR, AST, speech QA, and mixed-modal SFT—in a single-stage protocol by sampling batches stochastically from each data type according to fixed ratios. This empirical weighting obviates the need for explicit per-task loss scaling. Similarly, in (Ding et al., 11 Jun 2025), domain adaptation is realized by mixing 83% general-purpose, synthetic SFT data with 17% domain-specific (medical) SFT data, with $\theta\in\mathbb{R}^D$ 7 denoting the mix weight in the composite loss:

$\theta\in\mathbb{R}^D$ 8

This approach maintains general model capability while enhancing task-specific performance.

4. Algorithms, Architectures, and Key Implementation Details

Joint-dataset SFT architectures span multiple modalities and domains.

CPI-FT (Wang et al., 29 Aug 2025): Operates on full model parameters; uses per-task core parameter transplantation, SLERP for non-core fusion, and pipelined freezing by binary masks.
Vision SFT (ViSFT) (Jiang et al., 2024): Attaches per-task heads to a frozen vision backbone (e.g., EVA-CLIP ViT-E, ViT-G), then trains low-rank LoRA adapters jointly on tasks (e.g., detection, segmentation, captioning)—sampling tasks per iteration using fixed probabilities.
Multilingual SFT and CC-Tuning: In CC-Tuning (Ye et al., 1 Jun 2025), joint SFT is baseline; the architecture implements cross-lingual fusion in latent space at the feed-forward layer, with a Decision Maker selecting activations for transfer.
Speech–Text SFT: VoiceTextBlender (Peng et al., 2024) alternates over text, ASR/AST, QA, and mixed-modal speech-text via a LoRA-adapted LLM, using a modality adapter between speech and textual representations.

Key SFT hyperparameters include learning rate and batch size per modality/task (e.g., 1e–5, batch size 64 in CPI-FT; fixed mixture ratio in (Ding et al., 11 Jun 2025); AdamW optimizer). Practical guidelines from multiple works converge on: (1) proportional sampling, (2) up-sampling of rare task data, (3) lightweight adapter usage for memory efficiency, and (4) freezing the backbone in most vision and speech joint SFT protocols (Jiang et al., 2024, Peng et al., 2024).

5. Empirical Outcomes, Benchmarks, and Ablation Analyses

Joint-dataset SFT consistently outperforms single-task, sequential, and naïve multi-task baselines on key benchmarks:

CPI-FT (Wang et al., 29 Aug 2025): Average normalized score (LLaMA-2-7B) improves from 6.58 (full SFT) to 7.21; catastrophic forgetting reduced to <6 points.
Domain-preserving SFT (Ding et al., 11 Jun 2025): On public LLM benchmarks (MMLU-PRO, GPQA, IFEval, Math5), joint SFT with 83% synthetic preserves or improves average accuracy (39.21 vs. 39.00) compared to the base model, while public dataset mixtures induce forgetting.
Vision (Jiang et al., 2024): ViSFT boosts out-of-domain classification, detection, and retrieval metrics vs. the pretrained backbone (e.g., zero-shot accuracy +0.3–0.7 points, OCR +2.5–3.2%).
Speech–Text (Peng et al., 2024): Single-stage joint SFT with LoRA matches text-only LLM performance while achieving state-of-the-art speech recognition and translation (lowest WER across evaluated languages, highest BLEU in 5/6 AST directions).
Ablations: Joint dataset SFT requires both atomic and synthetic data in code (removing either impairs generalization (Chen et al., 2024)); in speech–text, ablating text data causes collapse of text skills, showing only joint SFT with modality mixture preserves both modality capabilities (Peng et al., 2024). Multi-response and multi-model candidate filtering further enhance generalization in domain adaptation (Ding et al., 11 Jun 2025).

Joint-dataset SFT extends naturally to multilingual (Ye et al., 1 Jun 2025), multimodal (Peng et al., 2024), and continual learning scenarios (Ding et al., 11 Jun 2025):

Multilingual SFT: Basic paradigm is simple aggregation of instruction–response pairs across languages, minimizing joint negative log-likelihood. CC-Tuning introduces explicit latent-level fusion between English and non-English activations, with added modules but no auxiliary loss, yielding substantial performance gains over data-augmented SFT, and outperforming baselines that use 2× more training data (Ye et al., 1 Jun 2025).
Multimodal SFT: Speech–text models such as VoiceTextBlender sample from text, speech, and hybrid inputs per mini-batch, with a single SFT loss, avoiding catastrophic forgetting in all modalities (Peng et al., 2024).
Continual/Domain Adaptation: Synthetic dataset construction (sampling instruction distribution of the base LLM, multi-model response filtering) provides a general rehearsal buffer, enabling preservation of base capabilities during domain-specific adaptation (Ding et al., 11 Jun 2025). This method is recommended over relying solely on partial public SFT datasets.

Empirical findings highlight that proportional joint-dataset mixing, modular loss weighting, and strategic parameter freezing are critical for transferring and preserving capabilities in challenging multi-domain, multi-language, or multi-modal settings.

7. Practical Guidelines and Best Practices

Consensus best practices for joint-dataset SFT, as synthesized from the referenced works, include:

Construct joint datasets to include both "atomic" (core/simple) and "composite/synthetic" (complex/long-chain) instances when possible (Chen et al., 2024).
Tune mixture weights (e.g., $\theta\in\mathbb{R}^D$ 9 in domain-generalization) using held-out evaluation data (Ding et al., 11 Jun 2025).
Freeze pretrained backbone weights in vision and speech settings; restrict adaptation to LoRA/adapter modules for efficiency (Jiang et al., 2024, Peng et al., 2024).
Use multi-model/multi-response filtering for synthetic data construction to maximize generalization and mitigate forgetting (Ding et al., 11 Jun 2025).
Sample tasks/datasets proportionally to their contribution to overall performance; up-sample underrepresented sources as needed to balance (Jiang et al., 2024, Peng et al., 2024).
Use staged or pipelined freezing when performing multi-stage adaptation, particularly when catastrophic forgetting is a concern (Wang et al., 29 Aug 2025).
For multilingual and multimodal applications, integrate explicit latent-level or architectural fusions for improved transfer (Ye et al., 1 Jun 2025, Peng et al., 2024).

These procedures, applied within the joint-dataset SFT paradigm, systematically address gradient conflict, preserve both general and specialized skills, and extend across modalities and domains at scale.