Blockwise SFT: Targeted Fine-Tuning

Updated 23 March 2026

Blockwise SFT is a method that restricts parameter updates to specific blocks, such as attention heads or transformer layers, for targeted adaptation.
It enhances data and compute efficiency while reducing overfitting by focusing updates on task-relevant components of large pre-trained models.
Empirical results across language, vision, and diffusion models demonstrate faster convergence and improved stability compared to full-model fine-tuning.

Blockwise Supervised Fine-Tuning (SFT) refers to a family of methodologies in which parameter adaptation in large pre-trained neural networks is restricted to a well-defined subset, or “block,” of model components. By constraining which network segments are updated—using blocks delineated as attention heads, transformer layers, neuron rows, or response token regions—blockwise SFT frameworks increase data efficiency, boost stability, reduce memory and compute requirements, and often provide superior generalization with limited labeled data compared to full-model fine-tuning. Techniques across diverse modalities and architectures—including transformers for language, vision, sparse models, and diffusion LMs—have converged on the utility of blockwise SFT for targeted adaptation, pruning-aware optimization, and training-inference alignment.

1. Blockwise SFT in Parameter Subspace Selection

Blockwise SFT formalizes a restriction of parameter updates to a targeted subnetwork or block, with block granularity adapted to the network class:

Structural segmentation: In standard deep networks, blocks may correspond to contiguous groups of layers, identified by architectural separators (e.g., activation or pooling layers) or by sliding windows of fixed length (Barakat et al., 2023).
Head-level granularity: Transformer models naturally segment into attention heads; only a salient subset, often task-specific as determined by causality analysis, is updated during adaptation (Nur'aini et al., 13 Jan 2026).
Neuron-row blocks: For linear layers, each output neuron row constitutes a block. Structural pruning metrics rank these rows for downstream task importance, after which adaptation is restricted to the most relevant (Li et al., 17 Feb 2025).
Diffusion LMs: Blockwise tuning is interpreted temporally or sequentially over response token blocks, aligning training optimization to blockwise inference steps (Sun et al., 27 Aug 2025).

This block-oriented selection enables fine-grained trade-offs between plasticity and generalization, adjusting the degree of adaptation per task and data regime.

2. Core Methodologies and Algorithmic Frameworks

Distinct algorithmic implementations of blockwise SFT have emerged, each with strong theoretical and empirical underpinnings:

Circuit-Targeted SFT (CT-SFT) discovers a label-balanced, causally relevant circuit of attention heads using task-directional relevance scoring derived from Contextual Decomposition Transformer (CD-T) analysis. Only the identified heads and LayerNorm parameters are updated, implemented via head-level gradient masking (Nur'aini et al., 13 Jan 2026).
EBFT (Effective Block-wise Fine-Tuning) sequentially optimizes a per-block reconstruction error—mean squared error between the outputs of each pruned “student” block and its unpruned “teacher” counterpart—cycling through model blocks one at a time with backpropagation localized to only the active block (Guo et al., 2024).
SPruFT (Structured-Pruning-based Sparse Fine-Tuning) selects top neuron rows using Taylor or magnitude criteria, then adapts only those rows via a sparse update matrix ( $\Delta W$ ), summing original and adapted activations to retain implementation efficiency and minimize memory use (Li et al., 17 Feb 2025).
Layer/Block-wise Optimization in CV models first partitions the network into candidate blocks, then searches (on a small held-out subset) for the most predictive block or group of layers to unfreeze, with all others kept static (Barakat et al., 2023).
Blockwise SFT for Diffusion LMs partitions responses into fixed-size token blocks, at each training step sampling a single block for gradient computation, masking all future tokens (suffix leakage prevention) and keeping prefix context clean (Sun et al., 27 Aug 2025).

Commonly, these algorithms involve an initial discovery or selection phase followed by block-restricted SGD, using optimizer masks or parameter groupings at the framework level.

3. Theoretical and Empirical Advantages

Empirical and analytical investigations across blockwise SFT variants highlight several advantages:

Stability and regularization: Blockwise adaptation imposes a hard prior against overfitting, especially when labeled data is limited, by only permitting updates in regions of the parameter space causally implicated in the target task (Nur'aini et al., 13 Jan 2026).
Data efficiency: By focusing plasticity, blockwise SFT often outperforms full-model fine-tuning with 1–2 orders of magnitude fewer trainable parameters (Nur'aini et al., 13 Jan 2026, Li et al., 17 Feb 2025).
Resistance to catastrophic forgetting: When transferring between tasks (e.g., cross-lingual adaptation), restricting updates to task-relevant blocks preserves source task competence significantly better than full fine-tuning, as measured by post-transfer accuracy on proxy tasks (Nur'aini et al., 13 Jan 2026).
Reduced compute and memory: SPruFT and EBFT report 1.5–2× lower memory utilization than LoRA, and much lower than full fine-tuning, without incurring the kernel overhead of unstructured sparse updates (Li et al., 17 Feb 2025, Guo et al., 2024).
Improved sample efficiency under modest hardware: All weights and gradients for a single block can reside in memory at once, enabling full SFT on 6–7B parameter models on a 16GB GPU in ~30 minutes, provided a blockwise methodology is used (Guo et al., 2024).
Training-inference alignment: In diffusion LMs, blockwise SFT eliminates bias from prefix noise and suffix leakage, ensuring the training objective matches the likelihood factorization imposed by blockwise decoding (Sun et al., 27 Aug 2025).

A typical outcome is that, under constrained resource budgets, blockwise SFT formulations both converge faster and yield higher or equivalent accuracy versus strong baselines.

4. Representative Experimental Results

Blockwise SFT variants achieve substantial empirical gains, as summarized below:

Method	# Params Updated	Main Task/Dataset	Performance	Memory/Speedup	Reference
CT-SFT (depth 2)	0.66% (18 heads+LN)	NusaX-Senti XNLI	+0.119 (Acehnese) over FT	Catastrophic forgetting reduced	(Nur'aini et al., 13 Jan 2026)
EBFT (Wikitext2, 7B)	~one block at a time	70% sparsity	PPL 16.88 vs. DSnoT 75.14	30 min/16 GB GPU	(Guo et al., 2024)
SPruFT (Llama2-7B)	128 rows/block	Alpaca, 9 tasks	60.86% (LoRA 60.69%)	17.62 GB (23.46 GB LoRA)	(Li et al., 17 Feb 2025)
Blockwise SFT (Diff-LM)	LoRA (rank 256)	GSM8K, MATH	+5pp GSM8K vs. SFT	No arch chg, stable	(Sun et al., 27 Aug 2025)
Blockwise (separator)	~block/group	Tf_flowers, 5 models	0.8518 (lowest variance)	Robust to overfit	(Barakat et al., 2023)

Key results indicate that blockwise SFT matches or exceeds the accuracy of full or LoRA-based adaptation, with smaller memory and substantially lower risk of performance drop post-transfer or under limited supervision.

5. Design Choices and Practical Implementation

Critical design and implementation details vary by context:

Block definition and selection: Can be architecture-driven (e.g., transformer heads, convolutional layer groups) or data-driven via importance scoring (Taylor saliency, quantile-mean, magnitude).
Optimizer masking: Per-parameter optimizer masks (e.g., PyTorch's param_group['requires_grad'] or custom hooks) restrict gradients to block-selected parameters (Nur'aini et al., 13 Jan 2026).
Data partitioning: Empirical works typically use a small held-out or “discovery” pool (e.g., 50–400 examples) for initial importance or relevance computation (Nur'aini et al., 13 Jan 2026, Guo et al., 2024); subsequent updates are performed on disjoint tuning/validation splits.
Hyperparameters: Learning rates (e.g., 5×10⁻⁵ for CT-SFT), batch sizes (typically small, ≤16), and early-stopping (loss stagnation, relative change <1e−3) are tuned per block.
Sparsity trade-offs: Increasing the selection ratio p yields larger blocks or circuits, trading off faithfulness to the source mechanism for adaptation capacity (Nur'aini et al., 13 Jan 2026).
No requirement for custom sparse kernels: SPruFT and other blockwise SFT approaches use standard dense operations over small matrices, sidestepping the engineering issues characteristic of elementwise sparse updates (Li et al., 17 Feb 2025).

Best practices recommend block-level tuning for data-scarce tasks, starting selection from higher network layers, and monitoring adaptation on small validation splits for overfitting.

6. Applications, Extensions, and Generalization

Blockwise SFT methodologies have been successfully adapted across several domains:

Transformer LMs: Cross-lingual transfer, low-resource adaptation (Nur'aini et al., 13 Jan 2026).
Sparse and pruned models: Structured/unstructured sparsity, pruning-aware retraining (Guo et al., 2024, Li et al., 17 Feb 2025).
Vision backbones: CV models (VGG, ResNet, MobileNet) show blockwise SFT outperforms both head-only and full model adaptation in image-based tasks (Barakat et al., 2023).
Diffusion LMs: Exact alignment between blockwise decoding and SFT leads to state-of-the-art math and reasoning performance (Sun et al., 27 Aug 2025).

The frameworks generalize to multi-task SFT (by mixing downstream task examples in calibration sets (Guo et al., 2024)), and are robust to adaptation in both low-data and high-sparsity regimes (Li et al., 17 Feb 2025).

7. Limitations and Open Directions

Known limitations include:

Block discovery cost: For many block selection methods, especially those requiring gradient or structural importance estimates, there is a pre-adaptation compute overhead.
Block size/selection granularity: Selection of too large a block reduces memory/computation savings, while too small may underfit. Alignment of train and inference block granularity is essential in certain architectures (Sun et al., 27 Aug 2025).
Remaining overfitting risk: In minimal-data regimes, care must be taken with block selection to prevent adaptation to spurious features.
Implementation overhead: Blockwise SFT demands per-block optimizer and parameter management, though in practice this is standardized in major frameworks.

Future work may extend blockwise SFT to gradient-free solvers, adaptive block sizing, or human-feedback-driven block adaptation, particularly in diffusion and semi-autoregressive generative models (Sun et al., 27 Aug 2025). A plausible implication is that as models and tasks grow still larger and more heterogeneous, blockwise SFT will remain core to scalable, robust, and data-efficient adaptation strategies.