Selective Non-Adaptive Fine-Tuning (PaCA)

Updated 5 January 2026

PaCA is a parameter-efficient fine-tuning technique that directly updates a fixed random subset of pretrained weights, avoiding auxiliary adapter modules.
It minimizes latency and memory overhead by executing parallel in-place updates, which accelerates both forward and backward passes in training.
Empirical benchmarks show PaCA achieves competitive accuracy with faster training times and supports longer sequences compared to LoRA.

Selective Non-Adaptive Fine-Tuning, specifically in the context of Partial Connection Adaptation (PaCA), represents an advancement in parameter-efficient fine-tuning (PEFT) by enabling direct sparse modification of pretrained neural weights rather than the introduction of auxiliary adapter modules. Unlike classical PEFT methods such as Low-Rank Adaptation (LoRA), which rely on external low-rank matrices appended to each layer, PaCA achieves fine-tuning efficiency by randomly selecting and updating a small, fixed subset of the pretrained model’s parameters—"partial connections"—throughout training. This design minimizes latency overhead, reduces activation memory requirements, and improves throughput, making it suitable for adaptation of very large models under memory and time constraints (Woo et al., 28 Feb 2025).

1. Algorithmic Foundations and Workflow

PaCA departs from adapter-based PEFT by operating directly on a random subset of entries in each weight matrix $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ :

Initialization: PaCA starts from a frozen, pretrained model (e.g., LLaMA). A rank parameter $r$ (or equivalently a probability $p = r/d_{\text{in}}$ ) is chosen, controlling the sparsity of the adapted subspace.
Partial Connection Selection: For each layer, a binary mask $M \in \{0,1\}^{d_{\text{out}} \times d_{\text{in}}}$ is sampled, selecting exactly $r$ (or $rp$ in expectation) active columns per matrix. This mask is fixed for the entirety of fine-tuning.
Forward Pass: The forward computation remains identical to that of the frozen pretrained model: $X_{\text{out}} = W X_{\text{in}}$ .
Backward Pass and Update: Input gradients are computed as $\nabla X_{\text{in}} = W^\top \nabla X_{\text{out}}$ , and parameter gradients are evaluated only for the entries indicated by $M$ : $\nabla P = (\nabla X_{\text{out}} X_{\text{in}}^\top) \odot M$ . Only the selected parameters are updated; all others remain frozen.
Kernel Parallelism: Since PaCA does not introduce any adapter layers, no additional sequential kernels are required. All partial connection updates proceed in parallel across layers.

This approach eliminates the common GPU latency penalty associated with stacking adapter modules (as in LoRA), as all computation remains in the existing weight kernels (Woo et al., 28 Feb 2025).

2. Mathematical Formulation

The trainable parameter subspace in PaCA is a fixed, randomly chosen submatrix:

Let $M_{ij} \sim \mathrm{Bernoulli}(p)$ or sample exactly $r$ out of $d_{\text{in}}$ columns per layer.
The set of updated weights is $P = M \odot W$ , where $\odot$ denotes the Hadamard product.
Forward computation is unaffected: $X_{\text{out}} = W X_{\text{in}}$ .
Backward parameter updates are masked: $\nabla P = (\nabla X_{\text{out}} X_{\text{in}}^\top)\odot M$ .

Pseudocode directly from (Woo et al., 28 Feb 2025) outlines this computation:

sample mask M_l ∈ {0,1}^{d_out×d_in} with exactly r ones per matrix
let P_l = M_l ⊙ W_l  # partial connection block
for each training step:
    compute X_out[l] = W_l X_in[l]     # forward
    compute ∇X_in[l] = W_l^T ∇X_out[l] # backward input gradient
    for each layer l:
        G = ∇X_out[l]·X_in[l]^T
        ∇P_l = G ⊙ M_l
        P_l ← P_l – η ∇P_l
        W_l[M_l==1] ← P_l[M_l==1]

The remaining frozen entries ( $M_{ij}=0$ ) are never changed during training (Woo et al., 28 Feb 2025).

3. Comparative Computational and Memory Analysis

PaCA achieves computational and memory benefits by eliminating adapter overhead:

Method	Trainable Params	Forward/Backward Kernels	Activation Memory (per layer)
Full-FT	All	Standard	$d_{\text{in}} \cdot \text{seq\_len}$
LoRA	2 low-rank $r$	Sequential (W + adapters)	$d_{\text{in}} \cdot \text{seq\_len}$ + $r \cdot \text{seq\_len}$
PaCA	$r$ columns	Parallel (no adapters)	$r \cdot \text{seq\_len}$

Empirical benchmarks (LLaMA3-8B, rank=8, seq=512, batch=2) show that PaCA is 19% faster per iteration than LoRA (–18% forward, –20% backward) and uses 16% less total memory. The smaller memory footprint allows PaCA to support 23% longer training sequences before out-of-memory (OOM) errors (~9.8K tokens for PaCA vs 8K for LoRA at the same rank). PaCA also improves throughput by 16% over LoRA on both NVIDIA A100 and Intel Gaudi2 HPU (Woo et al., 28 Feb 2025).

4. Experimental Results and Performance Metrics

Comprehensive empirical evaluations are reported in (Woo et al., 28 Feb 2025):

MMLU (5-shot) fine-tuning, LLaMA3-8B, single A100:

| Method | Rank | Params | Mem | Time | MMLU-Avg | |--------|------|--------|-----|-------|-----------| | LoRA | 8 | 21M | 27G | 4.4h | 65.0% | | PaCA | 8 | 11M | 23G | 3.5h | 65.2% | | PaCA | 16 | 22M | 23G | 3.5h | 65.4% |

Instruction Tuning (Oasst1 $\rightarrow$ MT-Bench, LLaMA3-8B):

| Method | Rank | Mem | Time | MT-Bench Avg | |---------|------|-----|------|--------------| | LoRA | 64 | 56G | 26m | 5.12 | | PaCA | 64 | 47G | 21m | 5.23 | | PaCA | 128 | 51G | 21m | 5.26 |

Quantization (QLoRA vs. QPaCA):
- LLaMA3-8B (Oasst1): QLoRA 18G/42m/5.00, QPaCA 16G/37m/5.02.
- LLaMA3.1-70B: QLoRA 80G/5.1h/6.09, QPaCA 69G/4.7h/6.08.

This suggests that PaCA matches or slightly exceeds the accuracy of LoRA while consistently reducing both training time and resource usage (Woo et al., 28 Feb 2025).

5. Implementation Practices and Hyperparameter Choices

PaCA is implemented in PyTorch with the HuggingFace PEFT library and typically uses 16-bit mixed precision. Weight matrices targeted include all Transformer MLP and attention weights (Q, K, V, O, Up, Down, Gate). Typical ranks are 8–16 for MMLU and 64–128 for instruction tuning and QPaCA. Batch sizes are 8–16 with gradient accumulation for large models, cosine schedule for MMLU, and linear schedule for instruction tuning. AdamW is used as the optimizer with a short warmup period and a single epoch for most tasks (Woo et al., 28 Feb 2025).

Traditional adapter-based PEFT methods, notably LoRA, insert low-rank matrices per layer, which necessitates separate forward/backward kernel executions for base and adapter parameters during training. This sequential operation constrains the reduction of wall-clock time, despite nominal computational reductions (LoRA achieves only 0.6% wall-clock time savings compared to Full-FT, even at 33% lower theoretical FLOPs). In contrast, PaCA performs weight adaptation in-place, avoiding additional kernel launches or sequential operations, which leads to improved actual training throughput and memory efficiency (Woo et al., 28 Feb 2025).

A plausible implication is that the selective non-adaptive mechanism of PaCA enables more scalable and resource-efficient fine-tuning for large-scale language and vision models compared to kernel-inefficient adapter paradigms.

7. Limitations and Scope

PaCA’s principal constraint is that the set of updated weights is fixed at initialization and random, rather than adaptively selected in response to task difficulty or model uncertainty. This non-adaptive sparsity could potentially underfit tasks that require complex weight space adaptation. Nevertheless, benchmarks indicate that, for a range of fine-tuning scenarios—including MMLU, instruction tuning, and quantized large-scale models—PaCA maintains or marginally improves accuracy relative to LoRA and similar adapter-based fine-tuning methods, while providing quantifiable gains in speed, memory, and maximum sequence length (Woo et al., 28 Feb 2025).

The approach offers direct applicability to any pretrained architecture with matrix-based layers and is validated across multiple hardware backends (NVIDIA A100, Intel Gaudi2).

For further technical details and open-source code, see (Woo et al., 28 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

PaCA: Partial Connection Adaptation for Efficient Fine-Tuning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Non-Adaptive Fine-Tuning (PaCA).

Selective Non-Adaptive Fine-Tuning (PaCA)

1. Algorithmic Foundations and Workflow

2. Mathematical Formulation

3. Comparative Computational and Memory Analysis

4. Experimental Results and Performance Metrics

5. Implementation Practices and Hyperparameter Choices

7. Limitations and Scope

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Selective Non-Adaptive Fine-Tuning (PaCA)

1. Algorithmic Foundations and Workflow

2. Mathematical Formulation

3. Comparative Computational and Memory Analysis

4. Experimental Results and Performance Metrics

5. Implementation Practices and Hyperparameter Choices

6. Comparison with Related PEFT Approaches

7. Limitations and Scope

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research