Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

Published 10 Apr 2026 in cs.CV | (2604.09088v1)

Abstract: Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces Masked Dual Path Distillation (MDPD), combining dual-path feature and logits distillation for efficient transfer learning.
It employs a lightweight, trainable side network that is discarded after training, reducing inference latency by at least 25.2% while maintaining accuracy.
Empirical results on visual, language, and multimodal tasks demonstrate state-of-the-art performance and superior memory efficiency compared to PETL and traditional METL approaches.

Memory-Efficient Transfer Learning via Masked Dual Path Distillation

Motivation and Background

Transfer learning with large pre-trained models underpins state-of-the-art results across vision, language, and multimodal tasks. Fully fine-tuning these large backbones, however, is computationally prohibitive, incurring excessive memory cost and substantial risk of overfitting in data-scarce regimes. Parameter-efficient transfer learning (PETL) schemes, such as Adapter tuning, LoRA, or prompt-based approaches, offer trainable parameter reduction by updating only subsets or injecting small modules, but still incur significant memory overhead because their gradients traverse the entire backbone during backpropagation.

Memory-efficient transfer learning (METL) addresses these shortcomings by introducing lightweight side networks parallel to the backbone. During adaptation, only the side network is updated while the backbone is kept frozen, dramatically reducing the memory required for gradient storage. However, the critical limitation of existing METL approaches is their reliance on the side network during inference, introducing additional latency and memory cost, thus undermining the primary goal of total efficiency.

Masked Dual Path Distillation: Methodology

"Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation" (2604.09088) introduces Masked Dual Path Distillation (MDPD), an approach that combines efficient fine-tuning with zero-inference overhead from the side network. The approach leverages a dual-path knowledge distillation framework and a hierarchical feature-based distillation strategy to maximize transfer efficiency and optimize backbone adaptation.

The MDPD training pipeline comprises two intertwined networks: the frozen backbone and a lightweight, trainable side network. Both networks, initialized from the pre-trained model, interact through two main distillation mechanisms:

Feature-Based Distillation (Backbone→Side Network): Intermediate features from the backbone are used as targets for corresponding layers in the side network. Dimensionality mismatches between the backbone and side network outputs are reconciled via low-rank bottleneck modules to maintain parameter efficiency.
Logits-Based Distillation (Side Network→Backbone): The side network, optimized for the downstream task, produces logits that supervise the backbone's task-specific output layer. The backbone in this phase is partially unfrozen at normalization layers and the output head, allowing for minimal but sufficient adaptation.

A unique attribute of MDPD is the mutual, alternating teacher-student dynamic between backbone and side network, which enhances both generalization and downstream performance. Most importantly, after training, the side network is discarded; only the backbone, now independently capable of downstream inference, is deployed. This ensures that inference-time memory and computational cost are equivalent to that of PETL approaches, without their respective training memory bottlenecks.

Figure 1: Overview of MDPD. During training, backbone and side network mutually distill features and logits; inference uses the backbone only, maximizing efficiency.

Hierarchical Feature Distillation

A critical observation is that feature distributions between backbone and side network diverge substantially in deep layers, especially regarding attention distributions. For shallow layers—where attention patterns are well-aligned—direct feature imitation is effective. In contrast, for deep layers, the authors employ masked feature generation: selected tokens of side network features are replaced by learned mask tokens, and a simple convolutional generator attempts to reconstruct the backbone's deep features based on these masked representations.

This strategy balances the conflicting demands of semantic alignment and information diversity, ensuring stability in shallower layers while maximizing adaptation flexibility and representational richness in deeper layers.

Empirical Results

MDPD is extensively validated on visual, language, and multimodal tasks, including ITR (Flickr30K, MSCOCO), VTR (MSR-VTT, MSVD), VQA (VQAv2, GQA), visual grounding (RefCOCO, RefCOCO+, RefCOCOg), as well as the GLUE and VTAB-1K benchmarks for language-only and vision-only evaluation.

Key strong claims substantiated by empirical results include:

Inference acceleration: Discarding the side network post-finetuning yields at least a 25.2% reduction in inference latency compared to previous METL approaches that retain the side network.
Superior tradeoff: MDPD matches or outperforms both PETL and METL methods in memory, parameter footprint, and accuracy. For example, on cross-modal retrieval, MDPD improves R@1 over UniPT by up to 1.4% and over LST by 3.4%, together with notable Rsum gains.
GLUE/VTAB-1K performance: MDPD achieves state-of-the-art mean accuracy (e.g., 78.3% mean on VTAB-1K, outperforming prior METL/PETL baselines), with commensurate or higher efficiency metrics.
Ablation results demonstrate that dual-path (feature + logits) distillation yields better performance than using either alone. Further, masking rates in deep layer distillation are sensitive, with optimal results at mask ratio $\lambda = 0.5$ .
Figure 2: Impact of mask ratio $\lambda$ on distillation efficacy for ITR tasks—optimal tradeoff is observed at $\lambda = 0.5$ .

Theoretical and Practical Implications

MDPD establishes that side network architectures for METL can be relegated entirely to the training stage, negating their inference burden. This property brings METL paradigms in line with parameter-efficient approaches for real-world, resource-constrained deployment. The paper provides detailed analysis of memory usage during backpropagation, formalizing why prior PETL methods cannot surpass certain theoretical memory bounds, while side-network based METL approaches do, and how MDPD maximizes these gains.

The dual-path, hierarchical distillation concept provides a general template for future transfer learning research, with implications beyond memory efficiency: More fine-grained teacher-student strategies may improve knowledge transfer in self-supervised representation or continual learning contexts.

Practically, MDPD's architecture-agnostic design and validated performance on both unimodal and multimodal settings signal robust applicability to large-scale, real-world AI systems that must balance adaptation capacity against stringent memory and time constraints.

Conclusion

Masked Dual Path Distillation provides a rigorously substantiated approach to memory-efficient transfer learning, resolving the inferential inefficiency inherent in prior METL architectures. The combination of mutual distillation, feature masking, and side network fading enables fast, accurate, and resource-transparent deployment of adapted models. This framework points toward increasingly efficient fine-tuning paradigms and suggests further research directions in distillation topology, masking policies, and teacher-student dynamics for both unimodal and multimodal foundation model adaptation.

Markdown Report Issue