DELTA: DEep Learning Transfer using Feature Map with Attention for Convolutional Networks (1901.09229v4)

Published 26 Jan 2019 in cs.LG and stat.ML

Abstract: Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new target task. To solve the problem, some regularization methods, constraining the outer layer weights of the target network using the starting point as references (SPAR), have been studied. In this paper, we propose a novel regularized transfer learning framework DELTA, namely DEep Learning Transfer using Feature Map with Attention. Instead of constraining the weights of neural network, DELTA aims to preserve the outer layer outputs of the target network. Specifically, in addition to minimizing the empirical loss, DELTA intends to align the outer layer outputs of two networks, through constraining a subset of feature maps that are precisely selected by attention that has been learned in an supervised learning manner. We evaluate DELTA with the state-of-the-art algorithms, including L2 and L2-SP. The experiment results show that our proposed method outperforms these baselines with higher accuracy for new tasks.

Authors (7)

Xingjian Li (49 papers)
Haoyi Xiong (98 papers)
Hanchao Wang (23 papers)
Yuxuan Rao (1 paper)
Liping Liu (26 papers)
Zeyu Chen (48 papers)
Jun Huan (31 papers)

Citations (164)

View on Semantic Scholar

Summary

DEep Learning Transfer using Feature Map with Attention for Convolutional Networks: A Critical Overview

The paper introduces a novel framework called DEep Learning Transfer using Feature Map with Attention (DELTA) aimed at enhancing the accuracy of deep convolutional neural networks (CNNs) in transfer learning scenarios, particularly when faced with limited target domain datasets. The authors address two main shortcomings of traditional weight regularization methods in CNNs: the potential loss of valuable information due to inadequate regularization and suboptimal model performance due to overly restrictive regularization.

Theoretical Foundations and Methodological Advances

DELTA aims to preserve the semantic value of the pre-trained network's feature maps instead of merely anchoring network weights. This approach harnesses supervised attention to define the optimal feature maps for alignment between pre-trained and target networks. Specifically, DELTA computes the distance between feature maps generated by the source and target networks, re-enforcing those with higher discriminative power through supervised attention weights. This methodology adopts a behavioral regularization strategy, focusing on the outputs (behavior) of the network's outer layers rather than merely on the network's internal weights.

The paper delineates the integration of DELTA with existing techniques like $L^2$ and $L^2\text{-}SP$ by employing the starting point as reference (SPAR) strategy. This integration is achieved by balancing a parameter-based proximal term to help maintain consistency in inner layer parameter estimation.

Experimental Evidence and Implications

Empirical evaluation reports indicate that DELTA surpasses traditional methods like $L^2$ and $L^2\text{-}SP$ on various datasets, including Caltech 256, Stanford Dogs 120, MIT Indoors 67, and others, achieving impressive improvements in classification accuracy. The effectiveness of DELTA is particularly notable in fine-grained image categorization tasks as demonstrated on datasets such as the CUB-200-2011 and Food-101, where DELTA results in top-1 accuracy improvements over other state-of-the-art transfer learning strategies.

The results, highlighted through detailed comparisons, demonstrate DELTA's robustness and its capacity to efficiently reuse and re-weight "unactivated" channels — convolutional segments not contributing significantly during pre-training — enhancing model generalization without catastrophic forgetting.

Broader Impact and Future Directions

The success of DELTA suggests further exploration into behavior-based regularization methods in deep transfer learning across different architectures. There remains potential for incorporating more advanced attention mechanisms and integrating DELTA into larger, diverse neural networks. It paves the way for more refined algorithms handling dynamic transfer scenarios, where attention dampened weights could adapt continually to emerging patterns and datasets.

DELTA's ability to improve transfer learning outcomes holds practical significance in domains with scarce labeled data, presenting implications for applications in commercial and research-focused AI projects. Future research could explore cross-domain adaptation and apply DELTA in new fields like natural language processing and other AI areas reliant on transfer learning frameworks.

Conclusion

DELTA represents a significant shift in addressing CNN transfer learning challenges, leveraging feature map behavioral data with a focus on attention-mediated weight optimization. Its superiority over conventional weight-centric techniques and enhancements in classification accuracy signify a noteworthy progression in the field of transfer learning, offering profound implications for advancing AI applications reliant on deep learning technologies.