Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning (2110.05340v1)

Published 11 Oct 2021 in cs.CV and cs.LG

Abstract: Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

Citations (36)

View on Semantic Scholar

Summary

The paper proposes the CARE framework, which revitalizes CNN attentions by leveraging transformers for self-supervised learning.
It employs a dual-stream design, running CNN and transformer streams concurrently to enhance feature representation.
Experiments show that CARE-trained CNNs outperform state-of-the-art SSL methods on benchmarks like ImageNet with superior accuracy.

Overview of "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning"

The paper "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning" introduces a novel framework named CNN Attention REvitalization (CARE) aimed at enhancing the attention mechanisms within Convolutional Neural Network (CNN) encoders leveraged in self-supervised learning (SSL). CNNs have demonstrated competitive performance in image recognition tasks; however, their potential is yet fully harnessed, particularly regarding network attention. CARE proposes an innovative approach utilizing transformers to guide and enhance attentional focus within CNNs, thereby boosting their performance in SSL tasks.

Methodology and Framework

CARE synthesizes the strengths of CNNs and transformers by establishing a dual-stream framework: the CNN stream (C-stream) and the transformer stream (T-stream). The C-stream mimics standard SSL configurations containing CNN encoders, projectors, and predictors. In contrast, the T-stream employs transformer architectures similarly structured with projectors and predictors.

A key innovation of the CARE framework lies in performing SSL tasks concurrently across both streams. The T-stream processes the same input as the C-stream, generating modulated features which then supervise the CNN encoders in the C-stream. This attention-centric supervision ensures that CNN encoders begin to mimic the attention richness typically observed in transformers. Utilizing this dual-stream system, the CNN encoders in the CARE framework are trained to better attend to relevant features within images.

Experimental Results

Experiments demonstrate notable performance enhancements across several visual recognition benchmarks, including image classification, object detection, and semantic segmentation. CNNs trained under the CARE framework consistently surpass their counterparts from other state-of-the-art SSL methodologies. This is particularly evident when comparing linear classification performance on ImageNet, where CNNs under CARE achieve superior top-1 accuracy rates with fewer epochs relative to other techniques such as BYOL and MoCo.

Implications and Future Directions

The CARE framework's capacity to enrich CNN attention through transformer supervision holds profound implications for both practical applications and theoretical exploration. As CNNs continue to serve as foundational architectures in numerous computer vision tasks, augmenting their performance through enhanced attention provides avenues for refining downstream applications like real-time detection, feature extraction, and autonomous navigation.

From a theoretical standpoint, the CARE framework suggests a promising direction for hybrid architectures, leveraging both CNN and transformer advantages. Future research might explore optimizing computational efficiency, given the additional overhead introduced by maintaining dual streams during SSL training. Moreover, expanding the CARE approach to encompass broader applications beyond vision, possibly aligning with modalities like audio and text, might uncover more versatile self-supervised systems.

In conclusion, the CARE framework represents a significant stride towards equipping CNN encoders with more sophisticated attentional capabilities, powered by the parallel utilization of transformer architectures, fostering improved learning and recognition performance across various visual tasks.

PDF Markdown

Related Papers

GitHub

GitHub - ChongjianGE/CARE: [NeurIPS 2021] Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning (110 stars)