Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniFormer: Unifying Convolution and Self-attention for Visual Recognition (2201.09450v3)

Published 24 Jan 2022 in cs.CV

Abstract: It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.

Analyzing "UniFormer: Unifying Convolution and Self-attention for Visual Recognition"

The paper presented introduces a novel architectural framework named UniFormer, which integrates the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance visual recognition capabilities. This paper contributes substantially to addressing two pivotal challenges in visual data representation: local redundancy and global dependency.

Core Contributions

UniFormer is designed to seamlessly integrate convolution and self-attention into a cohesive transformer framework. By doing so, it aims to leverage the benefits of both architectures:

  • Local Redundancy Reduction: The convolutional component in UniFormer efficiently diminishes computation through local feature aggregation, making it ideal for handling highly redundant visual data in shallow layers.
  • Global Dependency Modeling: The transformer aspect empowers the architecture to capture long-range dependencies efficiently, crucial for understanding complex interactions in visual data.

Design and Architecture

UniFormer’s architecture is constructed with a blend of what it refers to as "local" and "global" attention mechanisms across its layers:

  • Local Multi-Head Relation Aggregator (MHRA): Used in shallower layers, this component mimics convolution by focusing on a limited neighborhood, thereby addressing redundancy issues effectively.
  • Global Multi-Head Relation Aggregator (MHRA): Deployed in deeper layers, it utilizes token similarity comparisons akin to self-attention mechanisms in ViTs, ensuring the model captures long-range dependencies.
  • Dynamic Position Embedding (DPE): This feature accommodates input variations and maintains token order, enhancing the aggregation of positional information in the network.

The UniFormer utilizes a hybrid stacking strategy of local and global MHRA blocks in four stages, adapting its approach based on the specifics of the vision task—ranging from image classification to dense prediction tasks such as object detection and semantic segmentation.

Empirical Results

The UniFormer exhibits strong performance across several benchmark datasets and tasks:

  • Image Classification: Achieves a remarkable 86.3% top-1 accuracy on ImageNet-1K without additional data, situating itself competitively against other state-of-the-art models.
  • Video Classification: The model excels on datasets like Kinetics-400 and SthSth V1, achieving top-1 accuracies of 82.9% and 60.9% respectively—demonstrating robust temporal modeling capabilities.
  • Dense Prediction Tasks: For COCO object detection and ADE20K semantic segmentation, UniFormer achieves 53.8 box AP and 50.8 mIoU, showcasing versatility across multiple computer vision applications.

Practical and Theoretical Implications

The introduction of UniFormer suggests several implications for future research and practice:

  • Hybrid Architecture Design: The effective combination of convolution and self-attention could inform other domains where both local and global contexts are vital.
  • Efficiency and Performance Trade-offs: By addressing both redundancy and dependency, UniFormer could inspire new approaches to designing efficient architectures for resource-constrained environments.
  • Extensibility: The flexible stacking strategy proposed offers a blueprint for future models to dynamically adjust between convolutional and transformer blocks based on task requirements.

Future Directions

Future iterations might explore further optimization in token pruning strategies to enhance throughput without sacrificing performance. Additionally, further exploration could lead to extensions into domains outside visual data where structured long-range dependencies exist.

In conclusion, the UniFormer paper makes significant strides in unifying critical elements of CNNs and ViTs, fostering improved accuracy and efficiency in visual recognition tasks. Its demonstrated performance across various challenging datasets underscores its potential as a foundational model for diverse computer vision applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yali Wang (78 papers)
  2. Junhao Zhang (24 papers)
  3. Peng Gao (401 papers)
  4. Guanglu Song (45 papers)
  5. Yu Liu (784 papers)
  6. Hongsheng Li (340 papers)
  7. Yu Qiao (563 papers)
  8. KunChang Li (43 papers)
Citations (297)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com