Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (2211.05778v4)

Published 10 Nov 2022 in cs.CV

Abstract: Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

Citations (521)

Summary

  • The paper introduces deformable convolutions (DCNv3) to effectively scale CNN architectures and overcome limitations in conventional designs.
  • It achieves state-of-the-art performance with 89.6% top-1 ImageNet accuracy and record COCO (65.4 mAP) and ADE20K (62.9 mIoU) benchmarks.
  • The architecture blends convolutional efficiency with transformer-like adaptive spatial aggregation, inspiring future large-scale vision models.

Analysis of the InternImage Paper

In the paper "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions," the authors introduce a novel approach to CNN-based foundational models. The focus lies on exploiting deformable convolutions to surpass the limitations typically encountered in conventional CNN architectures when scaling to large parameter sizes.

Core Operator: Deformable Convolutions

The primary innovation of the InternImage model revolves around using deformable convolutions as a core operator. This choice addresses critical weaknesses in traditional CNNs by introducing long-range dependencies and adaptive spatial aggregation similar to what is observed in vision transformers (ViTs). DCNv2, a flexible convolution variant, serves as the basis for this work, which they further refine into DCNv3.

The modifications resulting in DCNv3 include:

  • Shared Weights: This reduces parameters and memory complexity by sharing projection weights among convolutional neurons.
  • Multi-Group Design: Similar to transformer models, this design uses multiple groups to allow adaptive spatial aggregation from varied representation subspaces.
  • Softmax Normalization: Helps in stabilizing the gradient flow by ensuring modulation scalars sum to one.

Model Architecture and Scaling

InternImage proposes a sophisticated architecture design characterized by DCNv3 as the fundamental block, deploying advanced components similar to transformer designs, such as Layer Normalization and GELU. The architecture is strategically designed using stacking and scaling rules focused on maximizing the use of the core deformable convolution operator. A crucial aspect of their design is efficiently scaling the model to incorporate over a billion parameters, an achievement seldom realized in CNN-based models.

Performance Evaluation

InternImage demonstrates significant performance improvements across multiple key benchmarks:

  • ImageNet: The InternImage-H model achieves a state-of-the-art 89.6\% top-1 accuracy on ImageNet, highlighting its capability to effectively leverage vast parameters and training data, comparable to leading ViTs.
  • COCO Object Detection: Here, InternImage set a new record with a 65.4 mAP, leveraging advanced frameworks such as DINO for training.
  • ADE20K Semantic Segmentation: Achieves a remarkable 62.9 mIoU, further proving its prowess in dense prediction tasks.

The model's robustness against translational, rotational, and scaling aberrations is also noteworthy, showcasing its practical utility for real-world scenarios.

Implications and Future Outlook

InternImage not only challenges the perception of CNNs as being inferior to ViTs in handling large-scale models but also reinvigorates interest in convolution-based architectures for foundational models. The significant performance results on several diverse tasks suggest that deformable convolutions can indeed bridge the gap with attention-based models, and even excel in contexts where efficient learning from vast data is needed.

For future developments, further exploration into latency reduction and efficient deployment on resource-constrained devices may hold potential. The proposed architecture can influence and inspire new designs that blend traditional convolutional mechanisms with emerging deep learning architectures.

Conclusion

The paper offers a compelling argument for the ongoing relevance of CNNs in large-scale model research. By leveraging deformable convolutions effectively, InternImage establishes itself as a competitive alternative to the widely popular transformer models, encouraging a balanced exploration of both approaches in the future of foundation model development.