Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration (2211.12735v2)

Published 23 Nov 2022 in cs.CV and cs.AI

Abstract: We propose integrally pre-trained transformer pyramid network (iTPN), towards jointly optimizing the network backbone and the neck, so that transfer gap between representation models and downstream tasks is minimal. iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT). 2) Multi-stage supervision to the feature pyramid using masked feature modeling (MFM). iTPN is updated to Fast-iTPN, reducing computational memory overhead and accelerating inference through two flexible designs. 1) Token migration: dropping redundant tokens of the backbone while replenishing them in the feature pyramid without attention operations. 2) Token gathering: reducing computation cost caused by global attention by introducing few gathering tokens. The base/large-level Fast-iTPN achieve 88.75%/89.5% top-1 accuracy on ImageNet-1K. With 1x training schedule using DINO, the base/large-level Fast-iTPN achieves 58.4%/58.8% box AP on COCO object detection, and a 57.5%/58.7% mIoU on ADE20K semantic segmentation using MaskDINO. Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss, demonstrating the potential to be a powerful backbone for downstream vision tasks. The code is available at: github.com/sunsmarterjie/iTPN.

Authors (7)

Yunjie Tian (17 papers)
Lingxi Xie (137 papers)
Jihao Qiu (4 papers)
Jianbin Jiao (51 papers)
Yaowei Wang (149 papers)
Qi Tian (314 papers)
Qixiang Ye (110 papers)

Citations (4)

View on Semantic Scholar

Summary

Integrally Pre-Trained Transformer Pyramid Networks

The paper "Integrally Pre-Trained Transformer Pyramid Network with Token Migration" introduces a novel architecture, the Integrally Pre-Trained Transformer Pyramid Network (iTPN), designed to reduce the transfer gap between the pre-training and fine-tuning phases in vision transformers. This research presents two primary innovations: the establishment of a pre-trained feature pyramid on vision transformers (ViT), and the implementation of multi-stage supervision to the feature pyramid through masked feature modeling (MFM).

Key Contributions and Methodology

The iTPN architecture integrates the network backbone and the feature pyramid (neck) into a single pre-training framework. This integration aims to jointly optimize both components, thus reducing the gap between pre-trained model representations and downstream tasks such as object detection and semantic segmentation. The authors implement the first pre-trained feature pyramid for ViTs and introduce a method termed Multi-stage supervision with Masked Feature Modeling (MFM), which enhances the feature pyramid pre-training.

The iTPN has been extended to Fast-iTPN by incorporating two scalable design modifications that significantly improve computation efficiency during inference:

Token Migration: This method involves discarding redundant tokens in the backbone while replenishing them in the feature pyramid without incurring additional attention operations, thereby reducing memory overhead and accelerating computation.
Token Gathering: Introduces a mechanism to aggregate global information across tokens by using a few gathering tokens to minimize computational load, thereby optimizing global attention.

Performance and Results

The experimental results demonstrate that Fast-iTPN models achieve substantial improvements in several standard vision tasks. The base and large-level Fast-iTPNs report top-1 accuracy of 88.75% and 89.5% on the ImageNet-1K dataset, respectively. For object detection tasks on the COCO dataset and semantic segmentation on ADE20K, using the base-level Fast-iTPN, the system achieves box AP scores of 58.4% and 58.8% and mIoU of 57.5% and 58.7%, respectively.

Importantly, Fast-iTPN achieves up to 70% acceleration during inference, maintaining negligible performance loss. This underscores its potential as a powerful architecture for downstream visual recognition tasks, offering an optimal trade-off between computational efficiency and performance.

Implications and Future Directions

The research highlights the effectiveness of integrally pre-training both the network backbone and the neck, laying the groundwork for future exploration in harmonizing pre-training and downstream optimization processes. The demonstrated improvements in computational efficiency and performance suggest promising directions for developing more versatile and scalable vision transformer architectures.

Future work may focus on further enhancing the scalability and generalizability of the iTPN framework across more diverse datasets and tasks, potentially exploring its applications in real-time systems where computational resources are constrained. Additionally, exploring alternative token manipulation strategies and further refining token importance assessments might offer additional gains in performance and efficiency. In sum, iTPN represents a significant stride toward achieving more cohesive and effective vision transformer architectures, with implications extending across various domains in artificial intelligence research.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - sunsmarterjie/iTPN: (CVPR2023) Integrally Pre-Trained Transformer Pyramid Networks -- A Hierarchical Vision Transformer for Masked Image Modeling (156 stars)