Feature Pyramid Transformer (2007.09451v1)

Published 18 Jul 2020 in cs.CV

Abstract: Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales. To this end, we propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT). It transforms any feature pyramid into another feature pyramid of the same size but with richer contexts, by using three specially designed transformers in self-level, top-down, and bottom-up interaction fashion. FPT serves as a generic visual backbone with fair computational overhead. We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks, using various backbones and head networks, and observe consistent improvement over all the baselines and the state-of-the-art methods.

Citations (232)

View on Semantic Scholar

Summary

The paper presents the FPT model that transforms feature pyramids by integrating cross-scale and spatial interactions.
It introduces a three-transformer framework—Self, Grounding, and Rendering Transformers—to capture comprehensive feature relationships.
Experiments reveal improvements of up to 8.5% in detection AP and 6.0% in segmentation performance across several benchmarks.

Feature Pyramid Transformer: Enhancing Visual Recognition Through Cross-Scale Interactions

The paper introduces the Feature Pyramid Transformer (FPT), a novel methodology focused on improving the capabilities of visual recognition systems by optimizing both spatial and scale-based feature interactions. This approach is particularly relevant given the established importance of feature interactions in modern vision tasks, such as object detection, instance segmentation, and semantic segmentation.

The traditional convolutional neural networks (CNNs) capture spatial context through pooling and dilation, albeit without explicit cross-scale spatial interactions. Prior methods either overlook these interactions or inadequately address them, often resulting in suboptimal recognition of objects that vary in scale within an image. The FPT aims to address this shortfall by enabling comprehensive feature interaction across both spatial dimensions and scales.

Main Contributions

The primary contribution of this work is the proposed FPT, a model designed to transform any feature pyramid into another of equivalent dimensions but enriched with contextual information from both spatial and scale interactions. The FPT incorporates three specialized transformers:

Self-Transformer (ST): This component captures the inter-object relationships within the same feature level using the Mixture of Softmaxes (MoS) for normalization, which is more effective than conventional Softmax normalization.
Grounding Transformer (GT): This top-down interaction model facilitates the grounding of high-level feature map concepts into the finer granularity low-level feature maps, employing the Euclidean distance in defining the interaction similarity.
Rendering Transformer (RT): A bottom-up approach that enriches high-level abstract concepts by incorporating detailed visual attributes from low-level data, facilitated through localized channel-wise attention mechanisms.

The interaction frameworks are not only careful in their computational design but also efficient, ensuring that the transformed feature maps retain dimensions that can be seamlessly integrated with various task-specific head networks.

Experimental Validation

Extensive experiments demonstrate the efficacy of the FPT on multiple datasets including MS-COCO, Cityscapes, ADE20K, and PASCAL VOC 2012. The results underscore the benefits of the FPT in enhancing the performance of baseline models and state-of-the-art methods. For instance, the FPT achieves prominent improvements of up to 8.5% in object detection AP and 6.0% in instance segmentation mask AP compared to previous top-performing methods, indicating its robust capability in different visual recognition tasks.

Implications and Future Work

The introduction of FPT has significant implications. Practically, it offers a new tool that researchers can apply in different contexts to enhance model efficacy without drastic architectural changes. Theoretically, it enhances understanding of contextual interactions, encouraging new methodologies that might further optimize cross-scale feature interactions.

Looking forward, future research could explore deeper integration of such transformers with other advanced models, such as incorporating them into 3D recognition tasks or video analysis where temporal scales also need comprehensive interaction modeling. Additionally, examining the FPT's adaptability in real-time systems and understanding its performance under various resolutions and configurations could unlock further potential applications.

In conclusion, the FPT marks a significant stride in visual recognition, enhancing feature interaction methods and potentially paving the way for subsequent innovations in computer vision research.