End-to-End Object Detection with Adaptive Clustering Transformer (2011.09315v2)

Published 18 Nov 2020 in cs.CV

Abstract: End-to-end Object Detection with Transformer (DETR)proposes to perform object detection with Transformer and achieve comparable performance with two-stage object detection like Faster-RCNN. However, DETR needs huge computational resources for training and inference due to the high-resolution spatial input. In this paper, a novel variant of transformer named Adaptive Clustering Transformer(ACT) has been proposed to reduce the computation cost for high-resolution input. ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and ap-proximate the query-key interaction using the prototype-key interaction. ACT can reduce the quadratic O(N2) complexity inside self-attention into O(NK) where K is the number of prototypes in each layer. ACT can be a drop-in module replacing the original self-attention module without any training. ACT achieves a good balance between accuracy and computation cost (FLOPs). The code is available as supplementary for the ease of experiment replication and verification. Code is released at \url{https://github.com/gaopengcuhk/SMCA-DETR/}

View on arXiv

Authors (7)

Minghang Zheng (7 papers)
Peng Gao (402 papers)
Renrui Zhang (100 papers)
Xiaogang Wang (230 papers)
Hongsheng Li (340 papers)
Hao Dong (175 papers)
KunChang Li (43 papers)

Citations (181)

View on Semantic Scholar

Summary

Analysis of "End-to-End Object Detection with Adaptive Clustering Transformer"

This paper introduces an innovative approach to object detection that enhances the efficiency of the Detection Transformer (DETR) framework through the integration of an Adaptive Clustering Transformer (ACT). The proposed method focuses on reducing the computational demands typically associated with the self-attention mechanism in the transformer encoder by employing adaptive clustering techniques.

The paper begins by addressing a significant limitation of the existing DETR framework: its substantial computational cost due to the quadratic complexity of the self-attention mechanism. DETR, while simplifying the object detection pipeline by eliminating hand-crafted components, demands significant computational resources, particularly when processing high-resolution images. The authors propose ACT to address these computational challenges without requiring retraining of the model.

Key Contributions

Adaptive Clustering Transformer (ACT): ACT replaces the original self-attention module in the DETR encoder. It adaptively clusters the query features using Locality Sensitive Hashing (LSH) and approximates the computation-intensive query-key interactions through prototype-key interactions. This adjustment shifts the complexity from O( $N^2$ ) to O( $NK$ ), where $K$ represents the number of prototypes.
Reduction in Computational Cost: The application of ACT significantly reduces the FLOPs from 73.4 Gflops to 58.2 Gflops, with a marginal decrease in average precision (AP) of only 0.7%. This efficiency is achieved without additional training or modifications to the fundamental structure of the original transformer, streamlining its deployment.
Seamless Integration with Multi-Task Knowledge Distillation (MTKD): The introduction of MTKD helps close the gap in accuracy caused by reduced computational complexity. With MTKD, the loss in AP is further minimized to 0.2%, highlighting that ACT can maintain high performance with reduced costs.
Experimental Results: The proposed method demonstrates competitive performance across various object sizes, achieving comparable or superior results compared to Faster R-CNN with significantly lower computational overhead. The analysis of queries and prototype clusters showcases ACT's ability to maintain semantic integrity and selectivity in clustering.

Implications and Future Directions

The advancements presented in this paper hold considerable promise for the future of end-to-end object detection frameworks. By reducing the computational burden without sacrificing performance, ACT enhances the feasibility of deploying transformer-based detection systems in resource-constrained environments. The adaptive clustering strategy can be further explored to improve other transformer architectures, potentially benefiting a broad range of computer vision applications.

While ACT effectively addresses inference costs, future research could investigate its application in accelerating the training process. Additionally, the integration of ACT with multi-scale feature networks like the Feature Pyramid Network (FPN) could unlock new capabilities in cross-scale information fusion. Such developments could spur more efficient and versatile detection systems, further bridging the gap between state-of-the-art performance and practical deployment challenges.

In conclusion, the paper contributes significantly to the field of computer vision by proposing a method that optimizes the trade-off between computational efficiency and detection accuracy in transformer-based object detection. The results and methodologies herein provide a solid foundation for future research aimed at enhancing the scalability and applicability of deep learning-based object detection systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - gaopengcuhk/SMCA-DETR (172 stars)