Rethinking Transformer-based Set Prediction for Object Detection (2011.10881v2)

Published 21 Nov 2020 in cs.CV and cs.LG

Abstract: DETR is a recently proposed Transformer-based method which views object detection as a set prediction problem and achieves state-of-the-art performance but demands extra-long training time to converge. In this paper, we investigate the causes of the optimization difficulty in the training of DETR. Our examinations reveal several factors contributing to the slow convergence of DETR, primarily the issues with the Hungarian loss and the Transformer cross-attention mechanism. To overcome these issues we propose two solutions, namely, TSP-FCOS (Transformer-based Set Prediction with FCOS) and TSP-RCNN (Transformer-based Set Prediction with RCNN). Experimental results show that the proposed methods not only converge much faster than the original DETR, but also significantly outperform DETR and other baselines in terms of detection accuracy.

Authors (4)

Zhiqing Sun (35 papers)
Shengcao Cao (13 papers)
Yiming Yang (151 papers)
Kris Kitani (96 papers)

Citations (298)

View on Semantic Scholar

Summary

The paper identifies DETR's slow convergence due to unstable Hungarian loss and cross-attention dynamics.
It introduces TSP-FCOS and TSP-RCNN, leveraging FCOS and RCNN techniques to accelerate training and improve detection precision.
Experiments on COCO 2017 show these models achieve faster convergence and higher average precision, especially for small objects.

Analysis of "Rethinking Transformer-based Set Prediction for Object Detection"

The paper "Rethinking Transformer-based Set Prediction for Object Detection" provides an in-depth analysis of the DEtection TRansformer (DETR) method, a pioneering approach using Transformer models for object detection by framing the task as a set prediction problem. While DETR has demonstrated state-of-the-art performance, it suffers from prolonged training times, necessitating up to 500 epochs for convergence. The authors aim to diagnose the causes of these optimization challenges and propose two methodologies—TSP-FCOS and TSP-RCNN—to mitigate them and enhance both convergence speed and detection accuracy.

Key Findings and Proposed Solutions

The authors identify two primary factors impeding the optimization of DETR: the Hungarian loss and the cross-attention mechanism within the Transformer structure. The Hungarian loss, responsible for assigning set-level ground truths, exhibits instability due to its reliance on bipartite matching, which is particularly susceptible to the randomness of initial assignments and noise in early training phases.

To address this, the paper introduces two new models:

TSP-FCOS (Transformer-based Set Prediction with FCOS): By integrating methodologies from the one-stage detector FCOS, this model leverages a novel Feature of Interest (FoI) selection mechanism to effectively manage multi-level features. The approach modifies the bipartite matching process to use a restricted cost-based scheme, thereby accelerating convergence.
TSP-RCNN (Transformer-based Set Prediction with RCNN): Borrowing from the two-stage architecture of Faster RCNN, this model introduces a Region Proposal Network (RPN) to generate proposals and refine predictions iteratively, which enhances detection precision, especially for small objects.

Experimental results on the COCO 2017 benchmark validate that both TSP-FCOS and TSP-RCNN achieve faster convergence rates compared to the original DETR. Furthermore, they deliver superior average precision, particularly under scenarios involving constrained training resources or the need for accelerated iterations.

Challenges and Comparisons

The analysis underscores the critical role of the cross-attention module in Transformer-based models, pointing out its sparsity dynamics as a significant factor in DETR's slow convergence. By adopting an encoder-only Transformer model, the authors reveal a substantial improvement in detecting small objects and a partial improvement for medium-sized objects, while noting a trade-off in performance for large object detection.

In juxtaposition with other contemporary solutions such as Deformable DETR and UP-DETR, TSP would not only necessitate fewer epochs but also surpasses DETR in critical performance metrics. When enhanced with iterative refinement techniques, TSP-RCNN approaches the capabilities of more computationally intensive models, fortifying its position as a viable alternative for streamlined yet efficacious object detection tasks.

Future Directions and Implications

The research presented could potentially reshape the landscape of object detection, as it provides an efficient blueprint for deploying Transformers while maintaining the benefits of faster convergence and enhanced precision. Future developments could benefit from an exploration of sparse attention mechanisms, which may offer additional performance gains by refining the relationship modeling among multi-level features without exacerbating computational overhead.

As the paper illustrates the practical and theoretical value of rethinking Transformer-based models for object detection, subsequent efforts can aim to harness these insights to further refine model architectures for other predictive tasks within the field of deep learning, solidifying their efficacy and broadening their applicability.

This exploration into optimizing Transformer-based object detection methodologies refines our understanding of model dynamics and delineates new pathways for efficient and accurate deployment of deep learning models across varied computational environments.

PDF Markdown