Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships

Published 5 Apr 2024 in cs.CV and cs.LG | (2404.04140v2)

Abstract: In most modern object detection pipelines, the detection proposals are processed independently given the feature map. Therefore, they overlook the underlying relationships between objects and the surrounding background, which could have provided additional context for accurate detection. Because aerial imagery is almost orthographic, the spatial relations in image space closely align with those in the physical world, and inter-object and object-background relationships become particularly significant. To address this oversight, we propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training (CLIP) features to capture such relationships. Specifically, Building on two-stage detectors, we treat Region of Interest (RoI) proposals as tokens, accompanied by CLIP Tokens obtained from multi-level image segments. These tokens are then passed through a Transformer encoder, where specific spatial and geometric relations are incorporated into the attention weights, which are adaptively modulated and regularized. Additionally, we introduce self-supervised constraints on CLIP Tokens to ensure consistency. Extensive experiments on three benchmark datasets demonstrate that our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP${50}$ on DOTA-v1.0, 5.30 mAP${50}$ on DOTA-v1.5, 2.30 mAP${50}$ on DOTA-v2.0 and 3.23 mAP${50}$ on DIOR-R.