HiFT: Hierarchical Feature Transformer for Aerial Tracking (2108.00202v3)

Published 31 Jul 2021 in cs.CV and cs.RO

Abstract: Most existing Siamese-based tracking methods execute the classification and regression of the target object based on the similarity maps. However, they either employ a single map from the last convolutional layer which degrades the localization accuracy in complex scenarios or separately use multiple maps for decision making, introducing intractable computations for aerial mobile platforms. Thus, in this work, we propose an efficient and effective hierarchical feature transformer (HiFT) for aerial tracking. Hierarchical similarity maps generated by multi-level convolutional layers are fed into the feature transformer to achieve the interactive fusion of spatial (shallow layers) and semantics cues (deep layers). Consequently, not only the global contextual information can be raised, facilitating the target search, but also our end-to-end architecture with the transformer can efficiently learn the interdependencies among multi-level features, thereby discovering a tracking-tailored feature space with strong discriminability. Comprehensive evaluations on four aerial benchmarks have proven the effectiveness of HiFT. Real-world tests on the aerial platform have strongly validated its practicability with a real-time speed. Our code is available at https://github.com/vision4robotics/HiFT.

Citations (175)

View on Semantic Scholar

Summary

The paper introduces a novel Hierarchical Feature Transformer that fuses multi-level CNN features with attention for robust aerial tracking.
It outperforms state-of-the-art methods on four aerial tracking benchmarks, excelling in accuracy and real-time efficiency.
The approach effectively addresses challenges like fast motion, low resolution, and occlusion in UAV-based tracking tasks.

Overview of HiFT: Hierarchical Feature Transformer for Aerial Tracking

The paper "HiFT: Hierarchical Feature Transformer for Aerial Tracking" by Ziang Cao et al. introduces an innovative approach to enhance visual object tracking performance using a Hierarchical Feature Transformer (HiFT). This method is specifically tailored for aerial tracking applications, addressing challenges such as fast motion, low resolution, and frequent occlusion often encountered in UAV-based scenarios.

Methodology and Approach

At the core of this work is the development of the HiFT architecture that uniquely combines the strengths of hierarchical similarity maps and the attention mechanism of transformers. Unlike conventional Siamese-based tracking methods that rely on single-layer feature maps, the HiFT leverages multi-level convolutional layers to create hierarchical similarity maps that incorporate both spatial and semantic cues from shallow and deep layers, respectively.

These maps are processed using a hierarchical feature transformer, designed to fuse the spatial details from shallow layers with the semantic information from deep layers. This approach enables the model to capture global contextual information, facilitating effective target localization even in complex environments.

Notably, the introduction of a feature modulation layer within the transformer structure helps in managing low-resolution objects more effectively by exploring interdependencies among the multi-level features. Consequently, HiFT learns a tracking-tailored feature space with enhanced discriminability and robustness.

Numerical Results and Implications

The efficacy of the HiFT model is validated against four authoritative aerial benchmarks, demonstrating superior performance compared to state-of-the-art methods. The model achieves impressive precision and success rates, significantly outperforming deep backbone models like ResNet-based trackers while maintaining real-time processing speed.

In attribute-based evaluations covering challenges such as low resolution, scale variation, and fast motion, HiFT consistently performs better with notable improvements over the second-best methods. This highlights the model's capability in adapting to diverse aerial tracking conditions, crucial for real-world UAV applications.

Theoretical and Practical Impact

The HiFT approach suggests several theoretical contributions, primarily in its ability to integrate transformers with hierarchical feature maps for object tracking. This integration paves the way for new directions in feature fusion and attention mechanisms, potentially inspiring further developments in lightweight model architectures for resource-constrained platforms.

Practically, HiFT advances aerial tracking technology by offering a real-time, efficient solution that suits embedded systems, such as those used in UAVs. Its high performance during real-world tests underscores its applicability in demanding environments, which is vital for tasks such as aerial cinematography and visual localization.

Future Research Directions

Future work could explore optimizing the transformer architecture for even smaller computational footprints while maintaining accuracy. Exploring transformer-based methods to address additional challenges like viewpoint changes and out-of-view events could further boost its applicability. Additionally, integrating HiFT with multimodal inputs or enhancing its resilience to adversarial conditions could be promising avenues of research.

In summary, "HiFT: Hierarchical Feature Transformer for Aerial Tracking" offers a significant contribution to the field of computer vision and UAV tracking, providing a robust and computationally efficient approach that sets a strong foundation for future innovations in aerial tracking technologies.

PDF Markdown