Bi-directional Adapter for Multi-modal Tracking (2312.10611v1)

Published 17 Dec 2023 in cs.CV and cs.AI

Abstract: Due to the rapid development of computer vision, single-modal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging sensor, multi-modal images (RGB, Infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT.

Authors (4)

Bing Cao (23 papers)
Junliang Guo (39 papers)
Pengfei Zhu (76 papers)
Qinghua Hu (83 papers)

Citations (28)

View on Semantic Scholar

Summary

Overview of Bi-directional Adapter for Multi-modal Tracking

The paper "Bi-directional Adapter for Multi-modal Tracking" addresses the challenges inherent in single-modal object tracking, particularly its limitations in complex environments due to reliance on a single imaging sensor like RGB. The authors propose a novel approach to multi-modal tracking by integrating a universal bi-directional adapter within a transformer architecture, which dynamically adapts to varying environmental conditions by cross-prompting multiple modalities, specifically RGB and thermal infrared (TIR). This research focuses on parameter-efficient tuning to optimize the usage of pre-existing models, enhancing tracking performance significantly while adding minimal new parameters.

The proposed model, referred to as BAT (Bi-directional Adapter for Multi-modal Tracking), utilizes a novel dual-stream method to process both RGB and TIR data, which are preprocessed through a shared transformer encoder with frozen pre-trained parameters. This setup allows for efficient parameter use by only adding a light feature adapter, reducing the training burden compared to exhaustive full fine-tuning methods. The dual-stream approach permits BAT to leverage complementary features from each modality dynamically, ultimately improving tracking accuracy across diverse scenarios.

Key Contributions and Methodology

Universal Bi-directional Adapter: Central to the paper is the introduction of a universal bi-directional adapter capable of prompt fusion between modalities. This feature is designed to transcend fixed dominant-auxiliary relationships between modalities by enabling dynamic, runtime determination of which modality holds the most effective data for tracking at any given moment.
Parameter Efficiency: The model architecture is grounded on a pretrained transformer-based backbone, further extended with the bi-directional adapter, which achieves superior performance using only an additional 0.32M trainable parameters. This is particularly noteworthy given the model's competitive performance against methods that often require more extensive parameter tuning.
Experimental Validation: Empirical results underscore BAT's efficacy, achieving notable improvements on both the RGBT234 and LasHeR datasets. Specifically, BAT outperformed existing state-of-the-art techniques with MPR and MSR scores of 86.8% and 64.1%, respectively, on RGBT234, and precision (PR) and success rate (SR) scores of 70.2% and 56.3% on LasHeR. The paper details how BAT effectively maintains favorable performance across various scenarios, including low illumination and high occlusion, situations where single-modality trackers typically struggle.

Implications and Future Directions

The implications of this work extend to both the theoretical and practical realms of AI and computer vision. In theory, the research challenges the common paradigm of fixed modality dominance by offering a dynamically adaptable framework. Practically, BAT's parameter efficiency and robust performance mean it could be viable for real-world applications where computational resources are limited but multi-sensor environments are present, such as autonomous vehicles or surveillance systems.

Looking forward, the paper suggests potential for broader application, implying that the architecture might be adapted for tasks involving additional modalities beyond RGB and TIR. There is a logical progression towards integrating this framework into vision-LLMs or expanding its use in environments with more complex sensory inputs.

In summary, "Bi-directional Adapter for Multi-modal Tracking" provides an innovative framework for addressing the limitations of single-modal object tracking by employing a parameter-efficient, dynamically adaptable approach. This paper lays a foundation for future advancements in multi-modal object tracking and sets the stage for further exploration into flexible, scalable sensor fusion techniques in varying environmental conditions.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - SparkTempest/BAT: Bi-directional Adapter for Multi-modal Tracking (58 stars)