Divert More Attention to Vision-Language Tracking (2207.01076v1)

Published 3 Jul 2022 in cs.CV

Abstract: Relying on Transformer for complex visual feature learning, object tracking has witnessed the new standard for state-of-the-arts (SOTAs). However, this advancement accompanies by larger training data and longer training period, making tracking increasingly expensive. In this paper, we demonstrate that the Transformer-reliance is not necessary and the pure ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. The essence lies in learning novel unified-adaptive VL representations with our modality mixer (ModaMixer) and asymmetrical ConvNet search. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features, by unbelievably improving a CNN-based Siamese tracker by 14.5% in SUC on challenging LaSOT (50.7% > 65.2%), even outperforming several Transformer-based SOTA trackers. Besides empirical results, we theoretically analyze our approach to evidence its effectiveness. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking beyond Transformer. Code and models will be released at https://github.com/JudasDie/SOTS.

Citations (39)

View on Semantic Scholar

Summary

The paper presents a ConvNet-based VL tracking framework using ModaMixer to integrate language with visual features, achieving SOTA results.
It introduces an asymmetrical search strategy to design distinct template and search architectures, thereby reducing computational costs compared to Transformers.
Empirical and theoretical analyses reveal a 14.5% SUC improvement on LaSOT and robust performance across benchmarks, validating its efficiency.

Overview of "Divert More Attention to Vision-Language Tracking"

This paper introduces a novel approach to object tracking by leveraging multimodal vision-language (VL) representations, demonstrating the effectiveness of using convolutional neural networks (ConvNets) over the commonly employed Transformers. The authors propose a method that maintains state-of-the-art (SOTA) performance while reducing training resource requirements.

Key Contributions

The paper dispels the notion that complex Transformer-based architectures are essential for SOTA object tracking. It introduces a framework utilizing a unified-adaptive VL representation exclusively through ConvNets, which is both cost-effective and competitive.

ModaMixer for Multimodal Integration: The paper presents ModaMixer, a module designed to reweight visual feature channels using language representations. This integration occurs at various semantic depths, enhancing both robustness and discriminative capability.
Asymmetrical Network Search: Introducing an asymmetrical searching strategy (ASS), the authors develop distinct architectures for the template and search branches, optimizing the mixed modality processing. This NAS-based approach facilitates better network tailoring without the extensive computational costs associated with traditional methods.
Empirical and Theoretical Validation: The ConvNet-based VL tracker shows significant improvements in tracking capability, surpassing many Transformer-based models. Additionally, a theoretical analysis supports the efficacy of the multimodal representation and asymmetrical design.
Robust Performance Across Benchmarks: The proposed VL tracker exhibits robust performance across multiple benchmarking datasets, evidencing its broad applicability and potential for real-world deployment.

Numerical Results

A remarkable outcome of this paper is the 14.5% improvement in SUC on the LaSOT dataset by a purely CNN-based Siamese tracker when integrated with the ModaMixer and ASS—outperforming several Transformer-based competitors. This result, alongside substantial performance on other datasets like LaSOT $_\mathrm{Ext}$ , TNL2K, and OTB99-LANG, substantiates the claim regarding the potential of ConvNets in VL tracking.

Implications and Future Prospects

The implications of this research extend to both practical applications and theoretical explorations within AI. By setting a precedent for effective multimodal learning without reliance on Transformers, this work encourages the community to explore more efficient architectures that balance performance with computational feasibility. As AI evolves, the insights from this paper could inform the development of models requiring less data and computational power while maintaining high accuracy.

Future developments may focus on enhancing VL representation learning, exploring broader datasets, and integrating more sophisticated LLMs to further refine the balance between vision and linguistic inputs.

In summary, the paper demonstrates that ConvNets, coupled with strategic network design and multimodal interaction, can redefine SOTA standards in vision-language tracking. This work serves as a catalyst for more explorations into multimodal architectures that prioritize efficiency alongside effectiveness.

Related Papers

GitHub

GitHub - JudasDie/SOTS: Single object tracking and segmentation. (489 stars)