Learn to Match: Automatic Matching Network Design for Visual Tracking (2108.00803v1)

Published 2 Aug 2021 in cs.CV

Abstract: Siamese tracking has achieved groundbreaking performance in recent years, where the essence is the efficient matching operator cross-correlation and its variants. Besides the remarkable success, it is important to note that the heuristic matching network design relies heavily on expert experience. Moreover, we experimentally find that one sole matching operator is difficult to guarantee stable tracking in all challenging environments. Thus, in this work, we introduce six novel matching operators from the perspective of feature fusion instead of explicit similarity learning, namely Concatenation, Pointwise-Addition, Pairwise-Relation, FiLM, Simple-Transformer and Transductive-Guidance, to explore more feasibility on matching operator selection. The analyses reveal these operators' selective adaptability on different environment degradation types, which inspires us to combine them to explore complementary features. To this end, we propose binary channel manipulation (BCM) to search for the optimal combination of these operators. BCM determines to retrain or discard one operator by learning its contribution to other tracking steps. By inserting the learned matching networks to a strong baseline tracker Ocean, our model achieves favorable gains by $67.2 \rightarrow 71.4$, $52.6 \rightarrow 58.3$, $70.3 \rightarrow 76.0$ success on OTB100, LaSOT, and TrackingNet, respectively. Notably, Our tracker, dubbed AutoMatch, uses less than half of training data/time than the baseline tracker, and runs at 50 FPS using PyTorch. Code and model will be released at https://github.com/JudasDie/SOTS.

Citations (151)

View on Semantic Scholar

Summary

The paper introduces AutoMatch, automatically searching for optimal feature fusion operators with Binary Channel Manipulation to enhance Siamese tracking.
It evaluates six alternative matching operators, challenging the traditional cross-correlation approach to address diverse tracking challenges.
Experimental results demonstrate over a 4-point gain on OTB100 with improved efficiency, using less training data and time compared to baseline methods.

Essay on "Learn to Match: Automatic Matching Network Design for Visual Tracking"

The paper "Learn to Match: Automatic Matching Network Design for Visual Tracking" presents a novel framework, AutoMatch, for optimizing matching network design in Siamese visual tracking. The authors scrutinize the prevailing reliance on expert-derived heuristic designs, challenging this approach by introducing a systematic search method for optimal matching networks.

Key Contributions and Methodology

Visual tracking has been significantly advanced by Siamese networks, where cross-correlation has been a staple operator for similarity measurement. Noting its limitations, the authors propose six alternative matching operators: Concatenation, Pointwise-Addition, Pairwise-Relation, FiLM, Simple-Transformer, and Transductive-Guidance. Unlike traditional similarity learning, these operators focus on feature fusion. Their selective adaptability to varying environmental challenges leads to the hypothesis that combining these operators can yield a more robust tracking solution.

The core of the AutoMatch framework is the Binary Channel Manipulation (BCM), a search algorithm designed to automatically select and combine these operators. BCM evaluates the contribution of each operator to the tracking performance, utilizing a differentiable search process based on Gumbel-Softmax. This process effectively narrows down the optimal combination of operators, avoiding exhaustive manual experimentation.

Experimental Results

The effectiveness of the proposed approach is evidenced by substantial metrics improvements on benchmarks such as OTB100, LaSOT, and TrackingNet. AutoMatch shows an uplift of 4.2 points on OTB100 and operates at significant computational efficiency, realizing these gains with less than half the training data and time compared to the baseline tracker Ocean.

These experimental results suggest that heuristic matching operator selection may not be essential, and a learned combination of feature fusion operators can considerably enhance performance across diverse scenarios. The approach also outperforms some recent leading trackers, including DiMP and KYS, indicating both theoretical and practical advancements in the field.

Practical and Theoretical Implications

Practically, the research suggests a shift from manual operator design towards automated, adaptable frameworks for visual tracking, potentially streamlining development workflows and enhancing tracking resilience in varied contexts. This automation holds potential for application beyond visual tracking, possibly influencing broader domains reliant on feature similarity computations.

Theoretically, the paper highlights the potential for feature fusion methodologies to surpass traditional cross-correlation in generating robust, adaptable models. Future research may extend AutoMatch's principles to other visual recognition tasks, examining the transferability and efficacy of learned matching networks in diverse AI applications.

Conclusion

The paper makes a compelling argument for automated design processes in visual tracking, emphasizing flexibility and efficiency. By showcasing the performance gains of AutoMatch, the authors contribute a significant methodological advancement to the domain, opening avenues for further exploration in automated model design and optimization.

PDF Markdown

Related Papers

GitHub

GitHub - JudasDie/SOTS: Single object tracking and segmentation. (468 stars)