Papers
Topics
Authors
Recent
Search
2000 character limit reached

End-to-end Flow Correlation Tracking with Spatial-temporal Attention

Published 3 Nov 2017 in cs.CV | (1711.01124v4)

Abstract: Discriminative correlation filters (DCF) with deep convolutional features have achieved favorable performance in recent tracking benchmarks. However, most of existing DCF trackers only consider appearance features of current frame, and hardly benefit from motion and inter-frame information. The lack of temporal information degrades the tracking performance during challenges such as partial occlusion and deformation. In this work, we focus on making use of the rich flow information in consecutive frames to improve the feature representation and the tracking accuracy. Firstly, individual components, including optical flow estimation, feature extraction, aggregation and correlation filter tracking are formulated as special layers in network. To the best of our knowledge, this is the first work to jointly train flow and tracking task in a deep learning framework. Then the historical feature maps at predefined intervals are warped and aggregated with current ones by the guiding of flow. For adaptive aggregation, we propose a novel spatial-temporal attention mechanism. Extensive experiments are performed on four challenging tracking datasets: OTB2013, OTB2015, VOT2015 and VOT2016, and the proposed method achieves superior results on these benchmarks.

Citations (258)

Summary

  • The paper presents FlowTrack, a novel end-to-end architecture that fuses optical flow-based motion details with spatial-temporal attention to boost tracking accuracy.
  • It embeds optical flow estimation within a differentiable framework, achieving higher AUC scores on OTB benchmarks than existing methods.
  • Its adaptive spatial-temporal attention mechanism aggregates multi-frame features, improving resilience against occlusions and deformations in dynamic scenes.

End-to-end Flow Correlation Tracking with Spatial-temporal Attention

The paper "End-to-end Flow Correlation Tracking with Spatial-temporal Attention" introduces a novel architecture for visual object tracking by leveraging both appearance and motion information. It addresses the limitations observed in traditional Discriminative Correlation Filters (DCF) which primarily rely on appearance features of the current frame and lack temporal context. This deficiency often results in degraded performance under partial occlusions, deformations, and similar challenges.

The proposed framework, FlowTrack, is designed to integrate rich motion information from consecutive frames. This is accomplished by embedding optical flow estimation, feature extraction, and correlation filter tracking into a unified deep learning framework. This comprehensive integration allows end-to-end training, which aligns inter-frame motion features tightly with tracking processes—a distinct advancement over existing off-the-shelf methods that deploy optical flow without training adaptability.

Key elements of this methodology include:

  • Optical Flow Integration: Frames at predefined intervals are warped to the target frame using flow information, enriching the feature pool with motion details.
  • Spatial-temporal Attention Mechanism: A novel adaptive mechanism is introduced for aggregating and weighting feature maps from multiple frames, harnessing both spatial and temporal dimensions for thorough target representation.
  • End-to-end Training: The incorporation of DCF in a differentiable manner allows the use of back-propagation within this framework, optimizing the entire network operatively.

Empirical evaluations performed on benchmarks such as OTB2013, OTB2015, VOT2015, and VOT2016 demonstrate this approach's superiority. FlowTrack achieves AUC scores of 0.689 and 0.655 on OTB2013 and OTB2015 respectively, outperforming state-of-the-art frameworks like CCOT and SINT+. Furthermore, in the VOT challenges, FlowTrack surfaces as a leading configuration with respect to expected average overlap (EAO), boasting expedited processing speeds of 12FPS—outperforming competitors known for speed constraints.

The implications of these findings are significant for both practical deployment and theoretical comprehension in visual object tracking. Practically, the seamless blend of appearance and motion information enables a more resilient and precise tracking system, valuable in applications demanding high accuracy and robustness under dynamic conditions. Theoretically, the exploration of end-to-end differentiable structures integrating complex spatio-temporal cues presents a new paradigm for enhancing deep learning models in video analysis.

Future extensions of this work could explore fine-tuning the spatial-temporal attention mechanism to further optimize frame selection under varied video motion contexts. Additionally, expanding the framework to support real-time mobile device environments might benefit a large segment of computer vision applications requiring lightweight yet proficient tracking systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (4)

Collections

Sign up for free to add this paper to one or more collections.