ECO: Efficient Convolutional Network for Online Video Understanding (1804.09066v2)

Published 24 Apr 2018 in cs.CV, cs.AI, cs.IR, and cs.MM

Abstract: The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.

Citations (483)

View on Semantic Scholar

Summary

The paper introduces ECO, which uses compact generative models and dimensionality reduction to significantly cut memory usage while boosting tracking speed.
The paper employs factorized convolution operators that decompose filters into low-dimensional subspaces, reducing computational load and maintaining 0.910 precision on benchmarks.
The paper integrates adaptive model updates via incremental learning, ensuring the tracker remains robust against dynamic scene changes in real-time applications.

ECO: Efficient Convolution Operators for Tracking

The paper presents "Efficient Convolution Operators" (ECO), a novel framework for visual object tracking that addresses both efficiency and performance challenges associated with typical convolutional approaches. This work makes an important contribution to the field by optimizing resources without sacrificing accuracy, positioning itself as a critical development in tracking systems.

ECO brings forth three key innovations:

Compact Generative Models: The authors utilize efficient dimensionality reduction techniques, allowing for the dynamic adaptation of the tracking model in response to changes in object appearance. This results in a substantial reduction in memory footprint, facilitating real-time processing capabilities.
Factorized Convolution Operators: By decomposing convolutional filters into low-dimensional subspaces, ECO reduces the computational load. This factorization preserves the discriminative power necessary for accurate tracking, even as it diminishes the number of required operations.
Efficient Model Updates: The framework integrates a mechanism for adaptive model updates through incremental learning. This ensures that the tracker remains robust to variations over time, such as illumination changes and object deformations.

Numerical Results

The ECO framework achieves state-of-the-art performance on multiple challenging benchmark datasets, including OTB-2015, VOT2016, and UAV123. For example, on the OTB-2015 dataset, the method achieves a precision score of 0.910, outperforming many contemporary trackers, while maintaining a real-time operation speed of over 60 FPS. The results emphasize the balance between computational efficiency and high accuracy.

Implications

Practically, the ECO framework provides a scalable solution suitable for deploying in resource-constrained environments, such as mobile devices and embedded systems. This is particularly pertinent for applications requiring seamless real-time tracking, including surveillance, autonomous driving, and augmented reality.

Theoretically, the decomposition of convolutional filters into subspaces may inspire further research into the optimization of deep learning models, enhancing both speed and efficiency. The integration of dimensionality reduction within tracking systems could be further explored to apply to other domains where model efficiency is paramount.

Future Developments

Future work could explore extensions of the ECO framework incorporating more advanced machine learning techniques, such as reinforcement learning, to further improve adaptability. Additionally, investigating the application of such efficient methodologies in other areas of computer vision and beyond could provide promising avenues for extending these benefits to broader contexts.

In conclusion, by achieving an optimal balance of computational efficiency and tracking accuracy, ECO sets a benchmark for future research. The authors successfully highlight how strategic algorithmic innovations can lead to practical improvements with wide-ranging impacts across various application domains.