Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition (2112.14238v2)

Published 28 Dec 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at https://github.com/LeapLabTHU/AdaFocusV2.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yulin Wang (45 papers)
  2. Yang Yue (42 papers)
  3. Yuanze Lin (10 papers)
  4. Haojun Jiang (13 papers)
  5. Zihang Lai (15 papers)
  6. Victor Kulikov (6 papers)
  7. Nikita Orlov (10 papers)
  8. Humphrey Shi (97 papers)
  9. Gao Huang (178 papers)
Citations (47)

Summary

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

The paper "AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition" presents an innovative advancement in the computational efficiency of video recognition. It builds upon the prior work AdaFocus, which enhanced efficiency by dynamically attending to informative video frame regions. However, AdaFocus had the drawback of a complex three-stage training pipeline involving reinforcement learning, which hampered its convergence speed and accessibility for practitioners. AdaFocus V2 addresses this by reformulating the training into a one-stage, end-to-end learnable algorithm. This is achieved by introducing a differentiable interpolation-based patch selection operation, thus streamlining the training process while maintaining or improving performance.

Key Contributions and Methodology

AdaFocus V2 maintains the core idea of focusing computational resources on spatially informative regions within video frames like its predecessor. The paper enhances this by simplifying the training procedure using differentiable approaches rather than non-differentiable, reinforcement learning-based decisions. The learning method involves:

  1. Differentiable Patch Selection: The authors introduce a differentiable, interpolation-based mechanism for selecting patches within video frames. This allows gradients to be propagated throughout the training network, enabling efficient end-to-end optimization.
  2. Optimization Challenges and Solutions:
    • Lack of Supervision: Addressed by introducing auxiliary supervision, which applies direct frame-wise recognition losses to guide the learning of global and local feature encoders.
    • Input Diversity: The diversity augmentation technique incorporates randomized patch cropping during training to enhance the generalization ability of the network.
    • Training Stability: Implementing a stop-gradient strategy helps to prevent interference between learning tasks and promotes training stability.
  3. Conditional-Exit Technique: Improving upon temporal redundancy, the paper suggests an adaptive early-exit mechanism that skips less informative frames based on prediction confidence, refining efficiency without additional training needs.

Results and Implications

Experiments conducted across six benchmark datasets (ActivityNet, FCVID, Mini-Kinetics, Something-Something V1\V2, and Jester) demonstrate that AdaFocus V2 outperforms the original AdaFocus and other competitive baselines, achieving higher accuracy while reducing training time by approximately half. The proposed method accelerates training by factors of 2.2 to 2.4 compared to its predecessor and includes a patch-size invariance, providing consistent performance improvements across varying datasets, backbone architectures, and model configurations.

Theoretical and Practical Implications

The theoretical contribution of AdaFocus V2 lies in its differentiable approach to spatial dynamic networks. By utilizing interpolation-based patch selection, it enables gradient flow through discrete decision tasks, which were traditionally tackled with reinforcement learning. Practically, this innovation reduces the computational overhead and complexity associated with high-performance video recognition models, which is crucial for real-world applications like video surveillance, automated content analysis, and video recommendation systems.

Future Directions

The advancements presented by AdaFocus V2 open pathways for further research in optimizing dynamic neural networks. Future work could involve extending the framework to more varied and larger datasets, including diverse video domains beyond human action recognition. Moreover, integrating AdaFocus V2 with temporal-dynamic strategies or transformer-based models could yield new possibilities for efficiency and accuracy improvements in neural network architectures.

In closing, AdaFocus V2 exemplifies a significant leap towards more accessible and efficient training of spatial dynamic networks in video recognition tasks. Its novel contributions serve as a foundation for both theoretical exploration and practical implementation in dynamic neural computing.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub