A Multigrid Method for Efficiently Training Video Models (1912.00998v2)

Published 2 Dec 2019 in cs.CV

Abstract: Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training assumes a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but they are inaccurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8% absolute) on Kinetics-400 compared to the baseline training method. Code is available online.

PDF Abstract

Overview of "A Multigrid Method for Efficiently Training Video Models"

The paper "A Multigrid Method for Efficiently Training Video Models" by Chao-Yuan Wu et al. introduces an approach designed to enhance the efficiency of training deep video models. The primary motivation behind this research is the observation that video models require substantially more computational resources than their image-based counterparts, often hindering progress due to prolonged training times.

Key Contributions

The authors propose a novel method inspired by multigrid optimization techniques. This method utilizes variable mini-batch shapes, allowing training on different spatial-temporal resolutions according to a defined schedule. This variance in the mini-batch shape is crucially different from the fixed shape traditionally employed in video model training, and it provides several advantages:

Training Acceleration: By dynamically altering the resolution and batch size, the method significantly reduces the wall-clock time for training without compromising the model accuracy.
Robust Grid Scheduling: The strategy can be applied across various models and datasets, including I3D, non-local, and SlowFast models, and datasets such as Kinetics, Something-Something, and Charades. This generality shows the robustness and adaptability of the approach.
Empirical Results: Notably, this method achieved a 4.5x training speedup for the ResNet-50 SlowFast network on Kinetics-400, coupled with a marginal accuracy improvement (+0.8%).

Methodology

The paper describes the multigrid approach in detail, emphasizing the importance of scheduling different grid shapes throughout the training process. The authors divide the training schedule into long and short cycles:

Long Cycles: Allow the model to transition from coarse to fine resolutions, exploiting the benefits of larger batch sizes with reduced spatial and temporal dimensions early in training.
Short Cycles: Introduce rapid changes in the spatial configuration, complementing the long cycles to provide a diversity of perspectives and effectively improve generalization.

This scheduling requires minimal changes to existing data loaders, making the implementation straightforward and easy to integrate with current training pipelines.

Implications and Future Work

The paper has significant implications for the video understanding domain. By presenting a method that reduces the computational burden of training video models, it makes research more accessible. This has the potential to broaden participation in video-based AI research by lowering resource barriers, thus fostering innovation.

Practically, the multigrid method can be leveraged for scalable training on large datasets, enabling quicker iterations and explorations. This aligns with current trends towards more extensive and diverse video datasets, offering a scalable solution that complements existing hardware advances.

Theoretically, the results confirm that dynamic adaptation of training schedules, aligned with multigrid concepts in numerical analysis, can be effectively applied within deep learning contexts. Future research could explore further optimizations and extend these ideas to other domains, potentially accelerating training across various AI disciplines.

In conclusion, this paper offers a detailed and empirically validated method to alleviate one of the significant bottlenecks in video model training—computational intensity—while also providing a robust framework for future experimentation and development in video understanding research.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chao-Yuan Wu (19 papers)
Ross Girshick (75 papers)
Kaiming He (71 papers)
Christoph Feichtenhofer (52 papers)
Philipp Krähenbühl (55 papers)

Citations (93)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/SlowFast: PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models. (6,344 stars)