Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

X3D: Expanding Architectures for Efficient Video Recognition (2004.04730v1)

Published 9 Apr 2020 in cs.CV
X3D: Expanding Architectures for Efficient Video Recognition

Abstract: This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https://github.com/facebookresearch/SlowFast

X3D: Expanding Architectures for Efficient Video Recognition

The paper, "X3D: Expanding Architectures for Efficient Video Recognition," introduces a methodology for designing efficient video recognition models by progressively expanding a small 2D image classification network across multiple axes—space, time, width, and depth. This approach aims to optimize the trade-off between computational complexity and accuracy, a significant consideration in video recognition tasks where models typically demand substantial computational resources.

Overview of X3D Architecture

The basis of the X3D architecture is a tiny 2D image classification network inspired by the MobileNet family of models, known for their efficient, channel-wise separable convolutions. The X3D framework incrementally expands this base network into a full-fledged spatiotemporal architecture. The core idea is to alternate expansions along different network axes, evaluating the impact on model accuracy after each step to maintain an optimal computation/accuracy trade-off.

Expansion Strategy

The expansion process starts with a base model (X2D) and includes multiple candidate axes:

  • Temporal duration: Increasing the duration of the sampled frames.
  • Frame rate: Adjusting the sampling frequency of frames.
  • Spatial resolution: Increasing the spatial dimensions of the input frames.
  • Network width: Expanding the number of channels in each layer.
  • Bottleneck width: Expanding the width of intermediate layers within residual blocks.
  • Network depth: Adding more layers to the network.

The expansion follows a forward stepwise method, which progressively doubles the computational complexity at each step by modifying one axis. The best expansion based on the accuracy/computation trade-off is retained, and backward contraction ensures precise targeting of specific complexity budgets.

Experimental Results

The X3D models were evaluated on several benchmarks, demonstrating robust performance across a range of computational budgets. Key findings include:

  • X3D architectures achieve competitive accuracy with significantly fewer multiply-add operations (FLOPs) and parameters compared to state-of-the-art models.
  • For instance, X3D-XL achieved comparable accuracy to SlowFast R101 + Non-Local (NL) networks with 4.8x fewer FLOPs and 5.5x fewer parameters.
  • X3D-S, a smaller model configuration, provided a 4.7x reduction in FLOPs and 9.1x fewer parameters while maintaining performance comparable to previous lower-complexity models.

Inference Efficiency

The paper also explores the impact of the number of inference clips on the accuracy and computational cost. Sparse temporal sampling (e.g., 3-clip testing) was found to maintain accuracy with significantly reduced computational demand. On Kinetics-400, X3D-S achieved 71.4% top-1 accuracy at 5.9 GFLOPs using 3-clip testing, illustrating the efficiency gains from the X3D expansion strategy.

Implications and Future Directions

The primary contribution of the X3D framework is its methodical approach to balancing model accuracy and computational efficiency, making it highly relevant for real-time applications where computational resources may be limited. The progressive expansion methodology offers a systematic way to tailor models to specific computational budgets without significant loss in accuracy.

Future developments in AI may further explore this incremental expansion approach, potentially integrating more sophisticated neural architecture search (NAS) techniques to refine the expansion steps and uncover even more efficient architectures. Additionally, extending X3D to handle other video-related tasks, such as object detection and segmentation, could provide further insights into its versatility and efficacy in different contexts.

In summary, the X3D framework presented in this paper highlights the importance of considering multiple axes of expansion for constructing efficient video recognition models. Through its progressive and adaptable methodology, X3D succeeds in offering high-performance video recognition with unprecedented computational efficiency, paving the way for more scalable and application-friendly AI solutions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
Citations (928)
X Twitter Logo Streamline Icon: https://streamlinehq.com