Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SmallBigNet: Integrating Core and Contextual Views for Video Classification (2006.14582v1)

Published 25 Jun 2020 in cs.CV

Abstract: Temporal convolution has been widely used for video classification. However, it is performed on spatio-temporal contexts in a limited view, which often weakens its capacity of learning video representation. To alleviate this problem, we propose a concise and novel SmallBig network, with the cooperation of small and big views. For the current time step, the small view branch is used to learn the core semantics, while the big view branch is used to capture the contextual semantics. Unlike traditional temporal convolution, the big view branch can provide the small view branch with the most activated video features from a broader 3D receptive field. Via aggregating such big-view contexts, the small view branch can learn more robust and discriminative spatio-temporal representations for video classification. Furthermore, we propose to share convolution in the small and big view branch, which improves model compactness as well as alleviates overfitting. As a result, our SmallBigNet achieves a comparable model size like 2D CNNs, while boosting accuracy like 3D CNNs. We conduct extensive experiments on the large-scale video benchmarks, e.g., Kinetics400, Something-Something V1 and V2. Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/xhl-video/SmallBigNet.

Citations (88)

Summary

  • The paper introduces SmallBigNet, a novel video classification model employing a dual-view framework to learn core and contextual semantics separately.
  • SmallBigNet aggregates features from an expansive view to support a core view, creating robust spatio-temporal representations while sharing parameters for compactness.
  • Extensive experiments on large-scale benchmarks demonstrate SmallBigNet's superior accuracy and computational efficiency compared to state-of-the-art methods.

SmallBigNet: Integrating Core and Contextual Views for Video Classification

This paper introduces the SmallBigNet, an innovative approach for video classification that leverages differentiated spatio-temporal receptive fields to enhance the robustness and discriminative power of video representations. Traditional methods using temporal convolution for video classification often work within constrained spatio-temporal contexts, leading to limitations in capturing relevant features. The SmallBigNet addresses this by employing a dual-view approach: the small view for core semantics and the big view for contextual semantics.

Key Contributions

  1. Dual-view framework: The SmallBigNet comprises two branches—a small view and a big view. The small view is tasked with learning the core semantics of video content, while the big view captures broader contextual semantics. This separation enables more effective learning by addressing the issue of irrelevant context within constrained views, a problem inherent in standard temporal convolution approaches.
  2. Enhanced feature aggregation: Unlike conventional methods, the big view branch provides activated features from an expansive 3D receptive field, which are then aggregated to support the small view branch. This results in a more discriminative and stable spatio-temporal representation that improves video classification accuracy.
  3. Parameter sharing: The network proposes a parameter-sharing mechanism between the small and big views, promoting model compactness and reducing overfitting. This approach allows SmallBigNet to maintain a competitive model size akin to 2D CNNs while achieving the accuracy enhancements seen in 3D CNNs.
  4. Experimental validation: Extensive experiments were conducted using large-scale video benchmarks such as Kinetics400, Something-Something V1 and V2. SmallBigNet demonstrated superior performance compared to recent state-of-the-art models in terms of accuracy and computational efficiency.

Results and Implications

The SmallBigNet outperformed several existing approaches across benchmarks, providing empirical evidence of its efficacy. These results highlight the potential for dual-view frameworks to significantly enhance video representation learning by selectively aggregating contextual information based on activated features rather than relying on homogenous temporal convolution strategies.

Speculation on Future Developments

There's a clear trajectory for further exploration of dual-view frameworks in AI-driven video analysis, particularly in tasks requiring nuanced understanding of complex video dynamics and contexts. Future research could delve into optimizing parameters between views or investigating adaptive strategies where the size of view receptivity varies based on video content complexity. Moreover, integrating attention mechanisms to dynamically adjust the contribution of contextual views could drive deeper insights and model accuracy.

In summary, this paper advances the field of video classification by demonstrating the effectiveness of view differentiation and parameter sharing in achieving robust video representations, setting the groundwork for subsequent innovations in spatio-temporal learning frameworks.