MoViNets: Efficient Mobile Video Recognition
- MoViNets are mobile-optimized neural architectures designed for efficient video recognition in resource-constrained environments.
- They leverage neural architecture search to optimize spatiotemporal convolutions and incorporate Gaussian Random Walk regularization for temporal smoothness.
- MoViNets enable constant-memory online inference using causal convolutions and stream buffers, setting state-of-the-art accuracy-efficiency benchmarks on datasets like Kinetics-600.
Mobile Video Networks (MoViNets) represent a family of neural architectures optimized for efficient video recognition, enabling highly accurate classification with reduced computation and memory requirements. Designed for mobile and resource-constrained environments, MoViNets leverage neural architecture search (NAS), innovative memory management techniques, and regularization strategies that exploit temporal coherence in video streams. The architectures provide leading accuracy-efficiency tradeoffs for both offline and online video inference, surpassing traditional 3D CNNs under limited FLOPs and memory constraints (Kondratyuk et al., 2021, Goldman et al., 25 Nov 2025).
1. Architecture Search Space and Variants
MoViNets extend MobileNetV3’s inverted-bottleneck supernet to 3D, adapting block-level primitives to video input. The fundamental design incorporates:
- A stem 1×3×3 convolution with spatial and temporal stride.
- Blocks 2–6 containing Lᵢ inverted-bottleneck layers. Each layer consists of:
- 1×1×1 expansion convolution with expansion factor αᵢ.
- Depthwise 3D convolutions with kernel shapes kᵢᵗ×(kᵢˢ)², supporting diverse spatiotemporal receptive fields.
- 1×1×1 projection convolution.
- Optional 3D squeeze-and-excitation (SE) and residual connections.
Scaling of depth, width, input resolution, and frame-rate is controlled by a single compound scaling parameter φ with coefficients. The family includes A0–A6 variants (increasing in capacity and resource demand). For example, MoViNet-A0 uses 2.7 GFLOPs at 71.5% Top-1 Kinetics-600 accuracy, while MoViNet-A6 peaks at 386 GFLOPs and 83.5% (Kondratyuk et al., 2021).
Streaming versions ("-Stream") leverage causal convolution and allow constant-memory, online inference with negligible degradation in accuracy. The architectures are constructed by one-shot NAS using TuNAS, maximizing a multi-objective reward function targeting accuracy under strict FLOP and memory budgets.
| Model Variant | Frames | Resolution | GFLOPs | Top-1 (%) |
|---|---|---|---|---|
| A0 | 50 | 172² | 2.71 | 71.5 |
| A1 | 50 | 172² | 6.02 | 76.0 |
| A2 | 50 | 224² | 10.3 | 77.5 |
| A3 | 120 | 256² | 56.9 | 80.8 |
| A5 | 120 | 320² | 281 | 82.7 |
| A6 | 120 | 320² | 386 | 83.5 |
2. Neural Architecture Search Methodology
MoViNets use a supernetwork approach wherein all architectural choices are embedded in a single model. The NAS policy, parameterized by , samples sub-architectures and alternates between weight updates (minimizing cross-entropy) and policy optimization via REINFORCE to maximize expected reward . The reward balances accuracy, normalized FLOPs, and memory, penalizing architectures exceeding set budgets. This facilitates discovery of architectures occupying optimal efficiency-accuracy frontiers for different hardware and deployment scenarios (Kondratyuk et al., 2021).
3. Stream Buffer and Causal Inference
Traditional 3D CNNs exhibit memory scaling, prohibiting efficient streaming or online inference. MoViNets introduce the Stream Buffer, which maintains only the last temporal frames of intermediate features, enabling peak memory use relative to video length. At each step, new subclip features are concatenated to the buffer, and outdated frames are discarded. Causal convolutions apply left-only padding for strict temporality, while causal global-average pooling and causal squeeze-and-excitation (with positional encoding) maintain online prediction capabilities.
Training involves splitting videos into subclips, accumulating loss over them, and restricting backward gradients to subclip boundaries, which suggests robust scalability without excessive recomputation (Kondratyuk et al., 2021).
4. Temporal Regularization via Gaussian Random Walk (GRW)
MoViNets-trained models benefit from temporal smoothness imposed by Gaussian Random Walk (GRW) regularization (Goldman et al., 25 Nov 2025). The GRW technique defines per-frame embeddings , velocities , and accelerations across subclips. The regularization term models accelerations as i.i.d. draws, penalizing abrupt representational updates.
For a given subclip , the GRW-based objective combines:
- Frame-order contrastive loss, comparing likelihood of true acceleration sequence versus randomly permuted frames.
- An unconditional speed prior on velocity magnitudes.
The smoothness loss is added to supervised cross-entropy with balance , inducing a strong temporal inductive bias favoring low-acceleration solutions. Empirical ablations demonstrate robustness to hyperparameter variation (, , ), and optimized placement at the final embedding layer prior to a 2-layer Transformer head proves most effective.
5. Quantitative Performance and State-of-the-Art Comparisons
Application of GRW smoothing yields substantial accuracy improvements for MoViNet variants at fixed compute or memory budgets. On Kinetics-600, reported gains include:
- MoViNet-A0-S: 78.4% vs. 72.3% (+6.1%) @2.7 GFLOPs
- MoViNet-A1-S: 81.9% vs. 76.7% (+5.2%) @6.0 GFLOPs
- MoViNet-A2-S: 83.3% vs. 78.6% (+4.7%) @11.3 GFLOPs
- MoViNet-A3: 85.6% vs. 81.8% (+3.8%) @56.4 GFLOPs
At comparable memory footprints:
- MobileNetV3-S: 67.3% vs. 61.3% (+6.0%) @29 MB
- MoViNet-A0-S: 78.4% vs. 72.0% (+6.4%) @53 MB
- MoViNet-A1-S: 81.9% vs. 76.4% (+5.5%) @67 MB
- MoViNet-A2-S: 83.3% vs. 78.4% (+4.9%) @78 MB
MoViNet-A3 achieves 85.6% Top-1 using only 56.4 GFLOPs; matching accuracy with transformer-based MViTv2-B-32×3 requires 1 030 GFLOPs (18× greater). Across datasets including Kinetics-400/700, Moments-in-Time, Charades, Something-Something V2, MoViNets set new accuracy-efficiency frontiers (Goldman et al., 25 Nov 2025, Kondratyuk et al., 2021).
6. Deployment and Practical Considerations
The constant-memory streaming property of MoViNets supports on-device, real-time video inference. Deployment settings feature buffer size per layer (typically for ), and inference clip length for low latency. Hard-Swish activations and ReZero residual scalars improve convergence and quantization characteristics. For CPU targets, MoViNet-A0-Stream achieves 3.7 ms/frame at higher Top-1 accuracy than previous mobile baselines, suggesting suitability for edge applications.
An ensemble scheme recovers the minor accuracy drop from causal inference: parallel models process alternate frames, and logits are averaged for prediction, improving Top-1 by 0.3–1.0% without elevating FLOPs. These mechanisms collectively enable deployment in scenarios with stringent resource limits and temporal streaming constraints (Kondratyuk et al., 2021).
MoViNets embody a comprehensive approach to mobile video recognition, combining NAS-driven efficiency, causal memory models for streaming inference, and robust temporal regularization via GRW. The resulting architectures set benchmark standards in video action recognition for resource-constrained environments (Kondratyuk et al., 2021, Goldman et al., 25 Nov 2025).