Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoViNets: Efficient Mobile Video Recognition

Updated 11 April 2026
  • Mobile Video Networks (MoViNets) are a family of computation- and memory-efficient 3D CNNs designed for online video recognition in resource-constrained environments.
  • They leverage neural architecture search and a novel constant-memory stream buffer to process videos with low latency and reduced memory requirements.
  • Empirical benchmarks, including Kinetics 600, demonstrate that MoViNets achieve superior accuracy per FLOP with significantly lower memory usage compared to traditional 3D CNN architectures.

Mobile Video Networks (MoViNets) are a family of computation- and memory-efficient 3D convolutional neural networks (CNNs) for video recognition, explicitly optimized for online, streaming inference under constrained resources. MoViNets leverage neural architecture search (NAS), a novel constant-memory stream buffer mechanism, and lightweight temporal ensembling to enable state-of-the-art accuracy at significantly lower computational and memory costs than prior 3D CNN architectures, particularly in the context of mobile and edge devices (Kondratyuk et al., 2021).

1. Neural Architecture Search and Block Design

MoViNets consist of seven architectures, denoted A₀ through A₆, generated within a mobile-oriented NAS search space. Each MoViNet is constructed entirely from “inverted bottleneck” depthwise-separable residual modules, extending MobileNetV3’s architecture with a temporal dimension appropriate for video data.

Each MoViNet block comprises sequential operations:

  • 1×1×1 pointwise expansion: Expands channel width.
  • Depthwise 3D convolution: With kernel sizes kt×kh×kw{1×3×3, 1×5×5, 1×7×7, 3×3×3, 5×3×3, 5×1×1, 7×1×1}k_t \times k_h \times k_w \in \{\text{1×3×3, 1×5×5, 1×7×7, 3×3×3, 5×3×3, 5×1×1, 7×1×1}\}.
  • 1×1×1 projection: Reduces channel width.
  • Optional spatiotemporal Squeeze-and-Excitation (SE) block.
  • Hard-Swish activation and residual skip connections, when feasible.

The NAS search space is parameterized by:

  • Number of residual blocks: n=5n = 5 (plus stem and head).
  • Block depth: Li{1,,10}L_i \in \{1,\ldots,10\}.
  • Base channels: cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}.
  • Expansion ratio: ri{1.5,2.0,2.5,3.0,3.5,4.0}r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}, with ciexp=ricibasec_i^{\text{exp}} = r_i \cdot c_i^{\text{base}}.
  • Frame stride: τ{5,8,12,}\tau \in \{5,8,12,\ldots\}.

Compound scaling, following principles from EfficientNet, increases model capacity via uniform scaling rules across depth (d(ϕ)d(\phi)), width (w(ϕ)w(\phi)), resolution (r(ϕ)r(\phi)), and frame-rate (n=5n = 50): n=5n = 51 with constraint n=5n = 52, generating the A₀–A₆ family. Each unit increment in n=5n = 53 increases average FLOPs 4n=5n = 54. This scaling produces models from as lightweight as A₀ (3M parameters, 2.7 GFLOPs/video) to large-mobile A₆ (31M parameters, 386 GFLOPs/video) (Kondratyuk et al., 2021).

2. Stream Buffer for Constant-Memory Streaming

Conventional 3D CNNs for video recognition exhibit memory complexity n=5n = 55 due to the need for simultaneous access to n=5n = 56 video frames. Standard “multi-clip” baselines partition long videos into subclips of length n=5n = 57, but require recomputation and lose long-range context.

MoViNets employ a “Stream Buffer” mechanism that maintains only the last n=5n = 58 frames (where n=5n = 59 is the maximal temporal kernel size minus one, typically Li{1,,10}L_i \in \{1,\ldots,10\}0 for Li{1,,10}L_i \in \{1,\ldots,10\}1) of layer activations between subclips. All temporal operations are causal—output at time Li{1,,10}L_i \in \{1,\ldots,10\}2 depends only on input frames up to Li{1,,10}L_i \in \{1,\ldots,10\}3—allowing for true online inference.

Formally, for Li{1,,10}L_i \in \{1,\ldots,10\}4-th subclip Li{1,,10}L_i \in \{1,\ldots,10\}5 and buffer Li{1,,10}L_i \in \{1,\ldots,10\}6, the per-layer recurrence is: Li{1,,10}L_i \in \{1,\ldots,10\}7 where “Li{1,,10}L_i \in \{1,\ldots,10\}8” denotes temporal concatenation and Li{1,,10}L_i \in \{1,\ldots,10\}9 retains only the final cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}0 frames.

Peak memory thus becomes

cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}1

For streaming inference with cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}2, MoViNet-Stream achieves per-frame operation with constant memory and sub-10 ms/frame latency on mobile CPUs (Kondratyuk et al., 2021).

3. Causal Operations for Online Recognition

The stream buffer requires causality throughout the network. Causal convolutions are implemented with past-side padding, ensuring that outputs at time cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}3 depend exclusively on earlier or concurrent inputs. Additional causal mechanisms include:

  • Cumulative Global Average Pooling (CGAP):

cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}4

maintained in a rolling window with a 1-frame buffer.

  • Causal SE gating: SE modules apply recalibration weights using CGAP outputs and fixed sine-cosine position encodings—mitigating instability in running statistics.

These operations collectively enable MoViNet models to process arbitrary-length video streams for both training and inference, supporting latency-sensitive online applications.

4. Lightweight Temporal Ensembling

Causality in streaming operation induces a cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}51% top-1 accuracy drop relative to non-causal, offline-trained baselines. MoViNets recover this lost accuracy via a lightweight two-model temporal ensembling technique that does not increase computational cost.

Given video frames cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}6 and frame strides cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}7, cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}8, two models process the original and an offset sequence. The resulting logits are averaged:

  • Model 1: cibase{16,24,48,96,96,192}c_i^{\text{base}} \in \{16,24,48,96,96,192\}9 (stride ri{1.5,2.0,2.5,3.0,3.5,4.0}r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}0),
  • Model 2: ri{1.5,2.0,2.5,3.0,3.5,4.0}r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}1 (stride ri{1.5,2.0,2.5,3.0,3.5,4.0}r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}2, offset one frame),
  • Ensemble logits: ri{1.5,2.0,2.5,3.0,3.5,4.0}r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}3,
  • Final prediction: softmax over ensemble logits.

This ensemble matches the FLOPs of the original single-model configuration while restoring and slightly exceeding non-causal accuracy (Kondratyuk et al., 2021).

5. Empirical Results and Benchmark Comparisons

MoViNets achieve leading results on major video action recognition benchmarks using only RGB input and models trained from scratch.

Kinetics 600

Model Top-1 FLOPs/video Memory (MB) Params
X3D-XL (single-clip) 80.3% 145 GF 490 11.0 M
MoViNet-A5 82.7% 281 GF 2040 15.7 M
MoViNet-A5-Stream 82.0% 282 GF 171 15.7 M
MoViNet-A5-Stream + Ens ×2 82.9% 282 GF 171 31.4 M
  • At 282 GFLOPs, “MoViNet-A5-Stream” achieves comparable or greater accuracy than X3D-XL while using ≈65% less peak memory.
  • Ensembling recovers and surpasses non-causal baseline accuracy without additional computational budget.

Performance Across Scales (Kinetics 600)

Model FLOPs Top-1 vs X3D-XL FLOPs vs X3D-XL Acc
A₂ 10.3 77.5% –94% +1.6%
A₄ 105 81.2% –86% +0.5%
A₆ 386 83.5% –73% +2.4%

Additional Benchmarks

  • Moments in Time (339 classes, 3s video): MoViNet-A₅ achieves 39.9% top-1 (vs. AssembleNet-101 [RGB+Flow] at 34.3%) with 60% fewer FLOPs. MoViNet-A₀, at 2.7 GFLOPs, attains 27.5% top-1 (+4% over TinyVideoNet).
  • Charades (multi-label, mAP): A₆ yields 63.2% mAP (vs. AssembleNet++ [RGB+Flow+Seg] at 59.8%) using 75% fewer FLOPs.

Empirical analyses confirm that, across the spectrum A₀–A₆, each MoViNet variant offers superior accuracy per FLOP with much lower memory usage compared to standard 2D (TSM) and 3D (X3D, SlowFast) baselines. Memory usage remains flat (≈100 MB) as video length ri{1.5,2.0,2.5,3.0,3.5,4.0}r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}4 frames, unlike the linear growth observed in conventional 3D CNNs (Kondratyuk et al., 2021).

6. Availability and Practical Significance

MoViNets are publicly available with code and pretrained checkpoints at https://github.com/tensorflow/models/tree/master/official/vision. The designs cover a range from extremely lightweight (few million parameters, suitable for on-device inference) to large models that push the Pareto frontier on accuracy versus efficiency for video understanding tasks.

MoViNets demonstrate:

  • Up to +6% top-1 on Kinetics 600 compared to MobileNetV3+TSM at matched FLOPs.
  • –80% FLOPs and –65% memory versus X3D-XL at matched accuracy.
  • Robust transfer across benchmarks without using additional modalities (e.g., flow, segmentation).

This suggests that the MoViNet approach—integrating NAS-designed building blocks, a constant-memory streaming mechanism, and efficient ensembling—represents a substantial advance for resource-constrained video recognition tasks (Kondratyuk et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mobile Video Networks (MoViNets).