MoViNets: Efficient Mobile Video Recognition

Updated 11 April 2026

Mobile Video Networks (MoViNets) are a family of computation- and memory-efficient 3D CNNs designed for online video recognition in resource-constrained environments.
They leverage neural architecture search and a novel constant-memory stream buffer to process videos with low latency and reduced memory requirements.
Empirical benchmarks, including Kinetics 600, demonstrate that MoViNets achieve superior accuracy per FLOP with significantly lower memory usage compared to traditional 3D CNN architectures.

Mobile Video Networks (MoViNets) are a family of computation- and memory-efficient 3D convolutional neural networks (CNNs) for video recognition, explicitly optimized for online, streaming inference under constrained resources. MoViNets leverage neural architecture search (NAS), a novel constant-memory stream buffer mechanism, and lightweight temporal ensembling to enable state-of-the-art accuracy at significantly lower computational and memory costs than prior 3D CNN architectures, particularly in the context of mobile and edge devices (Kondratyuk et al., 2021).

1. Neural Architecture Search and Block Design

MoViNets consist of seven architectures, denoted A₀ through A₆, generated within a mobile-oriented NAS search space. Each MoViNet is constructed entirely from “inverted bottleneck” depthwise-separable residual modules, extending MobileNetV3’s architecture with a temporal dimension appropriate for video data.

Each MoViNet block comprises sequential operations:

1×1×1 pointwise expansion: Expands channel width.
Depthwise 3D convolution: With kernel sizes $k_t \times k_h \times k_w \in \{\text{1×3×3, 1×5×5, 1×7×7, 3×3×3, 5×3×3, 5×1×1, 7×1×1}\}$ .
1×1×1 projection: Reduces channel width.
Optional spatiotemporal Squeeze-and-Excitation (SE) block.
Hard-Swish activation and residual skip connections, when feasible.

The NAS search space is parameterized by:

Number of residual blocks: $n = 5$ (plus stem and head).
Block depth: $L_i \in \{1,\ldots,10\}$ .
Base channels: $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ .
Expansion ratio: $r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}$ , with $c_i^{\text{exp}} = r_i \cdot c_i^{\text{base}}$ .
Frame stride: $\tau \in \{5,8,12,\ldots\}$ .

Compound scaling, following principles from EfficientNet, increases model capacity via uniform scaling rules across depth ( $d(\phi)$ ), width ( $w(\phi)$ ), resolution ( $r(\phi)$ ), and frame-rate ( $n = 5$ 0): $n = 5$ 1 with constraint $n = 5$ 2, generating the A₀–A₆ family. Each unit increment in $n = 5$ 3 increases average FLOPs 4 $n = 5$ 4. This scaling produces models from as lightweight as A₀ (3M parameters, 2.7 GFLOPs/video) to large-mobile A₆ (31M parameters, 386 GFLOPs/video) (Kondratyuk et al., 2021).

2. Stream Buffer for Constant-Memory Streaming

Conventional 3D CNNs for video recognition exhibit memory complexity $n = 5$ 5 due to the need for simultaneous access to $n = 5$ 6 video frames. Standard “multi-clip” baselines partition long videos into subclips of length $n = 5$ 7, but require recomputation and lose long-range context.

MoViNets employ a “Stream Buffer” mechanism that maintains only the last $n = 5$ 8 frames (where $n = 5$ 9 is the maximal temporal kernel size minus one, typically $L_i \in \{1,\ldots,10\}$ 0 for $L_i \in \{1,\ldots,10\}$ 1) of layer activations between subclips. All temporal operations are causal—output at time $L_i \in \{1,\ldots,10\}$ 2 depends only on input frames up to $L_i \in \{1,\ldots,10\}$ 3—allowing for true online inference.

Formally, for $L_i \in \{1,\ldots,10\}$ 4-th subclip $L_i \in \{1,\ldots,10\}$ 5 and buffer $L_i \in \{1,\ldots,10\}$ 6, the per-layer recurrence is: $L_i \in \{1,\ldots,10\}$ 7 where “ $L_i \in \{1,\ldots,10\}$ 8” denotes temporal concatenation and $L_i \in \{1,\ldots,10\}$ 9 retains only the final $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 0 frames.

Peak memory thus becomes

$c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 1

For streaming inference with $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 2, MoViNet-Stream achieves per-frame operation with constant memory and sub-10 ms/frame latency on mobile CPUs (Kondratyuk et al., 2021).

3. Causal Operations for Online Recognition

The stream buffer requires causality throughout the network. Causal convolutions are implemented with past-side padding, ensuring that outputs at time $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 3 depend exclusively on earlier or concurrent inputs. Additional causal mechanisms include:

Cumulative Global Average Pooling (CGAP):

$c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 4

maintained in a rolling window with a 1-frame buffer.

Causal SE gating: SE modules apply recalibration weights using CGAP outputs and fixed sine-cosine position encodings—mitigating instability in running statistics.

These operations collectively enable MoViNet models to process arbitrary-length video streams for both training and inference, supporting latency-sensitive online applications.

4. Lightweight Temporal Ensembling

Causality in streaming operation induces a $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 51% top-1 accuracy drop relative to non-causal, offline-trained baselines. MoViNets recover this lost accuracy via a lightweight two-model temporal ensembling technique that does not increase computational cost.

Given video frames $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 6 and frame strides $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 7, $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 8, two models process the original and an offset sequence. The resulting logits are averaged:

Model 1: $c_i^{\text{base}} \in \{16,24,48,96,96,192\}$ 9 (stride $r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}$ 0),
Model 2: $r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}$ 1 (stride $r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}$ 2, offset one frame),
Ensemble logits: $r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}$ 3,
Final prediction: softmax over ensemble logits.

This ensemble matches the FLOPs of the original single-model configuration while restoring and slightly exceeding non-causal accuracy (Kondratyuk et al., 2021).

5. Empirical Results and Benchmark Comparisons

MoViNets achieve leading results on major video action recognition benchmarks using only RGB input and models trained from scratch.

Kinetics 600

Model	Top-1	FLOPs/video	Memory (MB)	Params
X3D-XL (single-clip)	80.3%	145 GF	490	11.0 M
MoViNet-A5	82.7%	281 GF	2040	15.7 M
MoViNet-A5-Stream	82.0%	282 GF	171	15.7 M
MoViNet-A5-Stream + Ens ×2	82.9%	282 GF	171	31.4 M

At 282 GFLOPs, “MoViNet-A5-Stream” achieves comparable or greater accuracy than X3D-XL while using ≈65% less peak memory.
Ensembling recovers and surpasses non-causal baseline accuracy without additional computational budget.

Performance Across Scales (Kinetics 600)

Model	FLOPs	Top-1	vs X3D-XL FLOPs	vs X3D-XL Acc
A₂	10.3	77.5%	–94%	+1.6%
A₄	105	81.2%	–86%	+0.5%
A₆	386	83.5%	–73%	+2.4%

Additional Benchmarks

Moments in Time (339 classes, 3s video): MoViNet-A₅ achieves 39.9% top-1 (vs. AssembleNet-101 [RGB+Flow] at 34.3%) with 60% fewer FLOPs. MoViNet-A₀, at 2.7 GFLOPs, attains 27.5% top-1 (+4% over TinyVideoNet).
Charades (multi-label, mAP): A₆ yields 63.2% mAP (vs. AssembleNet++ [RGB+Flow+Seg] at 59.8%) using 75% fewer FLOPs.

Empirical analyses confirm that, across the spectrum A₀–A₆, each MoViNet variant offers superior accuracy per FLOP with much lower memory usage compared to standard 2D (TSM) and 3D (X3D, SlowFast) baselines. Memory usage remains flat (≈100 MB) as video length $r_i \in \{1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}$ 4 frames, unlike the linear growth observed in conventional 3D CNNs (Kondratyuk et al., 2021).

6. Availability and Practical Significance

MoViNets are publicly available with code and pretrained checkpoints at https://github.com/tensorflow/models/tree/master/official/vision. The designs cover a range from extremely lightweight (few million parameters, suitable for on-device inference) to large models that push the Pareto frontier on accuracy versus efficiency for video understanding tasks.

MoViNets demonstrate:

Up to +6% top-1 on Kinetics 600 compared to MobileNetV3+TSM at matched FLOPs.
–80% FLOPs and –65% memory versus X3D-XL at matched accuracy.
Robust transfer across benchmarks without using additional modalities (e.g., flow, segmentation).

This suggests that the MoViNet approach—integrating NAS-designed building blocks, a constant-memory streaming mechanism, and efficient ensembling—represents a substantial advance for resource-constrained video recognition tasks (Kondratyuk et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

MoViNets: Mobile Video Networks for Efficient Video Recognition (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mobile Video Networks (MoViNets).