Papers
Topics
Authors
Recent
Search
2000 character limit reached

TransNet V2: Real-Time Shot Detection

Updated 12 February 2026
  • The paper presents TransNet V2, a deep 3D-CNN architecture with dual supervised heads that achieves state-of-the-art shot transition detection.
  • Methodology integrates dilated, factorized 3D convolutions with a frame-similarity branch to capture extensive temporal context efficiently.
  • The model delivers real-time inference performance and reproducible results with open-source implementation for scalable video analytics.

TransNet V2 is a deep neural architecture designed for efficient and accurate shot transition detection within video streams. It employs a stack of dilated, factorized 3D convolutional networks augmented with temporal frame-similarity features, culminating in dual supervised output heads. The model attains state-of-the-art performance on established benchmarks while maintaining real-time inference capability. TransNet V2's implementation, pre-trained weights, and evaluation scripts are publicly available to promote reproducibility and integration into large-scale video analysis pipelines (Souček et al., 2020).

1. Architecture and Layer Composition

TransNet V2's core is a six-cell "DDCNN V2" stack, where each cell processes tensors of shape (T,H,W,C)(T,H,W,F)(T, H, W, C) \rightarrow (T, H, W, F)TT for temporal dimension, H×WH \times W spatial size, CC input channels, and FF cell output channels (increasing from 32 to 128). Each cell utilizes four parallel 3×3×33 \times 3 \times 3 3D convolutions with temporal dilation rates d{1,2,4,8}d\in\{1,2,4,8\}; kernel factorization reduces spatial convolution to 3×33 \times 3 followed by a separate 1D temporal (dilated) convolution. Convolutions are followed by batch normalization and ReLU activations, then the four convolutional branches are summed, passed through an additional ReLU, and in cells 2, 4, and 6, skip connections from the input tensor are added (post-sum, pre-ReLU).

Following cells 2, 4, and 6, spatial average pooling halves HH and WW while keeping TT fixed. This configuration enables the network's receptive field to span ±48\pm48 frames by the final cell, providing substantial temporal context for transition detection.

A dedicated frame-similarity augmentation branch operates at each pooling stage: global spatial average pooling generates vtRFv_t \in \mathbb{R}^F framewise vectors, projected via a dense layer to stR64s_t \in \mathbb{R}^{64}, concatenated with a 512-bin RGB histogram hth_t to yield a 576-dim descriptor per frame. For each frame tt, cosine similarity with contexts t50t-50 to t1t-1 and t+1t+1 to t+50t+50 is computed, forming a similarity vector σtR100\sigma_t \in \mathbb{R}^{100}, which traverses a shallow MLP and fuses with the 3D-CNN features.

At the output, two classification heads operate:

  • Single-frame head: 1×1×11\times 1\times 1 conv \rightarrow sigmoid yields ptsinglep^\text{single}_t, the probability that tt is the center frame of a transition.
  • All-frame head: 1×1×11\times 1\times 1 conv \rightarrow sigmoid outputs ptallp^\text{all}_t, the probability of tt being within a transition. Only psinglep^\text{single} is used at inference.

The 3D convolution operation is defined as:

O[t,x,y,c]=δt,δx,δy,cK[δt,δx,δy,c,c]I[t+stδt,x+ssδx,y+ssδy,c]O[t,x,y,c'] = \sum_{\delta t, \delta x, \delta y, c} K[\delta t, \delta x, \delta y, c, c'] \cdot I[t + s_t \cdot \delta t, x + s_s \cdot \delta x, y + s_s \cdot \delta y, c]

where sts_t, sss_s are temporal and spatial strides (set to 1), and δt\delta t is dilated according to the four branches.

2. Input/Output and Loss Design

Inputs: Each input is a non-overlapping clip of 100 consecutive RGB frames at 48×2748 \times 27 resolution, no frame skipping, normalized to [0,1][0,1] per sequence. Augmentations (with respective probabilities) include random horizontal/vertical flips, color jitter, PIL Equalize/Posterize/Color, and—in the case of synthetic transitions—color transfer to one segment.

Outputs: The model emits two per-frame sequences (ptsinglep^\text{single}_t, ptallp^\text{all}_t, t=099t=0\ldots99). At test time, the outer 25 frames are discarded; frames where ptsingle>θp^\text{single}_t > \theta (default θ=0.5\theta=0.5) constitute transitions, with shots demarcated between transitions.

Loss: Given ground-truth labels ytsingle,ytall{0,1}y^\text{single}_t, y^\text{all}_t \in \{0, 1\}:

L=Lsingle+0.1Lall+104W2L = L_\text{single} + 0.1 \cdot L_\text{all} + 10^{-4} \cdot \|W\|^2

where

Lsingle=(wposytsingle+1(1ytsingle))CE(ytsingle,ptsingle),wpos=5.0L_\text{single} = (w_\text{pos} \cdot y^\text{single}_t + 1 \cdot (1 - y^\text{single}_t)) \cdot CE(y^\text{single}_t, p^\text{single}_t), \quad w_\text{pos} = 5.0

Lall=CE(ytall,ptall)L_\text{all} = CE(y^\text{all}_t, p^\text{all}_t)

CE(y,p)=[ylogp+(1y)log(1p)]CE(y, p) = -[y \log p + (1 - y) \log(1 - p)]

and W2\|W\|^2 is the squared L2L_2 norm over all trainable weights.

3. Training Regimen and Data Handling

Data composition: Synthetic transitions are dynamically generated from TRECVID IACC.3 (approximately 300,000 shots) and 15% of real transitions from ClipShots. The synthetic mix consists of 35% hard cuts and 50% dissolves of random duration (2–30 frames), plus a 15% subset of real ClipShots transitions (of its 128k cuts and 38k graduals). Each batch draws 85% synthetic with 15% real transitions.

  • Real transitions involve random sampling of 100-frame windows containing annotated transitions.
  • Synthetic transitions involve pairing 300-frame clips from the same shot, cropping to 100 frames, and inserting a synthetically generated transition.

Optimization: SGD with momentum 0.9, learning rate 0.01, batch size 16, fixed 750 batches per epoch (approx. 600,000 transitions per run), for 50 epochs. Training and validation on a single Tesla V100 (16GB) require about 17 hours total. Weight decay regularization at 1×1041\times10^{-4} is applied. Data augmentation is performed per-sequence; no explicit curriculum is used beyond the real/synthetic mix.

4. Benchmark Results and Precision-Recall Characteristics

TransNet V2 demonstrates state-of-the-art performance on three standardized benchmarks (all following a uniform evaluation protocol). F1 scores are as follows:

Model ClipShots BBC RAI
TransNet V1 [2019] 73.5 92.9 94.3
Hassanien et al. [2017] 75.9* 92.6* 93.9*
Tang et al. (ResNet-18) 76.1* 89.3* 92.8*
TransNet V2 (ours) 77.9 96.2 93.9

(* re-evaluated from public code, thresholds tuned.)

On ClipShots and BBC datasets, TransNet V2 achieves the highest F1, and is competitive on RAI. Speed benchmarks indicate inference throughput of 200–250 windows/s (100-frame clips) on NVIDIA V100, equating to ~20,000 fps effective rate; CPU-only inference (single-core Intel Xeon) achieves ~200 fps.

5. Computational Resource Characteristics

TransNet V2 encompasses approximately 4.2 million parameters, with 3.5109\sim3.5 \cdot 10^9 FLOPs per 100-frame clip. The weight footprint is 32 MB; GPU memory requirement for activations (batch size 1) is 200 MB. On high-end GPUs, 20 real-time HD streams can be analyzed in parallel, while CPU-only deployment is sufficient for real-time operation on a single stream.

6. Implementation, Usage, and Configuration

The implementation (including training, evaluation, and inference code) and pre-trained weights are publicly accessible at https://github.com/soCzech/TransNetV2. The model is instantiable and usable in Python as follows:

1
2
3
4
5
6
7
8
9
10
11
from transnetv2 import TransNetV2

model = TransNetV2(weights_dir="/path/to/weights_dir")

video_frames, p_single, p_all = model.predict_video("/path/to/video.mp4")

scenes = model.predictions_to_scenes(predictions=p_single, threshold=0.5)

img = model.visualize_predictions(frames=video_frames,
                                 predictions=(p_single, p_all))
img.save("results.png")

Configurable parameters include psinglep_\text{single} threshold (default 0.5), window size (default 100), device (GPU/CPU), and histogram bins (default 512).

7. Visualization and Interpretation

The accompanying code tools provide visualization of psinglep^\text{single} (green curve) and pallp^\text{all} (blue curve) over time, with ground-truth transitions denoted in red. The bottom panel of Figure 1 overlays detected shot boundaries, demonstrating robustness in challenging cases (such as rapid scene changes or fade-in/fade-out transitions).

In summary, TransNet V2 integrates a deep dilated 3D-CNN backbone with a light-weight, histogram-augmented frame-similarity branch and dual-head sequence supervision for shot boundary detection with both accuracy and real-time speed. The open-source release supports reproducible evaluation and downstream deployment in high-throughput video analytics (Souček et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransNet V2.