LiteVPNet: Efficient QP Prediction Network

Updated 12 April 2026

LiteVPNet is a lightweight neural network that uses low-complexity feature extraction and semantic embeddings to predict QP for specified VMAF targets.
It employs a dual-module design with a Transformer-style ClipNet and a residual feed-forward DNN that integrates bitstream, metadata, and video complexity features for efficient regression.
LiteVPNet outperforms prior methods with up to 87.3% coverage within 2 VMAF points and a 65× speed-up over brute-force approaches, making it ideal for time-critical streaming and production.

LiteVPNet is a lightweight neural network designed for accurate Quantisation Parameter (QP) prediction to achieve specified perceptual video quality (measured via VMAF) in quality-critical streaming and virtual production workflows. Targeting the requirements of on-set virtual production and high-value content transport, LiteVPNet leverages low-complexity feature extraction, semantic video embeddings, and efficient inference to enable direct, single-pass QP selection, outperforming previous methods in precision and speed (Vibhoothi et al., 14 Oct 2025).

1. System Architecture and Model Design

LiteVPNet comprises two jointly-trained network modules: ClipNet and a main feed-forward Deep Neural Network (DNN) for QP prediction. The architecture is organized as follows.

ClipNet is a lightweight Transformer-style self-attention network which processes an input of 4096-dimensional semantic features (CLIP embeddings from 8 contiguous video frames, each 512-dim), projecting the result to a 16-dimensional embedding vector. This operation enables semantic context aggregation from temporally neighboring frames with minimal computational overhead.

Main QP-Prediction DNN is a residual feed-forward network. It ingests the 16-dimensional ClipNet output concatenated with 738 additional features (bitstream statistics, video complexity metrics, and video-level metadata) to form a 754-dimensional input. The DNN is arranged in fully connected (FC) blocks interleaved with Batch Normalisation (BN), GELU activations, Dropout, and Residual connections:

FC 754→256 → BN → GELU → Dropout → Residual
FC 256→128 → BN → GELU → Dropout → Residual
FC 128→64 → BN → GELU → Dropout → Residual
FC 64→8 → Sigmoid → Normalised QP output

Each FC block computes: $z^{(l)} = W^{(l)} x^{(l-1)} + b^{(l)},\quad \hat z^{(l)} = \mathrm{BatchNorm}(z^{(l)}) = \gamma \frac{z^{(l)} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta, \quad x^{(l)} = \mathrm{GELU}(\hat z^{(l)}) \approx \hat z^{(l)} \cdot \Phi(\hat z^{(l)})$ with skip connections that add $x^{(l-1)}$ to $x^{(l)}$ at each residual stage.

The outputs are eight QP predictions, each normalised, corresponding to target VMAF levels $\{99, 97, 95, 91, 88, 85, 83, 80\}$ . The block diagram, as described textually, routes input features through ClipNet and the main DNN sequentially for joint representation learning and QP regression.

2. Input Feature Extraction and Normalization

LiteVPNet's predictive accuracy is driven by four groups of features, all extracted from down-sampled (480×270) versions of source video:

Frame-Level Bitstream Statistics ( $\hat F$ ; $\sim$ 600 dims):

For the first eight frames at QP=160 (NVENC AV1, preset-7), histograms and summary statistics are computed for block partitioning, transform characteristics, skip/block-copy/palette flags, reference types, in-loop filter usage, and motion vector bit allocation. For property $p$ and frame $i$ :

$\mu_{p,i} = \frac{1}{N} \sum_{k=1}^N p_{i,k},\quad \sigma_{p,i}^2 = \frac{1}{N} \sum (p_{i,k} - \mu_{p,i})^2$

plus quartiles $\{p_{25}, p_{50}, p_{75}\}$ .

Video-Level Metadata ( $x^{(l-1)}$ 0; 5 dims):

Encodes duration ( $x^{(l-1)}$ 1), bit-depth ( $x^{(l-1)}$ 2), average QP index ( $x^{(l-1)}$ 3), frame dimensions ( $x^{(l-1)}$ 4), and frame rate ( $x^{(l-1)}$ 5).

Video Complexity Analysis (VCA, $x^{(l-1)}$ 6; $x^{(l-1)}$ 7100 dims):
- Spatial Complexity (SC): DCT-energy mean per block,
$x^{(l-1)}$ 8 - Temporal Complexity (TC): mean per-pixel absolute frame difference,

$x^{(l-1)}$ 9 - VCA features include statistic descriptors (mean, std, min, max, percentiles) for I-frames and non-I-frames.
Semantic Embeddings (CLIP, “Clippie,” $x^{(l)}$ 0; 4096 dims):

CLIP model (CPU, pure NumPy) returns 512-dim vectors per frame; concatenation over 8 frames yields 4096 dims, projected to 16 by ClipNet.

Normalization procedures:

Bitstream statistics and metadata: min–max scaled to [0,1].
VCA features: per-metric min–max scaling.
CLIP embeddings: standardized to zero mean and unit variance.

3. Training Methodology and Loss Formulation

The training corpus comprises 2,944 1080p single-shot clips (∼300 frames, ≤7 s), sampled from 12 publicly available datasets (YouTube-UGC, Netflix Open, AOM-CTC, Xiph, SJTU, Inter4K, among others) with an 80/20 train/test split guaranteeing no content overlap.

Ground-truth generation:

Each clip is exhaustively encoded at 24 QP levels using NVENC AV1. For each, VMAF is computed, then piecewise cubic Hermite interpolation (PCHIP) is used to invert the QP–VMAF curve and produce the ground-truth QP satisfying each VMAF target ( $x^{(l)}$ 1).

Loss Function:

“TolerantWeightedMSELoss” is applied: $x^{(l)}$ 2 with $x^{(l)}$ 3. Total loss is $x^{(l)}$ 4 when $x^{(l)}$ 5; within $x^{(l)}$ 6 VMAF points, no penalty is incurred.

Optimizer:

Adam, learning rate $x^{(l)}$ 7, weight decay $x^{(l)}$ 8, batch size 32. Learning rate is reduced on plateau (factor 0.4, patience 8 epochs), and early stopping is triggered after 20 epochs with no improvement.

Training convergence:

103 epochs on an NVIDIA 40-series GPU. Model size: ClipNet ≈798k parameters; main DNN ≈242k; total ≈1 million.

4. Evaluation Protocols and Comparative Performance

Metrics used for evaluation include:

Absolute VMAF error:

$x^{(l)}$ 9

Reporting mean, median, standard deviation, and coverage for $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 0 and $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 1.

QP Mean Absolute Error (MAE):

Mean absolute deviation of predicted QPs from ground-truth for eight VMAF targets.

Test set results:

Metric	LiteVPNet
Mean QP MAE	4.5
Median QP MAE	2.5
QP Std Dev	6.5
Mean VMAF MAE	1.0
Median VMAF MAE	0.5
VMAF Std Dev	1.5
Coverage ( $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 2)	87.3%
Coverage ( $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 3)	96.5%

Boxplots of QP and VMAF MAE across all eight targets and CDF plots of $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 4 demonstrate ≥80% coverage at ≤2 VMAF points for all targets and >93% at ≤4.

Ablation studies:

Excluding CLIP embeddings decreases coverage (≤2) from 87.3% to 75.6% and increases VMAF MAE from 1.0 to 1.5. Removing VCA features also leads to significant error increases.

Comparison with prior art:

Model	QP MAE	VMAF MAE	Coverage (≤2)
Mico-DNN	33.7	5.9	27.8%
JTPS	13.0	2.1	61.1%
LiteVPNet	4.5	1.0	87.3%

LiteVPNet outperforms both Mico-DNN and JTPS by a substantial margin across all measures.

5. Computational Efficiency and Resource Usage

Model size: ∼1 M total parameters for both ClipNet and LiteVPNet.
Per-shot inference latency: ∼0.28 s for neural network inference; full end-to-end processing (including feature extraction and downsampling) requires ∼3.0 s per 1080p shot.
Baselines:
- JTPS: 5.6 s/shot (1.9× slower than LiteVPNet)
- Mico-DNN: 5.3 s/shot (1.7× slower)
Brute-force encoding: 8 QP encodes + VMAF evaluation per shot requires ~197 s. LiteVPNet achieves up to 65× speed-up relative to brute-force QP search.
Hardware: NVENC AV1 on NVIDIA 40-series GPU for initial feature extraction; Clippie inference (CLIP embedding extraction) runs on CPU in approximately 0.1 s (pure NumPy implementation).

6. Integration Guidelines and Application Considerations

Recommended workflow:

Downsample video input to 480×270 using Lanczos-5 filter.
Extract bitstream features with NVENC AV1 at preset-7, QP=160; avoid split-frame encoding.
Ensure full feature set inclusion (bitstream, metadata, VCA, CLIP) for optimal predictor accuracy.
Use model hyperparameters: batch size=32, Adam optimizer with $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 5 learning rate, $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 6 weight decay, learning rate scheduler patience=8, tolerant loss ( $\{99, 97, 95, 91, 88, 85, 83, 80\}$ 7 VMAF).
For lowest latency, deploy both feature extraction and inference on modern NVIDIA hardware; since CLIP embedding runs efficiently on CPU, GPU is not strictly required for this step.
Target quality: choose eight VMAF targets in the 99–80 range, or interpolate new targets using PCHIP as required.
Integrate LiteVPNet as a drop-in QP predictor at the onset of virtual production or streaming pipelines—direct QP selection for desired VMAF quality avoids iterative two-pass or multi-QP encodes.
Substantial energy and time savings (up to 65× over brute-force) by eliminating redundant encodes, supporting sustainable and resource-efficient quality-critical workflows.

A plausible implication is that LiteVPNet's low-latency, accurate QP prediction using a lightweight architecture and low-complexity feature set enables its use in resource-constrained or time-critical environments without compromising perceptual quality guarantees. The method is particularly well suited for real-time high-value content streaming, on-set virtual production, and remote post-production workflows demanding tight control over output video quality (Vibhoothi et al., 14 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LiteVPNet: A Lightweight Network for Video Encoding Control in Quality-Critical Applications (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiteVPNet.