Towards Practical Real-Time Neural Video Compression (2502.20762v2)

Published 28 Feb 2025 in eess.IV and cs.CV

Abstract: We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.

Authors (7)

Zhaoyang Jia (10 papers)
Bin Li (514 papers)
Jiahao Li (80 papers)
Wenxuan Xie (22 papers)
Linfeng Qi (3 papers)
Houqiang Li (236 papers)
Yan Lu (179 papers)

Summary

The paper introduces a practical real-time neural video compression system by addressing operational complexity, which is dominated by memory I/O and function calls, rather than just reducing MACs.
Key innovations include using a single-scale latent representation via patch embedding, employing implicit temporal modeling for faster encoding, and implementing a module-bank for adaptive rate control.
The proposed system achieves real-time encoding and decoding speeds (over 100 fps for 1080p video) on consumer GPUs while delivering competitive compression performance against state-of-the-art codecs.

The paper presents a comprehensive paper addressing the bottleneck in neural video compression (NVC) systems and introduces a practical real-time NVC that achieves an excellent rate-distortion-complexity trade-off. The work begins with an in-depth analysis distinguishing between computational and operational complexity. While traditional approaches have focused on reducing the number of multiply-accumulate operations (MACs), the authors demonstrate that latency is predominantly governed by non-computational factors such as memory I/O and excessive function calls. They quantify operational complexity via two critical factors: the latent representation size and the number of modules in the network architecture.

Key contributions include:

Operational Complexity Reduction:
- The paper reveals that reducing the number of channels produces an almost linear speedup, as opposed to the quadratic decrease in MACs expected from computational analysis. This observation motivates a design shift where the focus is on lowering operational overhead rather than merely minimizing arithmetic operations.
- The paper introduces the concept of learning latent representations at a single low resolution (specifically at 1/8 of the original image scale) using patch embedding. This approach replaces the traditional progressive downsampling strategy, leading to a significant reduction in latent tensor size and associated memory I/O costs.
Implicit Temporal Modeling:
- Instead of using explicit motion estimation and compensation—which, despite low per-pixel computational complexity, demand many sequential module calls and thus contribute substantially to operational overhead—the paper proposes an implicit temporal modeling strategy. A simple feature extractor is used to generate temporal context by concatenating features from previous frames directly with the current frame’s latent representation.
- Experimental ablation studies indicate that the implicit approach achieves a modest BD-Rate degradation (about 0.4% improvement on small motions and a 3.2% drop on cases with large motions) but enables a 3.4× faster encoding time, making it a practical alternative for real-time applications.
Module-Bank-Based Rate Control:
- To address the variability in target bitrates, the authors incorporate a module bank that allows the model to adjust to a range of quantization parameters (qp). This approach involves learning a spectrum of hyperprior modules and separate vector banks that modulate the latent features in a fine-grained manner. The method yields an average bitrate savings of around 3.4% compared to single-module approaches and supports hierarchical quality control across frames.
Model Integerization for Cross-Device Consistency:
- A deterministic 16-bit integerization strategy is introduced to eliminate nondeterministic floating-point computations that could lead to decoding inconsistencies across different hardware platforms. By mapping floating-point features and weights to int16 (using fixed multipliers) and carrying out operations in integer arithmetic, the model guarantees reproducible results on consumer devices. The paper details both the theoretical framework and an algorithmic workflow that ensures these operations do not saturate the available range, thereby preserving model fidelity.
Experimental Validation and Performance:
- Extensive experiments on datasets such as Vimeo-90k, HEVC Classes B–E, UVG, and MCL-JCV demonstrate that the proposed codec (DCVC-RT) achieves real-time performance on consumer GPUs. For 1080p video, the system attains an average encoding speed of 125.2 fps and decoding at 112.8 fps on an NVIDIA A100 GPU, while also delivering a 21.0% bitrate reduction compared to VTM/H.266 under a challenging single intra-frame setting.
- The qualitative rate-distortion curves further underscore that, despite its lightweight design, the proposed model maintains competitive compression performance especially in the low-quality range (below 0.02 bpp), with only minor degradation at high-quality ranges that are less perceptually significant.
Scalability and Flexibility:
- Although the core design is optimized for speed, the framework is readily extendable. The authors also outline a larger variant (DCVC-RT Large) that, by increasing channel widths and the number of depth-wise convolution blocks, significantly improves BD-Rate performance (e.g., achieving an average BD-Rate of –30.8% on YUV420) while still maintaining real-time performance.
- Moreover, a parallel coding strategy is employed where entropy coding operations are decoupled from network inference, yielding an additional 12% to 9% speedup for encoding and decoding processes respectively.

In summary, the paper offers a detailed and systematic approach to overcoming the practical obstacles in NVCs. It shifts the paradigm from focusing solely on reducing computational operations to a more holistic consideration of operational costs. By integrating implicit temporal modeling, single-scale latent representation, adaptive rate control via a module bank, and precise model integerization, the work provides a real-time NVC solution that is both efficient and practical for deployment on consumer hardware, all while maintaining state-of-the-art compression performance.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/DCVC: Deep Contextual Video Compression (446 stars)

Tweets

https://twitter.com/ssh4net/status/1896552363714040240

Reddit

[2502.20762] Towards Practical Real-Time Neural Video Compression (1 point, 0 comments)