- The paper introduces a practical real-time neural video compression system by addressing operational complexity, which is dominated by memory I/O and function calls, rather than just reducing MACs.
- Key innovations include using a single-scale latent representation via patch embedding, employing implicit temporal modeling for faster encoding, and implementing a module-bank for adaptive rate control.
- The proposed system achieves real-time encoding and decoding speeds (over 100 fps for 1080p video) on consumer GPUs while delivering competitive compression performance against state-of-the-art codecs.
The paper presents a comprehensive paper addressing the bottleneck in neural video compression (NVC) systems and introduces a practical real-time NVC that achieves an excellent rate-distortion-complexity trade-off. The work begins with an in-depth analysis distinguishing between computational and operational complexity. While traditional approaches have focused on reducing the number of multiply-accumulate operations (MACs), the authors demonstrate that latency is predominantly governed by non-computational factors such as memory I/O and excessive function calls. They quantify operational complexity via two critical factors: the latent representation size and the number of modules in the network architecture.
Key contributions include:
- Operational Complexity Reduction:
- The paper reveals that reducing the number of channels produces an almost linear speedup, as opposed to the quadratic decrease in MACs expected from computational analysis. This observation motivates a design shift where the focus is on lowering operational overhead rather than merely minimizing arithmetic operations.
- The paper introduces the concept of learning latent representations at a single low resolution (specifically at 1/8 of the original image scale) using patch embedding. This approach replaces the traditional progressive downsampling strategy, leading to a significant reduction in latent tensor size and associated memory I/O costs.
- Implicit Temporal Modeling:
- Instead of using explicit motion estimation and compensation—which, despite low per-pixel computational complexity, demand many sequential module calls and thus contribute substantially to operational overhead—the paper proposes an implicit temporal modeling strategy. A simple feature extractor is used to generate temporal context by concatenating features from previous frames directly with the current frame’s latent representation.
- Experimental ablation studies indicate that the implicit approach achieves a modest BD-Rate degradation (about 0.4% improvement on small motions and a 3.2% drop on cases with large motions) but enables a 3.4× faster encoding time, making it a practical alternative for real-time applications.
- Module-Bank-Based Rate Control:
- To address the variability in target bitrates, the authors incorporate a module bank that allows the model to adjust to a range of quantization parameters (qp). This approach involves learning a spectrum of hyperprior modules and separate vector banks that modulate the latent features in a fine-grained manner. The method yields an average bitrate savings of around 3.4% compared to single-module approaches and supports hierarchical quality control across frames.
- Model Integerization for Cross-Device Consistency:
- A deterministic 16-bit integerization strategy is introduced to eliminate nondeterministic floating-point computations that could lead to decoding inconsistencies across different hardware platforms. By mapping floating-point features and weights to int16 (using fixed multipliers) and carrying out operations in integer arithmetic, the model guarantees reproducible results on consumer devices. The paper details both the theoretical framework and an algorithmic workflow that ensures these operations do not saturate the available range, thereby preserving model fidelity.
- Experimental Validation and Performance:
- Extensive experiments on datasets such as Vimeo-90k, HEVC Classes B–E, UVG, and MCL-JCV demonstrate that the proposed codec (DCVC-RT) achieves real-time performance on consumer GPUs. For 1080p video, the system attains an average encoding speed of 125.2 fps and decoding at 112.8 fps on an NVIDIA A100 GPU, while also delivering a 21.0% bitrate reduction compared to VTM/H.266 under a challenging single intra-frame setting.
- The qualitative rate-distortion curves further underscore that, despite its lightweight design, the proposed model maintains competitive compression performance especially in the low-quality range (below 0.02 bpp), with only minor degradation at high-quality ranges that are less perceptually significant.
- Scalability and Flexibility:
- Although the core design is optimized for speed, the framework is readily extendable. The authors also outline a larger variant (DCVC-RT Large) that, by increasing channel widths and the number of depth-wise convolution blocks, significantly improves BD-Rate performance (e.g., achieving an average BD-Rate of –30.8% on YUV420) while still maintaining real-time performance.
- Moreover, a parallel coding strategy is employed where entropy coding operations are decoupled from network inference, yielding an additional 12% to 9% speedup for encoding and decoding processes respectively.
In summary, the paper offers a detailed and systematic approach to overcoming the practical obstacles in NVCs. It shifts the paradigm from focusing solely on reducing computational operations to a more holistic consideration of operational costs. By integrating implicit temporal modeling, single-scale latent representation, adaptive rate control via a module bank, and precise model integerization, the work provides a real-time NVC solution that is both efficient and practical for deployment on consumer hardware, all while maintaining state-of-the-art compression performance.