Efficient Neural Video Representation with Temporally Coherent Modulation (2505.00335v1)

Published 1 May 2025 in cs.CV and cs.AI

Abstract: Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video's dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is https://sujiikim.github.io/NVTM/.

Summary

Efficient Neural Video Representation with Temporally Coherent Modulation: An Overview

This paper presents a novel framework called Neural Video Representation with Temporally coherent Modulation (NVTM), which addresses the challenge of efficiently representing videos using Implicit Neural Representations (INR). The framework aims to enhance both the speed and parameter efficiency of training neural networks for video applications, a critical aspect for practical use.

Background and Motivation

The field of implicit neural representations has seen significant advancements, offering powerful tools for the continuous representation of signals across domains such as images, sounds, and 3D objects. However, applying these methods to video is particularly challenging due to the high dimensionality and temporal dynamics of video data. The state-of-the-art approaches like NeRV and its successors lack the ability to handle the dynamic characteristics of video efficiently. NVTM addresses this gap by introducing temporally coherent modulation, which significantly speeds up training while reducing the bitrate and parameter redundancy associated with previous methods.

Methodology

NVTM introduces several key innovations to improve the efficiency of video representation:

Temporal Coherence: Unlike prior grid-type methods that do not account for temporal dynamics, NVTM decomposes spatio-temporal data into 2D grids with flow information. This decomposition allows the framework to simultaneously process temporally corresponding pixels, achieving rapid encoding speeds.
Modulation Framework: The framework utilizes a novel modulation strategy where a consistent modulation is applied to corresponding pixels across time, allowing the model to capture video dynamics effectively. This is implemented via learning a modulation latent that enhances the representational power without requiring excess parameters.
Alignment Flow Network: To support the temporal coherence, NVTM employs an alignment flow network, allowing the mapping of 3D video coordinates to 2D representations, thus efficiently handling temporal redundancy.

Results

The paper reports significant improvements over competing methods in terms of both speed and quality metrics. Specifically, NVTM achieves over three times the encoding speed compared to NeRV-style approaches and demonstrates substantial gains in PSNR/LPIPS benchmarks with fewer parameters. For instance, it showed improvements of 1.54dB/0.019 in PSNR/LPIPS on the UVG dataset and 1.84dB/0.013 on the MCL-JCV dataset, even while using 10% fewer parameters.

Implications and Future Directions

The implications of this research are twofold. Practically, NVTM allows for efficient video processing in real-world applications, potentially impacting areas such as video streaming and editing. Theoretically, this work advances the understanding of how implicit representations can be effectively utilized for high-dimensional and dynamic data like videos.

Looking forward, there are several promising directions for future research. These include exploring variable GOP sizes to further optimize temporal segmentation, extending the framework to handle different types of video formats and compressions, and potentially integrating the framework with other neural processing tasks to create more comprehensive multimedia solutions.

In conclusion, NVTM provides a robust solution for neural video representation, marked by its innovative use of temporally coherent modulation and efficient handling of video dynamics. This framework is a critical step forward in marrying the theoretical elegance of INR with practical efficiency necessary for handling complex video data.

Related Papers

GitHub

NVTM