M-LVC: Multiple Frames Prediction for Learned Video Compression (2004.10290v1)

Published 21 Apr 2020 in eess.IV, cs.CV, and cs.LG

Abstract: We propose an end-to-end learned video compression scheme for low-latency scenarios. Previous methods are limited in using the previous one frame as reference. Our method introduces the usage of the previous multiple frames as references. In our scheme, the motion vector (MV) field is calculated between the current frame and the previous one. With multiple reference frames and associated multiple MV fields, our designed network can generate more accurate prediction of the current frame, yielding less residual. Multiple reference frames also help generate MV prediction, which reduces the coding cost of MV field. We use two deep auto-encoders to compress the residual and the MV, respectively. To compensate for the compression error of the auto-encoders, we further design a MV refinement network and a residual refinement network, taking use of the multiple reference frames as well. All the modules in our scheme are jointly optimized through a single rate-distortion loss function. We use a step-by-step training strategy to optimize the entire scheme. Experimental results show that the proposed method outperforms the existing learned video compression methods for low-latency mode. Our method also performs better than H.265 in both PSNR and MS-SSIM. Our code and models are publicly available.

Authors (4)

Jianping Lin (7 papers)
Dong Liu (267 papers)
Houqiang Li (236 papers)
Feng Wu (198 papers)

Citations (190)

View on Semantic Scholar

Summary

The paper presents a major advance by leveraging multiple reference frames to boost prediction accuracy and reduce temporal redundancy.
It employs dual deep auto-encoders to compress motion vectors and residuals, optimizing the rate-distortion tradeoff effectively.
Experimental results show that M-LVC outperforms H.265 with higher PSNR/MS-SSIM metrics and lower bitrates in low-latency scenarios.

An Analysis of "M-LVC: Multiple Frames Prediction for Learned Video Compression"

The paper "M-LVC: Multiple Frames Prediction for Learned Video Compression" introduces an end-to-end learned video compression framework optimized for low-latency scenarios. This research addresses the limitations of previous methods that predominantly utilized a single reference frame for predicting video frames. Instead, the presented method leverages multiple reference frames, thereby enhancing the prediction accuracy and compression efficacy for each frame.

Methodological Innovations

The proposed M-LVC approach introduces several critical innovations through its handling of motion vectors (MVs) and residuals:

Multiple Reference Frames: By incorporating multiple past frames as references, M-LVC significantly reduces the temporal redundancy in video sequences. Multiple hypotheses for predicting the current frame can be generated, and their combination forms a more accurate prediction ensemble.
MV and Residual Compression: M-LVC employs two deep auto-encoders explicitly designed for compressing residuals and MVs. This compression is crucial as it reduces the spatial redundancy, further optimizing the storage and transmission size of the encoded data.
MV Prediction and Refinement: The MV prediction network (MAMVP-Net) aligns information from previous frames to predict the current MV, minimizing coding costs. Furthermore, a MV refinement network compensates for any compression errors, ensuring higher accuracy in motion predictions.
Residual Refinement: Supplementary networks refine the residuals post-compression, further reducing prediction errors and enhancing video quality through effective use of multiple reference frames.

The entire system is integrated and trained using a unified rate-distortion loss function with a step-by-step training strategy, which systematically optimizes each component for a balanced rate-distortion tradeoff.

Experimental Validation

The experiments conducted validate M-LVC's superior performance against contemporary learned methods such as DVC and conventional video codecs like H.265. The paper reports that M-LVC consistently surpasses H.265 in terms of PSNR and MS-SSIM metrics across various test datasets, including UVG and HEVC Class B and D. Particularly noteworthy is the bitrate reduction achieved through the innovative use of multiple frames for prediction, which provides a substantial efficiency boost in low-latency video scenarios.

Implications and Future Directions

This paper posits a significant step forward in learned video compression, particularly for real-time applications that require low-latency processing. The ability to outperform H.265—a major industry-standard codec—illustrates the practical impact of integrating deep learning within video encoding standards. Conceptually, M-LVC also highlights the potential advantages of considering temporal context more broadly, potentially extending to scenarios like immersive media transmission and gaming.

Future work could involve further computational optimizations to enable real-time encoding on edge devices, a field where the encoding complexity of multiple frame references might pose computational challenges. Additionally, extending the scheme with more advanced entropy models or integrating with cutting-edge neural network architectures could yield further enhancements in compression ratios without sacrificing computational efficiency.

In summary, M-LVC stands as a robust model in learned video compression, demonstrating that increased temporal context through multiple frame predictions can significantly enhance both compression efficiency and the quality of reconstructed video, marking it as a salient contribution in the field of video processing.

PDF Markdown