Deep Learning-based Image and Video Inpainting: A Survey (2401.03395v1)

Published 7 Jan 2024 in cs.CV

Abstract: Image and video inpainting is a classic problem in computer vision and computer graphics, aiming to fill in the plausible and realistic content in the missing areas of images and videos. With the advance of deep learning, this problem has achieved significant progress recently. The goal of this paper is to comprehensively review the deep learning-based methods for image and video inpainting. Specifically, we sort existing methods into different categories from the perspective of their high-level inpainting pipeline, present different deep learning architectures, including CNN, VAE, GAN, diffusion models, etc., and summarize techniques for module design. We review the training objectives and the common benchmark datasets. We present evaluation metrics for low-level pixel and high-level perceptional similarity, conduct a performance evaluation, and discuss the strengths and weaknesses of representative inpainting methods. We also discuss related real-world applications. Finally, we discuss open challenges and suggest potential future research directions.

References (23)

Citations (18)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey of state-of-the-art deep learning methods for image and video inpainting, covering both deterministic and stochastic approaches.
The paper highlights the use of advanced architectures like CNNs, VAEs, GANs, diffusion models, and transformers, along with text-guided techniques to enhance content realism and temporal coherence.
The paper discusses key challenges including visual artifacts, model training specificity, and ethical concerns, urging further research to address these issues.

Introduction

Deep learning has redefined the boundaries of image and video inpainting, providing innovative solutions and techniques to reconstruct missing or occluded regions with realistic content. By leveraging various neural network architectures such as CNNs, VAEs, GANs, diffusion models, and transformers, inpainting processes have become more sophisticated, allowing the generation of semantically plausible content that blends seamlessly with the surrounding areas. The range of possible applications is vast, from cultural relic restoration to digital forensics and film production.

Advances in Image Inpainting

The development of inpainting techniques can be divided into deterministic approaches, which yield a single outcome, and stochastic methods that provide multiple plausible results. Deterministic strategies often operate within frameworks such as single-shot, two-stage, or progressive models, where a generator or multiple generators are employed to gradually or directly complete the missing regions. On the other hand, stochastic inpainting introduces randomness and diversity to the results, appealing to VAE-based, GAN-based, and diffusion model-based methodologies.

An emerging trend is the incorporation of text prompts in the inpainting process. This text-guided approach offers both challenges and opportunities, like the effective fusion of text and visual features and the precise execution of user-specified edits. Furthermore, the architectural innovations such as mask-aware design and attention mechanisms contribute significantly to improving inpainting fidelity and realism.

Progress in Video Inpainting

With additional temporal dimensions, video inpainting requires not only spatial consistency but also temporal coherence. Approaches differ, including 3D CNN-based methods that address both dimensions concurrently, and shift-based methods that optimize computational efficiency. Flow-guided methods exploit optical flow information to ensure temporal stability, while attention-based methods leverage the mechanism to capture contextual information over an enlarged spatio-temporal window.

Challenges and Opportunities

There remains a gap between the depth of understanding offered by the current technology and the complexities inherent to real-world application scenarios. Issues like the manifestation of visual artifacts, the specificity of model training, and the challenge of large-scale inpainting persist as areas ripe for further research. Moreover, scaling up training to massive datasets such as LAION proves both a challenge and an opportunity for future development.

Ethical Considerations

As deep learning-based inpainting technologies advance, ethical considerations must be at the forefront of the conversation. Potential misuse for malicious purposes, copyright infringement, historical inaccuracy, and bias are serious concerns that must be addressed to prevent abuse and harm.

Conclusion

The field of image and video inpainting is rapidly evolving, bolstered by deep learning innovations. The blend of advanced neural architectures and large-scale datasets opens new frontiers for automatic visual data restoration, editing, and creation. As we undertake this technological journey, it is essential that ethical standards guide the use and advancement of these powerful techniques to ensure they serve the best interests of society.

PDF Markdown

Related Papers

Reimagining Reality: A Comprehensive Survey of Video Inpainting Techniques (2024)
Video Inpainting of Complex Scenes (2015)
Infusion: internal diffusion for inpainting of dynamic textures and complex motion (2023)
Image inpainting: A review (2019)
Deep Video Inpainting (2019)

Tweets

https://twitter.com/peter_wonka/status/1746879865750536461