Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Learning-based Image and Video Inpainting: A Survey (2401.03395v1)

Published 7 Jan 2024 in cs.CV

Abstract: Image and video inpainting is a classic problem in computer vision and computer graphics, aiming to fill in the plausible and realistic content in the missing areas of images and videos. With the advance of deep learning, this problem has achieved significant progress recently. The goal of this paper is to comprehensively review the deep learning-based methods for image and video inpainting. Specifically, we sort existing methods into different categories from the perspective of their high-level inpainting pipeline, present different deep learning architectures, including CNN, VAE, GAN, diffusion models, etc., and summarize techniques for module design. We review the training objectives and the common benchmark datasets. We present evaluation metrics for low-level pixel and high-level perceptional similarity, conduct a performance evaluation, and discuss the strengths and weaknesses of representative inpainting methods. We also discuss related real-world applications. Finally, we discuss open challenges and suggest potential future research directions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell (6):679–698
  2. Carlsson S (1988) Sketch based coding of grey level images. Sign Process 15(1):57–83
  3. Chen P (2018) Video retouch: Object removal. http://www.12371.cn/2021/02/08/ARTI1612745858192472.shtml
  4. Daubechies I (1990) The wavelet transform, time-frequency localization and signal analysis. IEEE Trans Inf Theory 36(5):961–1005
  5. Dosselmann R, Yang XD (2011) A comprehensive assessment of the structural similarity index. Sign Image and Video Process 5:81–91
  6. Felzenszwalb PF, Huttenlocher DP (2004) Efficient graph-based image segmentation. Int J Comput Vis (59):167–181
  7. Guillemot C, Meur OL (2014) Image inpainting: Overview and recent advances. IEEE Sign Process Magazine 31(1):127–144
  8. Han C, Wang J (2021) Face image inpainting with evolutionary generators. IEEE Sign Process Letters 28:190–193
  9. Herling J, Broll W (2014) High-quality real-time video inpainting with pixmix. IEEE Trans Vis Comput Graph 20(6):866–879
  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
  11. Houle ME (2017a) Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications. In: Int. Conf. Similarity Search App., pp 64–79
  12. Houle ME (2017b) Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support. In: Int. Conf. Similarity Search App.
  13. Ilan S, Shamir A (2015) A survey on data-driven video completion. Comput Graph Forum 34(6):60–85
  14. Kingma DP, Welling M (2014) Auto-Encoding Variational Bayes. In: Int. Conf. Learn. Represent.
  15. Lim JH, Ye JC (2017) Geometric gan. arXiv preprint arXiv:170502894
  16. Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693
  17. Navasardyan S, Ohanyan M (2020) Image Inpainting with Onion Convolutions. In: Asian Conf. Comput. Vis.
  18. Phutke SS, Murala S (2021) Diverse receptive field based adversarial concurrent encoder network for image inpainting. IEEE Sign Process Letters 28:1873–1877
  19. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
  20. Tabak EG, Vanden-Eijnden E (2010) Density estimation by dual ascent of the log-likelihood. Commun Math Sci 8(1):217 – 233
  21. Tschumperlé D, Deriche R (2005) Vector-valued image regularization with pdes: a common framework for different applications. IEEE Trans Pattern Anal Mach Intell 27(4):506–517
  22. Yu F, Koltun V (2016) Multi-Scale Context Aggregation by Dilated Convolutions. In: Int. Conf. Learn. Represent.
  23. Zhang L, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. 2302.05543
Citations (18)

Summary

  • The paper presents a comprehensive survey of state-of-the-art deep learning methods for image and video inpainting, covering both deterministic and stochastic approaches.
  • The paper highlights the use of advanced architectures like CNNs, VAEs, GANs, diffusion models, and transformers, along with text-guided techniques to enhance content realism and temporal coherence.
  • The paper discusses key challenges including visual artifacts, model training specificity, and ethical concerns, urging further research to address these issues.

Introduction

Deep learning has redefined the boundaries of image and video inpainting, providing innovative solutions and techniques to reconstruct missing or occluded regions with realistic content. By leveraging various neural network architectures such as CNNs, VAEs, GANs, diffusion models, and transformers, inpainting processes have become more sophisticated, allowing the generation of semantically plausible content that blends seamlessly with the surrounding areas. The range of possible applications is vast, from cultural relic restoration to digital forensics and film production.

Advances in Image Inpainting

The development of inpainting techniques can be divided into deterministic approaches, which yield a single outcome, and stochastic methods that provide multiple plausible results. Deterministic strategies often operate within frameworks such as single-shot, two-stage, or progressive models, where a generator or multiple generators are employed to gradually or directly complete the missing regions. On the other hand, stochastic inpainting introduces randomness and diversity to the results, appealing to VAE-based, GAN-based, and diffusion model-based methodologies.

An emerging trend is the incorporation of text prompts in the inpainting process. This text-guided approach offers both challenges and opportunities, like the effective fusion of text and visual features and the precise execution of user-specified edits. Furthermore, the architectural innovations such as mask-aware design and attention mechanisms contribute significantly to improving inpainting fidelity and realism.

Progress in Video Inpainting

With additional temporal dimensions, video inpainting requires not only spatial consistency but also temporal coherence. Approaches differ, including 3D CNN-based methods that address both dimensions concurrently, and shift-based methods that optimize computational efficiency. Flow-guided methods exploit optical flow information to ensure temporal stability, while attention-based methods leverage the mechanism to capture contextual information over an enlarged spatio-temporal window.

Challenges and Opportunities

There remains a gap between the depth of understanding offered by the current technology and the complexities inherent to real-world application scenarios. Issues like the manifestation of visual artifacts, the specificity of model training, and the challenge of large-scale inpainting persist as areas ripe for further research. Moreover, scaling up training to massive datasets such as LAION proves both a challenge and an opportunity for future development.

Ethical Considerations

As deep learning-based inpainting technologies advance, ethical considerations must be at the forefront of the conversation. Potential misuse for malicious purposes, copyright infringement, historical inaccuracy, and bias are serious concerns that must be addressed to prevent abuse and harm.

Conclusion

The field of image and video inpainting is rapidly evolving, bolstered by deep learning innovations. The blend of advanced neural architectures and large-scale datasets opens new frontiers for automatic visual data restoration, editing, and creation. As we undertake this technological journey, it is essential that ethical standards guide the use and advancement of these powerful techniques to ensure they serve the best interests of society.

X Twitter Logo Streamline Icon: https://streamlinehq.com