Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting (2301.10048v2)
Abstract: Transformers have been widely used for video processing owing to the multi-head self attention (MHSA) mechanism. However, the MHSA mechanism encounters an intrinsic difficulty for video inpainting, since the features associated with the corrupted regions are degraded and incur inaccurate self attention. This problem, termed query degradation, may be mitigated by first completing optical flows and then using the flows to guide the self attention, which was verified in our previous work - flow-guided transformer (FGT). We further exploit the flow guidance and propose FGT++ to pursue more effective and efficient video inpainting. First, we design a lightweight flow completion network by using local aggregation and edge loss. Second, to address the query degradation, we propose a flow guidance feature integration module, which uses the motion discrepancy to enhance the features, together with a flow-guided feature propagation module that warps the features according to the flows. Third, we decouple the transformer along the temporal and spatial dimensions, where flows are used to select the tokens through a temporally deformable MHSA mechanism, and global tokens are combined with the inner-window local tokens through a dual perspective MHSA mechanism. FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks qualitatively and quantitatively.
- M. Bertalmio, A. L. Bertozzi, and G. Sapiro, “Navier-stokes, fluid dynamics, and image and video inpainting,” in CVPR, 2001, pp. 355–362.
- M. Granados, J. Tompkin, K. Kim, O. Grau, J. Kautz, and C. Theobalt, “How not to be seen – Object removal from videos of crowded scenes,” Comput. Graph. Forum, vol. 31, pp. 219–228, 2012.
- D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Deep video inpainting,” in CVPR, 2019, pp. 5792–5801.
- Y. Matsushita, E. Ofek, W. Ge, X. Tang, and H.-Y. Shum, “Full-frame video stabilization with motion inpainting,” IEEE Trans. PAMI, vol. 28, no. 7, pp. 1150–1163, 2006.
- D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR, 2016, pp. 2536–2544.
- S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Trans. Graph., vol. 36, no. 4, pp. 107:1–14, 2017.
- G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, “Image inpainting for irregular holes using partial convolutions,” in ECCV, 2018, pp. 85–100.
- J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in CVPR, 2018, pp. 5505–5514.
- ——, “Free-form image inpainting with gated convolution,” in ICCV, 2019, pp. 4471–4480.
- K. Nazeri, E. Ng, T. Joseph, F. Qureshi, and M. Ebrahimi, “EdgeConnect: Structure guided image inpainting using edge prediction,” in ICCV Workshops, 2019, pp. 3265–3274.
- S. Xu, D. Liu, and Z. Xiong, “E2I: Generative inpainting from edge to image,” IEEE Trans. CSVT, vol. 31, no. 4, pp. 1308–1322, 2021.
- J. Peng, D. Liu, S. Xu, and H. Li, “Generating diverse structure for image inpainting with hierarchical VQ-VAE,” in CVPR, 2021, pp. 10 775–10 784.
- L. Liao, J. Xiao, Z. Wang, C.-W. Lin, and S. Satoh, “Image inpainting guided by coherence priors of semantics and textures,” in CVPR, June 2021, pp. 6539–6548.
- K. Zhang, J. Fu, and D. Liu, “Flow-guided transformer for video inpainting,” in ECCV. Springer, 2022, pp. 74–90.
- Z. Li, C.-Z. Lu, J. Qin, C.-L. Guo, and M.-M. Cheng, “Towards an end-to-end framework for flow-guided video inpainting,” in CVPR, 2022, pp. 17 541–17 550.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, vol. 30, 2017, pp. 6000––6010.
- Y. Zeng, J. Fu, and H. Chao, “Learning joint spatial-temporal transformations for video inpainting,” in ECCV, 2020, pp. 528–543.
- R. Liu, H. Deng, Y. Huang, X. Shi, L. Lu, W. Sun, X. Wang, J. Dai, and H. Li, “FuseFormer: Fusing fine-grained information in transformers for video inpainting,” in ICCV, 2021, pp. 14 020–14 029.
- ——, “Decoupled spatial-temporal transformer for video inpainting,” arXiv:2104.06637, 2021.
- M. Song, Y. Zhang, and T. O. Aydın, “TempFormer: Temporally consistent transformer for video denoising,” in ECCV. Cham: Springer Nature Switzerland, 2022, pp. 481–496.
- J. Liang, J. Cao, Y. Fan, K. Zhang, R. Ranjan, Y. Li, R. Timofte, and L. Van Gool, “VRT: A video restoration transformer,” arXiv:2201.12288, 2022.
- J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR: Image restoration using swin transformer,” in ICCV Workshops, 2021, pp. 1833–1844.
- C. Liu, H. Yang, J. Fu, and X. Qian, “Learning trajectory-aware transformer for video super-resolution,” in CVPR, 2022, pp. 5677–5686.
- R. Xu, X. Li, B. Zhou, and C. C. Loy, “Deep flow-guided video inpainting,” in CVPR, 2019, pp. 3723–3732.
- C. Gao, A. Saraf, J.-B. Huang, and J. Kopf, “Flow-edge guided video completion,” in ECCV, 2020, pp. 713–729.
- Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3D residual networks,” in ICCV, 2017, pp. 5533–5541.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
- X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in CVPR, 2019, pp. 9308–9316.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, October 2021, pp. 10 012–10 022.
- J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal attention for long-range interactions in vision transformers,” in NeurIPS, 2021, pp. 30 008–30 022.
- X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” in NeurIPS, 2021, pp. 9355–9366.
- M. Ebdelli, O. Le Meur, and C. Guillemot, “Video inpainting with short-term windows: Application to object removal and error concealment,” IEEE Trans. IP, vol. 24, no. 10, pp. 3034–3047, 2015.
- M. Granados, K. I. Kim, J. Tompkin, J. Kautz, and C. Theobalt, “Background inpainting for videos with dynamic objects and a free-moving camera,” in ECCV, 2012.
- A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Pérez, “Video inpainting of complex scenes,” SIAM Journal on Imaging Sciences, vol. 7, no. 4, pp. 1993–2019, 2014.
- J.-B. Huang, S. B. Kang, N. Ahuja, and J. Kopf, “Temporally coherent completion of dynamic video,” ACM Trans. Graph., vol. 35, no. 6, pp. 196:1–11, 2016.
- K. Zhang, J. Fu, and D. Liu, “Inertia-guided flow completion and style fusion for video inpainting,” in CVPR, June 2022, pp. 5982–5991.
- C. Wang, H. Huang, X. Han, and J. Wang, “Video inpainting by jointly learning temporal structure and spatial details,” in AAAI, vol. 33, 2019, pp. 5232–5239.
- Y.-L. Chang, Z. Y. Liu, K.-Y. Lee, and W. Hsu, “Free-form video inpainting with 3D gated convolution and temporal PatchGAN,” in ICCV, 2019, pp. 9066–9075.
- ——, “Learnable gated temporal shift module for deep video inpainting,” in BMVC, 2019.
- X. Zou, L. Yang, D. Liu, and Y. J. Lee, “Progressive temporal feature alignment network for video inpainting,” in CVPR, 2021, pp. 16 443–16 452.
- L. Ke, Y.-W. Tai, and C.-K. Tang, “Occlusion-aware video object inpainting,” in ICCV, 2021, pp. 14 448–14 458.
- A. Li, S. Zhao, X. Ma, M. Gong, J. Qi, R. Zhang, D. Tao, and R. Kotagiri, “Short-term and long-term context aggregation network for video inpainting,” in ECCV, 2020, pp. 728––743.
- S. Lee, S. W. Oh, D. Won, and S. J. Kim, “Copy-and-paste networks for deep video inpainting,” in ICCV, 2019, pp. 4413–4421.
- S. W. Oh, S. Lee, J.-Y. Lee, and S. J. Kim, “Onion-peel networks for deep video completion,” in ICCV, 2019, pp. 4403–4412.
- H. Zhang, L. Mai, N. Xu, Z. Wang, J. Collomosse, and H. Jin, “An internal learning approach to video inpainting,” in ICCV, 2019, pp. 2720–2729.
- H. Ouyang, T. Wang, and Q. Chen, “Internal video inpainting by implicit long-range propagation,” in ICCV, 2021, pp. 14 559–14 568.
- M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’00, 2000, pp. 417–424.
- C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “PatchMatch: A randomized correspondence algorithm for structural image editing,” ACM Trans. Graph., vol. 28, no. 3, pp. 24:1–12, 2009.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
- S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, “Understanding robustness of transformers for image classification,” in ICCV, October 2021, pp. 10 231–10 241.
- H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in ICCV, October 2021, pp. 6824–6835.
- H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “CvT: Introducing convolutions to vision transformers,” in ICCV, October 2021, pp. 22–31.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
- I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for 3D object detection,” in ICCV, October 2021, pp. 2906–2917.
- X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “OadTR: Online action detection with transformers,” in ICCV, October 2021, pp. 7565–7575.
- H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Max-deeplab: End-to-end panoptic segmentation with mask transformers,” in CVPR, 2021, pp. 5463–5474.
- T. Kalluri, D. Pathak, M. Chandraker, and D. Tran, “Flavr: Flow-agnostic video representations for fast frame interpolation,” in WACV, 2023, pp. 2071–2082.
- K. Hara, H. Kataoka, and Y. Satoh, “Learning spatio-temporal features with 3D residual networks for action recognition,” in CVPR Workshops, 2017, pp. 3154–3160.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in CVPR, 2015, pp. 4489–4497.
- X. Ying, L. Wang, Y. Wang, W. Sheng, W. An, and Y. Guo, “Deformable 3D convolution for video super-resolution,” IEEE Signal Processing Letters, vol. 27, pp. 1500–1504, 2020.
- X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen, “Appearance-preserving 3D convolution for video-based person re-identification,” in ECCV. Springer, 2020, pp. 228–243.
- H. Yang, C. Yuan, B. Li, Y. Du, J. Xing, W. Hu, and S. J. Maybank, “Asymmetric 3D convolutional neural networks for action recognition,” Pattern Recognition, vol. 85, pp. 1–12, 2019.
- J. Canny, “A computational approach to edge detection,” IEEE Trans. PAMI, vol. PAMI-8, no. 6, pp. 679–698, 1986.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang, “Learning blind video temporal consistency,” in ECCV, 2018, pp. 170–185.
- X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional positional encodings for vision transformers,” in ICLR, 2023. [Online]. Available: https://openreview.net/forum?id=3KWnuT-R1bh
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
- K. C. Chan, S. Zhou, X. Xu, and C. C. Loy, “BasicVSR++: Improving video super-resolution with enhanced propagation and alignment,” in CVPR, 2022, pp. 5962–5971.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016.
- R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in WACV, 2022, pp. 2149–2159.
- Z. Qiu, H. Yang, J. Fu, and D. Fu, “Learning spatiotemporal frequency-transformer for compressed video super-resolution,” in ECCV, 2022, pp. 257–273.
- J. Huang, Y. Liu, F. Zhao, K. Yan, J. Zhang, Y. Huang, M. Zhou, and Z. Xiong, “Deep fourier-based exposure correction network with spatial-frequency interaction,” in ECCV. Springer, 2022, pp. 163–180.
- N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “YouTube-VOS: A large-scale video object segmentation benchmark,” arXiv:1809.03327, 2018.
- S. Caelles, A. Montes, K.-K. Maninis, Y. Chen, L. V. Gool, F. Perazzi, and J. Pont-Tuset, “The 2018 DAVIS challenge on video object segmentation,” arXiv:1803.00557, 2018.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. IP, vol. 13, no. 4, pp. 600–612, 2004.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
- Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in ECCV. Springer, 2020, pp. 402–419.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015. [Online]. Available: https://arxiv.org/abs/1412.6980