Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing on Low-level Features (2310.06525v1)
Abstract: Nowadays, multimedia forensics faces unprecedented challenges due to the rapid advancement of multimedia generation technology thereby making Image Manipulation Localization (IML) crucial in the pursuit of truth. The key to IML lies in revealing the artifacts or inconsistencies between the tampered and authentic areas, which are evident under pixel-level features. Consequently, existing studies treat IML as a low-level vision task, focusing on allocating tampered masks by crafting pixel-level features such as image RGB noises, edge signals, or high-frequency features. However, in practice, tampering commonly occurs at the object level, and different classes of objects have varying likelihoods of becoming targets of tampering. Therefore, object semantics are also vital in identifying the tampered areas in addition to pixel-level features. This necessitates IML models to carry out a semantic understanding of the entire image. In this paper, we reformulate the IML task as a high-level vision task that greatly benefits from low-level features. Based on such an interpretation, we propose a method to enhance the Masked Autoencoder (MAE) by incorporating high-resolution inputs and a perceptual loss supervision module, which is termed Perceptual MAE (PMAE). While MAE has demonstrated an impressive understanding of object semantics, PMAE can also compensate for low-level semantics with our proposed enhancements. Evidenced by extensive experiments, this paradigm effectively unites the low-level and high-level features of the IML task and outperforms state-of-the-art tampering localization methods on all five publicly available datasets.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun 2022, p. 10674–10685. [Online]. Available: https://ieeexplore.ieee.org/document/9878449/
- S. Weng, T. Zhu, T. Zhang, and C. Zhang, “Ucm-net: A u-net-like tampered-region-related framework for copy-move forgery detection,” IEEE Transactions on Multimedia, p. 1–14, 2023.
- Y. Li, J. You, J. Zhou, W. Wang, X. Liao, and X. Li, “Image operation chain detection with machine translation framework,” IEEE Transactions on Multimedia, p. 1–16, 2022.
- F. Li, Z. Pei, X. Zhang, and C. Qin, “Image manipulation localization using multi-scale feature fusion and adaptive edge supervision,” IEEE Transactions on Multimedia, p. 1–15, 2022.
- B. Chen, W. Tan, G. Coatrieux, Y. Zheng, and Y.-Q. Shi, “A serial image copy-move forgery localization scheme with source/target distinguishment,” IEEE Transactions on Multimedia, vol. 23, p. 3506–3517, 2021.
- L. Verdoliva, “Media forensics and deepfakes: An overview,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, p. 910–932, Aug 2020.
- Y. Wu, W. AbdAlmageed, and P. Natarajan, “Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, Jun 2019, p. 9535–9544. [Online]. Available: https://ieeexplore.ieee.org/document/8953774/
- P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features for image manipulation detection,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, Jun 2018, p. 1053–1061. [Online]. Available: https://ieeexplore.ieee.org/document/8578214/
- B. Bayar and M. C. Stamm, “Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, p. 2691–2706, Nov 2018.
- P. Zhou, B.-C. Chen, X. Han, M. Najibi, A. Shrivastava, S.-N. Lim, and L. Davis, “Generate, segment, and refine: Towards generic manipulation segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, p. 13058–13065, Apr 2020.
- X. Chen, C. Dong, J. Ji, J. Cao, and X. Li, “Image manipulation detection by multi-view multi-scale supervision,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct 2021, p. 14165–14173. [Online]. Available: https://ieeexplore.ieee.org/document/9710015/
- J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y.-G. Jiang, “Objectformer for image manipulation detection and localization,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun 2022, p. 2354–2363. [Online]. Available: https://ieeexplore.ieee.org/document/9880322/
- H. Li and J. Huang, “Localization of deep inpainting using high-pass fully convolutional network,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, Oct 2019, p. 8300–8309. [Online]. Available: https://ieeexplore.ieee.org/document/9009804/
- J. Dong, W. Wang, and T. Tan, “Casia image tampering detection evaluation database,” in 2013 IEEE China Summit and International Conference on Signal and Information Processing. Beijing, China: IEEE, Jul 2013, p. 422–426. [Online]. Available: http://ieeexplore.ieee.org/document/6625374/
- H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” no. arXiv:2106.08254, Sep 2022, arXiv:2106.08254 [cs]. [Online]. Available: http://arxiv.org/abs/2106.08254
- J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” arXiv preprint arXiv:2111.07832, 2021.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun 2022, p. 15979–15988. [Online]. Available: https://ieeexplore.ieee.org/document/9879206/
- X. Ma, B. Du, Z. Jiang, A. Y. A. Hammadi, and J. Zhou, “Iml-vit: Benchmarking image manipulation localization by vision transformer,” no. arXiv:2307.14863, Aug 2023, arXiv:2307.14863 [cs]. [Online]. Available: http://arxiv.org/abs/2307.14863
- Y.-f. Hsu and S.-f. Chang, “Detecting image splicing using geometry invariants and camera characteristics consistency,” in 2006 IEEE International Conference on Multimedia and Expo. Toronto, ON, Canada: IEEE, Jul 2006, p. 549–552. [Online]. Available: http://ieeexplore.ieee.org/document/4036658/
- B. Wen, Y. Zhu, R. Subramanian, T.-T. Ng, X. Shen, and S. Winkler, “Coverage — a novel database for copy-move forgery detection,” in 2016 IEEE International Conference on Image Processing (ICIP). Phoenix, AZ, USA: IEEE, Sep 2016, p. 161–165. [Online]. Available: http://ieeexplore.ieee.org/document/7532339/
- H. Guan, M. Kozak, E. Robertson, Y. Lee, A. N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, and J. Fiscus, “Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation,” in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). Waikoloa Village, HI, USA: IEEE, Jan 2019, p. 63–72. [Online]. Available: https://ieeexplore.ieee.org/document/8638296/
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663.
- J.-L. Zhong and C.-M. Pun, “An end-to-end dense-inceptionnet for image copy-move forgery detection,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2134–2146, 2019.
- J. H. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. K. Roy-Chowdhury, “Hybrid lstm and encoder–decoder architecture for detection of image forgeries,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3286–3300, 2019.
- C. Dong, X. Chen, R. Hu, J. Cao, and X. Li, “Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1–14, 2022.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, Jun 2022, p. 4794–4804. [Online]. Available: https://ieeexplore.ieee.org/document/9879809/
- Y. Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” no. arXiv:2111.11429, Nov 2021, arXiv:2111.11429 [cs]. [Online]. Available: http://arxiv.org/abs/2111.11429
- Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in Computer Vision – ECCV 2022, ser. Lecture Notes in Computer Science, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, p. 280–296.
- E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” no. arXiv:2105.15203, Oct 2021, arXiv:2105.15203 [cs]. [Online]. Available: http://arxiv.org/abs/2105.15203
- J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” no. arXiv:1603.08155, Mar 2016, arXiv:1603.08155 [cs]. [Online]. Available: http://arxiv.org/abs/1603.08155
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” no. arXiv:1409.1556, Apr 2015, arXiv:1409.1556 [cs]. [Online]. Available: http://arxiv.org/abs/1409.1556
- X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, and R. Nevatia, “Span: Spatial pyramid attention network for image manipulation localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 2020, pp. 312–328.
- C. Yang, Z. Wang, H. Shen, H. Li, and B. Jiang, “Multi-modality image manipulation detection,” in 2021 IEEE International conference on multimedia and expo (ICME). IEEE, 2021, pp. 1–6.
- G. Mahfoudi, B. Tajini, F. Retraint, F. Morain-Nicolier, J. L. Dugelay, and M. Pic, “Defacto: Image and face manipulation dataset,” in 2019 27th European Signal Processing Conference (EUSIPCO). A Coruna, Spain: IEEE, Sep 2019, p. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/8903181/
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” no. arXiv:1711.05101, Jan 2019, arXiv:1711.05101 [cs, math]. [Online]. Available: http://arxiv.org/abs/1711.05101
- ——, “Sgdr: Stochastic gradient descent with warm restarts,” no. arXiv:1608.03983, May 2017, arXiv:1608.03983 [cs, math]. [Online]. Available: http://arxiv.org/abs/1608.03983
- C. Yang, H. Li, F. Lin, B. Jiang, and H. Zhao, “Constrained r-cnn: A general image manipulation detection model,” in 2020 IEEE International conference on multimedia and expo (ICME). IEEE, 2020, p. 1–6.
- X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Mar 2010, p. 249–256. [Online]. Available: https://proceedings.mlr.press/v9/glorot10a.html
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
- Xiaochen Ma (14 papers)
- Jizhe Zhou (25 papers)
- Xiong Xu (11 papers)
- Zhuohang Jiang (9 papers)
- Chi-Man Pun (75 papers)