Leveraging Self-Supervised Learning for Scene Classification in Child Sexual Abuse Imagery (2403.01183v2)
Abstract: Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing & Exploited Children every year, and over 80% originate from online sources. Therefore, investigation centers cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene classification task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to downstream tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.
- E. Bursztein, E. Clarke, M. DeLaune, D. M. Elifff, N. Hsu, L. Olson, J. Shehan, M. Thakur, K. Thomas, and T. Bright, “Rethinking the detection of child sexual abuse imagery on the internet,” in The World Wide Web Conference, 2019, pp. 2601–2607.
- C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT,” Mar. 2023.
- H.-E. Lee, T. Ermakova, V. Ververis, and B. Fabian, “Detecting child sexual abuse material: A comprehensive survey,” Forensic Science International: Digital Investigation, vol. 34, p. 301022, 2020.
- Microsoft Inc., “Photodna — fighting the harmful content problem,” https://www.microsoft.com/en-us/photodna, 2017.
- Apple, “CSAM Detection - Technical Summary,” Apple, Tech. Rep., 2021.
- J. Macedo, F. Costa, and J. A. dos Santos, “A benchmark methodology for child pornography detection,” in Conference on Graphics, Patterns and Images (SIBGRAPI), 2018, pp. 455–462.
- P. Vitorino, S. Avila, M. Perez, and A. Rocha, “Leveraging deep neural networks to fight child pornography in the age of social media,” Journal of Visual Communication and Image Representation, vol. 50, pp. 303–313, 2018.
- A. Gangwar, V. González-Castro, E. Alegre, and E. Fidalgo, “Attm-cnn: Attention and metric learning based cnn for pornography, age and child sexual abuse (csa) detection in images,” Neurocomputing, vol. 445, pp. 81–104, 2021.
- M. Castrillón-Santana, J. Lorenzo-Navarro, C. M. Travieso-González, D. Freire-Obregón, and J. B. Alonso-Hernández, “Evaluation of local descriptors and cnns for non-adult detection in visual content,” Pattern Recognition Letters, vol. 113, pp. 10–18, 2018.
- D. Chaves, E. Fidalgo, E. Alegre, F. Jánez-Martino, and R. Biswas, “Improving age estimation in minors and young adults with occluded faces to fight against child sexual exploitation.” in International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2020, pp. 721–729.
- C. Laranjeira, J. Macedo, S. Avila, and J. dos Santos, “Seeing without looking: Analysis pipeline for child sexual abuse datasets,” in ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2022, p. 2189–2205.
- J. A. Kloess, J. Woodhams, H. Whittle, T. Grant, and C. E. Hamilton-Giachritsis, “The challenges of identifying and classifying child sexual abuse material,” Sexual Abuse, vol. 31, no. 2, pp. 173–196, 2019.
- J. A. Kloess, J. Woodhams, and C. E. Hamilton-Giachritsis, “The challenges of identifying and classifying child sexual exploitation material: Moving towards a more ecologically valid pilot study with digital forensics analysts,” Child Abuse & Neglect, vol. 118, p. 105166, 2021.
- E. Yiallourou, R. Demetriou, and A. Lanitis, “On the detection of images containing child-pornographic material,” in International Conference on Telecommunications. IEEE, 2017, pp. 1–5.
- L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
- P. Goyal, M. Caron, B. Lefaudeux, M. Xu, P. Wang, V. Pai, M. Singh, V. Liptchinsky, I. Misra, A. Joulin et al., “Self-supervised pretraining of visual features in the wild,” arXiv preprint arXiv:2103.01988, 2021.
- G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 6874–6883.
- L. Ericsson, H. Gouk, and T. M. Hospedales, “How well do self-supervised models transfer?” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5414–5423.
- E. Cole, X. Yang, K. Wilber, O. Mac Aodha, and S. Belongie, “When does contrastive visual representation learning work?” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 755–14 764.
- A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709, 2013.
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in European Conference Computer Vision, 2020, pp. 776–794.
- R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” in International Conference on Learning Representations, 2019.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning, 2020, pp. 1597–1607.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
- M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Advances in Neural Information Processing Systems, 2020, pp. 9912–9924.
- J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning, 2021, pp. 12 310–12 320.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, 2020, pp. 21 271–21 284.
- X. Chen and K. He, “Exploring simple siamese representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
- P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 661–18 673, 2020.
- J. M. Henderson and A. Hollingworth, “High-level scene perception,” Annual review of psychology, vol. 50, no. 1, pp. 243–271, 1999.
- R. Epstein, “The cortical basis of visual scene processing,” Visual Cognition, vol. 12, no. 6, pp. 954–978, 2005.
- A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 413–420.
- D. Zeng, M. Liao, M. Tavakolian, Y. Guo, B. Zhou, D. Hu, M. Pietikäinen, and L. Liu, “Deep learning for scene classification: A survey,” arXiv preprint arXiv:2101.10531, 2021.
- A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2021.
- B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2017.
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” Advances in Neural Information Processing Systems, vol. 27, 2014.
- J. Qiu, Y. Yang, X. Wang, and D. Tao, “Scene essence,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8322–8333.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
- Z. Zuo, B. Shuai, G. Wang, X. Liu, X. Wang, B. Wang, and Y. Chen, “Learning contextual dependence with convolutional hierarchical recurrent neural networks,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 2983–2996, 2016.
- M. Hayat, S. H. Khan, M. Bennamoun, and S. An, “A spatial layout and scale invariant feature representation for indoor scene classification,” IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4829–4841, 2016.
- M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recognition and segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 3828–3836.
- Y. Li, M. Dixit, and N. Vasconcelos, “Deep scene image classification with the mfafvnet,” in IEEE International Conference on Computer Vision, 2017, pp. 5746–5754.
- Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in European Conference on Computer Vision, 2014, pp. 392–407.
- X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018.
- T. Durand, N. Thome, and M. Cord, “Weldon: Weakly supervised learning of deep convolutional neural networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 4743–4752.
- C. Laranjeira, A. Lacerda, and E. R. Nascimento, “On modeling context from objects with a long short-term memory for indoor scene recognition,” in Conference on Graphics, Patterns and Images (SIBGRAPI), 2019, pp. 249–256.
- G. Chen, X. Song, H. Zeng, and S. Jiang, “Scene recognition with prototype-agnostic scene layout,” IEEE Transactions on Image Processing, vol. 29, pp. 5877–5888, 2020.
- Z. Zhao and M. Larson, “From volcano to toyshop: Adaptive discriminative region discovery for scene recognition,” in ACM International Conference on Multimedia, 2018, pp. 1760–1768.
- L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, “Deep learning for generic object detection: A survey,” International Journal of Computer Vision, vol. 128, no. 2, pp. 261–318, 2020.
- R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016, pp. 21–37.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
- J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 7263–7271.
- L. Xie, F. Lee, L. Liu, K. Kotani, and Q. Chen, “Scene recognition: A comprehensive survey,” Pattern Recognition, vol. 102, p. 107205, 2020.
- R. Wu, B. Wang, W. Wang, and Y. Yu, “Harvesting discriminative meta objects with deep cnn features for scene classification,” in IEEE International Conference on Computer Vision, 2015, pp. 1287–1295.
- S. Liu, G. Tian, and Y. Xu, “A novel scene classification model combining resnet based transfer learning and data augmentation with a filter,” Neurocomputing, vol. 338, pp. 191–206, 2019.
- S. Yang and D. Ramanan, “Multi-scale recognition with dag-cnns,” in IEEE International Conference on Computer Vision, 2015, pp. 1215–1223.
- J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, “Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?” in IEEE International Conference on Computer Vision, 2017, pp. 2697–2706.
- Z. Ren and Y. J. Lee, “Cross-domain self-supervised multi-task feature learning using synthetic imagery,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 762–771.
- D.-Y. She and K. Xu, “Contrastive self-supervised representation learning using synthetic data,” International Journal of Automation and Computing, vol. 18, no. 4, pp. 556–567, 2021.
- A. Panchenko, R. Beaufort, and C. Fairon, “Detection of child sexual abuse media on p2p networks: Normalization and classification of associated filenames,” in LREC Workshop on Language Resources for Public Security Applications, 2012, pp. 27–31.
- A. Shupo, M. V. Martin, L. Rueda, A. Bulkan, Y. Chen, and P. C. Hung, “Toward efficient detection of child pornography in the network infrastructure,” IADIS International Journal on Computer Science and Information Systems, vol. 1, no. 2, pp. 15–31, 2006.
- E. Guerra and B. G. Westlake, “Detecting child sexual abuse images: traits of child sexual exploitation hosting and displaying websites,” Child Abuse & Neglect, vol. 122, p. 105336, 2021.
- M. de Castro Polastro and P. M. da Silva Eleuterio, “Nudetective: A forensic tool to help combat child pornography through automatic nudity detection,” in Workshops on Database and Expert Systems Applications, 2010, pp. 349–353.
- C. Peersman, C. Schulze, A. Rashid, M. Brennan, and C. Fischer, “icop: Automatically identifying new child abuse media in p2p networks,” in IEEE Security and Privacy Workshops, 2014, pp. 124–131.
- N. Sae-Bae, X. Sun, H. T. Sencar, and N. D. Memon, “Towards automatic detection of child pornography,” in IEEE International Conference on Image Processing, 2014, pp. 5332–5336.
- M. Perez, S. Avila, D. Moreira, D. Moraes, V. Testoni, E. Valle, S. Goldenstein, and A. Rocha, “Video pornography detection through deep learning techniques and motion information,” Neurocomputing, vol. 230, pp. 279–293, 2017.
- A. Ishikawa, E. Bollis, and S. Avila, “Combating the elsagate phenomenon: Deep learning architectures for disturbing cartoons,” in 2019 7th International Workshop on Biometrics and Forensics (IWBF). IEEE, 2019, pp. 1–6.
- S. Avila, D. Moreira, M. Perez, D. Moraes, V. Testoni, S. Goldenstein, E. Valle, and A. Rocha, “Multimodal and real-time method for filtering sensitive media,” 2019, uS Patent 10,194,203.
- F. Anda, N.-A. Le-Khac, and M. Scanlon, “Deepuage: improving underage age estimation accuracy to aid csem investigation,” Forensic Science International: Digital Investigation, vol. 32, p. 300921, 2020.
- M. W. Al-Nabki, E. Fidalgo, R. A. Vasco-Carofilis, F. Janez-Martino, and J. Velasco-Mata, “Evaluating performance of an adult pornography classifier for child sexual abuse detection,” arXiv preprint arXiv:2005.08766, 2020.
- A. Tabone, K. Camilleri, A. Bonnici, S. Cristina, R. Farrugia, and M. Borg, “Pornographic content classification using deep-learning,” in ACM Symposium on Document Engineering, 2021, pp. 1–10.
- E. Yiallourou, R. Demetriou, and A. Lanitis, “On the detection of images containing child-pornographic material,” in International Conference on Telecommunications, 2017, pp. 1–5.
- G. Macilotti, “Online child pornography: Conceptual issues and law enforcement challenges,” in Handbook of Research on Trends and Issues in Crime Prevention, Rehabilitation, and Victim Support, 2020, pp. 226–247.
- Z. Yu, L. Jin, and S. Gao, “P22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTnet: Patch-match and plane-regularization for unsupervised indoor depth estimation,” in European Conference on Computer Vision, 2020.
- M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” in International Conference on Computer Vision, 2021.
- Z. Li, T.-W. Yu, S. Sang, S. Wang, M. Song, Y. Liu, Y.-Y. Yeh, R. Zhu, N. Gundavarapu, J. Shi, S. Bi, H.-X. Yu, Z. Xu, K. Sunkavalli, M. Hasan, R. Ramamoorthi, and M. Chandraker, “Openrooms: An open framework for photorealistic indoor scene datasets,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7190–7199.
- W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger, “Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset,” in British Machine Vision Conference, 2018.
- W. Gaviria Rojas, S. Diamos, K. Kini, D. Kanter, V. Janapa Reddi, and C. Coleman, “The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world,” in Advances in Neural Information Processing Systems, 2022, pp. 12 979–12 990.
- J. K. Kruschke, “Bayesian estimation supersedes the t test.” Journal of Experimental Psychology: General, vol. 142, no. 2, p. 573, 2013.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- P. Goyal, Q. Duval, I. Seessel, M. Caron, M. Singh, I. Misra, L. Sagun, A. Joulin, and P. Bojanowski, “Vision models are more robust and fair when pretrained on uncurated images without supervision,” arXiv preprint arXiv:2202.08360, 2022.
- A. Straw, T. Wiecki, C. Fonnesbeck, and S. Andrés, “Bayesian estimation supersedes the t-test,” in PyMC examples, P. Team, Ed., 2022.
- N. Ye, K. Li, H. Bai, R. Yu, L. Hong, F. Zhou, Z. Li, and J. Zhu, “OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7937–7948.
- I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 428–10 436.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in IEEE International Conference on Computer Vision, 2021, pp. 9650–9660.
- X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in IEEE International Conference on Computer Vision, 2021, pp. 9640–9649.
- Pedro H. V. Valois (4 papers)
- João Macedo (3 papers)
- Leo S. F. Ribeiro (3 papers)
- Jefersson A. dos Santos (29 papers)
- Sandra Avila (41 papers)