Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Potential of Vision-Language Models for Content Moderation of Children's Videos (2312.03936v1)

Published 6 Dec 2023 in cs.CV, cs.CY, cs.LG, and cs.SI

Abstract: Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. D. Donchev, “40 Mind Blowing YouTube Facts, Figures and Statistics – 2023,” accessed 2023-07-24. [Online]. Available: https://fortunelords.com/youtube-statistics/
  2. B. Auxier, “Parenting children in the age of screens,” Jul 2020. [Online]. Available: https://www.pewresearch.org/internet/2020/07/28/parenting-children-in-the-age-of-screens/
  3. D. D. Placido, “YouTube’s ”Elsagate” illuminates the unintended horrors of the digital age,” Forbes, 11 2017.
  4. K. Habib and T. Soliman, “Cartoons’ effect in changing children mental response and behavior,” Open Journal of Social Sciences, vol. 03, no. 09, p. 248–264, Jan 2015. [Online]. Available: https://doi.org/10.4236/jss.2015.39033
  5. “Important info for parents about YouTube Kids,” accessed 2023-07-25. [Online]. Available: https://support.google.com/youtubekids/answer/6130561
  6. L. Binh, R. Tandon, C. Oinar, J. Liu, U. Durairaj, J. Guo, S. Zahabizadeh, S. Ilango, J. Tang, F. Morstatter, S. Woo, and J. Mirkovic, “Samba: Identifying inappropriate videos for young children on YouTube,” in Proceedings of the ACM International Conference on Information and Knowledge Management, 2022, p. 88–97.
  7. S. H. Ahmed, M. J. Khan, H. M. U. Qaisar, and G. Sukthankar, “Malicious or Benign? Towards Effective Content Moderation for Children’s Videos,” in Proceedings of the International FLAIRS Conference, vol. 36, 2023.
  8. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
  9. M. Wang, J. Xing, and Y. Liu, “ActionCLIP: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
  10. R. Zhang, Z. Zeng, Z. Guo, and Y. Li, “Can language understand depth?” in Proceedings of the ACM International Conference on Multimedia, 2022, pp. 6868–6874.
  11. L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” 2021.
  12. M. Gkolemi, P. Papadopoulos, E. Markatos, and N. Kourtellis, “YouTubers Not MadeForKids: Detecting Channels Sharing Inappropriate Videos Targeting Children,” in ACM Web Science Conference, 2022, p. 370–381.
  13. K. Yousaf and T. Nawaz, “A deep learning-based approach for inappropriate content detection and classification of YouTube videos,” IEEE Access, vol. 10, p. 16283–16298, 2022.
  14. S. Singh, R. Kaushal, A. B. Buduru, and P. Kumaraguru, “Kidsguard: fine grained approach for child unsafe video representation and detection,” in Proceedings of the ACM/SIGAPP Symposium on Applied Computing, 2019, p. 2104–2111.
  15. S. Alshamrani, A. Abusnaina, M. Abuhamad, D. Nyang, and D. Mohaisen, “Hate, obscenity, and insults: Measuring the exposure of children to inappropriate comments in YouTube,” in Companion Proceedings of the Web Conference.   Association for Computing Machinery, 2021, p. 508–515.
  16. M. Y. Chuttur and A. Nazurally, “A multi-modal approach to detect inappropriate cartoon video contents using deep learning networks,” Multimedia Tools and Applications, vol. 81, no. 12, p. 16881–16900, May 2022.
  17. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” 2021.
  18. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” 2015.
  19. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
  20. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  21. H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine-tuned CLIP models are efficient video learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6545–6554.
  22. T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “AIM: Adapting image models for efficient video action recognition,” in The International Conference on Learning Representations, 2023.
  23. P. Kaliamoorthi, S. Ravi, and Z. Kozareva, “Prado: Projection attention networks for document classification on-device,” in Conference on Empirical Methods in Natural Language Processing, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:202785536
  24. C. Sankar, S. Ravi, and Z. Kozareva, “Proformer: Towards on-device lsh projection based transformers,” arXiv preprint arXiv:2004.05801, 2020.
  25. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” Advances in neural information processing systems, vol. 26, 2013.
  26. W. Kay, J. Carreira, K. Simonyan, and B. Zhang, “The Kinetics Human Action Video Dataset,” 2017.
  27. R. Agrawal, R. Srikant et al., “Fast algorithms for mining association rules,” in Proc of VLDB, vol. 1215.   Santiago, Chile, 1994, pp. 487–499.
  28. K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild,” 2012.
  29. A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  30. I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, Y. Zhou, C. Ramaiah, F. Feng, R. Li, X. Wang, D. Athanasakis, J. Shawe-Taylor, M. Milakov, J. Park, R. Ionescu, M. Popescu, C. Grozea, J. Bergstra, J. Xie, L. Romaszko, B. Xu, Z. Chuang, and Y. Bengio, “Challenges in representation learning: A report on three machine learning contests,” in Neural Information Processing, M. Lee, A. Hirose, Z.-G. Hou, and R. M. Kil, Eds., 2013, pp. 117–124.
  31. D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video transformer network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3163–3172.
  32. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  33. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
  34. P. Lattman, “The origins of Justice Stewart’s I know it when I see it,” 2007. [Online]. Available: https://www.wsj.com/articles/BL-LB-4558
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Syed Hammad Ahmed (5 papers)
  2. Shengnan Hu (8 papers)
  3. Gita Sukthankar (33 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets