Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs (2407.02411v2)
Abstract: The advent of video-based LLMs has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerning the video content and description, in response to specific queries. By imperceptibly embedding watermarks into key video frames with multi-modal flow-based losses, our method preserves the viewing experience while preventing misuse by video-based LLMs. Extensive experiments show that Video Watermarking significantly reduces the comprehensibility of videos with various video-based LLMs, demonstrating both stealth and robustness. In essence, our method provides a solution for securing video content, ensuring its integrity and confidentiality in the face of evolving video-based LLMs technologies.
- Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12487–12496, 2019.
- (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
- Practical protection against video data leakage via universal adversarial head. Pattern Recognition, 131:108834, 2022.
- Badclip: Trigger-aware prompt learning for backdoor attacks on clip. In CVPR, 2024a.
- Improving adversarial robustness via channel-wise activation suppressing. In International Conference on Learning Representations.
- Hilbert-based generative defense for adversarial examples. In Proceedings of the IEEE/CVF International conference on computer vision, pp. 4784–4793, 2019.
- Improving query efficiency of black-box adversarial attack. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 101–116. Springer, 2020.
- Special characters attack: Toward scalable training data extraction from large language models. arXiv preprint arXiv:2405.05990, 2024b.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015.
- Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
- Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas. arXiv preprint arXiv:2406.05392, 2024.
- Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4065–4080, 2021.
- One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models. arXiv preprint arXiv:2406.05491, 2024a.
- Privacy leakage on dnns: A survey of model inversion attacks and defenses. arXiv preprint arXiv:2402.04013, 2024b.
- Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 214–229. Springer, 2020.
- Adversarial robustness for visual grounding of multimodal large language models. In ICLR Workshop, 2024a.
- Inducing high energy-latency of large vision-language models with verbose images. In ICLR, 2024b.
- Energy-latency manipulation of multi-modal large language models via verbose samples. arXiv preprint arXiv:2404.16557, 2024c.
- Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
- Mambair: A simple baseline for image restoration with state-space model. arXiv preprint arXiv:2402.15648, 2024.
- Bidirectional projection network for cross dimension scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14373–14382, 2021.
- Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8981–8989, 2018.
- Fsr: A general frequency-oriented framework to accelerate image super-resolution networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 1343–1350, 2023a.
- Fmm-attack: A flow-based multi-modal adversarial attack on video-based llms. arXiv preprint arXiv:2403.13507, 2024a.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Interpreting unsupervised anomaly detection in security via rule extraction. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), NeurIPS, 2023c.
- Genos: General in-network unsupervised intrusion detection by rule extraction. CoRR, abs/2403.19248, 2024b.
- Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning. In CVPR, 2024.
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- Spatio-temporal embedding for statistical face recognition from video. In ECCV, 2006.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Visual knowledge graph for human action reasoning in videos. In ACM MM, 2022a.
- Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490, 2022b.
- Magicstick: Controllable video editing via control handle transformations. arXiv preprint arXiv:2312.03047, 2023.
- Follow your pose: Pose-guided text-to-video generation using pose-free videos. In AAAI, 2024.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021b.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17959–17968, 2022.
- Video based face recognition using multiple classifiers. In Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings., pp. 345–349. IEEE, 2004.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
- Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pp. 4534–4542, 2015.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
- Triangle attack: A query-efficient decision-based adversarial attack. In ECCV, 2022.
- Sparse adversarial perturbations for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 8973–8980, 2019.
- Defenses in adversarial machine learning: A survey. arXiv preprint arXiv:2312.08890, 2023.
- P2t: Pyramid pooling transformer for scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Make your home safe: Time-aware unsupervised user behavior anomaly detection in smart homes via loss-guided mask. arXiv preprint arXiv:2406.10928, 2024.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645–1653, 2017.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5036–5045, 2022.
- Cheating suffix: Targeted attack to text-to-image diffusion models with multi-modal priors. arXiv preprint arXiv:2402.01369, 2024.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
- Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9837–9846, 2021.
- On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems, 36, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- A pilot study of query-free adversarial attack against stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2391, 2023.
- Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207, 2024.
- Iotbeholder: A privacy snooping attack on user habitual behaviors from smart home wi-fi traffic. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7(1):1–26, 2023.