Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-Visual Prompting for Efficient 2D Temporal Video Grounding (2303.04995v3)

Published 9 Mar 2023 in cs.CV and cs.AI

Abstract: In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5x inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  2. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, page 2022, 2022.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  4. Visual prompting for adversarial robustness. arXiv preprint arXiv:2210.06284, 2022.
  5. Understanding and improving visual prompting: A label-mapping perspective. arXiv preprint arXiv:2211.11635, 2022.
  6. Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning. arXiv:2202.10629, 2022.
  7. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8199–8206, 2019.
  8. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  9. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Adversarial reprogramming of neural networks. arXiv:1806.11146, 2018.
  12. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  13. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  14. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017.
  15. Evoquer: Enhancing temporal grounding with video-pivoted backquery generation. arXiv preprint arXiv:2109.04600, 2021.
  16. Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755, 2019.
  17. Appearance-preserving 3d convolution for video-based person re-identification. In ECCV, 2020.
  18. Tripping through time: Efficient localization of activities in videos. arXiv preprint arXiv:1904.09936, 2019.
  19. Tensor fista-net for real-time snapshot compressive imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10933–10940, 2020.
  20. Localizing moments in video with temporal language. arXiv preprint arXiv:1809.01337, 2018.
  21. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  22. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
  23. On the robustness of deep learning-based mri reconstruction to image transformations. arXiv preprint arXiv:2211.04930, 2022.
  24. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10267–10276, 2020.
  25. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
  26. Prompting visual-language models for efficient video understanding. arXiv preprint arXiv:2112.04478, 2021.
  27. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022.
  28. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  29. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  30. A survey on temporal sentence grounding in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021.
  31. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
  32. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  33. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 2018.
  34. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1902–1910, 2021.
  35. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  36. Bridge-prompt: Towards ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  37. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  38. Pruning-as-search: Efficient neural architecture search via channel pruning and structural reparameterization. arXiv preprint arXiv:2206.01198, 2022.
  39. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia, pages 843–851, 2018.
  40. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021.
  41. Dynamic updating of the knowledge base for a large-scale question answering system. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 2020.
  42. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  43. Debug: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5144–5153, 2019.
  44. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  45. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 2019.
  46. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
  47. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
  48. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, 2017.
  49. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676, 2020.
  50. Few-shot text generation with pattern-exploiting training. arXiv preprint arXiv:2012.11926, 2020.
  51. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118, 2020.
  52. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  53. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 2014.
  54. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  55. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
  56. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  57. Yun-Yun Tsai et al. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. arXiv:2007.08714, 2020.
  58. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  59. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  60. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12386–12393, 2020.
  61. Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678, 2021.
  62. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2986–2994, 2021.
  63. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), 2018.
  64. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9062–9069, 2019.
  65. Voice2series: Reprogramming acoustic models for time series classification. In ICML. PMLR, 2021.
  66. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  67. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9159–9166, 2019.
  68. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754, 2021.
  69. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287–10296, 2020.
  70. Multi-modal relational graph for cross-modal video moment retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2215–2224, 2021.
  71. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1247–1257, 2019.
  72. Fairness reprogramming. arXiv preprint arXiv:2209.10222, 2022.
  73. Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931, 2020.
  74. The elements of temporal sentence grounding in videos: A survey and future directions. arXiv preprint arXiv:2201.08071, 2022.
  75. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870–12877, 2020.
  76. Data-model-circuit tri-design for ultra-light video intelligence on edge devices. In Proceedings of the 28th Asia and South Pacific Design Automation Conference, pages 745–750, 2023.
  77. Video synthesis via transform-based tensor neural network. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2454–2462, 2020.
  78. How to robustify black-box ml models? a zeroth-order optimization perspective. arXiv preprint arXiv:2203.14195, 2022.
  79. Advancing model pruning via bi-level optimization. arXiv preprint arXiv:2210.04092, 2022.
  80. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In International Conference on Machine Learning, pages 26693–26712. PMLR, 2022.
  81. Why adversarial reprogramming works, when it fails, and how to tell the difference. arXiv:2108.11673, 2021.
  82. Distance-iou loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12993–13000, 2020.
  83. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yimeng Zhang (33 papers)
  2. Xin Chen (456 papers)
  3. Jinghan Jia (30 papers)
  4. Sijia Liu (204 papers)
  5. Ke Ding (30 papers)
Citations (21)