Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval (2310.05195v2)

Published 8 Oct 2023 in cs.CV, cs.AI, cs.IR, and cs.MM

Abstract: Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Layer normalization. arXiv preprint arXiv:1607.06450.
  2. Improving query efficiency of black-box adversarial attack. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, 101–116. Springer.
  3. Improving adversarial robustness via channel-wise activation suppressing. arXiv preprint arXiv:2103.08307.
  4. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
  5. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10638–10647.
  6. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 246–257.
  7. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 20(12): 3377–3388.
  8. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9346–9355.
  9. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4065–4080.
  10. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 32(8): 5680–5694.
  11. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
  12. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems, 35: 3403–3416.
  13. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267–5275.
  14. Backdoor Attack on Hash-based Image Retrieval via Clean-label Data Poisoning. In BMVC.
  15. Test-time Adaptation of Residual Blocks against Poisoning and Backdoor Attacks. Preprint.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  17. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, 3900–3908.
  18. Where Are You Looking? A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study. In Proceedings of the 30th ACM International Conference on Multimedia, 1025–1034.
  19. T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6649–6653. IEEE.
  20. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706–715.
  21. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846–11858.
  22. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, 447–463. Springer.
  23. Momentdiff: Generative video moment retrieval from random to real. arXiv preprint arXiv:2307.02869.
  24. W2vv++ fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM international conference on multimedia, 1786–1794.
  25. Mobile Volumetric Video Streaming System through Implicit Neural Representation. In Proceedings of the 2023 Workshop on Emerging Multimedia Systems, 1–7.
  26. CaV3: Cache-assisted Viewport Adaptive Volumetric Video Streaming. In 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR), 173–183. IEEE.
  27. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  29. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
  30. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9879–9889.
  31. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2630–2640.
  32. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10810–10819.
  33. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia, 1047–1055.
  34. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, 24: 2914–2923.
  35. Attention is all you need. Advances in neural information processing systems, 30.
  36. Hugs Are Better Than Handshakes: Unsupervised Cross-Modal Transformer Hashing with Multi-granularity Alignment. In 33nd British Machine Vision Conference.
  37. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, 9929–9939. PMLR.
  38. Contrastive masked autoencoders for self-supervised video hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2733–2741.
  39. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10704–10713.
  40. Motion-Aware Graph Reasoning Hashing for Self-supervised Video Retrieval. In 33nd British Machine Vision Conference.
  41. A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046.
  42. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 685–695.
  43. Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection. arXiv preprint arXiv:2302.05160.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuting Wang (112 papers)
  2. Jinpeng Wang (48 papers)
  3. Bin Chen (546 papers)
  4. Ziyun Zeng (16 papers)
  5. Shu-Tao Xia (171 papers)
Citations (3)