GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval (2310.05195v2)
Abstract: Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer}.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Improving query efficiency of black-box adversarial attack. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, 101–116. Springer.
- Improving adversarial robustness via channel-wise activation suppressing. arXiv preprint arXiv:2103.08307.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
- Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10638–10647.
- Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 246–257.
- Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 20(12): 3377–3388.
- Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9346–9355.
- Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4065–4080.
- Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 32(8): 5680–5694.
- Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
- Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems, 35: 3403–3416.
- Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 5267–5275.
- Backdoor Attack on Hash-based Image Retrieval via Clean-label Data Poisoning. In BMVC.
- Test-time Adaptation of Residual Blocks against Poisoning and Backdoor Attacks. Preprint.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, 3900–3908.
- Where Are You Looking? A Large-Scale Dataset of Head and Gaze Behavior for 360-Degree Videos and a Pilot Study. In Proceedings of the 30th ACM International Conference on Multimedia, 1025–1034.
- T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6649–6653. IEEE.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706–715.
- Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846–11858.
- Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, 447–463. Springer.
- Momentdiff: Generative video moment retrieval from random to real. arXiv preprint arXiv:2307.02869.
- W2vv++ fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM international conference on multimedia, 1786–1794.
- Mobile Volumetric Video Streaming System through Implicit Neural Representation. In Proceedings of the 2023 Workshop on Emerging Multimedia Systems, 1–7.
- CaV3: Cache-assisted Viewport Adaptive Volumetric Video Streaming. In 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR), 173–183. IEEE.
- Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
- End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9879–9889.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2630–2640.
- Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10810–10819.
- Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia, 1047–1055.
- Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, 24: 2914–2923.
- Attention is all you need. Advances in neural information processing systems, 30.
- Hugs Are Better Than Handshakes: Unsupervised Cross-Modal Transformer Hashing with Multi-granularity Alignment. In 33nd British Machine Vision Conference.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, 9929–9939. PMLR.
- Contrastive masked autoencoders for self-supervised video hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2733–2741.
- Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10704–10713.
- Motion-Aware Graph Reasoning Hashing for Self-supervised Video Retrieval. In 33nd British Machine Vision Conference.
- A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046.
- Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 685–695.
- Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection. arXiv preprint arXiv:2302.05160.
- Yuting Wang (112 papers)
- Jinpeng Wang (48 papers)
- Bin Chen (546 papers)
- Ziyun Zeng (16 papers)
- Shu-Tao Xia (171 papers)