Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement (2402.13576v2)

Published 21 Feb 2024 in cs.CV and cs.IR

Abstract: Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. The relevance between the video and query is partial, mainly evident in two aspects:~(1)~Scope: The untrimmed video contains many frames, but not all are relevant to the query. Strong relevance is typically observed only within the relevant moment.~(2)~Modality: The relevance of the query varies with different modalities. Action descriptions align more with visual elements, while character conversations are more related to textual information.Existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content. We also introduce relevant content-enhanced training methods for both retriever and localizer to enhance the ability of model to capture relevant content. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR. The code is available at \url{https://github.com/hdy007007/PREM}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.
  2. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015.
  3. On Pursuit of Designing Multi-modal Transformer for Video Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9810–9823.
  4. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
  5. Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing. 162–171.
  6. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8175–8182.
  7. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10551–10558.
  8. Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8199–8206.
  9. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.
  10. Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval. IEEE Transactions on Multimedia (2023).
  11. Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 845–855.
  12. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
  13. Temporal Localization of Moments in Video Collections with Natural Language. (2019). arXiv:1907.12763
  14. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
  15. Multi-modal transformer for video retrieval. In European Conference on Computer Vision. Springer, 214–229.
  16. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1984–1990.
  17. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587.
  18. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826–3834.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  20. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3900–3908.
  21. Hierarchical cross-modal graph consistency learning for video-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1114–1124.
  22. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781.
  23. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.
  24. Tvr: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision. 447–463.
  25. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.
  26. HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training. In EMNLP.
  27. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11235–11244.
  28. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia. 843–851.
  29. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3042–3051.
  30. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  31. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640.
  32. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23023–23033.
  33. Support-set bottlenecks for video-text representation learning. In International Conference on Learning Representations.
  34. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia 24 (2021), 2914–2923.
  35. Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1979–1988.
  36. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 1, 8 (2015).
  37. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 776–794.
  38. Representation learning with contrastive predictive coding. arXiv e-prints (2018), arXiv–1807.
  39. Attention is all you need. Advances in neural information processing systems 30 (2017).
  40. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 189–198.
  41. Modality-Balanced Embedding for Video Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2578–2582.
  42. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5079–5088.
  43. Hanet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM international conference on Multimedia. 3518–3527.
  44. Natural Language Video Localization with Learnable Moment Proposals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4008–4017.
  45. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.
  46. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1339–1348.
  47. Selective Query-Guided Debiasing for Video Corpus Moment Retrieval. In European Conference on Computer Vision. Springer, 185–200.
  48. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In International Conference on Learning Representations.
  49. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159–9166.
  50. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10287–10296.
  51. A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046 (2020).
  52. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1247–1257.
  53. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.
  54. Span-based Localizing Network for Natural Language Video Localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6543–6554.
  55. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12669–12678.
  56. Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training. IEEE Transactions on Circuits and Systems for Video Technology (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Danyang Hou (5 papers)
  2. Liang Pang (94 papers)
  3. Huawei Shen (119 papers)
  4. Xueqi Cheng (274 papers)
Citations (1)