Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Event-aware Video Corpus Moment Retrieval (2402.13566v1)

Published 21 Feb 2024 in cs.CV and cs.IR

Abstract: Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.
  2. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.
  4. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961–970.
  5. On Pursuit of Designing Multi-modal Transformer for Video Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9810–9823.
  6. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
  7. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  8. David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
  9. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1870–1879.
  10. Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing. 162–171.
  11. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8175–8182.
  12. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10551–10558.
  13. Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8199–8206.
  14. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.
  15. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  16. Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval. IEEE Transactions on Multimedia (2023).
  17. Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 845–855.
  18. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246–257.
  19. Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11302–11312.
  20. Temporal Localization of Moments in Video Collections with Natural Language. (2019). arXiv:1907.12763
  21. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
  22. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
  23. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681 (2021).
  24. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 214–229.
  25. Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16167–16176.
  26. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1984–1990.
  27. Coot: Cooperative hierarchical transformer for video-text representation learning. Advances in neural information processing systems 33 (2020), 22605–22618.
  28. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587.
  29. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826–3834.
  30. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  31. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3900–3908.
  32. Progressive Event Alignment Network for Partial Relevant Video Retrieval. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1973–1978.
  33. Uboco: Unsupervised boundary contrastive learning for generic event boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20073–20082.
  34. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781.
  35. Hugo Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. Advances in Neural Information Processing Systems 25 (2012).
  36. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.
  37. Loopitr: Combining dual and cross encoder architectures for image-text retrieval. arXiv preprint arXiv:2203.05465 (2022).
  38. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341.
  39. Tvr: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision. 447–463.
  40. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.
  41. HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training. In EMNLP.
  42. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11235–11244.
  43. Inflate and shrink: Enriching and reducing interactions for fast text-image retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9796–9809.
  44. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia. 843–851.
  45. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).
  46. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3042–3051.
  47. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  48. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9826–9836.
  49. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
  50. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision. 2630–2640.
  51. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23023–23033.
  52. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  53. Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199 (2020).
  54. Generic event boundary detection: A benchmark for event segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8075–8084.
  55. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.
  56. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73.
  57. Barbara Tversky and Jeffrey M Zacks. 2013. Event perception. Oxford handbook of cognitive psychology 1, 2 (2013), 3.
  58. Attention is all you need. Advances in neural information processing systems 30 (2017).
  59. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5079–5088.
  60. GMMFormer: Gaussian-Mixture-Model based Transformer for Efficient Partially Relevant Video Retrieval. arXiv preprint arXiv:2310.05195 (2023).
  61. Natural Language Video Localization with Learnable Moment Proposals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4008–4017.
  62. Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021).
  63. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
  64. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.
  65. Selective Query-Guided Debiasing for Video Corpus Moment Retrieval. In European Conference on Computer Vision. Springer, 185–200.
  66. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In International Conference on Learning Representations.
  67. Cross-probe bert for fast cross-modal search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2178–2183.
  68. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159–9166.
  69. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10287–10296.
  70. A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046 (2020).
  71. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1247–1257.
  72. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.
  73. Span-based Localizing Network for Natural Language Video Localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6543–6554.
  74. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12669–12678.
  75. Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training. IEEE Transactions on Circuits and Systems for Video Technology (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Danyang Hou (5 papers)
  2. Liang Pang (94 papers)
  3. Huawei Shen (119 papers)
  4. Xueqi Cheng (274 papers)