Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors (2309.16701v4)

Published 15 Aug 2023 in cs.CV, cs.AI, and cs.CL

Abstract: With the explosion of multimedia content, video moment retrieval (VMR), which aims to detect a video moment that matches a given text query from a video, has been studied intensively as a critical problem. However, the existing VMR framework evaluates video moment retrieval performance, assuming that a video is given, which may not reveal whether the models exhibit overconfidence in the falsely given video. In this paper, we propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task that aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets using textual and visual semantic distance verification methods. We extend existing VMR datasets using these methods and newly construct three practical MVMR datasets. To solve the task, we further propose a strong informative sample-weighted learning method, CroCs, which employs two contrastive learning mechanisms: (1) weakly-supervised potential negative learning and (2) cross-directional hard-negative learning. Experimental results on the MVMR datasets reveal that existing VMR models are easily distracted by the misinformation (distractors), whereas our model shows significantly robust performance, demonstrating that CroCs is essential to distinguishing positive moments against distractors. Our code and datasets are publicly available: https://github.com/yny0506/Massive-Videos-Moment-Retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
  2. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812.
  3. Partially relevant video retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 246–257.
  4. Temporal localization of moments in video collections with natural language.
  5. TALL: temporal activity localization via language query. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5277–5285. IEEE Computer Society.
  6. Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1523–1532.
  7. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  8. Excl: Extractive clip localization using natural language descriptions.
  9. Activitynet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 961–970. IEEE Computer Society.
  10. Spanbert: Improving pre-training by representing and predicting spans.
  11. Modal-specific pseudo query generation for video corpus moment retrieval. arXiv preprint arXiv:2210.12617.
  12. Dense-captioning events in videos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 706–715. IEEE Computer Society.
  13. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 447–463. Springer.
  14. G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12032–12042.
  15. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11235–11244.
  16. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  17. Zhixin Ma and Chong Wah Ngo. 2022. Interactive video corpus moment retrieval using reinforcement learning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 296–306.
  18. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2765–2775.
  19. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36.
  20. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  21. Bidirectional attention flow for machine comprehension.
  22. Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17929–17938.
  23. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer.
  24. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  25. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489–4497. IEEE Computer Society.
  26. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2613–2623.
  27. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  28. Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931.
  29. Learning 2d temporal adjacent networks for moment localization with natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 12870–12877. AAAI Press.
  30. Simans: Simple ambiguous negatives sampling for dense text retrieval.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nakyeong Yang (9 papers)
  2. Minsung Kim (34 papers)
  3. Seunghyun Yoon (64 papers)
  4. Joongbo Shin (14 papers)
  5. Kyomin Jung (76 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets