Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank (2404.06173v1)

Published 9 Apr 2024 in cs.CV

Abstract: Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of concept banks, which results in the failures of unseen queries and the out-of-vocabulary problem. This paper addresses these two problems by constructing a new dataset and developing a multi-word concept bank. Specifically, capitalizing on a generative model, we construct a new dataset consisting of 7 million generated text and video pairs for pre-training. To tackle the out-of-vocabulary problem, we develop a multi-word concept bank based on syntax analysis to enhance the capability of a state-of-the-art interpretable AVS method in modeling relationships between query words. We also study the impact of current advanced features on the method. Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. ITI-CERTH participation in TRECVid 2018. In Proceedings of the TRECVid 2018 Workshop. 1–13.
  2. Evaluating Multiple Video Understanding and Retrieval Tasks at TRECVid 2021. In Proceedings of TRECVid 2021. 1–55.
  3. TRECVid 2020: comprehensive campaign for evaluating video retrieval tasks across multiple application domains. In Proceedings of TRECVid 2020. 1–55.
  4. TRECVID 2023 - A series of evaluation tracks in video understanding. In Proceedings of TRECVID 2023. NIST, USA, 1–23.
  5. TRECVid 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search. In Proceedings of the TRECVid 2018 Workshop.
  6. TRECVid 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking. In Proceedings of the TRECVid 2016 Workshop. 1–54.
  7. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In IEEE International Conference on Computer Vision. 1–15.
  8. V3C1 Dataset: An Evaluation of Content Characteristics. In Proceedings of the International Conference on Multimedia Retrieval. 334–338.
  9. Semantic Reasoning in Zero Example Video Event Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 4 (2017), 1–17.
  10. A Short Note about Kinetics-600. ArXiv abs/1808.01340 (2018), 1–6.
  11. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10635–10644.
  12. Microsoft COCO Captions: Data Collection and Evaluation Server. CoRR abs/1504.00325 (2015), 1–7.
  13. Predicting Visual Features From Text for Image and Video Caption Retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
  14. Dual Encoding for Zero-Example Video Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9346–9355.
  15. Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–17.
  16. Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. IEEE Transactions on Circuits and Systems for Video Technology 32 (2022), 5680–5694.
  17. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference. 1–13.
  18. EURECOM at TRECVid AVS 2019. In Proceedings of the TRECVid 2019 Workshop.
  19. Damianos Galanopoulos and Vasileios Mezaris. 2020. Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-Hoc Video Search with Dual Encoding Networks. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 336––340.
  20. ImageBind: One Embedding Space To Bind Them All. In CVPR. 1–11.
  21. VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events. In Proceedings of the ACM Conference on Multimedia. 17–26.
  22. Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In European Conference on Computer Vision. Springer Nature Switzerland, Cham, 444–461.
  23. Informedia @ TRECVid 2018: Ad-hoc Video Search with Discrete and Continuous Representations. In Proceedings of the TRECVid 2018 Workshop. 1–10.
  24. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. In Proceedings of the ACM International Conference on Multimedia Retrieval.
  25. Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study. IEEE Transactions on Multimedia 12 (2010), 42–53.
  26. ([n. d.]).
  27. Dong-Hyun Lee. 2013. Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. ICML 2013 Workshop : Challenges in Representation Learning (WREPL) (07 2013), 1–6.
  28. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ArXiv abs/2301.12597 (2023), 1–13.
  29. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML. 1–12.
  30. Renmin University of China and Zhejiang Gongshang University at TRECVid 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. In Proceedings of the TRECVid 2018 Workshop. 1–6.
  31. W2VV++: Fully Deep Learning for Ad-hoc Video Search. In Proceedings of the ACM International Conference on Multimedia. 1786–1794.
  32. SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia (2021), 4351–4362.
  33. TGIF: A New Dataset and Benchmark on Animated GIF Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641–4650.
  34. A W2VV++ Case Study with Automated and Interactive Text-to-Video Retrieval. In Proceedings of the 28th ACM International Conference on Multimedia. 2553––2561.
  35. Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts. In Proceedings of the ACM on International Conference on Multimedia Retrieval. 127–134.
  36. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv preprint arXiv:2104.08860 (2021), 1–14.
  37. Query and Keyframe Representations for Ad-Hoc Video Search. In Proceedings of the ACM on International Conference on Multimedia Retrieval. 407–411.
  38. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. 1–11.
  39. George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41.
  40. Large-Scale Concept Ontology for Multimedia. IEEE Transactions on Multimedia 13 (2006), 86–91.
  41. VIREO @ TRECVid 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking. In Proceedings of the TRECVid 2017 Workshop.
  42. VIREO-EURECOM @ TRECVid 2019: Ad-hoc Video Search. In Proceedings of the TRECVid 2019 Workshop. 1–8.
  43. NII-HITACHI-UIT at TRECVid 2016 Ad-hoc Video Search: Enriching Semantic Features using Multiple Neural Networks. In Proceedings of the TRECVid 2016 Workshop. 1–4.
  44. TRECVID 2014 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics.
  45. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML. 8748–8763.
  46. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems. 1–50.
  47. Kindai University and Kobe University at TRECVid 2019 AVS Task. In Proceedings of the TRECVid 2019 Workshop.
  48. Evaluation Campaigns and TRECVid. In Proceedings of the ACM International Workshop on Multimedia Information Retrieval. 321–330.
  49. University of Amsterdam and Renmin University at TRECVid 2017: Searching Video, Detecting Events and Describing Video. In Proceedings of the TRECVid 2017 Workshop. 1–6.
  50. Cees G. M. Snoek and Marcel Worring. 2009. Concept-Based Video Retrieval. Foundations and Trends in Information Retrieval 2 (2009), 215–322.
  51. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of the 14th ACM International Conference on Multimedia. 421–430.
  52. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR (12 2012).
  53. YFCC100M: the new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73.
  54. Waseda Meisei at TRECVid 2017: Ad-hoc Video Search. In Proceedings of the TRECVid 2017 Workshop. 1–8.
  55. Waseda Meisei SoftBank at TRECVid 2019: Ad-hoc Video Search. In Proceedings of the TRECVid 2019 Workshop. 1–7.
  56. Waseda at TRECVid 2016: Ad-hoc Video Search. In Proceedings of the TRECVid 2016 Workshop. 1–5.
  57. Waseda Meisei SoftBank at TRECVid 2020: Ad-hoc Video Search. In Proceedings of the TRECVid 2020 Workshop. 1–7.
  58. Waseda Meisei at TRECVid 2018: Ad-hoc Video Search. In Proceedings of the TRECVid 2018 Workshop. 1–7.
  59. Waseda Meisei SoftBank at TRECVID 2022. In Proceedings of the TRECVid 2022 Workshop. 1–5.
  60. Waseda Meisei SoftBank at TRECVID 2023. In Proceedings of the TRECVid 2023 Workshop. 1–8.
  61. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In Proceedings of the IEEE International Conference on Computer Vision. 4580–4590.
  62. Jiaxin Wu and Chong-Wah Ngo. 2020. Interpretable Embedding for Ad-Hoc Video Search. In Proceedings of the ACM International Conference on Multimedia. 3357–3366.
  63. (Un)likelihood Training for Interpretable Embedding. ACM Transactions on Information Systems 42, 3 (2023), 1–26.
  64. SQL-Like Interpretable Interactive Video Search. In International Conference on MultiMedia Modeling. 391–397.
  65. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 5288–5296.
  66. ALIP: Adaptive Language-Image Pre-training with Synthetic Caption. In Proceedings of the IEEE International Conference on Computer Vision. 1–10.
  67. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (2014), 67–78.
  68. Learning Deep Features for Scene Recognition Using Places Database. In Proceedings of the International Conference on Neural Information. 487–495.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jiaxin Wu (25 papers)
  2. Chong-Wah Ngo (55 papers)
  3. Wing-Kwong Chan (11 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com