Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features (2403.01437v2)

Published 3 Mar 2024 in cs.CV and cs.AI

Abstract: Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. LLMs have demonstrated proficiency in various computer vision tasks. However, existing methods for MR&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. S. Ghosh, A. Agarwal, Z. Parekh, and A. G. Hauptmann, “ExCL: Extractive Clip Localization Using Natural Language Descriptions,” in NAACL, 2019.
  2. S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment localization with natural language,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 12 870–12 877, issue: 07.
  3. M. Xu, H. Wang, B. Ni, R. Zhu, Z. Sun, and C. Wang, “Cross-category video highlight detection via set-based learning,” in CVPR, 2021, pp. 7970–7979.
  4. K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in ECCV.   Springer, 2016, pp. 766–782.
  5. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, and F. Azhar, “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  6. “Introducing ChatGPT.” [Online]. Available: https://openai.com/blog/chatgpt
  7. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,” arXiv preprint arXiv:2304.10592, 2023.
  8. T. Gong, C. Lyu, S. Zhang, Y. Wang, M. Zheng, Q. Zhao, K. Liu, W. Zhang, P. Luo, and K. Chen, “Multimodal-gpt: A vision and language model for dialogue with humans,” arXiv preprint arXiv:2305.04790, 2023.
  9. K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint arXiv:2305.06355, 2023.
  10. M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models,” Jun. 2023, arXiv:2306.05424 [cs]. [Online]. Available: http://arxiv.org/abs/2306.05424
  11. H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Temporal Sentence Grounding in Videos: A Survey and Future Directions,” Oct. 2022, arXiv:2201.08071 [cs] version: 2. [Online]. Available: http://arxiv.org/abs/2201.08071
  12. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
  13. J. Lei, T. L. Berg, and M. Bansal, “Detecting Moments and Highlights in Videos via Natural Language Queries,” NeurIPS, vol. 34, pp. 11 846–11 858, 2021.
  14. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark, “Learning transferable visual models from natural language supervision,” in ICML.   PMLR, 2021, pp. 8748–8763.
  15. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in ECCV.   Springer, 2020, pp. 402–419.
  16. H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning optical flow via global matching,” in CVPR, 2022, pp. 8121–8130.
  17. Y. Xu, M. Li, C. Peng, Y. Li, and S. Du, “Dual attention feature fusion network for monocular depth estimation,” in CAAI International Conference on Artificial Intelligence, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:245639385
  18. Y. Xu, C. Peng, M. Li, Y. Li, and S. Du, “Pyramid feature attention network for monocular depth prediction,” in 2021 IEEE International Conference on Multimedia and Expo (ICME).   Shenzhen, China: IEEE, Jul 2021, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/9428446/
  19. L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in CVPR, 2019, pp. 10 386–10 395. [Online]. Available: https://api.semanticscholar.org/CorpusID:198906771
  20. S. Chun, S. J. Oh, R. S. de Rezende, Y. Kalantidis, and D. Larlus, “Probabilistic embeddings for cross-modal retrieval,” in CVPR, June 2021, pp. 8415–8424.
  21. S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR,” in ICLR, 2022.
  22. L. Zhen, P. Hu, X. Peng, R. S. M. Goh, and J. T. Zhou, “Deep multimodal transfer learning for cross-modal retrieval,” IEEE Trans Neural Netw Learn Syst, vol. 33, no. 2, pp. 798–810, Feb 2022.
  23. M. Cheng, Y. Sun, L. Wang, X. Zhu, K. Yao, J. Chen, G. Song, J. Han, J. Liu, E. Ding, and J. Wang, “Vista: Vision and scene text aggregation for cross-modal retrieval,” in CVPR, June 2022, pp. 5184–5193.
  24. Y. Xu, Y. Sun, Z. Xie, B. Zhai, Y. Jia, and S. Du, “Query-guided refinement and dynamic spans network for video highlight detection and temporal grounding in online information systems,” Int. J. Semant. Web Inf. Syst., vol. 19, no. 1, pp. 1–20, Jun 2023.
  25. G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in 2013 IEEE international conference on acoustics, speech and signal processing.   IEEE, 2013, pp. 8609–8613.
  26. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
  27. Y. Xu, Y. Sun, Y. Li, Y. Shi, X. Zhu, and S. Du, “Mh-detr: Video moment and highlight detection with cross-modal transformer,” arXiv preprint arXiv:2305.00355, 2023.
  28. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  29. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  30. K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou, “Univtg: Towards unified video-language temporal grounding,” in arXiv preprint, 2023.
  31. V. Escorcia, M. Soldan, J. Sivic, B. Ghanem, and B. Russell, “Temporal localization of moments in video collections with natural language,” arXiv preprint arXiv:1907.12763, 2019.
  32. J. Lei, L. Yu, T. L. Berg, and M. Bansal, “Tvr: A large-scale dataset for video-subtitle moment retrieval,” in ECCV.   Springer, 2020, pp. 447–463.
  33. Y. Liu, S. Li, Y. Wu, C.-W. Chen, Y. Shan, and X. Qie, “UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection,” in CVPR, 2022, pp. 3042–3051.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yunzhuo Sun (5 papers)
  2. Yifang Xu (18 papers)
  3. Zien Xie (3 papers)
  4. Yukun Shu (1 paper)
  5. Sidan Du (10 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.