GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features (2403.01437v2)
Abstract: Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. LLMs have demonstrated proficiency in various computer vision tasks. However, existing methods for MR&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.
- S. Ghosh, A. Agarwal, Z. Parekh, and A. G. Hauptmann, “ExCL: Extractive Clip Localization Using Natural Language Descriptions,” in NAACL, 2019.
- S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment localization with natural language,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 12 870–12 877, issue: 07.
- M. Xu, H. Wang, B. Ni, R. Zhu, Z. Sun, and C. Wang, “Cross-category video highlight detection via set-based learning,” in CVPR, 2021, pp. 7970–7979.
- K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in ECCV. Springer, 2016, pp. 766–782.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, and F. Azhar, “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- “Introducing ChatGPT.” [Online]. Available: https://openai.com/blog/chatgpt
- D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,” arXiv preprint arXiv:2304.10592, 2023.
- T. Gong, C. Lyu, S. Zhang, Y. Wang, M. Zheng, Q. Zhao, K. Liu, W. Zhang, P. Luo, and K. Chen, “Multimodal-gpt: A vision and language model for dialogue with humans,” arXiv preprint arXiv:2305.04790, 2023.
- K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint arXiv:2305.06355, 2023.
- M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models,” Jun. 2023, arXiv:2306.05424 [cs]. [Online]. Available: http://arxiv.org/abs/2306.05424
- H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Temporal Sentence Grounding in Videos: A Survey and Future Directions,” Oct. 2022, arXiv:2201.08071 [cs] version: 2. [Online]. Available: http://arxiv.org/abs/2201.08071
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- J. Lei, T. L. Berg, and M. Bansal, “Detecting Moments and Highlights in Videos via Natural Language Queries,” NeurIPS, vol. 34, pp. 11 846–11 858, 2021.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark, “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
- Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in ECCV. Springer, 2020, pp. 402–419.
- H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning optical flow via global matching,” in CVPR, 2022, pp. 8121–8130.
- Y. Xu, M. Li, C. Peng, Y. Li, and S. Du, “Dual attention feature fusion network for monocular depth estimation,” in CAAI International Conference on Artificial Intelligence, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:245639385
- Y. Xu, C. Peng, M. Li, Y. Li, and S. Du, “Pyramid feature attention network for monocular depth prediction,” in 2021 IEEE International Conference on Multimedia and Expo (ICME). Shenzhen, China: IEEE, Jul 2021, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/9428446/
- L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in CVPR, 2019, pp. 10 386–10 395. [Online]. Available: https://api.semanticscholar.org/CorpusID:198906771
- S. Chun, S. J. Oh, R. S. de Rezende, Y. Kalantidis, and D. Larlus, “Probabilistic embeddings for cross-modal retrieval,” in CVPR, June 2021, pp. 8415–8424.
- S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR,” in ICLR, 2022.
- L. Zhen, P. Hu, X. Peng, R. S. M. Goh, and J. T. Zhou, “Deep multimodal transfer learning for cross-modal retrieval,” IEEE Trans Neural Netw Learn Syst, vol. 33, no. 2, pp. 798–810, Feb 2022.
- M. Cheng, Y. Sun, L. Wang, X. Zhu, K. Yao, J. Chen, G. Song, J. Han, J. Liu, E. Ding, and J. Wang, “Vista: Vision and scene text aggregation for cross-modal retrieval,” in CVPR, June 2022, pp. 5184–5193.
- Y. Xu, Y. Sun, Z. Xie, B. Zhai, Y. Jia, and S. Du, “Query-guided refinement and dynamic spans network for video highlight detection and temporal grounding in online information systems,” Int. J. Semant. Web Inf. Syst., vol. 19, no. 1, pp. 1–20, Jun 2023.
- G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 8609–8613.
- H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
- Y. Xu, Y. Sun, Y. Li, Y. Shi, X. Zhu, and S. Du, “Mh-detr: Video moment and highlight detection with cross-modal transformer,” arXiv preprint arXiv:2305.00355, 2023.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou, “Univtg: Towards unified video-language temporal grounding,” in arXiv preprint, 2023.
- V. Escorcia, M. Soldan, J. Sivic, B. Ghanem, and B. Russell, “Temporal localization of moments in video collections with natural language,” arXiv preprint arXiv:1907.12763, 2019.
- J. Lei, L. Yu, T. L. Berg, and M. Bansal, “Tvr: A large-scale dataset for video-subtitle moment retrieval,” in ECCV. Springer, 2020, pp. 447–463.
- Y. Liu, S. Li, Y. Wu, C.-W. Chen, Y. Shan, and X. Qie, “UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection,” in CVPR, 2022, pp. 3042–3051.
- Yunzhuo Sun (5 papers)
- Yifang Xu (18 papers)
- Zien Xie (3 papers)
- Yukun Shu (1 paper)
- Sidan Du (10 papers)