Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GL-RG: Global-Local Representation Granularity for Video Captioning (2205.10706v2)

Published 22 May 2022 in cs.CV

Abstract: Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GL-RG framework for video captioning, namely a \textbf{G}lobal-\textbf{L}ocal \textbf{R}epresentation \textbf{G}ranularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning behavior. Experimental results on the challenging MSR-VTT and MSVD datasets show that our DL-RG outperforms recent state-of-the-art methods by a significant margin. Code is available at \url{https://github.com/ylqi/GL-RG}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Liqi Yan (9 papers)
  2. Qifan Wang (129 papers)
  3. Yiming Cui (80 papers)
  4. Fuli Feng (143 papers)
  5. Xiaojun Quan (52 papers)
  6. Xiangyu Zhang (328 papers)
  7. Dongfang Liu (44 papers)
Citations (53)