Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition (2312.00096v2)

Published 30 Nov 2023 in cs.CV

Abstract: Due to the resource-intensive nature of training vision-LLMs on expansive video data, a majority of studies have centered on adapting pre-trained image-LLMs to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a LLM to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Tongjia Chen (5 papers)
  2. Hongshan Yu (18 papers)
  3. Zhengeng Yang (10 papers)
  4. Zechuan Li (8 papers)
  5. Wei Sun (373 papers)
  6. Chen Chen (752 papers)
Citations (4)