Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection (2401.12471v2)

Published 23 Jan 2024 in cs.CV

Abstract: We introduce VidTFS, a Training-free, open-vocabulary video goal and action inference framework that combines the frozen vision foundational model (VFM) and LLM with a novel dynamic Frame Selection module. Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly. We validate the performance of the proposed VidTFS on four widely used video datasets, including CrossTask, COIN, UCF101, and ActivityNet, covering goal inference and action recognition tasks under open-vocabulary settings without requiring any training or fine-tuning. The results show that VidTFS outperforms pretrained and instruction-tuned multimodal LLMs that directly stack LLM and VFM for downstream video inference tasks. Our VidTFS with its adaptability shows the future potential for generalizing to new training-free video inference tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ee Yeo Keat (2 papers)
  2. Zhang Hao (2 papers)
  3. Alexander Matyasko (6 papers)
  4. Basura Fernando (60 papers)

Summary

We haven't generated a summary for this paper yet.