Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 95 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 236 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

TDS-CLIP: Temporal Difference Side Network for Efficient VideoAction Recognition (2408.10688v2)

Published 20 Aug 2024 in cs.CV

Abstract: Recently, large-scale pre-trained vision-LLMs (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a novel memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding backpropagation in frozen parameter models. Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features to strengthen the model's global temporal modeling capabilities. Furthermore, we designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos, thereby improving the side network's ability to capture and learn motion information. Extensive experiments are conducted on three benchmark datasets, including Something-Something V1&V2, and Kinetics-400. Experimental results show that our method achieves competitive performance in video action recognition tasks.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube