Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporal and cross-modal attention for audio-visual zero-shot learning (2207.09966v1)

Published 20 Jul 2022 in cs.CV

Abstract: Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to unseen classes at test time. We propose a multi-modal and Temporal Cross-attention Framework (\modelName) for audio-visual generalised zero-shot learning. Its inputs are temporally aligned audio and visual features that are obtained from pre-trained networks. Encouraging the framework to focus on cross-modal correspondence across time instead of self-attention within the modalities boosts the performance significantly. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning. Code for reproducing all results is available at \url{https://github.com/ExplainableML/TCAF-GZSL}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Otniel-Bogdan Mercea (8 papers)
  2. Thomas Hummel (16 papers)
  3. A. Sophia Koepke (22 papers)
  4. Zeynep Akata (144 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com