VicTR: Video-conditioned Text Representations for Activity Recognition (2304.02560v2)

Published 5 Apr 2023 in cs.CV

Abstract: Vision-LLMs (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

Authors (4)

Kumara Kahatapitiya (20 papers)
Anurag Arnab (56 papers)
Arsha Nagrani (62 papers)
Michael S. Ryoo (75 papers)

Citations (13)

View on Semantic Scholar

Summary

Video-conditioned Text Representations for Activity Recognition

The paper presents VicTR, a novel approach to enhancing the capabilities of vision-LLMs (VLMs) for activity recognition tasks by specifically focusing on video-conditioned text embeddings. In the context of video understanding, the integration of temporal information with textual semantics represents a promising avenue for improving the accuracy of activity recognition models.

Contributions and Methodology

VicTR is proposed to overcome the limitations of traditional image-VLMs when applied to the video domain. The approach incorporates three key components: token-boosting, cross-modal attention, and affinity (re-)weighting. These mechanisms are designed to adapt textual embeddings to video content dynamically, thus enhancing the interplay between video and text data.

Token-boosting involves creating video-specific text tokens by weighting them with frame-text affinities derived from existing embeddings. This step not only increases model capacity but also emphasizes relevant semantic categories over time, facilitating a more nuanced understanding of video context.
Cross-modal and Temporal Attention are employed to allow enhanced interaction between video and text tokens. Cross-modal attention facilitates communication between visual and textual data, while temporal attention captures the evolution of these interactions over the video timeline.
Affinity (re-)weighting recalibrates the alignment of text tokens in relation to updated video embeddings, refining the initial noisy affinities by integrating temporal context and feedback loops within each layer of the transformer model used.

Additionally, VicTR utilizes auxiliary semantic information, which includes a broader range of text prompts, aiding in the alignment of text representations in the latent space. This strategy provides an extra layer of semantically grounded supervision, potentially increasing the model's robustness.

Results and Implications

The effectiveness of VicTR is evaluated across a variety of benchmarks, including both short-form and long-form activity recognition datasets. The results demonstrate significant improvements in few-shot and zero-shot learning tasks compared to existing models that adapt image-VLMs for video. For instance, in few-shot settings on datasets such as HMDB-51 and UCF-101, VicTR surpasses prior works by a clear margin, validating its ability to generalize with limited training data.

Beyond activity recognition, VicTR's framework can be extended to other reasoning tasks, underscoring the versatility and potential for broader applicability within AI models. By conditioning text representations on video content, VicTR captures a more comprehensive depiction of visual contexts as they unfold over time, addressing a critical gap in current VLM applications for video data.

Future Directions

The paper suggests several paths for future development. Firstly, enhancing the auxiliary semantic vocabulary with more detailed or personalized descriptions may further bolster the model's temporal reasoning capabilities. Secondly, this approach could be integrated with larger datasets or different modalities beyond video and text, potentially scaling to more complex tasks such as video-based question answering or real-time video analytics.

In summary, the VicTR framework demonstrates a valuable advancement in aligning textual and visual domains for video applications. The paper's emphasis on fine-tuning text embeddings in relation to temporal video information provides significant insights for researchers aiming to develop more robust and flexible VLMs, particularly in the domain of activity recognition.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kkahatapitiy/status/1762270785819885695

https://twitter.com/kkahatapitiy/status/1803799821234639098