Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video (2406.09462v1)

Published 13 Jun 2024 in cs.CV and cs.AI

Abstract: Pretraining egocentric vision-LLMs has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hector A. Valdez (3 papers)
  2. Kyle Min (22 papers)
  3. Subarna Tripathi (38 papers)

Summary

We haven't generated a summary for this paper yet.