vid-TLDR: Training Free Token merging for Light-weight Video Transformer (2403.13347v2)
Abstract: Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However, these video transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames, which has been the major barrier to training the model. Further, the patches irrelevant to the main contents, e.g., backgrounds, degrade the generalization performance of models. To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training. For vid-TLDR, we introduce a novel approach to capture the salient regions in videos only with the attention map. Further, we introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores. Our experiments show that vid-TLDR significantly mitigates the computational complexity of video Transformers while achieving competitive performance compared to the base model without vid-TLDR. Code is available at https://github.com/mlvlab/vid-TLDR.
- Quantifying attention flow in transformers. arXiv:2005.00928, 2020.
- Localizing moments in video with natural language. In ICCV, 2017.
- Multimae: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
- Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Longformer: The long-document transformer. arXiv:2004.05150, 2020.
- Token merging for fast stable diffusion. CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023.
- Token merging: Your vit but faster. ICLR, 2022.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- End-to-end object detection with transformers. In ECCV, 2020.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.
- Vindlu: A recipe for effective video-and-language pretraining. In CVPR, 2023.
- Twins: Revisiting the design of spatial attention in vision transformers. NeurIPS, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
- Adaptive token sampling for efficient vision transformers. In ECCV, 2022.
- VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. In arXiv:2111.1268, 2021.
- Multi-modal transformer for video retrieval. In ECCV, 2020.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2023.
- The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
- Text with knowledge graph augmented transformer for video captioning. In CVPR, 2023.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Mgmae: Motion guided masking for video masked autoencoding. In ICCV, 2023.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.
- Learned token pruning for transformers. In KDD, 2022.
- Reformer: The efficient transformer. In ICLR, 2020.
- Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In ECCV, 2022.
- Sparse token transformer with attention back tracking. In ICLR, 2022.
- Vitgan: Training gans with vision transformers. arXiv:2107.04589, 2021.
- Less is more: Clipbert for video-and-language learningvia sparse sampling. In CVPR, 2021.
- Revealing single frame bias for video-and-language learning. In ACL, 2023.
- Align and prompt: Video-and-language pre-training with entity prompts. In CVPR, 2022a.
- Dn-detr: Accelerate detr training by introducing query denoising. In CVPR, 2022b.
- Unmasked teacher: Towards training-efficient video foundation models. ICCV, 2023a.
- Lavender: Unifying video-language understanding as masked language modeling. In CVPR, 2023b.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. In ICLR, 2022a.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. ICLR, 2022b.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, 2022.
- Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV, 2021a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021b.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM Multimedia, 2022.
- Token pooling in vision transformers for image classification. In WACV, 2023.
- Conditional detr for fast training convergence. In ICCV, 2021.
- Adavit: Adaptive vision transformers for efficient image recognition. In CVPR, 2022.
- Less is more: Pay less attention in vision transformers. In AAAI, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. NeurIPS, 2021.
- Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023.
- Movie description. IJCV, 2017.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402, 2012.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022.
- Training data-efficient image transformers & distillation through attention. In ICML. PMLR, 2021.
- Attention is all you need. NeurIPS, 2017.
- Fast transformers with clustered attention. NeurIPS, 2020.
- Omnivl: One foundation model for image-language and video-language tasks. NeurIPS, 2022a.
- All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023a.
- Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023b.
- Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020.
- End-to-end video instance segmentation with transformers. In CVPR, 2021.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv:2212.03191, 2022b.
- Anchor detr: Query design for transformer-based detector. In AAAI, 2022c.
- Denoising masked autoencoders help robust classification. In ICLR, 2023.
- Video graph transformer for video question answering. In ECCV, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Svformer: Semi-supervised video transformer for action recognition. In CVPR, 2023.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, 2021.
- Video question answering via gradually refined attention over appearance and motion. In MM, 2017.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv:2209.06430, 2022.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv:2212.04979, 2022.
- Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
- Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 35, 2022a.
- Recurring the transformer for video action recognition. In CVPR, 2022b.
- Big bird: Transformers for longer sequences. NeurIPS, 33, 2020.
- Open-vocabulary detr with conditional matching. In ECCV, 2022.
- Merlot: Multimodal neural script knowledge models. NeurIPS, 2021.
- Deformable detr: Deformable transformers for end-to-end object detection. ICLR, 2021.
- Joonmyung Choi (8 papers)
- Sanghyeok Lee (9 papers)
- Jaewon Chu (6 papers)
- Minhyuk Choi (4 papers)
- Hyunwoo J. Kim (70 papers)