Streaming Dense Video Captioning (2404.01297v1)
Abstract: An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.
- ViViT: A Video Vision Transformer. In ICCV, 2021.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshops, 2005.
- Tracking without bells and whistles. In ICCV, pages 941–951, 2019.
- Is space-time attention all you need for video understanding? In ICML, 2021.
- Token merging: Your vit but faster. In ICLR, 2023.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Sst: Single-stream temporal action proposals. In CVPR, 2017.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- The 2019 davis challenge on vos: Unsupervised multi-object segmentation. In arXiv:1905.00737, 2019.
- End-to-end object detection with transformers. In ECCV, 2020.
- Pali-x: On scaling up a multilingual vision and language model. In arXiv:2305.18565, 2023a.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
- Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, 2019.
- Online action detection. In ECCV, 2016.
- Scenic: A JAX library for computer vision research and beyond. In CVPR Demo, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Object detection with discriminatively trained part-based models. PAMI, 2009.
- Soda: Story oriented dense video captioning evaluation framework. In ECCV, 2020.
- Alex Graves. Generating sequences with recurrent neural networks. In arXiv:1308.0850, 2013.
- Autoad: Movie description in context. In CVPR, 2023a.
- Autoad ii: The sequel-who, when, and what in movie audio description. In ICCV, 2023b.
- Object-region video transformers. In CVPR, 2022.
- Multimodal pretraining for dense video captioning. arXiv:2011.11760, 2020.
- A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In BMVC, 2020a.
- Multi-modal dense video captioning. In CVPR Workshops, 2020b.
- Cag-qil: Context-aware actionness grouping via q imitation learning for online temporal action localization. In ICCV, 2021.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Movinets: Mobile video networks for efficient video recognition. In CVPR, 2021.
- Dense-captioning events in videos. In ICCV, 2017.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. In arXiv:2211.09552, 2022.
- Towards streaming perception. In ECCV, 2020.
- Frozen clip models are efficient video learners. In ECCV, 2022.
- Mot16: A benchmark for multi-object tracking. In arXiv:1603.00831, 2016.
- Actor-context-actor relation network for spatio-temporal action localization. In CVPR, 2021.
- Language models as knowledge bases? In arXiv:1909.01066, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Hiera: A hierarchical vision transformer without the bells-and-whistles. In arXiv:2306.00989, 2023.
- Tokenlearner: What can 8 learned tokens do for images and videos? In NeurIPS, 2021.
- Token turing machines. In CVPR, 2023.
- Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
- Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, 2022.
- Moviechat: From dense token to sparse memory for long video understanding. In arXiv:2307.16449, 2023.
- Attention is all you need. NeurIPS, 2017.
- Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Bidirectional attentive fusion with context gating for dense video captioning. In CVPR, 2018.
- Git: A generative image-to-text transformer for vision and language. In arXiv:2205.14100, 2022.
- Videomae v2: Scaling video masked autoencoders with dual masking. In arXiv:2303.16727, 2023.
- Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020a.
- End-to-end dense video captioning with parallel decoding. In CVPR, 2021.
- Towards real-time multi-object tracking. In ECCV, 2020b.
- Towards long-form video understanding. In CVPR, 2021.
- Long-term feature banks for detailed video understanding. In CVPR, 2019.
- MeMVit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR, 2022a.
- Memorizing transformers. In ICLR, 2022b.
- Move forward and tell: A progressive generator of video descriptions. In ECCV, 2018.
- Multiview transformers for video recognition. In CVPR, 2022.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Real-time online video detection with temporal smoothing transformers. In ECCV, 2022.
- Streaming video model. In CVPR, 2023.
- Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
- End-to-end dense video captioning with masked transformer. In CVPR, 2018b.
- End-to-end dense video captioning as sequence generation. ACL, 2022.