Streaming Dense Video Captioning (2404.01297v1)
Abstract: An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.
- ViViT: A Video Vision Transformer. In ICCV, 2021.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshops, 2005.
- Tracking without bells and whistles. In ICCV, pages 941–951, 2019.
- Is space-time attention all you need for video understanding? In ICML, 2021.
- Token merging: Your vit but faster. In ICLR, 2023.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Sst: Single-stream temporal action proposals. In CVPR, 2017.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- The 2019 davis challenge on vos: Unsupervised multi-object segmentation. In arXiv:1905.00737, 2019.
- End-to-end object detection with transformers. In ECCV, 2020.
- Pali-x: On scaling up a multilingual vision and language model. In arXiv:2305.18565, 2023a.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
- Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, 2019.
- Online action detection. In ECCV, 2016.
- Scenic: A JAX library for computer vision research and beyond. In CVPR Demo, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Object detection with discriminatively trained part-based models. PAMI, 2009.
- Soda: Story oriented dense video captioning evaluation framework. In ECCV, 2020.
- Alex Graves. Generating sequences with recurrent neural networks. In arXiv:1308.0850, 2013.
- Autoad: Movie description in context. In CVPR, 2023a.
- Autoad ii: The sequel-who, when, and what in movie audio description. In ICCV, 2023b.
- Object-region video transformers. In CVPR, 2022.
- Multimodal pretraining for dense video captioning. arXiv:2011.11760, 2020.
- A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In BMVC, 2020a.
- Multi-modal dense video captioning. In CVPR Workshops, 2020b.
- Cag-qil: Context-aware actionness grouping via q imitation learning for online temporal action localization. In ICCV, 2021.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Movinets: Mobile video networks for efficient video recognition. In CVPR, 2021.
- Dense-captioning events in videos. In ICCV, 2017.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. In arXiv:2211.09552, 2022.
- Towards streaming perception. In ECCV, 2020.
- Frozen clip models are efficient video learners. In ECCV, 2022.
- Mot16: A benchmark for multi-object tracking. In arXiv:1603.00831, 2016.
- Actor-context-actor relation network for spatio-temporal action localization. In CVPR, 2021.
- Language models as knowledge bases? In arXiv:1909.01066, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Hiera: A hierarchical vision transformer without the bells-and-whistles. In arXiv:2306.00989, 2023.
- Tokenlearner: What can 8 learned tokens do for images and videos? In NeurIPS, 2021.
- Token turing machines. In CVPR, 2023.
- Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
- Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, 2022.
- Moviechat: From dense token to sparse memory for long video understanding. In arXiv:2307.16449, 2023.
- Attention is all you need. NeurIPS, 2017.
- Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Bidirectional attentive fusion with context gating for dense video captioning. In CVPR, 2018.
- Git: A generative image-to-text transformer for vision and language. In arXiv:2205.14100, 2022.
- Videomae v2: Scaling video masked autoencoders with dual masking. In arXiv:2303.16727, 2023.
- Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020a.
- End-to-end dense video captioning with parallel decoding. In CVPR, 2021.
- Towards real-time multi-object tracking. In ECCV, 2020b.
- Towards long-form video understanding. In CVPR, 2021.
- Long-term feature banks for detailed video understanding. In CVPR, 2019.
- MeMVit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR, 2022a.
- Memorizing transformers. In ICLR, 2022b.
- Move forward and tell: A progressive generator of video descriptions. In ECCV, 2018.
- Multiview transformers for video recognition. In CVPR, 2022.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Real-time online video detection with temporal smoothing transformers. In ECCV, 2022.
- Streaming video model. In CVPR, 2023.
- Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
- End-to-end dense video captioning with masked transformer. In CVPR, 2018b.
- End-to-end dense video captioning as sequence generation. ACL, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.