Emergent Mind

Streaming Dense Video Captioning

Published Apr 1, 2024 in cs.CV


An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.
Framework encodes video frames, compresses for memory, and decodes into captions with timestamps.


  • Introduces a new streaming approach to dense video captioning that can localize and describe events in untrimmed videos without needing the entire video content.

  • Features a novel memory module based on clustering incoming tokens and a streaming decoding algorithm to efficiently manage and process video streams.

  • Demonstrates superior performance on three dense video captioning benchmarks, outperforming state-of-the-art models.

  • Opens new research avenues in real-world applications like live video analysis and automated surveillance, challenging traditional video processing methods.

Introduction to Streaming Dense Video Captioning

Dense video captioning demands the simultaneous localization and description of events within untrimmed videos, making it a challenging yet critical task for advanced video understanding. Unlike conventional models that require access to entire video content before generating localized captions, this paper introduces a streaming approach to dense video captioning. The proposed model boasts two innovative components: a novel memory module based on clustering incoming tokens, designed to manage videos of arbitrary length, and a pioneering streaming decoding algorithm permitting predictions without the necessity of processing the complete video. This approach sets a new standard on three dense video captioning benchmarks: ActivityNet, YouCook2, and ViTT.

Novel Contributions

  • Memory Module:

  • A unique memory mechanism is proposed, built upon the foundation of clustering incoming tokens from the video stream.

  • This memory module efficiently compresses video data, maintaining a constant size no matter the input length, thereby ensuring scalability to longer video sequences.

  • Streaming Decoding Algorithm:

  • The model introduces an effective streaming decoding strategy, where predictions are made incrementally as the video is being processed.

  • It leverages "decoding points" to update and generate event captions dynamically, utilizing memory-based visual features, significantly reducing the prediction latency common in existing approaches.

  • Empirical Validation:

  • The effectiveness of the proposed streaming model is rigorously validated across multiple dense video captioning benchmarks.

  • It achieves notable improvements over the state-of-the-art models, substantiating the model's superiority in handling both long videos and generating detailed textual descriptions simultaneously.

Technical Insights

The paper meticulously details the streaming model's architecture, emphasizing the strategic integration of a clustering-based memory module for handling input video streams and a streaming decoding algorithm for generating outputs efficiently. This design not only addresses the limitations associated with processing long videos but also innovatively predicts localized captions in a streaming manner. The comprehensive experiments conducted demonstrate the model’s robust performance enhancements across various benchmarks.

Future Directions and Theoretical Implications

The introduction of streaming capabilities in dense video captioning opens new research avenues, particularly in real-world applications such as live video analysis and automated surveillance systems, where immediate response is crucial. Theoretically, this work challenges the traditional approach to video processing tasks, advocating for more dynamic, real-time methods. Future explorations might extend this streaming framework to other video-related tasks or investigate the incorporation of additional modalities (e.g., audio cues) to further enrich the model's understanding and description of video content.

Concluding Remarks

This paper sets forth a pioneering streaming model for dense video captioning, marked by its ability to efficiently manage long input videos and deliver immediate predictions. With solid empirical results supporting its efficacy, this work paves the way for more advanced, real-time video processing and understanding systems, holding promising implications for both academic research and practical applications in the AI domain.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. ViViT: A Video Vision Transformer. In ICCV
  2. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshops
  3. Tracking without bells and whistles. In ICCV, pages 941–951
  4. Is space-time attention all you need for video understanding? In ICML
  5. Token merging: Your vit but faster. In ICLR
  6. JAX: composable transformations of Python+NumPy programs
  7. Sst: Single-stream temporal action proposals. In CVPR
  8. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR
  9. The 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object Segmentation
  10. End-to-end object detection with transformers. In ECCV
  11. PaLI-X: On Scaling up a Multilingual Vision and Language Model
  12. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
  13. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL
  14. Online action detection. In ECCV
  15. Scenic: A JAX library for computer vision research and beyond. In CVPR Demo
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
  17. Object detection with discriminatively trained part-based models. PAMI
  18. Soda: Story oriented dense video captioning evaluation framework. In ECCV
  19. Generating Sequences With Recurrent Neural Networks
  20. Autoad: Movie description in context. In CVPR, 2023a.
  21. Autoad ii: The sequel-who, when, and what in movie audio description. In ICCV, 2023b.
  22. Object-region video transformers. In CVPR
  23. Multimodal Pretraining for Dense Video Captioning
  24. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In BMVC, 2020a.
  25. Multi-modal dense video captioning. In CVPR Workshops, 2020b.
  26. Cag-qil: Context-aware actionness grouping via q imitation learning for online temporal action localization. In ICCV
  27. Adam: A method for stochastic optimization. In ICLR
  28. Movinets: Mobile video networks for efficient video recognition. In CVPR
  29. Dense-captioning events in videos. In ICCV
  30. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML
  31. UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
  32. Towards streaming perception. In ECCV
  33. Frozen clip models are efficient video learners. In ECCV
  34. MOT16: A Benchmark for Multi-Object Tracking
  35. Actor-context-actor relation network for spatio-temporal action localization. In CVPR
  36. Language Models as Knowledge Bases?
  37. Learning transferable visual models from natural language supervision. In ICML
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR
  39. Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
  40. Tokenlearner: What can 8 learned tokens do for images and videos? In NeurIPS
  41. Token turing machines. In CVPR
  42. Online real-time multiple spatiotemporal action localisation and prediction. In ICCV
  43. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR
  44. MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
  45. Attention is all you need. NeurIPS
  46. Cider: Consensus-based image description evaluation. In CVPR
  47. Bidirectional attentive fusion with context gating for dense video captioning. In CVPR
  48. GIT: A Generative Image-to-text Transformer for Vision and Language
  49. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
  50. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020a.
  51. End-to-end dense video captioning with parallel decoding. In CVPR
  52. Towards real-time multi-object tracking. In ECCV, 2020b.
  53. Towards long-form video understanding. In CVPR
  54. Long-term feature banks for detailed video understanding. In CVPR
  55. MeMVit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In CVPR, 2022a.
  56. Memorizing transformers. In ICLR, 2022b.
  57. Move forward and tell: A progressive generator of video descriptions. In ECCV
  58. Multiview transformers for video recognition. In CVPR
  59. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR
  60. Coca: Contrastive captioners are image-text foundation models. TMLR
  61. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR
  62. Real-time online video detection with temporal smoothing transformers. In ECCV
  63. Streaming video model. In CVPR
  64. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
  65. End-to-end dense video captioning with masked transformer. In CVPR, 2018b.
  66. End-to-end dense video captioning as sequence generation. ACL

Show All 66

Test Your Knowledge

You answered out of questions correctly.

Well done!