Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval (2404.07610v1)

Published 11 Apr 2024 in cs.CV

Abstract: There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. An event-related potential study of explicit memory on tests of cued recall and recognition. Neuropsychologia, 35(4):387–397, 1997.
  2. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. iperceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. arXiv preprint arXiv:2011.07735, 2020.
  5. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications and Applications, 19(1s):1–24, 2023.
  6. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8425–8435, 2021.
  7. Video captioning with guidance of multimodal latent topics. In Proceedings of the 25th ACM international conference on Multimedia, pages 1838–1846, 2017.
  8. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10739–10750, 2023.
  9. Sketch, ground, and refine: Top-down dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 234–243, 2021.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Soda: Story oriented dense video captioning evaluation framework. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 517–531. Springer, 2020.
  12. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 19(9):2045–2055, 2017.
  13. Mugen: A playground for video-audio-text multimodal understanding and generation. In European Conference on Computer Vision, pages 431–449. Springer, 2022.
  14. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. arXiv preprint arXiv:2005.08271, 2020a.
  15. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 958–959, 2020b.
  16. Memory-based augmentation network for video captioning. IEEE Transactions on Multimedia, pages 1–13, 2023.
  17. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  18. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  20. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7492–7500, 2018.
  21. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
  22. Univl: A unified video and language pre-training model for multimodal understanding and generation, 2020.
  23. Streamlined dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6588–6597, 2019.
  24. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4594–4602, 2016.
  25. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  26. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8347–8356, 2019.
  27. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology, 30(8):2617–2633, 2019.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  29. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8908–8917, 2019.
  30. Retrieval-augmented image captioning. arXiv preprint arXiv:2302.08268, 2023.
  31. Retrieval augmentation for deep neural networks. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
  32. Translating video content to natural language descriptions. In Proceedings of the IEEE international conference on computer vision, pages 433–440, 2013.
  33. Neural correlates of memory retrieval during recognition memory and cued recall. Neuroimage, 8(3):262–273, 1998.
  34. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  35. Retrieval-augmented transformer for image captioning. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing, pages 1–7, 2022.
  36. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022.
  37. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1916–1924, 2017.
  38. Dense procedure captioning in narrated instructional videos. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 6382–6391, 2019.
  39. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  40. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
  41. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
  42. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7622–7631, 2018a.
  43. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7190–7198, 2018b.
  44. Dense-captioning events in videos: Sysu submission to activitynet challenge 2020. arXiv preprint arXiv:2006.11693, 2020a.
  45. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 31(5):1890–1900, 2020b.
  46. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021.
  47. A unified generation-retrieval framework for image captioning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2313–2316, 2019.
  48. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
  49. Hierarchical context encoding for events captioning in videos. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 1288–1292. IEEE, 2018.
  50. Unifying event detection and captioning as sequence generation via pre-training. In European Conference on Computer Vision, pages 363–379. Springer, 2022.
  51. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9837–9846, 2021.
  52. Image caption generation via unified retrieval and generation-based method. Applied Sciences, 10(18):6235, 2020.
  53. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018a.
  54. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8739–8748, 2018b.
  55. End-to-end dense video captioning as sequence generation. International Conference on Computational Linguistics (COLING), 2022.
  56. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Minkuk Kim (3 papers)
  2. Hyeon Bae Kim (5 papers)
  3. Jinyoung Moon (13 papers)
  4. Jinwoo Choi (26 papers)
  5. Seong Tae Kim (42 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com