Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SnapCap: Efficient Snapshot Compressive Video Captioning (2401.04903v1)

Published 10 Jan 2024 in cs.CV

Abstract: Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. A review of deep learning for video captioning, 2023.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  3. On the use of deep learning for computational imaging. Optica, 6(8):921–943, 2019.
  4. Detrdistill: A universal knowledge distillation framework for detr-families. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6898–6908, 2023.
  5. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  6. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 699–710, 2023.
  7. Motion guided spatial attention for video captioning. In Proceedings of the AAAI conference on artificial intelligence, pages 8191–8198, 2019.
  8. Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1543–1552, 2021a.
  9. Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1543–1552, 2021b.
  10. Birnat: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In European Conference on Computer Vision, pages 258–275. Springer, 2020.
  11. Memory-efficient network for large-scale video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16246–16255, 2021.
  12. Speech to text adaptation: Towards an efficient cross-modal distillation. arXiv preprint arXiv:2005.08213, 2020.
  13. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Retrieving object motions from coded shutter snapshot in dark environment. IEEE Transactions on Image Processing, 2023.
  16. Self-distillation with batch knowledge ensembling improves imagenet classification. arXiv preprint arXiv:2104.13298, 2021.
  17. Zero-shot detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2(3):4, 2021.
  18. Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18941–18951, 2023a.
  19. Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18941–18951, 2023b.
  20. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  22. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  23. Video object detection from one single image through opto-electronic neural network. APL Photonics, 6(4), 2021.
  24. Snapshot compressed sensing: Performance bounds and algorithms. IEEE Trans. Inf. Theory, 65(12):8005–8024, 2019.
  25. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  26. Few-shot object detection via knowledge transfer. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 3564–3569. IEEE, 2020.
  27. Action recognition from a single coded image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4109–4121, 2022.
  28. Gan compression: Efficient architectures for interactive conditional gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5284–5294, 2020a.
  29. Learning efficient gans for image translation via differentiable masks and co-attention distillation. IEEE Transactions on Multimedia, 2022.
  30. End-to-end video compressive sensing using anderson-accelerated unrolled networks. In 2020 IEEE international conference on computational photography (ICCP), pages 1–12. IEEE, 2020b.
  31. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  32. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
  33. Rank minimization for snapshot compressive imaging. IEEE transactions on pattern analysis and machine intelligence, 41(12):2990–3006, 2018.
  34. Coded aperture compressive temporal imaging. Optics express, 21(9):10526–10545, 2013.
  35. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  37. Deep tensor admm-net for snapshot compressive imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10223–10232, 2019.
  38. Computational imaging. Advances in Optics and Photonics, 10(2):409–483, 2018.
  39. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
  40. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
  41. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  42. Deep learning for video compressive sensing. Apl Photonics, 5(3), 2020.
  43. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  45. Top-down visual saliency guided by captions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7206–7215, 2017.
  46. Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, pages 2514–2522, 2021a.
  47. Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, pages 2514–2522, 2021b.
  48. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  49. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022.
  50. Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.
  51. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4858–4862, 2021.
  52. Zero-shot video captioning with evolving pseudo-tokens. arXiv preprint arXiv:2207.11100, 2022.
  53. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  54. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  55. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7622–7631, 2018a.
  56. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  57. Spatial-temporal transformer for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
  58. Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18477–18486, 2023.
  59. Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4933–4942, 2019.
  60. Kdgan: Knowledge distillation with generative adversarial networks. Advances in neural information processing systems, 31, 2018b.
  61. Metasci: Scalable and adaptive reconstruction for video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2083–2092, 2021.
  62. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21970–21980, 2023a.
  63. Dense deep unfolding network with 3d-cnn prior for snapshot compressive imaging. arXiv preprint arXiv:2109.06548, 2021.
  64. Adaptive deep pnp algorithm for video snapshot compressive imaging. International Journal of Computer Vision, pages 1–18, 2023b.
  65. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pages 305–321, 2018.
  66. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
  67. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  68. Feature normalized knowledge distillation for image classification. In European conference on computer vision, pages 664–680. Springer, 2020.
  69. Non-autoregressive coarse-to-fine video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3119–3127, 2021.
  70. Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17939–17948, 2022.
  71. Learning metrics from teachers: Compact networks for image embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2907–2916, 2019.
  72. Xin Yuan. Generalized alternating projection based total variation minimization for compressive sensing. In 2016 IEEE International conference on image processing (ICIP), pages 2539–2543. IEEE, 2016.
  73. Low-cost compressive sensing for color video and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3318–3325, 2014.
  74. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1447–1457, 2020a.
  75. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1447–1457, 2020b.
  76. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2):65–88, 2021a.
  77. Plug-and-play algorithms for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7093–7111, 2021b.
  78. Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5404–5413, 2023.
  79. From compressive sampling to compressive tasking: Retrieving semantics in compressed domain with low bandwidth. PhotoniX, 3(1):1–22, 2022.
  80. Refined semantic enhancement towards frequency diffusion for video captioning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 3724–3732. AAAI Press, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jianqiao Sun (4 papers)
  2. Yudi Su (4 papers)
  3. Hao Zhang (947 papers)
  4. Ziheng Cheng (16 papers)
  5. Zequn Zeng (9 papers)
  6. Zhengjue Wang (10 papers)
  7. Bo Chen (309 papers)
  8. Xin Yuan (198 papers)
Citations (1)