SnapCap: Efficient Snapshot Compressive Video Captioning (2401.04903v1)
Abstract: Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.
- A review of deep learning for video captioning, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- On the use of deep learning for computational imaging. Optica, 6(8):921–943, 2019.
- Detrdistill: A universal knowledge distillation framework for detr-families. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6898–6908, 2023.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
- Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 699–710, 2023.
- Motion guided spatial attention for video captioning. In Proceedings of the AAAI conference on artificial intelligence, pages 8191–8198, 2019.
- Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1543–1552, 2021a.
- Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1543–1552, 2021b.
- Birnat: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In European Conference on Computer Vision, pages 258–275. Springer, 2020.
- Memory-efficient network for large-scale video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16246–16255, 2021.
- Speech to text adaptation: Towards an efficient cross-modal distillation. arXiv preprint arXiv:2005.08213, 2020.
- Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Retrieving object motions from coded shutter snapshot in dark environment. IEEE Transactions on Image Processing, 2023.
- Self-distillation with batch knowledge ensembling improves imagenet classification. arXiv preprint arXiv:2104.13298, 2021.
- Zero-shot detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2(3):4, 2021.
- Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18941–18951, 2023a.
- Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18941–18951, 2023b.
- Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Video object detection from one single image through opto-electronic neural network. APL Photonics, 6(4), 2021.
- Snapshot compressed sensing: Performance bounds and algorithms. IEEE Trans. Inf. Theory, 65(12):8005–8024, 2019.
- Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
- Few-shot object detection via knowledge transfer. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 3564–3569. IEEE, 2020.
- Action recognition from a single coded image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4109–4121, 2022.
- Gan compression: Efficient architectures for interactive conditional gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5284–5294, 2020a.
- Learning efficient gans for image translation via differentiable masks and co-attention distillation. IEEE Transactions on Multimedia, 2022.
- End-to-end video compressive sensing using anderson-accelerated unrolled networks. In 2020 IEEE international conference on computational photography (ICCP), pages 1–12. IEEE, 2020b.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
- Rank minimization for snapshot compressive imaging. IEEE transactions on pattern analysis and machine intelligence, 41(12):2990–3006, 2018.
- Coded aperture compressive temporal imaging. Optics express, 21(9):10526–10545, 2013.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Deep tensor admm-net for snapshot compressive imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10223–10232, 2019.
- Computational imaging. Advances in Optics and Photonics, 10(2):409–483, 2018.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
- Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Deep learning for video compressive sensing. Apl Photonics, 5(3), 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Top-down visual saliency guided by captions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7206–7215, 2017.
- Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, pages 2514–2522, 2021a.
- Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, pages 2514–2522, 2021b.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022.
- Inception-v4, inception-resnet and the impact of residual connections on learning, 2016.
- Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4858–4862, 2021.
- Zero-shot video captioning with evolving pseudo-tokens. arXiv preprint arXiv:2207.11100, 2022.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7622–7631, 2018a.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
- Spatial-temporal transformer for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
- Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18477–18486, 2023.
- Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4933–4942, 2019.
- Kdgan: Knowledge distillation with generative adversarial networks. Advances in neural information processing systems, 31, 2018b.
- Metasci: Scalable and adaptive reconstruction for video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2083–2092, 2021.
- Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21970–21980, 2023a.
- Dense deep unfolding network with 3d-cnn prior for snapshot compressive imaging. arXiv preprint arXiv:2109.06548, 2021.
- Adaptive deep pnp algorithm for video snapshot compressive imaging. International Journal of Computer Vision, pages 1–18, 2023b.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pages 305–321, 2018.
- mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
- Feature normalized knowledge distillation for image classification. In European conference on computer vision, pages 664–680. Springer, 2020.
- Non-autoregressive coarse-to-fine video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3119–3127, 2021.
- Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17939–17948, 2022.
- Learning metrics from teachers: Compact networks for image embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2907–2916, 2019.
- Xin Yuan. Generalized alternating projection based total variation minimization for compressive sensing. In 2016 IEEE International conference on image processing (ICIP), pages 2539–2543. IEEE, 2016.
- Low-cost compressive sensing for color video and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3318–3325, 2014.
- Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1447–1457, 2020a.
- Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1447–1457, 2020b.
- Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2):65–88, 2021a.
- Plug-and-play algorithms for video snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7093–7111, 2021b.
- Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5404–5413, 2023.
- From compressive sampling to compressive tasking: Retrieving semantics in compressed domain with low bandwidth. PhotoniX, 3(1):1–22, 2022.
- Refined semantic enhancement towards frequency diffusion for video captioning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 3724–3732. AAAI Press, 2023.
- Jianqiao Sun (4 papers)
- Yudi Su (4 papers)
- Hao Zhang (947 papers)
- Ziheng Cheng (16 papers)
- Zequn Zeng (9 papers)
- Zhengjue Wang (10 papers)
- Bo Chen (309 papers)
- Xin Yuan (198 papers)