Papers
Topics
Authors
Recent
2000 character limit reached

Knowledge Distillation for Efficient Audio-Visual Video Captioning (2306.09947v1)

Published 16 Jun 2023 in eess.AS

Abstract: Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields. Deep neural networks are the dominant methods, offering state-of-the-art performance. However, these methods are often undeployable in low-power devices like smartphones due to the large size of the model parameters. In this paper, we propose to exploit simple pooling front-end and down-sampling algorithms with knowledge distillation for audio and visual attributes using a reduced number of audio-visual frames. With the help of knowledge distillation from the teacher model, our proposed method greatly reduces the redundant information in audio-visual streams without losing critical contexts for caption generation. Extensive experimental evaluations on the MSR-VTT dataset demonstrate that our proposed approach significantly reduces the inference time by about 80% with a small sacrifice (less than 0.02%) in captioning accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. “Resnet based deep gated recurrent unit for image captioning on smartphone,” European Journal of Science and Technology, , no. 35, pp. 610–615, 2022.
  2. “Video captioning based on multi-layer gated recurrent unit for smartphones,” European Journal of Science and Technology, , no. 32, pp. 221–226, 2021.
  3. “A benchmark for feature-injection architectures in image captioning,” European Journal of Science and Technology, , no. 31, pp. 461–468, 2021.
  4. “Automated audio captioning: An overview of recent progress and new challenges,” EURASIP journal on audio, speech, and music processing, vol. 2022, no. 1, pp. 1–18, 2022.
  5. “Automated audio captioning via fusion of low-and high-dimensional features,” arXiv preprint arXiv:2210.05037, 2022.
  6. “Visually-aware audio captioning with adaptive audio-visual attention,” arXiv preprint arXiv:2210.16428, 2022.
  7. “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  8. “A passive similarity based cnn filter pruning for efficient acoustic scene classification,” arXiv preprint arXiv:2203.15751, 2022.
  9. “Low-complexity cnns for acoustic scene classification,” arXiv preprint arXiv:2207.11529, 2022.
  10. “Distilling the knowledge of bert for sequence-to-sequence asr,” arXiv preprint arXiv:2008.03822, 2020.
  11. “Knowledge distillation for small-footprint highway networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4820–4824.
  12. “Temporal knowledge distillation for on-device audio classification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 486–490.
  13. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  14. “Channel compression: Rethinking information redundancy among channels in cnn architecture,” IEEE Access, vol. 8, pp. 147265–147274, 2020.
  15. “Simple pooling front-ends for efficient audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  16. “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
  17. “Leveraging pre-trained bert for audio captioning,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 1145–1149.
  18. “Diverse audio captioning via adversarial training,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8882–8886.
  19. “An encoder-decoder based audio captioning system with transfer and reinforcement learning,” arXiv preprint arXiv:2108.02752, 2021.
  20. “Label denoising with large ensembles of heterogeneous neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 250–261.
  21. “Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 1092–1101.
  22. “Efficient video classification using fewer frames,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVF). 2019, pp. 354–363, IEEE.
  23. “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning. PMLR, 2015, pp. 448–456.
  24. “Mobile application based automatic caption generation for visually impaired,” in International Conference on Intelligent and Fuzzy Systems (INFUS). IEEE, 2020, pp. 1532–1539.
  25. “Auxiliary classifier based residual rnn for image captioning,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 1126–1130.
  26. “Dropout training as adaptive regularization,” Advances in Neural Information Processing Systems, vol. 26, 2013.
  27. Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVF). 2015, pp. 3128–3137, IEEE.
  28. “Sequence-to-sequence video captioning with residual connected gated recurrent units,” European Journal of Science and Technology, , no. 35, pp. 380–386, 2022.
  29. “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVF). 2016, pp. 5288–5296, IEEE.
  30. “Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments,” in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 228–231.
  31. “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), 2002, pp. 311–318.
  32. “Cider: Consensus-based image description evaluation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVF). 2015, pp. 4566–4575, IEEE.
  33. Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Proceedings of the Association for Computational Linguistics (ACL) Workshop, 2004, pp. 1–8.
  34. “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 382–398.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.