Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning (2402.17680v1)

Published 27 Feb 2024 in cs.CV

Abstract: To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  2. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3375, 2017.
  3. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  4. Lightgcl: Simple yet effective graph contrastive learning for recommendation. arXiv preprint arXiv:2302.08191, 2023.
  5. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248, 2018.
  6. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  7. Deep learning for video captioning: A review. pages 6283–6290, 2019.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  9. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  10. Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV), pages 358–373, 2018.
  11. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.
  12. A feature-space multimodal data augmentation technique for text-video retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4385–4394, 2022.
  13. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
  14. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  15. An empirical investigation of catastrophic forgetting in gradient-based neural networks. In International Conference on Learning Representations, 2013.
  16. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE international conference on computer vision, pages 2712–2719, 2013.
  17. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019.
  20. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  23. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 50(2):171–184, 2002.
  24. Graph convolutional network meta-learning with multi-granularity pos guidance for video captioning. Neurocomputing, 472:294–305, 2022.
  25. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  26. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  27. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020.
  28. Adaptive aggregation networks for class-incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2544–2553, 2021.
  29. Spring: Situated conversation agent pretrained with multimodal questions from incremental layout graph. arXiv preprint arXiv:2301.01949, 2023.
  30. Attend and interact: Higher-order object interactions for video understanding. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  31. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  32. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  33. McCloskey and Michael. Catastrophic interference in connectionist networks: The sequential learning problem. pages 109–165, 1989.
  34. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
  35. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  36. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  37. Looking back on learned experiences for class/task incremental learning. In International Conference on Learning Representations, 2021.
  38. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  39. A. Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
  40. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 42(3):651–663, 2018.
  41. Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, pages 2514–2522, 2021.
  42. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  43. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  44. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
  45. Diverse beam search: Decoding diverse solutions from neural sequence models. 2016.
  46. Pos-trends dynamic-aware model for video caption. IEEE Transactions on Circuits and Systems for Video Technology, 32(7):4751–4764, 2021.
  47. Cap4video: What can auxiliary captions do for text-video retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10704–10713, 2023.
  48. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  49. Stsi: Efficiently mine spatio-temporal semantic information between different multimodal for video captioning. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), pages 1–5. IEEE, 2022.
  50. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  51. Der: Dynamically expandable representation for class incremental learning. 2021.
  52. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023.
  53. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
  54. Overcoming forgetting in fine-grained urban flow inference via adaptive knowledge replay. AAAI, 2023.
  55. A cascade sequence-to-sequence model for chinese mandarin lip reading. In Proceedings of the ACM Multimedia Asia, pages 1–6. 2019.
  56. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
  57. Maps: Joint multimodal attention and pos sequence generation for video captioning. In 2021 International Conference on Visual Communications and Image Processing (VCIP), pages 1–5. IEEE, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Huiyu Xiong (3 papers)
  2. Lanxiao Wang (12 papers)
  3. Heqian Qiu (13 papers)
  4. Taijin Zhao (4 papers)
  5. Benliu Qiu (2 papers)
  6. Hongliang Li (59 papers)
Citations (1)