Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting (2404.12782v1)
Abstract: Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from the current methods. Sentimental factors are critical in interactive commenting, and lack of research so far. Thus, in this paper, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at https://github.com/fufy1024/So-TVAE.
- Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12487–12496.
- Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–37.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.
- Stories that big danmaku data can tell as a new media. IEEE Access 7 (2019), 53509–53519.
- Satanjeev Banerjee and Alon Lavie. 2007. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Association for Computational Linguistics, ACL, 2007. 65–72.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems 28 (2015).
- Generating Sentences from a Continuous Space. Computer Science (2015).
- Vision-Enhanced and Consensus-Aware Transformer for Image Captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 10 (2022), 7005–7018.
- Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1–24.
- Nice: Neural image commenting with empathy. In Findings of the Association for Computational Linguistics: EMNLP 2021. 4456–4472.
- “Factual”or“Emotional”: Stylized Image Captioning with Adaptive Learning and Attention. In Proceedings of the european conference on computer vision (ECCV). 519–535.
- Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1 (2023), 1–22.
- Personalized key frame recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 315–324.
- BA^ 2M: A Batch Aware Attention Module for Image Classification. arXiv preprint arXiv:2103.15099 (2021).
- A fully differentiable beam search decoder. In International Conference on Machine Learning. PMLR, 1341–1350.
- Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 1–21.
- Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9268–9277.
- Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2 (2023), 1–18.
- Multimodal matching transformer for live commenting. ECAI, (2020).
- Every picture tells a story: Generating sentences from images. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, September 5-11, 2010, Proceedings, Part IV 11. Springer, 15–29.
- Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3137–3146.
- Self-critical n-step training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6300–6308.
- Improving image-sentence embeddings using large weakly annotated photo collections. In Computer Vision–ECCV 2014. 529–545.
- Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision & Pattern Recognition. 4204–4213.
- Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research 47, 1 (2015), 853–899.
- Cross attention network for few-shot classification. Advances in neural information processing systems 32 (2019).
- Batchformer: Learning to explore sample relationships for robust representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Toward controlled generation of text. In International conference on machine learning. PMLR, 1587–1596.
- Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision. 4634–4643.
- Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–20.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
- Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Computer Science (2014).
- Semi-supervised learning with deep generative models. Advances in neural information processing systems 27 (2014).
- Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR) (2014).
- Multimodal neural language models. In International conference on machine learning. PMLR, 595–603.
- Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision 50, 2 (2002), 171–184.
- Generating natural-language video descriptions using text-mined knowledge. AAAI Press (2013).
- Babytalk: Understanding and generating simple image descriptions. IEEE transactions on pattern analysis and machine intelligence 35, 12 (2013), 2891–2903.
- Shankar Kumar and Bill Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004. 169–176.
- Target conditioning for one-to-many generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Visual-texual emotion analysis with deep coupled video and danmu neural networks. IEEE Transactions on Multimedia 22, 6 (2019), 1634–1646.
- Composing simple image descriptions using web-scale n-grams. Association for Computational Linguistics (2011).
- Contrast and generation make bart a good dialogue emotion recognizer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11002–11010.
- Coherent comment generation for chinese articles with a graph-to-sequence model. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019.
- Video ChatBot: Triggering Live Social Interactions by Automatic Video Commenting. In Proceedings of the 24th ACM international conference on Multimedia. 757–758.
- Shengcai Liao and Ling Shao. 2022. Graph sampling based deep metric learning for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7359–7368.
- Hypergraph-induced semantic tuplet loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 212–222.
- Learning comment generation by leveraging user-generated data. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7225–7229.
- Improved Image Captioning via Policy Gradient optimization of SPIDEr, In IEEE International Conference on Computer Vision, ICCV. IEEE (2017).
- Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European conference on computer vision (ECCV). 338–354.
- Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1s (2021), 1–18.
- Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding. AAAI Press (2016).
- Understanding the users and videos by mining a novel danmu dataset. IEEE Transactions on Big Data 8, 2 (2019), 535–551.
- Livebot: Generating live video comments based on visual and textual contexts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6810–6817.
- Laurens. Van Der Maaten. and Hinton Geoffrey. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2605 (2008), 2579–2605.
- Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 4 (2022), 1–17.
- Senticap: Generating image descriptions with sentiments. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
- Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 953–962.
- Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879.
- Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2020), 2617–2633. https://doi.org/10.1109/TCSVT.2019.2921655
- Automatic article commenting: the task and dataset. In ACL.
- Self-critical Sequence Training for Image Captioning, In IEEE Conference on Computer Vision and Pattern Recognition, CVPR ,. IEEE (2017).
- Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286.
- Cem: Commonsense-aware empathetic response generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11229–11237.
- Controlvae: Controllable variational autoencoder. In International Conference on Machine Learning. PMLR.
- Mixture models for diverse machine translation: Tricks of the trade. In International conference on machine learning. PMLR, 5719–5728.
- Dispersed Exponential Family Mixture VAEs for Interpretable Text Generation. In International Conference on Machine Learning. PMLR, 8840–8851.
- Learning Video-Text Aligned Representations for Video Captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2 (2023), 1–21.
- Automatic concept discovery from parallel text and visual corpora. In Proceedings of the IEEE international conference on computer vision. 2596–2604.
- Learning to discretely compose reasoning module networks for video captioning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI.
- SKEP: Sentiment knowledge enhanced pre-training for sentiment analysis. Association for Computational Linguistics, ACL (2020).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- CIDEr: Consensus-based Image Description Evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision.
- Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Computer Science (2014).
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.
- Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7622–7631.
- Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2s (2018), 1–20.
- Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space. Advances in Neural Information Processing Systems (NIPS) (2017).
- Event-Centric Hierarchical Representation for Dense Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021).
- VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation. In The 28th ACM International Conference on Multimedia.
- Topic-guided variational autoencoders for text generation. CoRR abs/1903.07137 (2019).
- Beyond the Watching: Understanding Viewer Interactions in Crowdsourced Live Video Broadcasting Services. IEEE Transactions on Circuits and Systems for Video Technology 29, 11 (2019), 3454–3468.
- Xiangyu Wang and Chengqing Zong. 2021. Distributed representations of emotion categories in emotion space. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2364–2375.
- Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 2 (2021), 1–22.
- Latent Intention Dialogue Models. In International Conference on Machine Learning (ICML).
- Knowing Where and What to Write in Automated Live Video Comments: A Unified Multi-Task Approach. In International Conference on Multimodal Interaction (ICMI).
- Personalized multimedia item and key frame recommendation. In IJCAI.
- Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia. 537–545.
- Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048–2057.
- Position bias mitigation: A knowledge-aware graph model for emotion cause extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Constrained lstm and residual attention for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1–18.
- Cross-modal commentator: Automatic machine commenting based on cross-modal information. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2680–2686.
- Herding effect based attention for personalized time-sync video recommendation. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 454–459.
- Time-Sync Video Tag Extraction Using Semantic Association Graph. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 1–24.
- Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507–4515.
- Fine-grained video captioning for sports narrative. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Multimodal Transformer With Multi-View Visual Representation for Image Captioning. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482
- Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1–22.
- Adaptive Text Denoising Network for Image Caption Editing. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1–18.
- Automatic Generation of Personalized Comment Based on User Profile. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019.
- PLVCG: A Pretraining Based Model for Live Video Comment Generation. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part II. Springer, 690–702.
- Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278–13288.
- Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 9 (2020), 3130–3139. https://doi.org/10.1109/TCSVT.2019.2936526
- Dca: Diversified co-attention towards informative live video commenting. In Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part II 9. Springer, 3–15.
- Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Association for Computational Linguistics, ACL(1).
- Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Association for Computational Linguistics, ACL(1).
- Memcap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12984–12992.
- Latest development of the theory framework, derivative model and application of genera-tive adversarial nets. J Chin Comput Syst 39, 12 (2018), 2602–2606.
- Compression artifacts reduction by improved generative adversarial networks. EURASIP Journal on Image and Video Processing 2019, 1 (2019), 1–7.
- Chunting Zhou and Graham Neubig. 2017. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017).
- A batch normalized inference network keeps the KL vanishing away. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020.
- A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1004–1013.
- Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1097–1100.