Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model (2207.07934v2)

Published 16 Jul 2022 in cs.CL, cs.HC, and cs.MM

Abstract: Text response generation for multimodal task-oriented dialog systems, which aims to generate the proper text response given the multimodal context, is an essential yet challenging task. Although existing efforts have achieved compelling success, they still suffer from two pivotal limitations: 1) overlook the benefit of generative pre-training, and 2) ignore the textual context related knowledge. To address these limitations, we propose a novel dual knowledge-enhanced generative pretrained LLM for multimodal task-oriented dialog systems (DKMD), consisting of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation. To be specific, the dual knowledge selection component aims to select the related knowledge according to both textual and visual modalities of the given context. Thereafter, the dual knowledge-enhanced context learning component targets seamlessly integrating the selected knowledge into the multimodal context learning from both global and local perspectives, where the cross-modal semantic relation is also explored. Moreover, the knowledge-enhanced response generation component comprises a revised BART decoder, where an additional dot-product knowledge-decoder attention sub-layer is introduced for explicitly utilizing the knowledge to advance the text response generation. Extensive experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. CATS: Customizable Abstractive Topic-based Summarization. ACM Transactions on Information Systems (TOIS) 40, 1 (2022), 5:1–5:24.
  2. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval. ACM, 1665–1668.
  3. Gated-Attention Architectures for Task-Oriented Language Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 2819–2826.
  4. Shubham Chatterjee and Laura Dietz. 2022. BERT-ER: Query-specific BERT Entity Representations for Entity Ranking. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1466–1477.
  5. Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System. In Proceedings of the Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 5437–5447.
  6. KGGen: A Generative Approach for Incipient Knowledge Graph Population. IEEE Transactions on Knowledge and Data Engineering 34, 5 (2022), 2254–2267.
  7. Adversarial-Enhanced Hybrid Graph Network for User Identity Linkage. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1084–1093.
  8. Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation Learning. ACM Transactions on Information Systems (TOIS) 38, 4 (2020), 37:1–37:26.
  9. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR (2014).
  10. User Attention-guided Multimodal Dialog Systems. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 445–454.
  11. Toward Personalized Answer Generation in E-Commerce via Multi-perspective Preference Modeling. ACM Transactions on Information Systems (TOIS) 40, 4 (2022), 87:1–87:28.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 4171–4186.
  13. George Doddington. 2002. Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics. In Proceedings of the Second International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc., 138–145.
  14. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations. OpenReview.net.
  15. Text-based editing of talking-head video. ACM Transactions on Graphics 38, 4 (2019), 68:1–68:14.
  16. ”What Can I Cook with these Ingredients?” - Understanding Cooking-Related Information Needs in Conversational Search. ACM Transactions on Information Systems (TOIS) 40, 4 (2022), 81:1–81:32.
  17. Language-based Video Editing via Multi-Modal Multi-Level Transformer. CoRR (2021).
  18. HeteroQA: Learning towards Question-and-Answering through Multiple Information Sources via Heterogeneous Graph Modeling. In Proceedings of the ACM International Conference on Web Search and Data Mining. ACM, 307–315.
  19. Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements. In Proceedings of the ACM International Conference on Multimedia. ACM, 2755–2764.
  20. Entangled Bidirectional Encoder to Autoregressive Decoder for Sequential Recommendation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1657–1661.
  21. Saeid Balaneshin Kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 28–36.
  22. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1437–1447.
  23. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7871–7880.
  24. Chun Hung Li and C. K. Lee. 1993. Minimum cross entropy thresholding. Pattern Recognition 26, 4 (1993), 617–625.
  25. Contrast and Generation Make BART a Good Dialogue Emotion Recognizer. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 11002–11010.
  26. End-to-End Task-Completion Neural Dialogue Systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 733–743.
  27. Graph-Structured Context Understanding for Knowledge-grounded Response Generation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1930–1934.
  28. Hierarchical Prediction and Adversarial Learning For Conditional Response Generation. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 314–327.
  29. MMConv: An Environment for Multimodal Conversational Search across Multiple Domains. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 675–684.
  30. Knowledge-aware Multimodal Dialogue Systems. In Proceedings of the ACM International Conference on Multimedia. ACM, 801–809.
  31. Topic-Guided Conversational Recommender in Multiple Domains. IEEE Transactions on Knowledge and Data Engineering 34, 5 (2022), 2485–2496.
  32. Graph-Grounded Goal Planning for Conversational Recommendation. IEEE Transactions on Knowledge and Data Engineering (2022), 1–15.
  33. UniTranSeR: A Unified Transformer Semantic Representation Framework for Multimodal Task-Oriented Dialog System. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 103–114.
  34. Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems. In Proceedings ofAnnual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1468–1478.
  35. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates Inc., 3111–3119.
  36. On the Study of Transformers for Query Suggestion. ACM Transactions on Information Systems (TOIS) 40, 1 (2022), 18:1–18:27.
  37. Conversational Image Search. IEEE Transactions on Image Processing 30 (2021), 7732–7743.
  38. Multimodal Dialog System: Generating Responses via Adaptive Decoders. In Proceedings of the ACM International Conference on Multimedia. ACM, 1098–1106.
  39. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 311–318.
  40. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Annual Conference on Neural Information Processing Systems. 8024–8035.
  41. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1532–1543.
  42. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.
  43. Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.
  44. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21 (2020), 140:1–140:67.
  45. Conversations with Search Engines: SERP-based Conversational Response Generation. ACM Transactions on Information Systems (TOIS) 39, 4 (2021), 47:1–47:29.
  46. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 696–704.
  47. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching. CoRR (2021).
  48. V2P: Vision-to-Prompt based Multi-Modal Product Summary Generation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 992–1001.
  49. AnyFace: Free-style Text-to-Face Synthesis and Manipulation. CoRR (2022).
  50. Sequence to Sequence Learning with Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 3104–3112.
  51. Attention is All you Need. In Advances in Neural Information Processing Systems. 5998–6008.
  52. Attention Is All You Need. CoRR (2017).
  53. Dual Dynamic Memory Network for End-to-End Multi-turn Task-oriented Dialog Systems. In Proceedings of the International Conference on Computational Linguistics. International Committee on Computational Linguistics, 4100–4110.
  54. Comprehensive Linguistic-Visual Composition Network for Image Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1369–1378.
  55. Symmetric Regularization based BERT for Pair-wise Semantic Reasoning. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval. ACM, 1901–1904.
  56. Augmenting End-to-End Dialogue Systems With Commonsense Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 4970–4977.
  57. Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3995–4007.
  58. Understanding WeChat User Preferences and “Wow” Diffusion. IEEE Transactions on Knowledge and Data Engineering (2021), 1–14.
  59. Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding. In Proceedings of the ACM International Conference on Multimedia. ACM, 695–703.
  60. Personalized Graph Neural Networks With Attention Mechanism for Session-Aware Recommendation. IEEE Transactions on Knowledge and Data Engineering 34, 8 (2022), 3946–3957.
  61. Task-Oriented Dialog Systems That Consider Multiple Appropriate Responses under the Same Context. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 9604–9611.
  62. DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 270–278.
  63. Leveraging Lead Bias for Zero-shot Abstractive News Summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1462–1471.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaolin Chen (27 papers)
  2. Xuemeng Song (30 papers)
  3. Liqiang Jing (21 papers)
  4. Shuo Li (179 papers)
  5. Linmei Hu (14 papers)
  6. Liqiang Nie (191 papers)
Citations (15)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets