Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts (2305.14839v2)

Published 24 May 2023 in cs.CL and cs.CV

Abstract: Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358.
  3. Learning to ground visual objects for visual dialog. arXiv preprint arXiv:2109.06013.
  4. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR.
  5. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  8. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
  9. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
  10. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
  11. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719.
  12. Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10749–10757.
  13. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
  14. Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13116–13124.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
  16. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
  17. Simmc 2.0: A task-oriented dialog dataset for immersive multimodal conversations. arXiv preprint arXiv:2104.08667.
  18. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
  19. Augpt: Dialogue with pre-trained language models and data augmentation. arXiv preprint arXiv:2102.05126.
  20. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  21. Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166.
  22. Learning to embed multi-modal contexts for situated conversational agents. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 813–830.
  23. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pages 201–216.
  24. Constructing multi-modal dialogue dataset by replacing text with semantically relevant images. arXiv preprint arXiv:2107.08685.
  25. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  26. Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086.
  28. Mmconv: an environment for multimodal conversational search across multiple domains. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 675–684.
  29. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
  30. Spring: Situated conversation agent pretrained with multimodal questions from incremental layout graph. arXiv preprint arXiv:2301.01949.
  31. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  32. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  33. Structured fusion networks for dialog. arXiv preprint arXiv:1907.10016.
  34. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
  35. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
  36. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  37. Two causal principles for improving visual dialog. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10860–10869.
  38. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  41. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  42. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  43. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945.
  44. Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082.
  45. Mining clues from incomplete utterance: A query-enhanced network for incomplete utterance rewriting. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4839–4847.
  46. Multimodal dialogue response generation. arXiv preprint arXiv:2110.08515.
  47. Attention is all you need. Advances in neural information processing systems, 30.
  48. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052.
  49. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
  50. Open domain dialogue generation with latent images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14239–14247.
  51. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
  52. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  53. Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453.
  54. Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544.
  55. Mmchat: Multi-modal chat dataset on social media. arXiv preprint arXiv:2108.07154.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yunshui Li (18 papers)
  2. Binyuan Hui (57 papers)
  3. Min Yang (239 papers)
  4. Fei Huang (408 papers)
  5. Yongbin Li (128 papers)
  6. Zhichao Yin (8 papers)
Citations (16)