PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts (2305.14839v2)
Abstract: Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358.
- Learning to ground visual objects for visual dialog. arXiv preprint arXiv:2109.06013.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR.
- Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
- Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
- Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719.
- Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10749–10757.
- A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
- Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13116–13124.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
- Simmc 2.0: A task-oriented dialog dataset for immersive multimodal conversations. arXiv preprint arXiv:2104.08667.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
- Augpt: Dialogue with pre-trained language models and data augmentation. arXiv preprint arXiv:2102.05126.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166.
- Learning to embed multi-modal contexts for situated conversational agents. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 813–830.
- Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pages 201–216.
- Constructing multi-modal dialogue dataset by replacing text with semantically relevant images. arXiv preprint arXiv:2107.08685.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086.
- Mmconv: an environment for multimodal conversational search across multiple domains. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 675–684.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
- Spring: Situated conversation agent pretrained with multimodal questions from incremental layout graph. arXiv preprint arXiv:2301.01949.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Structured fusion networks for dialog. arXiv preprint arXiv:1907.10016.
- Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Two causal principles for improving visual dialog. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10860–10869.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945.
- Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082.
- Mining clues from incomplete utterance: A query-enhanced network for incomplete utterance rewriting. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4839–4847.
- Multimodal dialogue response generation. arXiv preprint arXiv:2110.08515.
- Attention is all you need. Advances in neural information processing systems, 30.
- Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
- Open domain dialogue generation with latent images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14239–14247.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453.
- Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544.
- Mmchat: Multi-modal chat dataset on social media. arXiv preprint arXiv:2108.07154.
- Yunshui Li (18 papers)
- Binyuan Hui (57 papers)
- Min Yang (239 papers)
- Fei Huang (408 papers)
- Yongbin Li (128 papers)
- Zhichao Yin (8 papers)