Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections (2205.12005v2)

Published 24 May 2022 in cs.CL and cs.CV

Abstract: Large-scale pretrained foundation models have been an emerging paradigm for building AI systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.

An Analysis of mPLUG: Cross-modal Vision-Language Learning with Skip-connections

The paper "mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections" elucidates a significant advancement in the domain of vision-language pre-training (VLP) models. The paper introduces mPLUG, a novel architecture designed to enhance both cross-modal understanding and generation tasks.

mPLUG Architecture

The mPLUG model addresses inherent challenges in VLP, particularly the issues of computational inefficiency and information asymmetry between visual and textual modalities. Traditional VLP approaches often involve the use of pre-trained object detectors or long sequences of image patches, which are computationally expensive and inefficient. Moreover, these methods struggle with the disparity between the detailed visual data and abstract textual descriptions.

To mitigate these challenges, mPLUG employs a unique cross-modal skip-connected architecture. This setup allows the model to effectively bypass certain visual layers while maintaining the semantic integrity of the input. The architecture comprises two unimodal encoders for images and text, which are then connected via a transformer-based cross-modal skip-connected network. This design ensures efficient multi-modal fusion by aligning visual and textual representations at disparate levels of abstraction, thereby addressing the information asymmetry problem.

Training Protocol

mPLUG is pre-trained on a substantial dataset of 14 million image-text pairs, incorporating multiple objectives such as Image-Text Contrastive Learning, Image-Text Matching, Masked LLMing, and Prefix LLMing. The pre-training strategy is critical in initializing the model's parameters for robust zero-shot transferability across various tasks, including image-text retrieval, captioning, and visual question answering.

Experimental Results

The empirical evaluation highlights mPLUG's superior performance across several VLP benchmarks:

  1. Image-Text Retrieval: mPLUG demonstrates remarkable retrieval accuracy, achieving state-of-the-art performance on the Flickr30K and MSCOCO datasets. The registered recall@1 scores substantiate the model's efficacy in capturing fine-grained cross-modal associations.
  2. Image Captioning: The model excels in image captioning tasks, evidenced by high CIDEr scores on both COCO Cap and NoCaps datasets, outperforming previous benchmarks.
  3. Visual Question Answering (VQA): Notably, mPLUG achieves substantial gains in VQA tasks, surpassing models that leverage extensive pre-training data, such as SimVLM and Florence.
  4. Visual Grounding and Reasoning: The architecture's design facilitates unprecedented performance on visual grounding tasks like RefCOCO and visual reasoning datasets such as NLVR2 and SNLI-VE.

Implications and Future Prospects

The introduction of mPLUG marks a pivotal step towards more efficient and effective multi-modal learning systems. This architecture not only enhances model efficiency by reducing computational load but also ensures information-rich cross-modal encoding, benefiting numerous downstream applications in AI. The robustness of mPLUG in zero-shot settings also paves the way for more generalized AI systems capable of transferring knowledge across domains without extensive re-training.

Moving forward, further research could explore scaling mPLUG's architecture to accommodate additional modalities and investigate its application in real-time multi-modal systems. The exploration of skip-connections in other multi-modal contexts may yield insights beneficial to the broader field of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  2. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  3. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
  4. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
  5. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3208–3216.
  6. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34.
  7. Simvlm: Simple visual language model pretraining with weak supervision. CoRR, abs/2108.10904.
  8. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
  9. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588.
  10. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334.
  11. An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387.
  12. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  13. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  14. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  15. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  16. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer.
  17. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
  18. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918.
  19. E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804.
  20. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  22. Highway networks. arXiv preprint arXiv:1505.00387.
  23. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708.
  24. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
  25. Rethinking skip connection with layer normalization in transformers and resnets. arXiv preprint arXiv:2105.07205.
  26. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
  27. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  28. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
  29. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  30. Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. arXiv preprint arXiv:2004.07159.
  31. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  32. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  33. Scaling up vision-language pre-training for image captioning. CoRR, abs/2111.12233.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086.
  35. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052.
  36. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
  37. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
  38. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  39. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  40. Im2text: Describing images using 1 million captioned photographs. In Advances in neural information processing systems, pages 1143–1151.
  41. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  42. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703.
  43. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23.
  44. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1931–1942. PMLR.
  45. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  46. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409.
  47. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325.
  48. nocaps: novel object captioning at scale. CoRR, abs/1812.08658.
  49. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195.
  50. Large-scale adversarial training for vision-and-language representation learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  51. MDETR - modulated detection for end-to-end multi-modal understanding. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 1760–1770. IEEE.
  52. Crossing the format boundary of text and boxes: Towards unified vision-language modeling. CoRR, abs/2111.12085.
  53. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
  54. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491.
  55. Visual entailment: A novel task for fine-grained image understanding. CoRR, abs/1901.06706.
  56. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
  57. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889.
  58. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800.
  59. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34.
  60. Align and prompt: Video-and-language pre-training with entity prompts. arXiv preprint arXiv:2112.09583.
  61. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681.
  62. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640.
  63. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738.
  64. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34.
  65. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE.
  66. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697.
  67. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  68. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
  69. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Chenliang Li (92 papers)
  2. Haiyang Xu (67 papers)
  3. Junfeng Tian (19 papers)
  4. Wei Wang (1793 papers)
  5. Ming Yan (190 papers)
  6. Bin Bi (24 papers)
  7. Jiabo Ye (17 papers)
  8. Hehong Chen (10 papers)
  9. Guohai Xu (21 papers)
  10. Zheng Cao (48 papers)
  11. Ji Zhang (176 papers)
  12. Songfang Huang (51 papers)
  13. Fei Huang (408 papers)
  14. Jingren Zhou (198 papers)
  15. Luo Si (73 papers)
Citations (181)