A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation (2306.07198v1)
Abstract: LLMs such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of large pre-trained models for Natural Language Processing and Computer Vision. Recently, we have seen rapid developments in the joint Vision-Language space as well, where pre-trained models such as CLIP (Radford et al., 2021) have demonstrated improvements in downstream tasks like image captioning and visual question answering. However, surprisingly there is comparatively little work on exploring these models for the task of multimodal machine translation, where the goal is to leverage image/video modality in text-to-text translation. To fill this gap, this paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation. We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
- The AMARA corpus: Building parallel language resources for the educational domain. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1856–1862, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Flamingo: a visual language model for few-shot learning.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts.
- Cross-lingual visual pre-training for multimodal machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1317–1324, Online. Association for Computational Linguistics.
- Vlp: A survey on vision-language pre-training.
- Pali: A jointly-scaled multilingual language-image model.
- Karan Desai and Justin Johnson. 2020. Virtex: Learning visual representations from textual annotations. arXiv preprint arXiv:2006.06666.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- An image is worth 16x16 words: Transformers for image recognition at scale.
- A survey of vision-language pre-trained models.
- Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, pages 215–233, Copenhagen, Denmark. Association for Computational Linguistics.
- Multi30K: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin, Germany. Association for Computational Linguistics.
- Masked autoencoders are scalable vision learners.
- Training compute-optimal large language models.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission.
- Perceiver: General perception with iterative attention.
- Scaling up visual and vision-language representation learning with noisy text supervision.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations.
- Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- VISA: An ambiguous subtitles dataset for visual scene-aware machine translation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6735–6743, Marseille, France. European Language Resources Association.
- Microsoft coco: Common objects in context.
- Swin transformer: Hierarchical vision transformer using shifted windows.
- Proceedings of the 6th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong, China.
- Overview of the 9th workshop on Asian translation. In Proceedings of the 9th Workshop on Asian Translation, pages 1–36, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
- Overview of the 8th workshop on Asian translation. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 1–45, Online. Association for Computational Linguistics.
- Overview of the 7th workshop on Asian translation. In Proceedings of the 7th Workshop on Asian Translation, pages 1–44, Suzhou, China. Association for Computational Linguistics.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer.
- High-resolution image synthesis with latent diffusion models.
- How2: a large-scale dataset for multimodal language understanding. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL). NeurIPS.
- Learning visual representations with caption annotations.
- Laion-5b: An open large-scale dataset for training next generation image-text models.
- Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
- A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 543–553, Berlin, Germany. Association for Computational Linguistics.
- Multimodal machine translation through visuals and speech. Machine Translation, 34(2-3):97–147.
- Videobert: A joint model for video and language representation learning.
- YFCC100m. Communications of the ACM, 59(2):64–73.
- Git: A generative image-to-text transformer for vision and language.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks.
- Large-scale multi-modal pre-trained models: A comprehensive survey.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research.
- Msr-vtt: A large video description dataset for bridging video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR).
- Coca: Contrastive captioners are image-text foundation models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.