VLIS: Unimodal Language Models Guide Multimodal Language Generation (2310.09767v2)
Abstract: Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-LLMs face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-LLMs as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-LLMs with the language understanding of unimodal text-only LLMs without further training. It extracts pointwise mutual information of each image and text from a visual-LLM and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-LLMs on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
- VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
- Openflamingo.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72.
- Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274.
- Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22–29.
- Enabling multimodal generation on clip via vision-language knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2383–2395.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL), pages 889–898.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 6904–6913.
- Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV), pages 771–787.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528.
- Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13450–13459.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR).
- Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL), pages 1638–1649.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- Taichi Iki and Akiko Aizawa. 2021. Effect of visual extensions on natural language understanding in vision-and-language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2189–2196.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
- A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2763–2775.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Unifiedqa: Crossing format boundaries with a single qa system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907.
- Large language models are zero-shot reasoners. ArXiv, abs/2205.11916.
- A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 317–325.
- Concadia: Towards image-based text generation with a purpose. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4667–4684.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097.
- Recurrent topic-transition gan for visual paragraph generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3362–3371.
- Visual instruction tuning.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural information Processing Systems (NeurIPS), 32.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- Neurologic a* esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 780–799.
- Neurologic decoding:(un) supervised neural text generation with predicate logic constraints. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4299.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Locally typical sampling. Transactions of the Association for Computational Linguistics (TACL), 11:102–121.
- Training for diversity in image paragraph captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 757–761.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 839–849.
- Thanh-Son Nguyen and Basura Fernando. 2022. Effective multimodal encoding for image paragraph captioning. IEEE Transactions on Image Processing, 31:6381–6395.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems (NeurIPS), 34:4816–4828.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. In International Conference on Learning Representations (ICLR).
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045.
- A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 146–162. Springer.
- Yixuan Su and Nigel Collier. 2023. Contrastive search is what you need for neural text generation. Transactions on Machine Learning Research.
- Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.
- A contrastive framework for neural text generation. In Advances in Neural Information Processing Systems.
- Pre-training is (almost) all you need: An application to commonsense reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3878–3887.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111.
- Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928.
- Surya T Tokdar and Robert E Kass. 2010. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):54–60.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems (NeurIPS), 34:200–212.
- Zerogen: Zero-shot multimodal controllable text generation with multiple oracles. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 494–506. Springer.
- CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697.
- Hierarchical scene graph encoder-decoder for image paragraph captioning. In Proceedings of the 28th ACM International Conference on Multimedia (MM), pages 4181–4189.
- Fusing pre-trained language models with multimodal prompts through reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10845–10856.
- Multimodal knowledge alignment with reinforcement learning. arXiv preprint arXiv:2205.12630.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598.
- What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Jiwan Chung (22 papers)
- Youngjae Yu (72 papers)