Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VLIS: Unimodal Language Models Guide Multimodal Language Generation (2310.09767v2)

Published 15 Oct 2023 in cs.CL and cs.AI

Abstract: Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-LLMs face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-LLMs as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-LLMs with the language understanding of unimodal text-only LLMs without further training. It extracts pointwise mutual information of each image and text from a visual-LLM and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-LLMs on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
  2. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  3. Openflamingo.
  4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72.
  5. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22–29.
  10. Enabling multimodal generation on clip via vision-language knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2383–2395.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  12. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178.
  13. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  14. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL), pages 889–898.
  15. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 6904–6913.
  16. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV), pages 771–787.
  17. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528.
  18. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13450–13459.
  19. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  20. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR).
  21. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL), pages 1638–1649.
  22. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
  23. Taichi Iki and Akiko Aizawa. 2021. Effect of visual extensions on natural language understanding in vision-and-language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2189–2196.
  24. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  25. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2763–2775.
  26. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  27. Unifiedqa: Crossing format boundaries with a single qa system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907.
  28. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916.
  29. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 317–325.
  30. Concadia: Towards image-based text generation with a purpose. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4667–4684.
  31. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  33. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  34. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  35. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097.
  36. Recurrent topic-transition gan for visual paragraph generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3362–3371.
  37. Visual instruction tuning.
  38. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural information Processing Systems (NeurIPS), 32.
  39. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  40. Neurologic a* esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 780–799.
  41. Neurologic decoding:(un) supervised neural text generation with predicate logic constraints. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4299.
  42. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR).
  43. Locally typical sampling. Transactions of the Association for Computational Linguistics (TACL), 11:102–121.
  44. Training for diversity in image paragraph captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 757–761.
  45. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  46. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 839–849.
  47. Thanh-Son Nguyen and Basura Fernando. 2022. Effective multimodal encoding for image paragraph captioning. IEEE Transactions on Image Processing, 31:6381–6395.
  48. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  49. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
  50. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems (NeurIPS), 34:4816–4828.
  51. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  52. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR.
  53. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  54. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  55. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. In International Conference on Learning Representations (ICLR).
  56. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045.
  57. A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 146–162. Springer.
  58. Yixuan Su and Nigel Collier. 2023. Contrastive search is what you need for neural text generation. Transactions on Machine Learning Research.
  59. Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.
  60. A contrastive framework for neural text generation. In Advances in Neural Information Processing Systems.
  61. Pre-training is (almost) all you need: An application to commonsense reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3878–3887.
  62. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111.
  63. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928.
  64. Surya T Tokdar and Robert E Kass. 2010. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):54–60.
  65. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  66. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems (NeurIPS), 34:200–212.
  67. Zerogen: Zero-shot multimodal controllable text generation with multiple oracles. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 494–506. Springer.
  68. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575.
  69. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
  70. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  71. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2575–2584.
  72. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697.
  73. Hierarchical scene graph encoder-decoder for image paragraph captioning. In Proceedings of the 28th ACM International Conference on Multimedia (MM), pages 4181–4189.
  74. Fusing pre-trained language models with multimodal prompts through reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10845–10856.
  75. Multimodal knowledge alignment with reinforcement learning. arXiv preprint arXiv:2205.12630.
  76. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598.
  77. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469.
  78. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jiwan Chung (22 papers)
  2. Youngjae Yu (72 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.