Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction (2310.03291v3)

Published 5 Oct 2023 in cs.CV

Abstract: In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained LLMs. The conventional approach in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, focused on extracting and consolidating relevant visual features. This is followed by a subsequent phase that emphasizes end-to-end alignment between visual and linguistic modalities. Our novel one-stage, single-loss framework bypasses the computationally demanding first training stage by gradually merging similar visual tokens during training, while avoiding model collapse caused by single-stage training of BLIP-2 type models. The gradual merging process effectively condenses visual information while preserving semantic richness, resulting in rapid convergence without compromising performance. Our experimental findings demonstrate that our approach accelerates the training of vision-LLMs by a factor of 5 without a noticeable impact on overall performance. Furthermore, we illustrate that our models significantly narrow the performance gap to current vision-LLMs using only 1/10 of the data. Finally, we showcase how our image-text models can seamlessly adapt to video-conditioned language generation tasks through novel soft attentive temporal token contextualizing modules. Code is available at \url{https://github.com/yiren-jian/EVLGen}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Jointly training large autoregressive multimodal models. In The Twelfth International Conference on Learning Representations.
  2. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE/CVF international conference on computer vision.
  4. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In Advances in Neural Information Processing Systems.
  5. Token merging: Your ViT but faster. In International Conference on Learning Representations.
  6. David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Annual meeting of the association for computational linguistics: human language technologies.
  7. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.
  8. Learning distinct and representative modes for image captioning. In Advances in Neural Information Processing Systems.
  9. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. In Advances in Neural Information Processing Systems.
  10. The principle of diversity: Training stronger vision transformers calls for reducing all levels of redundancy. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  11. Uniter: Universal image-text representation learning. In European Conference on Computer Vision.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  13. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning.
  14. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  16. Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems.
  17. An empirical study of training end-to-end vision-and-language transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  18. Eventful transformers: Leveraging temporal redundancy in vision transformers. In IEEE/CVF International Conference on Computer Vision.
  19. Eva: Exploring the limits of masked visual representation learning at scale. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  20. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  21. Large-scale adversarial training for vision-and-language representation learning. In Advances in Neural Information Processing Systems.
  22. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE conference on computer vision and pattern recognition.
  23. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In IEEE/CVF conference on computer vision and pattern recognition.
  24. Icl-d3ie: In-context learning with diverse demonstrations updating for document information extraction. In IEEE/CVF International Conference on Computer Vision.
  25. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In IEEE/CVF conference on computer vision and pattern recognition.
  26. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
  27. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In IEEE/CVF conference on computer vision and pattern recognition.
  28. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning.
  29. Bootstrapping vision-language learning with decoupled language pre-training. In Advances in Neural Information Processing Systems.
  30. Mdetr-modulated detection for end-to-end multi-modal understanding. In IEEE/CVF international conference on computer vision.
  31. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning.
  32. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision.
  33. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning.
  35. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in neural information processing systems.
  36. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  37. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.
  38. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision.
  39. Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing.
  40. Microsoft coco: Common objects in context. In European Conference on Computer Vision.
  41. Visual instruction tuning. In Advances in neural information processing systems.
  42. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems.
  43. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems.
  44. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207.
  45. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF conference on computer vision and pattern recognition.
  46. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424.
  47. Im2text: Describing images using 1 million captioned photographs. In Advances in neural information processing systems.
  48. Ia-red^2: Interpretability-aware redundancy reduction for vision transformers. In Advances in Neural Information Processing Systems.
  49. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
  50. Self-critical sequence training for image captioning. In IEEE conference on computer vision and pattern recognition.
  51. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems: Datasets and Benchmarks Track.
  52. Prompting large language models with answer heuristics for knowledge-based visual question answering. In IEEE/CVF conference on computer vision and pattern recognition.
  53. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics.
  54. How much can CLIP benefit vision-and-language tasks? In International Conference on Learning Representations.
  55. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
  56. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.
  57. Multimodal few-shot learning with frozen language models. In Advances in Neural Information Processing Systems.
  58. GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research.
  59. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning.
  60. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  61. SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations.
  62. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.
  63. Msr-vtt: A large video description dataset for bridging video and language. In IEEE conference on computer vision and pattern recognition.
  64. Bridgetower: Building bridges between encoders in vision-language representation learning. In Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence.
  65. MultiInstruct: Improving multi-modal zero-shot learning via instruction tuning. In Annual Meeting of the Association for Computational Linguistics.
  66. Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. In Advances in Neural Information Processing Systems.
  67. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners.
  68. Unitab: Unifying text and box outputs for grounded vision-language modeling. In European Conference on Computer Vision.
  69. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence.
  70. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  71. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
  72. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research.
  73. Multi-grained vision language pre-training: Aligning texts with visual concepts. In International Conference on Machine Learning.
  74. VPGTrans: Transfer visual prompt generator across LLMs. In Advances in Neural Information Processing Systems.
  75. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Empirical Methods in Natural Language Processing: System Demonstrations.
  76. Vinvl: Revisiting visual representations in vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  77. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  78. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In Advances in Neural Information Processing Systems.
  79. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yiren Jian (11 papers)
  2. Tingkai Liu (9 papers)
  3. Yunzhe Tao (20 papers)
  4. Chunhui Zhang (46 papers)
  5. Soroush Vosoughi (90 papers)
  6. Hongxia Yang (130 papers)
Citations (5)