Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (2404.14396v1)

Published 22 Apr 2024 in cs.CV
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Abstract: The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X.

Enhancing Multimodal Foundation Models for Real-world Applicability: Introducing SEED-X

Introduction to SEED-X

In the rapidly evolving domain of multimodal foundation models, the transition from laboratory settings to real-world applicability presents notable challenges, primarily due to the models' limited interaction capabilities with diverse visual and instructional data. Addressing these challenges, this paper introduces SEED-X, an enhanced version of the previously developed SEED-LLaMA. SEED-X integrates advanced features to comprehend images of arbitrary sizes and aspect ratios and enables multi-granularity image generation, ranging from high-level instructional creation to detailed image manipulation.

Key Features and Methodology

SEED-X represents a comprehensive approach to multimodal understanding and generation, designed to operate effectively in diverse real-world applications. The model architecture includes significant enhancements over its predecessors:

  • Visual Tokenization and De-tokenization: Utilizes a pre-trained Vision Transformer (ViT) as a visual tokenizer coupled with a visual de-tokenizer that supports the generation of detailed images by interpreting ViT features. This setup helps in accurate image reconstruction aligned with original semantic contexts and detailed image manipulation tasks.
  • Dynamic Resolution Image Encoding: Introduces a method to process images with arbitrary resolutions by employing a grid division technique for image encoding, which preserves detailed information and supports various aspect ratios without requiring standard pre-defined image sizes.
  • Multimodal Pre-training and Instruction Tuning: Employs a large-scale multimodal data corpus for training, followed by instruction tuning to refine the model's capabilities to follow specific instructions in real-world applications, enhancing both comprehension and generation tasks across varied domains.

Evaluation and Performance

Extensive evaluations demonstrate SEED-X's superior performance on several benchmarks designed for multimodal LLMs. It shows competitive results in multimodal comprehension and state-of-the-art performance in image generation tasks compared to existing LLMs. Particularly, SEED-X excels in handling multi-image contexts and generating high-quality, instruction-aligned images.

Implications and Future Prospects

The development of SEED-X marks a significant step toward bridging the gap between academic multimodal model research and practical real-world applications. By enabling nuanced understanding and generation of multimodal data, SEED-X could serve various domains, from creative design to personal assistance and beyond.

Future research could explore further enhancements in the robustness of image tokenization processes and expand the model's adaptability to dynamically varied multimodal scenarios, potentially leading to more generalized AI systems capable of seamless interaction in complex real-world environments.

Conclusion

SEED-X sets a new precedent in the field of multimodal foundation models by substantially enhancing the real-world applicability of such systems. With its robust architecture and superior performance across multiple benchmarks, SEED-X not only fulfills but extends the capabilities expected of next-generation AI models, promising exciting developments in AI applications across industries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023.
  2. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  3. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  4. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  5. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  6. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  7. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
  8. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  9. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  13. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  14. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
  15. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023.
  16. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  17. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  18. Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. arXiv preprint arXiv:2312.09251, 2023.
  19. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  20. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  21. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
  22. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  23. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
  24. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024.
  25. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022.
  26. Unsplash. https://github.com/unsplash/datasets, 2023.
  27. Laion-coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/, 2023.
  28. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  29. Magicbrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023.
  30. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  31. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  33. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  34. Laion coco: 600m synthetic captions from laion2b-en. [EB/OL], 2022. https://laion.ai/blog/laion-coco/.
  35. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  36. Journeydb: A benchmark for generative image understanding. arXiv preprint arXiv:2307.00716, 2023.
  37. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023.
  38. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
  39. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
  40. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  41. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  42. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  43. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  44. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  45. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.
  46. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  47. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8876–8884, 2019.
  48. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
  49. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  50. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  51. Vision-language instruction tuning: A review and analysis. arXiv preprint arXiv:2311.08172, 2023.
  52. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  53. Vision-flan: Scaling human-labeled tasks in visual instruction tuning. arXiv preprint arXiv:2402.11690, 2024.
  54. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model, 2024.
  55. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020.
  56. Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 1233–1239, 2016.
  57. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuying Ge (39 papers)
  2. Sijie Zhao (15 papers)
  3. Jinguo Zhu (20 papers)
  4. Yixiao Ge (99 papers)
  5. Kun Yi (25 papers)
  6. Lin Song (44 papers)
  7. Chen Li (386 papers)
  8. Xiaohan Ding (41 papers)
  9. Ying Shan (252 papers)
Citations (44)
Github Logo Streamline Icon: https://streamlinehq.com