Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepSeek-VL: Towards Real-World Vision-Language Understanding (2403.05525v2)

Published 8 Mar 2024 in cs.AI
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Abstract: We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-LLM should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

DeepSeek-VL: A New Horizon in Vision-LLMs

Introduction

The integration of vision and language understanding has long been a challenging yet critical goal in artificial intelligence research. Vision-LLMs (VLMs) are at the forefront of bridging this gap, enabling machines to comprehend and generate responses based on visual and textual inputs. DeepSeek-VL presents an innovative leap in the development of open-source VLMs, offering a pragmatic approach optimized for real-world applications. Drawing from the strengths of LLMs, DeepSeek-VL introduces a novel methodology to retain linguistic abilities while embracing multimodal data during pretraining. This entry focuses on the distinct strategies employed in DeepSeek-VL’s creation, including data construction, model architecture, training strategies, and a comprehensive evaluation across a range of benchmarks.

Model Architecture

DeepSeek-VL incorporates a hybrid vision encoder that efficiently handles high-resolution images, a crucial aspect of understanding detailed visual information. The model's architecture is designed to process 1024 x 1024 resolution images within a fixed token budget, showcasing an effective balance between capturing essential details and maintaining low computational demands. This architectural choice addresses the challenge of processing complex real-world scenarios, such as fine-grained object recognition and detailed OCR tasks.

Data Construction

The robustness of DeepSeek-VL is significantly attributable to its extensive pretraining data, meticulously curated to cover a wide spectrum of real-world scenarios. This dataset spans from web screenshots, PDFs, and OCR tasks to charts and knowledge-based content, ensuring a broad representation of practical contexts. Additionally, the model benefits from an instruction-tuning dataset specifically designed around real user scenarios, enhancing its relevance and effectiveness in practical applications.

Training Strategy

A key innovation in DeepSeek-VL's development is the strategic approach to training, aimed at preserving the model's language capabilities while incorporating vision and language modalities. The training begins with a significant emphasis on text, gradually adjusting the multimodal ratio to ensure a balanced development of both capabilities. This method effectively prevents the potential degradation of linguistic performance, a common challenge faced by multimodal models.

Evaluation and Implications

DeepSeek-VL has undergone rigorous testing across a broad spectrum of visual-language benchmarks, achieving state-of-the-art or highly competitive performance. The model demonstrates superior capabilities in language understanding, visual comprehension, and multimodal interaction, marking it as a significant contribution to the field. DeepSeek-VL’s performance highlights its potential as a foundational model for a wide range of applications, pushing the boundaries of what is achievable with open-source VLMs.

Limitations and Future Directions

Despite its achievements, DeepSeek-VL has limitations, particularly in scaling the model size and integrating Mixture of Experts (MoE) technology. Future work will focus on overcoming these challenges, with plans to scale up DeepSeek-VL and enhance its efficiency, potentially setting new benchmarks in the VLM landscape.

Conclusion

DeepSeek-VL represents a significant stride towards realizing the full potential of vision-LLMs. By effectively combining deep language understanding with robust visual processing capabilities, DeepSeek-VL sets a new standard for open-source models in real-world applications. Its development strategy, focused on comprehensive pretraining, careful data curation, and a balanced training approach, provides valuable insights for future advancements in VLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. 01-ai. Yi-34B vision language model. https://huggingface.co/01-ai/Yi-VL-34B, 2024.
  2. Abi. Screenshot to code. https://github.com/abi/screenshot-to-code, 2024.
  3. Anna’s Archive. Anna’s archive. https://annas-archive.org/, 2024.
  4. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  6. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  7. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  8. L. Blecher. Latex-ocr. GitHub repository, 2024. URL https://github.com/lukas-blecher/LaTeX-OCR.
  9. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023.
  10. A suite of generative tasks for multi-level multimodal webpage understanding. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://openreview.net/forum?id=rwcLHjtUmn.
  11. J. Carter. Textocr-gpt4v. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v, 2024.
  12. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  13. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE, 2019.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  16. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. URL https://github.com/deepseek-ai/DeepSeek-LLM.
  17. echo840. Detailed caption dataset. https://huggingface.co/datasets/echo840/Detailed_Caption, 2024.
  18. W. Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
  19. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
  20. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  21. Google. An important next step on our AI journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
  22. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  23. High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm.
  24. Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199, 2022.
  25. HuggingFaceM4. Websight dataset. https://huggingface.co/datasets/HuggingFaceM4/WebSight, 2024.
  26. Chart-to-text: A large-scale benchmark for chart summarization. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4005–4023, Dublin, Ireland, May 2022. Association for Computational Linguistics. 10.18653/v1/2022.acl-long.277. URL https://aclanthology.org/2022.acl-long.277.
  27. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  28. The stack: 3 tb of permissively licensed source code. In Transactions on Machine Learning Research, 2023.
  29. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  30. Open images v5 text annotation and yet another mask text spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
  31. A. Kulkarni and J. Truelsen. wkhtmltopdf. https://wkhtmltopdf.org/. Project maintained by Ashish Kulkarni, originally created by Jakob Truelsen. Accessed: 2024-02-22.
  32. LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset, 2023.
  33. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  34. S. Li and N. Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs, 2023.
  35. Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295, 2020.
  36. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
  37. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
  38. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
  39. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022a.
  40. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  41. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
  42. Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 16705–16717. Curran Associates, Inc., 2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/6a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf.
  43. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023a.
  44. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
  45. Towards end-to-end unified scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  46. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021.
  47. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022a.
  48. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
  49. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  50. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  51. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761, 2023.
  52. mPLUG. M-paper dataset. https://huggingface.co/datasets/mPLUG/M-Paper, 2024.
  53. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  54. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 1454–1459. IEEE, 2017.
  55. OpenAI. Chatgpt: Optimizing language models for dialogue. 2022. URL https://openai.com/blog/chatgpt.
  56. OpenAI. GPT-4 technical report. arXiv, 2023a.
  57. R. OpenAI. Gpt-4v(ision) system card. 2023b.
  58. Ocr-vqgan: Taming text-within-image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3689–3698, 2023.
  59. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2024.
  60. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  61. Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on document analysis and recognition (ICDAR), volume 1, pages 1429–1434. IEEE, 2017.
  62. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  63. Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8802–8812, 2021.
  64. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  65. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  66. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE, 2019.
  67. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  68. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
  69. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  70. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  71. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
  72. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021.
  73. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023a.
  74. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
  75. Visual goal-step inference using wikihow. arXiv preprint arXiv:2104.05845, 2021.
  76. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023.
  77. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023a.
  78. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
  79. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  80. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  81. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  82. Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2401.11944, 2024.
  83. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE, 2019.
  84. Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop - CVPR 2017, Hawaii, U.S.A., 2017. URL http://sunw.csail.mit.edu/abstract/uberText.pdf.
  85. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
  86. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Haoyu Lu (24 papers)
  2. Wen Liu (55 papers)
  3. Bo Zhang (633 papers)
  4. Bingxuan Wang (10 papers)
  5. Kai Dong (15 papers)
  6. Bo Liu (484 papers)
  7. Jingxiang Sun (20 papers)
  8. Tongzheng Ren (32 papers)
  9. Zhuoshu Li (7 papers)
  10. Yaofeng Sun (6 papers)
  11. Chengqi Deng (11 papers)
  12. Hanwei Xu (8 papers)
  13. Zhenda Xie (51 papers)
  14. Chong Ruan (16 papers)
  15. Hao Yang (328 papers)
Citations (149)