DeepSeek-VL: Efficient Multimodal VLM
- DeepSeek-VL is a family of open-source vision-language models that enable real-world multimodal understanding across tasks like OCR, chart analysis, and logical reasoning.
- It employs a hybrid vision encoder combining low- and high-resolution processing with a unified transformer via a vision-language adaptor for efficient token fusion.
- The model uses a three-stage training pipeline with architectural enhancements such as dynamic tiling, Mixture-of-Experts, and Multi-head Latent Attention to achieve state-of-the-art performance with reduced parameter footprints.
DeepSeek-VL is an open-source family of vision-LLMs (VLMs) designed for real-world multimodal understanding with a particular emphasis on robust performance, data diversity, and computational efficiency. Engineered by the DeepSeek-AI group, DeepSeek-VL and its successors address a broad range of visual and language-driven tasks, from document OCR and chart analysis to logical reasoning and visual grounding, leveraging a hybrid vision encoder and carefully curated training protocols. The series demonstrates state-of-the-art or competitive results across diverse evaluation benchmarks while maintaining minimal parameter footprints relative to peer systems (Lu et al., 2024, Wu et al., 2024).
1. Data Foundation and Use-Case Taxonomy
DeepSeek-VL’s training corpus extends across two primary stages: pretraining on massive, heterogenous vision-language data and supervised fine-tuning (SFT) based on a granular, real-world use-case taxonomy. The pretraining data comprises interleaved image–text datasets (MMC4, English Wikipedia, WikiHow), chart and table corpora (Chart2Text, UniChart, UReader), web-UI screenshots, OCR sources, and a large text-only subset from DeepSeek-LLM (70% of tokens) (Lu et al., 2024).
The SFT stage leverages a comprehensive taxonomy of practical user scenarios, including:
- Recognition: global/local descriptions, OCR/transcriptions.
- Conversion: image-to-text, image-to-code.
- Analysis: chart/table interpretation, professional diagrams, specialized and encyclopedic images.
- Reasoning: logical, commonsense, multi-graph, math, safety.
- Evaluation Tasks: realism, aesthetic critique.
Benchmarks and official instruction sets (ShareGPT4V, LAION-GPTV, IconQA, Table/Chart tasks, Screen-to-Code) complement in-house SFT data to ensure fine-grained coverage of all taxonomy leaf nodes. Each SFT example is stored as an (image, prompt, response) triple, with prompts sampled to achieve maximal domain diversity (Lu et al., 2024).
2. Model Architecture and Vision-Language Fusion
The DeepSeek-VL VLM architecture is based on a dense, decoder-only transformer following the DeepSeek-LLM backbone, with a hybrid vision encoder designed for efficient, high-resolution processing (Lu et al., 2024). The vision encoder consists of two components:
- Low-resolution encoder: SigLIP-L (ViT-B variant), operating on inputs for global, semantic features.
- High-resolution encoder: SAM-B (ViTDet backbone), operating on for fine details (e.g., text, UI elements).
Feature maps from SigLIP-L and SAM-B are pooled and concatenated into a fixed-dimensional token sequence (576 tokens), mapped to the LLM embedding space via a two-layer MLP “vision-language adaptor.” The transformer receives a unified token stream, with visual tokens prepended to the text tokens.
In DeepSeek-VL2, introduced in (Wu et al., 2024), this architecture evolves to incorporate dynamic tiling: arbitrary high-resolution images are divided into tiles plus a global thumbnail. This process minimizes padding for arbitrary aspect ratios and maximizes spatial and semantic coverage. Each tile’s features are compressed into 196 tokens using a pixel shuffle, and separator tokens delineate both spatial structure and global-local context.
3. Multistage Training and Optimization Strategies
DeepSeek-VL utilizes a three-stage training pipeline:
- Vision–Language Adaptor Alignment: Only the adaptor is trained initially, aligning frozen vision and language encoders using ∼1.25 M ShareGPT4V captions and document-OCR pairs. This process saturates quickly in capacity with no gains observed from extended training (Lu et al., 2024).
- Joint Vision–Language Pretraining: Both the LLM and the adaptor are updated using a mixture of ∼70% multimodal and ∼30% text-only data. To prevent language degradation, curriculum strategies are employed: “modality mixing” (separating pure language and multimodal batches) and a “modality warm-up” schedule where proportion of multimodal samples is linearly increased.
- Supervised Fine-Tuning (SFT): The final joint optimization includes the vision encoder, adaptor, and LLM. The loss is cross-entropy on next-token prediction for conditional text generation, with the training objective:
and weighting of , between language-only and vision-language samples (Lu et al., 2024).
4. DeepSeek-VL2: Architectural Advancements
DeepSeek-VL2 introduces principal enhancements over its predecessor by integrating a Mixture-of-Experts (MoE) LLM and Multi-head Latent Attention (MLA), both designed to minimize active parameter count during inference while maintaining or increasing accuracy (Wu et al., 2024).
- Dynamic Tiling for Vision: Images are partitioned into tiles where , , , each tile passed through SigLIP-SO400M-384. Token compression and hierarchical spatial separators optimize context efficiency.
- Multi-head Latent Attention (MLA): Rather than caching all key/value pairs () in attention, representations are projected into a lower-rank latent space (). At each new token, the query attends to the latent cache (, ), drastically reducing memory and compute for long contexts.
- Sparse MoE Feed-Forward Layers: Each transformer layer replaces the dense FFN with a MoE block. Gating selects Top- experts among total (64 or 72 routeable plus 2 shared), with expert selection via softmax (Tiny/Small) or sigmoid with expert bias correction (Base).
- Parameter Efficiency: Activated parameter count at inference is significantly reduced: 1.0 B (Tiny), 2.8 B (Small), 4.5 B (Base), compared to model sizes of 3 B/16 B/27 B, respectively.
- Variant Architecture: Tiny (12 layers/10 heads), Small (27/16), Base (30/32) scale both embedding dim and expert pool. Inference on a single GPU is feasible; Tiny fits in 10 GB, Small in 40 GB, Base in 80 GB.
5. Performance Benchmarking and Qualitative Evaluation
DeepSeek-VL and DeepSeek-VL2 are extensively evaluated on diverse VLM benchmarks, demonstrating leading or near-leading results at competitive size (Lu et al., 2024, Wu et al., 2024).
| Model Variant | MMBench | DocVQA | ChartQA | OCRBench | RefCOCO+ (base) |
|---|---|---|---|---|---|
| DeepSeek-VL2-Tiny (1.0B) | 73.3 | 88.9 | 81.0 | 809 | — |
| DeepSeek-VL2-Small (2.8B) | 82.3 | 92.3 | 84.5 | — | ~95 |
| DeepSeek-VL2 (4.5B) | 83.1 | 93.3 | 86.0 | — | 94.9 |
| InternVL2-1B | 65.4 | 81.7 | 72.9 | 754 | — |
| Qwen2-VL-2B | 74.9 | 90.1 | — | — | — |
| Aria-MoE-3.9B (4.3B act.) | 81.7 | 92.6 | 86.4 | — | — |
Qualitative examples from the paper show precise fine-grained visual recognition (UI elements, tables, chart lines), code explanation, document OCR, and commonsense visual reasoning (e.g., safety assessment from images) (Lu et al., 2024, Wu et al., 2024). Language-only abilities are preserved or mildly attenuated during multimodal training, with select benchmarks (HellaSwag, MMLU) showing improvements and GSM8K reflecting minimal trade-off.
6. Efficiency, Ablation Findings, and Implementation Considerations
DeepSeek-VL and VL2 prioritize computational efficiency alongside accuracy. The dynamic tiling and compression in DeepSeek-VL2’s visual pipeline allow for high-resolution recognition and spatial localization without exceeding transformer context limits. MLA confers a 2–3× speedup for inference at context lengths up to 4000 tokens, and sparse MoE feed-forward layers achieve up to 70–80% parameter activation savings relative to dense baselines.
Training ablations confirm that separating batches by modality improves throughput by approximately 20%, and that OBELICS interleaving at 30% provides optimal VL convergence for smaller variants. In VL2, shared expert bias correction for sigmoid gating (noaux_tc) enhances load balancing in the Base model (Wu et al., 2024). Disabling dynamic tiling for multi-image contexts prevents excessive context growth.
7. Summary and Prospective Directions
DeepSeek-VL establishes a data- and taxonomy-driven, efficient architectural foundation for open-source vision-language modeling. DeepSeek-VL2, introducing dynamic high-resolution tiling and an MoE-MLA transformer backbone, achieves state-of-the-art or highly competitive results across OCR, chart understanding, VQA, math, and visual grounding—while reducing inference compute and memory requirements. The release of pretrained models and code supports broad adoption and further innovation (Lu et al., 2024, Wu et al., 2024). A plausible implication is the continued trend towards modular, sparse, and context-efficient architectures in large-scale multimodal modeling.