DeepSeek: Open-Source Foundation Models

Updated 11 November 2025

DeepSeek is a family of open-source, large-scale foundation models that spans language, vision-language, and multimodal architectures with cost-efficient MoE scaling.
The architecture introduces innovations like Multi-Head Latent Attention and Multi-Token Prediction to reduce inference costs while enhancing performance on key benchmarks.
Reinforcement learning pipelines and RLHF techniques drive advanced reasoning, making DeepSeek effective in theorem proving, biomedical NLP, and vision-language applications.

DeepSeek is a family of large-scale, open-source foundation models that encompasses language, vision-language, and multimodal architectures. Developed by DeepSeek-AI (China), DeepSeek models are characterized by a strong emphasis on cost-efficient mixture-of-experts (MoE) scaling, architectural innovations in attention and memory management, reinforcement-learning-driven reasoning, and open-access engineering. The suite covers applications from general-purpose language modeling and code generation to advanced vision-language understanding, formal theorem proving, and real-world biomedical and high-performance computing tasks.

1. Model Architecture and Technical Innovations

DeepSeek models introduce several principal architectural advancements:

Mixture-of-Experts (MoE): Most DeepSeek models, notably DeepSeek-V3 (671B total, 37B activated) and successors, employ a sparse MoE transformer backbone. Each token is routed to a small, learned subset of experts (feedforward subnetworks) per layer, with expert selection guided by token-to-expert affinity and lightweight gating. This yields up to 10× reduction in inference cost relative to dense models of comparable capacity, enabling efficient scaling on commodity and regionally restricted hardware (DeepSeek-AI et al., 27 Dec 2024, DeepSeek-AI et al., 7 May 2024).
Multi-Head Latent Attention (MLA): DeepSeek replaces standard multi-head attention with MLA, which compresses key and value projections via a low-rank latent bottleneck. MLA drastically reduces the per-token key-value (KV) cache required during decoding, e.g., by over 90% in DeepSeek-V2, while sometimes improving benchmark accuracy (e.g., MMLU, GSM8K) due to enhanced positional encoding and regularization (DeepSeek-AI et al., 7 May 2024, DeepSeek-AI et al., 27 Dec 2024).
Multi-Token Prediction (MTP): Training optimizes not only next-token prediction but also prediction for multiple future tokens using shallow auxiliary modules. MTP improves data efficiency and boosts validation perplexity, affording greater performance at a given compute budget (DeepSeek-AI et al., 27 Dec 2024).
Auxiliary-Loss-Free Load Balancing: To avoid the degenerate effects of conventional MoE auxiliary loss, DeepSeek-V3 uses a bias-based balancing strategy (learned, per-expert biases) and a negligible sequence-wise correction. This approach achieves near-uniform expert utilization across mini-batches (DeepSeek-AI et al., 27 Dec 2024).
Group Relative Policy Optimization (GRPO): For RL-based post-training (notably in DeepSeek-R1), DeepSeek introduces GRPO, a PPO-family algorithm that forgoes a learned value network in favor of normalized, group-wise advantage computation over candidate completions. This allows stable, interpretable RL optimization with drastically reduced memory requirements, supporting full-scale RL at MoE scales (Wang et al., 14 Mar 2025).

Model / Component	Core Innovation	Scaling Target
DeepSeek-V3	MLA, MoE, MTP, bias-based load balance	671B (37B active)
DeepSeek-R1	RL pipeline, GRPO, chain-of-thought	Post-V3, open RL
DeepSeek-VL2	Dynamic tiling, MLA, MoE-VL backbone	4.5B, 2.8B, 1.0B
DeepSeek-Prover-V2	RL for theorem proving, subgoal decomp	671B, 7B

2. Training Regimes and Reinforcement Learning Paradigms

The DeepSeek-R1 family exemplifies multi-stage RLHF pipelines:

Pure RL (DeepSeek-R1-Zero): RL via GRPO starting from a reasoning-only system prompt. The model spontaneously develops multi-step chain-of-thought (CoT) reasoning, emergent self-reflection ("aha moments"), and exploratory decomposition purely from reward signals (final answer correctness, CoT formatting) (DeepSeek-AI et al., 22 Jan 2025, Marjanović et al., 2 Apr 2025).
Alternating SFT and RL (DeepSeek-R1): Post-R1-Zero, a sequence of SFT on high-quality and rejection-sampled CoT examples (≈800K), reasoning-intensive RL (accuracy and language consistency reward), and all-scenarios RL (including helpfulness/harmlessness alignment) stabilizes output and improves accuracy, readability, and safety (DeepSeek-AI et al., 22 Jan 2025, Marjanović et al., 2 Apr 2025).
Open Distillation Pipeline: R1 checkpoints are further used to distill smaller, dense models (e.g., Qwen2.5-32B, Llama3.1-70B) that retain a significant share of R1’s reasoning ability at far lower computational cost (DeepSeek-AI et al., 22 Jan 2025, Zhan et al., 1 Mar 2025).
RL in Theorem Proving (DeepSeek-Prover-V2): Recursively decomposes Lean 4 theorems into subgoals using the V3 backbone, synthesizes both informal CoT and formal subproofs for curriculum SFT, then applies GRPO-based RL to maximize formally checked proof success (Ren et al., 30 Apr 2025).

3. Vision-Language and Multimodal Extensions

DeepSeek’s multimodal platforms incorporate principles from their language backbones:

DeepSeek-VL2: MoE vision-LLM employing dynamic tiling for high-resolution, variable-aspect image encoding. Literature explicitly describes modular MoE LLMs with MLA, per-layer top-K expert routing (K=6), and expert bias correction. The vision backbone is based on SigLIP-SO400M, with tokens adaptively arranged based on tiling, supporting competitive visual grounding, OCR, table/chart/document understanding, and reasoning across model variants (4.5B, 2.8B, 1.0B activated) (Wu et al., 13 Dec 2024).
Optical Context Compression (DeepSeek-OCR): Demonstrates “optical” mapping of long text contexts into images, compresses these via a windowed-attention+conv+global-attention DeepEncoder and recovers text sequences using a compact 3B MoE decoder. The pipeline achieves ≈97% OCR precision at ≤10× compression (600–700 text tokens → 64 vision tokens at 96.5% precision), and meaningful decoding at up to 20× compression (Wei et al., 21 Oct 2025).
Hallucination Vulnerabilities: Embedding-manipulation attacks show that DeepSeek Janus models (1B, 7B) can be induced to hallucinate arbitrary target objects in images with high visual fidelity (SSIM > 0.88), even under physically imperceptible perturbations (Islam et al., 11 Feb 2025). Closed-form factual queries exacerbate vulnerability, and a LLaMA-3.1-based multi-prompt framework provides robust detection.

4. Benchmark Performance and Practical Applications

DeepSeek models achieve top-tier performance on a broad spectrum of tasks, with particularly strong results in structured reasoning:

Reasoning, Math, and Code: DeepSeek-R1 and DeepSeek-V3 consistently match or outperform leading open-source and even closed models on GSM8K, AIME, MATH500, HumanEval, and Codeforces, often approaching or exceeding 90% accuracy on MMLU suites, and 65–80% pass@1 on mathematics and coding leaderboards (DeepSeek-AI et al., 27 Dec 2024, DeepSeek-AI et al., 22 Jan 2025, Wang et al., 14 Mar 2025).
Biomedical and Healthcare: DeepSeek variants, notably R1-distilled models, perform competitively in biomedical NLP for NER and classification (F1 > 0.95 on key datasets), and lead in clinical reasoning benchmarks such as USMLE, ophthalmology MCQs, and longitudinal dental cases, reporting higher faithfulness and expert approval than state-of-the-art proprietary LLMs (Ye et al., 2 Jun 2025, Zhang et al., 2 Sep 2025, Zhan et al., 1 Mar 2025, Xu et al., 25 Feb 2025).
Vision-Language Understanding: DeepSeek-VL2 achieves or exceeds state-of-the-art open-source performance in DocVQA (93.3%), ChartQA (86.0%), TextVQA (84.2%), and visual grounding (RefCOCOg 92.8%), with lower parameter activation than many dense MoE comparators (Wu et al., 13 Dec 2024). DeepSeek-OCR provides production-scale pipeline throughput (200K+ pages/day/A100) for LLM pretraining (Wei et al., 21 Oct 2025).
High-Performance Computing (HPC): DeepSeek-generated code covers correct syntactic and algorithmic forms in key HPC kernels, but typically lags GPT-4 in blocking, parallel efficiency, and library invocation for optimized routines. A plausible implication is that further domain-specific RL or corpus enrichment could close the gap in specialized, throughput-critical code tasks (Nader et al., 15 Mar 2025).

5. Safety, Alignment, and Model Robustness

Empirical safety audits identify substantial vulnerabilities:

Safety Testing: Benchmarks such as CNSafe reveal that DeepSeek-R1 and DeepSeek-V3 have text attack success rates (ASR) of 22–31% in non-adversarial settings (highest categories: discrimination, rights infringement), rising to near-universal ASR (80–100%) in adversarial red-teaming (CNSafe_RT) (Ying et al., 19 Mar 2025). English-language output is consistently more vulnerable (+21.7% ASR) than Chinese, reflecting alignment gaps.
Chain-of-Thought (CoT) Risks: R1’s explicit CoT outputs, while improving interpretability and supporting scientific “thoughtology,” expose an enlarged attack surface. Jailbreaking rates increase dramatically (from 30% to 72.5% ASR with transferability to competing LLMs). Recommendations include adversarial RL during alignment, output sanitization, and improved multilingual safety training (Marjanović et al., 2 Apr 2025, Ying et al., 19 Mar 2025).
Multimodal and T2I Models: Text-to-image (Janus-Pro-7B) and VL models are also susceptible to high ASR in categories such as sexual content and illegal activity (>50% risk), while current vision-multimodal “safety” sometimes reflects limited semantic understanding rather than true content filtering.

6. Reasoning Taxonomy, Cognitive Analysis, and Limitations

DeepSeek-R1 models reasoning as a discrete chain with four archetypal stages: problem definition, initial decomposition (“blooming”), iterative reconstruction/rumination, and final answer commitment. The emergent sweet-spot phenomenon suggests maximal accuracy at an intermediate chain-of-thought length, with performance degrading for excessive reasoning (rumination tendency), reflecting non-trivial inference-time optimality (Marjanović et al., 2 Apr 2025).

Cognitive and cultural analyses demonstrate:

Moral reasoning by DeepSeek-R1 is less focused on universal ethical principles (Defining Issues Test: R1=35/29 vs. GPT-4=55.7/49.4).
Chains are longer in English, with output reflecting in-group collectivism in Chinese, demonstrating non-trivial language/culture interactions.
For tasks such as garden-path sentence processing and physical simulation-by-ASCII, chain length correlates inversely with human comprehension, and visual “world models” remain fragile. Final outputs often diverge from step-wise reasoning, pointing to gaps in chain-to-answer faithfulness and spatial reasoning.

7. Future Directions, Open Questions, and Ecosystem Impact

Open research challenges include:

Understanding and ablation of MLA’s positional encoding impact.
Theoretical comparison of MoE bias-based vs. auxiliary-loss load balancing.
Adaptive or lightweight multi-token prediction to minimize training overhead.
Robustification of safety—particularly CoT reasoning—via red-team adversarial training, improved reward modeling, and explicit cross-lingual alignment.
Procedural and neural enhancements for formal theorem proving (e.g., AlphaProof-style self-play, neural MCTS, and reward shaping for combinatorics and abstract domains) (Ren et al., 30 Apr 2025).
Societal ramifications of cheap, open-reasoning LLMs, including the risk of widespread misuse, information security, and cross-cultural value drift (Mercer et al., 4 Feb 2025).

A plausible implication is that DeepSeek’s open-source, efficient mixture-of-experts paradigm—especially in combination with RL-driven reasoning alignment—establishes a robust blueprint for scalable, accessible, and high-performance foundation models. However, safety and chain-of-thought controllability represent persistent and urgent challenges as global deployment accelerates.