DeepSeek LLM: Advanced Open-Source Models

Updated 5 July 2025

DeepSeek LLM is an open-source family of large language models leveraging innovative transformer architectures for structured problem-solving.
It employs a unique compute budgeting metric using non-embedding FLOPs per token and adaptive scaling laws for efficient training.
DeepSeek variants achieve robust performance across domains including code generation, mathematical reasoning, biomedical NLP, and CAD design.

DeepSeek LLM designates a family of open-source LLMs developed for advanced reasoning, structured problem-solving, and efficient scaling, with an emphasis on technical innovation in model architecture, training methodology, and practical deployment. The DeepSeek initiative covers a range of models (including DeepSeek LLM, DeepSeek-V3, DeepSeek-R1, and Chat variants) that target code generation, mathematical reasoning, biomedical NLP, public opinion simulation, and specialized verticals such as healthcare and engineering design. Notable for their transparency, cost-efficiency, and long-term perspective, DeepSeek models challenge incumbent proprietary systems with competitive performance on standard and domain-specific benchmarks, while also introducing significant techniques in efficient training and reasoning interpretability.

1. Model Architecture and Scaling Principles

The core DeepSeek architecture refines the transformer backbone, drawing from LLaMA series models but introducing several structural innovations. Chief among these is the replacement of traditional parameter-count-based scaling with a metric termed “non-embedding FLOPs per token” (M), capturing the per-token compute cost while isolating attention overhead and excluding vocabulary lookups. This innovation supports a more accurate compute-budgeting formalism: $C = M \cdot D$ where $C$ is the compute budget, $M$ is non-embedding FLOPs per token, and $D$ is the number of dataset tokens.

Empirical scaling law analysis led to power-law fits for optimal batch size and learning rate as functions of compute: $\eta_{\mathrm{opt}} = 0.3118 \cdot C^{-0.1250}$

$B_{\mathrm{opt}} = 0.2920 \cdot C^{0.3271}$

These laws enabled effective training hyperparameter selection for models at the 7B and 67B parameter scale, supporting principled extrapolation from small-scale runs to far larger compute regimes.

Further architectural advances include:

Pre-Norm with RMSNorm and SwiGLU activations for improved optimization.
Rotary Embeddings for positional encoding.
Grouped-Query Attention (GQA) in larger models for latency and scalability improvements.
Multi-Head Latent Attention (MLA) and Mixture of Experts (MoE) (in later variants), enabling memory-efficient context extension and expert routing.
Decoupled Rotary Position Embedding (RoPE), which separates query and key rotary representations, reducing run-time compute and KV cache size.

2. Dataset Construction and Tokenization

DeepSeek models are pretrained on a bilingual corpus (Chinese and English) built from 2 trillion tokens, with continual expansion. The data pipeline consists of three principal phases:

Aggressive, cross-dump deduplication across 91 web crawls, removing up to 4x more duplicates than single-dump processing.
Rigorous linguistic and semantic filtering to ensure high textual quality.
Remixing to rebalance domains and mitigate over- or under-representation.

The tokenizer is a byte-level Byte Pair Encoding (BPE), akin to GPT-2, with specific pre-tokenization strategies to manage non-Latin scripts and control undesired merges. The vocabulary is sized at 102,400, including special tokens for control, line separators, and language markers.

Dataset quality directly informs the scaling allocation: as shown in scaling law analyses, higher-quality data permits scaling model size relative to data quantity for optimal performance at fixed budget.

3. Training Protocols and Iterative Alignment

DeepSeek models employ a multistage pipeline that integrates pretraining, supervised fine-tuning (SFT), and preference-based alignment:

Pretraining follows optimized scaling-hyperparameter curves; the 7B model trains over 4 epochs of SFT, while the 67B model uses only 2 epochs to avoid overfitting.
Instruction Fine-Tuning leverages 1.5 million instructional prompts (1.2M “helpful,” 300K “safety” examples).
Direct Preference Optimization (DPO): An epoch of alignment using a batch size of 512 and learning rate of $5 \times 10^{-6}$ encourages outputs aligned with human preferences for helpfulness and harmlessness.

Distinctively, the DeepSeek-R1 variant adopts reinforcement learning (RL) with Group Relative Policy Optimization (GRPO): $J_{\text{GRPO}}(\theta) = \mathbb{E}[ \frac{1}{G} \sum_{i=1}^{G} \min(R_i A_i, \text{clip}(R_i, 1-\epsilon, 1+\epsilon) A_i) - \beta D_{KL}( \pi_{\theta} || \pi_{\text{ref}} ) ]$ where the advantage $A_i$ is the normalized output reward within a sampled group. Cold-start SFT data (“long CoT” exemplars) followed by RL and additional SFT rounds enable robust, readable multi-step reasoning.

The combination of SFT + DPO produces the DeepSeek Chat models, exhibiting reduced repetition and improved conversational zero-shot robustness.

4. Evaluation: Reasoning, Benchmarks, and Specialization

Comprehensive evaluations on DeepSeek LLMs cover code generation (HumanEval, MBPP), mathematical reasoning (GSM8K, MATH), general knowledge (MMLU, BBH, C-Eval, CMMLU), and open-ended dialogue (AlignBench, MT-Bench). DeepSeek-67B outperforms LLaMA-2 70B across code, math, and reasoning. On open-ended evaluations, DeepSeek 67B Chat approaches or surpasses GPT-3.5.

In domain-specific applications:

Biomedical NLP: DeepSeek models match or exceed state-of-the-art on NER and text classification; tradeoffs in precision and recall persist for event/relation extraction.
Medical Diagnostics and Clinical Reasoning: DeepSeek-R1 demonstrates high bilingual MCQ accuracy (0.862 Chinese, 0.808 English) and excels on complex, reasoning-intensive clinical cases, although deployment in real-world workflows remains challenging.
High-Performance Computing: DeepSeek generates functional multilang HPC code but lags behind GPT-4 in execution efficiency and scaling.
Movie Review Generation: Outputs are more balanced and lifelike than those of GPT-4o and Gemini-2.0, with closer alignment to human review sentiment and style.

Public opinion simulation and cross-cultural evaluation reveal nuances in demographic and ideological modeling; DeepSeek-V3 excels in simulating US opinions on abortion (accuracy ≈ 0.53) but struggles with Chinese perspectives on capitalism and underrepresented groups.

5. Advanced Reasoning Dynamics and Safety Implications

DeepSeek-R1 explicitly exposes internal reasoning chains (> ... ) reflecting its chain-of-thought architecture. The reasoning process comprises problem definition, “blooming,” iterative re-examination (rumination), and decision/answer production. Studies show:

There is an optimal chain-of-thought length; excessive inference (“overthinking”) degrades performance.
The model’s multi-step reasoning is beneficial up to a “sweet spot,” beyond which rumination reduces accuracy and computational efficiency.
Handling of long contexts is generally strong, but model performance deteriorates with extended or ambiguous input, occasionally leading to incoherent or language-mixed output.
Reasoning structure mirrors certain human cognitive processes (e.g., reanalysis of “garden-path” sentences), yet the tendency toward repetitive or overly lengthy justification is a divergence from efficient meta-cognition.

A significant concern is increased safety vulnerability: DeepSeek-R1’s detailed stepwise reasoning can facilitate harmful content or “jailbreak” tactics, posing challenges for alignment and robustness in safety-sensitive deployment.

Investigations into DeepSeek’s value alignment reveal prominent cultural specificity. When assessed with the Schwartz values framework, DeepSeek consistently rates “self-enhancement values” (e.g., power, achievement) lower than Western-developed models, aligning with collectivist cultural norms. Empirical bias audits indicate:

DeepSeek-R1 exhibits higher rates of Chinese-state propaganda and anti-US sentiment, notably in Chinese (Simplified and Traditional) but not English, with an “invisible loudspeaker” effect where bias is subtly introduced, even in apolitical content domains.
Public opinion modeling analyses and Bayesian ordinal regression of cultural values confirm that LLMs inherit biases from their training regimes, challenging the notion of a “universal” ethical framework in AI.

7. Industrial, Practical, and Security Applications

DeepSeek’s hybrid architecture (especially in its MoE variants) is designed for cost-efficient inference and adaptability, enabling deployment in CPU/GPU and trusted execution environments (Intel TDX). For small model scales (e.g., DeepSeek-R1-1.5B), TDX-based confidential computing can even outperform standard CPU-only deployment; however, lack of GPU enclave support remains a limiting factor as model size scales.

In engineering, Seek-CAD leverages DeepSeek-R1 for training-free, retrieval-augmented parametric 3D CAD generation, iteratively refining outputs with visual and chain-of-thought feedback. The SSR (Sketch, Sketch-based feature, Refinement) paradigm and a novel 40,000-sample dataset underpin this application, demonstrating effective and efficient local CAD pipeline generation with high-fidelity geometric outcomes.

8. Open Questions and Future Directions

The DeepSeek research program is explicitly geared toward longtermism—a commitment to continuous improvement in scaling, dataset quality, and alignment. Forthcoming areas include:

Dataset enrichment for Chinese and underrepresented language coverage.
Enhancement of mathematical and code reasoning, including new MoE architectures for resource-efficient scaling.
Further integration of RL (beyond DPO), multimodal and process supervision (including multimodal vision-language pretraining and CAD refinement), and advanced bias mitigation.
Addressing the challenge of adversarial robustness in the face of detailed chain-of-thought exposure.
Broadening governance, validation, and collaborative frameworks to ensure transparent, safe, and equitable deployment of open LLMs.

A major open question concerns the reconciliation of model transparency, internal reasoning complexity, and real-world safety/alignment. As the DeepSeek initiative shares pretrained weights, training protocols, scaling analyses, and benchmarks, it continues to serve as a reference point for reproducible, cost-accessible, and interpretable open-source LLM development across technical and applied scientific domains.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DeepSeek LLM.