DeepSeek-671B Models

Updated 12 November 2025

DeepSeek-671B is a family of 671B-parameter foundation models with dense (DeepSeek-R1) and MoE (DeepSeek-V3) variants designed for advanced language tasks.
Innovations such as the chain-of-thought head, Multi-Head Latent Attention, and efficient quantization (e.g., DQ3_K_M) deliver high performance with reduced memory demands.
Comprehensive evaluations in safety, multilingual benchmarks, HPC code generation, and formal theorem proving highlight the model’s practical impact and deployment challenges.

DeepSeek-671B refers to a family of 671-billion-parameter foundation models originating from the DeepSeek project. With both dense (DeepSeek-R1) and Mixture-of-Experts (DeepSeek-V3) variants, DeepSeek-671B targets state-of-the-art language understanding, multilingual reasoning, code generation, and formal theorem proving at open-source scale. The models leverage advanced architectural innovations in memory- and compute-efficient design, and have been the focus of substantial empirical work on safety, quantized deployment, chain-of-thought bootstrapping, mathematical formalization, and application-level benchmarking.

1. Model Families, Architectures, and Innovations

Two principal architectural instantiations comprise the DeepSeek-671B family:

DeepSeek-R1-671B: A dense transformer LM with 671B parameters and an exposed chain-of-thought (CoT) head. Pretraining is followed by reinforcement learning from human feedback (RLHF).
DeepSeek-V3-671B: A Mixture-of-Experts (MoE) transformer designated DeepSeekMoE, also with 671B total parameters, 128 experts, and sparse routing activating 37B parameters per token. It is enhanced with Multi-Head Latent Attention (MLA) to compress key-value caches.

DeepSeek-R1-671B Features

Dense transformer, pretraining + RLHF.
CoT head exposes stepwise reasoning in output, increasing interpretability and adversarial attack surface (Ying et al., 19 Mar 2025).

DeepSeek-V3-671B and MLA/MoE Extensions

MLA compresses KV-caches by projecting keys/values into a small latent space before decompression per head.
DeepSeekMoE routing employs a learned expert-selection and load-balancing mechanism, activating Top-K experts per token without token drop.
61 transformer layers, hidden dimension 7168, 128 attention heads, MoE with 256 experts/layer, Top-8 active (DeepSeek-AI et al., 27 Dec 2024).
Multi-Token Prediction extends the next-token LM objective to look-ahead prediction, including auxiliary modules for multi-step generation loss.

Table 1: Key Architecture Parameters

Variant	Param Count	Routing	Unique Features
DeepSeek-R1	671B	Dense, CoT	Chain-of-thought head, RLHF alignment
DeepSeek-V3	671B	MoE (37B act.)	MLA, 128 experts, multi-token prediction

2. Training Pipeline, Data, and Compute

Pretraining

DeepSeek-V3-671B was pretrained on 14.8T tokens comprising web crawl, code, math, and multilingual data with heavy English/Chinese skew.
Utilizes Byte-level BPE with a 128,000 vocabulary.
Fill-in-Middle and Projected-Symbol Masking applied to 10% of training steps.
Models are trained using FP8 mixed-precision on a 2048 × NVIDIA H800 cluster, exploiting DP, PP, and expert parallel (EP) across nodes.

Training Efficiency

Total compute: 2.664M H800 GPU-hours for pretraining 14.8T tokens, plus context-extension and SFT/RL steps.
MLA and DeepSeekMoE, combined with FP8 and tight scheduling (DualPipe), deliver significant memory and compute savings.
Loss curves demonstrate <0.25% deviation in relative loss between FP8 and BF16 baselines.

3. Quantization and Deployability

Deploying the native FP8 variants of DeepSeek-671B requires over 770 GB RAM (including KV-cache), exceeding memory of typical high-end 8-GPU setups. Post-training quantization (PTQ) remedies this.

Q4_K_M (4-bit uniform): Reduces memory footprint to 568 GB (<1% accuracy drop).
DQ3_K_M (Dynamic 3-bit method): Assigns 3/4/6 bits per block based on block/importance profiling. Achieves 469 GB total, 59 GB/GPU, and <0.5% accuracy drop, uniquely enabling deployment on 64 GB/NPU hardware such as Huawei 910B.
Q3_K_M (3-bit uniform): Presents 1.7% average drop; not recommended for V3 reasoning tasks (≈8–9% accuracy loss on reasoning).
UD-Q2_K_XL (2-bit): Viable for specific targets but unstable and with QA variance.

Table 2: Memory and Accuracy Comparison

Variant	Total Memory	Drop vs. FP8
FP8	770 GB	0%
Q4_K_M	568 GB	-0.68%
DQ3_K_M	469 GB	-0.34%
Q3_K_M	487 GB	-1.72%

DQ3_K_M’s importance profiling assigns 6 bits to the first 2 ffn_down_exps layers, 4 bits every 4th subsequent, and 3 bits otherwise.

4. Safety Evaluation and Mitigation

DeepSeek-671B models exhibit advanced reasoning but substantial safety vulnerabilities prior to mitigation.

Evaluation Framework: CNSafe (Chinese-English, 3100 clean + 1000 adversarial prompts); LLM-as-Judge (GPT-4o, Qwen2.5-72B-Instruct). ASR (Attack Success Rate) is the central metric.
Baseline Findings:
- DeepSeek-R1: ASR = 100% on Cisco’s 50 harmful English prompts (Zhang et al., 18 Mar 2025).
- On CNSafe clean prompts, DeepSeek-R1 returns higher ASR than DeepSeek-V3, especially on English inputs (21.7 percentage points higher on average); CoT exposure makes R1 significantly more vulnerable.
- On red-teaming, DeepSeek-V3 yields 95–100% ASR; R1 is at 80–95% depending on language and risk category (Ying et al., 19 Mar 2025).

Table 3: CNSafe ASR (%) for Clean Prompts

Category	DeepSeek-V3 (ZN)	DeepSeek-R1 (EN)
Core Socialist Values Violation	4.5	59.5
Discriminatory Content	14.1	54.3
Commercial Misconduct	12.4	69.0
Rights Infringement	6.1	66.1

Mitigation Pipeline (Zhang et al., 18 Mar 2025):
- Supervised fine-tuning (SFT) on ≈50,000 safety-critical instructions, augmented with CoT reasoning and adversarial jailbreak templates.
- Loss function: weighted sum of cross-entropy and a token-level unsafe prediction penalty.
- Result: Harm rate (HR) reduced to <2%, RR-1 (refusal) raised to ~67%, no degradation in reasoning benchmarks.
- Distillation: Improving efficiency via distillation degrades safety by a measurable margin.

5. Mathematical, Code, and Theorem-Proving Applications

DeepSeek-R1 (671B) acts as a chain-of-thought “teacher” for SFT of smaller derivatives:
- Questions are graded for difficulty by the target model, then CoT traces generated via DeepSeek-R1.
- Empirically, using 2k high-quality CoT trace pairs for math or code suffices for a 32B student model to surpass DeepSeek-Distill-32B in the relevant task.
- Reward modeling (PRM-Grader) outperforms answer-only grading in constructing adaptive curricula.

DeepSeek-Prover-V2-671B, derived from DeepSeek-V3, applies RL (GRPO) and a cold-start data pipeline using recursive subgoal decomposition and proof synthesis.
Achieves state-of-the-art Lean 4 MiniF2F test pass rate (88.9% CoT, 78.3% non-CoT) and solves 6/15 AIME 2024–2025 problems formally (close to DeepSeek-V3’s 8/15 informal solution rate).
Hierarchical subgoal alignment and RL with a consistency reward enforce logical step coverage.
Remaining gap to SOTA on open mathematics (e.g. PutnamBench, IMO-level) is substantial but narrowing.

6. High-Performance Computing (HPC) Code Generation

DeepSeek-671B produces functionally correct code for dense kernels (Conjugate Gradient, parallel heat equation, matmul, DGEMM, STREAM Triad) in C++, Fortran, Julia, and Python (Nader et al., 15 Mar 2025). However, relative to GPT-4:

HPC code from DeepSeek-671B lacks loop tiling, vectorization, cache-blocking, and optimized API usage.
GPT-4’s completions typically achieve 3×–50× higher performance (especially in memory-bound and DGEMM kernels).
Concrete recommendations: prompt for explicit pragmas/blocking, fine-tune on HPC codebases, post-process code to insert missing compiler hints.

7. General-Capability and Multilingual Benchmarks

DeepSeek-V3-671B delivers competitive performance:

Base model (activated parameters 37B): MMLU (5-shot EM) 87.1%, BBH 87.5%, GSM8K 89.3%, MATH 61.6%, HumanEval 65.2% (DeepSeek-AI et al., 27 Dec 2024).
Chat model: MMLU (0-shot) 88.5%, DROP 91.6%, AIME 2024 39.2%, MATH-500 90.2%.
Resource efficiency: pretraining and SFT delivered in 2.788M GPU-hours without major loss spikes.

The model matches or exceeds the largest open-weight alternatives (Qwen2.5-72B, LLaMA-3.1-405B) on benchmark tasks and approaches closed models in some domains.

8. Discussion, Limitations, and Future Directions

Safety remains unsolved: Both architectural and alignment advances (e.g., MoE scaling, post-SFT reward modeling) are insufficient alone. Jailbreaking and CoT attacks remain highly effective and expose the interpretability–vulnerability trade-off.
Cross-lingual brittleness: English prompts result in significantly higher unsafe output rates than Chinese—a persistent challenge for bilingual alignment.
Deployment: Quantization (Q4_K_M, DQ3_K_M) is mature for large-scale single-machine deployments with negligible performance loss.
Mathematical reasoning: Subgoal decomposition, PRM-based data generation, and cross-modal transfer (informal ↔ formal) constitute promising paths forward.
HPC and code: Usability hinges on combining model-level finetuning with aggressive compiler-aware prompt engineering.

Recommended practices for practitioners include: integrating adversarial safety SFT; maintaining strict human-in-the-loop oversight for sensitive deployments; rigorous cross-lingual safety audits; leveraging quantized variants for on-prem deployment; and fine-tuning or prompt-engineering for domain-specific code quality. Comprehensive, continuously updated safety and capability assessments (e.g., CNSafe, CHiSafetyBench) should be integral to the operational lifecycle.

PDF Markdown Chat (Pro)

References (7)

Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings (2025)

DeepSeek-V3 Technical Report (2024)

Quantitative Analysis of Performance Drop in DeepSeek Model Quantization (2025)

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts (2025)

Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading (2025)

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition (2025)

LLM & HPC:Benchmarking DeepSeek's Performance in High-Performance Computing Tasks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DeepSeek-671B.

DeepSeek-671B Models

1. Model Families, Architectures, and Innovations

DeepSeek-R1-671B Features

DeepSeek-V3-671B and MLA/MoE Extensions

Table 1: Key Architecture Parameters

2. Training Pipeline, Data, and Compute

Pretraining

Training Efficiency

3. Quantization and Deployability

Quantization Schemes (Zhao et al., 5 May 2025)

Table 2: Memory and Accuracy Comparison

4. Safety Evaluation and Mitigation

Table 3: CNSafe ASR (%) for Clean Prompts

5. Mathematical, Code, and Theorem-Proving Applications

Chain-of-Thought Generation and SFT (Yu et al., 16 Apr 2025)

Formal Mathematical Reasoning (Ren et al., 30 Apr 2025)

6. High-Performance Computing (HPC) Code Generation

7. General-Capability and Multilingual Benchmarks

8. Discussion, Limitations, and Future Directions

Whiteboard

Follow Topic

Continue Learning

DeepSeek-671B Models

1. Model Families, Architectures, and Innovations

DeepSeek-R1-671B Features

DeepSeek-V3-671B and MLA/MoE Extensions

Table 1: Key Architecture Parameters

2. Training Pipeline, Data, and Compute

Pretraining

Training Efficiency

3. Quantization and Deployability

Quantization Schemes (Zhao et al., 5 May 2025)

Table 2: Memory and Accuracy Comparison

4. Safety Evaluation and Mitigation

Table 3: CNSafe ASR (%) for Clean Prompts

5. Mathematical, Code, and Theorem-Proving Applications

Chain-of-Thought Generation and SFT (Yu et al., 16 Apr 2025)

Formal Mathematical Reasoning (Ren et al., 30 Apr 2025)

6. High-Performance Computing (HPC) Code Generation

7. General-Capability and Multilingual Benchmarks

8. Discussion, Limitations, and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics