Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepSeek-671B Models

Updated 12 November 2025
  • DeepSeek-671B is a family of 671B-parameter foundation models with dense (DeepSeek-R1) and MoE (DeepSeek-V3) variants designed for advanced language tasks.
  • Innovations such as the chain-of-thought head, Multi-Head Latent Attention, and efficient quantization (e.g., DQ3_K_M) deliver high performance with reduced memory demands.
  • Comprehensive evaluations in safety, multilingual benchmarks, HPC code generation, and formal theorem proving highlight the model’s practical impact and deployment challenges.

DeepSeek-671B refers to a family of 671-billion-parameter foundation models originating from the DeepSeek project. With both dense (DeepSeek-R1) and Mixture-of-Experts (DeepSeek-V3) variants, DeepSeek-671B targets state-of-the-art language understanding, multilingual reasoning, code generation, and formal theorem proving at open-source scale. The models leverage advanced architectural innovations in memory- and compute-efficient design, and have been the focus of substantial empirical work on safety, quantized deployment, chain-of-thought bootstrapping, mathematical formalization, and application-level benchmarking.

1. Model Families, Architectures, and Innovations

Two principal architectural instantiations comprise the DeepSeek-671B family:

  • DeepSeek-R1-671B: A dense transformer LM with 671B parameters and an exposed chain-of-thought (CoT) head. Pretraining is followed by reinforcement learning from human feedback (RLHF).
  • DeepSeek-V3-671B: A Mixture-of-Experts (MoE) transformer designated DeepSeekMoE, also with 671B total parameters, 128 experts, and sparse routing activating 37B parameters per token. It is enhanced with Multi-Head Latent Attention (MLA) to compress key-value caches.

DeepSeek-R1-671B Features

  • Dense transformer, pretraining + RLHF.
  • CoT head exposes stepwise reasoning in output, increasing interpretability and adversarial attack surface (Ying et al., 19 Mar 2025).

DeepSeek-V3-671B and MLA/MoE Extensions

  • MLA compresses KV-caches by projecting keys/values into a small latent space before decompression per head.
  • DeepSeekMoE routing employs a learned expert-selection and load-balancing mechanism, activating Top-K experts per token without token drop.
  • 61 transformer layers, hidden dimension 7168, 128 attention heads, MoE with 256 experts/layer, Top-8 active (DeepSeek-AI et al., 2024).
  • Multi-Token Prediction extends the next-token LM objective to look-ahead prediction, including auxiliary modules for multi-step generation loss.

Table 1: Key Architecture Parameters

Variant Param Count Routing Unique Features
DeepSeek-R1 671B Dense, CoT Chain-of-thought head, RLHF alignment
DeepSeek-V3 671B MoE (37B act.) MLA, 128 experts, multi-token prediction

2. Training Pipeline, Data, and Compute

Pretraining

  • DeepSeek-V3-671B was pretrained on 14.8T tokens comprising web crawl, code, math, and multilingual data with heavy English/Chinese skew.
  • Utilizes Byte-level BPE with a 128,000 vocabulary.
  • Fill-in-Middle and Projected-Symbol Masking applied to 10% of training steps.
  • Models are trained using FP8 mixed-precision on a 2048 × NVIDIA H800 cluster, exploiting DP, PP, and expert parallel (EP) across nodes.

Training Efficiency

  • Total compute: 2.664M H800 GPU-hours for pretraining 14.8T tokens, plus context-extension and SFT/RL steps.
  • MLA and DeepSeekMoE, combined with FP8 and tight scheduling (DualPipe), deliver significant memory and compute savings.
  • Loss curves demonstrate <0.25% deviation in relative loss between FP8 and BF16 baselines.

3. Quantization and Deployability

Deploying the native FP8 variants of DeepSeek-671B requires over 770 GB RAM (including KV-cache), exceeding memory of typical high-end 8-GPU setups. Post-training quantization (PTQ) remedies this.

  • Q4_K_M (4-bit uniform): Reduces memory footprint to 568 GB (<1% accuracy drop).
  • DQ3_K_M (Dynamic 3-bit method): Assigns 3/4/6 bits per block based on block/importance profiling. Achieves 469 GB total, 59 GB/GPU, and <0.5% accuracy drop, uniquely enabling deployment on 64 GB/NPU hardware such as Huawei 910B.
  • Q3_K_M (3-bit uniform): Presents 1.7% average drop; not recommended for V3 reasoning tasks (≈8–9% accuracy loss on reasoning).
  • UD-Q2_K_XL (2-bit): Viable for specific targets but unstable and with QA variance.

Table 2: Memory and Accuracy Comparison

Variant Total Memory Drop vs. FP8
FP8 770 GB 0%
Q4_K_M 568 GB -0.68%
DQ3_K_M 469 GB -0.34%
Q3_K_M 487 GB -1.72%

DQ3_K_M’s importance profiling assigns 6 bits to the first 2 ffn_down_exps layers, 4 bits every 4th subsequent, and 3 bits otherwise.

4. Safety Evaluation and Mitigation

DeepSeek-671B models exhibit advanced reasoning but substantial safety vulnerabilities prior to mitigation.

  • Evaluation Framework: CNSafe (Chinese-English, 3100 clean + 1000 adversarial prompts); LLM-as-Judge (GPT-4o, Qwen2.5-72B-Instruct). ASR (Attack Success Rate) is the central metric.
  • Baseline Findings:
    • DeepSeek-R1: ASR = 100% on Cisco’s 50 harmful English prompts (Zhang et al., 18 Mar 2025).
    • On CNSafe clean prompts, DeepSeek-R1 returns higher ASR than DeepSeek-V3, especially on English inputs (21.7 percentage points higher on average); CoT exposure makes R1 significantly more vulnerable.
    • On red-teaming, DeepSeek-V3 yields 95–100% ASR; R1 is at 80–95% depending on language and risk category (Ying et al., 19 Mar 2025).

Table 3: CNSafe ASR (%) for Clean Prompts

Category DeepSeek-V3 (ZN) DeepSeek-R1 (EN)
Core Socialist Values Violation 4.5 59.5
Discriminatory Content 14.1 54.3
Commercial Misconduct 12.4 69.0
Rights Infringement 6.1 66.1

5. Mathematical, Code, and Theorem-Proving Applications

  • DeepSeek-R1 (671B) acts as a chain-of-thought “teacher” for SFT of smaller derivatives:
    • Questions are graded for difficulty by the target model, then CoT traces generated via DeepSeek-R1.
    • Empirically, using 2k high-quality CoT trace pairs for math or code suffices for a 32B student model to surpass DeepSeek-Distill-32B in the relevant task.
    • Reward modeling (PRM-Grader) outperforms answer-only grading in constructing adaptive curricula.
  • DeepSeek-Prover-V2-671B, derived from DeepSeek-V3, applies RL (GRPO) and a cold-start data pipeline using recursive subgoal decomposition and proof synthesis.
  • Achieves state-of-the-art Lean 4 MiniF2F test pass rate (88.9% CoT, 78.3% non-CoT) and solves 6/15 AIME 2024–2025 problems formally (close to DeepSeek-V3’s 8/15 informal solution rate).
  • Hierarchical subgoal alignment and RL with a consistency reward enforce logical step coverage.
  • Remaining gap to SOTA on open mathematics (e.g. PutnamBench, IMO-level) is substantial but narrowing.

6. High-Performance Computing (HPC) Code Generation

DeepSeek-671B produces functionally correct code for dense kernels (Conjugate Gradient, parallel heat equation, matmul, DGEMM, STREAM Triad) in C++, Fortran, Julia, and Python (Nader et al., 15 Mar 2025). However, relative to GPT-4:

  • HPC code from DeepSeek-671B lacks loop tiling, vectorization, cache-blocking, and optimized API usage.
  • GPT-4’s completions typically achieve 3×–50× higher performance (especially in memory-bound and DGEMM kernels).
  • Concrete recommendations: prompt for explicit pragmas/blocking, fine-tune on HPC codebases, post-process code to insert missing compiler hints.

7. General-Capability and Multilingual Benchmarks

DeepSeek-V3-671B delivers competitive performance:

  • Base model (activated parameters 37B): MMLU (5-shot EM) 87.1%, BBH 87.5%, GSM8K 89.3%, MATH 61.6%, HumanEval 65.2% (DeepSeek-AI et al., 2024).
  • Chat model: MMLU (0-shot) 88.5%, DROP 91.6%, AIME 2024 39.2%, MATH-500 90.2%.
  • Resource efficiency: pretraining and SFT delivered in 2.788M GPU-hours without major loss spikes.

The model matches or exceeds the largest open-weight alternatives (Qwen2.5-72B, LLaMA-3.1-405B) on benchmark tasks and approaches closed models in some domains.

8. Discussion, Limitations, and Future Directions

  • Safety remains unsolved: Both architectural and alignment advances (e.g., MoE scaling, post-SFT reward modeling) are insufficient alone. Jailbreaking and CoT attacks remain highly effective and expose the interpretability–vulnerability trade-off.
  • Cross-lingual brittleness: English prompts result in significantly higher unsafe output rates than Chinese—a persistent challenge for bilingual alignment.
  • Deployment: Quantization (Q4_K_M, DQ3_K_M) is mature for large-scale single-machine deployments with negligible performance loss.
  • Mathematical reasoning: Subgoal decomposition, PRM-based data generation, and cross-modal transfer (informal ↔ formal) constitute promising paths forward.
  • HPC and code: Usability hinges on combining model-level finetuning with aggressive compiler-aware prompt engineering.

Recommended practices for practitioners include: integrating adversarial safety SFT; maintaining strict human-in-the-loop oversight for sensitive deployments; rigorous cross-lingual safety audits; leveraging quantized variants for on-prem deployment; and fine-tuning or prompt-engineering for domain-specific code quality. Comprehensive, continuously updated safety and capability assessments (e.g., CNSafe, CHiSafetyBench) should be integral to the operational lifecycle.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepSeek-671B.