DeepSeek-Coder-1.3B Transformer

Updated 9 April 2026

DeepSeek-Coder-1.3B Transformer is a 1.3B-parameter, decoder-only model optimized for code synthesis using next-token prediction and fill-in-the-middle losses.
It employs FlashAttention v2 and linear RoPE interpolation for efficient long-context processing up to 16K tokens, enhancing performance on diverse code corpora.
Empirical evaluations indicate competitive results in code completion and discrimination, with significant improvements in compilation, security, and semantic correctness.

DeepSeek-Coder-1.3B Transformer is a 1.3-billion-parameter, decoder-only Transformer LLM engineered for code synthesis, completion, and discrimination. As the smallest member of the DeepSeek-Coder suite, it emphasizes architectural simplicity and performance-oriented training protocols on programmatically diverse, large-scale code corpora. Released under an open-license for unrestricted research and commercial use, DeepSeek-Coder-1.3B serves both as a standalone generative model and as the backbone for advanced code correctness discriminators and retrieval-augmented self-repair systems (Guo et al., 2024, Liang et al., 2024, Sriram et al., 1 Jan 2026).

1. Architectural Foundations

DeepSeek-Coder-1.3B is a decoder-only Transformer conforming to the GPT-family paradigm, with the following detailed configuration (Guo et al., 2024, Liang et al., 2024):

Parameterization	Value	Note
Layers (L)	24
Hidden dim ( $d_{model}$ )	2,048
Attention heads (h)	16	Head dim=$128$
FFN dim ( $d_{ff}$ )	5,504	SwiGLU activation
Context length	up to 16,384 tokens	RoPE interpolation
Tokenizer	BPE, vocab=32,000	No security-special tokens
Attention	FlashAttention v2	No grouped-query in 1.3B

Each Transformer block integrates multi-head self-attention (augmented with Rotary Position Embeddings—RoPE) and SwiGLU-activated two-layer feed-forward networks. Positional encoding uses linear RoPE interpolation, enabling stable behavior for long contexts. Engineering enhancements include FlashAttention v2 for memory- and compute-efficient attention, supporting practical inference for up to 16K tokens (Guo et al., 2024).

2. Pre-training Objectives and Training Procedure

The model is trained from scratch using a dual-objective formulation:

Next-Token Prediction (NTP): Standard autoregressive loss,

$\mathcal{L}_{NTP} = -\sum_t \log P(x_t|x_{<t})$

Fill-in-the-Middle (FIM): Program-span masking and permutation (PSM); the input is split into three contiguous spans and infilling is formulated as

$<|\mathrm{fim\_start}|> f_{pre} <|\mathrm{fim\_hole}|> f_{suf} <|\mathrm{fim\_end}|> f_{middle} <|\mathrm{eos}|>$

The FIM loss applies cross-entropy to only the masked $f_{middle}$ tokens.

Ablation showed that a PSM mix of 50% (i.e., equal application of NTP and FIM losses) optimizes the trade-off between code-completion and infilling—maximizing HumanEval-FIM performance without degrading next-token synthesis (Guo et al., 2024).

Training utilized an ∼2-trillion-token corpus (∼603 million files, ∼798 GB) from public GitHub (pre-February 2023). Rigorous repo-level filtering, context-sensitive concatenation, n-gram decontamination, and heuristic quality controls yielded a dataset composed of 87% source code, 10% code-related English, and 3% Chinese descriptions. Training leveraged AdamW (initial lr= $5.3×10^{-4}$ , complex step-down schedule), batch size 1,024, and NVIDIA A100/H800 hardware (Guo et al., 2024).

3. Empirical Performance and Evaluation

DeepSeek-Coder-1.3B was extensively benchmarked against both open- and closed-source models. Key results include (Guo et al., 2024):

Task/Dataset	DeepSeek-1.3B	CodeGeeX2-6B	StarCoder-16B	CodeLlama-7B
HumanEval (Python, pass@1)	34.8%	36.0%	31.7%	—
MBPP (few-shot)	46.2%	36.2%	42.8%	—
DS-1000 (data-science, pass@1)	16.2%	—	—	22.1%
FIM (mean/all)	70.4%	—	—	—
PAL Math Reasoning (avg)	31.9%	—	—	46.1%

The model achieves competitive results on code-completion, code infilling, and data-science tasks, with fill-in-the-middle (FIM) accuracy of 70.4% (Python 57.4, Java 82.2, JS 71.7), demonstrating strong language generalization given its parameter budget.

4. Retrieval-Augmented Generation and Self-Repair

In security-critical synthesis, DeepSeek-Coder-1.3B integrates within a retrieval-augmented, tool-feedback-driven repair loop (Sriram et al., 1 Jan 2026). The system leverages:

Semantic retrieval: an all-MiniLM-L6-v2 (384-dim) embedding model for fetching prior successful (user-task, secure-code) pairs. Top-k relevant examples are prepended to the prompt using cosine similarity.
Augmented decoding: generation conditioned on both user prompt and retrieved exemplars,

$y^* = \arg\max_n \log P(y\,|\, [x; r_1;\dots;r_k])$

Iterative self-repair: diagnostics from GCC (compilation), CodeQL (static security), and KLEE (symbolic execution) are aggregated into $\Delta$ and appended to the prompt over up to three iterations.

Empirical application to 1,522 C/C++ programs shows:

Metric	Baseline Error Rate	After Repair Loop	Absolute Reduction
Compilation	39.79%	20.43%	19.36 pp
Security	36.35%	1.45%	34.90 pp
Semantic (KLEE)	60.09%	5.72%	54.37 pp

Security error rates are reduced by approximately 96% (from 36.35% to 1.45%). Semantic correctness shows an absolute improvement of 54.37 percentage points (Sriram et al., 1 Jan 2026).

5. DeepSeek-Coder-1.3B as a Discriminative Backbone

The Condor framework repurposes DeepSeek-Coder-1.3B as a code correctness discriminator enhanced by two principal strategies (Liang et al., 2024):

Embedding-level contrastive fine-tuning: For code samples $c_a, c_b$ with labels $128$0 (both correct) or $128$1 (one incorrect),

$128$2

($128$3, $128$4 is Euclidean distance)

Data-level augmentation: From user-submitted correction histories, intermediate fixes ($128$5) are generated and injected as training examples, providing the model with a granular trajectory of code repair.
Evaluation: On the CodeNanoFix dataset (minimal-diff bugfixes), the vanilla DeepSeek-1.3B discriminator yields F1=67.06%. Condor-1.3B, with contrastive and data augmentation, achieves F1=73.38%. In discriminative selection from $128$6 completions by larger models, Condor-1.3B boosts Pass@1 of Meta-Llama-3.1-Instruct (70B) from 52.64% to 62.63%, a 10-percentage point gain. On APPS, DeepSeek-Coder-Instruct (6.7B)+Condor-1.3B: 14.68% Pass@1 (vs. 9.40%). On MBPP, Condor-1.3B nudges Meta-Llama-3.1 from 71.40% to 75.20% (Liang et al., 2024).

6. Licensing and Availability

DeepSeek-Coder-1.3B and its larger-family models are distributed under a fully permissive open-source license, with no restrictions on academic or commercial usage, fine-tuning, or redistribution. This licensing structure enables broad adoption and facilitates integration into both research workflows and production deployment (Guo et al., 2024).

7. Limitations and Future Directions

Current limitations include sensitivity to the underlying code corpus biases and a reliance on large-scale, high-quality test-annotated datasets for discriminative fine-tuning (e.g., intermediate code states for Condor). While the retrieval-augmented self-repair loop achieves strong empirical reductions in security and compilation errors, its iteration count and scaling in ultra-high-throughput settings remain non-trivial constraints (Sriram et al., 1 Jan 2026). Potential avenues include synthetic generation of intermediate repair trajectories, hybrid discrimination leveraging lightweight execution, and upstream integration of contrastive, code-detail-aware objectives into the pre-training pipeline to minimize the downstream need for discrimination (Liang et al., 2024).