Stable Code Instruct

Updated 21 November 2025

Stable Code Instruct is a code language model featuring a 32-layer Transformer with 2.8B parameters, optimized for following human instructions in software engineering tasks.
It undergoes rigorous training with autoregressive pretraining on 1.3 trillion tokens and instruction tuning using over 500K code examples to enhance multi-turn dialogue and code completion accuracy.
Deployed with quantized variants for accelerated inference on edge devices, it delivers state-of-the-art performance across diverse programming tasks and benchmarks.

Stable Code Instruct is an open-source instruction-tuned code LLM based on the Stable Code architecture, designed to follow human instructions in chat interfaces for software engineering tasks. It integrates a 32-layer Transformer backbone with 2.8 billion parameters, supports a context length of up to 16,384 tokens, and is optimized for code completion, reasoning, multi-turn dialogue, and SQL generation. The model distinguishes itself by extensive pretraining on a large code corpus, followed by targeted alignment through supervised fine-tuning and Direct Preference Optimization (DPO). It is released with quantized variants for accelerated inference and low-power edge deployment, and demonstrates state-of-the-art performance in its parameter class across several code-centric benchmarks (Pinnaparaju et al., 2024).

1. Architecture and Technical Specifications

Stable Code Instruct inherits its architecture directly from Stable Code 3B. The model comprises:

32 Transformer layers with a hidden dimension $d_{\mathrm{model}} = 2560$ .
32 attention heads per layer; rotary embeddings are applied to the first 25% of head dimensions with a rotary base of $10^6$ for extended context capabilities.
LayerNorm with learned gain and bias, FFN dimension at $4 \times d_{\mathrm{model}} = 10240$ .
Removal of biases from all feed-forward and attention output projections except QKV biases.
Tokenizer: GPT-NeoX BPE with a vocabulary size of 50,257, plus special code tokens (StarCoder markers, FIM tokens).

Parameter	Value
Parameters	$2.795 \times 10^9$
Hidden size	2560
Layers	32
Attention heads	32
Max sequence	16,384

Editor's term: pass@1 metric is used for code completion accuracy.

2. Training Data, Instruction Tuning, and Alignment

The training pipeline involves two principle stages:

Autoregressive pretraining on a $80{:}20$ code/natural language mix, totaling approximately $1.3$ trillion tokens. Pretraining is conducted in two context stages: $4096$ tokens and $16384$ tokens.
Instruction tuning (Stable Code Instruct): Supervised Fine-Tuning (SFT) with $\sim500,000$ deduplicated instruction examples from collections including OpenHermes 2.5, CodeAlpaca 20K, and CodeFeedback. Data sequences are packed up to $4096$ tokens and augmented with fill-in-the-middle (FIM) markers.

SFT uses standard cross-entropy loss: $\mathcal{L}_{\mathrm{SFT}} = - \sum_{(x,y)} \log p_\theta(y \mid x)\,.$

After SFT, Direct Preference Optimization (DPO) is applied using $\sim7000$ code-related response pairs and $\sim15000$ “harmless” preference pairs. The DPO loss is defined as: $\mathcal{L}_{\mathrm{DPO}} = - \log \sigma\bigl(\beta\,(\log p_\theta(r^+ \mid x) - \log p_\theta(r^- \mid x))\bigr)$ with the margin scale parameter $\beta = 0.01$ .

Training is conducted on 256 NVIDIA A100 40GB GPUs, leveraging ZeRO Stage 1 optimization and mixed BF16/FP32 precision (Pinnaparaju et al., 2024).

3. Evaluation Benchmarks and Performance Metrics

Stable Code Instruct undergoes evaluation on multiple programming tasks: Polyglot Code Completion (Multi-PL): pass@1 metric across languages (Python, C++, JS, Java, PHP, Rust).

Model	Size	Avg pass@1	Python	C++	JS	Java	PHP	Rust
Stable Code Instruct	3B	47.2	58.6	48.1	49.2	44.4	45.6	37.2
DeepSeek Coder Ins	1.3B	44.3	52.5	45.0	52.3	40.9	46.4	28.6
DeepSeek Coder Ins	6.7B	61.1	64.6	63.2	67.7	59.1	62.7	48.9
CodeLlama Instruct	7B	30.6	32.7	30.8	33.6	31.5	29.6	25.5

MT-Bench Coding: Judge score—Stable Code Instruct (3B, score 5.8) surpasses CodeLlama Instruct (7B, score 3.6).

SQL-Eval: Stable Code Instruct achieves 47.2% average accuracy across SQL patterns; for comparison, SQLCoder 7B posts 70.6%.

No confidence intervals are reported, but Stable Code Instruct consistently outperforms other 3B instruction-tuned baselines (Pinnaparaju et al., 2024).

4. Deployment: Quantization, Throughput, and Edge Efficiency

Stable Code Instruct is provided in GGUF and MLX quantized formats:

GGUF: FP16, Q5_K_M, Q6_K
MLX: INT4

Performance on Apple M2 Pro Max:

Framework	Precision	Tokens/sec	Power (W)
MLX	FP16	23	18
MLX	INT4	52	17
GGUF	FP16	28	14
GGUF	Q5_K_M	53	23
GGUF	Q6_K	54	23

Quantization approximately doubles inference throughput; up to several points degradation in pass@1 is observed and validation on target tasks is recommended (Pinnaparaju et al., 2024).

5. Prompt Engineering and Usage Guidelines

Recommended interaction structure uses explicit “System” and “User” roles for best results. Examples:

"System: You are an expert Python assistant."
"User: Implement a function that…"

For completion tasks, constraining length with “max_tokens” and utilizing temperature 0-0.2 yields deterministic output. In multi-turn reasoning, explicit restatement of user goals is advised. Inclusion of FIM markers (“<fim_prefix>…<fim_middle>…<fim_suffix>”) enables bidirectional context utilization.

Potential failure modes include complex SQL queries (30–50% failure rate), drift in multi-turn scenarios, and hallucination of small code snippets. Though DPO reduces unsafe completions, external safety filters are advised for end-user deployments (Pinnaparaju et al., 2024).

6. Context: Robust Prompting and Code-Style Instruction Strategies

Stable Code Instruct’s alignment process is technically distinct from “Robust Code Instructions” (Zhang et al., 2024). The latter converts natural language task descriptions to structural code-style instructions, reducing interpretive ambiguity and increasing robustness to adversarial inputs. For closed-API LLMs, using code-style prompting and adversarial in-context demonstration mixtures yields up to +5.7% accuracy and −5.98 ASR on gpt-3.5-turbo, without parameter updates.

Best practices in robust prompting:

Maintain consistent class/method templates across demonstrations and test prompts.
Mix positive/negative classes and adversarial samples in demonstration shots.
Use non-executable but structurally clear pseudocode.

Manual prompt conversion can be labor-intensive; exploration of automated tools and richer abstractions are ongoing research areas (Zhang et al., 2024).

7. Limitations and Future Directions

Stable Code Instruct demonstrates strong performance at the 3B scale across multilingual code completion and instruction following, but does not match larger competitors on highly complex tasks (e.g., SQLCoder 7B on SQL-Eval). Quantization tradeoffs require task-specific validation. Further research into automated, structural prompt generation and deeper theoretical understanding of code-style alignment benefits remains open (Pinnaparaju et al., 2024, Zhang et al., 2024).

PDF Markdown Chat (Pro)

References (2)

Stable Code Technical Report (2024)

RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Stable Code Instruct.