Papers
Topics
Authors
Recent
2000 character limit reached

Stable Code Instruct

Updated 21 November 2025
  • Stable Code Instruct is a code language model featuring a 32-layer Transformer with 2.8B parameters, optimized for following human instructions in software engineering tasks.
  • It undergoes rigorous training with autoregressive pretraining on 1.3 trillion tokens and instruction tuning using over 500K code examples to enhance multi-turn dialogue and code completion accuracy.
  • Deployed with quantized variants for accelerated inference on edge devices, it delivers state-of-the-art performance across diverse programming tasks and benchmarks.

Stable Code Instruct is an open-source instruction-tuned code LLM based on the Stable Code architecture, designed to follow human instructions in chat interfaces for software engineering tasks. It integrates a 32-layer Transformer backbone with 2.8 billion parameters, supports a context length of up to 16,384 tokens, and is optimized for code completion, reasoning, multi-turn dialogue, and SQL generation. The model distinguishes itself by extensive pretraining on a large code corpus, followed by targeted alignment through supervised fine-tuning and Direct Preference Optimization (DPO). It is released with quantized variants for accelerated inference and low-power edge deployment, and demonstrates state-of-the-art performance in its parameter class across several code-centric benchmarks (Pinnaparaju et al., 2024).

1. Architecture and Technical Specifications

Stable Code Instruct inherits its architecture directly from Stable Code 3B. The model comprises:

  • 32 Transformer layers with a hidden dimension dmodel=2560d_{\mathrm{model}} = 2560.
  • 32 attention heads per layer; rotary embeddings are applied to the first 25% of head dimensions with a rotary base of 10610^6 for extended context capabilities.
  • LayerNorm with learned gain and bias, FFN dimension at 4×dmodel=102404 \times d_{\mathrm{model}} = 10240.
  • Removal of biases from all feed-forward and attention output projections except QKV biases.
  • Tokenizer: GPT-NeoX BPE with a vocabulary size of 50,257, plus special code tokens (StarCoder markers, FIM tokens).
Parameter Value
Parameters 2.795×1092.795 \times 10^9
Hidden size 2560
Layers 32
Attention heads 32
Max sequence 16,384

Editor's term: pass@1 metric is used for code completion accuracy.

2. Training Data, Instruction Tuning, and Alignment

The training pipeline involves two principle stages:

  • Autoregressive pretraining on a 80:2080{:}20 code/natural language mix, totaling approximately $1.3$ trillion tokens. Pretraining is conducted in two context stages: $4096$ tokens and $16384$ tokens.
  • Instruction tuning (Stable Code Instruct): Supervised Fine-Tuning (SFT) with 500,000\sim500,000 deduplicated instruction examples from collections including OpenHermes 2.5, CodeAlpaca 20K, and CodeFeedback. Data sequences are packed up to $4096$ tokens and augmented with fill-in-the-middle (FIM) markers.

SFT uses standard cross-entropy loss: LSFT=(x,y)logpθ(yx).\mathcal{L}_{\mathrm{SFT}} = - \sum_{(x,y)} \log p_\theta(y \mid x)\,.

After SFT, Direct Preference Optimization (DPO) is applied using 7000\sim7000 code-related response pairs and 15000\sim15000 “harmless” preference pairs. The DPO loss is defined as: LDPO=logσ(β(logpθ(r+x)logpθ(rx)))\mathcal{L}_{\mathrm{DPO}} = - \log \sigma\bigl(\beta\,(\log p_\theta(r^+ \mid x) - \log p_\theta(r^- \mid x))\bigr) with the margin scale parameter β=0.01\beta = 0.01.

Training is conducted on 256 NVIDIA A100 40GB GPUs, leveraging ZeRO Stage 1 optimization and mixed BF16/FP32 precision (Pinnaparaju et al., 2024).

3. Evaluation Benchmarks and Performance Metrics

Stable Code Instruct undergoes evaluation on multiple programming tasks: Polyglot Code Completion (Multi-PL): pass@1 metric across languages (Python, C++, JS, Java, PHP, Rust).

Model Size Avg pass@1 Python C++ JS Java PHP Rust
Stable Code Instruct 3B 47.2 58.6 48.1 49.2 44.4 45.6 37.2
DeepSeek Coder Ins 1.3B 44.3 52.5 45.0 52.3 40.9 46.4 28.6
DeepSeek Coder Ins 6.7B 61.1 64.6 63.2 67.7 59.1 62.7 48.9
CodeLlama Instruct 7B 30.6 32.7 30.8 33.6 31.5 29.6 25.5

MT-Bench Coding: Judge score—Stable Code Instruct (3B, score 5.8) surpasses CodeLlama Instruct (7B, score 3.6).

SQL-Eval: Stable Code Instruct achieves 47.2% average accuracy across SQL patterns; for comparison, SQLCoder 7B posts 70.6%.

No confidence intervals are reported, but Stable Code Instruct consistently outperforms other 3B instruction-tuned baselines (Pinnaparaju et al., 2024).

4. Deployment: Quantization, Throughput, and Edge Efficiency

Stable Code Instruct is provided in GGUF and MLX quantized formats:

  • GGUF: FP16, Q5_K_M, Q6_K
  • MLX: INT4

Performance on Apple M2 Pro Max:

Framework Precision Tokens/sec Power (W)
MLX FP16 23 18
MLX INT4 52 17
GGUF FP16 28 14
GGUF Q5_K_M 53 23
GGUF Q6_K 54 23

Quantization approximately doubles inference throughput; up to several points degradation in pass@1 is observed and validation on target tasks is recommended (Pinnaparaju et al., 2024).

5. Prompt Engineering and Usage Guidelines

Recommended interaction structure uses explicit “System” and “User” roles for best results. Examples:

  • "System: You are an expert Python assistant."
  • "User: Implement a function that…"

For completion tasks, constraining length with “max_tokens” and utilizing temperature 0-0.2 yields deterministic output. In multi-turn reasoning, explicit restatement of user goals is advised. Inclusion of FIM markers (“<fim_prefix>…<fim_middle>…<fim_suffix>”) enables bidirectional context utilization.

Potential failure modes include complex SQL queries (30–50% failure rate), drift in multi-turn scenarios, and hallucination of small code snippets. Though DPO reduces unsafe completions, external safety filters are advised for end-user deployments (Pinnaparaju et al., 2024).

6. Context: Robust Prompting and Code-Style Instruction Strategies

Stable Code Instruct’s alignment process is technically distinct from “Robust Code Instructions” (Zhang et al., 2024). The latter converts natural language task descriptions to structural code-style instructions, reducing interpretive ambiguity and increasing robustness to adversarial inputs. For closed-API LLMs, using code-style prompting and adversarial in-context demonstration mixtures yields up to +5.7% accuracy and −5.98 ASR on gpt-3.5-turbo, without parameter updates.

Best practices in robust prompting:

  • Maintain consistent class/method templates across demonstrations and test prompts.
  • Mix positive/negative classes and adversarial samples in demonstration shots.
  • Use non-executable but structurally clear pseudocode.

Manual prompt conversion can be labor-intensive; exploration of automated tools and richer abstractions are ongoing research areas (Zhang et al., 2024).

7. Limitations and Future Directions

Stable Code Instruct demonstrates strong performance at the 3B scale across multilingual code completion and instruction following, but does not match larger competitors on highly complex tasks (e.g., SQLCoder 7B on SQL-Eval). Quantization tradeoffs require task-specific validation. Further research into automated, structural prompt generation and deeper theoretical understanding of code-style alignment benefits remains open (Pinnaparaju et al., 2024, Zhang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Stable Code Instruct.