Stable Code Instruct
- Stable Code Instruct is a code language model featuring a 32-layer Transformer with 2.8B parameters, optimized for following human instructions in software engineering tasks.
- It undergoes rigorous training with autoregressive pretraining on 1.3 trillion tokens and instruction tuning using over 500K code examples to enhance multi-turn dialogue and code completion accuracy.
- Deployed with quantized variants for accelerated inference on edge devices, it delivers state-of-the-art performance across diverse programming tasks and benchmarks.
Stable Code Instruct is an open-source instruction-tuned code LLM based on the Stable Code architecture, designed to follow human instructions in chat interfaces for software engineering tasks. It integrates a 32-layer Transformer backbone with 2.8 billion parameters, supports a context length of up to 16,384 tokens, and is optimized for code completion, reasoning, multi-turn dialogue, and SQL generation. The model distinguishes itself by extensive pretraining on a large code corpus, followed by targeted alignment through supervised fine-tuning and Direct Preference Optimization (DPO). It is released with quantized variants for accelerated inference and low-power edge deployment, and demonstrates state-of-the-art performance in its parameter class across several code-centric benchmarks (Pinnaparaju et al., 2024).
1. Architecture and Technical Specifications
Stable Code Instruct inherits its architecture directly from Stable Code 3B. The model comprises:
- 32 Transformer layers with a hidden dimension .
- 32 attention heads per layer; rotary embeddings are applied to the first 25% of head dimensions with a rotary base of for extended context capabilities.
- LayerNorm with learned gain and bias, FFN dimension at .
- Removal of biases from all feed-forward and attention output projections except QKV biases.
- Tokenizer: GPT-NeoX BPE with a vocabulary size of 50,257, plus special code tokens (StarCoder markers, FIM tokens).
| Parameter | Value |
|---|---|
| Parameters | |
| Hidden size | 2560 |
| Layers | 32 |
| Attention heads | 32 |
| Max sequence | 16,384 |
Editor's term: pass@1 metric is used for code completion accuracy.
2. Training Data, Instruction Tuning, and Alignment
The training pipeline involves two principle stages:
- Autoregressive pretraining on a code/natural language mix, totaling approximately $1.3$ trillion tokens. Pretraining is conducted in two context stages: $4096$ tokens and $16384$ tokens.
- Instruction tuning (Stable Code Instruct): Supervised Fine-Tuning (SFT) with deduplicated instruction examples from collections including OpenHermes 2.5, CodeAlpaca 20K, and CodeFeedback. Data sequences are packed up to $4096$ tokens and augmented with fill-in-the-middle (FIM) markers.
SFT uses standard cross-entropy loss:
After SFT, Direct Preference Optimization (DPO) is applied using code-related response pairs and “harmless” preference pairs. The DPO loss is defined as: with the margin scale parameter .
Training is conducted on 256 NVIDIA A100 40GB GPUs, leveraging ZeRO Stage 1 optimization and mixed BF16/FP32 precision (Pinnaparaju et al., 2024).
3. Evaluation Benchmarks and Performance Metrics
Stable Code Instruct undergoes evaluation on multiple programming tasks: Polyglot Code Completion (Multi-PL): pass@1 metric across languages (Python, C++, JS, Java, PHP, Rust).
| Model | Size | Avg pass@1 | Python | C++ | JS | Java | PHP | Rust |
|---|---|---|---|---|---|---|---|---|
| Stable Code Instruct | 3B | 47.2 | 58.6 | 48.1 | 49.2 | 44.4 | 45.6 | 37.2 |
| DeepSeek Coder Ins | 1.3B | 44.3 | 52.5 | 45.0 | 52.3 | 40.9 | 46.4 | 28.6 |
| DeepSeek Coder Ins | 6.7B | 61.1 | 64.6 | 63.2 | 67.7 | 59.1 | 62.7 | 48.9 |
| CodeLlama Instruct | 7B | 30.6 | 32.7 | 30.8 | 33.6 | 31.5 | 29.6 | 25.5 |
MT-Bench Coding: Judge score—Stable Code Instruct (3B, score 5.8) surpasses CodeLlama Instruct (7B, score 3.6).
SQL-Eval: Stable Code Instruct achieves 47.2% average accuracy across SQL patterns; for comparison, SQLCoder 7B posts 70.6%.
No confidence intervals are reported, but Stable Code Instruct consistently outperforms other 3B instruction-tuned baselines (Pinnaparaju et al., 2024).
4. Deployment: Quantization, Throughput, and Edge Efficiency
Stable Code Instruct is provided in GGUF and MLX quantized formats:
- GGUF: FP16, Q5_K_M, Q6_K
- MLX: INT4
Performance on Apple M2 Pro Max:
| Framework | Precision | Tokens/sec | Power (W) |
|---|---|---|---|
| MLX | FP16 | 23 | 18 |
| MLX | INT4 | 52 | 17 |
| GGUF | FP16 | 28 | 14 |
| GGUF | Q5_K_M | 53 | 23 |
| GGUF | Q6_K | 54 | 23 |
Quantization approximately doubles inference throughput; up to several points degradation in pass@1 is observed and validation on target tasks is recommended (Pinnaparaju et al., 2024).
5. Prompt Engineering and Usage Guidelines
Recommended interaction structure uses explicit “System” and “User” roles for best results. Examples:
- "System: You are an expert Python assistant."
- "User: Implement a function that…"
For completion tasks, constraining length with “max_tokens” and utilizing temperature 0-0.2 yields deterministic output. In multi-turn reasoning, explicit restatement of user goals is advised. Inclusion of FIM markers (“<fim_prefix>…<fim_middle>…<fim_suffix>”) enables bidirectional context utilization.
Potential failure modes include complex SQL queries (30–50% failure rate), drift in multi-turn scenarios, and hallucination of small code snippets. Though DPO reduces unsafe completions, external safety filters are advised for end-user deployments (Pinnaparaju et al., 2024).
6. Context: Robust Prompting and Code-Style Instruction Strategies
Stable Code Instruct’s alignment process is technically distinct from “Robust Code Instructions” (Zhang et al., 2024). The latter converts natural language task descriptions to structural code-style instructions, reducing interpretive ambiguity and increasing robustness to adversarial inputs. For closed-API LLMs, using code-style prompting and adversarial in-context demonstration mixtures yields up to +5.7% accuracy and −5.98 ASR on gpt-3.5-turbo, without parameter updates.
Best practices in robust prompting:
- Maintain consistent class/method templates across demonstrations and test prompts.
- Mix positive/negative classes and adversarial samples in demonstration shots.
- Use non-executable but structurally clear pseudocode.
Manual prompt conversion can be labor-intensive; exploration of automated tools and richer abstractions are ongoing research areas (Zhang et al., 2024).
7. Limitations and Future Directions
Stable Code Instruct demonstrates strong performance at the 3B scale across multilingual code completion and instruction following, but does not match larger competitors on highly complex tasks (e.g., SQLCoder 7B on SQL-Eval). Quantization tradeoffs require task-specific validation. Further research into automated, structural prompt generation and deeper theoretical understanding of code-style alignment benefits remains open (Pinnaparaju et al., 2024, Zhang et al., 2024).