IBM Granite-3.3-8b-Instruct Model
- IBM Granite-3.3-8b-Instruct is an 8-billion-parameter decoder-only Transformer model tuned for code generation, editing, explanation, and multimodal inference.
- It employs advanced techniques such as rotary positional embeddings for extended contexts up to 128K tokens, grouped-query attention, and efficient mixed-precision inference.
- Extensive instruction tuning on trillions of tokens and integration with both text and speech components ensure robust performance in enterprise AI and research applications.
IBM Granite-3.3-8b-Instruct is an 8-billion-parameter, decoder-only Transformer-based LLM developed by IBM Research. Released open-source under the Apache 2.0 license, it serves as the instruction-tuned flagship in the Granite series for code intelligence, and forms the backbone of multi-modal and high-efficiency inference systems. The model is optimized for code-oriented tasks, including code generation, editing, explanation, repository-level understanding, and is effective across both short and extreme long-context inferencing up to 128K tokens. Granite-3.3-8b-Instruct also underpins speech-aware variants with advanced automatic speech recognition (ASR) and speech translation capability, and has been vertically integrated into specialized inference accelerator platforms for low-latency, high-throughput enterprise deployment (Mishra et al., 7 May 2024, Stallone et al., 18 Jul 2024, Saon et al., 13 May 2025, &&&3&&&).
1. Model Architecture and Variants
Granite-3.3-8b-Instruct employs a standard decoder-only Transformer stack with ≈8 billion learnable parameters. The core architecture implements:
- 32–36 Transformer blocks (exact count depends on release variant and paper; e.g., 36 for code models (Mishra et al., 7 May 2024), 32 for speech-aware (Saon et al., 13 May 2025))
- Model dimension , feed-forward inner dimension of 14,336 or 16,384
- 32 attention heads with grouped-query attention (GQA), utilizing 8 key-value heads per group
- Rotary positional embeddings (RoPE) with support for both short (4K) and long (128K) context lengths
- SwiGLU or GELU activations and RMSNorm or LayerNorm, as per variant
- Byte-pair encoding (BPE) tokenizer, vocabulary size 49,152
Instruction tuning does not alter the architecture—"instruct" variants are pure finetunes of the respective base models. For the 128K-token variant, no extra layers or adapters are introduced; RoPE base frequency and long-context pretraining alone enable extended context (Stallone et al., 18 Jul 2024).
The acoustic extension in Granite-speech-3.3-8B-Instruct appends a Conformer-based encoder (10 blocks, block attention) for speech inputs, a windowed query Transformer ("Q-former"), and LoRA adapters injected into the attention projections for efficient speech-text alignment. Text-only operation always reverts to the original Granite-3.3-8b-Instruct weights, preserving instruction-tuned capabilities and model safety (Saon et al., 13 May 2025).
2. Training Data, Objectives, and Instruction Tuning
Pretraining of Granite-3.3-8b-Instruct covers ≈4 trillion code tokens from 116 programming languages, followed by ≈500 billion tokens from an 80/20 code-to-natural language mixture to boost reasoning abilities. Data curation includes StarCoder-inspired quality/duplication filters, HAP/PII redaction, and excludes malware (Mishra et al., 7 May 2024).
The loss function is a mixture of causal language modeling () and fill-in-the-middle () objectives, for pretraining, switching to pure CLM for instruction tuning. Training uses AdamW with fine-tuned hyperparameters, large effective batch sizes, and heavy reliance on hardware-efficient components like FlashAttention 2, sequence and pipeline parallelism, activation checkpointing, and BF16 precision.
Instruction-tuning is performed on a blend of:
- Human Git commit-message ↔ code diffs (CommitPackFT, 92 languages)
- Math instruction datasets (MathInstruct, MetaMathQA)
- Code assistant dialogues (Glaive-Code-Assistant-v3, Self-OSS-Instruct-SC2)
- Synthetic and real code-based instruction data (NL2SQL, API-BLEND)
- General natural language instructions (HelpSteer, OpenPlatypus)
For long-context (128K) instruction-tuning, repository-level "file packing" and length up-sampling create multi-turn, long-range Q&A covering class/function comprehension and synthesis tasks, balancing short- and long-context samples (≈50/50 mix). Learning rates and batch sizes are set to maximize context exposure per update (Stallone et al., 18 Jul 2024).
3. Context Scaling: 4K to 128K Tokens
To support up to 128K-token contexts, Granite-3.3-8b-Instruct undergoes a multi-stage continual pretraining schedule:
- RoPE base frequencies are progressively increased in five stages (e.g., base: for 8K, for 128K), each with 500 training steps and batch size 32.
- Repository-level file packing concatenates all files in a repo into single, long training documents.
- Sequences shorter than 4K are downsampled; lengthy documents are upsampled to encourage long-context learning.
- Multi-turn synthetic instructions drive effective optimization for retrieval, understanding, and completion over extreme context windows.
Performance degrades sharply for models lacking frequency adjustment or repository packing. Incorporating synthetic long-context Q&A during instruction tuning is pivotal for retrieval and comprehension tasks above 4K tokens (Stallone et al., 18 Jul 2024).
4. Evaluation and Benchmarking
Granite-3.3-8b-Instruct exhibits competitive and in some settings state-of-the-art performance among open-source 7B/8B LLMs, especially in code-centric evaluation. Key metrics include pass@1 (functional code generation), exact match (EM) and edit similarity (ES), and fuzzy/execution-based code correctness.
Short-context (≤4K) benchmarks:
| Benchmark | Metric | Instruct 8B Result (%) |
|---|---|---|
| HumanEval (Synth) | pass@1 | 49.6 |
| MBPP (Python) | pass@1 | 49.6 |
| RepoBench (Python/Java) | EM / ES | 31.8 / 69.5 (Py), 38.4 / 76.4 (Java) |
| SantaCoder-FIM (Infilling) | Exact Match | 76.6 |
| HumanEvalPack (Explain/Fix) | pass@1 | 40.9 / 40.4 |
| CanItEdit (Python Edit) | pass@1 | 39.7 (descriptive) |
| GSM8K+Py (Math Reasoning) | acc. | 63.1 |
Long-context (up to 128K):
- Long code completion (balanced) EM: Upwards of 57.4% at 32K tokens, 44.5% for RepoBench-P at 32K
- RepoQA @ 16K context: 68.0% (vs 10.0% for 4K)
- Key-retrieval (needle-in-haystack): Near-perfect up to 128K tokens
- Minimal (<1%) performance loss on short-context tasks for 128K variant (Stallone et al., 18 Jul 2024)
Speech mode evaluation (ASR, AST):
Granite-speech-3.3-8B achieves WER of 1.5% (LibriSpeech clean), 7.0% (CommonVoice), with BLEU on speech translation within 5 points of state-of-the-art specialist models, outperforming closed-source systems on some English ASR corpora (Saon et al., 13 May 2025).
5. Inference, Quantization, and Efficiency
IBM Granite-3.3-8b-Instruct is designed for both standard GPU and domain-specific accelerator deployments:
GPU:
- FlashAttention 2, padding-free transformers, and activation checkpointing enable 4096-token context at practical inference speeds on 80GB A100/H100 GPUs (Mishra et al., 7 May 2024).
- Supports mixed-precision inference (BF16) and 4-bit quantization flows (ORCA, vLLM) for further resource reduction.
NorthPole Accelerator (vertical integration):
- Deployed across 288 NorthPole PCIe cards in 18 standard 2U servers within a 42U rack, delivering 115 peta-ops/s (4-bit) and 3.7 PB/s on-chip bandwidth at ≈30kW (air cooled, 0.67 m², 730 kg).
- All weights and KV caches fit on-chip, eliminating DRAM bottlenecks; inference runs in A8-C8-W4 (activations and caches: 8-bit; weights: 4-bit).
- SiLQ quantization-aware fine-tuning recovers full bfloat16 accuracy (≤0.4 loss over 19 code/LANG benchmarks).
- 3 simultaneous 8B model instances per rack at 2K context, each serving 28 users with inter-token latency 2.8 ms/token and time-to-first-token 64.8 ms.
- Scales linearly for other model sizes: up to 18× 3B models/rack, 1× 70B, or 2 racks for 120B (Debole et al., 20 Nov 2025).
This architecture is expressly designed for agentic enterprise AI and fine-grained, multi-user inference in constrained data center envelopes.
6. Licensing, Release, and Best Practices
Granite-3.3-8b-Instruct and its code/speech variants, as well as all long-context and speech-aware derivatives, are distributed under Apache 2.0—enabling research and commercial use with a permissive patent grant and open access to weights, code, and data pointers. Training data is curated for licensing compliance, PII/HAP redaction, and model trustworthiness per IBM AI ethics policies (Mishra et al., 7 May 2024, Stallone et al., 18 Jul 2024, Saon et al., 13 May 2025).
Prompting best practices recommend using explicit <|system|>, <|user|>, and <|assistant|> delimiters (or commit-boundary tokens), providing few-shot exemplars, and including type signatures when eliciting code behavior, following established instruction-tuning methodologies.
A plausible implication is that this model can serve as a drop-in replacement for both generalist and specialist 8B code assistants in research and enterprise environments requiring high code coverage, open licensing, and scalable deployment with minimal adaptation effort.
7. Multi-modality and Extensions
Granite-3.3-8b-Instruct is the backbone of the IBM Granite-speech-3.3-8B, a speech-aware LLM integrating a Conformer encoder, Q-former adapter, and LoRA injection. It operates natively in text or speech modes, achieving advanced ASR and AST without sacrificing the code LLM’s instruction-following or functional safety in text-only operations. All subcomponents are modular and open source, with model checkpoints available on HuggingFace for immediate use in academic and commercial projects (Saon et al., 13 May 2025).
This suggests the Granite-3.3-8b-Instruct model family covers a broad spectrum of tasks from software engineering assistance to cross-modal intelligence, supporting both standard and exotic deployment scenarios while maintaining open licensing and best-in-class inference efficiency.