Nemotron-4 15B: Open-Source Multilingual LLM
- Nemotron-4 15B is a multilingual, decoder-only Transformer model with 15.6B parameters and innovations like rotary position embeddings and squared ReLU activation.
- It is trained on an 8-trillion-token corpus using extensive preprocessing, a SentencePiece BPE tokenizer, and supports diverse languages and programming languages.
- Leveraging structured pruning and knowledge distillation, the model achieves competitive performance on multilingual, reasoning, and coding benchmarks while enabling efficient fine-tuning on common GPUs.
Nemotron-4 15B is a high-capacity, open-weight, multilingual LLM comprising approximately 15.6 billion parameters, designed to serve as both a general-purpose foundation model and a teacher for deriving compact model variants via structured compression. Developed by NVIDIA and released in 2024, Nemotron-4 15B is distinguished by its competitive or state-of-the-art performance across English, code, and multilingual benchmarks, efficient architecture, and its explicit support for model family compression via pruning and knowledge distillation (Parmar et al., 2024, Muralidharan et al., 2024).
1. Model Architecture
Nemotron-4 15B is a decoder-only Transformer employing causal masking and grouped-query attention (GQA). Its key architectural hyperparameters are: 32 layers (), model dimension , 48 attention heads (), per-head dimension , and MLP intermediate dimension . The vocabulary size is . Total parameters, dominated by the attention and feedforward layers, reach approximately $15.6$ billion, quantified as:
Innovations include rotary position embeddings (RoPE) for improved extrapolation, squared ReLU activation in MLP blocks, untied input/output embeddings, and explicit removal of bias and dropout terms. The architecture is designed for high single-GPU throughput, enabling inference and finetuning on commonly available accelerators such as A100 and H100 (Parmar et al., 2024).
2. Data, Preprocessing, and Tokenization
Nemotron-4 15B was trained on a curated 8-trillion-token corpus, with a blend of approximately 70% English, 15% source code, and 15% multilingual natural language text. The multilingual component encompasses 53 natural languages (ranging from high- to low-resource) and the code data covers 43 programming languages. Data preprocessing includes document-level deduplication, language-model-based quality filtering, and heuristic filtering. Tokenization uses a SentencePiece BPE model with 256,000 tokens; non-English content is upsampled during tokenizer training to increase subword coverage for diverse scripts and languages. The tokenizer preserves whitespace, splits numbers into digits, and supports byte-level backoff, enhancing coverage for unknown or rare sequences (Parmar et al., 2024).
3. Training Regimen and Optimization
Model pretraining employed 384 DGX H100 nodes (3072 H100 GPUs) using combined tensor and data parallelism (up to 8-way tensor and 288-way data parallel). Optimizer state was sharded across data parallel groups, following a Megatron-style pipeline. Training was performed over approximately 13 wall-clock days, reaching peak single-GPU throughput of 989 TFLOP/s (bfloat16). The loss function is next-token cross-entropy:
Adam optimizer is used with parameters , , . The learning rate is scheduled with linear warmup and cosine decay. After the initial 8T-token pretraining phase, fine-tuning with reweighted data sources and a steeper learning rate decay is employed to emphasize high-quality and aligned data (Parmar et al., 2024).
4. Structured Pruning and Knowledge Distillation
Nemotron-4 15B is the progenitor of the Minitron pruned-variant family, derived via structured pruning and knowledge distillation (Muralidharan et al., 2024). The approach prunes four axes—model depth, width (MLP neurons, embedding channels), attention heads, and MLP neurons—ranking elements by forward-pass activation scores or block importance scores:
- Head/Neuron/Embedding importance: activation norm sums across a 1024-sample calibration set
- Layer (depth) importance: either “remove-and-measure-PPL” or Block Importance (BI) Residual information from pruned attention heads is redistributed into surviving projections.
Following pruning, lightweight retraining via teacher-student knowledge distillation is performed, employing less than 3% of the original pretraining data. The primary distillation objective operates on softmax logits:
Intermediate-state distillation losses may be added if depth reduction is extreme, but are omitted for moderate pruning. Each architecture candidate receives 1.8B tokens of retraining (400 gradient steps). This procedure yields 8B and 4B parameter models with 40x less retraining data per model compared to training from scratch, and 1.8x total family-wide compute savings (Muralidharan et al., 2024).
5. Evaluation and Benchmark Performance
Nemotron-4 15B achieves or exceeds state-of-the-art performance among open models of comparable size, especially in multilingual and reasoning tasks. It is evaluated on commonsense reasoning (SIQA, ARC-easy/code, PIQA, Winogrande, HellaSwag), aggregate benchmarks (MMLU, BBH), mathematics (GSM8K), coding (HumanEval, MBPP, MultiPL-E), and multilingual tasks (XCOPA, TyDiQA-GoldP, MGSM, FLORES-101). Key benchmark results:
| Benchmark | Nemotron-4 15B | Leading Open-Scale Peer |
|---|---|---|
| Winogrande (5-shot) | 83.6% | 80.7% (Llama-2 13B) |
| ARC-Challenge | 58.8% | 54.5% (Llama-2 34B) |
| MMLU (5-shot) | 66.6% | 66.3% (QWEN 14B) |
| HellaSwag (10-shot) | 84.6% | 83.3% (Llama-2 34B) |
| HumanEval (0-shot) | 31.6% | 32.3% (Gemma 7B) |
| MBPP (0-shot) | 38% | 44.4% (Gemma 7B) |
| XCOPA (0/4-shot) | 59.5%/68.9% | 55.6%/61.4% (XGLM) |
| TyDiQA (1-shot) | 50.5% | 45.7% (PaLM 62Bcont) |
| MGSM (8-shot) | 41.3% | 32.0% (PaLM 62Bcont) |
The model’s multilingual proficiency is notable, setting the leading score on FLORES spBLEU (23.2 vs. 16.1) and outperforming larger or specialized models on diverse language families (Parmar et al., 2024).
6. Model Release, Integration, and Usability
Nemotron-4 15B and its pruned Minitron variants are fully open-sourced at https://huggingface.co/nvidia. Compression and retraining scripts, along with illustrative examples, are maintained at https://github.com/NVlabs/Minitron. Integration is streamlined via Huggingface Transformers, with minimal code to load and run the model:
1 2 3 4 5 6 7 8 |
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/nemotron-4-15b") model = AutoModelForCausalLM.from_pretrained("nvidia/nemotron-4-15b", torch_dtype=torch.float16, device_map="auto") prompt = "Explain the difference between pruning and distillation." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, top_p=0.9) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
7. Significance, Limitations, and Ongoing Directions
Nemotron-4 15B serves as an archetype for data-scale–optimal LLMs at the 15B parameter scale, demonstrating that competitive or superior performance on multilingual, reasoning, and code tasks can be achieved via extensive, diverse data and efficient transformer implementations. It is the basis for highly data- and compute-efficient model family creation via structured pruning and distillation, with released variants matching or surpassing community models (Llama-2/3 8B, Mistral 7B, Gemma 7B) with orders of magnitude less retraining.
Current limitations include somewhat lower performance on mathematical reasoning compared to recent fine-tuned competitors (e.g., QWEN) and slightly lower code pass@1 scores versus specialized models (e.g., Gemma on Python tasks). No toxicity or harm analysis is reported. Future work encompasses sparse MoE architectures, reinforcement learning from human feedback (RLHF), extended context windows, and further domain adaptation (Parmar et al., 2024, Muralidharan et al., 2024).