ProLong-8B: Extended Context Language Model
- ProLong-8B is a long-context language model that extends Llama-3-8B-Instruct, achieving state-of-the-art performance on tasks up to 512K tokens.
- Its training regimen combines 40 billion tokens with advanced RoPE tuning and cross-document masking, boosting both context reasoning and recall.
- Inference leverages sequence parallelism on high‑RAM GPUs for scalable deployment, offering practical long-context processing for diverse applications.
ProLong-8B is a long-context LLM developed through continued pre-training and supervised fine-tuning (SFT) with explicit focus on robust long-context reasoning, context-window extrapolation, and practical scalability. It is built upon the Llama-3-8B-Instruct architecture, thoroughly optimizing data mixture, training sequence length, position encoding, and evaluation regimen to maximize performance on both standard and long-context benchmarks. ProLong-8B notably achieves state-of-the-art results among 8B-parameter models for context lengths up to 128K and is capable of processing up to 512K tokens, one of the longest context windows available in publicly released LMs (Gao et al., 2024).
1. Model Architecture and Position Encoding
ProLong-8B inherits the transformer configuration of Llama-3-8B-Instruct: 8 billion parameters distributed across 32 transformer layers, each with 8 attention heads and a hidden dimension of 4096 utilizing GELU activations. The architecture employs full dense attention, eschewing any form of sparsity or local windowing.
For robust long-context support, ProLong-8B implements cross-document attention masking, which blocks attention across document boundaries during concatenation of multiple documents into a batch sequence. This prevents undesired long-range information leakage and enhances both short- and long-context performance.
A central component is position extrapolation using Rotary Position Embeddings (RoPE) with a frequency base optimized for the target context length. The original base is replaced by for 64K and for 512K through continued training. This tuning follows the dynamic NTK heuristic: where , and is the attention head dimension. Empirically sweeping yielded these values as optimal for respective maximum context lengths.
2. Continued Pre-training Regimen
Continued pre-training was conducted over 40 billion tokens, structured in a two-stage curriculum:
- Stage 1: 20B tokens at sequence length 64K
- Stage 2: 20B tokens at sequence length 512K
A carefully balanced data mixture is used, summarized as follows (by token count):
| Data Source | Proportion | Notable Details |
|---|---|---|
| GitHub Repos | 30% | All files per repo concatenated as long documents |
| Books | 30% | |
| Textbooks | 3% | LibreTexts, all at 512K in Stage 2 |
| "ShortMix" | 37% | FineWeb-Edu (27%), FineWeb (27%), Tulu-v2 (11%), StackExchange (11%), Wikipedia (8%), OpenWebMath (8%), arXiv papers (8%), packed up to 64K |
Further, sequence-length curriculum is applied in Stage 2: code repository samples are split equally between 64K and 512K, books are 17% at 512K and 83% at 64K, and textbooks are exclusively at 512K. The model is trained with next-token cross-entropy (MLE) loss using AdamW optimizer (weight_decay=0.1, , ), peak learning rate (10% warmup, cosine decay to per stage), and global batch sizes of 4M tokens in Stage 1 and 8M in Stage 2. Cross-document attention masks prevent spurious long-range mixing, and variable-length attention with minibatch reordering improves GPU throughput (~12% gain).
3. Supervised Fine-Tuning (SFT) and Instruction Tuning
The SFT phase utilizes 1 billion tokens of UltraChat—human-authored chat transcripts, average length ~1.2K, maximum 4.1K tokens. Synthetic augmentation with long-instruction QA/RAG/summarization examples (up to 50%) was ablated and found to offer no performance gain; pure UltraChat was optimal.
For SFT, AdamW is again used (weight_decay=0.1, , ), with a peak learning rate of (5% warmup, cosine decay to ), and a batch size of 4M tokens. Sequence lengths during SFT are implicitly capped at ≤4K, according to UltraChat dialog lengths.
4. Context Length, Extrapolation, and Inference
ProLong-8B is trained and evaluated at context lengths of 32K, 64K, 128K, and 512K tokens, with a supported maximum sequence window of 512K at inference, enabled via sequence parallelism. Empirical findings indicate that training at lengths greater than the intended target (e.g., 512K for a 64K evaluation window) consistently improves long-context task performance, albeit with higher computational cost.
5. Long-Context Evaluation and Benchmarks
The primary long-context evaluation employs the HELMET benchmark suite (Yen et al. 2024), assessed after SFT. This includes six task categories:
- Recall (JSON key–value retrieval)
- RAG (QA over retrieved Wikipedia passages: NaturalQuestions, HotPotQA, PopQA)
- Re-ranking (MSMARCO, nDCG@10)
- In-context learning (five classification tasks—TREC, NLU, Banking77, Clinc-150)
- QA (NarrativeQA books, GPT-4o scoring)
- Summarization (Multi-LexSum legal documents, GPT-4o precision/recall)
Pre-SFT short-context evaluations include HellaSwag, MMLU, ARC-Challenge, WinoGrande, and GSM8K.
Key results (averaged over 32K/64K/128K, after SFT):
| Model | Long-Context Avg | Recall (%) | RAG (%) | ICL (%) | Re-rank (nDCG@10) | QA (GPT-4o) | Summarization (GPT-4o) |
|---|---|---|---|---|---|---|---|
| ProLong-8B (512K window) | 60.2 | 99.4 | 66.0 | 81.1 | 33.2 | 40.8 | 40.5 |
| Llama-3.1-8B-Instruct | 59.0 | — | — | — | — | — | — |
| MegaBeam-Mistral-7B | 56.5 | — | — | — | — | — | — |
| Llama-3.1-70B | 63.7 | — | — | — | — | — | — |
| GPT-4o | 70.1 | — | — | — | — | — | — |
On a 512K stress test, ProLong-8B achieves NarrativeQA QA scores at 32K/64K/128K/512K of 31.7 / 43.7 / 46.7 / 49.7 respectively, and Multi-LexSum summarization scores of 40.4 / 39.8 / 41.5 / 42.1 at the same lengths.
6. Principal Design Insights
- Combining code repositories and books constitutes the strongest source of long-context information, but inclusion of approximately 40% high-quality short-context data ("ShortMix") is critical to preserve and recover base-model capabilities.
- Training at or beyond the target evaluation sequence length (e.g., 512K even for 64K tasks) yields consistent performance gains for long-context reasoning.
- SFT on short-form datasets (UltraChat) suffices for effective long-context downstream task generalization; synthetic long prompts did not confer additional benefit.
- Adjusting the RoPE base is essential for position extrapolation; retaining the original base degrades performance, even with extended continued training.
- Perplexity and "needle-in-a-haystack" tasks are inadequate proxies; a realistic, diversified post-SFT evaluation provides the only reliable measure of long-context model improvement.
7. Practical Considerations and Deployment
- Compute requirements: Continued training consumed approximately 2,200 H100-GPU-hours (64K, 20B tokens) and 12,200 H100-GPU-hours (512K, 20B tokens); SFT required approximately 500 H100-GPU-hours (1B tokens UltraChat).
- Inference: Very long-context inference (up to 512K tokens) demands large-RAM A100/H100 infrastructure or sequence-parallel inference (e.g., DeepSpeed).
- Practitioner recommendations: Evaluate models post-SFT on diverse tasks, balance long (code+books) and short (ShortMix) data (approximately 60:40), use RoPE bases tuned for target sequence lengths, adopt cross-document attention masking, apply sequence parallelism and minibatch reordering, and restrict SFT to high-quality short-form dialog where possible.
The ProLong-8B recipe and model weights are publicly available, facilitating direct evaluation and further experimentation (Gao et al., 2024).