ProLong-8B: Extended Context Language Model

Updated 2 March 2026

ProLong-8B is a long-context language model that extends Llama-3-8B-Instruct, achieving state-of-the-art performance on tasks up to 512K tokens.
Its training regimen combines 40 billion tokens with advanced RoPE tuning and cross-document masking, boosting both context reasoning and recall.
Inference leverages sequence parallelism on high‑RAM GPUs for scalable deployment, offering practical long-context processing for diverse applications.

ProLong-8B is a long-context LLM developed through continued pre-training and supervised fine-tuning (SFT) with explicit focus on robust long-context reasoning, context-window extrapolation, and practical scalability. It is built upon the Llama-3-8B-Instruct architecture, thoroughly optimizing data mixture, training sequence length, position encoding, and evaluation regimen to maximize performance on both standard and long-context benchmarks. ProLong-8B notably achieves state-of-the-art results among 8B-parameter models for context lengths up to 128K and is capable of processing up to 512K tokens, one of the longest context windows available in publicly released LMs (Gao et al., 2024).

1. Model Architecture and Position Encoding

ProLong-8B inherits the transformer configuration of Llama-3-8B-Instruct: 8 billion parameters distributed across 32 transformer layers, each with 8 attention heads and a hidden dimension of 4096 utilizing GELU activations. The architecture employs full dense attention, eschewing any form of sparsity or local windowing.

For robust long-context support, ProLong-8B implements cross-document attention masking, which blocks attention across document boundaries during concatenation of multiple documents into a batch sequence. This prevents undesired long-range information leakage and enhances both short- and long-context performance.

A central component is position extrapolation using Rotary Position Embeddings (RoPE) with a frequency base optimized for the target context length. The original base $b_0 = 5 \times 10^5$ is replaced by $b=8 \times 10^6$ for 64K and $b=1.28 \times 10^8$ for 512K through continued training. This tuning follows the dynamic NTK heuristic: $b = b_0 \cdot t^{d/(d-2)}$ where $t = L_{\max}/L_0$ , and $d$ is the attention head dimension. Empirically sweeping $b$ yielded these values as optimal for respective maximum context lengths.

2. Continued Pre-training Regimen

Continued pre-training was conducted over 40 billion tokens, structured in a two-stage curriculum:

Stage 1: 20B tokens at sequence length 64K
Stage 2: 20B tokens at sequence length 512K

A carefully balanced data mixture is used, summarized as follows (by token count):

Data Source	Proportion	Notable Details
GitHub Repos	30%	All files per repo concatenated as long documents
Books	30%
Textbooks	3%	LibreTexts, all at 512K in Stage 2
"ShortMix"	37%	FineWeb-Edu (27%), FineWeb (27%), Tulu-v2 (11%), StackExchange (11%), Wikipedia (8%), OpenWebMath (8%), arXiv papers (8%), packed up to 64K

Further, sequence-length curriculum is applied in Stage 2: code repository samples are split equally between 64K and 512K, books are 17% at 512K and 83% at 64K, and textbooks are exclusively at 512K. The model is trained with next-token cross-entropy (MLE) loss using AdamW optimizer (weight_decay=0.1, $\beta_1=0.9$ , $\beta_2=0.95$ ), peak learning rate $1\times10^{-5}$ (10% warmup, cosine decay to $1\times10^{-6}$ per stage), and global batch sizes of 4M tokens in Stage 1 and 8M in Stage 2. Cross-document attention masks prevent spurious long-range mixing, and variable-length attention with minibatch reordering improves GPU throughput (~12% gain).

3. Supervised Fine-Tuning (SFT) and Instruction Tuning

The SFT phase utilizes 1 billion tokens of UltraChat—human-authored chat transcripts, average length ~1.2K, maximum 4.1K tokens. Synthetic augmentation with long-instruction QA/RAG/summarization examples (up to 50%) was ablated and found to offer no performance gain; pure UltraChat was optimal.

For SFT, AdamW is again used (weight_decay=0.1, $\beta_1=0.9$ , $\beta_2=0.95$ ), with a peak learning rate of $2\times10^{-5}$ (5% warmup, cosine decay to $2\times10^{-6}$ ), and a batch size of 4M tokens. Sequence lengths during SFT are implicitly capped at ≤4K, according to UltraChat dialog lengths.

4. Context Length, Extrapolation, and Inference

ProLong-8B is trained and evaluated at context lengths of 32K, 64K, 128K, and 512K tokens, with a supported maximum sequence window of 512K at inference, enabled via sequence parallelism. Empirical findings indicate that training at lengths greater than the intended target (e.g., 512K for a 64K evaluation window) consistently improves long-context task performance, albeit with higher computational cost.

5. Long-Context Evaluation and Benchmarks

The primary long-context evaluation employs the HELMET benchmark suite (Yen et al. 2024), assessed after SFT. This includes six task categories:

Recall (JSON key–value retrieval)
RAG (QA over retrieved Wikipedia passages: NaturalQuestions, HotPotQA, PopQA)
Re-ranking (MSMARCO, nDCG@10)
In-context learning (five classification tasks—TREC, NLU, Banking77, Clinc-150)
QA (NarrativeQA books, GPT-4o scoring)
Summarization (Multi-LexSum legal documents, GPT-4o precision/recall)

Pre-SFT short-context evaluations include HellaSwag, MMLU, ARC-Challenge, WinoGrande, and GSM8K.

Key results (averaged over 32K/64K/128K, after SFT):

Model	Long-Context Avg	Recall (%)	RAG (%)	ICL (%)	Re-rank (nDCG@10)	QA (GPT-4o)	Summarization (GPT-4o)
ProLong-8B (512K window)	60.2	99.4	66.0	81.1	33.2	40.8	40.5
Llama-3.1-8B-Instruct	59.0	—	—	—	—	—	—
MegaBeam-Mistral-7B	56.5	—	—	—	—	—	—
Llama-3.1-70B	63.7	—	—	—	—	—	—
GPT-4o	70.1	—	—	—	—	—	—

On a 512K stress test, ProLong-8B achieves NarrativeQA QA scores at 32K/64K/128K/512K of 31.7 / 43.7 / 46.7 / 49.7 respectively, and Multi-LexSum summarization scores of 40.4 / 39.8 / 41.5 / 42.1 at the same lengths.

6. Principal Design Insights

Combining code repositories and books constitutes the strongest source of long-context information, but inclusion of approximately 40% high-quality short-context data ("ShortMix") is critical to preserve and recover base-model capabilities.
Training at or beyond the target evaluation sequence length (e.g., 512K even for 64K tasks) yields consistent performance gains for long-context reasoning.
SFT on short-form datasets (UltraChat) suffices for effective long-context downstream task generalization; synthetic long prompts did not confer additional benefit.
Adjusting the RoPE base is essential for position extrapolation; retaining the original base degrades performance, even with extended continued training.
Perplexity and "needle-in-a-haystack" tasks are inadequate proxies; a realistic, diversified post-SFT evaluation provides the only reliable measure of long-context model improvement.

7. Practical Considerations and Deployment

Compute requirements: Continued training consumed approximately 2,200 H100-GPU-hours (64K, 20B tokens) and 12,200 H100-GPU-hours (512K, 20B tokens); SFT required approximately 500 H100-GPU-hours (1B tokens UltraChat).
Inference: Very long-context inference (up to 512K tokens) demands large-RAM A100/H100 infrastructure or sequence-parallel inference (e.g., DeepSpeed).
Practitioner recommendations: Evaluate models post-SFT on diverse tasks, balance long (code+books) and short (ShortMix) data (approximately 60:40), use RoPE bases tuned for target sequence lengths, adopt cross-document attention masking, apply sequence parallelism and minibatch reordering, and restrict SFT to high-quality short-form dialog where possible.

The ProLong-8B recipe and model weights are publicly available, facilitating direct evaluation and further experimentation (Gao et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

How to Train Long-Context Language Models (Effectively) (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProLong-8B.