Papers
Topics
Authors
Recent
2000 character limit reached

Instella-Long: LLMs & Optical Transients

Updated 20 November 2025
  • Instella-Long is defined as a dual artifact that includes a 128K long-context Transformer model with advanced RoPE scaling and a persistent optical transient with a duration exceeding 800 days.
  • In its language model application, it utilizes document-masked full attention and variable-length FlashAttention to efficiently process ultra-long sequences up to 256K tokens.
  • In astrophysics, the Instella-Long transient shows a plateau light curve, slow decline rates, and extreme longevity, distinguishing it from ordinary supernovae.

Instella-Long refers to two distinct, high-salience artifacts in contemporary research: 1) a fully open long-context LLM in the Instella family of LLMs, and 2) a rare, extremely long-duration extragalactic optical transient (SDF-05M05) classified by its discoverers as a new subclass of astrophysical transient. Both usages share the characteristic of exceptional persistence—either in temporal optical emission or in sequence context length—and are leading exemplars in their respective domains (Urata et al., 2012, Liu et al., 13 Nov 2025).

1. Long-Context LLM: Architecture and Modifications

Instella-Long is a 3 billion parameter decoder-only Transformer model derived from Instella-3B, specifically engineered for long-context understanding with a maximum context window of 128K tokens. The architecture retains the 36-layer, 2,560-dimensional hidden state backbone with 32 attention heads per block and 6,912-dimensional SwiGLU-activated feedforward networks. RMSNorm and QK-Norm are applied, along with Rotary Position Embeddings (RoPE), which are crucial for context extension.

The sole architectural modification distinguishing Instella-Long from its progenitor is the adjustment of RoPE base frequencies—scaled up to allow positional encoding far beyond the default 4K or 8K-limit—plus the use of document-masked full attention with variable-length FlashAttention over training sequences reaching 256K tokens. RoPE scaling modifies token embedding rotations as follows:

θk=1B2k/d,\theta_k = \frac{1}{B^{2k/d}},

where BB is the RoPE base frequency, applied at B514,640B \approx 514,640 for a 64K window (stage 1) and B3,691,950B \approx 3,691,950 for up to 256K tokens (stage 2). This enables robust position differentiation across ultra-long sequences (Liu et al., 13 Nov 2025).

2. Context Extension and Training Protocols

Instella-Long achieves long-context processing without altering the core attention mechanism. Extended capacity derives from two principled modifications:

  • RoPE Scaling: Increasing the RoPE base frequency slows the positional angular increments, mitigating aliasing over long token distances and preserving embedding uniqueness out to 128K positions.
  • Document Masking with Full FlashAttention: By tightly masking attention computation within document boundaries inside each training sequence pack, the model avoids cross-document leakage and intractable attention matrix growth. Variable-length FlashAttention 2 affords efficient allocation, achieving full attention at practical compute cost.

Training proceeds in two major continued-pretraining stages (each over 20B tokens), with a data mix emphasizing long-context corpora such as code repositories, books, and textbooks (≈60%), as well as conventional sources (≈40%). Supervised finetuning for instruction-following leverages both short (Ultrachat 200K, OpenMathInstruct-2, MMLU auxiliary) and long (DCLM, ArXiv, synthetic QA from a 14B-parameter Qwen2.5 teacher) contexts, with document masking enforced throughout. Preference alignment is implemented with Direct Preference Optimization (DPO) on 0.76B tokens from the OLMo 2 Preference Mix, using short contexts but conferring improved robustness even for long-context tasks (Liu et al., 13 Nov 2025).

3. Evaluation, Benchmarks, and Comparative Analysis

Instella-Long is evaluated chiefly on the Helmet benchmark, which spans multi-value retrieval (NIAH-MV), retrieval-augmented QA (Natural Questions, TriviaQA, HotpotQA), and long-document QA (InfiniteBench, NarrativeQA) at context lengths up to 128K tokens. Key results:

Model Avg. Helmet Score (%) NIAH-MV (%) InfMC (%) InfQA (%)
Instella-Long 52.7 84.0 54.0 30.7
Llama-3.2-3B-Instruct 59.2 92.0 58.8 37.2
Phi-3.5-Mini-Instruct 51.7 77.2 55.6 31.3
Qwen-2.5-3B-Instruct 41.9 64.5 36.9 14.7
MiniCPM-2B-128K 28.1 38.2 16.2 4.3

On 8K–32K token contexts, Instella-Long achieves 68.7% average Helmet performance, surpassing Qwen-2.5-3B-Instruct (65.9%). These results demonstrate that Instella-Long is the highest scoring fully open long-context model at 128K context and outperforms all transparent 3B-parameter LLMs tested on these tasks (Liu et al., 13 Nov 2025).

4. Efficiency and Implementation

Instella-Long leverages hardware-efficient training using 128 AMD Instinct MI300X GPUs (16 nodes) with PyTorch, TorchCompile, FlashAttention 2, and mixed bfloat16 precision. Training regimes employed Fully Sharded Data Parallelism (FSDP) and intra-node sequence parallelism via Deepspeed Ulysses. With full attention and masking, throughput reaches 4–8M tokens per step at 256K context length. Document masking prevents the quadratic scaling cost across unrelated document spans, maintaining tractable compute and memory utilization even as sequence lengths saturate hardware limits (Liu et al., 13 Nov 2025).

5. Performance Trade-offs and Limitations

Adapting for longer context induces a mild short-context performance decrease. For example, on MMLU, Instella-Long's score drops from 58.9 to 57.4. However, toxicity robustness (Toxigen 57→42) and consistency (TruthfulQA) are maintained or improved, possibly due to exposure to more diverse and complex long-context instructions during finetuning. No summarization or code-completion benchmarks are reported. The model's performance beyond 128K tokens is not validated; the adoption of memory-augmented mechanisms or chunk prioritization approaches (e.g., Re-mem, LSH-attention) are suggested for further scaling (Liu et al., 13 Nov 2025).

6. Astrophysical Usage: Instella-Long as an Optical Transient Class

In astronomy, “Instella-Long” is used by the discoverers of SDF-05M05 to designate a new subclass of extraordinarily persistent, luminous optical transients. SDF-05M05 was detected at z0.65z \approx 0.65 with a peak absolute magnitude Mpeak20M_{\rm peak} \approx -20, a duration >800>800 days in the observer frame (≈485 rest-frame days), and total radiated energy Erad1051E_{\rm rad} \approx 10^{51} erg. Its multi-band light curve exhibits a plateau brighter than M<19M<-19 lasting ∼300 rest-frame days, far exceeding the duration of both typical and ultra-luminous supernovae. Slow decline rates of $0.28$ to 0.76 mag/100d0.76~\text{mag}/100\,\text{d} and a single-blackbody SED at T6500T\approx6500 K further distinguish Instella-Longs from other Type IIn or pair-instability supernovae. The authors interpret the phenomenon as a peculiar supernova powered by extended circumstellar medium interaction in a low-mass host, positing SDF-05M05 as the first recognized example of this class (Urata et al., 2012).

7. Significance and Future Prospects

Instella-Long models establish that principled RoPE scaling and document-masked full attention can robustly support 128K context capability in open-weight 3B-parameter Transformers, advancing the transparency and reproducibility of long-context LLMs (Liu et al., 13 Nov 2025). The astrophysical subclass expands the known diversity of luminous transient events and constrains progenitor and environment models for long-lived extragalactic explosions (Urata et al., 2012). In both domains, Instella-Long represents a convergence of energy, persistence, and transparency, providing a baseline and a stimulus for further investigation.

References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Instella-Long.