Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 38 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 420 tok/s Pro

Claude Sonnet 4.5 30 tok/s Pro

2000 character limit reached

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity (2510.02228v1)

Published 2 Oct 2025 in cs.LG

Abstract: Scaling laws play a central role in the success of LLMs, enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM's advantage widens as training and inference contexts grow.

Summary

The paper establishes that xLSTM outperforms Transformers in validation loss versus compute, demonstrating Pareto dominance across various compute budgets.
It employs comprehensive scaling laws and benchmarks over 672 training runs to compare architectures, highlighting consistent power-law behavior even in overtraining regimes.
It shows that xLSTM’s linear time-complexity yields superior inference efficiency and stable performance with increasing context lengths compared to Transformers.

xLSTM Scaling Laws: Comparative Analysis of Linear and Quadratic Time-Complexity Architectures

Overview

This paper presents a comprehensive empirical and theoretical paper of the scaling laws governing xLSTM and Transformer architectures, with a focus on training efficiency, context length scaling, and inference characteristics. The central claim is that xLSTM, a linear time-complexity architecture based on mLSTM cells, is Pareto-dominant over dense multi-head self-attention Transformers in terms of cross-entropy loss versus compute, and that its advantages become more pronounced as context length increases. The paper systematically benchmarks both architectures across a wide range of model sizes (80M–7B parameters), training tokens (2B–2T), and context lengths (2k–16k), providing actionable insights for model design and deployment.

Experimental Setup and Methodology

The authors construct a large dataset of 672 training runs, spanning both xLSTM and Transformer models, under two primary configurations: IsoFLOP (fixed compute budget, varying model and data size) and Token/Param (fixed model size, varying data size). All models are trained on the DCLM-Baseline dataset, using a consistent recipe and context length of 8192 tokens unless otherwise specified. Compute is precisely accounted for, including the quadratic cost of attention in Transformers and the linear cost in xLSTM, enabling a fair comparison.

Figure 1: Dataset of training runs for the scaling law paper, covering both xLSTM and Transformer architectures under IsoFLOP and Token/Param configurations.

Scaling laws are extracted using both parametric surface fits and IsoFLOP profiles, following and extending the methodology of prior work. The loss surface is modeled as a power law in model size and dataset size, with an additional exponent $\gamma$ to improve fit quality. Compute-optimal model and dataset sizes are determined by fitting second-order polynomials to IsoFLOP profiles and extracting minima.

Training Scaling Laws: xLSTM vs. Transformer

Pareto Dominance in Loss vs. Compute

xLSTM models are shown to be strictly Pareto-dominant over Transformers: for any fixed compute budget, xLSTM achieves lower validation loss, and for any fixed loss, requires less compute. This dominance holds across nearly five orders of magnitude in compute.

Figure 2: xLSTM scaling laws: Validation loss over training compute. xLSTM is Pareto-dominant over Transformers in both loss and compute.

Overtraining Regime and Power Law Consistency

Both architectures exhibit consistent power-law exponents in the overtraining regime (token-to-parameter ratios far above compute-optimal). For xLSTM, the exponent $\eta$ in the fit $\hat{L}(C) = \lambda C^{-\eta}$ remains constant even at high token-to-parameter ratios, mirroring the behavior observed in Transformers. This implies that small, inference-optimized xLSTM models can be trained on large datasets without loss of scaling efficiency.

Figure 3: Power law fits to loss over training compute with increasing token-to-parameter ratios. xLSTM maintains constant exponents, indicating reliable scaling in the overtraining regime.

Compute-Optimal Model and Dataset Sizes

For a given compute budget, compute-optimal xLSTM models are consistently larger (i.e., have more parameters) than compute-optimal Transformers, while both are trained on a similar number of tokens. The relationship between performance and model size for compute-optimal models is nearly universal across both architectures, but the larger size of xLSTM models underlies their superior performance.

Figure 4: Varying model size and tokens with a fixed compute budget (IsoFLOP). xLSTM achieves lower loss and larger optimal model size for the same compute.

Context Length Scaling

The most significant architectural difference is the scaling of compute with context length: quadratic for Transformers (due to self-attention) and linear for xLSTM (due to recurrent updates). As context length increases, the compute-optimal model size and dataset size for Transformers decrease sharply, while for xLSTM they remain stable or even increase slightly. This results in a growing performance gap in favor of xLSTM as context length grows.

Figure 5: IsoFLOP curves and power-law fits for compute-optimal model size as a function of context length. xLSTM's optimal size is robust to context length, while Transformer shrinks rapidly.

Inference Scaling and Runtime Modeling

Empirical Inference Runtimes

Inference is analyzed in terms of time to first token (TTFT, prefill stage) and step time (generation stage). For both metrics, xLSTM outperforms Transformers at all model sizes, with the gap widening at longer context lengths. At 16k prefill, xLSTM achieves 30–50% lower TTFT and step time than same-sized Transformers. The step time for xLSTM is independent of prefill length, while for Transformers it increases linearly due to the need to attend over the entire KV cache.

Figure 6: Scaling of TTFT (left) and step time (right) as a function of prefill length for different model sizes. xLSTM maintains linear scaling, while Transformer degrades with context length.

Theoretical Runtime Model

A detailed runtime model is constructed based on FLOP and memory operation counts, validated against empirical measurements. The model accurately predicts the observed scaling behavior and explains the architectural differences: xLSTM is compute-bound during prefill and memory-bound during generation, but its linear complexity ensures superior scaling on both axes.

Implications and Future Directions

The findings have direct implications for the design and deployment of LLMs in settings where long context lengths and inference efficiency are critical. xLSTM's linear scaling enables the use of longer contexts without prohibitive compute or memory costs, making it suitable for applications in reasoning, retrieval, and on-device inference. The universality of the performance–model size relationship suggests that further architectural innovations may be evaluated within this scaling law framework.

Potential future work includes extending the analysis to Mixture-of-Experts and hybrid architectures, exploring diverse data distributions, evaluating downstream and long-context tasks, and benchmarking inference at production scale on multi-GPU systems.

Conclusion

This paper provides a rigorous, large-scale comparison of xLSTM and Transformer architectures under modern scaling law analysis. xLSTM is shown to be Pareto-dominant in training loss versus compute, robust to overtraining, and increasingly advantageous as context length grows. Its inference efficiency, both in TTFT and step time, is superior to Transformers of equivalent size, and these results are well-explained by a principled runtime model. The evidence positions xLSTM as a compelling architecture for future LLMs, particularly in scenarios demanding efficient long-context processing and scalable inference.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper compares two kinds of AI models for language:

Transformers (the common kind used in most chatbots today), and
xLSTM (a newer kind that works more like a memory that gets updated step by step).

The authors paper “scaling laws,” which are rules that tell you how a model’s quality changes when you give it more computer power, more data, or make it bigger. Their main message: xLSTM can reach similar or better quality than Transformers while using computer time more efficiently, especially when the input text (the “context”) is long.

The big questions the paper asks

To make the comparison fair and useful, the authors focus on three easy-to-understand questions:

Training: If we spend the same amount of computing power to train, which type gets better results?
Context length: What happens when the input is long, like a very long document? Does one type handle long text more efficiently?
Inference (using the model): Which type is faster when you ask it a question, both for the first word it answers and for each word after that?

How they studied it (in everyday language)

Think of training a model like practicing for a sport:

“Parameters” are the model’s internal settings (like a huge set of knobs you can tune).
“Tokens” are chunks of text (like words or parts of words) it practices on.
“Compute budget” (FLOPs) is the total amount of “workout effort” the computer spends.

The team ran lots of training experiments:

Model sizes ranged from small to medium-large (about 80 million to 7 billion parameters).
Data sizes ranged from 2 billion to 2 trillion tokens.
They used two ways to organize experiments:
- Token/Param: Pick a model size and vary how much data it sees.
- IsoFLOP: Keep total training effort the same, but trade off between making the model bigger vs. giving it more data.

They then fit simple mathematical curves (power laws) to predict how loss (a measure of mistakes; lower is better) depends on model size and data. They also measured real-world speed during use (inference):

Time to First Token (TTFT): how long until the first answer word appears.
Step time: how long for each next word.

Key idea about the two model types:

Transformer “attention” compares many parts of the input with each other. If the input is longer, the cost grows roughly like “length squared” (quadratic). Imagine every student in a class comparing notes with every other student—this gets slow as the class grows.
xLSTM updates a running memory as it reads. The cost grows roughly in a straight line with input length (linear). Imagine just taking notes as you read, one line at a time.

What they found (and why it matters)

Here are the main results, with a short reason why each is important:

xLSTM is Pareto-dominant in training cost vs. quality.
- Meaning: For the same training compute, xLSTM gets a better loss (makes fewer mistakes); for the same loss, it needs less compute.
- Why it matters: You can get more bang for your buck.
In “over-training” (smaller models trained on lots of data), xLSTM keeps improving steadily.
- They show xLSTM follows the same kind of reliable power-law improvements seen in Transformers when you feed it more data.
- Why it matters: You can train small, fast xLSTMs on big datasets and still keep getting better results.
Compute-optimal xLSTMs are larger than compute-optimal Transformers.
- Under the best way to spend a fixed training budget, the best xLSTM tends to have more parameters than the best Transformer.
- Why it matters: If you’re planning a training run, xLSTM lets you allocate your compute toward model size differently and still come out ahead on quality.
As context length grows, xLSTM’s advantage grows.
- For long inputs, Transformers spend a lot more time on attention. xLSTM handles long inputs more efficiently.
- Why it matters: Long documents, long instructions, or long reasoning chains become practical and faster with xLSTM.
Inference is faster with xLSTM, especially for long inputs.
- TTFT: For long contexts (like 16,000 tokens), xLSTM cuts the time to first token by about 30–50% compared to same-sized Transformers.
- Step time: xLSTM’s per-token speed stays steady as inputs get longer; Transformer gets slower.
- Why it matters: Lower latency and higher throughput mean faster, cheaper deployments—useful for chatbots, on-device AI, or any app where speed matters.
A simple runtime model (based on how much math and memory the model needs) accurately explains the measured speeds.
- Why it matters: The speed gains aren’t just lucky—they’re expected from the underlying math of the two architectures.

Why this is important overall

Better cost-performance: xLSTM can get better results for the same training cost and run faster during use.
Long-context strength: As tasks need longer memory (long documents, multi-step reasoning), xLSTM’s benefits grow.
Practical deployment: Faster response and lower compute can reduce costs and make AI more accessible on devices with limited memory or power.

Key terms explained simply

Parameters: The model’s internal “settings” that it learns, like millions or billions of tiny dials.
Tokens: Pieces of text (like words or word parts).
Loss: A number that measures how often the model is wrong; lower is better.
FLOPs: A way to count how much “math work” the computer does; more FLOPs means more compute effort.
Context length: How much text the model reads at once (like how many tokens it can remember).
Time to First Token (TTFT): How long until the model produces its first word of output.
Step time: How long it takes to produce each additional word.

Takeaway

If you care about training efficiently and running models fast—especially on long inputs—xLSTM looks like a very promising alternative to Transformers. It keeps improving with more data, is more compute-efficient, and scales better as inputs get longer. This could help future LLMs become cheaper, faster, and more useful in real-world applications.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper. Each point indicates a concrete direction for future research.

External validity and evaluation scope

Lack of downstream task evaluation: No assessment beyond cross-entropy (e.g., reasoning, coding, math, multilingual, safety, calibration), leaving it unclear whether xLSTM’s training/inference advantages translate into end-task gains.
No long-context benchmark results: Claims about advantages at long context are not validated on established suites (e.g., LONG-Bench, RULER, L-Eval, SCROLLS), nor on retrieval-heavy or tool-use tasks.
Single data distribution: Results are based on DCLM-Baseline; robustness across diverse corpora (code, math, multilingual, speech-text, domain-specific datasets) is unknown.
Tokenizer sensitivity: Only GPT-NeoX tokenizer is used; the effect of vocabulary size and tokenization choices on scaling (and relative architecture advantages) is unmeasured.

Fairness and breadth of architectural baselines

Transformer baseline choice: The comparison uses dense multi-head attention Llama-2 style models; it omits stronger or more efficient Transformer variants (e.g., MQA/GQA, FlashAttention-2/3, fused/fp8 kernels, paged KV cache, ring/sequence parallel attention), which could alter the Pareto frontier.
Missing comparisons to other linear-time or hybrid models: No head-to-head scaling comparisons with Mamba/SSMs, linear attention variants, Hyena/FastConv/FlashFFTConv, or hybrid models combining attention with xLSTM—limiting conclusions about xLSTM’s relative position among linear-time architectures.
No evaluation of Mixture-of-Experts (MoE) baselines: Given MoE’s inference-efficiency benefits, it remains unclear whether xLSTM’s wins persist when compared against compute- and inference-optimized routed models.

Scaling ranges and extrapolation risk

Limited scale: Experiments span 80M–7B parameters and 2B–2T tokens; the validity of fitted laws for ≥70B models and multi-trillion-token training remains untested.
Extreme token/parameter ratios: Over-training analysis covers a specific set of $M$ values; behavior at very high $M$ (e.g., ≥5k–10k), or very low compute regimes, is not established.
Ultra-long contexts: Context-dependent results are limited to 2k–16k; claims may not hold at 32k–1M token contexts where memory/precision and kernel limits emerge.

Context-length analysis confounds

Non-comparable losses across context lengths: Authors note dataset distribution differences by context; compute-optimal conclusions across context lengths may be confounded. A controlled setup with identical documents/packing across $T$ is needed.
Positional/temporal encoding effects: The interaction of positional methods (e.g., RoPE variants for Transformers, temporal embedding for xLSTM) with context scaling is not disentangled; potential gains/losses attributable to encoding are unknown.

Compute accounting and runtime modeling assumptions

Roofline model simplifications: The runtime model assumes dominant FLOPs or memory phases with a constant overhead; sensitivity to kernel fusion, scheduling, and overlapping compute/memory (and their hardware-specific variability) is not analyzed.
FLOP/MemOps estimates vs. real kernels: The theoretical accounting may diverge from highly optimized vendor kernels; calibration of coefficients across GPUs/TPUs/consumer cards is limited, and generalization across hardware is unverified.
Absence of cost/energy modeling: No $ cost or energy/carbon accounting is reported, leaving economic/ecological implications of the scaling claims unquantified.

Inference benchmarking limitations

Narrow inference setup: Primary figures use batch size 1 and single-GPU; large-batch prefill, heterogeneous/streaming batches, beam search, and cache-management strategies are not studied.
No multi-GPU/cluster inference: Pipeline/tensor parallelism, cross-node latency, and KV/state sharding effects on TTFT and throughput are unexplored.
Memory footprint under varied serving patterns: The comparative memory costs of Transformer KV caches vs. xLSTM state (prefill/generation, batch, long T) are not quantified; practical serving capacity limits and context-length ceilings remain unclear.

Training efficiency and systems considerations

Wall-clock and utilization: Results are expressed in FLOPs and loss, not in actual training throughput, GPU utilization, or end-to-end wall-clock time—critical for practitioner decisions.
Parallelism regimes: Impacts of data/tensor/sequence/pipeline parallelism on xLSTM vs. Transformer training efficiency (e.g., overlap, communication volume, activation checkpointing) are not investigated.
Stability and precision: Numerical stability of xLSTM at scale and long contexts (gradient behavior, norm control) and sensitivity to mixed precision (fp8/bf16) are not reported.

Hyperparameter, recipe, and optimization sensitivity

Recipe robustness: Sensitivity to optimizer choice, LR schedules, weight decay, normalization/regularization, dropout/label smoothing, gradient clipping, and activation functions is not systematically evaluated for both architectures.
Architectural ablations: Depth/width trade-offs, head/dimension scaling (for Transformers), and cell/MLP ratios (for xLSTM) are not ablated—limiting prescriptive scaling rules.
Data curriculum and packing: Effects of curriculum learning, length/batch packing policies, and deduplication/stringent filtering on scaling exponents and over-training behavior are untested.

Theoretical questions

Universality claim: The observed “universal” relation between compute-optimal performance and model size across architectures is empirical; a theoretical explanation and boundaries of validity are missing.
Inference-aware optimality for xLSTM: An explicit derivation of xLSTM-specific inference-aware scaling laws (analogous to Beyond Chinchilla for Transformers) is not provided; the optimal $N$ , $D$ , $M$ under serving demand distributions remains an open problem.

Transfer, fine-tuning, and alignment

Post-pretraining adaptation: Effects of instruction tuning, RLHF/DPO, domain adaptation, and continual learning on the relative advantages of xLSTM vs. Transformer are unknown.
Robustness and safety: No analysis of robustness to distribution shift, adversarial prompts, calibration, or safety behaviors; interactions with scaling are not addressed.

Reproducibility and implementation details

Kernel maturity and portability: xLSTM’s performance depends on specialized kernels (e.g., TFLA); portability and performance consistency across frameworks/hardware stacks are not assured.
Seed variance and uncertainty: Statistical variability (across random seeds) and confidence intervals for fitted exponents/coefficients are not reported, limiting confidence in extrapolations.

These gaps identify concrete next steps: broaden architectural and task baselines; control for context-length confounds; extend to larger scales and longer contexts; measure wall-clock, energy, and memory; develop inference-aware theory for xLSTM; and validate on production-like multi-GPU serving and diverse downstream tasks.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete use cases that can be deployed now using the paper’s findings, with sector tags and notes on feasibility. They leverage the demonstrated advantages of xLSTM—linear-time complexity with context length, lower time-to-first-token (TTFT), lower per-token step time, and predictable scaling laws—at model sizes up to 7B parameters and context windows up to 16k tokens.

Cost-efficient LLM serving with long prompts (software, enterprise, cloud)
- What: Replace or augment Transformer-based services (chat, copilots, summarization) with xLSTM models to cut TTFT and step time—especially for long prompts (8k–16k) and long histories/session contexts.
- Benefits: Lower latency, higher throughput per GPU, reduced serving costs, improved user experience for long-context interactions.
- Tools/products: xLSTM inference servers; throughput-optimized serving stacks; HF Transformers-compatible xLSTM models; inference scaling dashboards using the paper’s roofline-informed runtime model.
- Assumptions/dependencies: Quality parity on target tasks must be validated; performance gains depend on kernel maturity and hardware; batch-size effects beyond the paper’s single-GPU setup require benchmarking.
Long-context RAG with less chunking and vector-store overhead (software, legal, finance, research)
- What: Use xLSTM’s linear scaling to accept larger retrieved contexts per query, reducing aggressive chunking and re-ranking cycles in RAG pipelines.
- Benefits: Better recall/precision trade-offs, fewer retrieval calls, simpler prompt engineering, lower end-to-end latency.
- Tools/products: RAG connectors tuned for xLSTM; prompt-assembly layers that target 8k–16k contexts; observability that measures retrieval vs. reasoning time.
- Assumptions/dependencies: Data governance and document access; long-context quality must be validated on domain-specific corpora.
Repo-scale coding assistants for medium/large codebases (software engineering, DevOps)
- What: Use xLSTM to process long files and repository-wide context (history, multi-file references) with lower TTFT and steady generation step times.
- Benefits: Improved cross-file reasoning, faster code exploration, fewer context truncation errors.
- Tools/products: IDE extensions with xLSTM backends; repo-indexers that pass larger context windows; CI assistants that review long diffs.
- Assumptions/dependencies: Code-domain instruction tuning; license-compliant training data; integration with developer tools.
Contact center and sales ops with full-call histories (customer support, CRM)
- What: Feed long transcripts and prior interactions (multi-session) into xLSTM for summarization, QA, and suggestion generation without quadratic slowdowns.
- Benefits: Lower TTFT for “first reply,” faster iterative suggestions, better personalization across long histories.
- Tools/products: CRM plugins using xLSTM; agent-assist dashboards that keep days/weeks of context in prompt.
- Assumptions/dependencies: Data privacy controls; domain fine-tuning for dialog; stable latency under production batching.
E-discovery and regulatory document review (legal, compliance, finance)
- What: Process long filings/contracts or multi-document dossiers in a single pass to extract risks, obligations, and inconsistencies.
- Benefits: Fewer passes, reduced chunk stitching, improved consistency; time/cost savings at scale.
- Tools/products: Long-context contract analyzers; risk-summary pipelines; obligation extractors.
- Assumptions/dependencies: Legal-domain validation; auditability and traceability; governance over sensitive data.
EHR summarization and longitudinal patient narratives (healthcare)
- What: Summarize multi-visit records and write longitudinal clinical notes with long-context xLSTM to maintain continuity and reduce truncation.
- Benefits: Better continuity-of-care summaries; lower latency in on-prem settings.
- Tools/products: On-prem clinical summarization services; physician-facing assistants with long patient timelines.
- Assumptions/dependencies: HIPAA/GDPR compliance; clinical fine-tuning; institutional IT integration and validation.
On-device or near-edge assistants with long-session memory (mobile, embedded, IoT)
- What: Run 1–7B xLSTM models on edge GPUs/NPUs where memory and latency are constrained, enabling longer “short-term memory” with lower TTFT.
- Benefits: Privacy, offline capability, reduced cloud dependency.
- Tools/products: Mobile/edge runtimes with quantized xLSTM; session memory managers optimized for recurrent state.
- Assumptions/dependencies: Efficient kernels on target hardware; quantization/distillation pipelines; battery/thermal constraints.
Time-series and log analysis over long windows (security, AIOps, energy)
- What: Apply xLSTM to long-horizon logs/metrics for anomaly detection, incident summaries, and trend explanations.
- Benefits: Model longer windows without quadratic cost; faster incident triage.
- Tools/products: SIEM/SOAR integrations; observability copilot that ingests long traces; grid/SCADA log analysis.
- Assumptions/dependencies: Domain-specific evaluation; secure data handling; integration with existing telemetry stacks.
Training-budget planners that include inference costs (MLOps, academia)
- What: Use the paper’s IsoFLOP and parametric loss fits to compute-optimize xLSTM training with inference cost constraints; over-train smaller models to cut serving costs.
- Benefits: Better ROI on pretraining; predictable loss vs. compute; more accurate context-length-aware planning vs. 6ND proxy.
- Tools/products: “xLSTM Scaling Planner” calculators; AutoML sweeps over token-per-parameter ratios with inference-aware objectives.
- Assumptions/dependencies: Target data/domain match; reliable loss-transfer from validation to downstream tasks; access to large token corpora.
Sustainability reporting and procurement guidance (policy, enterprise IT)
- What: Prefer linear-time LLMs in procurement and architecture reviews for long-context workloads to lower energy consumption and carbon intensity.
- Benefits: Reduced energy/KWh per request; aligns with sustainability KPIs.
- Tools/products: Internal guidelines for model selection; cost/CO2 calculators using the paper’s FLOPs/memops runtime model.
- Assumptions/dependencies: Comparable task performance; standardized energy metering in inference pipelines.
Methodological baselines for academic scaling-law studies (academia)
- What: Adopt xLSTM as a linear-complexity baseline in scaling-law experiments; replicate IsoFLOP sweeps and parametric fits including context-length dependence.
- Benefits: More complete scaling maps; avoids misestimation from 6ND-only proxies.
- Tools/products: Open-source code/data from the paper; reproducible notebooks for loss-surface fitting and runtime roofline modeling.
- Assumptions/dependencies: Reproduction fidelity; compute access; dataset comparability.

Long-Term Applications

These opportunities require further research, scaling, kernel maturity, or validation on broader downstream tasks, but are natural extensions of the paper’s results—especially as contexts and model sizes grow.

Million-token context assistants and retrieval-light workflows (software, research, education)
- What: Build assistants that read entire books/courses, large codebases, or multi-quarter project histories in one pass, leveraging linear-time growth.
- Potential products: “Whole-repo copilots,” “full-course tutors,” long-context literature-review engines.
- Assumptions/dependencies: Efficient memory/state management; kernel/hardware improvements; strong long-context grounding/recall behavior beyond 16k; training data and pretraining budgets at scale.
Long-horizon reasoning with test-time compute scaling (software, research, robotics)
- What: Combine xLSTM’s faster per-step generation with reasoning tokens (chain-of-thought, tree search, sampling) to get more “thinking” per time budget.
- Potential products: Planning agents, complex multi-step tool-use systems, robotics task planners with extended horizons.
- Assumptions/dependencies: Reasoning supervision, safety controls, and evaluation; alignment techniques for deep deliberation.
Privacy-first, on-device professional assistants (healthcare, legal, finance, government)
- What: Run specialized xLSTM models locally (hospital, law firm, bank, agency) for long-context analysis of sensitive data without cloud exposure.
- Potential products: On-prem copilots for EHRs, legal matter management, regulatory filings.
- Assumptions/dependencies: Hardware accelerators at the edge; robust domain instruction tuning; certification and compliance audits.
Hybrid and MoE architectures with linear-time cores (software, research)
- What: Mix xLSTM cores with sparse attention or experts to capture rare patterns while keeping inference linear w.r.t. context length.
- Potential products: Frontier models that preserve long-context speed while improving accuracy on global dependencies.
- Assumptions/dependencies: Routing/training stability; inference system design; compiler/kernel support.
Policy frameworks for energy-aware AI deployments (policy, cloud, sustainability)
- What: Introduce standards that factor asymptotic complexity (linear vs. quadratic) into public procurement and energy labeling for LLM services.
- Potential products: “Green AI” certifications; incentives for linear-time long-context deployments.
- Assumptions/dependencies: Accepted measurement protocols (per-request energy/carbon); neutrality across vendors.
Scientific and biomedical sequence modeling over very long inputs (science, biotech)
- What: Use linear-complexity sequence models for genomics, proteomics, and climate/hydrology time series where context length is a bottleneck.
- Potential products: Genome-scale annotators, long-time climate analyzers, wearable-stream models.
- Assumptions/dependencies: Domain datasets and labels; multimodal extensions; rigorous validation.
Enterprise-wide consolidation of LLM serving fleets (enterprise IT, cloud)
- What: Migrate long-context workloads to linear-time architectures to reduce GPU count and improve capacity planning.
- Potential products: Fleet optimizers that simulate TTFT/throughput using roofline models; dynamic routing to xLSTM for long-context requests.
- Assumptions/dependencies: Mixed-traffic routing strategies; SLO management; gradual migration playbooks.
Education at scale with offline/low-bandwidth support (education, public sector)
- What: Deploy long-context teaching assistants on school devices that can maintain multi-lesson memory without cloud dependence.
- Potential products: Local classroom copilots; curricula ingestion tools; longitudinal student portfolio analysis.
- Assumptions/dependencies: Device procurement; content licensing; fairness and safety evaluation.
Security and operations “memory” over months of telemetry (security, AIOps)
- What: Maintain rolling, very long windows of logs/alerts for contextual threat hunting and root-cause analysis without prohibitive latency.
- Potential products: “Memory copilot” for SOC/NOC; timeline reconstruction tools.
- Assumptions/dependencies: Secure data lakes; continuous fine-tuning; evaluation on adversarial settings.
Training orchestration co-designed with inference economics (MLOps, finance)
- What: Portfolio-style optimization of training runs (overtraining small models vs. compute-optimal larger models) under serving cost and SLA constraints.
- Potential products: Budget planners that couple scaling-law fits with demand forecasts and price-performance curves.
- Assumptions/dependencies: Accurate demand modeling; dynamic pricing of compute; evolving model families and kernels.

Notes on Cross-Cutting Assumptions and Dependencies

Task transferability: The paper evaluates cross-entropy loss on DCLM-Baseline and does not audit a wide range of downstream tasks; validate quality per domain.
Inference benchmarking: Results are from controlled, single-GPU experiments; multi-GPU, batching, and production kernels may shift realized gains.
Kernel maturity and hardware support: Performance depends on optimized recurrent kernels (e.g., xLSTM/mLSTM implementations) and vendor compiler support.
Data quality and licensing: Overtraining strategies rely on large, high-quality, legally usable corpora.
Model ecosystem: Fine-tuning recipes, safety/alignment, quantization, and tool-use integrations need to be robust for enterprise/regulated deployments.
Context distribution shift: Losses are not directly comparable across different context lengths; real workloads may differ from pretraining distributions.

In summary, the paper’s evidence that xLSTM is Pareto-dominant in loss vs. compute and scales linearly in inference with context length enables immediate wins in long-context, latency-sensitive deployments, and sets the stage for long-term advances in million-token contexts, on-device privacy-preserving assistants, and energy-aware AI policy and infrastructure.

View Paper Prompt View All Prompts

Glossary

attention mechanism: Component in Transformers that computes pairwise token interactions, often the computational bottleneck. "the attention mechanism inflicts computational costs during training and inference that are quadratic in terms of context length."
Chinchilla-optimal: Informal reference to the compute-optimal scaling prescription from Chinchilla work; used as a baseline for data/parameter ratios. "much higher than ``Chinchilla-optimal'', which incurs higher inference speeds due to smaller models."
compute-optimal training: Strategy choosing model and data sizes that minimize loss for a fixed training compute budget. "establish the notion of compute-optimal training, which refers to the optimal choice of $N$ and $D$ for a given compute budget $H$ "
context length: Number of tokens processed as input at once (sequence length). "offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime."
cross-entropy loss: Standard LLMing objective measuring prediction quality per token. "xLSTM models are Pareto-dominant in terms of cross-entropy loss over Transformer models"
IsoFLOP approach: Scaling-law fitting method that compares models along curves of equal training compute. "For the IsoFLOP approach a set of compute budgets $H$ is defined and for each budget the values of $N$ and $D$ are varied such that the constraint $C(N,D)=H$ is fulfilled."
KV cache: Stored key/value representations from attention used to speed up autoregressive decoding. "i.e.~the KV cache for Transformer models or the mLSTM cell states for xLSTM."
linear time-complexity: Computation that scales linearly with sequence length, advantageous for long contexts. "self-attention with quadratic time-complexity in Transformer versus recurrent mLSTM dynamics with linear time-complexity in xLSTM."
logits: Pre-softmax scores output by the model used to sample or select the next token. "compute the logits for the first token to be generated"
MemOps: Count of memory read/write operations (bytes moved), used to model runtime alongside FLOPs. "MemOps (App.~\ref{app:acc_memops})"
Mixture-of-Experts: Architecture that routes inputs to a subset of specialized sub-networks to increase parameter count without proportional compute. "such as the widely considered Mixture-of-Experts method \citep{shazeer:17outrageously}."
mLSTM: A multiplicative LSTM cell variant used as the recurrent building block in xLSTM. "it alternates mLSTM layers with position-wise feedforward MLP layers."
over-training regime: Training with more tokens per parameter than compute-optimal, often to reduce inference costs for smaller models. "so-called over-training regime, i.e.~on more tokens than would be optimal in terms of pre-taining compute."
parametric fit approach: Method fitting a functional form (typically power laws) to loss as a function of parameters and data. "the parametric fit approach and the IsoFLOP approach"
Pareto-dominant: Strictly better on at least one metric without being worse on others compared to alternatives. "xLSTM models are Pareto-dominant in terms of cross-entropy loss over Transformer models"
Pareto frontier: Set of solutions that are non-dominated across trade-off metrics (e.g., loss vs. compute). "we visualize the Pareto frontier by connecting the data points for xLSTM and Transformer."
power law: Relationship where a quantity varies as a power of another (e.g., loss vs. compute). "these scaling laws take the form of power laws"
prefill: Inference stage that processes the prompt to produce the first token and caches states. "In the prefill stage the LLMs process the prompt, compute the logits for the first token to be generated, and store the intermediate internal representations of the prompt"
quadratic time-complexity: Computation scaling with the square of sequence length, typical of full self-attention. "and their performance advantage grows with context length due to Transformersâ quadratic time complexity."
roofline model: Performance model relating FLOPs, memory bandwidth, and operational intensity to predict runtime bounds. "based on the roofline model."
scaling laws: Empirical rules describing how performance changes with model size, data, or compute. "Scaling laws play a central role in the success of LLMs"
self-attention: Mechanism where tokens attend to each other to mix sequence information in Transformers. "dense multi-head self-attention Transformer architectures"
step time: Time per generated token during the autoregressive decoding stage. "the key performance metric is the step time, i.e. how long it takes to obtain the next token given the current (potentially batched) sequence."
time to first token (TTFT): Latency from submitting a prompt to receiving the first generated token. "the key performance metric is the time to first token (TTFT)."
token-to-parameter ratio: Number of training tokens per model parameter, used to characterize training regimes. "token-to-parameter ratios of $[22, 44, 110, 220, 550, 1100, 2200]$ "

View Paper Prompt View All Prompts

Continue Learning

Authors (5)

Collections

Tweets

This paper has been mentioned in 11 posts and received 350 likes.

HackerNews

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity (1 point, 0 comments)

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity (2510.02228v1)

Summary

xLSTM Scaling Laws: Comparative Analysis of Linear and Quadratic Time-Complexity Architectures

Overview

Experimental Setup and Methodology

Training Scaling Laws: xLSTM vs. Transformer

Pareto Dominance in Loss vs. Compute

Overtraining Regime and Power Law Consistency

Compute-Optimal Model and Dataset Sizes

Context Length Scaling

Inference Scaling and Runtime Modeling

Empirical Inference Runtimes

Theoretical Runtime Model

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in simple terms)

The big questions the paper asks

How they studied it (in everyday language)

What they found (and why it matters)

Why this is important overall

Key terms explained simply

Takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

External validity and evaluation scope

Fairness and breadth of architectural baselines

Scaling ranges and extrapolation risk

Context-length analysis confounds

Compute accounting and runtime modeling assumptions

Inference benchmarking limitations

Training efficiency and systems considerations

Hyperparameter, recipe, and optimization sensitivity

Theoretical questions

Transfer, fine-tuning, and alignment

Reproducibility and implementation details

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Cross-Cutting Assumptions and Dependencies

Glossary

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

HackerNews

Don't miss out on important new AI/ML research