Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Abstract: Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Gated DeltaNet-2, explained simply
1) What is this paper about?
This paper is about making AI models remember long stories or documents better and faster. Usual Transformers remember by comparing every word to every other word, which gets very slow as the text gets longer. Gated DeltaNet-2 is a new way to remember that keeps a small “memory” the same size no matter how long the text is, while still finding the right facts when needed.
2) What questions does it try to answer?
The paper focuses on a few simple questions:
- How can a model with a small, fixed memory remember long texts without mixing things up?
- Can we control forgetting and writing into memory more precisely, so old facts aren’t accidentally ruined?
- Can we do this without slowing training and inference?
- Does this actually help on real tasks like language modeling and finding specific information in long contexts?
3) How does it work? (with simple analogies)
Think of the model’s memory like a small whiteboard the size never changes. Each new word in the text is like a student who:
- Reads what’s currently written at a specific “address” (the key),
- Decides what to erase,
- Decides what new stuff to write (the value),
- And slightly fades the old writing over time (decay), so very old notes slowly disappear unless refreshed.
Some basic ideas and the upgrades in this paper:
- Queries, keys, and values:
- Key = the “address label” for a memory slot.
- Value = the “note” you store at that address.
- Query = the “question” you ask the memory to get an answer.
- Linear attention with a fixed state: Instead of keeping a huge list of past notes, the model keeps one small board (a matrix). Every token updates this board in constant space and linear time, so speed and memory stay under control as text grows.
- The delta rule (targeted editing): Before writing new content, the model first reads what’s already stored for the current key and subtracts it. This is like erasing the specific old note at that address, then writing the new note. It prevents piling new notes on top of old ones and getting a mess.
- What earlier models did:
- Mamba-2: Adds a “fade” knob (decay) to make older content slowly disappear.
- DeltaNet/Gated DeltaNet: Uses the delta rule plus a global gate to control how strongly to erase and write.
- KDA (Kimi Delta Attention): Makes the fade (decay) smarter by tuning it separately for each “color channel” of the memory. Think of the memory note as having multiple colored layers; KDA can fade each color differently. But it still uses a single shared knob to control both erase and write strength.
- What Gated DeltaNet-2 changes:
- Two separate knobs instead of one:
- Erase gate b_t (key side): a set of per-channel dimmer switches that decide which parts of the old note to erase. Different channels (think “colors” or “features”) can be erased more or less.
- Write gate w_t (value side): another set of per-channel dimmer switches that decide which parts of the new note to write.
- Channel-wise decay stays: each channel can fade at its own rate, like in KDA.
- Why this helps: Erasing and writing are different actions. Sometimes you want to erase a lot but only write a little, or erase one set of features and write a different set. A single shared knob can’t do both well; two knobs give finer control.
- Efficient training and inference:
- Chunking: The model processes the sequence in small blocks (chunks) so it still trains in parallel on GPUs.
- A math trick (WY form) and a “gate-aware” backward pass keep this efficient even with the new per-channel gates.
- Result: Nearly the same speed as earlier linear-attention models, and far faster scaling than a standard Transformer as sequences get longer.
4) What did they find, and why does it matter?
Main takeaways:
- Better long-context memory: On tough “needle-in-a-haystack” tests (find the exact item buried in very long text), Gated DeltaNet-2 does best overall, especially when there are many distracting keys and only one correct one. This is where precise erase vs. write control really matters.
- Strong general performance: With a 1.3B-parameter model trained on 100B tokens, it gets the best overall results among similar models (like Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants) across:
- Language modeling (predicting the next word),
- Commonsense reasoning,
- Retrieval from long contexts (both synthetic tests and real datasets).
- Real-world retrieval: It achieves the best average on practical tasks like extracting facts from web pages or PDFs and answering questions when lots of distracting text is present.
- Ablations (what matters most?): Both gates help, but the erase gate contributes most of the gain. This makes sense: if you can carefully remove the right old content, you avoid interference and keep the memory clean.
- Efficiency: Training speed remains high and scales well with longer sequences, with only a small overhead compared to previous linear-attention models.
Why it matters:
- Models that can remember long documents accurately without huge memory cost are crucial for tasks like long-form question answering, code understanding, medical or legal document analysis, and multi-step reasoning across many pages.
5) What’s the bigger impact?
- Smarter memory editing: Decoupling erasing from writing is a simple but powerful idea. It reduces “interference,” where many facts crowded into a small memory blur together.
- Practical for long inputs: You can handle thousands of tokens with steady memory use, making it useful for servers and possibly edge devices.
- Plays well with hybrids: You can combine this fixed-memory recurrence for long-range information with a small sliding-window attention for precise local details. This keeps things fast and accurate.
- Future directions: The same principle—more precise control over what to remove and what to add—could inspire even better memory systems, leading to models that are both efficient and reliable over very long contexts.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future work:
- Scaling behavior beyond 1.3B parameters and 100B tokens: do the gains from decoupled erase/write persist or widen at larger scales (e.g., 7B–70B) and longer training runs (≥1T tokens), and how do depth/width, number of heads, dk/dv affect outcomes?
- Ultra-long-context regime: the model is trained at 4K and evaluated up to 8K (with a 16K throughput plot); efficacy, stability, and interference control for 32K–1M contexts remain untested, as do training strategies for such lengths.
- Decoding latency and end-to-end inference metrics: while a forward-only recurrent kernel is provided, the paper does not report decode tokens/s, latency under batching, cache/memory footprint, or beam search performance relative to KDA, Mamba-3, and Transformers.
- Theoretical capacity and interference analysis: no formal characterization of memory capacity, retrieval error, or interference as a function of dk, dv, chunk size, and gate statistics; conditions for stable overwrite and non-interference under channel-wise erase/write are not derived.
- Stability guarantees: absent analysis of convergence and spectral stability when combining channel-wise decay with asymmetric erase factors (including the [0,2] erase scaling); criteria to prevent oscillation or drift in long streams are not established.
- Gate parameterization design space: only sigmoid gating is explored; alternatives (e.g., softmax/entmax over channels, temperature scheduling, sparsity/entropy regularizers, hard gating, low-rank/shared gate factors) and their effects on capacity, stability, and efficiency are open.
- Gate dynamics and interpretability: no analysis of learned gate patterns across layers/heads/tokens (e.g., which channels are erased vs. written for which structures), nor causal probes to validate the claimed selective erase/write behavior.
- Sensitivity to chunk size and algorithmic hyperparameters: chunk size C is fixed (64); effects of varying C, adaptive/dynamic chunking, and trade-offs between accuracy, numerical error in the triangular solve, and throughput are not studied.
- Precision and quantization robustness: the approach relies on fp32 accumulators and fp32 triangular solves; resilience to bf16/fp8/int8 (training and inference), mixed-precision triangular solves, and quantization-aware training is not evaluated.
- Numerical stability at extreme lengths: the decay accumulation and triangular inverse are recognized as precision-sensitive but not stress-tested on very deep sequences, very small/large gates, or low-precision hardware configurations.
- Hybrid architecture design choices: SWA window is fixed at 2K; no ablations on window size, adaptive/learned windows, or removal of SWA to quantify division-of-labor between recurrent memory and local attention across tasks.
- Generalization across domains and modalities: evaluations focus on English LM and retrieval; performance on multilingual corpora, code, math, long-form reasoning, and multimodal settings is unknown.
- Fine-tuning and instruction-following: the impact of decoupled gates under supervised fine-tuning, instruction tuning, RLHF, and tool/retrieval-augmented generation has not been assessed.
- Robustness and adversarial stress-tests: beyond RULER and selected retrieval tasks, robustness to noisy inputs, distractor overload, adversarial prompts, and domain shift (e.g., document structure perturbations) is not systematically measured.
- Fairness of baseline tuning: Mamba-3 MIMO rank is fixed at R=4 and other baselines may be under-tuned; sensitivity of conclusions to stronger baseline hyperparameter sweeps (rank, state size, discretization choices) is unverified.
- Grouped value head design: the paper repeats q, k, g, b across value-head groups; the trade-off between grouped vs. independent gates per value group, and the impact on capacity/efficiency, is not ablated.
- External-memory and retrieval augmentation: interoperability with explicit retrieval (kNN, memory tables), key-value caches, or tool-use pipelines to further reduce interference is unexplored.
- Composition with other recurrent advances: compatibility and synergies with MIMO fast-weights, complex rotations (as in Mamba-3), or multi-input/multi-output delta updates have not been tested.
- Regularization of gates: potential gate saturation (collapse to 0/1), temporal smoothness, or sparsity constraints and their effect on stability and generalization are not examined.
- Negative-eigenvalue variant: expanding bt to [0,2] showed no clear gain at 1.3B; when, why, and at what scale this variant helps (or harms) stability and retrieval remains unclear.
- Memory utilization diagnostics: no direct measurements of collision rates, effective capacity (e.g., #distinct associations retained), or per-layer memory usage over long sequences in naturalistic data.
- Data scaling and curriculum: training uses FineWeb-Edu with 4K sequences; curricula for progressively longer contexts, mixture-of-length training, and their interaction with gate learning are open.
- Hardware portability and kernel generality: performance and numerical behavior on non-Hopper GPUs, non-NVIDIA hardware, and different compiler stacks are not reported; autotuning coverage and failure modes are not mapped.
- Safety, privacy, and calibration: the effects of decoupled memory edits on hallucination rates, calibration, and memorization/privacy risks are not addressed.
Practical Applications
Immediate Applications
The following applications can be built or piloted now by leveraging the published code, kernels, and training recipe for Gated DeltaNet-2, which preserves linear-time training/inference with constant memory and improves long-context retrieval and recall.
- Bold title: Long-context enterprise document QA and summarization (contracts, policies, manuals)
- Sector: Legal, Enterprise software
- What it does: Answers questions and produces summaries over very long documents or multi-document packs without quadratic attention cost, reducing interference among many references via decoupled erase/write gates.
- Tools/products/workflows: Hybrid Gated DeltaNet-2 + Sliding-Window Attention (SWA) block as a drop-in token mixer in an LLM; document ingestion pipeline with chunking and optional vector store; GPU-serving with fixed-size recurrent state instead of growing KV cache.
- Assumptions/dependencies: Performance reported at 1.3B on FineWeb-Edu; domain adaptation/fine-tuning recommended for legal corpora; best throughput on NVIDIA GPUs with Triton kernels; long-context formatting still benefits from SWA for local comparisons.
- Bold title: Retrieval-augmented generation (RAG) with improved context packing and recall
- Sector: Software, Search, Customer support
- What it does: Packs more retrieved passages per request with constant memory inference; decoupled erase/write reduces interference in distractor-heavy RAG, improving answer recall and grounding.
- Tools/products/workflows: Swap the Transformer attention mixer for Gated DeltaNet-2 in a RAG stack; preserve SWA for local reasoning; maintain smaller or no KV cache; integrate with existing retrievers and rerankers.
- Assumptions/dependencies: RAG quality also depends on retriever and prompt design; verify task-specific gains since benchmarks in the paper focus on recall-heavy settings.
- Bold title: PDF/HTML key–value extraction at scale
- Sector: Document AI, Compliance, Finance
- What it does: Improves structured field extraction from long PDFs and web pages (matches gains on FDA and SWDE) with stable recall over long, noisy contexts.
- Tools/products/workflows: Batch ETL pipeline feeding Gated DeltaNet-2-based models for field extraction; constant memory enables higher document lengths per GPU.
- Assumptions/dependencies: Domain fine-tuning improves robustness; layout-aware preprocessing still helpful.
- Bold title: Multi-turn customer support and CRM assistants with long session memory
- Sector: CX, Sales/CRM
- What it does: Maintains and edits long dialogue histories efficiently, reducing memory interference across many user issues within the same session.
- Tools/products/workflows: Replace attention module in existing assistants; maintain a small recurrent state across turns; log-aware SWA for local slot filling.
- Assumptions/dependencies: Guardrails and conversation safety layers still required; cross-session identity linking is a separate system concern.
- Bold title: Code assistants over large repositories and long files
- Sector: Developer tools
- What it does: Handles long code contexts (multi-file diffs, large notebooks) with constant memory; decoupled erase/write helps disambiguate many symbol associations.
- Tools/products/workflows: Use hybrid Gated DeltaNet-2 blocks in code LLMs; plug into IDEs; optionally combine with file-level retrieval.
- Assumptions/dependencies: Requires code-pretrained checkpoints; static analysis and LSP signals remain complementary.
- Bold title: Streaming meeting/call transcription and summarization
- Sector: Productivity, Enterprise IT
- What it does: Processes hours-long transcripts in a streaming fashion with fixed memory; selectively forgets stale context while writing salient updates.
- Tools/products/workflows: Real-time ASR followed by Gated DeltaNet-2 summarizer; chunkwise processing with minimal buffering; SWA for short-range discourse links.
- Assumptions/dependencies: ASR quality limits ceiling; tune decay/erase behavior for latency/recall trade-offs.
- Bold title: Log and telemetry analytics (security, reliability)
- Sector: Observability, Security
- What it does: Long-horizon pattern detection and anomaly explanation in streams of events without growing memory footprint.
- Tools/products/workflows: Online inference with the provided recurrent decoding kernel; alerting workflows consuming model outputs.
- Assumptions/dependencies: Domain adaptation and labeling strategy needed; privacy/governance controls for log data.
- Bold title: Cost/performance optimization for LLM serving
- Sector: Cloud/Inference platforms
- What it does: Reduces or eliminates the per-request KV cache growth by using a fixed-size state, enabling more concurrent users or longer contexts per GPU.
- Tools/products/workflows: Retool serving stack to store a small fp32 recurrent state; throughput tuning with Triton kernels; use hybrid blocks to match Transformer quality on local tasks.
- Assumptions/dependencies: Gains depend on request length distribution; some tasks still favor global attention if exact token–token interactions dominate.
- Bold title: Edge inference for long-context NLP on NVIDIA Jetson-class devices
- Sector: Embedded/Edge AI
- What it does: Brings long-context summarization or QA to constrained devices using constant-memory recurrence.
- Tools/products/workflows: Quantized Gated DeltaNet-2 variants; on-device chunkwise processing; local data privacy.
- Assumptions/dependencies: Kernel support and memory bandwidth on target hardware; potential accuracy loss from aggressive quantization.
- Bold title: Academic baseline for long-context memory interference research
- Sector: Academia/Research
- What it does: Provides a competitive open baseline with controllable gates for studying interference, forgetting, and fast-weight dynamics.
- Tools/products/workflows: Use the released code and ablations (channel vs scalar gates, erase range) to design experiments and coursework.
- Assumptions/dependencies: Reproducing results requires matched training recipe and hardware; benchmarks like RULER, LAMBADA, and real-world retrieval included.
- Bold title: Framework integration and kernels for practitioners
- Sector: Software tooling
- What it does: Incorporates the gate-aware WY chunkwise algorithm and fused Triton kernels into PyTorch/JAX ecosystems for wider adoption.
- Tools/products/workflows: Package as an attention drop-in; expose knobs for chunk size, SWA window, precision flags.
- Assumptions/dependencies: Maintenance of custom kernels; compatibility with BF16/FP32 and future GPUs; test coverage for variable-length batches.
Long-Term Applications
These applications are promising but may require larger-scale training, multimodal integration, regulatory work, or hardware/compiler co-design.
- Bold title: Lifelong, privacy-preserving personal assistants with controllable forgetting
- Sector: Consumer AI, Privacy
- What it could do: Maintain years-long histories with explicit, per-channel erase controls aligned to user preferences (e.g., “forget financial details”).
- Tools/products/workflows: Policy-aware gating APIs; on-device or federated deployment with constant memory.
- Assumptions/dependencies: Stronger safety/alignment; UI and policy layers to surface and verify forgetting; larger models and personalization.
- Bold title: Longitudinal clinical summarization and decision support
- Sector: Healthcare
- What it could do: Reason over multiyear EHR timelines, selectively retaining clinically salient signals while decaying noise.
- Tools/products/workflows: Fine-tuned medical LLMs with Gated DeltaNet-2 mixers; integration with EHR systems; audit logs of memory edits.
- Assumptions/dependencies: Regulatory compliance (HIPAA/GDPR); medical pretraining; rigorous validation; human-in-the-loop oversight.
- Bold title: Robotics and autonomy with long-horizon memory
- Sector: Robotics, Automotive
- What it could do: Stream sensor/action histories while editing state to prevent memory interference; support task decomposition over long horizons.
- Tools/products/workflows: Multimodal recurrent blocks (vision/audio/text) using decoupled gates; control stacks integrating fast-weight memory.
- Assumptions/dependencies: Multimodal extensions and real-time guarantees; sim2real transfer; safety certifications.
- Bold title: Financial analytics and compliance monitoring at scale
- Sector: Finance, RegTech
- What it could do: Scan continuous streams (filings, chats, trades) with selective erase/write to track entities and obligations across long contexts.
- Tools/products/workflows: Domain-adapted checkpoints; compliance dashboards; event-driven pipelines.
- Assumptions/dependencies: High-stakes accuracy; explainability of memory edits; robust handling of adversarial inputs.
- Bold title: Ultra-long context LLMs (100K–1M tokens) with constant memory
- Sector: Foundation models
- What it could do: Train and serve models handling book-length inputs and multi-episode histories without quadratic cost.
- Tools/products/workflows: Scaled Gated DeltaNet-2 layers; curriculum for long-context training; memory diagnostics for interference.
- Assumptions/dependencies: Larger models and datasets; stability at extreme lengths; improved chunkwise solvers and precision controls.
- Bold title: Hybrid architectures that combine decoupled delta-rule memory with SSM rotations (e.g., Mamba-3 MIMO)
- Sector: AI architecture R&D
- What it could do: Merge channel-wise erase/write with data-dependent rotations for richer dynamics and better decoding latency.
- Tools/products/workflows: New blocks that integrate WY updates with SSM inputs; kernel fusion strategies.
- Assumptions/dependencies: Nontrivial kernel and backward-pass complexity; careful state-size and latency trade-offs.
- Bold title: Continual learning with explicit associative memory editing
- Sector: ML research, Edge AI
- What it could do: Task-adaptive fast weights with selective erasure of stale associations, mitigating catastrophic forgetting.
- Tools/products/workflows: Training curricula that drive gate policies; evaluation on lifelong learning suites.
- Assumptions/dependencies: Stable optimization with gate dynamics; monitoring tools for interference and drift.
- Bold title: Energy- and cost-aware AI policy and procurement
- Sector: Public policy, Sustainability
- What it could do: Favor linear-time, constant-memory models for long-context workloads to lower energy/use-phase emissions and hardware cost.
- Tools/products/workflows: Benchmarks and reporting standards for energy per token vs. context length; procurement guidelines.
- Assumptions/dependencies: Transparent energy metrics; comparable quality benchmarks across architectures.
- Bold title: Multimodal long-context understanding (video/audio+text)
- Sector: Media, Safety
- What it could do: Handle hours-long video transcripts and audio streams with selective memory editing (e.g., tracking characters/threads).
- Tools/products/workflows: Tokenization and gating per modality; fusion layers; temporal SWA for local correlations.
- Assumptions/dependencies: Robust multimodal training; efficient tokenization; licensing for media datasets.
- Bold title: Hardware and compiler co-design for gate-aware WY kernels
- Sector: Semiconductors, Systems
- What it could do: Accelerate triangular solves and gate-aware dot products (A = (I+T){-1}) in tensor cores/ASICs; standardized ops in cuDNN/XLA.
- Tools/products/workflows: Primitive support for lower-triangular solves with mixed precision; autotuning for fused forward/backward kernels.
- Assumptions/dependencies: Vendor buy-in; sustained demand for recurrent linear mixers; correctness and numerical stability guarantees.
- Bold title: Safety and privacy-by-design via controllable erase semantics
- Sector: Trust & Safety
- What it could do: Implement audited “forgetting” at the fast-weight memory level to reduce accidental leakage across prompts or users.
- Tools/products/workflows: Telemetry of gate activations; policy constraints on erase ranges; red-teaming frameworks targeting interference.
- Assumptions/dependencies: Careful separation of per-user states; formalization of memory-edit guarantees; interaction with higher-level caches.
Notes on cross-cutting dependencies and assumptions
- Training and hardware: Reported results use 1.3B parameter models trained on 100B tokens and evaluated on NVIDIA GPUs with Triton-based fused kernels and chunk size C=64; reproducing efficiency assumes similar hardware and kernel availability.
- Hybrid design: For many tasks, pairing the recurrent mixer with SWA is important to capture exact local interactions; recurrent-only variants may underperform on purely local reasoning.
- Stability/precision: L2-normalized queries/keys, fp32 state/accumulators, and careful triangular solve precision are part of the recipe; deviations can affect long-context stability.
- Domain adaptation: Task/domain fine-tuning is recommended for regulated or specialized use (healthcare, legal, finance).
- Benchmarks vs. production: The strongest gains are on long-context retrieval and interference-heavy settings; validate on your production distribution before wholesale migration.
Glossary
- Asymmetric delta recurrence: A recurrence update where the erase and write directions are asymmetric due to gating and decay normalization. "Eq. 10 becomes a pure asymmetric delta recurrence,"
- Autoregressive decoding: Generating tokens one by one, conditioning on previously generated tokens, often with a recurrent kernel for inference. "A forward-only recurrent kernel is provided for autoregressive decoding at short sequence lengths."
- bfloat16: A 16-bit floating-point format with 8-bit exponent and 7-bit mantissa, used to speed training with acceptable precision. "In bfloat16, the error follows the bfloat16 mantissa."
- Causal mask: A masking matrix that enforces causality by preventing attention to future tokens. "where M is the causal mask."
- Causal score matrix: The masked attention score matrix ensuring each position only attends to past positions. "Define the causal score matrix"
- Channel-wise decay: Forgetting coefficients applied per channel (dimension) rather than as a single scalar, enabling finer control of memory retention. "channel-wise decay absorbed into asymmetric erase factors"
- Complex-valued state transitions: State updates that use complex numbers (e.g., rotations) in state-space models to increase expressivity. "complex-valued state transitions"
- Data-dependent decay: A decay factor that is computed from the input data to control forgetting dynamically. "Mamba-2 uses data-dependent decay to regulate the memory horizon [8]."
- Data-dependent rotations: Input-driven rotations applied to the state (often in complex SSMs) to enhance modeling capacity. "data-dependent rotations"
- Decay-normalized state: A state reparameterization that absorbs cumulative decay into the state for efficient computation. "Define the decay-normalized state S, by S, = Diag(r)S,."
- Delta rule: An update that subtracts the current read before writing the new value, performing a residual correction in memory. "DeltaNet replaces additive writes with the delta rule, enabling targeted overwrite"
- Exponential-trapezoidal discretization: A numerical integration scheme for discretizing continuous-time state-space models that blends exponential and trapezoidal rules. "exponential-trapezoidal discretization, complex-valued state transitions, and a multi-input, multi-output formulation"
- Fast-weight memory: A transient, rapidly updated associative memory implemented via fast-weight updates during sequence processing. "an online update of a fast-weight memory state"
- Gate-aware backward pass: A backpropagation method that explicitly accounts for per-channel gates inside matrix products to compute correct gradients. "a gate-aware backward pass that preserves efficient parallel training."
- Gated Delta Rule-2: The decoupled delta-rule update with separate channel-wise erase and write gates operating on key and value axes, respectively. "We refer to Eq. 10 as Gated Delta Rule-2."
- Hebbian-style accumulation: A learning rule that accumulates associations additively, inspired by Hebbian plasticity, often contrasted with delta-rule edits. "improving associative memory over Hebbian-style accumulation"
- Log-decay: The logarithm of the decay factors, used for numerical stability when accumulating decays across long sequences. "The log-decay follows the Gated DeltaNet parameterization,"
- Multi-input, multi-output (MIMO): A formulation where multiple inputs drive multiple outputs per step, increasing expressivity of the recurrence. "and a multi-input, multi-output formulation for stronger and more efficient recurrence [13]."
- Needle-In-A-Haystack (NIAH): Benchmarks that test long-context retrieval by hiding a “needle” among many distractors. "Single Needle-In-A-Haystack (S-NIAH) and Multi-Key Needle-In-A-Haystack (MK-NIAH) tasks from RULER."
- Negative-eigenvalue variant: A modification allowing negative eigenvalues in the state transition, affecting stability and spectrum. "We also support the negative-eigenvalue variant of [20]"
- Projector (in linear algebra): A matrix that idempotently projects onto a subspace; here, rank-one k kᵀ when the key is unit-normalized. "the matrix kkt is a projector,"
- RMSNorm: Root Mean Square Layer Normalization, a normalization technique without mean subtraction. "the output is passed through an RMSNorm and SiLU gate"
- RULER: A suite for evaluating long-context retrieval and interference control in LLMs. "On the RULER needle-in-a-haystack tasks in Table 3,"
- Sliding-Window Attention (SWA): Attention restricted to a fixed local window to keep computation and memory linear in sequence length. "Sliding-Window Attention (SWA)"
- State-space model (SSM): A model class that represents sequences via latent states with linear dynamics and learned inputs/outputs. "the complex SSM view"
- Triangular solve: Solving a (lower/upper) triangular linear system, used here for forward substitution in chunkwise computations. "The triangular solve for A = (I + T)-1 is the most precision-sensitive part of the chunk computation."
- Triton kernels: GPU kernels written in the Triton language to fuse and accelerate custom tensor operations. "fused Triton kernels"
- UT transform: A specific linear-algebra transform used to accelerate computations in the chunkwise algorithm. "We use the UT transform [22]"
- Vector-Jacobian product: The operation used in reverse-mode autodiff to propagate gradients efficiently. "The inverse itself has the standard triangular vector-Jacobian product"
- WY form: A compact factorization (I − UYᵀ) used to represent products of rank-one updates efficiently. "the recurrence admits a compact WY form"
Collections
Sign up for free to add this paper to one or more collections.