Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeltaProduct: Expressive & Efficient Linear RNN

Updated 27 February 2026
  • DeltaProduct is a parametric family of linear recurrent neural networks that efficiently combines expressivity and tunable rank control using generalized Householder transformations.
  • It constructs state transitions via products of rank-1 Householder-style matrices, enabling robust encoding of complex group operations and long-context extrapolation.
  • Empirical results demonstrate that DeltaProduct outperforms standard DeltaNet and low-rank models on group word, Chomsky hierarchy, and language modeling tasks.

DeltaProduct is a parametric family of linear recurrent neural network (RNN) architectures designed to bridge the trade-off between state-transition expressivity and computational efficiency in sequence modeling. By generalizing the DeltaNet architecture, DeltaProduct leverages products of generalized Householder transformations to enable state transitions with tunable rank while preserving stable, efficient inference. This mechanism equips linear RNNs with enhanced capacity for state-tracking, associative recall, and long-context extrapolation, outperforming other diagonal or low-rank recurrent models on tasks requiring the encoding of complex group transformations and permutations (Siems et al., 14 Feb 2025).

1. Architectural Foundations

DeltaProduct builds upon linear RNN recurrences. For hidden state htRnh_t \in \mathbb{R}^n and input xtRdx_t \in \mathbb{R}^d at time tt, a generic linear RNN layer computes

ht=Wtht1+Utxth_t = W_t h_{t-1} + U_t x_t

where WtW_t and UtU_t are the time-dependent state and input matrices, respectively. DeltaNet recurrences can be reformulated as a single online gradient descent (OGD) step per token on a quadratic associative recall loss, with update

ht=(Iβtktkt)ht1+βtkt(ktvt)h_t = (I - \beta_t k_t k_t^\top) h_{t-1} + \beta_t k_t (k_t^\top v_t)

where ktk_t is a unit-norm "key", vtv_t is a "value", and βt\beta_t is the step size. This formulation realizes a diagonal plus rank-1 state transition via a Householder-type operation.

DeltaProduct generalizes this process by performing nhn_h OGD steps per token, each with independent (kt,j,vt,j,βt,j)(k_{t,j}, v_{t,j}, \beta_{t,j}) for j=1,,nhj = 1, \dots, n_h. The update is

ht,j=(Iβt,jkt,jkt,j)ht,j1+βt,jkt,j(kt,jvt,j),ht,0=ht1h_{t,j} = (I - \beta_{t,j} k_{t,j} k_{t,j}^\top) h_{t,j-1} + \beta_{t,j} k_{t,j} (k_{t,j}^\top v_{t,j}), \quad h_{t,0} = h_{t-1}

yielding the closed-form

ht=(j=1nh(Iβt,jkt,jkt,j))ht1+j=1nh(k=j+1nh(Iβt,kkt,kkt,k))βt,jkt,j(kt,jvt,j)h_t = \Bigl(\prod_{j=1}^{n_h} (I - \beta_{t,j} k_{t,j} k_{t,j}^\top)\Bigr) h_{t-1} + \sum_{j=1}^{n_h} \Bigl(\prod_{k=j+1}^{n_h} (I - \beta_{t,k} k_{t,k} k_{t,k}^\top)\Bigr) \beta_{t,j} k_{t,j} (k_{t,j}^\top v_{t,j})

The state matrix WtW_t thus becomes a product of nhn_h rank-1 Householder-style matrices, offering a diagonal plus rank-nhn_h structure.

2. Generalized Householder Framework

Classical Householder transformations reflect across hyperplanes orthogonal to a given vector. DeltaProduct employs a more general family of rank-1 factors: Gt,j=Iβt,jkt,jkt,jG_{t,j} = I - \beta_{t,j} k_{t,j} k_{t,j}^\top with kt,j=1\|k_{t,j}\| = 1 and βt,j[0,2]\beta_{t,j} \in [0, 2] (or [2,2][-2, 2] to allow negative eigenvalues), resulting in spectral norms Gt,j1\|G_{t,j}\| \leq 1 to ensure stability.

The product Wt=j=1nhGt,jW_t = \prod_{j=1}^{n_h} G_{t,j} has rank at most nhn_h, and can be interpreted as a sequence of generalized reflections or projections, enabling flexible channel and token mixing. An equivalent expression involves a diagonal DtD_t (typically II) and a product of classical Householder matrices Ht,jH_{t,j}: Wt=Dt+j=1nhHt,jW_t = D_t + \prod_{j=1}^{n_h} H_{t,j}

3. Theoretical Expressivity and State-Tracking

The core theoretical advance of DeltaProduct is its capacity to encode complex group actions within a single RNN layer. A key result is that a product of nhn_h generalized Householder transformations can, in finite precision, solve any group-word problem whose group acts by permutations on at most nh+1n_h+1 symbols. Thus, DeltaProductnh_{n_h} can track any permutation group with maximal degree nh+1n_h+1:

  • nh=1n_h = 1 (DeltaNet): solves dihedral, parity, or Zm\mathbb{Z}_m-word problems but fails on S3S_3 (symmetric group on 3 elements) or higher.
  • nh=2n_h = 2 suffices for all dihedral and S3S_3, S4S_4, and A5A_5 problems in one layer.
  • nh=4n_h = 4 extends tracking to S5S_5.

This expressivity requires distinct kt,jk_{t,j} directions; repeating the same direction collapses the product rank. The use of Wt1\|W_t\| \leq 1 ensures robust propagation across long sequences, enhancing length extrapolation. Increasing nhn_h systematically enlarges the class of encodable permutation groups (Siems et al., 14 Feb 2025).

4. Computational Efficiency and Parameterization

DeltaProduct incurs linear per-step compute cost in nhn_h, as each update requires nhn_h inner products. Parameterization scales correspondingly: a typical projection of xtx_t produces (kt,1,...,kt,nh)(k_{t,1},...,k_{t,n_h}), (vt,1,...,vt,nh)(v_{t,1},...,v_{t,n_h}), (βt,1,...,βt,nh)(\beta_{t,1},..., \beta_{t,n_h}) via matrices of size d×nd \times n for each key/value and dd for step-size, giving a total layer parameter count of O(nhdn)\mathcal{O}(n_h \cdot d \cdot n).

Comparative expressivity and efficiency:

Model Class State Matrix Structure Expressivity Scope
Diagonal RNNs Diagonal No cross-channel mixing; solves only regular languages.
DeltaNet (nh=1n_h=1) Diagonal + rank-1 Group words with \leq2 points moved (Zm\mathbb{Z}_m, parity, not S3S_3 or higher).
DeltaProduct (nh>1n_h > 1) Diagonal + rank-nhn_h Larger groups (e.g., S4S_4, S5S_5 depending on nhn_h); higher context-free and group complexity.

Runtime per token scales as nhn_h; thus, selection of nhn_h enables trade-off between speed and capacity.

5. Empirical Performance

Experiments systematically validate DeltaProduct against DeltaNet and standard baselines across group-word, Chomsky-hierarchy, and language modeling tasks (Siems et al., 14 Feb 2025).

Group Word Problems

  • On S3S_3, nh=1n_h=1 fails to extrapolate, but nh=2n_h=2 achieves \sim100% accuracy, matching the theoretical degree.
  • S4S_4 and A5A_5 are tractable with nh=2n_h=2 (one layer), whereas DeltaNet with five layers still fails.
  • S5S_5 requires nh=4n_h=4 for successful extrapolation to length-512, confirming the theoretical linkage between nhn_h and permutation degree.

Chomsky-Hierarchy Tasks

Validation on Parity and Modular Arithmetic (with/without brackets) reveals the following:

Model Parity Mod Arith (no []) Mod Arith ([]) Avg
sLSTM 1.000 0.787 0.173 0.653
DeltaNet[-1,1]_{\text{[-1,1]}} 0.982 0.915 0.253 0.717
DeltaProduct2_2 0.896 0.887 0.266 0.683
DeltaProduct3_3 0.932 0.736 0.394 0.687
DeltaProduct4_4 0.982 0.893 0.460 0.778

DeltaProduct4_4 outperforms DeltaNet by approximately 8.5% and sLSTM by ~19% on average across these tasks.

Language Modeling

Evaluated on FineWeb (35B tokens), DeltaProduct demonstrates superior perplexity and robustness to context length:

Model Params (M) WikiPPL LAMBADA PPL
DeltaNet[-1,1]_{\text{[-1,1]}} 340 26.92 43.07
DeltaProduct2_2 392 26.43 30.66
DeltaProduct3_3 443 25.94 29.91

In addition, DeltaProduct maintains lower perplexity for extrapolation up to 16k context tokens, where DeltaNet's perplexity degrades steeply. Training stability is preserved or improved as nhn_h increases.

DeltaProduct parameterizes a spectrum from fast but limited (diagonal RNNs) to highly expressive (dense RNNs) recurrence via nhn_h. With nh=2n_h=2, DeltaProduct suffices for single-layer solutions to dihedral and S4S_4 word problems and significantly improves extrapolation and LM performance relative to DeltaNet at only 2×2\times the per-step cost.

Recommended settings include:

  • Moderate-complexity state-tracking (parity, small group problems, context-free languages): nh{2,3}n_h \in \{2, 3\}.
  • Long-context language modeling demanding minimal perplexity degradation: nh3n_h \approx 3.

DeltaProduct is well-suited for scenarios requiring both finite-precision tracking of nontrivial permutations or group operations in a single layer and robust extrapolation to contexts longer than those seen in training (Siems et al., 14 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeltaProduct.