DeltaProduct: Expressive & Efficient Linear RNN

Updated 27 February 2026

DeltaProduct is a parametric family of linear recurrent neural networks that efficiently combines expressivity and tunable rank control using generalized Householder transformations.
It constructs state transitions via products of rank-1 Householder-style matrices, enabling robust encoding of complex group operations and long-context extrapolation.
Empirical results demonstrate that DeltaProduct outperforms standard DeltaNet and low-rank models on group word, Chomsky hierarchy, and language modeling tasks.

DeltaProduct is a parametric family of linear recurrent neural network (RNN) architectures designed to bridge the trade-off between state-transition expressivity and computational efficiency in sequence modeling. By generalizing the DeltaNet architecture, DeltaProduct leverages products of generalized Householder transformations to enable state transitions with tunable rank while preserving stable, efficient inference. This mechanism equips linear RNNs with enhanced capacity for state-tracking, associative recall, and long-context extrapolation, outperforming other diagonal or low-rank recurrent models on tasks requiring the encoding of complex group transformations and permutations (Siems et al., 14 Feb 2025).

1. Architectural Foundations

DeltaProduct builds upon linear RNN recurrences. For hidden state $h_t \in \mathbb{R}^n$ and input $x_t \in \mathbb{R}^d$ at time $t$ , a generic linear RNN layer computes

$h_t = W_t h_{t-1} + U_t x_t$

where $W_t$ and $U_t$ are the time-dependent state and input matrices, respectively. DeltaNet recurrences can be reformulated as a single online gradient descent (OGD) step per token on a quadratic associative recall loss, with update

$h_t = (I - \beta_t k_t k_t^\top) h_{t-1} + \beta_t k_t (k_t^\top v_t)$

where $k_t$ is a unit-norm "key", $v_t$ is a "value", and $\beta_t$ is the step size. This formulation realizes a diagonal plus rank-1 state transition via a Householder-type operation.

DeltaProduct generalizes this process by performing $n_h$ OGD steps per token, each with independent $(k_{t,j}, v_{t,j}, \beta_{t,j})$ for $j = 1, \dots, n_h$ . The update is

$h_{t,j} = (I - \beta_{t,j} k_{t,j} k_{t,j}^\top) h_{t,j-1} + \beta_{t,j} k_{t,j} (k_{t,j}^\top v_{t,j}), \quad h_{t,0} = h_{t-1}$

yielding the closed-form

$h_t = \Bigl(\prod_{j=1}^{n_h} (I - \beta_{t,j} k_{t,j} k_{t,j}^\top)\Bigr) h_{t-1} + \sum_{j=1}^{n_h} \Bigl(\prod_{k=j+1}^{n_h} (I - \beta_{t,k} k_{t,k} k_{t,k}^\top)\Bigr) \beta_{t,j} k_{t,j} (k_{t,j}^\top v_{t,j})$

The state matrix $W_t$ thus becomes a product of $n_h$ rank-1 Householder-style matrices, offering a diagonal plus rank- $n_h$ structure.

2. Generalized Householder Framework

Classical Householder transformations reflect across hyperplanes orthogonal to a given vector. DeltaProduct employs a more general family of rank-1 factors: $G_{t,j} = I - \beta_{t,j} k_{t,j} k_{t,j}^\top$ with $\|k_{t,j}\| = 1$ and $\beta_{t,j} \in [0, 2]$ (or $[-2, 2]$ to allow negative eigenvalues), resulting in spectral norms $\|G_{t,j}\| \leq 1$ to ensure stability.

The product $W_t = \prod_{j=1}^{n_h} G_{t,j}$ has rank at most $n_h$ , and can be interpreted as a sequence of generalized reflections or projections, enabling flexible channel and token mixing. An equivalent expression involves a diagonal $D_t$ (typically $I$ ) and a product of classical Householder matrices $H_{t,j}$ : $W_t = D_t + \prod_{j=1}^{n_h} H_{t,j}$

3. Theoretical Expressivity and State-Tracking

The core theoretical advance of DeltaProduct is its capacity to encode complex group actions within a single RNN layer. A key result is that a product of $n_h$ generalized Householder transformations can, in finite precision, solve any group-word problem whose group acts by permutations on at most $n_h+1$ symbols. Thus, DeltaProduct $_{n_h}$ can track any permutation group with maximal degree $n_h+1$ :

$n_h = 1$ (DeltaNet): solves dihedral, parity, or $\mathbb{Z}_m$ -word problems but fails on $S_3$ (symmetric group on 3 elements) or higher.
$n_h = 2$ suffices for all dihedral and $S_3$ , $S_4$ , and $A_5$ problems in one layer.
$n_h = 4$ extends tracking to $S_5$ .

This expressivity requires distinct $k_{t,j}$ directions; repeating the same direction collapses the product rank. The use of $\|W_t\| \leq 1$ ensures robust propagation across long sequences, enhancing length extrapolation. Increasing $n_h$ systematically enlarges the class of encodable permutation groups (Siems et al., 14 Feb 2025).

4. Computational Efficiency and Parameterization

DeltaProduct incurs linear per-step compute cost in $n_h$ , as each update requires $n_h$ inner products. Parameterization scales correspondingly: a typical projection of $x_t$ produces $(k_{t,1},...,k_{t,n_h})$ , $(v_{t,1},...,v_{t,n_h})$ , $(\beta_{t,1},..., \beta_{t,n_h})$ via matrices of size $d \times n$ for each key/value and $d$ for step-size, giving a total layer parameter count of $\mathcal{O}(n_h \cdot d \cdot n)$ .

Comparative expressivity and efficiency:

Model Class	State Matrix Structure	Expressivity Scope
Diagonal RNNs	Diagonal	No cross-channel mixing; solves only regular languages.
DeltaNet ( $n_h=1$ )	Diagonal + rank-1	Group words with $\leq$ 2 points moved ( $\mathbb{Z}_m$ , parity, not $S_3$ or higher).
DeltaProduct ( $n_h > 1$ )	Diagonal + rank- $n_h$	Larger groups (e.g., $S_4$ , $S_5$ depending on $n_h$ ); higher context-free and group complexity.

Runtime per token scales as $n_h$ ; thus, selection of $n_h$ enables trade-off between speed and capacity.

5. Empirical Performance

Experiments systematically validate DeltaProduct against DeltaNet and standard baselines across group-word, Chomsky-hierarchy, and language modeling tasks (Siems et al., 14 Feb 2025).

Group Word Problems

On $S_3$ , $n_h=1$ fails to extrapolate, but $n_h=2$ achieves $\sim$ 100% accuracy, matching the theoretical degree.
$S_4$ and $A_5$ are tractable with $n_h=2$ (one layer), whereas DeltaNet with five layers still fails.
$S_5$ requires $n_h=4$ for successful extrapolation to length-512, confirming the theoretical linkage between $n_h$ and permutation degree.

Chomsky-Hierarchy Tasks

Validation on Parity and Modular Arithmetic (with/without brackets) reveals the following:

Model	Parity	Mod Arith (no [])	Mod Arith ([])	Avg
sLSTM	1.000	0.787	0.173	0.653
DeltaNet $_{\text{[-1,1]}}$	0.982	0.915	0.253	0.717
DeltaProduct $_2$	0.896	0.887	0.266	0.683
DeltaProduct $_3$	0.932	0.736	0.394	0.687
DeltaProduct $_4$	0.982	0.893	0.460	0.778

DeltaProduct $_4$ outperforms DeltaNet by approximately 8.5% and sLSTM by ~19% on average across these tasks.

Language Modeling

Evaluated on FineWeb (35B tokens), DeltaProduct demonstrates superior perplexity and robustness to context length:

Model	Params (M)	WikiPPL	LAMBADA PPL
DeltaNet $_{\text{[-1,1]}}$	340	26.92	43.07
DeltaProduct $_2$	392	26.43	30.66
DeltaProduct $_3$	443	25.94	29.91

In addition, DeltaProduct maintains lower perplexity for extrapolation up to 16k context tokens, where DeltaNet's perplexity degrades steeply. Training stability is preserved or improved as $n_h$ increases.

6. Practical Implications and Recommended Usage

DeltaProduct parameterizes a spectrum from fast but limited (diagonal RNNs) to highly expressive (dense RNNs) recurrence via $n_h$ . With $n_h=2$ , DeltaProduct suffices for single-layer solutions to dihedral and $S_4$ word problems and significantly improves extrapolation and LM performance relative to DeltaNet at only $2\times$ the per-step cost.

Recommended settings include:

Moderate-complexity state-tracking (parity, small group problems, context-free languages): $n_h \in \{2, 3\}$ .
Long-context language modeling demanding minimal perplexity degradation: $n_h \approx 3$ .

DeltaProduct is well-suited for scenarios requiring both finite-precision tracking of nontrivial permutations or group operations in a single layer and robust extrapolation to contexts longer than those seen in training (Siems et al., 14 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeltaProduct.

DeltaProduct: Expressive & Efficient Linear RNN

1. Architectural Foundations

2. Generalized Householder Framework

3. Theoretical Expressivity and State-Tracking

4. Computational Efficiency and Parameterization

5. Empirical Performance

Group Word Problems

Chomsky-Hierarchy Tasks

Language Modeling

6. Practical Implications and Recommended Usage

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeltaProduct: Expressive & Efficient Linear RNN

1. Architectural Foundations

2. Generalized Householder Framework

3. Theoretical Expressivity and State-Tracking

4. Computational Efficiency and Parameterization

5. Empirical Performance

Group Word Problems

Chomsky-Hierarchy Tasks

Language Modeling

6. Practical Implications and Recommended Usage

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research