xLSTM-PINN: Spectral Enhancement for PDE Solvers

Updated 20 November 2025

The paper introduces xLSTM-PINN, a spectral remodeling extension to PINNs that leverages memory-gated, multiscale xLSTM blocks to elevate high-frequency learning.
It employs a staged frequency curriculum and adaptive residual reweighting to address spectral bias and improve convergence and extrapolation in solving PDEs.
Empirical benchmarks on various PDE problems show significant accuracy gains, enhanced generalization, and superior performance compared to conventional PINNs.

xLSTM-PINN is a spectral remodeling extension of physics-informed neural networks (PINNs) designed to mitigate spectral bias, residual-data imbalance, and poor extrapolation in neural PDE solvers. By introducing memory-gated, multiscale feature extraction via xLSTM blocks, coupled with a staged frequency curriculum and adaptive residual reweighting, xLSTM-PINN systematically elevates the neural tangent kernel (NTK) spectrum for high-frequency learning. The method achieves both theoretically justified and empirically significant improvement in accuracy, convergence, and extrapolation on benchmark PDEs, without modifications to the standard physics loss or automatic differentiation routines (Tao et al., 16 Nov 2025).

1. Architecture: xLSTM Blocks and Gated Memory

xLSTM-PINN replaces the generic multilayer perceptron core of conventional PINNs with a stack of xLSTM blocks. Each block is composed of an internal multiscale, memory-gated recursion (“micro-time” steps) and a light, nonlinear feed-forward mixer.

During each of the $S$ internal micro-steps within a block $\ell$ , the state evolves as follows:

Hidden state $h_t \in \mathbb{R}^W$
Memory cell $c_t \in \mathbb{R}^W$
Duty-cycle scalar $n_t \in \mathbb{R}^W$
Logarithmic-scale gate accumulator $m_t \in \mathbb{R}^W$
Evolving block representation $u_t \in \mathbb{R}^W$

The steps comprise (simplified from Eqs. 3–6):

Compute gates and candidate:

$[g_i, g_f, g_o, g_z] = W^\ell u_t + U^\ell h_t + b^\ell$

$i_t = \exp(g_i), \quad f_t = \sigma(g_f) ~\text{or}~ \exp(g_f), \quad o_t = \sigma(g_o), \quad z_t = \tanh(g_z)$

Log-space stabilization and normalized gating:

$m_{t+1} = \max(\log f_t + m_t, \log i_t)$

$\bar{f}_t = \exp(\text{clip}(\log f_t + m_t - m_{t+1})), \quad \bar{i}_t = \exp(\text{clip}(\log i_t - m_{t+1}))$

Memory state and output update:

$c_{t+1} = \bar{f}_t \odot c_t + \bar{i}_t \odot z_t$

$n_{t+1} = \bar{f}_t \odot n_t + \bar{i}_t$

$h_{t+1} = o_t \odot \left( c_{t+1} / (n_{t+1} + \epsilon) \right)$

$u_{t+1} = u_t + \psi(P^\ell h_{t+1}), \quad \psi = \tanh$

Each block aggregates features at each micro-step (different “scales”) through a learnable, LSTM-style gating function:

$m_k^{(t)} = g_k(h^{(t-1)}, x) \odot m_k^{(t-1)} + (1 - g_k(h^{(t-1)}, x)) \odot \mathcal{F}_k(h^{(t-1)}, x)$

with $g_k(h, x) = \sigma(W_g^k [h; x] + b_g^k)$ . After all $S$ steps, a weighted aggregation $M = \sum_{k=1}^S \alpha_k \odot m_k^{(S)}$ is merged into the layer’s output.

A shallow gated feed-forward mixer then computes

$\tilde{y} = \varphi_2(W_2 \varphi_1(W_1 u_S)), \quad u^+ = u_S + \gamma(u_S) \odot \tilde{y}$

$u^{(\ell+1)} = \tanh(W^{(\ell)} u^+ + b^{(\ell)})$

where $\gamma$ is a sigmoid gate and $\varphi_1, \varphi_2$ are $\tanh$ activations.

Parameter sharing across micro-steps ensures model depth O( $LS$ ) with parameter count O( $LW^2$ ), matching baseline MLP-based PINNs but with richer representational capacity (Tao et al., 16 Nov 2025).

2. Spectral-Bias Mitigation via Frequency Curriculum and Residual Reweighting

xLSTM-PINN directly addresses the spectral bias inherent to standard PINN training. This is accomplished with two orthogonal scheduling mechanisms:

2.1 Frequency Curriculum:

During early training, the residual loss is softly low-pass filtered:

$L_{\text{res}}^{\text{low}} = \| \mathcal{F}^{-1}[ \chi_{|k| \leq K(t)} \cdot \mathcal{F}[\mathcal{R}(u)] ] \|_{L^2}^2$

Here $K(t)$ , the frequency cutoff, smoothly grows to its final value over a curriculum of $T_c=10{,}000$ steps, ensuring the network resolves large-scale structure before high-frequency detail.

2.2 Adaptive Residual Reweighting:

Each collocation point’s residual is exponentially reweighted according to its current error:

$L_{\text{res}}(\theta) = \sum_{i=1}^N w_i^{(t)} \|R(u_\theta(x_i))\|^2$

$w_i^{(t)} = \frac{\exp(\alpha |R(u_\theta(x_i))|)}{\sum_{j=1}^N \exp(\alpha |R(u_\theta(x_j))|)}$

with $\alpha=8$ –$12$. This adaptively prioritizes harder, typically higher-frequency regions during gradient descent.

Combined with the xLSTM block’s effect on the empirical NTK—where high-frequency eigenvalues $\lambda(k)$ are amplified by $(1+\alpha(k))^S$ —these procedures jointly lift the NTK tail and suppress spectral bias (Tao et al., 16 Nov 2025).

3. Optimization Protocols and Hyperparameters

The combined objective is

$J(\theta) = \lambda_r L_{\text{res}} + \lambda_D L_D + \lambda_N L_N + \lambda_{IC} L_{IC} + R(\theta)$

where the $\lambda$ parameters balance residual, Dirichlet, Neumann, and initial-condition losses, with $R(\theta)$ incorporating $L_2$ or Jacobian regularization as needed.

Empirically validated choices include:

Block width $W=64$ , depth $L=6$ , micro-steps $S=4$
$\sim$ 30,000 total parameters for parity with baseline PINN
Adam optimizer, learning rate $10^{-3}$ decaying to $10^{-4}$ (cosine schedule)
Frequency cutoff schedule $K(t) = K_{\text{max}} \cdot (t/T_c)^2$ , $T_c=10{,}000$
Residual reweight parameter $\alpha=8$ –$12$
LayerNorm applied to $u^+$ in each block
Training stabilization: freeze xLSTM gates for the first $1{,}000$ steps ( $i_t$ , $f_t$ , $o_t$ fixed at $0.5$), gradient clipping at norm $1.0$, early stopping by validation residual

4. Quantitative Benchmarks and Frequency Analysis

xLSTM-PINN and baseline PINN were evaluated under identical sample and parameter budgets (3,000 interior/boundary samples, $\sim$ 30k parameters) on four PDE problems:

PDE	MSE	RMSE	MAE	MaxAE
1D Advection–Reaction	$6.28 \times 10^{-6} \downarrow 89\%$	$2.51 \times 10^{-3} \downarrow 66\%$	$1.54 \times 10^{-3} \downarrow 79\%$	$1.71 \times 10^{-2} \uparrow 77\%$
2D Laplace (mixed BCs)	$1.47 \times 10^{-8} \downarrow 99.98\%$	$1.21 \times 10^{-4} \downarrow 98.44\%$	$9.90 \times 10^{-5} \downarrow 98.72\%$	$3.82 \times 10^{-4} \downarrow 95.66\%$
Steady Heat in Disk (Robin BC)	$9.66 \times 10^{-9} \downarrow 96.62\%$	$9.83 \times 10^{-5} \downarrow 81.62\%$	$7.87 \times 10^{-5} \downarrow 85.04\%$	$4.03 \times 10^{-4} \downarrow 45.94\%$
Anisotropic Poisson–Beam (4th order)	$1.93 \times 10^{-6} \downarrow 96.75\%$	$1.39 \times 10^{-3} \downarrow 81.99\%$	$1.09 \times 10^{-3} \downarrow 85.79\%$	$5.32 \times 10^{-3} \downarrow 46.48\%$

Frequency-domain diagnostics substantiate the claimed suppression of spectral bias:

Endpoint error $E_T(k)$ ( $L^2$ plane wave fit) is lower in high $k$ ; plateau lowered by $\sim\times 2$
Spectral gain $G(k)=E_{\text{base}}(k)/E_{\text{xLSTM}}(k)$ exceeds 1.5–3.0 for $k\in[10,40]$
Time to error threshold $\tau(k)$ shortened by 30–50%
Resolvable bandwidth $k^*(\epsilon)$ up by $\sim$ 25%

In field-space, xLSTM-PINN produces sharply localized error, cleaner boundary transitions, and significantly less high- $k$ contamination ( $<$ 10% energy in $k>30$ vs $>$ 40% for baseline) (Tao et al., 16 Nov 2025).

5. Extrapolation and Generalization

Extrapolation assessments demonstrate superior robustness:

On 1D advection, training on $t\in[0,1]$ and prediction on $t\in[1,1.5]$ yields $<$ 1% error for xLSTM-PINN up to $t=1.4$ , whereas baseline PINN’s error grows exponentially past $t=1.1$ .
For the 2D Laplace problem with 10% of the boundary data removed (“O-shaped” deficit), xLSTM-PINN reconstructs the missing region with $\leq 10^{-3}$ error, while the baseline PINN exhibits substantial error.

Memory-gated micro-step recursions approximate an ODE in feature space, imparting greater robustness to off-manifold or out-of-distribution inputs. The cross-scale memory at each layer further enables data-deficient scales to be reconstructed from related features, smoothing the NTK spectrum and reducing overfitting to the observed spectral envelope (Tao et al., 16 Nov 2025).

6. Implications and Extensions

The xLSTM block is modular and can be integrated into any PINN extension, including Fourier-feature PINNs, conservative or stochastic variants, and multi-fidelity setups, without requiring changes to physics loss functions or optimizers. For time-dependent PDEs, the internal micro-step refinement can be extended to both spatial and temporal resolutions. In inverse or multi-fidelity modeling contexts, memory gating can serve a cross-scale autoencoding role, mediating between low- and high-fidelity surrogates.

Architectural spectral engineering—lifting the NTK tail at the representation level—is shown to be as effective as direct loss reweighting strategies for bias mitigation. A plausible implication is that further advances in PDE generalization could arise from hybrid approaches that combine representation- and loss-level spectral control (Tao et al., 16 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Spectral Bias Mitigation via xLSTM-PINN: Memory-Gated Representation Refinement for Physics-Informed Learning (2025)

Follow Topic

Get notified by email when new papers are published related to xLSTM-PINN.