Papers
Topics
Authors
Recent
Search
2000 character limit reached

HiPPO: Polynomial Projection for Sequence Modeling

Updated 26 January 2026
  • HiPPO is a framework that projects sequential data onto polynomial bases, offering an optimal L² compression of historical inputs.
  • The approach uses continuous ODEs and discrete recurrences to ensure fast updates, timescale robustness, and bounded gradient behavior.
  • HiPPO underpins state-of-the-art models in recurrent neural networks, state-space systems, and Gaussian processes for efficient long-range sequence modeling.

High-Order Polynomial Projection Operators (HiPPO) define a general mathematical and algorithmic framework for online compression and representation of sequential data via optimal projection onto polynomial bases. By maintaining a low-dimensional set of polynomial coefficients that summarize all history up to the current time under a chosen measure, HiPPO yields provably optimal memory updates, efficient online recurrences, and tractable discretizations for both continuous and discrete data streams. The framework underlies a range of recent models for recurrent neural networks, state-space models, and Gaussian processes, enabling efficient long-range sequence modeling, timescale-robust memory, and structure-preserving approximations for matrix manifolds.

1. Mathematical Definition of the HiPPO Operator

At each time tt, the HiPPO operator projects the history of a input function f[0,t]f_{[0,t]} onto the polynomial subspace of degree <N<N in a Hilbert space with inner product induced by a time-dependent measure μ(t)\mu^{(t)}:

cn(t)=0tf(s)Pn(t)(s)dμ(t)(s),n=0,,N1c_n(t) = \int_0^t f(s)\,P_n^{(t)}(s)\,d\mu^{(t)}(s), \quad n=0,\ldots,N-1

where {Pn(t)}\{P_n^{(t)}\} is an orthonormal polynomial basis with respect to μ(t)\mu^{(t)}. The coefficient vector c(t)RNc(t) \in \mathbb{R}^N serves as an optimal L2^2 compression of the recent history, reconstructing ftf_{\leq t} (in the sense of minimal projection error) by:

ft()n=0N1cn(t)  Pn(t)()f_{\leq t}(\cdot) \approx \sum_{n=0}^{N-1} c_n(t)\;P_n^{(t)}(\cdot)

This operator generalizes a broad family of online memory mechanisms, where the choice of measure and basis recovers classical memory units such as Legendre Memory Units (LMU) and gating architectures (GRU variants), as well as new scalable mechanisms (Gu et al., 2020, Lee et al., 2024, Chen et al., 12 Feb 2025).

2. Continuous and Discrete Memory Updates via ODEs

Differentiating the HiPPO compression in tt yields a linear time-varying ordinary differential equation governing the evolution of coefficients:

ddtc(t)=A(t)c(t)+B(t)f(t)\frac{d}{dt} c(t) = A(t) c(t) + B(t) f(t)

where A(t)RN×NA(t)\in\mathbb{R}^{N\times N}, B(t)RNB(t)\in\mathbb{R}^N are determined by the measure and chosen polynomial family. For discrete time-series, this ODE admits forward-Euler and bilinear discretizations:

ck+1=ck+Δt(A(tk)ck+B(tk)fk)c_{k+1} = c_k + \Delta t\, (A(t_k)c_k + B(t_k)f_k)

ck+1=M(Δt)ck+N(Δt)fkc_{k+1} = M(\Delta t)c_k + N(\Delta t)f_k

These update rules are effective for irregularly-sampled data and missing values (Gu et al., 2020).

In the “scaled Legendre” or HiPPO-LegS variant, uniform weighting on [0,t][0,t] with basis Pn(t)(τ)=2n+1tLegn ⁣(2τt1)P_n^{(t)}(\tau) = \sqrt{2n+1 \over t}\,\mathrm{Leg}_n\!\Bigl(2{\tau \over t}-1\Bigr) yields:

ddtc(t)=1tAc(t)+1tBf(t)\frac{d}{dt}c(t) = -\frac{1}{t}A\,c(t) + \frac{1}{t}B\,f(t)

with An,k=(2n+1)(2k+1)A_{n,k} = \sqrt{(2n+1)(2k+1)} if n>kn>k, n+1n+1 if n=kn=k, and $0$ otherwise; Bn=2n+1B_n = \sqrt{2n+1} (Gu et al., 2020, Lee et al., 2024, Chen et al., 12 Feb 2025).

3. Timescale Robustness, Fast Updates, and Bounded Gradients

The HiPPO-LegS mechanism realizes significant theoretical properties:

  • Timescale Robustness: Coefficients behave equivariantly under time-dilation, eliminating explicit timescale hyperparameters.
  • Fast Updates: The triangular-plus-rank-one structure of AA admits O(N)O(N) step complexity.
  • Bounded Gradients: Jacobians of future memory with respect to past inputs scale inversely with time, providing robust gradient flow.
  • Optimal Approximation: For LL-Lipschitz input, projection error is O(tL/N)O(tL/\sqrt{N}); higher smoothness produces faster decay in NN (Gu et al., 2020).

4. Applications in Sequence Learning and Time-Series Modeling

HiPPO has been integrated into diverse architectures:

  • Recurrent Neural Networks: HiPPO units enable state-of-the-art recurrent sequence classification (e.g., 98.3% accuracy on permuted MNIST) and robust long-memory tasks (Gu et al., 2020).
  • State Space Models: Deep SSMs with HiPPO-LegS ODE memory units enable trajectory approximation of input functions, achieving empirical success for long-sequence modeling. The well-posedness and convergence of numerical discretizations are established: HiPPO-LegS ODE is well-posed (despite singularity), with discretizations convergent for Riemann integrable inputs, although arbitrary initial conditions are not allowed (Park et al., 2024).
  • Kolmogorov-Arnold Networks (KAN): HiPPO-KAN architecture provides constant-parameter memory encoding, with empirical superiority to vanilla KANs and RNNs on long window sizes. HiPPO-KAN maintains fixed parameter count (e.g., 4,384 params for window size up to 1,200) with MSE at 3.26×1073.26\times 10^{-7} (Lee et al., 2024).

Empirical studies corroborate scalability, parameter efficiency, and resolution of lagging artifacts in time series forecasting when MSE loss is computed on HiPPO coefficient vectors (Lee et al., 2024).

5. Structure-Preserving High-Order Retractions on Matrix Manifolds

HiPPO-inspired projection and polar decomposition yield high-order, structure-preserving approximations to Riemannian exponential maps on matrix manifolds, notably:

  • Unitary Group (U(m)U(m)): Degree-nn polynomial (scaled reverse Bessel) with polar decomposition approximates etΩe^{t\Omega} with error O(t2n+1)O(t^{2n+1}) (Gawlik et al., 2017).
  • Grassmannian (Gr(p,m)Gr(p,m)) and Stiefel Manifold (St(p,m)St(p,m)): Analogous projected polynomial constructions yield similarly high-order retractions (super-order in special cases), preserving exact manifold constraints and equivariance.
  • Averages on U(m)U(m): Arithmetic and geometric means are superclose (O(t3)O(t^3) difference) when inputs are O(t)O(t)-apart.

These retractions are computationally tractable (O(mp2)O(mp^2) via polar/QR algorithms) and maintain exact orthonormality or unitarity (Gawlik et al., 2017).

6. Integration with Online Gaussian Processes and Interdomain Inducing Variables

HiPPO polynomial projections provide a foundation for online long-memory Gaussian process models:

  • Inducing Variables as HiPPO Coefficients: Interdomain SVGPs utilize time-varying polynomial basis projections (un(t):=f(x)ϕn(t)(x)dxu_n^{(t)} := \int f(x)\phi_n^{(t)}(x)dx), which optimally summarize all history up to tt (Chen et al., 12 Feb 2025).
  • Online Recurrences for Kernel Matrices: Cross-kernel and prior kernel matrices admit ODE-based updates in time, with recurrences derived from the HiPPO ODE (Chen et al., 12 Feb 2025).
  • OHSVGP: The Online HiPPO Sparse Variational GP implements these ideas, demonstrating superior predictive accuracy, memory preservation, and computational efficiency compared to existing online GP methods—each step requiring only O(M2)O(M^2) cost for fixed memory size MM (Chen et al., 12 Feb 2025).

A plausible implication is that the HiPPO framework generalizes efficient long-memory mechanisms to Bayesian nonparametric models without parameter growth or catastrophic forgetting.

7. Table: HiPPO-Core Operators and Empirical Scaling

Model/Algorithm Memory Update Mechanism Parameter Scaling
HiPPO-LegS RNN ODE: c=A(t)c+B(t)uc' = A(t)c + B(t)u O(N)O(N) per step
HiPPO-KAN SSM + KAN on xRNx \in \mathbb{R}^N Constant in window size
Online HiPPO SVGP Inducing via HiPPO Projection O(M2)O(M^2) per step

HiPPO-based mechanisms consistently realize fixed or sublinear parameter scaling in window or sequence length, with robust long-term memory and efficient updates (Gu et al., 2020, Lee et al., 2024, Chen et al., 12 Feb 2025).


The HiPPO framework, rooted in algebraic polynomial projection and structure-preserving linear ODEs, offers a unified family of memory mechanisms for representation and learning in sequential, manifold, and probabilistic contexts. Its core advantages—optimality, scalability, structural guarantees, and empirical effectiveness—make it a cornerstone for future developments in sequence modeling and online functional approximation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Order Polynomial Projection Operators (HiPPO).