Papers
Topics
Authors
Recent
2000 character limit reached

Rough Transformers for Time Series

Updated 25 November 2025
  • Rough Transformers are neural network models that utilize truncated path signatures to convert irregular time series into continuous, invariant representations.
  • They employ a multi-view signature transform with multi-head attention to capture both local and global temporal patterns while reducing computational complexity.
  • Empirical benchmarks in scientific and medical applications demonstrate that Rough Transformers outperform traditional models in speed, memory usage, and accuracy.

Rough Transformers are a class of neural network architectures designed for modeling continuous-time, irregularly sampled time series while efficiently capturing long-range dependencies. By leveraging truncated path signatures as time-reparametrization invariant features and applying multi-head attention over a low-dimensional representation, Rough Transformers (often abbreviated as "RFormer"; Editor's term) achieve robust, scalable performance and computational efficiency rivaling or surpassing both vanilla Transformers and Neural ODE-based models in scientific and medical applications (Moreno-Pino et al., 15 Mar 2024, Moreno-Pino et al., 31 May 2024).

1. Motivation and Theoretical Foundation

Real-world time series—particularly in domains such as medicine and finance—are often characterized by irregular sampling intervals, missing or non-uniformly spaced data points, and latent dependencies extending over thousands of time steps. Classical recurrent models (including RNNs, LSTMs, ODE-RNNs, and Neural CDEs) manage irregular sampling by evolving hidden states in continuous time, but their computational and memory costs scale unfavorably with sequence length LL and solver mesh size, since they must carry latent states across very long sequences or solve ODEs/CDEs repeatedly. Standard Transformer architectures—originally designed for discrete, evenly spaced sequences—can capture global dependencies via attention but require fixed-length, uniformly sampled data and bear O(L2)\mathcal{O}(L^2) memory and compute scaling, which becomes prohibitive for long sequences. Furthermore, their positional encodings degrade or fail under time-warping or missing data (Moreno-Pino et al., 15 Mar 2024).

Rough Transformers address these limitations by lifting the input time series into a continuous-time path, extracting rich local and global signature features via iterated integrals, and operating Transformer attention only over a fixed number M≪LM \ll L of "views." This construction achieves invariance to irregular sampling and sequence length, enables both local and global temporal context modeling, and drastically reduces computational costs without sacrificing predictive power.

2. Continuous-Time Signature Representation

Given a time series X=((t0,x0),…,(tL,xL))X = ((t_0, x_0), \ldots, (t_L, x_L)), where 0=t0<t1<⋯<tL0 = t_0 < t_1 < \dots < t_L, Rough Transformers first form the piecewise-linear interpolation:

X~(t)=xk+t−tktk+1−tk(xk+1−xk),t∈[tk,tk+1].\tilde{X}(t) = x_k + \frac{t - t_k}{t_{k+1} - t_k}(x_{k+1} - x_k), \hspace{0.5em} t \in [t_k, t_{k+1}].

Next, for any smooth path f:[s,t]→Rdf : [s, t] \rightarrow \mathbb{R}^d, the path signature S(f)s,tS(f)_{s, t} is the sequence of iterated integrals:

S(f)s,t=(1,∫stf′(u)du, ∬s<u1<u2<tf′(u1)⊗f′(u2)du1du2, … )S(f)_{s, t} = \left( 1, \int_s^t f'(u) du,\ \iint_{s < u_1 < u_2 < t} f'(u_1) \otimes f'(u_2) du_1 du_2,\ \dots \right)

This infinite sequence is truncated to order nn, yielding S(f)s,t≤nS(f)^{\leq n}_{s,t}, which summarizes fine (local) and coarse (global) time-series structure invariant under smooth reparametrizations of time. For each linear segment [tk,tk+1][t_k, t_{k+1}] with increment Δxk=xk+1−xk\Delta x_k = x_{k+1} - x_k:

S(X^k)tk,tk+1=(1, Δxk, 12Δxk⊗Δxk, …, 1n!Δxk⊗n)S(\widehat{X}_k)_{t_k, t_{k+1}} = \left( 1,\ \Delta x_k,\ \frac{1}{2}\Delta x_k \otimes \Delta x_k,\ \ldots,\ \frac{1}{n!}\Delta x_k^{\otimes n} \right)

3. Multi-View Signature Transform and Attention

To efficiently summarize both local and global time-series information, the multi-view signature transform computes, at MM fixed "view" times 0=τ0<τ1<⋯<τM−1≤T0 = \tau_0 < \tau_1 < \dots < \tau_{M-1} \le T:

  • Global signature: S(X~)0,Ï„m≤nS(\tilde{X})_{0, \tau_m}^{\leq n} (integrating over the full path up to Ï„m\tau_m)
  • Local signature: S(X~)Ï„m−1,Ï„m≤nS(\tilde{X})_{\tau_{m-1}, \tau_m}^{\leq n} (capturing increment structure over just [Ï„m−1,Ï„m][\tau_{m-1}, \tau_m])

These are concatenated as:

M(X)m=[S(X~)0,τm≤n  ∥ S(X~)τm−1,τm≤n]∈RdˉM(X)_m = \big[S(\tilde{X})_{0, \tau_m}^{\leq n} \;\|\, S(\tilde{X})_{\tau_{m-1}, \tau_m}^{\leq n}\big] \in \mathbb{R}^{\bar{d}}

yielding M(X)∈RM×dˉM(X) \in \mathbb{R}^{M \times \bar{d}}, where dˉ\bar{d} grows polynomially with signature depth nn and input dimension dd.

This matrix forms the input to standard multi-head scaled dot-product attention. For each attention head hh, define projections WhQ,WhK,WhV∈Rdˉ×d′W_h^Q, W_h^K, W_h^V \in \mathbb{R}^{\bar{d} \times d'}:

  • Qh=M(X)WhQQ_h = M(X) W_h^Q
  • Kh=M(X)WhKK_h = M(X) W_h^K
  • Vh=M(X)WhVV_h = M(X) W_h^V

Attention is computed as:

Ah=softmax(QhKh⊤d′)Vh∈RM×d′A_h = \mathrm{softmax}\left(\frac{Q_h K_h^{\top}}{\sqrt{d'}}\right) V_h \in \mathbb{R}^{M \times d'}

Stacking all heads and following with feed-forward, normalization, and residual blocks yields one RFormer block. Critically, all attention operations occur on MM-view representations, so the dominant cost is O(M2)\mathcal{O}(M^2), regardless of original sequence length LL. No ODE or CDE solver is involved, but the signature feature S(f)0,tS(f)_{0, t} can be viewed as encoding the solution map of a canonical linear ODE driven by ff (Moreno-Pino et al., 15 Mar 2024).

4. Properties: Invariance, Robustness, and Decoupled Complexity

Path signatures are invariant under smooth time reparametrization: M(X)M(X) depends on the geometric path traversed by X~\tilde{X}, and not on specific sampling times or frequencies. This endows Rough Transformers with several key properties:

  • Robustness to missing or irregularly sampled data: Iterated integrals commute with time-warping, rendering the model insensitive to sampling irregularities or dropout.
  • Modeling of both local and global dependencies: Local signatures act analogously to convolutional filters on small windows; global signatures capture all higher-order interactions and long-term dependencies by Theorem A.1 of rough path theory.
  • Fixed and decoupled computational complexity: By selecting MM independently of LL, both computation and memory scale as O(M2)\mathcal{O}(M^2) for attention, and as O(Mâ‹…poly(d,n))\mathcal{O}(M \cdot \mathrm{poly}(d, n)) for multi-view signature extraction; this is a dramatic improvement over both the O(L2)\mathcal{O}(L^2) cost of vanilla Transformers and the sequence-length-dependent cost of Neural ODE/CDE variants.

5. Empirical Complexity and Benchmark Performance

Empirical evaluations on synthetic and medical time-series tasks support the theoretical advantages of Rough Transformers (Moreno-Pino et al., 15 Mar 2024).

  • Complexity: For key parameters LL (input length), MM (views), dd (dimension), d′d' (embedding size), and nn (signature depth):
    • Signature extraction: O(Mâ‹…poly(d,n))\mathcal{O}(M \cdot \mathrm{poly}(d,n))
    • Attention and memory: O(M2)\mathcal{O}(M^2) versus O(L2)\mathcal{O}(L^2) for vanilla Transformers.
  • Running time (synthetic sinusoid classification, L=2000L = 2000): RFormer completes epochs in 0.55 s versus 0.77 s for Transformer (1.4× faster), 9.83 s for Neural CDE, 5.39 s for ODE-RNN.
  • Running time (real-world heart rate dataset, L=4000L = 4000): RFormer requires 0.45 s/epoch versus 11.71 s for Transformer (26× faster), 50.7 s for ODE-RNN.
  • Memory usage: Remains O(M2)\mathcal{O}(M^2), compared to O(L2)\mathcal{O}(L^2) for vanilla Transformers.
Task / Model Test RMSE / Accuracy Speed (s/epoch)
Transformer 8.24 (full) / 21.01 (drop) 0.77 / 11.71 (HR)
ODE-RNN ∼\sim13.06 5.39 / 50.7
Neural CDE 9.82 9.83
Neural RDE 2.97 -
RFormer 3.04 ± 0.03 (full) / 3.31 ± 0.05 (drop) 0.55 / 0.45

Rough Transformers matched or outperformed state-of-the-art Neural ODE/CDE models in both accuracy (e.g., >170%>170\% RMSE improvement over vanilla Transformer in heart rate classification) and training efficiency, with accuracy remaining stable under aggressive down-sampling or point dropout.

6. Implications and Extensibility

Rough Transformers demonstrate that multi-scale, continuous-time feature extraction via truncated path signatures, when paired with attention on a small number of robust views, enables high-fidelity, scalable time-series modeling. They avoid the quadratic bottleneck of Transformer attention, circumvent the need for ODE solvers, and naturally extend to data with missing or irregular sampling without special positional encoding. A plausible implication is the broad applicability of this architecture beyond medical time-series to any domain where variable-length, non-uniformly sampled sequences with long-range dependencies are encountered, such as financial tick data, industrial sensor streams, or natural language with variable pacing.

Further exploration of hybrid architectures, alternate signature extraction methods, and extensions to higher input dimensions or adaptive view selection may yield increased expressivity or efficiency, though this remains an open area for research.

7. References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Rough Transformers.