Rough Transformers for Time Series

Updated 25 November 2025

Rough Transformers are neural network models that utilize truncated path signatures to convert irregular time series into continuous, invariant representations.
They employ a multi-view signature transform with multi-head attention to capture both local and global temporal patterns while reducing computational complexity.
Empirical benchmarks in scientific and medical applications demonstrate that Rough Transformers outperform traditional models in speed, memory usage, and accuracy.

Rough Transformers are a class of neural network architectures designed for modeling continuous-time, irregularly sampled time series while efficiently capturing long-range dependencies. By leveraging truncated path signatures as time-reparametrization invariant features and applying multi-head attention over a low-dimensional representation, Rough Transformers (often abbreviated as "RFormer"; Editor's term) achieve robust, scalable performance and computational efficiency rivaling or surpassing both vanilla Transformers and Neural ODE-based models in scientific and medical applications (Moreno-Pino et al., 15 Mar 2024, Moreno-Pino et al., 31 May 2024).

1. Motivation and Theoretical Foundation

Real-world time series—particularly in domains such as medicine and finance—are often characterized by irregular sampling intervals, missing or non-uniformly spaced data points, and latent dependencies extending over thousands of time steps. Classical recurrent models (including RNNs, LSTMs, ODE-RNNs, and Neural CDEs) manage irregular sampling by evolving hidden states in continuous time, but their computational and memory costs scale unfavorably with sequence length $L$ and solver mesh size, since they must carry latent states across very long sequences or solve ODEs/CDEs repeatedly. Standard Transformer architectures—originally designed for discrete, evenly spaced sequences—can capture global dependencies via attention but require fixed-length, uniformly sampled data and bear $\mathcal{O}(L^2)$ memory and compute scaling, which becomes prohibitive for long sequences. Furthermore, their positional encodings degrade or fail under time-warping or missing data (Moreno-Pino et al., 15 Mar 2024).

Rough Transformers address these limitations by lifting the input time series into a continuous-time path, extracting rich local and global signature features via iterated integrals, and operating Transformer attention only over a fixed number $M \ll L$ of "views." This construction achieves invariance to irregular sampling and sequence length, enables both local and global temporal context modeling, and drastically reduces computational costs without sacrificing predictive power.

2. Continuous-Time Signature Representation

Given a time series $X = ((t_0, x_0), \ldots, (t_L, x_L))$ , where $0 = t_0 < t_1 < \dots < t_L$ , Rough Transformers first form the piecewise-linear interpolation:

$\tilde{X}(t) = x_k + \frac{t - t_k}{t_{k+1} - t_k}(x_{k+1} - x_k), \hspace{0.5em} t \in [t_k, t_{k+1}].$

Next, for any smooth path $f : [s, t] \rightarrow \mathbb{R}^d$ , the path signature $S(f)_{s, t}$ is the sequence of iterated integrals:

$S(f)_{s, t} = \left( 1, \int_s^t f'(u) du,\ \iint_{s < u_1 < u_2 < t} f'(u_1) \otimes f'(u_2) du_1 du_2,\ \dots \right)$

This infinite sequence is truncated to order $n$ , yielding $S(f)^{\leq n}_{s,t}$ , which summarizes fine (local) and coarse (global) time-series structure invariant under smooth reparametrizations of time. For each linear segment $[t_k, t_{k+1}]$ with increment $\Delta x_k = x_{k+1} - x_k$ :

$S(\widehat{X}_k)_{t_k, t_{k+1}} = \left( 1,\ \Delta x_k,\ \frac{1}{2}\Delta x_k \otimes \Delta x_k,\ \ldots,\ \frac{1}{n!}\Delta x_k^{\otimes n} \right)$

3. Multi-View Signature Transform and Attention

To efficiently summarize both local and global time-series information, the multi-view signature transform computes, at $M$ fixed "view" times $0 = \tau_0 < \tau_1 < \dots < \tau_{M-1} \le T$ :

Global signature: $S(\tilde{X})_{0, \tau_m}^{\leq n}$ (integrating over the full path up to $\tau_m$ )
Local signature: $S(\tilde{X})_{\tau_{m-1}, \tau_m}^{\leq n}$ (capturing increment structure over just $[\tau_{m-1}, \tau_m]$ )

These are concatenated as:

$M(X)_m = \big[S(\tilde{X})_{0, \tau_m}^{\leq n} \;\|\, S(\tilde{X})_{\tau_{m-1}, \tau_m}^{\leq n}\big] \in \mathbb{R}^{\bar{d}}$

yielding $M(X) \in \mathbb{R}^{M \times \bar{d}}$ , where $\bar{d}$ grows polynomially with signature depth $n$ and input dimension $d$ .

This matrix forms the input to standard multi-head scaled dot-product attention. For each attention head $h$ , define projections $W_h^Q, W_h^K, W_h^V \in \mathbb{R}^{\bar{d} \times d'}$ :

$Q_h = M(X) W_h^Q$
$K_h = M(X) W_h^K$
$V_h = M(X) W_h^V$

Attention is computed as:

$A_h = \mathrm{softmax}\left(\frac{Q_h K_h^{\top}}{\sqrt{d'}}\right) V_h \in \mathbb{R}^{M \times d'}$

Stacking all heads and following with feed-forward, normalization, and residual blocks yields one RFormer block. Critically, all attention operations occur on $M$ -view representations, so the dominant cost is $\mathcal{O}(M^2)$ , regardless of original sequence length $L$ . No ODE or CDE solver is involved, but the signature feature $S(f)_{0, t}$ can be viewed as encoding the solution map of a canonical linear ODE driven by $f$ (Moreno-Pino et al., 15 Mar 2024).

4. Properties: Invariance, Robustness, and Decoupled Complexity

Path signatures are invariant under smooth time reparametrization: $M(X)$ depends on the geometric path traversed by $\tilde{X}$ , and not on specific sampling times or frequencies. This endows Rough Transformers with several key properties:

Robustness to missing or irregularly sampled data: Iterated integrals commute with time-warping, rendering the model insensitive to sampling irregularities or dropout.
Modeling of both local and global dependencies: Local signatures act analogously to convolutional filters on small windows; global signatures capture all higher-order interactions and long-term dependencies by Theorem A.1 of rough path theory.
Fixed and decoupled computational complexity: By selecting $M$ independently of $L$ , both computation and memory scale as $\mathcal{O}(M^2)$ for attention, and as $\mathcal{O}(M \cdot \mathrm{poly}(d, n))$ for multi-view signature extraction; this is a dramatic improvement over both the $\mathcal{O}(L^2)$ cost of vanilla Transformers and the sequence-length-dependent cost of Neural ODE/CDE variants.

5. Empirical Complexity and Benchmark Performance

Empirical evaluations on synthetic and medical time-series tasks support the theoretical advantages of Rough Transformers (Moreno-Pino et al., 15 Mar 2024).

Complexity: For key parameters $L$ $L$ (input length), $M$ $M$ (views), $d$ $d$ (dimension), $d'$ $d^{'}$ (embedding size), and $n$ $n$ (signature depth):
- Signature extraction: $\mathcal{O}(M \cdot \mathrm{poly}(d,n))$
- Attention and memory: $\mathcal{O}(M^2)$ versus $\mathcal{O}(L^2)$ for vanilla Transformers.
Running time (synthetic sinusoid classification, $L = 2000$ ): RFormer completes epochs in 0.55 s versus 0.77 s for Transformer (1.4× faster), 9.83 s for Neural CDE, 5.39 s for ODE-RNN.
Running time (real-world heart rate dataset, $L = 4000$ ): RFormer requires 0.45 s/epoch versus 11.71 s for Transformer (26× faster), 50.7 s for ODE-RNN.
Memory usage: Remains $\mathcal{O}(M^2)$ , compared to $\mathcal{O}(L^2)$ for vanilla Transformers.

Task / Model	Test RMSE / Accuracy	Speed (s/epoch)
Transformer	8.24 (full) / 21.01 (drop)	0.77 / 11.71 (HR)
ODE-RNN	$\sim$ 13.06	5.39 / 50.7
Neural CDE	9.82	9.83
Neural RDE	2.97	-
RFormer	3.04 ± 0.03 (full) / 3.31 ± 0.05 (drop)	0.55 / 0.45

Rough Transformers matched or outperformed state-of-the-art Neural ODE/CDE models in both accuracy (e.g., $>170\%$ RMSE improvement over vanilla Transformer in heart rate classification) and training efficiency, with accuracy remaining stable under aggressive down-sampling or point dropout.

6. Implications and Extensibility

Rough Transformers demonstrate that multi-scale, continuous-time feature extraction via truncated path signatures, when paired with attention on a small number of robust views, enables high-fidelity, scalable time-series modeling. They avoid the quadratic bottleneck of Transformer attention, circumvent the need for ODE solvers, and naturally extend to data with missing or irregular sampling without special positional encoding. A plausible implication is the broad applicability of this architecture beyond medical time-series to any domain where variable-length, non-uniformly sampled sequences with long-range dependencies are encountered, such as financial tick data, industrial sensor streams, or natural language with variable pacing.

Further exploration of hybrid architectures, alternate signature extraction methods, and extensions to higher input dimensions or adaptive view selection may yield increased expressivity or efficiency, though this remains an open area for research.

7. References

"Rough Transformers for Continuous and Efficient Time-Series Modelling" (Moreno-Pino et al., 15 Mar 2024)
"Rough Transformers: Lightweight and Continuous Time Series Modelling through Signature Patching" (Moreno-Pino et al., 31 May 2024)