LogSparse Transformer: Efficient Forecasting

Updated 10 February 2026

LogSparse Transformer is a variant that replaces dense attention with an exponential sparsity pattern to reduce complexity from O(L^2) to O(L log L) per layer.
It integrates causal 1D convolutions for query-key formation, injecting local temporal context critical for accurate time series forecasting.
Stacking LogSparse layers ensures multi-hop connectivity, enabling efficient long-range dependency propagation and competitive forecasting performance.

The LogSparse Transformer is a Transformer variant designed to address the dual challenges of computational inefficiency and lack of locality in canonical self-attention, with particular emphasis on time series forecasting scenarios featuring long input sequences and strong temporal dependencies. By imposing a fixed logarithmic sparsity pattern on attention connections and introducing convolutional context in the attention mechanism, the LogSparse Transformer achieves sub-quadratic complexity while retaining the capacity to model both local and global temporal dependencies effectively (Li et al., 2019, Bentsen et al., 2022).

1. LogSparse Attention: Definition and Formulation

LogSparse attention replaces the dense $O(L^2)$ dot-product attention mechanism with a sparse pattern defined by exponentially spaced and local key selections for each query position. For a layer with input length $L$ , the key set $S^k(i)$ for query position $i$ is given by: $S^k(i) = \left\{ i-2^0,\, i-2^1,\, i-2^2,\, …,\, i-2^{\lfloor\log_2 i\rfloor},\, i \right\} \cap \{1, ..., i\}$ where $|S^k(i)| = O(\log L)$ . Attention weights are then computed for query $Q_i$ and each key $K_j$ in $S^k(i)$ : $\alpha_{i,j} = \frac{\exp\!\left((Q_i K_j^T)/\sqrt{d}\right)}{\sum_{k\in S^k(i)}\exp\!\left((Q_i K_k^T)/\sqrt{d}\right)}$

$O_i = \sum_{j\in S^k(i)} \alpha_{i,j}\,V_j$

The LogSparse pattern may incorporate optional additions, such as a fixed local attention window and periodic “restart” links, to further augment memory pathways (Bentsen et al., 2022).

2. Computational Complexity and Efficiency

Canonical Transformers require $O(L^2)$ space and time per layer, which is prohibitive for long sequences. In the LogSparse Transformer, each of the $L$ queries attends to only $O(\log L)$ keys:

Per Layer: $O(L\log L)$ time and memory
Stacked Layers: With $K \approx \log_2 L + 1$ layers needed for global connectivity, the overall complexity becomes $O(L (\log L)^2)$

This scaling enables practical modeling of far longer sequences or finer temporal granularities than previously feasible (Li et al., 2019).

3. Convolutional Query-Key Construction and Locality

The LogSparse Transformer replaces pointwise linear projections for queries and keys with causal 1D convolutions of kernel size $k$ : $Q_t = \sum_{\tau=0}^{k-1} W^Q_\tau X_{t-\tau}, \qquad K_t = \sum_{\tau=0}^{k-1} W^K_\tau X_{t-\tau}$ where $W^Q_\tau, W^K_\tau$ are learnable $d \times (d+1)$ filters, and $X \in \mathbb{R}^{L \times (d+1)}$ is the model input. This convolutional mechanism injects local context into the similarity calculation, rendering the attention operation sensitive to the recent temporal neighborhood—a critical property for time series with significant local structure or anomalies. Empirically, increasing kernel size $k$ enhances rapid reduction in training loss and improves forecasting accuracy (Li et al., 2019, Bentsen et al., 2022).

4. Propagation of Long-Range Dependencies

Despite the sparsity in each individual attention matrix, stacking approximately $\log_2 L$ LogSparse layers ensures that, for any pair of positions $j \leq i \leq L$ , there exists at least one directed path connecting $j$ to $i$ . This guarantees that information from distant past can propagate to any future time step. The number of distinct multi-hop information pathways increases super-factorially in $\log_2(i-j)$ , promoting rich mixing of long-term dependencies. Thus, the LogSparse attention pattern enables both efficient and expressive temporal modeling (Li et al., 2019).

5. Empirical Performance in Time Series Forecasting

Comprehensive benchmarking demonstrates that the LogSparse Transformer offers strong forecasting accuracy on both synthetic and real-world datasets. In the original study, the method achieved superior or comparable performance to LSTM-based DeepAR and traditional statistical baselines (ARIMA, ETS, TRMF, DeepState), especially as the history window increases:

On traffic and electricity datasets, LogSparse with large convolutional kernels outperformed DeepAR by substantial margins in median and 90%-quantile losses.
Under a fixed memory budget, the LogSparse Transformer matched or exceeded dense attention on fine-grained datasets, notably achieving $0.138/0.092$ on traffic–fine as compared to $0.149/0.102$ for dense attention (Li et al., 2019).

The table below summarizes representative results from (Bentsen et al., 2022) on multi-step wind speed forecasting:

Model	1-step MSE	6-step MSE	24-step MSE
GNN–MLP	0.3938	0.9108	2.3951
GNN–LSTM	0.4040	0.9430	2.3707
GNN–Transformer	0.4022	0.8923	2.3302
GNN–LogSparse	0.3937	0.8492	2.3290
GNN–Informer	0.3948	0.8552	2.2321

LogSparse demonstrated consistent outperformance over GNN–MLP, GNN–LSTM, and GNN–Transformer at all horizons, and performed on par with or slightly better than Informer for short- and medium-term predictions (Bentsen et al., 2022).

6. Integration in Spatio-Temporal Graph Architectures

LogSparse Transformers have also been deployed as temporal update functions within spatio-temporal graph neural networks (GNNs), notably in multi-step spatio-temporal wind speed forecasting. In this formulation:

Nodes correspond to spatial stations, edges encode geographic proximity, and node/edge features are embedded using learnable and sinusoidal representations.
Within each GNN layer, edge updates use a vanilla Transformer-encoder; node updates employ the LogSparse Transformer encoder, operating on concatenated feature sequences.
Convolutional query-key formation and the fixed logarithmic sparsity pattern enable memory- and compute-efficient integration over long input sequences.

A representative forward-pass pseudocode for this setup is documented in (Bentsen et al., 2022).

7. Advantages, Limitations, and Comparative Perspective

Advantages:

Reduces attention complexity from $O(L^2)$ to $O(L(\log L)^2)$ , significantly mitigating the memory bottleneck for long sequences.
Convolutional self-attention introduces crucial local structure, improving robustness to local anomalies.
Fixed, simple sparse patterns avoid extra overhead needed in alternatives such as ProbSparse.

Limitations:

The fixed exponential skip pattern may not optimally capture mid-range dependencies if their lags do not coincide with selected indices.
Underperforms trend/seasonality-decomposition-based architectures (e.g., Autoformer, FFTransformer) on very long horizon tasks.
Performance may depend on tuning of convolution kernel, local window size, and other sparsity hyperparameters; optimal parameters may be domain-specific.

Insight from recent benchmarking suggests that, while LogSparse attention yields efficient and robust performance—especially in resource-constrained contexts—specialized time-series architectures leveraging signal decomposition or auto-correlation can offer further gains for certain forecasting scenarios (Bentsen et al., 2022).

References:

(Li et al., 2019) "Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting" (Bentsen et al., 2022) "Spatio-Temporal Wind Speed Forecasting using Graph Networks and Novel Transformer Architectures"

Markdown Report Issue Upgrade to Chat

References (2)

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting (2019)

Spatio-Temporal Wind Speed Forecasting using Graph Networks and Novel Transformer Architectures (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LogSparse Transformer.