Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (2012.07436v3)

Published 14 Dec 2020 in cs.LG, cs.AI, and cs.IR

Abstract: Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a $ProbSparse$ self-attention mechanism, which achieves $O(L \log L)$ in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.

PDF Abstract

Informer: An Efficient Transformer for Long Sequence Time-Series Forecasting

The paper "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting" by Haoyi Zhou et al. offers a rigorous exploration and solution to the challenges inherent in long sequence time-series forecasting (LSTF). This work advances the field by addressing the computational and memory inefficiencies of conventional Transformer models, while enhancing their capacity to predict long-term dependencies in time-series data.

Key Contributions

The authors present Informer, a novel transformer-based model tailored for LSTF, which ameliorates the limitations of the standard Transformer architecture through three key innovations:

ProbSparse Self-Attention Mechanism: This mechanism reduces the time complexity and memory usage of self-attention from $O(L^2)$ to $O(L \log L)$ . It achieves this by leveraging a sparsity measurement of the queries to retain only the important attention scores, thereby maintaining comparable performance in dependency alignment with substantially reduced computational overhead.
Self-Attention Distilling: By progressively focusing on dominant attention scores through a distilling operation, this technique halves the cascading input dimensions across layers. It addresses the memory bottleneck for extremely long input sequences, reducing total memory usage to $O((2-\epsilon)L \log L)$ .
Generative Style Decoder: Unlike traditional encoder-decoder architectures that rely on step-by-step dynamic decoding, the generative style decoder of Informer predicts long sequences in a single forward pass. This significantly enhances inference speed by avoiding cumulative error spreading and maintaining consistent prediction capacity.

Experimental Validation

The efficacy of Informer is demonstrated through extensive experiments on four large-scale, real-world datasets: ETTh $_1$ , ETTh $_2$ , ETTm $_1$ , and Weather, spanning diverse domains such as electricity consumption and weather prediction. The results consistently show that Informer outperforms existing methods, including ARIMA, LSTMa, and state-of-the-art Transformer variants like LogSparse Transformer and Reformer, especially as the prediction horizon increases.

Numerical Results

Informer achieves notable improvements in Mean Squared Error (MSE) and Mean Absolute Error (MAE) across varying forecasting lengths. For instance:

On the ETTh $_1$ dataset, Informer achieves an MSE of 0.269 at a prediction length of 720 compared to 2.112 for Reformer and 1.549 for ARIMA, highlighting its enhanced long-term prediction capacity.
For the Weather dataset, Informer's MSE is 0.359 at a prediction length of 720, significantly better than the 2.087 of Reformer and 1.062 of ARIMA.

These results underscore Informer's robustness and efficiency in handling long sequence forecasting tasks.

Implications and Future Directions

The innovations introduced in Informer have significant implications both practically and theoretically:

Practical Implications: Informer's efficiency in handling long sequences with reduced computational and memory requirements can be transformative for industries relying on extensive time-series data. This includes applications in energy management, financial forecasting, and sensor network monitoring, where accurate and timely long-term predictions are critical.
Theoretical Implications: Informer's approach demonstrates that the predictive capacity of transformers can be vastly improved with tailored modifications to the self-attention mechanism and decoding process. This opens avenues for further research into sparsity-aware attention mechanisms and efficient model architectures for other machine learning tasks involving extensive temporal dependencies.
Future Developments: The success of Informer in LSTF suggests potential extensions to other domains requiring long-range sequence modeling, such as natural language processing and bioinformatics. Further investigation into hybrid models that integrate Informer's principles with other neural network architectures could yield even more powerful predictive models.

In summary, the Informer model presented by Zhou et al. stands as a significant advancement in time-series forecasting, addressing key limitations of traditional transformers through innovative architectural modifications. Its promising results set a new benchmark for LSTF and pave the way for future research and applications in this domain.