Transformers (specifically self-attention) have powered significant recent progress in NLP. They have enabled models like BERT, GPT-2, and XLNet to form powerful language models that can be used to generate text, translate text, answer questions, classify documents, summarize text, and much more. With their recent success in NLP one would expect widespread adaptation to problems like time series forecasting and classification. After all, both involve processing sequential data. However, to this point research on their adaptation to time series problems has remained limited. Moreover, while some results are promising, others remain more mixed. In this article, I will review current literature on applying transformers as well as attention more broadly to time series problems, discuss the current barriers/limitations, and brainstorm possible solutions to (hopefully) enable these models to achieve the same level success as in NLP. This article will assume that you have a basic understanding of soft-attention, self-attention, and transformer architecture. If you don’t please read one of the linked articles. You can also watch my video from the PyData Orono presentation night.
Do you find this in-depth technical education to be useful? Subscribe below to be updated when we release new relevant content.
Attention for time series data: Review
The need to accurately forecast and classify time series data spans across just about every industry and long predates machine learning. For instance, in hospitals you may want to triage patients with the highest mortality early-on and forecast patient length of stay; in retail you may want to predict demand and forecast sales; utility companies want to forecast power usage, etc.
Despite the successes of deep learning with respect to computer vision many time series models are still shallow. Particularly, in industry many data scientists still utilize simple autoregressive models instead of deep learning. In some cases, they may even use models like XGBoost fed with manually manufactured time intervals. Usually, the common reasons for choosing these methods remain interpretability, limited data, ease of use, and training cost. While there is no single solution to address all these issues, deep models with attention provide a compelling case. In many cases, they offer overall performance improvements (other vanilla LSTMs/RNNs) with the benefit of interpretability in the form of attention heat maps. Additionally, in many cases, they are faster than using an RNN/LSTM (particularly with some of the techniques we will discuss).
Several papers have studied using basic and modified attention mechanisms for time series data. LSTNet is one of the first papers that proposes using an LSTM + attention mechanism for multivariate forecasting time series. Temporal Pattern Attention for Multivariate Time Series Forecasting by Shun-Yao Shih et al. focused on applying attention specifically attuned for multivariate data. This mechanism aimed at resolving issues including noisy variables in the multivariate time series and introducing a better method than a simple average. Specifically,
The attention weights on rows select those variables that are helpful for forecasting. Since the context vector vt is now the weighted sum of the row vectors containing the information across multiple time steps, it captures temporal information.
Simply speaking, this aims to select the useful information across the various feature time series data for predicting the target time series. First, they utilize a 2dConvolution on the row vectors of the RNNs hidden states. This is followed by a scoring function. Finally, they use a sigmoid activation instead of softmax since they expect multiple variables to be relevant for prediction. The rest follows a fairly standard attention practice.
# Original code by ganstheory # Can be found at https://github.com/gantheory/TPA-LSTM/blob/master/lib/attention_wrapper.py class TemporalPatternAttentionMechanism(): def __call__(self, query, attn_states, attn_size, attn_length, attn_vec_size): """ query: [batch_size, attn_size * 2] (c and h) attn_states: [batch_size, attn_length, attn_size] (h) new_attns: [batch_size, attn_size] new_attn_states: [batch_size, attn_length - 1, attn_size] """ with tf.variable_scope("attention"): filter_num = 32 filter_size = 1 # w: [batch_size, 1, filter_num] w = tf.reshape( dense(query, filter_num, use_bias=False), [-1, 1, filter_num]) reshape_attn_vecs = tf.reshape(attn_states, [-1, attn_length, attn_size, 1]) conv_vecs = tf.layers.conv2d( inputs=reshape_attn_vecs, filters=filter_num, kernel_size=[attn_length, filter_size], padding="valid", activation=None, ) feature_dim = attn_size - filter_size + 1 # conv_vecs: [batch_size, feature_dim, filter_num] conv_vecs = tf.reshape(conv_vecs, [-1, feature_dim, filter_num]) # s: [batch_size, feature_dim] s = tf.reduce_sum(tf.multiply(conv_vecs, w), ) # a: [batch_size, feature_dim] a = tf.sigmoid(s) # d: [batch_size, filter_num] d = tf.reduce_sum( tf.multiply(tf.reshape(a, [-1, feature_dim, 1]), conv_vecs), ) new_conv_vec = tf.reshape(d, [-1, filter_num]) new_attns = tf.layers.dense( tf.concat([query, new_conv_vec], axis=1), attn_size) new_attn_states = tf.slice(attn_stat
Code for the temporal pattern attention mechanism. Notice that the authors choose to use 32 filters.
In terms of results, the model outperforms (using relative absolute error) other methods including a standard auto-regressive model and LSTNet on forecasting solar energy and electricity demand, traffic and exchange rate.
Even though this article doesn’t use self-attention I think this is a really interesting and well-thought-out use of attention. A lot of time-series research seems to focus on univariate time series data. Moreover, the ones that do study multivariate time series often solely expand the dimensions of the attention mechanism rather than apply it horizontally across the feature time-series. It might make sense to see if a modified self-attention mechanism could select the relevant source time series data for predicting the target. The full code for this paper is publicly accessible on GitHub.
What is really going on with self-attention?
Lets first briefly review a couple of specifics of self-attention before we delve into the time series portion. For a more detailed examination please see this article on mathematics of attention or the Illustrated Transformer. For self-attention recall that we generally have query, key, value vectors that are formed via simple matrix multiplication of the embedding by the weight matrix. What a lot of explanatory articles don’t mention is that query, key, and value can often come from different sources depending on the task and vary based on whether it is the encoder or the decoder layer. So for instance, if the task is machine translation the query, key and value vectors in the encoder would come from the source language but the query, key, and value vectors in the decoder would come from the target language. In the unsupervised language modeling case however they are all generally formed from the source sequence. Later on we will see that many self-attention time series models modify how these values are formed.
Secondly, self-attention generally requires positional encodings as it has no knowledge of sequence order. It usually incorporates this positional information via addition to the word or time step embedding rather than concatenation. This is somewhat odd as you would assume that adding positional encodings directly to the word embedding would hurt it. However according to this Reddit response due to the high dimensionality of the word embeddings the positional encodings you get approximate orthogonality (i.e. the positional encodings and word embeddings already occupy different spaces). Moreover, the poster argues that sine and cosine help to give nearby word similar positional embeddings.
But in the end this still leaves a lingering question: wouldn’t straightforward concatenation work better in this respect? This is something that I don’t have a direct answer for at the moment. There are however some good recent papers on creating better positional embeddings. Transformer-XL (the basis for XLNet) has its own specific relational embeddings. Also the NeurIPS 2019 paper, Self-attention with Functional Time Representation Learning, examines creating more effective positional representations through a functional feature map.
A number of recent studies have analyzed what actually happens in models like BERT. Although geared entirely towards NLP these studies can help us to understand how to effectively utilize these architectures for time series data as well as anticipate possible problems.
In What Does BERT Look At? An Analysis of BERT’s Attention the authors analyze the attention of BERT and investigate linguistic relations. This paper is a great illustration of how self-attention (or any type of attention really) naturally lends itself to interpretability. As we can use the attention weights to visualize the relevant parts of focus.
Also interesting is the fact that the authors find the following:
We find that most heads put little attention on the current token. However, there are heads that specialize to attending heavily on the next or previous token, especially in earlier layers of the network.
Obviously in time-series data attention heads “attending to the next token” is problematic. Hence, when dealing with time series we will have to apply some sort of mask. Secondly, it is hard to tell if this is solely a product of the language data BERT was trained on or if this is likely to occur with multi-headed attention more broadly speaking. For forming language representations focusing on the closest word makes a lot of sense. However, this is much more variable with time series data, in certain time series sequences causality can come from steps much further back (for instance for some rivers it can take 24+ hours for heavy rainfall to raise the river).
In this article, the authors found that pruning several attention heads had a limited effect on performance. Generally, performance only significantly fell when more than 20% of attention heads were pruned. This is particularly relevant for time series data as often we are dealing with long dependencies. Especially only ablating a single attention head seems to have almost no impact on score and in some cases results in better performance.
This paper explores the geometrical structures found within the BERT model. They conclude that BERT seems to have geometric representations of parse trees internally. They also discover there are semantically meaningful subspaces within the larger embedding space. Although this probe is obviously linguistically focused, the main question it raises is if BERT learns these linguistically meaningful patterns then would it learn similar temporally relevant patterns. For instance, if we large scale trained a transformer time series, what would we discover in the embedding space? Would for instance we see similar patient trajectories clustered together or if we trained on many different streams for flood forecasting would it group dam fed streams together with similar release cycles, etc… Large scale training of a transformer on thousands of different time series could prove insightful and enhance our understanding of the data as well. The authors include two cool GitHub pages with interactive visualizations that you can use to explore further.
Another fascinating research work that came out of ICLR 2019 was Pay Less Attention with Lightweight and Dynamic Convolutions. This work investigates both why self-attention works and proposes dynamic convolutions as an alternative. The main advantage of dynamic convolutions are that they are computationally simpler and more parallelizable than self-attention. The authors find that these dynamic convolutions preform roughly equivalent to self-attention. The authors also employ weight sharing which further reduces the parameters required overall. Interestingly, despite the potential speed improvements I haven’t seen any time series forecasting research adopt this methodology (at least not yet).
Self-attention for time series
There have been only a few research papers that use self-attention on time series data with varying degrees of success. If you know of any additional ones please let me know. Additionally, huseinzol05 on GitHub has implemented a vanilla version of attention is all you need for stock forecasting.
Attend and Diagnose leverages self attention on medical time series data. This time series data is multivariate and contains information like a patient’s heart rate, SO2, blood pressure, etc.
Their architecture starts with a 1-D convolution across each clinical factor which they use to achieve preliminary embeddings. Recall that a 1D Conv will utilize a kernel of a specific length and process it a set number of times. It is important to note that here the 1-D convolution is not applied across the time series steps as is typical. Therefore if the initial time series contains 100 steps it will still contain 100 steps. Rather it is instead applied to create a multi-dimensional representation of each time step. For more information on 1-D convolutions for time series data refer to this great article. After the 1-D convolution step the authors then use positional encodings:
The encoding is performed by mapping time step t to the same randomized lookup table during both training and prediction.
This is different than standard self-attention which uses cosine and sine functions to capture the position of words. The positional encodings are joined (likely added although…the authors do not indicate exactly how) to each respective output from the 1D Conv layer.
Next comes the self-attention operation. This is mostly the same as the standard type of multi-headed attention operation, however it has a few subtle differences. First as mentioned above since this is time series data the self-attention mechanism cannot incorporate the entire sequence. It can only incorporate timesteps up to the time step being considered. To accomplish this the authors appear to use a masking mechanism that also masks timestamps too far in the past. Unfortunately, the authors are very non-specific on the actual formula for this, however, if I had to guess I would assume it is roughly analogous to the masking operation shown by the authors in overcoming the transformer bottleneck.
After the multi-headed attention, the now transformed embeddings still need to have additional steps taken before they are useful. Typically in standard self-attention, we have an addition and layer normalization component. The layer normalization will normalize the output of the self-attention and the original embedding (see here for more information on this), however, the authors instead chooses to Dense Interpolation. This means that embeddings outputted from the multi-headed-attention module are taken and used in a manner that is useful for capturing syntactic and structural information.
After the dense interpolation algorithm, there is a linear layer followed by a softmax, sigmoid or relu layer (depending on the task). The model itself is multitasking so it aims to forecast length of stay, the diagnosis code, the risk of decompensation, the length of stay and the mortality rate.
Altogether I thought this paper was good demonstration of using self-attention on multivariate time series data. The results were state of the art at the time it was released, they have now been surpassed by TimeNet. However, this is primarily due to the effectiveness of transfer-learning based pretraining rather than the architecture. If I had to guess similar pre-training with SAND would result in better performance.
My main critcism of this paper is primarily from reproducibility standpoint as no code is provided and various hyperparameters such as the kernal size are either not included or only vaguely hinted at. Other concepts are not discussed clearly enough such as the masking mechanism. I’m currently working on trying to reimplement in PyTorch and will post the code here when I’m more sure about its reliability.
Another recent paper that is fairly interesting is CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation by Jiawei Ma et al. This article focuses on imputing (estimating) missing time series values. Effective data imputation is important for many real world applications as sensors often have periods where they malfunction causing missing data. This creates problems when trying to forecast or classify data as the missing or null values will impact the forecast. The authors setup their model to use a cross attention mechanism that works by utilizing data in different dimensions such as time location and the measurement.
The authors evaluate their results on several traffic-forecasting and air-quality datasets. They evaluate with respect to both forecasting and imputation. For testing imputation, they discard a certain percentage of the values and attempt to impute them using the model. They compare these for with the actual values. Their model outperforms other RNN and statistical imputation methods on all missing data rates. In terms of forecasting, the model also achieves the best performance.
This is a recent article from NeurIPS 2019. It focuses on several of the problems with applying the transformer to time series data. The authors basically argue that classical self attention does fully leverage the contextual data. They argue that this particularly causes problems with dynamic time series data that varies with seasonality (for instance forecasting sales around the holidays vs. the rest of the year or forecasting extreme weather patterns). To remedy this they introduce a new method of generating the query and value vectors.
We propose convolutional self attention [mechanism] by employing causal convolutions to produce queries and keys in the self attention layer. Query-key matching aware of local context, e.g. shapes, can help the model achieve lower training error and further improve its forecasting accuracy.
Part two of the article focuses on solutions related to the memory use of the transformer model. Self-attention is very memory intensive particularly with respect to very long sequences (specifically it is O(L²)). The authors propose a new attention mechanism that is O(L(log L)²). With this self-attention mechanism, cells can only attend to previous cells with an exponential step size. So for instance cell five would attend to cell four and cell two. They also introduce two variations of this log attention: local attention and restart attention. See their diagram below for more information.
The authors evaluate their approach on several different datasets including electricity consumption (recorded in 15-minute intervals), traffic in San Francisco (20-minute intervals), solar data production hourly (from 137 different power plants) and wind data (daily estimates of 28 counties wind potential as a percentage of overall power production). Their choice of ρ-quantile loss as an evaluation metric is a bit strange as normally I’d expect MAE, MAP, RMSE or something similar for a time series forecasting problem.
I’m still trying to grasp what exactly this metric represents, however at least from the results it appears that a lower score is better. Using this metric their convolutional self-attention transformer outperforms DeepAR, DeepState, ARIMA, and other models. They also conduct an ablation study where they look at the effect of kernel size when computing a seven-day forecast. They found that a kernel size of 5 or 6 generally produced the best result.
I think this is a good research article that addresses some of the short-comings of the transformer as applied to time-series data. I particularly think that the use of a convolutional kernel (of size greater than one) is really useful in time series problems where you want to capture surrounding context for the key and query vectors. Unfortunately, there is currently no code implementation available for this paper.
Conclusion and future directions
In conclusion, self-attention and related architectures have led to improvements in several time series forecasting use cases, however, altogether they have not seen widespread adaptation. This likely revolves around several factors such as the memory bottleneck, difficulty encoding positional information, focus on pointwise values, and lack of research around handling multivariate sequences. Additionally, outside of NLP many researchers are probably not familiar with self-attention and its potential. While simple models such as ARIMA in many cases make sense for industry problems I believe that transformers have a lot to offer as well.
Hopefully, the approaches summarized in this article shine some light on effectively applying transformers to time series problems. In a subsequent article, I plan on giving a practical step-by-step example of forecasting and classifying time-series data with a transformer in PyTorch. Any feedback and/or criticisms are welcome in the comments. Please let me know if I got something incorrect (which is quite possible given the complexity of the topic) and I will update the article.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI and Data Science updates.
We’ll let you know when we release more in-depth technical education.