Time Series Forecasting with Deep Learning and Attention Mechanism

This is an overview of the architecture and the implementation details of the most important Deep Learning algorithms for Time Series Forecasting.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Motivation

Time Series Forecasting has always been a very important area of research in many domains because many different types of data are stored as time series. For example we can find a lot of time series data in medicine, weather forecasting, biology, supply chain management and stock prices forecasting, etc.

Given the growing availability of data and computing power in the recent years, Deep Learning has become a fundamental part of the new generation of Time Series Forecasting models, obtaining excellent results.

While in classical Machine Learning models – such as autoregressive models (AR) or exponential smoothing – feature engineering is performed manually and often some parameters are optimized also considering the domain knowledge, Deep Learning models learn features and dynamics only and directly from the data. Thanks to this, they speed up the process of data preparation and are able to learn more complex data patterns in a more complete way.

As different time series problems are studied in many different fields, a large number of new architectures have been developed in recent years. This has also been simplified by the growing availability of open source frameworks, which make the development of new custom network components easier and faster. In this article we summarize the most common Deep Learning approaches to Time Series Forecasting.

If this in-depth educational content on is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.

Applications

Let’s see some important applications of Time Series Forecasting.

Stock prices forecasting – Many advanced Time Series Forecasting models are used to predict stock prices, since in the historical sequences there is a lot of noise and a high uncertainty in the information, that may depend on several factors not always closely related to the stock market.

Weather Prediction – Time Series Forecasting models are widely used to improve the accuracy of weather forecasts.

Forecasting Traffic – Travel planning applications use Time Series Forecasting models to predict traffic on the roads, in order to decide more accurately the fastest way to arrive to the selected destination.

Churn – Many companies use Time Series Forecasting models to predict in which month they can expect a higher turnover, so that they can implement the best employee retention plans.

Text generation – When we write a text on the smartphone, it looks at the last written words and at the typed letters, and suggests the next letters or even the whole word.

Time Series Components

First of all it is important to explain what are the main components of a time series.

Long term trend

Long term trend is the overall general direction of the data, obtained ignoring any short term effects such as seasonal variations or noise.

Seasonality

Seasonality refers to periodic fluctuations that are repeated throughout all the time series period.

Stationarity

Stationarity is an important characteristic of time series. A time series is said to be stationary if its mean, variance and covariance don’t have significant changes over time. There are we many transformations that can extract the stationary part of a non-stationary process.

Noise

Every set of data has noise, that refers to random fluctuations or variations due to uncontrolled factors.

Autocorrelation

Autocorrelation is the correlation between the time series and a lagged version of itself, and is used to identify seasonality and trend in time series data.

Time Series Forecasting with traditional Machine Learning

Before speaking about Deep Learning methods for Time Series Forecasting, it is useful to recall that the most classical Machine Learning models used to solve this problem are ARIMA models and exponential smoothing.

ARIMA stands for combination of Autoregressive (AR) and Moving Average (MA) approaches within building a composite model of the time series. This model is very simple, but might have good results. It includes parameters to account for seasonality, long term trend, autoregressive and moving average terms, in order to handle the autocorrelation embedded in the data.

In Exponential smoothing forecasts are made on the basis of weighted averages like in ARIMA models, but in this case different decreasing weights are assigned to each observations and less importance is given to observations as we move further from the present.

Disadvantages of Traditional Machine Learning

It is well known that these traditional Machine Learning models have many limitations:

missing values can really affect the performance of the models;
they are not able to recognize complex patterns in the data;
they usually work well only in few-steps forecasts, not in long term forecast.

Deep Learning for Time Series Forecasting

The use of Deep Learning for Time Series Forecasting overcomes the traditional Machine Learning disadvantages with many different approaches. In this article, 5 different Deep Learning Architecture for Time Series Forecasting are presented:

Recurrent Neural Networks (RNNs), that are the most classical and used architecture for Time Series Forecasting problems;
Long Short-Term Memory (LSTM), that are an evolution of RNNs developed in order to overcome the vanishing gradient problem;
Gated Recurrent Unit (GRU), that are another evolution of RNNs, similar to LSTM;
Encoder-Decoder Model, that is a model for RNNs introduced in order to address the problems where input sequences differ in length from output sequences;
Attention Mechanism, that is an evolution of the Encoder-Decoder Model, developed in order to avoid forgetting of the earlier parts of the sequence.

Recurrent Neural Networks

Recurrent Neural Networks are networks of neuron-like nodes organized into successive layers, with an architecture similar to the one of standard Neural Networks. In fact, like in standard Neural Networks, neurons are divided in input layer, hidden layers and output layer. Each connection between neurons has a corresponding trainable weight.

The difference is that in this case every neurons is assigned to a fixed time step. The neurons in the hidden layer are also forwarded in a time dependent direction, that means that everyone of them is fully connected only with the neurons in the hidden layer with the same assigned time step, and is connected with a one-way connection to every neuron assigned to the next time step. The input and output neurons are connected only to the hidden layers with the same assigned time step.

Since the output of the hidden layer of one time step is part of the input of the next time step, the activation of the neurons is computed in time order: at any given time step, only the neurons assigned to that time step computes their activation.

Architecture

Weights – In the RNNs, the input vector at time t is connected to the hidden layer neurons of time t by a weight matrix U, the hidden layer neurons are connected to the neurons of time t-1 and t+1 by a weight matrix W, and the hidden layer neurons are connected to the output vector of time t by a weight matrix V; all the weight matrices are constant for each time step.

Input – The vector x(t) is the input of the network at time step t.

Hidden state –The vector h(t) is the a hidden state at time t, and is a sort of memory of the network; it is calculated based on the current input and the previous time step’s hidden state:

Output — The vector y^(t) is the output of the network at time t:

Learning algorithm

The goal of the learning process is to find the best weight matrices U, V and W that give the best prediction of y^(t), starting from the input x(t), of the real value y(t).

To achieve this, we define an objective function called the loss function and denoted J, which quantifies the distance between the real and the predicted values on the overall training set. It is given by

where

the cost function L evaluates the distances between the real and predicted values on a single time step;
m is the size of the training set;
θ the vector of model parameters.

The loss function J is minimized using these two major steps: the forward propagation and the backward propagation through time. These steps are iterated many times, and the number of iterations is called epoch number.

Forward propagation – With fixed parameters U, W and V, data are propagated through the network and at each moment t, we compute y^(t) using the previous defined formulas. At the end the loss function is calculated.

Back propagation through time –The gradients of the cost function are calculated with respect to the different parameters, then a descent algorithm is applied in order to update them. The gradients at each output depend both on the elements of the same time step and on the state of the memory at the previous time step.

Advantages of Recurrent Neural Network

In general RNNs solve many problems of traditional Machine Learning models for Time Series Forecasting.

RNNs’ performance is not significantly affected from missing values.
RNNs can find complex patterns in the input time series.
RNNs give good results in forecasting more then few-steps.
RNNs can model sequence of data so that each sample can be assumed to be dependent on previous ones.

Disadvantages of Recurrent Neural Network

When trained on long time series, RNNs typically suffer from the vanishing gradient or exploding gradient problem, that means that the parameters in the hidden layers either don’t change that much or they lead to numeric instability and chaotic behavior. This happens because the gradient of the cost function includes the power of W, which affects its memorizing capacity.
The intrinsic recurrent networks described above suffer from a weak memory unable to take into account several elements of the past in the prediction of the future.
The training of a Recurrent Neural Network is hard to parallelize, and is also computationally expensive.

Given these disadvantages, various extensions of the RNNs have been designed to trim the internal memory: bi-directional neural networks, LSTM, GRU, Attention Mechanisms. Memory enlargement can be crucial in certain fields such as finance, where it’s fundamental to memorize as much history as possible in order to predict the next steps.

Long Short-Term Memory (LSTM)

Long Short-Term Memory Networks (LSTM) have been developed to overcome the vanishing gradient problem in the standard RNN by improving the gradient flow within the network. This is achieved using a LSTM unit in place of the hidden layer. As shown in the Figure below, a LSTM unit is composed of:

a cell state, that brings information along the entire sequence and represents the memory of the network;
a forget gate, that decides what is relevant to keep from previous time steps;
an input gate, that decides what information is relevant to add from the current time step;
an output gate, that decides the value of the output at current time step.

Similarly to the RNNs, the input vector at time t is connected to the LSTM cell of time t by a weight matrix U, the LSTM cell is connected to the the LSTM cell of time t-1 and t+1 by a weight matrix W, and the the LSTM cell is connected to the output vector of time t by a weight matrix V. The matrices W and U are divided in submatrices (Wf, Wi, Wg, Wo; Uf, Ui, Ug, Uo) that are connected to different elements of the LSTM unit, as shown in the Figure below. All the weight matrices are shared across time.

The cell state transfers the relevant information during processing, so that also the information from the previous time steps arrives at each time step, reducing the effects of short-term memory. During training over all the time steps, the gates learn which information is important to keep or to forget, and add them to the cell state, or remove them from it.

In this way LSTM allows the recovery of data transferred in memory, solving the vanishing gradient problem. LSTM are useful for classifying, processing, and predicting time series with time lags of unknown duration.

Forget Gate

The first gate is the forget gate. This gate decides which information should be deleted or saved. The information from the previous hidden state and the information from the current input are passed through the sigmoid function. An output is close to 0 it means that the information can be forgotten, while an output close to 1 means that the information must be saved.

Input Gate

The second gate is the input gate. This is used to update the cell state. Initially the previous hidden state and the current input are given as inputs to a sigmoid function (the closer the output is to 1, the more important the information). It also passes the hidden state and current input to a tanh function to squeeze values between -1 and 1, in order to improve the tuning of the network. Then the output of the tanh and of the sigmoid are multiplied element by element (in the formula below the symbol * indicates the multiplication element by element of two matrices). The sigmoid output decides what information is important to keep from the tanh output.

Cell State

After the activation of the input gate, the cell state can be calculated. First, the cell state of the previous time step gets element-wise multiplied by the output of the forget gate. This gives the possibility to ignore values in the cell state when they are multiplied by values close to 0. Then the output of the input gate is element-wise added to the cell state. The output is the new cell state.

Output Gate

The third and final gate is the output gate, that decides the value of the next hidden state, which contains information about previous inputs. First, the previous hidden state and current input are summed and passed to a sigmoid function. Then the new cell state is passed to the tanh function. At the end the tanh output with the sigmoid output are multiplied to decide what information the hidden state should contain. The output is the new hidden state. The new cell state and the new hidden state are then carried over to the next time step.

Gated Recurrent Unit (GRU)

The GRU is a new generation of Recurrent Neural Networks and is very similar to an LSTM. To solve the vanishing gradient problem of a standard RNN, GRU uses the update gate and reset gate. These are two gates decide what information should be passed to the output. These two gates can be trained to keep information from many time steps before the actual time step, without washing it through time, or to remove information which is irrelevant for the prediction. If carefully trained, GRU can perform extremely well even in complex scenarios.

As shown in the Figure below, a GRU unit is composed of:

a reset gate, that decides how much of the information from the previous time steps can be forgotten;
an update gate, that decides how much of the information from the previous time steps must be saved;
a memory, that brings informations along the entire sequence and represents the memory of the network.

Reset Gate

The first gate is the reset gate. It determines how to combine the new input with the previous memory, deciding how much of the information from previous time steps can be forgotten. First the weighted sum between the input x(t) and the memory h(t-1), which holds the information for the previous t-1 steps, is performed. Then a sigmoid activation function is applied to squash the result between 0 and 1.

Update Gate

The second gate is the update gate. It helps the model to determine how much of the information from previous time steps needs to be passed along to the future.That is really powerful because the model can decide to copy all the information from the past and eliminate the risk of vanishing gradient problem. The formula to calculate it is the analogous to the one for the reset gate, but the difference comes in the weights and the gate’s usage (it will be clear in the calculation of the memory).

Current memory

The memory content uses the reset gate to store the relevant information from the past. To obtain it, first the multiplication element by element between the output of the reset gate r(t) and the final memory at the previous time step h(t-1) is computed, then the weighted sum between the result and the input x(t) is performed. Finally, the nonlinear activation function tanh is applied.

Final Memory

As the last step, the network needs to calculate h(t), that is the vector which holds information for the current unit and passes it to the next time step. It determines what to collect from the current memory content h~(t) and what from the previous steps h(t-1). It is computed applying the element by element multiplication between the update gate z_t and h_(t-1), and between (1-z_t) and h~(t), and finally performing yhr weighted sum between the two results.

Implementation of RNN, LSTM, GRU

RNN, LSTM and GRU can be implemented using Keras API, that is designed to be easy to use and customize. The following 3 RNN layers are present in Keras:

keras.layers.SimpleRNN
keras.layers.LSTM
keras.layers.GRU

They allow you to quickly create recurring templates without having to make difficult configuration choices. Moreover it’s possible to define a custom RNN cell layer with the desired behavior, allowing to quickly test various different prototypes in a flexible way with minimal code. On the Tensorflow website it’s possible to find instructions and many examples of the use of these layers.

Encoder-Decoder Model

In RNN, LSTM, GRU each input corresponds to an output for the same time step. However in many real cases we want to predict an output sequence given an input sequence of different length, without a correspondence between each input and each output. This situation is called sequence to sequence mapping model, and lies behind numerous commonly used applications like for example language translations, voice-enabled devices and online chatbots.

The Encoder-Decoder model for Recurrent Neural Networks was introduced in order to address the sequence-to-sequence mapping models. An Encoder-Decoder takes a sequence as input and generates the most probable next sequence as output. As the name suggests, the model is comprised of two sub-models:

the encoder, that is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector;
the decoder, that is responsible for stepping through the output time steps while reading from the context vector.

Encoder

The encoder is a stack of several recurrent units, that can be simple RNNs, LSTM cells or GRU cells. Each unit accepts a single element of the input sequence, collects information from that element and propagates it forward.

The hidden state vector h(t) is computed using the function of the chosen recurrent unit. The function is applied with the appropriate weights to the previous hidden state h(t-1) and the input vector x(t):

The final hidden state vector h(t) contains all the encoded information from the previous hidden representations and previous inputs.

Context Vector

The context vector is the final hidden state produced from the encoder part of the model, and represents the initial hidden state for the decoder. It encapsulates the information for all input elements in order to help the decoder make accurate forecasts.

Decoder

The decoder consists in a stack of several recurrent units. Each recurrent unit accepts a hidden state s(t-1) from the previous unit and produces and output y^(t) as well as its own hidden state s(t).

The hidden state s(t) is computed according to the the function of the chosen recurrent unit:

The output y^(t) is computed using the softmax function using the hidden state at the current time step s(t) together with the respective weight, in order to create a probability vector:

Advantages and Disadvantages

The power of this model lies in the fact that it can map sequences of different lengths to each other, since the inputs and outputs are not correlated and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture.

This technique works well for small sequences, but when the length of the sequence increases it is very difficult to summarize a long sequence into a single vector, and then model often forgets the earlier parts of the input sequence when processing the last parts. This is the reason why many experiments show that the performance of this model decreases as the size of the sequence increases.

Attention Mechanism

The Attention mechanism is one of the main frontiers in the Deep Learning and is an evolution of the Encoder-Decoder Model, developed in order to improve the performance on long input sequences.

The main idea is to allow the decoder to selectively access encoder information during decoding. This is achieved by building a different context vector for every time step of the decoder, calculating it in function of the previous hidden state and of all the hidden states of the encoder, assigning them trainable weights.

In this way, Attention mechanism assigns different importance to the different elements of the input sequence, and gives more attention to the more relevant inputs. This explains the name of the model.

Encoder

The encoder operation is very similar to the same operation of the Encoder-Decoder model. At each time step, the representation of each input sequence is computed as a function of the hidden state of the previous time step and of the current input. The final hidden state contains all the encoded information from the previous hidden representations and the previous inputs.

Context vector

The main difference between the Attention mechanism and the Encoder-Decoder model is that a different context vector c(t) is computed for every time step t of the decoder.

In order to calculate the context vector c(t) for time step t we proceed as follows. First of all, for every combination of time step j of the encoder and time step t of the decoder, the so called alignment scores e(j,t) are computed with the following weighted sum:

In this equation, Wₐ, Uₐ and Vₐ are trainable weights, that are called attention weights. The weights Wₐ are associated to the hidden states of the encoder, the weights Uₐ are associated to the hidden states of a decoder, and the weights Vₐ define the function that calculate the alignment score.

For every time step t, the scores e(j,t) are normalized using softmax function over the encoder time steps j, obtaining the attention weights α(j,t):

The attention weight α(j,t) captures the importance of the input of time step j for decoding the output of time step t. The context vector c(t) is calculated as the weighted sum of all the hidden values of the encoder according to the attention weights:

This context vector allows to give more attention to the more relevant inputs in the input sentence.

Decoder

Now the context vector c(t) is passed to the decoder, which computes the probability distribution of the next possible output. This operation of decoding goes for all the time steps present in the input.

Then the current hidden state s(t) is computed according to the recurrent unit function, taking as input the context vector c(t), the hidden state s(t-1) and output y^(t-1) of the previous time step:

Thus using this mechanism the model is able to find the correlaton between different parts of the input sequence and corresponding parts of the output sequence.

For each time step, the output of the decoder is calculated applying the softmax function to the weighted hidden state:

Advantages

As already mentioned, Attention mechanism gives good results also in presence of long input sequences.
Thanks to the attention weights, Attention mechanism has also the advantage to be more interpretable that other Deep Learning models, that are generally considered as black boxes since they do not have the ability to explain their outputs.
Moreover Attention mechanism gives outstanding results in NLP models since it allow to remember all the words in the input and recognize the most relevant words when formulating a response.

Implementation

Attention mechanism can be developed using TensorFlow and Keras and easily integrated with other Keras layers. On Github many implementations can be found, for example:

At these links, there are also many examples on sentiment classification, text generation, document classification and machine translation.

Conclusions

Recurrent Neural Networks are the most popular Deep Learning technique for Time Series Forecasting since they allow to make reliable predictions on time series in many different problems. The main problem with RNNs is that they suffer from the vanishing gradient problem when applied to long sequences.

LSTM and GRU were created in order to mitigate the vanishing gradient problem of RNNs with the use of gates, that regulate the flow of information through the sequence chain. The use of LSTM and GRU give remarkable results in applications like speech recognition, speech synthesis, natural language understanding, etc.

The Encoder-Decoder model for Recurrent Neural Networks is the most common technique or the sequence-to-sequence mapping problems where input sequences differ in length from output sequences.

The Attention mechanism is an evolution of the Encoder-Decoder model, that was born to solve the decrease of performance of Encoder-Decoder model in presence of long sequences, using a different context vector for every time step. It gives remarkable results for example in many areas like for example NLP, sentiment classification, document classification, etc.

We’ll let you know when we release more technical education.

Motivation

Applications

Time Series Components

Long term trend

Seasonality

Stationarity

Noise

Autocorrelation

Time Series Forecasting with traditional Machine Learning

Disadvantages of Traditional Machine Learning

Deep Learning for Time Series Forecasting

Recurrent Neural Networks

Architecture

Learning algorithm

Advantages of Recurrent Neural Network

Disadvantages of Recurrent Neural Network

Long Short-Term Memory (LSTM)

Forget Gate

Input Gate

Cell State

Output Gate

Gated Recurrent Unit (GRU)

Reset Gate

Update Gate

Current memory

Final Memory

Implementation of RNN, LSTM, GRU

Encoder-Decoder Model

Encoder

Context Vector

Decoder

Advantages and Disadvantages

Attention Mechanism

Encoder

Context vector

Decoder

Advantages

Implementation

Conclusions

Enjoy this article? Sign up for more applied AI updates.

Related

Reader Interactions

About Marco Del Pra

Leave a Reply

Footer

About TOPBOTS