Tag Archive for: LSTM

Air Quality Forecasting Python Project

You will find the full python code and all visuals for this article here in this gitlab repository. The repository contains a series of analysis, transforms and forecasting models frequently used when dealing with time series. The aim of this repository is to showcase how to model time series from the scratch, for this we are using a real usecase dataset

This project forecast the Carbon Dioxide (Co2) emission levels yearly. Most of the organizations have to follow government norms with respect to Co2 emissions and they have to pay charges accordingly, so this project will forecast the Co2 levels so that organizations can follow the norms and pay in advance based on the forecasted values. In any data science project the main component is data, for this project the data was provided by the company, from here time series concept comes into the picture. The dataset for this project contains 215 entries and two components which are Year and Co2 emissions which is univariate time series as there is only one dependent variable Co2 which depends on time. from year 1800 to year 2014 Co2 levels were present in the dataset.

The dataset used: The dataset contains yearly Co2 emmisions levels. data from 1800 to 2014 sampled every 1 year. The dataset is non stationary so we have to use differenced time series for forecasting.

After getting data the next step is to analyze the time series data. This process is done by using Python. The data was present in excel file so first we need to read that excel file. This task is done by using Pandas which is python libraries to creates Pandas Data Frame. After that preprocessing like changing data types of time from object to DateTime performed for the coding purpose. Time series contain 4 main components Level, Trend, Seasonality and Noise. To study this component, we need to decompose our time series so that we can batter understand our time series and we can choose the forecasting model accordingly because each component behave different on the model. also by decomposing we can identify that the time series is multiplicative or additive.

CO2 emissions – plotted via python pandas / matplotlib

Decomposing time series using python statesmodels libraries we get to know trend, seasonality and residual component separately. the components multiply together to make the time series multiplicative and in additive time series components added together. Taking the deep dive to understand the trend component, moving average of 10 steps were applied which shows nonlinear upward trend, fit the linear regression model to check the trend which shows upward trend. talking about seasonality there were combination of multiple patterns over time period which is common in real world time series data. capturing the white noise is difficult in this type of data. the time series contains values from 1800 where the Co2 values are less then 1 because of no human activities so levels were decreasing. By the time numbers of industries and human activities are rapidly increasing which causes Co2 levels rapidly increasing. In time series the highest Co2 emission level was 18.7 in 1979. It was challenging to decide whether to consider this values which are less then 0.5 as white noise or not because 30% of the Co2 values were less then 1, in real world looking at current scenario the chances of Co2 emission level being 0 is near to impossible still there are chances that Co2 levels can be 0.0005. So considering each data point as a valuable information we refused to remove that entries.

Next step is to create Lag plot so we can see the correlation between the current year Co2 level and previous year Co2 level. the plot was linear which shows high correlation so we can say that the current Co2 levels and previous levels have strong relationship. the randomness of the data were measured by plotting autocorrelation graph. the autocorrelation graph shows smooth curves which indicates the time series is nonstationary thus next step is to make time series stationary. in nonstationary time series, summary statistics like mean and variance change over time.

To make time series stationary we have to remove trend and seasonality from it. Before that we use dickey fuller test to make sure our time series is nonstationary. the test was done by using python, and the test gives pvalue as output. here the null hypothesis is that the data is nonstationary while alternate hypothesis is that the data is stationary, in this case the significance values is 0.05 and the pvalues which is given by dickey fuller test is greater than 0.05 hence we failed to reject null hypothesis so we can say the time series is nonstationery. Differencing is one of the techniques to make time series stationary. On this time series, first order differencing technique applied to make the time series stationary. In first order differencing we have to subtract previous value from current value for all the data points. also different transformations like log, sqrt and reciprocal were applied in the context of making the time series stationary. Smoothing techniques like simple moving average, exponential weighted moving average, simple exponential smoothing and double exponential smoothing techniques can be applied to remove the variation between time stamps and to see the smooth curves.

Smoothing techniques also used to observe trend in time series as well as to predict the future values. But performance of other models was good compared to smoothing techniques. First 200 entries taken to train the model and remaining last for testing the performance of the model. performance of different models measured by Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) as we are predicting future Co2 emissions so basically it is regression problem. RMSE is calculated by root of the average of squared difference between actual values and predicted values by the model on testing data. Here RMSE values were calculated using python sklearn library. For model building two approaches are there, one is datadriven and another one is model based. models from both the approaches were applied to find the best fitted model. ARIMA model gives the best results for this kind of dataset as the model were trained on differenced time series. The ARIMA model predicts a given time series based on its own past values. It can be used for any nonseasonal series of numbers that exhibits patterns and is not a series of random events. ARIMA takes 3 parameters which are AR, MA and the order of difference. Hyper parameter tuning technique gives best parameters for the model by trying different sets of parameters. Although The autocorrelation and partial autocorrelation plots can be use to decide AR and MA parameter because partial autocorrelation function shows the partial correlation of a stationary time series with its own lagged values so using PACF we can decide the value of AR and from ACF we can decide the value of MA parameter as ACF shows how data points in a time series are related.

Yearly difference of CO2 emissions – ARIMA Prediction

Apart from ARIMA, few other model were trained which are AR, ARMA, Simple Linear Regression, Quadratic method, Holts winter exponential smoothing, Ridge and Lasso Regression, LGBM and XGboost methods, Recurrent neural network (RNN) Long Short Term Memory (LSTM) and Fbprophet. I would like to mention my experience with LSTM here because it is another model which gives good result as ARIMA. the reason for not choosing LSTM as final model is its complexity. As ARIMA is giving appropriate results and it is simple to understand and requires less dependencies. while using lstm, lot of data preprocessing and other dependencies required, the dataset was small thus we used to train the model on CPU, otherwise gpu is required to train the LSTM model. we face one more challenge in deployment part. the challenge is to get the data into original form because the model was trained on differenced time series, so it will predict the future values in differenced format. After lot of research on the internet and by deeply understanding mathematical concepts finally we got the solution for it. solution for this issue is we have to add previous value from the original data from into first order differencing and then we have to add the last value of this time series into predicted values. To create the user interface streamlit was used, it is commonly used python library. the pickle file of the ARIMA model were used to predict the future values based on user input. The limit for forecasting is the year 2050. The project was uploaded on google cloud platform. so the flow is, first the starting year from which user want to forecast was taken and the end year till which year user want to forecast was taken and then according to the range of this inputs the prediction takes place. so by taking the inputs the pickle file will produce the future Co2 emissions in differenced format, then the values will be converted to original format and then the original values will be displayed on the user interface as well as the interactive line graph were displayed on the interface.

You will find the full python code and all visuals for this article here in this gitlab repository.

Instructions on Transformer for people outside NLP field, but with examples of NLP

I found it quite difficult to explain mathematical details of long short-term memory (LSTM) in my previous article series. But when I was studying LSTM, a new promising algorithm was already attracting attentions. The algorithm is named Transformer. Its algorithm was a first announced in a paper named “Attention Is All You Need,” and it outperformed conventional translation algorithms with lower computational costs.

In this article series, I am going to provide explanations on minimum prerequisites for understanding deep learning in NLP (natural language process) tasks, but NLP is not the main focus of this article series, and actually I do not study in NLP field. I think Transformer is going to be a new major model of deep learning as well as CNN or RNN, and the model is now being applied in various fields.

Even though Transformer is going to be a very general deep learning model, I still believe it would be an effective way to understand Transformer with some NLP because language is a good topic we have in common. Unlike my previous article series, in which I tried to explain theoretical side of RNN as precisely as possible, in this article I am going to focus on practical stuff with my toy implementations of NLP tasks, largely based on Tensorflow official tutorial. But still I will do my best to make it as straightforward as possible to understand the architecture of Transformer with various original figures.

This series is going to be composed of the articles below.

If you are in the field and can read the codes in the official tutorial with no questions, this article series is not for you, but if you want to see how a Transformer works but do not want to go too much into details of NLP, this article would be for you.

Simple RNN

LSTM back propagation: following the flows of variables

First of all, the summary of this article is: please just download my Power Point slides which I made and be patient, following the equations.

I am not supposed to use so many mathematics when I write articles on Data Science Blog. However using little mathematics when I talk about LSTM backprop is like writing German, never caring about “der,” “die,” “das,” or speaking little English in English classes (which most high school English teachers in Japan do) or writing Japanese without using any Chinese characters (which looks like a terrible handwriting by a drug addict). In short, that is ridiculous. And all the precise equations of LSTM backprop, written on a Blog is not a comfortable thing to see. So basically the whole of this article is an advertisement on my PowerPoint slides, sponsored by DATANOMIQ, and I can just give you some tips to get ready for the most tiresome part of understanding LSTM here.

*This article is the fifth article of “A gentle introduction to the tiresome part of understanding RNN.”

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1. Chain rules

This article is virtually an article on chain rules of differentiation. Even if you have clear understandings on chain rules, I recommend you to take a look at this section. If you have written down all the equations of back propagation of DCL, you would have seen what chain rules are. Even simple chain rules for backprop of normal DCL can be difficult to some people, but when it comes to backprop of LSTM, it is a pure torture.  I think using graphical models would help you understand what chain rules are like. Graphical models are basically used to describe the relations of variables and functions in probabilistic models, so to be exact I am going to use “something like graphical models” in this article. Not that this is a common way to explain chain rules.

First, let’s think about the simplest type of chain rule. Assume that you have a function f=f(x)=f(x(y)), and relations of the functions are displayed as the graphical model at the left side of the figure below. Variables are a type of function, so you should think that every node in graphical models denotes a function. Arrows in purple in the right side of the chart show how information propagate in differentiation.

Next, if you a function f , which has two variances  x_1 and x_2. And both of the variances also share two variances  y_1 and y_2. When you take partial differentiation of f with respect to y_1 or y_2, the formula is a little tricky. Let’s think about how to calculate \frac{\partial f}{\partial y_1}. The variance y_1 propagates to f via x_1 and x_2. In this case the partial differentiation has two terms as below.

In chain rules, you have to think about all the routes where a variance can propagate through. If you generalize chain rules, that is like below, and you need to understand chain rules in this way to understanding any types of back propagation.

The figure above shows that if you calculate partial differentiation of f with respect to y_i, the partial differentiation has n terms in total because y_i propagates to f via n variances. In order to understand backprop of LSTM, you constantly have to care about the flow of variances, which I showed as arrows in purple above.

2. Chain rules in LSTM

I would like you to remember the figure below, which I used in the second article to show how errors propagate backward during backprop of simple RNNs. After forward propagation, first of all, you need to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, gradients of the error function with respect to parameters, at every time step. But you have to be careful that even though these gradients depend on time steps, the parameters \boldsymbol{\theta} do not depend on time steps.

*As I mentioned in the second article I personally think \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} should be rather denoted as (\frac{\partial J}{\partial \boldsymbol{\theta}})^{(t)} because parameters themselves do not depend on time. The textbook by MIT press also partly use the former notation. And you are likely to encounter this type of notation, so I think it is not bad to get ready for both.

The errors at time step (t) propagate backward to all the \boldsymbol{h} ^{(s)}, (s \leq t). Conversely, in order to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} errors flowing from J^{(s)},  (s \geq t). In the chart you need arrows of errors in purple for the gradient in a purple frame, orange arrows for gradients in orange frame, red arrows for gradients in red frame. And you need to sum up \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} to calculate \frac{\partial J}{\partial \boldsymbol{\theta}} = \sum_{t}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}, and you need this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} to renew parameters, one time.

At an RNN block level, the flows of errors and how to renew parameters are the same in LSTM backprop, but the flow of errors inside each block is much more complicated in LSTM backprop. And in this article and my PowerPoint slides, I use a special notation to denote errors: \delta \star  ^{(t)}= \frac{\partial J^{(t)}}{\partial \star}

* Again, please be careful of what \delta \star  ^{(t)} means. Neurons depend on time steps, but parameters do not depend on time steps. So if \star are neurons,  \delta \star  ^{(t)}= \frac{\partial J}{ \partial \star ^{(t)}}, but when \star are parameters, \delta \star  ^{(t)}= \frac{\partial J^{(t)}}{ \partial \star} should be rather denoted like \delta \star  ^{(t)}= (\frac{\partial J}{ \partial \star ^{(t)}}). In the Space Odyssey paper\boldsymbol{\star} are not used as parameters, but in my PowerPoint slides and some other materials, \boldsymbol{\star} are used also as parameteres.

As I wrote in the last article, you calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below. Unlike the last article, I also added the terms of peephole connections in the equations below, and I also added the variances \bar{\boldsymbol{f}^{(t)}}, \bar{\boldsymbol{i}^{(t)}}, \bar{\boldsymbol{z}^{(t)}}, \bar{\boldsymbol{o}^{(t)}} for convenience.

  • \boldsymbol{\bar{f}}^{(t)}=\boldsymbol{W}_{for} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{for}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{for}
  • \boldsymbol{\bar{i}}^{(t)}=\boldsymbol{W}_{in} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{in}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{in}
  • \boldsymbol{\bar{z}}^{(t)}=\boldsymbol{W}_z \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z
  • \boldsymbol{\bar{o}}^{(t)}=\boldsymbol{W}_{out} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{out}\odot \boldsymbol{c}^{(t)} + \boldsymbol{b}_{out}
  • \boldsymbol{f}^{(t)}=\sigma( \boldsymbol{\bar{f}}^{(t)})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{\bar{i}}^{(t)})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{\bar{z}}^{(t)})
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{\bar{o}}^{(t)})

With  Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

In this article I would rather give instructions on how to read my PowerPoint slide. Just as general backprop, you need to calculate gradients of error functions with respect to parameters, such as \delta \boldsymbol{W}_{\star}, \delta \boldsymbol{R}_{\star}, \delta \boldsymbol{p}_{\star}, \delta \boldsymbol{b}_{\star}, where \star is either of \{z, in, for, out \}. And just as backprop of simple RNNs, in order to calculate gradients with respect to parameters, you need to calculate errors of neurons, that is gradients of error functions with respect to neurons, such as \delta \boldsymbol{f}^{(t)}, \delta \boldsymbol{i}^{(t)}, \delta \boldsymbol{z}^{(t)}, \delta \boldsymbol{o}^{(t)}.

*Again and again, keep it in mind that neurons depend on time steps, but parameters do not depend on time steps.

When you calculate gradients with respect to neurons, you can first calculate \delta \boldsymbol{y}^{(t)}, but the equation for this error is the most difficult, so I recommend you to put it aside for now. After calculating \delta \boldsymbol{y}^{(t)}, you can next calculate \delta \bar{\boldsymbol{o}}^{(t)}= \frac{\partial J^{(t)}}{ \partial \bar{\boldsymbol{o}}^{(t)}}. If you see the LSTM block below as a graphical model which I introduced, the information of \bar{\boldsymbol{o}}^{(t)} flow like the purple arrows. That means, \bar{\boldsymbol{o}}^{(t)} affects J only via \boldsymbol{y}^{(t)}, and this structure is equal to the first graphical model which I have introduced above. And if you calculate \bar{\boldsymbol{o}}^{(t)} element-wise, you get the equation \delta \bar{o}_{k}^{(t)}=\frac{\partial J}{\partial \bar{o}_{k}^{(t)}}= \frac{\partial J}{\partial y_{k}^{(t)}} \frac{\partial y_{k}^{(t)}}{\partial \bar{o}_{k}^{(t)}}.

*The k is an index of an element of vectors. If you can calculate element-wise gradients, it is easy to understand that as differentiation of vectors and matrices.

Next you can calculate \delta \boldsymbol{c}^{(t)}, and chain rules are very important in this process. The flow of \delta \boldsymbol{c}^{(t)} to J can be roughly divided into two streams: the one flows to J as \bodlsymbol{y}^{(t)}, and the one flows to J as \bodlsymbol{c}^{(t+1)}. And the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{y}^{(t)} also have two branches: the one via \bar{\boldsymbol{o}}^{(t)} and the one which directly converges as  \bodlsymbol{y}^{(t)}. Just as well, the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{c}^{(t+1)} also have three branches: the ones via \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, and the one which directly converges as \bodlsymbol{c}^{(t+1)}.

If you see see these flows as graphical a graphical model, that would be like the figure below.

According to this graphical model, you can calculate \delta \boldsymbol{c} ^{(t)} element-wise as below.

* TO BE VERY HONEST I still do not fully understand why we can apply chain rules like above to calculate \delta \boldsymbol{c}^{(t)}. When you apply the formula of chain rules, which I showed in the first section, to this case, you have to be careful of where to apply partial differential operators \frac{\partial}{ \partial c_{k}^{(t)}}. In the case above, in the part \frac{\partial y_{k}^{(t)}}{\partial c_{k}^{(t)}} the partial differential operator only affects tanh(c_{k}^{(t)}) of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}), and in the part \frac{\partial c_{k}^{(t+1)}}{\partial c_{k}^{(t)}}, the partial differential operator \frac{\partial}{\partial c_{k}^{(t)}} only affects the part c_{k}^{(t)} of the term c^{t}_{k} \cdot f_{k}^{(t+1)}. In the \frac{\partial \bar{o}_{k}^{(t)}}{\partial c_{k}^{(t)}} part, only (p_{out})_{k} \cdot c_{k}^{(t)},  in the \frac{\partial \bar{i}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}, and in the \frac{\partial \bar{f}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}. But some other parts, which are not affected by \frac{\partial}{ \partial c_{k}^{(t)}} are also functions of c_{k}^{(t)}. For example o_{k}^{(t)} of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}) is also a function of c_{k}^{(t)}. And I am still not sure about the logic behind where to affect those partial differential operators.

*But at least, these are the only decent equations for LSTM backprop which I could find, and a frequently cited paper on LSTM uses implementation based on these equations. Computer science is more of practical skills, rather than rigid mathematical logic. It  If you have any comments or advice on this point, please let me know.

Calculating \delta \bar{\boldsymbol{f}}^{(t)}, \delta \bar{\boldsymbol{i}}^{(t)}, \delta \bar{\boldsymbol{z}}^{(t)} are also relatively straigtforward as calculating \delta \bar{\boldsymbol{o}}^{(t)}. They all use the first type of chain rule in the first section. Thereby you can get these gradients: \delta \bar{f}_{k}^{(t)}=\frac{\partial J}{ \partial \bar{f}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{f}_{k}^{(t)}}, \delta \bar{i}_{k}^{(t)}=\frac{\partial J}{\partial \bar{i}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}, and \delta \bar{z}_{k}^{(t)}=\frac{\partial J}{\partial \bar{z}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}.

All the gradients which we have calculated use the error \delta \boldsymbol{y}^{(t)}, but when it comes to calculating \delta \boldsymbol{y}^{(t)}….. I can only say “Please be patient. I did my best in my PowerPoint slides to explain that.” It is not a kind of process which I want to explain on Word Press. In conclusion you get an error like this: \delta \boldsymbol{y}^{(t)}=\frac{\partial J^{(t)}}{\partial \boldsymbol{y}^{(t)}} + \boldsymbol{R}_{for}^{T} \delta \bar{\boldsymbol{f}}^{(t+1)} + \boldsymbol{R}_{in}^{T}\delta \bar{\boldsymbol{i}}^{(t+1)} + \boldsymbol{R}_{out}^{T}\delta \bar{\boldsymbol{o}}^{(t+1)} + \boldsymbol{R}_{z}^{T}\delta \bar{\boldsymbol{z}}^{(t+1)}, and the flows of \boldsymbol{y}^{(t)} are as blow.

Combining the gradients we have got so far, we can calculate gradients with respect to parameters. For concrete results, please check the Space Odyssey paper or my PowerPoint slide.

3. How LSTMs tackle exploding/vanishing gradients problems

*If you are allergic to mathematics, you should not read this section or download my PowerPoint slide.

*Part of this section is more or less subjective, so if you really want to know how LSTM mitigate the problems, I highly recommend you to also refer to other materials. But at least I did my best for this article.

LSTMs do not completely solve, vanishing gradient problems. They mitigate vanishing/exploding gradient problems. I am going to roughly explain why they can tackle those problems. I think you find many explanations on that topic, but many of them seems to have some mathematical mistakes (even the slide used in a lecture in Stanford University) and I could not partly agree with some statements. I also could not find any papers or materials which show the whole picture of how LSTMs can tackle those problems. So in this article I am only going to give instructions on the most mainstream way to explain this topic.

First let’s see how gradients actually “vanish” or “explode” in simple RNNs. As I in the second article of this series, simple RNNs propagate forward as the equations below.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

And every time step, you get an error function J^{(t)}. Let’s consider the gradient of J^{(t)} with respect to \boldsymbol{h}^{(k)}, that is the error flowing from J^{(t)} to \boldsymbol{h}^{(k)}. This error is the most used to calculate gradients of the parameters.

If you calculate this error more concretely, \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} \cdots \frac{\partial \boldsymbol{h}^{(k+2)}}{\partial \boldsymbol{h}^{(k+1)}} \frac{\partial \boldsymbol{h}^{(k+1)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}, where \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}} = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{b} + \boldsymbol{W}\cdot \boldsymbol{h}^{(s-1)} + \boldsymbol{U}\cdot \boldsymbol{x}^{(s)})) = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})).

* If you see the figure as a type of graphical model, you should be able to understand the why chain rules can be applied as the equation above.

*According to this paper \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}  = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})), but it seems that many study materials and web sites are mistaken in this point.

Hence \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})) = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} (\boldsymbol{W} ^T )^{(t - k)} \prod_{k< s \leq t} diag(g'(\boldsymbol{a}^{(s)})). If you take norms of the members you get an equality \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} \right\rVert \leq \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right\rVert \left\lVert \boldsymbol{W} ^T \right\rVert ^{(t - k)} \prod_{k< s \leq t} \left\lVert diag(g'(\boldsymbol{a}^{(s)}))\right\rVert. I will not go into detail anymore, but it is known that according to this inequality, multiplication of weight vectors exponentially converge to 0 or to infinite number.

We have seen that the error \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} is the main factor causing vanishing/exploding gradient problems. In case of LSTM, \frac{\partial J^{(t)}}{\partial \boldsymbol{c}^{(k)}} is an equivalent. For simplicity, let’s calculate only \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, which is equivalent to \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} of simple RNN backprop.

* Just as I noted above, you have to be careful of which part the partial differential operator \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} affects in the chain rule above. That is, you need to calculate \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} (\boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}), and the partial differential operator only affects \boldsymbol{c}^{(t-1)}. I think this is not a correct mathematical notation, but please forgive me for doing this for convenience.

If you continue calculating the equation above more concretely, you get the equation below.

I cannot mathematically explain why, but it is known that this characteristic of gradients of LSTM backprop mitigate the vanishing/exploding gradient problem. We have seen that if you take a norm of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}}, that is equal or smaller than repeated multiplication of the norm of the same weight matrix, and that soon leads to vanishing/exploding gradient problem. But according to the equation above, even if you take a norm of repeatedly multiplied \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, its norm cannot be evaluated with a simple value like repeated multiplication of the norm of the same weight matrix. The outputs of each gate are different from time steps to time steps, and that adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}.

*I personally guess the item diag(\boldsymbol{f}^{(t)}) is every effective. The unaffected value of can directly diag(\boldsymbol{f}^{(t)}) adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}. And as a matter of fact, it is known that performances of LSTM drop the most when you gite rid of forget gates.

When it comes to tackling exploding gradient problems, there is a much easier technique called gradient clipping. This algorithm is very simple: you just have to adjust the size of gradient so that the absolute value of gradient is under a threshold at every time step. Imagine that you decide in which direction to move by calculating gradients, but when the footstep is going to be too big, you just adjust the size of footstep to the threshold size you have set. In a pseudo code, write a gradient clipping part only with two line code as below.

*\boldsymbol{g} is a gradient at the time step threshold is the maximum size of the “step.”

The figure below, cited from a deep learning text from MIT press textbook, is a good and straightforward explanation on gradient clipping.It is known that a strongly nonlinear function, such as error functions of RNN, can have very steep or plain areas. If you artificially visualize the idea in 3-dimensional space, as the surface of a loss function J with two variants w, b, that means the loss function J has plain areas and very steep cliffs like in the figure.Without gradient clipping, at the left side, you can see that the black dot all of a sudden climb the cliff and could jump to an unexpected area. But with gradient clipping, you avoid such “big jumps” on error functions.

Source: Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, 409p

 

I am glad that I have finally finished this article series. I am not sure how many of the readers would have read through all of the articles in this series, including my PowerPoint slides. I bet that is not so many. I spent a great deal of my time for making these contents, but sadly even when I was studying LSTM, it was becoming old-fashioned, at least in natural language processing (NLP) field: a very promising algorithm named Transformer has been replacing the position of LSTM. Deep learning is a very fast changing field. I also would like to make illustrative introductions on attention mechanism in NLP, from seq2seq model to Transformer. And I think LSTM would still remain as one of the algorithms in sequence data processing, such as hidden Hidden Markov model, or particle filter.

Simple RNN

Understanding LSTM forward propagation in two ways

*This article is only for the sake of understanding the equations in the second page of the paper named “LSTM: A Search Space Odyssey”. If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.

1. Preface

I  heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.

Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for “normal people” to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.

I already had a feeling that “Does it help to understand only LSTM with this precision? I should do more practical codings.” For example François Chollet, the developer of Keras, in his book, said as below.

 

For me that sounds like “We have already implemented RNNs for you, so just shut up and use Tensorflow/Keras.” Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called Transformer seems to be replacing the position of LSTM in natural language processing. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.

But I think, or rather hope,  it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work.

I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

2. Why LSTM?

First of all, let’s take a brief look at what I said about the structures of RNNs,  in the first and the second article. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward/back propagation if you focus on its recurrent connections.

That is why simple RNNs suffer from vanishing/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in this paper. But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing/exploding gradient problem.

LSTM was invented as a way to tackle such problems as I mentioned in the last article.

3. How to display LSTM

I would like you to just go to image search on Google, Bing, or Yahoo!, and type in “LSTM.” I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them “Space Odyssey type” and “electronic circuit type”, and in conclusion, I highly recommend you to understand LSTM as the “electronic circuit type.”

*I just randomly came up with the terms “Space Odyssey type” and “electronic circuit type” because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.

However, not that all the well-made explanations on LSTM use the “electronic circuit type,” and I am sure you sometimes have to understand LSTM as the “space odyssey type.” And the paper “LSTM: A Search Space Odyssey,” which I learned a lot about LSTM from,  also adopts the “Space Odyssey type.”

LSTM architectur visualization

The main reason why I recommend the “electronic circuit type” is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.

*Behaviors of both of them look different, but of course they are doing the same things.

If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs  are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.

But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.

4. Forward propagation of LSTM in “electronic circuit type”

*For further understanding of mathematics of LSTM forward/back propagation, I recommend you to download my slides.

The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.

Let’s look at the simplified architecture of  an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the gates, and the one going through cell connections, the “highway” of LSTM block. For simplicity, we will see the architecture of an LSTM block without peephole connections, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.

In a LSTM block, the input and the output of the former time step separately go through sections named “gates”: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.

*The small two dots on the cell connections are the “on-ramp” of cell conection highway.

*You would see the terms “input gate,” “forget gate,” “output gate” almost everywhere, but how to call the “block gate” depends on textbooks.

Let’s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step (t) gets \boldsymbol{y}^{(t-1)}, the output at the last time step,  and \boldsymbol{c}^{(t-1)}, the information of the cell at the time step (t-1), via recurrent connections. The block at time step (t) gets the input \boldsymbol{x}^{(t)}, and it separately goes through each gate, together with \boldsymbol{y}^{(t-1)}. After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}. The outputs of the gates are mixed with \boldsymbol{c}^{(t-1)} and the LSTM block gives out an output \boldsymbol{y}^{(t)}, and gives \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} to the next LSTM block via recurrent connections.

You calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below.

  • \boldsymbol{f}^{(t)}= \sigma(\boldsymbol{W}_{for} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \boldsymbol{y}^{(t-1)} +  \boldsymbol{b}_{for})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{W}_{in} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{in})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{W}_z \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z)
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{W}_{out} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{out})

*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.

The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called Hadamard product, which you denote as \odot. This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.

With this Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

The values of \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} are compressed into the range of [0, 1] or [-1, 1] with activation functions. You can see that the input gate and the block input give new information to the cell. The part \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)} means that the output of the forget gate “forgets” the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell \boldsymbol{c}^{(t)} is activated with tanh() and the output of the output gate “suppress” the activated value of \boldsymbol{c}^{(t)}. In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input \boldsymbol{x}^{(t)}, and the recurrent connection \boldsymbol{y}^{(t-1)}. That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always (1, 1, …1)^T, LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And  if all the outputs of every gate are always (0, 0, …0)^T, LSTMs forget everything, receive no inputs, and give out nothing.

This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.

LSTM inner architecture

LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.

5. Space Odyssey type

I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.

In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted “highway” in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections,  \boldsymbol{x}^{(t)}, the input at the time step (t) , sand \boldsymbol{y}^{(t-1)}, the output at the last time step (t-1), which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.

Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step (t-1) goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly “forgotten.” Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step  (t). And the partly “forgotten” \boldsymbol{c}^{(t-1)} goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how \boldsymbol{c}^{(t)} is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is “suppressed” with the output gate.

I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.

 

[References]

[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey,” (2017)

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204

[3] “Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021”, Institute of advanced research in artificial intelligence, (2020)
URL: https://www.iarai.ac.at/news/sepp-hochreiter-receives-ieee-cis-neural-networks-pioneer-award-2021/?fbclid=IwAR27cwT5MfCw4Tqzs3MX_W9eahYDcIFuoGymATDR1A-gbtVmDpb8ExfQ87A

[4] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 120-125
岡谷貴之 著, 「機械学習プロフェッショナルシリーズ 深層学習」, (2015), pp. 120-125

[5] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 252-257
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 252-257

[6] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ~ 最近の動向と共に」, Qiita, (2015)
URL: https://qiita.com/t_Signull/items/21b82be280b46f467d1b

Simple RNN

A brief history of neural nets: everything you should know before learning LSTM

This is not a college course or something on deep learning with strict deadlines for assignments, so let’s take a detour from practical stuff and take a brief look at the history of neural networks.

The history of neural networks is also a big topic, which could be so long that I had to prepare another article series. And usually I am supposed to begin such articles with something like “The term ‘AI’ was first used by John McCarthy in Dartmouth conference 1956…” but you can find many of such texts written by people with much more experiences in this field. Therefore I am going to write this article from my point of view, as an intern writing articles on RNN, as a movie buff, and as one of many Japanese men who spent a great deal of childhood with video games.

We are now in the third AI boom, and some researchers say this boom began in 2006. A professor in my university said there we are now in a kind of bubble economy in machine learning/data science industry, but people used to say “Stop daydreaming” to AI researchers. The second AI winter is partly due to vanishing/exploding gradient problem of deep learning. And LSTM was invented as one way to tackle such problems, in 1997.

1, First AI boom

In the first AI boom, I think people were literally “daydreaming.” Even though the applications of machine learning algorithms were limited to simple tasks like playing chess, checker, or searching route of 2d mazes, and sometimes this time is called GOFAI (Good Old Fashioned AI).

Even today when someone use the term “AI” merely for tasks with neural networks, that amuses me because for me deep learning is just statistically and automatically training neural networks, which are capable of universal approximation, into some classifiers/regressors. Actually the algorithms behind that is quite impressive, but the structure of human brains is much more complicated. The hype of “AI” already started in this first AI boom. Let me take an example of machine translation in this video. In fact the research of machine translation already started in the early 1950s, and of  specific interest in the time was translation between English and Russian due to Cold War. In the first article of this series, I said one of the most famous applications of RNN is machine translation, such as Google Translation, DeepL. They are a type of machine translation called neural machine translation because they use neural networks, especially RNNs. Neural machine translation was an astonishing breakthrough around 2014 in machine translation field. The former major type of machine translation was statistical machine translation, based on statistical language models. And the machine translator in the first AI boom was rule base machine translators, which are more primitive than statistical ones.

The most remarkable invention in this time was of course perceptron by Frank Rosenblatt. Some people say that this is the first neural network. Even though you can implement perceptron with a-few-line codes in Python, obviously they did not have Jupyter Notebook in those days. The perceptron was implemented as a huge instrument named Mark 1 Perceptron, and it was composed of randomly connected wires. I do not precisely know how it works, but it was a huge effort to implement even the most primitive type of neural networks. They needed to use a big lighting fixture to get a 20*20 pixel image using 20*20 array of cadmium sulphide photocells. The research by Rosenblatt, however, was criticized by Marvin Minsky in his book because perceptrons could only be used for linearly separable data. To make matters worse the criticism prevailed as that more general, multi-layer perceptrons were also not useful for linearly inseparable data (as I mentioned in the first article, multi-layer perceptrons, namely normal neural networks,  can be universal approximators, which have potentials to classify/regress various types of complex data). In case you do not know what “linearly separable” means, imagine that there are data plotted on a piece of paper. If an elementary school kid can draw a border line between two clusters of the data with a ruler and a pencil on the paper, the 2d data is “linearly separable”….

With big disappointments to the research on “electronic brains,” the budget of AI research was reduced and AI research entered its first winter.

I think  the frame problem(1969),  by John McCarthy and Patrick J. Hayes, is also an iconic theory in the end of the first AI boom. This theory is known as a story of creating a robot trying to pull out its battery on a wheeled wagon in a room. The first prototype of the robot, named R1, naively tried to pull out the wagon form the room, and the bomb exploded. The problems was obvious: R1 was not programmed to consider the risks by taking each action, so the researchers made the next prototype named R1D1, which was programmed to consider the potential risks of taking each action. When R1D1 tried to pull out the wagon, it realized the risk of pulling the bomb together with the battery. But soon it started considering all the potential risks, such as the risk of the ceiling falling down, the distance between the wagon and all the walls, and so on, when the bomb exploded. The next problem was also obvious: R1D1 was not programmed to distinguish if the factors are relevant of irrelevant to the main purpose, and the next prototype R2D1 was programmed to do distinguish them. This time, R2D1 started thinking about “whether the factor is  irrelevant to the main purpose,” on every factor measured, and again the bomb exploded. How can we get a perfect AI, R2D2?

The situation of mentioned above is a bit extreme, but it is said AI could also get stuck when it try to take some super simple actions like finding a number in a phone book and make a phone call. It is difficult for an artificial intelligence to decide what is relevant and what is irrelevant, but humans will not get stuck with such simple stuff, and sometimes the frame problem is counted as the most difficult and essential problem of developing AI. But personally I think the original frame problem was unreasonable in that McCarthy, in his attempts to model the real world, was inflexible in his handling of the various equations involved, treating them all with equal weight regardless of the particular circumstances of a situation. Some people say that McCarthy, who was an advocate for AI, also wanted to see the field come to an end, due to its failure to meet the high expectations it once aroused.

Not only the frame problem, but also many other AI-related technological/philosophical problems have been proposed, such as Chinese room (1980), the symbol grounding problem (1990), and they are thought to be as hardships in inventing artificial intelligence, but I omit those topics in this article.

*The name R2D2 did not come from the famous story of frame problem. The story was Daniel Dennett first proposed the story of R2D2 in his paper published in 1984. Star Wars was first released in 1977. It is said that the name R2D2 came from “Reel 2, Dialogue 2,” which George Lucas said while film shooting. And the design of C3PO came from Maria in Metropolis(1927). It is said that the most famous AI duo in movie history was inspired by Tahei and Matashichi in The Hidden Fortress(1958), directed by Kurosawa Akira.

Interestingly, in the end of the first AI boom, 2001: A Space Odyssey, directed by Stanley Kubrick, was released in 1968. Unlike conventional fantasylike AI characters, for example Maria in Metropolis(1927), HAL 9000 was portrayed as a very realistic AI, and the movie already pointed out the risk of AI being insane when it gets some commands from several users. HAL 9000 still has been a very iconic character in AI field. For example when you say some quotes from 2001: A Space Odyssey to Siri you get some parody responses. I also thin you should keep it in mind that in order to make an AI like HAL 9000 come true, for now RNNs would be indispensable in many ways: you would need RNNs for better voice recognition, better conversational system, and for reading lips.

*Just as you cannot understand Monty Python references in Python official tutorials without watching Monty Python and the Holy Grail, you cannot understand many parodies in AI contexts without watching 2001: A Space Odyssey. Even thought the movie had some interview videos with some researchers and some narrations, Stanley Kubrick cut off all the footage and made the movie very difficult to understand. Most people did not or do not understand that it is a movie about aliens who gave homework of coming to Jupiter to human beings.

2, Second AI boom/winter

I am not going to write about the second AI boom in detail, but at least you should keep it in mind that convolutional neural network(CNN) is a keyword in this time. Neocognitron, an artificial model of how sight nerves perceive thing, was invented by Kunihiko Fukushima in 1980, and the model is said to be the origin on CNN. And Neocognitron got inspired by the Hubel and Wiesel’s research on sight nerves. In 1989, a group in AT & T Bell Laboratory led by Yann LeCun invented the first practical CNN to read handwritten digit.

Another turning point in this second AI boom was that back propagation algorithm was discovered, and the CNN by LeCun was also trained with back propagation. LeCun made a deep neural networks with some layers in 1998 for more practical uses.

But his research did not gain so much attention like today, because AI research entered its second winter at the beginning of the 1990s, and that was partly due to vanishing/exploding gradient problem of deep learning. People knew that neural networks had potentials of universal approximation, but when they tried to train naively stacked neural nets the gradients, which you need to train neural networks, exponentially increased/decreased. Even though the CNN made by LeCun was the first successful case of “deep” neural nets which did not suffer from the vanishing/exploding gradient problem, deep learning research also stagnated in this time.

The ultimate goal of this article series is to understand LSTM at a more abstract/mathematical level because it is one of the practical RNNs, but the idea of LSTM (Long Short Term Memory) itself was already proposed in 1997 as an RNN algorithm to tackle vanishing gradient problem. (Exploding gradient problem is solved with a technique named gradient clipping, and this is easier than techniques for preventing vanishing gradient problems. I am also going to explain it in the next article.) After that some other techniques like introducing forget gate, peephole connections, were discovered, but basically it took some 20 years till LSTM got attentions like today. The reasons for that is lack of hardware and data sets, and that was also major reasons for the second AI winter.

In the 1990s, the mid of second AI winter, the Internet started prevailing for commercial uses. I think one of the iconic events in this time was the source codes WWW(World Wide Web) were announced in 1993. Some of you might still remember that you little by little became able to transmit more data online in this time. That means people came to get more and more access to various datasets in those days, which is indispensable for machine learning tasks.

After all, we could not get HAL 9000 by the end of 2001, but instead we got Xbox console.

3, Video game industry and GPU

Even though research on neural networks stagnated in the 1990s the same period witnessed an advance in the computation of massive parallel linear transformations, due to their need in fields such as image processing.

Computer graphics move or rotate in 3d spaces, and that is also linear transformations. When you think about a car moving in a city, it is convenient to place the car, buildings, and other objects on a fixed 3d space. But when you need to make computer graphics of scenes of the city from a view point inside the car, you put a moving origin point in the car and see the city. The spatial information of the city is calculated as vectors from the moving origin point. Of course this is also linear transformations. Of course I am not talking about a dot or simple figures moving in the 3d spaces. Computer graphics are composed of numerous plane panels, and each of them have at least three vertexes, and they move on 3d spaces. Depending on viewpoints, you need project the 3d graphics in 3d spaces on 2d spaces to display the graphics on devices. You need to calculate which part of the panel is projected to which pixel on the display, and that is called rasterization. Plus, in order to get photophotorealistic image, you need to think about how lights from light sources reflect on the panel and projected on the display. And you also have to put some textures on groups of panels. You might also need to change color spaces, which is also linear transformations.

My point is, in short, you really need to do numerous linear transformations in parallel in image processing.

When it comes to the use of CGI in movies,  two pioneer movies were released during this time: Jurassic Park in 1993, and Toy Story in 1995. It is famous that Pixar used to be one of the departments in ILM(Industrial Light and Magic), founded by George Lucas, and Steve Jobs bought the department. Even though the members in Pixar had not even made a long feature film in their lives, after trial and errors, they made the first CGI animated feature movie. On the other hand, in order to acquire funds for the production of Schindler’s List(1993), Steven Spielberg took on Jurassic Park(1993), consequently changing the history of CGI through this “side job.”

*I think you have realized that George Lucas is mentioned almost everywhere in this article. His influences on technologies are not only limited to image processing, but also sound measuring system, nonlinear editing system. Photoshop was also originally developed under his company. I need another article series for this topic, but maybe not in Data Science Blog.

Considering that the first wire-frame computer graphics made and displayed by computers appeared in the scene of displaying the wire frame structure of Death Star in a war room, in Star Wars: A New Hope, the development of CGI was already astonishing at this time. But I think deep learning owe its development more to video game industry.

*I said that the Death Star scene is the first use of graphics made and DISPLAYED by computers, because I have to say one of the first graphics in movie MADE by computer dates back to the legendary title sequence of Vertigo(1958).

When it comes to 3D video games the processing unit has to constantly deal with real time commands from controllers. It is famous that GPU was originally specifically designed for plotting computer graphics. Video game market is the biggest in entertainment industry in general, and it is said that the quality of computer graphics have the strongest correlation with video games sales, therefore enhancing this quality is a priority for the video game console manufacturers.

One good example to see how much video games developed is comparing original Final Fantasy 7 and the remake one. The original one was released in 1997, the same year as when LSTM was invented. And recently  the remake version of Final Fantasy 7 was finally released this year. The original one was also made with very big budget, and it was divided into three CD-ROMs. The original one was also very revolutionary given that the former ones of Final Fantasy franchise were all 2d video retro style video games. But still the computer graphics looks like polygons, and in almost all scenes the camera angle was fixed in the original one. On the other hand the remake one is very photorealistic and you can move the angle of the camera as you want while you play the video game.

There were also fierce battles by graphic processor manufacturers in computer video game market in the 1990s, but personally I think the release of Xbox console was a turning point in the development of GPU. To be concrete, Microsoft adopted a type of NV20 GPU for Xbox consoles, and that left some room of programmability for developers. The chief architect of NV20, which was released under the brand of GeForce3, said making major changes in the company’s graphic chips was very risky. But that decision opened up possibilities of uses of GPU beyond computer graphics.

I think that the idea of a programmable GPU provided other scientific fields with more visible benefits after CUDA was launched. And GPU gained its position not only in deep learning, but also many other fields including making super computers.

*When it comes to deep learning, even GPUs have strong rivals. TPU(Tensor Processing Unit) made by Google, is specialized for deep learning tasks, and have astonishing processing speed. And FPGA(Field Programmable Gate Array), which was originally invented customizable electronic circuit, proved to be efficient for reducing electricity consumption of deep learning tasks.

*I am not so sure about this GPU part. Processing unit, including GPU is another big topic, that is beyond my capacity to be honest.  I would appreciate it if you could share your view and some references to confirm your opinion, on the comment section or via email.

*If you are interested you should see this video of game fans’ reactions to the announcement of Final Fantasy 7. This is the industry which grew behind the development of deep learning, and many fields where you need parallel computations owe themselves to the nerds who spent a lot of money for video games, including me.

*But ironically the engineers who invented the GPU said they did not play video games simply because they were busy. If you try to study the technologies behind video games, you would not have much time playing them. That is the reality.

We have seen that the in this second AI winter, Internet and GPU laid foundation of the next AI boom. But still the last piece of the puzzle is missing: let’s look at the breakthrough which solved the vanishing /exploding gradient problem of deep learning in the next section.

4, Pretraining of deep belief networks: “The Dawn of Deep Learning”

Some researchers say the invention of pretraining of deep belief network by Geoffrey Hinton was a breakthrough which put an end to the last AI winter. Deep belief networks are different type of networks from the neural networks we have discussed, but their architectures are similar to those of the neural networks. And it was also unknown how to train deep belief nets when they have several layers. Hinton discovered that training the networks layer by layer in advance can tackle vanishing gradient problems. And later it was discovered that you can do pretraining neural networks layer by layer with autoencoders.

*Deep belief network is beyond the scope of this article series. I have to talk about generative models, Boltzmann machine, and some other topics.

The pretraining techniques of neural networks is not mainstream anymore. But I think it is very meaningful to know that major deep learning techniques such as using ReLU activation functions, optimization with Adam, dropout, batch normalization, came up as more effective algorithms for deep learning after the advent of the pretraining techniques, and now we are in the third AI boom.

In the next next article we are finally going to work on LSTM. Specifically, I am going to offer a clearer guide to a well-made paper on LSTM, named “LSTM: A Search Space Odyssey.”

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Simple RNN

Simple RNN: the first foothold for understanding LSTM

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

In the last article, I mentioned “When it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN.” Even if you manage to understand DCL and CNN, you can be suddenly left behind once you try to understand RNN because it looks like a different field. In the second section of this article, I am going to provide a some helps for more abstract understandings of DCL/CNN , which you need when you read most other study materials.

My explanation on this simple RNN is based on a chapter in a textbook published by Massachusetts Institute of Technology, which is also recommended in some deep learning courses of Stanford University.

First of all, you should keep it in mind that simple RNN are not useful in many cases, mainly because of vanishing/exploding gradient problem, which I am going to explain in the next article. LSTM is one major type of RNN used for tackling those problems. But without clear understanding forward/back propagation of RNN, I think many people would get stuck when they try to understand how LSTM works, especially during its back propagation stage. If you have tried climbing the mountain of understanding LSTM, but found yourself having to retreat back to the foot, I suggest that you read through this article on simple RNNs. It should help you to gain a solid foothold, and you would be ready for trying to climb the mountain again.

*This article is the second article of “A gentle introduction to the tiresome part of understanding RNN.”

1, A brief review on back propagation of DCL.

Simple RNNs are straightforward applications of DCL, but if you do not even have any ideas on DCL forward/back propagation, you will not be able to understand this article. If you more or less understand how back propagation of DCL works, you can skip this first section.

Deep learning is a part of machine learning. And most importantly, whether it is classical machine learning or deep learning, adjusting parameters is what machine learning is all about. Parameters mean elements of functions except for variants. For example when you get a very simple function f(x)=a + bx + cx^2 + dx^3, then x is a variant, and a, b, c, d are parameters. In case of classical machine learning algorithms, the number of those parameters are very limited because they were originally designed manually. Such functions for classical machine learning is useful for features found by humans, after trial and errors(feature engineering is a field of finding such effective features, manually). You adjust those parameters based on how different the outputs(estimated outcome of classification/regression) are from supervising vectors(the data prepared to show ideal answers).

In the last article I said neural networks are just mappings, whose inputs are vectors, matrices, or sequence data. In case of DCLs, inputs are vectors. Then what’s the number of parameters ? The answer depends on the the number of neurons and layers. In the example of DCL at the right side, the number of the connections of the neurons is the number of parameters(Would you like to try to count them? At least I would say “No.”). Unlike classical machine learning you no longer need to do feature engineering, but instead you need to design networks effective for each task and adjust a lot of parameters.

*I think the hype of AI comes from the fact that neural networks find features automatically. But the reality is difficulty of feature engineering was just replaced by difficulty of designing proper neural networks.

It is easy to imagine that you need an efficient way to adjust those parameters, and the method is called back propagation (or just backprop). As long as it is about DCL backprop, you can find a lot of well-made study materials on that, so I am not going to cover that topic precisely in this article series. Simply putting, during back propagation, in order to adjust parameters of a layer you need errors in the next layer. And in order calculate the errors of the next layer, you need errors in the next next layer.

*You should not think too much about what the “errors” exactly mean. Such “errors” are defined in this context, and you will see why you need them if you actually write down all the mathematical equations behind backprops of DCL.

The red arrows in the figure shows how errors of all the neurons in a layer propagate backward to a neuron in last layer. The figure shows only some sets of such errors propagating backward, but in practice you have to think about all the combinations of such red arrows in the whole back propagation(this link would give you some ideas on how DCLs work).

These points are minimum prerequisites for continuing reading this  RNN this article. But if you are planning to understand RNN forward/back propagation at  an abstract/mathematical level that you can read academic papers,  I highly recommend you to actually write down all the equations of DCL backprop. And if possible you should try to implement backprop of three-layer DCL.

2, Forward propagation of simple RNN

In fact the simple RNN which we are going to look at in this article has only three layers. From now on imagine that inputs of RNN come from the bottom and outputs go up. But RNNs have to keep information of earlier times steps during upcoming several time steps because as I mentioned in the last article RNNs are used for sequence data, the order of whose elements is important. In order to do that, information of the neurons in the middle layer of RNN propagate forward to the middle layer itself. Therefore in one time step of forward propagation of RNN, the input at the time step propagates forward as normal DCL, and the RNN gives out an output at the time step. And information of one neuron in the middle layer propagate forward to the other neurons like yellow arrows in the figure. And the information in the next neuron propagate forward to the other neurons, and this process is repeated. This is called recurrent connections of RNN.

*To be exact we are just looking at a type of recurrent connections. For example Elman RNNs have simpler recurrent connections. And recurrent connections of LSTM are more complicated.

Whether it is a simple one or not, basically RNN repeats this process of getting an input at every time step, giving out an output, and making recurrent connections to the RNN itself. But you need to keep the values of activated neurons at every time step, so virtually you need to consider the same RNNs duplicated for several time steps like the figure below. This is the idea of unfolding RNN. Depending on contexts, the whole unfolded DCLs with recurrent connections is also called an RNN.

In many situations, RNNs are simplified as below. If you have read through this article until this point, I bet you gained some better understanding of RNNs, so you should little by little get used to this more abstract, blackboxed  way of showing RNN.

You have seen that you can unfold an RNN, per time step. From now on I am going to show the simple RNN in a simpler way,  based on the MIT textbook which I recomment. The figure below shows how RNN propagate forward during two time steps (t-1), (t).

The input \boldsymbol{x}^{(t-1)}at time step(t-1) propagate forward as a normal DCL, and gives out the output \hat{\boldsymbol{y}} ^{(t)} (The notation on the \boldsymbol{y} ^{(t)} is called “hat,” and it means that the value is an estimated value. Whatever machine learning tasks you work on, the outputs of the functions are just estimations of ideal outcomes. You need to adjust parameters for better estimations. You should always be careful whether it is an actual value or an estimated value in the context of machine learning or statistics). But the most important parts are the middle layers.

*To be exact I should have drawn the middle layers as connections of two layers of neurons like the figure at the right side. But I made my figure closer to the chart in the MIT textbook, and also most other study materials show the combinations of the two neurons before/after activation as one neuron.

\boldsymbol{a}^{(t)} is just linear summations of \boldsymbol{x}^{(t)} (If you do not know what “linear summations” mean, please scroll this page a bit), and \boldsymbol{h}^{(t)} is a combination of activated values of \boldsymbol{a}^{(t)} and linear summations of \boldsymbol{h}^{(t-1)} from the last time step, with recurrent connections. The values of \boldsymbol{h}^{(t)} propagate forward in two ways. One is normal DCL forward propagation to \hat{\boldsymbol{y}} ^{(t)} and \boldsymbol{o}^{(t)}, and the other is recurrent connections to \boldsymbol{h}^{(t+1)} .

These are equations for each step of forward propagation.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

*Please forgive me for adding some mathematical equations on this article even though I pledged not to in the first article. You can skip the them, but for some people it is on the contrary more confusing if there are no equations. In case you are allergic to mathematics, I prescribed some treatments below.

*Linear summation is a type of weighted summation of some elements. Concretely, when you have a vector \boldsymbol{x}=(x_0, x_1, x_2), and weights \boldsymbol{w}=(w_0,w_1, w_2), then \boldsymbol{w}^T \cdot \boldsymbol{x} = w_0 \cdot x_0 + w_1 \cdot x_1 +w_2 \cdot x_2 is a linear summation of \boldsymbol{x}, and its weights are \boldsymbol{w}.

*When you see a product of a matrix and a vector, for example a product of \boldsymbol{W} and \boldsymbol{v}, you should clearly make an image of connections between two layers of a neural network. You can also say each element of \boldsymbol{u}} is a linear summations all the elements of \boldsymbol{v}} , and \boldsymbol{W} gives the weights for the summations.

A very important point is that you share the same parameters, in this case \boldsymbol{\theta \in \{\boldsymbol{U}, \boldsymbol{W}, \boldsymbol{b}, \boldsymbol{V}, \boldsymbol{c} \}}, at every time step. 

And you are likely to see this RNN in this blackboxed form.

3, The steps of back propagation of simple RNN

In the last article, I said “I have to say backprop of RNN, especially LSTM (a useful and mainstream type or RNN), is a monster of chain rules.” I did my best to make my PowerPoint on LSTM backprop straightforward. But looking at it again, the LSTM backprop part still looks like an electronic circuit, and it requires some patience from you to understand it. If you want to understand LSTM at a more mathematical level, understanding the flow of simple RNN backprop is indispensable, so I would like you to be patient while understanding this step (and you have to be even more patient while understanding LSTM backprop).

This might be a matter of my literacy, but explanations on RNN backprop are very frustrating for me in the points below.

  • Most explanations just show how to calculate gradients at each time step.
  • Most study materials are visually very poor.
  • Most explanations just emphasize that “errors are back propagating through time,” using tons of arrows, but they lack concrete instructions on how actually you renew parameters with those errors.

If you can relate to the feelings I mentioned above, the instructions from now on could somewhat help you. And I am going to share some study materials on simple RNNs in an external link so that you can gain a clear and mathematical understanding on how simple RNNs work.

Backprop of RNN , as long as you are thinking about simple RNNs, is not so different from that of DCLs. But you have to be careful about the meaning of errors in the context of RNN backprop. Back propagation through time (BPTT) is one of the major methods for RNN backprop, and I am sure most textbooks explain BPTT. But most study materials just emphasize that you need errors from all the time steps, and I think that is very misleading and confusing.

You need all the gradients to adjust parameters, but you do not necessarily need all the errors to calculate those gradients. Gradients in the context of machine learning mean partial derivatives of error functions (in this case J) with respect to certain parameters, and mathematically a gradient of J with respect to \boldsymbol{\theta \in \{\boldsymbol{U}, \boldsymbol{W}, \boldsymbol{b}^{(t)}, \boldsymbol{V}, \boldsymbol{c} \}}is denoted as ( \frac{\partial J}{\partial \boldsymbol{\theta}}  ). And another confusing point in many textbooks, including the MIT one, is that they give an impression that parameters depend on time steps. For example some study materials use notations like \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, and I think this gives an impression that this is a gradient with respect to the parameters at time step (t). In my opinion this gradient rather should be written as ( \frac{\partial J}{\partial \boldsymbol{\theta}} )^{(t)} . But many study materials denote gradients of those errors in the former way, so from now on let me use the notations which you can see in the figures in this article.

In order to calculate the gradient \frac{\partial J}{\partial \boldsymbol{x}^{(t)}} you need errors from time steps s (s \geq t) \quad (as you can see in the figure, in order to calculate a gradient in a colored frame, you need all the errors in the same color).

*Another confusing point is that the \frac{\partial J}{\partial \boldsymbol{\ast ^{(t)}}}, \boldsymbol{\ast} \in \{\boldsymbol{a}^{(t)}, \boldsymbol{h}^{(t)}, \boldsymbol{o}^{(t)}, \dots \} are correct notations, because \boldsymbol{\ast} are values of neurons after forward propagation. They depend on time steps, and these are very values which I have been calling “errors.” That is why parameters do not depend on time steps, whereas errors depend on time steps.

As I mentioned before, you share the same parameters at every time step. Again, please do not assume that parameters are different from time step to time step. It is gradients/errors (you need errors to calculate gradients) which depend on time step. And after calculating errors at every time step, you can finally adjust parameters one time, and that’s why this is called “back propagation through time.” (It is easy to imagine that this method can be very inefficient. If the input is the whole text on a Wikipedia link, you need to input all the sentences in the Wikipedia text to renew parameters one time. To solve this problem there is a backprop method named “truncated BPTT,” with which you renew parameters based on a part of a text. )

And after calculating those gradients \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} you can take a summation of them: \frac{\partial J}{\partial \boldsymbol{\theta}}=\sum_{t=0}^{t=\tau}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}. With this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} , you can finally renew the value of \boldsymbol{\theta} one time.

At the beginning of this article I mentioned that simple RNNs are no longer for practical uses, and that comes from exploding/vanishing problem of RNN. This problem was one of the reasons for the AI winter which lasted for some 20 years. In the next article I am going to write about LSTM, a fancier type of RNN, in the context of a history of neural network history.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.