On the difficulty of language: prerequisites for NLP with deep learning

This is the first article of my article series “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Preface

This section is virtually just my essay on language. You can skip this if you want to get down on more technical topic.

As I do not study in natural language processing (NLP) field, I would not be able to provide that deep insight into this fast changing deep leaning field throughout my article series. However at least I do understand language is a difficult and profound field, not only in engineering but also in many other study fields. Some people might be feeling that technologies are eliminating languages, or one’s motivations to understand other cultures. First of all, I would like you to keep it in mind that I am not a geek who is trying to turn this multilingual world into a homogeneous one and rebuild Tower of Babel, with deep learning. I would say I am more keen on social or anthropological sides of language.

I think you would think more about languages if you have mastered at least one foreign language. As my mother tongue is Japanese, which is totally different from many other Western languages in terms of characters and ambiguity, I understand translating is not what learning a language is all about. Each language has unique characteristics, and I believe they more or less influence one’s personalities. For example, many Western languages make the verb, I mean the conclusion, of sentences clear in the beginning part of the sentences. That is also true of Chinese, I heard. However in Japanese, the conclusion comes at the end, so that is likely to give an impression that Japanese people are being obscure or indecisive. Also, Japanese sentences usually omit their subjects. In German as well, the conclusion of a sentences tend to come at the end, but I am almost 100% sure that no Japanese people would feel German people make things unclear. I think that comes from the structures of German language, which tends to make the number, verb, relations of words crystal clear.

Source: https://twitter.com/nakamurakihiro

Let’s take an example to see how obscure Japanese is. A Japanese sentence 「頭が赤い魚を食べる猫」can be interpreted in five ways, depending on where you put emphases on.

Common sense tells you that the sentence is likely to mean the first two cases, but I am sure they can mean those five possibilities. There might be similarly obscure sentences in other languages, but I bet few languages can be as obscure as Japanese. Also as you can see from the last two sentences, you can omit subjects in Japanese. This rule is nothing exceptional. Japanese people usually don’t use subjects in normal conversations. And when you read classical Japanese, which Japanese high school students have to do just like Western students learn some of classical Latin, the writings omit subjects much more frequently.

*However interestingly we have rich vocabulary of subjects. The subject “I” can be translated to 「私」、「僕」、「俺」、「自分」、「うち」etc, depending on your personality, who you are talking to, and the time when it is written in.

I believe one can see the world only in the framework of their language, and it seems one’s personality changes depending on the language they use. I am not sure whether the language originally determines how they think, or how they think forms the language. But at least I would like you to keep it in mind that if you translate a conversation, for example a random conversation at a bar in Berlin, into Japanese, that would linguistically sound Japanese, but not anthropologically. Imagine that such kind of random conversation in Berlin or something is like playing a catch, I mean throwing a ball named “your opinion.” On the other hand,  normal conversations of Japanese people are in stead more of, I would say,  “resonance” of several tuning forks. They do their bests to show that they are listening to each other, by excessively nodding or just repeating “Really?”, but usually it seems hardly any constructive dialogues have been made.

*I sometimes feel you do not even need deep learning to simulate most of such Japanese conversations. Several-line Python codes would be enough.

My point is, this article series is mainly going to cover only a few techniques of NLP in deep learning field: sequence to sequence model (seq2seq model) , and especially Transformer. They are, at least for now, just mathematical models and mappings of a small part of this profound field of language (as far as I can cover in this article series). But still, examples of language would definitely help you understand Transformer model in the long run.

2 Tokens and word embedding

*Throughout my article series, “words” just means the normal words you use in daily life. “Tokens” means more general unit of NLP tasks. For example the word “Transformer” might be denoted as a single token “Transformer,” or maybe as a combination of two tokens “Trans” and “former.”

One challenging part of handling language data is its encodings. If you started learning programming in a language other than English, you would have encountered some troubles of using keyboards with different arrangements or with characters. Some comments on your codes in your native languages are sometimes not readable on some software. You can easily get away with that by using only English, but when it comes to NLP you have to deal with this difficulty seriously. How to encode characters in each language should be a first obstacle of NLP. In this article we are going to rely on a library named BPEmb, which provides word embedding in various languages, and you do not have to care so much about encodings in languages all over the world with this library.

In the first section, you might have noticed that Japanese sentence is not separated with spaces like Western languages. This is also true of Chinese language, and that means we need additional tasks of separating those sentences at least into proper chunks of words. This is not only a matter of engineering, but also of some linguistic fields. Also I think many people are not so conscious of how sentences in their native languages are grammatically separated.

The next point is, unlike other scientific data, such as temperature, velocity, voltage, or air pressure, language itself is not measured as numerical data. Thus in order to process language, including English, you first have to map language to certain numerical data, and after some processes you need to conversely map the output numerical data into language data. This section is going to be mainly about one-hot encoding and word embedding, the ways to convert word/token into numerical data. You might already have heard about this

You might have learnt about word embedding to some extent, but I hope you could get richer insight into this topic through this article.

2.1 One-hot encoding

One-hot encoding would be the most straightforward way to encode words/tokens. Assume that you have a dictionary whose size is |\mathcal{V}|, and it includes words from “a”, “ablation”, “actually” to “zombie”, “?”, “!”

In a mathematical manner, in order to choose a word out of those |\mathcal{V}| words, all you need is a |\mathcal{V}| dimensional vector, one of whose elements is 1, and the others are 0. When you want to choose the No. i word, which is “indeed” in the example below, its corresponding one-hot vector is \boldsymbol{v} = (0, \dots, 1, \dots, 0 ), where only the No. i element is 1. One-hot encoding is also easy to understand, and that’s all. It is easy to imagine that people have already come up with more complicated and better way to encoder words. And one major way to do that is word embedding.

2.2 Word embedding

Source: Francois Chollet, Deep Learning with Python,(2018), Manning

Actually word embedding is related to one-hot encoding, and if you understand how to train a simple neural network, for example densely connected layers, you would understand word embedding easily. The key idea of word embedding is denoting each token with a D dimensional vector, whose dimension is fewer than the vocabulary size |\mathcal{V}|. The elements of the resulting word embedding vector are real values, I mean not only 0 or 1. Obviously you can encode much richer variety of tokens with such vectors. The figure at the left side is from “Deep Learning with Python” by François Chollet, and I think this is an almost perfect and simple explanation of the comparison of one-hot encoding and word embedding. But the problem is how to get such convenient vectors. The answer is very simple: you have only to train a network whose inputs are one-hot vector of the vocabulary.

The figure below is a simplified model of word embedding of a certain word. When the word is input into a neural network, only the corresponding element of the one-hot vector is 1, and that virtually means the very first input layer is composed of one neuron whose value is 1. And the only one neuron propagates to the next D dimensional embedding layer. These weights are the very values which most other study materials call “an embedding vector.”

When you input each word into a certain network, for example RNN or Transformer, you map the input one-hot vector into the embedding layer/vector. The examples in the figure are how inputs are made when the input sentences are “You’ve got the touch” and “You’ve got the power.”   Assume that you have a dictionary of one-hot encoding, whose vocabulary is {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}, and the dimension of word embeding is 6. In this case |\mathcal{V}| = 9, D=6. When the inputs are “You’ve got the touch” or “You’ve got the power” , you put the one-hot vector corresponding to “You’ve”, “got”, “the”, “touch” or “You’ve”, “got”, “the”, “power” sequentially every time step t.

In order to get word embedding of certain vocabulary, you just need to train the network. We know that the words “actually” and “indeed” are used in similar ways in writings. Thus when we propagate those words into the embedding layer, we can expect that those embedding layers are similar. This is how we can mathematically get effective word embedding of certain vocabulary.

More interestingly, if word embedding is properly trained, you can mathematically “calculate” words. For example, \boldsymbol{v}_{king} - \boldsymbol{v}_{man} + \boldsymbol{v}_{woman} \approx \boldsymbol{v}_{queen}, \boldsymbol{v}_{Japan} - \boldsymbol{v}_{Tokyo} + \boldsymbol{v}_{Vietnam} \approx \boldsymbol{v}_{Hanoi}.

*I have tried to demonstrate this type of calculation on several word embedding, but none of them seem to work well. At least you should keep it in mind that word embedding learns complicated linear relations between words.

I should explain word embedding techniques such as word2vec in detail, but the main focus of this article is not NLP, so the points I have mentioned are enough to understand Transformer model with NLP examples in the upcoming articles.

 

3 Language model

Language models is one of the most straightforward, but crucial ideas in NLP. This is also a big topic, so this article is going to cover only basic points. Language model is a mathematical model of the probabilities of which words to come next, given a context. For example if you have a sentence “In the lecture, he opened a _.”, a language model predicts what comes at the part “_.” It is obvious that this is contextual. If you are talking about general university students, “_” would be “textbook,” but if you are talking about Japanese universities, especially in liberal art department, “_” would be more likely to be “smartphone. I think most of you use this language model everyday. When you type in something on your computer or smartphone, you would constantly see text predictions, or they might even correct your spelling or grammatical errors. This is language modelling. You can make language models in several ways, such as n-gram and neural language models, but in this article I can explain only general formulations for such models.

*I am not sure which algorithm is used in which services. That must be too fast changing and competitive for me to catch up.

As I mentioned in the first article series on RNN, a sentence is usually processed as sequence data in NLP. One single sentence is denoted as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}), a list of vectors. The vectors are usually embedding vectors, and the (t) is the index of the order of tokens. For example the sentence “You’ve go the power.” can be expressed as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}), where \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)} denote “You’ve”, “got”, “the”, “power”, “.” respectively. In this case \tau = 4.

In practice a sentence \boldsymbol{X} usually includes two tokens BOS and EOS at the beginning and the end of the sentence. They mean “Beginning Of Sentence” and “End Of Sentence” respectively. Thus in many cases \boldsymbol{X} = (\boldsymbol{BOS} , \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS} ). \boldsymbol{BOS} and \boldsymbol{EOS} are also both vectors, at least in the Tensorflow tutorial.

P(\boldsymbol{X} = (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) is the probability of incidence of the sentence. But it is easy to imagine that it would be very hard to directly calculate how likely the sentence \boldsymbol{X} appears out of all possible sentences. I would rather say it is impossible. Thus instead in NLP we calculate the probability P(\boldsymbol{X}) as a product of the probability of incidence or a certain word, given all the words so far. When you’ve got the words (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1}) so far, the probability of the incidence of \boldsymbol{x}^{(t)}, given the context is  P(\boldsymbol{x}^{(t)}|\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1)}). P(\boldsymbol{BOS}) is a probability of the the sentence \boldsymbol{X} being (\boldsymbol{BOS}), and the probability of \boldsymbol{X} being (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) can be decomposed this way: P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS})P(\boldsymbol{BOS}).

Just as well P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P( \boldsymbol{BOS}, \boldsymbol{x}^{(1)})= P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P( \boldsymbol{BOS}).

Hence, the general probability of incidence of a sentence \boldsymbol{X} is P(\boldsymbol{X})=P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \dots, \boldsymbol{x}^{(\tau -1)}, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) = P(\boldsymbol{EOS}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}) P(\boldsymbol{x}^{(\tau)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau - 1)}) \cdots P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P(\boldsymbol{BOS}).

Let \boldsymbol{x}^{(0)} be \boldsymbol{BOS} and \boldsymbol{x}^{(\tau + 1)} be \boldsymbol{EOS}. Plus, let P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]}) be P(\boldsymbol{x}^{(t+1)}|\boldsymbol{x}^{(0)}, \dots, \boldsymbol{x}^{(t)}), then P(\boldsymbol{X}) = P(\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Language models calculate which words to come sequentially in this way.

Here’s a question: how would you evaluate a language model?

I would say the answer is, when the language model generates words, the more confident the language model is, the better the language model is. Given a context, when the distribution of the next word is concentrated on a certain word, we can say the language model is confident about which word to come next, given the context.

*For some people, it would be more understandable to call this “entropy.”

Let’s take the vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”} as an example. Assume that P(\boldsymbol{X}) = P(\boldsymbol{BOS}, \boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}, \boldsymbol{EOS}) = P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}, \boldsymbol{EOS})= P(\boldsymbol{x}^{(0)})\prod_{t=0}^{4}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Given a context (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}), the probability of incidence of \boldsymbol{x}^{(2)} is P(\boldsymbol{x}^{2}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}). In the figure below, the distribution at the left side is less confident because probabilities do not spread widely, on the other hand the one at the right side is more confident that next word is “got” because the distribution concentrates on “got”.

*You have to keep it in mind that the sum of all possible probability P(\boldsymbol{x}^{(2)} | \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) is 1, that is, P(\boldsymbol{the}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + P(\boldsymbol{You've}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + \cdots + P(\boldsymbol{Boogie}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = 1.

While the language model generating the sentence “BOS You’ve got the touch EOS”, it is better if the language model keeps being confident. If it is confident, P(\boldsymbol{X})= P(\boldsymbol{BOS}) P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}) P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} gets higher. Thus (-1) \{ log_{b}{P(\boldsymbol{BOS})} + log_{b}{P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}) + log_{b}{P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)})} + log_{b}{P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)})} + log_{b}{P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} \} gets lower, where usually b=2 or b=e.

This is how to measure how confident language models are, and the indicator of the confidence is called perplexity. Assume that you have a data set for evaluation \mathcal{D} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \dots, \boldsymbol{X}_{|\mathcal{D}|}), which is composed of |\mathcal{D}| sentences in total. Each sentence \boldsymbol{X}_n = (\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau ^{(n)}}{P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})} has \tau^{(n)} tokens in total excluding \boldsymbol{BOS}, \boldsymbol{EOS}. And let |\mathcal{V}| be the size of the vocabulary of the language model. Then the perplexity of the language model is b^z, where z = \frac{-1}{|\mathcal{V}|}\sum_{n=1}^{|\mathcal{D}|}{\sum_{t=0}^{\tau ^{(n)}}{log_{b}P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})}. The b is usually 2 or e.

For example, assume that \mathcal{V} is vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}. Also assume that the evaluation data set for perplexity of a language model is \mathcal{D} = (\boldsymbol{X}_1, \boldsymbol{X}_2), where \boldsymbol{X_1} =(\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}) \boldsymbol{X_2} = (\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the }, \boldsymbol{power}). In this case |\mathcal{V}|=9, |\mathcal{D}|=2. I have already showed you how to calculate the perplexity of the sentence “You’ve got the touch.” above. You just need to do a similar thing on another sentence “You’ve got the power”, and then you can get the perplexity of the language model.

*If the network is not properly trained, it would also be confident of generating wrong outputs. However, such network still would give high perplexity because it is “confident” at any rate. I’m sorry I don’t know how to tackle the problem. Please let me put this aside, and let’s get down on Transformer model soon.

Appendix

Let’s see how word embedding is implemented with a very simple example in the official Tensorflow tutorial. It is a simple binary classification task on IMDb Dataset. The dataset is composed to comments on movies by movie critics, and you have only to classify if the commentary is positive or negative about the movie. For example when you get you get an input “To be honest, Michael Bay is a terrible as an action film maker. You cannot understand what is going on during combat scenes, and his movies rely too much on advertisements. I got a headache when Mark Walberg used a Chinese cridit card in Texas. However he is very competent when it comes to humorous scenes. He is very talented as a comedy director, and I have to admit I laughed a lot.“, the neural netowork has to judge whether the statement is positive or negative.

This networks just takes an average of input embedding vectors and regress it into a one dimensional value from 0 to 1. The shape of embedding layer is (8185, 16). Weights of neural netowrks are usually implemented as matrices, and you can see that each row of the matrix corresponds to emmbedding vector of each token.

*It is easy to imagine that this technique is problematic. This network virtually taking a mean of input embedding vectors. That could mean if the input sentence includes relatively many tokens with negative meanings, it is inclined to be classified as negative. But for example, if the sentence is “This masterpiece is a dark comedy by Charlie Chaplin which depicted stupidity of the evil tyrant gaining power in the time. It thoroughly mocked Germany in the time as an absurd group of fanatics, but such propaganda could have never been made until ‘Casablanca.'” , this can be classified as negative, because only the part “masterpiece” is positive as a token, and there are much more words with negative meanings themselves.

The official Tensorflow tutorial provides visualization of word embedding with Embedding Projector, but I would like you to take more control over the data by yourself. Please just copy and paste the codes below, installing necessary libraries. You would get a map of vocabulary used in the text classification task. It seems you cannot find clear tendency of the clusters of the tokens. You can try other dimension reduction methods to get maps of the vocabulary by for example using Scikit Learn.

[References]

[1] “Word embeddings” Tensorflow Core
https://www.tensorflow.org/tutorials/text/word_embeddings

[2]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 43-64, 72-85, 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 43-64, 72-85, 191-193

[3]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[4] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 178-185

[5]”2.2. Manifold learning,” scikit-learn
https://scikit-learn.org/stable/modules/manifold.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Instructions on Transformer for people outside NLP field, but with examples of NLP

I found it quite difficult to explain mathematical details of long short-term memory (LSTM) in my previous article series. But when I was studying LSTM, a new promising algorithm was already attracting attentions. The algorithm is named Transformer. Its algorithm was a first announced in a paper named “Attention Is All You Need,” and it outperformed conventional translation algorithms with lower computational costs.

In this article series, I am going to provide explanations on minimum prerequisites for understanding deep learning in NLP (natural language process) tasks, but NLP is not the main focus of this article series, and actually I do not study in NLP field. I think Transformer is going to be a new major model of deep learning as well as CNN or RNN, and the model is now being applied in various fields.

Even though Transformer is going to be a very general deep learning model, I still believe it would be an effective way to understand Transformer with some NLP because language is a good topic we have in common. Unlike my previous article series, in which I tried to explain theoretical side of RNN as precisely as possible, in this article I am going to focus on practical stuff with my toy implementations of NLP tasks, largely based on Tensorflow official tutorial. But still I will do my best to make it as straightforward as possible to understand the architecture of Transformer with various original figures.

This series is going to be composed of the articles below.

If you are in the field and can read the codes in the official tutorial with no questions, this article series is not for you, but if you want to see how a Transformer works but do not want to go too much into details of NLP, this article would be for you.

Bias and Variance in Machine Learning

Machine learning continues to be an ever more vital component of our lives and ecosystem, whether we’re applying the techniques to answer research or business problems or in some cases even predicting the future. Machine learning models need to give accurate predictions in order to create real value for a given industry or domain.

While training a model is one of the key steps in the Data Science Project Life Cycle, how the model generalizes on unseen data is an equally important aspect that should be considered in every Data Science Project Life Cycle. We need to know whether it works and, consequently, if we can trust its predictions. Could the model be merely memorizing the data it is fed with, and therefore unable to make good predictions on future samples, or samples that it hasn’t seen before?

Let’s know the importance of evaluation with a simple example, There are two student’s Ramesh and Suresh preparing for the CAT exam to get into top IIMs (Indian Institute of Management). They both are quite good friends and stayed in the room during preparation and put an equal amount of hard work while solving numerical problems.

They both prepared for almost the same number of hours for the entire year and appeared in the final CAT exam. Surprisingly, Ramesh cleared, but Suresh did not. When asked, we got to know that there was one difference in their strategy of preparation between them, Ramesh had joined a Test Series course where he used to test his knowledge and understanding by giving mock exams and then further evaluating on which portions he is lagging and making necessary adjustments to he is preparation cycle in order to do well in those areas. But Suresh was confident, and he just kept training himself without testing on the preparation he had done.

Like the above situation we can train a Machine Learning Algorithm extensively with many parameters and new techniques, but if you are skipping its evaluation step, you cannot trust your model to perform well on the unseen data. In this article, we explain the importance of Bias, Variance and the trade-off between them in order to know how well a machine learning model generalizes to new, previously unseen data.

Training of Supervised Machine Learning

Bias

Bias is the difference between the Predicted Value and the Expected Value or how far are the predicted values from the actual values. During the training process the model makes certain assumptions on the training data provided. After Training, when it is introduced to the testing/validation data or unseen data, these assumptions may not always be correct.

If we use a large number of nearest neighbors in the K-Nearest Neighbors Algorithm, the model can totally decide that some parameters are not important at all for the modelling.  For example, it can just consider that only two predictor variables are enough to classify the data point though we have more than 10 variables.

This type of model will make very strong assumptions about the other parameters not affecting the outcome at all. You can take it as a model predicting or understanding only the simple relationship when the data points clearly indicate a more complex relationship.

When the model has high bias error, it results in a very simplistic model that does not consider the complexity of the data very well leading to Underfitting.

Variance

Variance occurs when the model performs well on the trained dataset but does not do well on an unseen data set, it is when the model considers the fluctuations or i.e. the noise as in the data as well. The model will still consider the variance as something to learn from because it learns too much from the noise inside the trained data set that it fails to perform as expected on the unseen data.

Based on the above example from Bias, if the model learns that all the ten predictor variables are important to classify a given data point then it tends to have high variance. You can take it as the model is trying to understand every minute detail making it more complex and failing to perform well on the unseen data.

When a model has High Bias error, it underfits the data and makes very simplistic assumptions on it. When a model has High Variance error, it overfits the data and learns too much from it. When a model has balanced Bias and Variance errors, it performs well on the unseen data.

Bias-Variance Trade-off

Based on the definitions of bias and variance, there is clear trade-off between bias and variance when it comes to the performance of the model. A model will have a high error if it has very high bias and low variance and have a high error if it has high variance and low bias.

A model that strikes a balance between the bias and variance can minimize the error better than those that live on extreme ends.

We can find whether the model has High Bias using the below steps:

  1. We tend to get high training errors.
  2. The validation error or test error will be similar to the training error.

We can find whether the model has High Bias using the below steps:

  1. We tend to get low training error
  2. The validation error or test error will be very high.

We can fix the High Bias using below steps:

  1. We need to gather more input features or can even try to create few using the feature engineering techniques.
  2. We can even add few polynomial features in order to increase the complexity.
  3. If we are using any regularization terms in our model, we can try to minimize it.

We can fix the High Variance using below steps:

  1. We can gather more training data so that the model can learn more on the patterns rather than the noise.
  2. We can even try to reduce the input features or do feature selection.
  3.  If we are using any regularization terms in our model we can try to maximize it.

Conclusion

In this article, we got to know the importance of the evaluation step in the Data Science Project Life Cycle, definitions of Bias and Variance, the trade-off between them and the steps we can take to fix the Underfitting and Overfitting of a Machine Learning Model.

Simple RNN

LSTM back propagation: following the flows of variables

First of all, the summary of this article is: please just download my Power Point slides which I made and be patient, following the equations.

I am not supposed to use so many mathematics when I write articles on Data Science Blog. However using little mathematics when I talk about LSTM backprop is like writing German, never caring about “der,” “die,” “das,” or speaking little English in English classes (which most high school English teachers in Japan do) or writing Japanese without using any Chinese characters (which looks like a terrible handwriting by a drug addict). In short, that is ridiculous. And all the precise equations of LSTM backprop, written on a blog is not a comfortable thing to see. So basically the whole of this article is an advertisement on my PowerPoint slides, sponsored by DATANOMIQ, and I can just give you some tips to get ready for the most tiresome part of understanding LSTM here.

*This article is the fifth article of “A gentle introduction to the tiresome part of understanding RNN.”

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1. Chain rules

This article is virtually an article on chain rules of differentiation. Even if you have clear understandings on chain rules, I recommend you to take a look at this section. If you have written down all the equations of back propagation of DCL, you would have seen what chain rules are. Even simple chain rules for backprop of normal DCL can be difficult to some people, but when it comes to backprop of LSTM, it is a pure torture.  I think using graphical models would help you understand what chain rules are like. Graphical models are basically used to describe the relations of variables and functions in probabilistic models, so to be exact I am going to use “something like graphical models” in this article. Not that this is a common way to explain chain rules.

First, let’s think about the simplest type of chain rule. Assume that you have a function f=f(x)=f(x(y)), and relations of the functions are displayed as the graphical model at the left side of the figure below. Variables are a type of function, so you should think that every node in graphical models denotes a function. Arrows in purple in the right side of the chart show how information propagate in differentiation.

Next, if you have a function f , which has two variances  x_1 and x_2. And both of the variances also share two variances  y_1 and y_2. When you take partial differentiation of f with respect to y_1 or y_2, the formula is a little tricky. Let’s think about how to calculate \frac{\partial f}{\partial y_1}. The variance y_1 propagates to f via x_1 and x_2. In this case the partial differentiation has two terms as below.

In chain rules, you have to think about all the routes where a variance can propagate through. If you generalize chain rules as the graphical model below, the partial differentiation of f with respect to y_i is calculated as below. And you need to understand chain rules in this way to understanding any types of back propagation.

The figure above shows that if you calculate partial differentiation of f with respect to y_i, the partial differentiation has n terms in total because y_i propagates to f via n variances. In order to understand backprop of LSTM, you constantly have to care about the flows of variances, which I display as purple arrows.

2. Chain rules in LSTM

I would like you to remember the figure below, which I used in the second article to show how errors propagate backward during backprop of simple RNNs. After forward propagation, first of all, you need to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, gradients of the error function with respect to parameters, at each time step. But you have to be careful that even though these gradients depend on time steps, the parameters \boldsymbol{\theta} do not depend on time steps.

*As I mentioned in the second article I personally think \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} should be rather denoted as (\frac{\partial J}{\partial \boldsymbol{\theta}})^{(t)} because parameters themselves do not depend on time. However even the textbook by MIT press partly use the former notation. And I think you are likely to encounter this type of notation, so I think it is not bad to get ready for both.

The errors at time step (t) propagate backward to all the \boldsymbol{h} ^{(s)} (s \leq t). Conversely, in order to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} errors flowing from J^{(s)}  (s \geq t). In the chart you need arrows of errors in purple for the gradient in a purple frame, orange arrows for gradients in orange frame, red arrows for gradients in red frame. And you need to sum up \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} to calculate \frac{\partial J}{\partial \boldsymbol{\theta}} = \sum_{t}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}, and you need this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} to renew parameters, one time.

At an RNN block level, the flows of errors and how to renew parameters are the same in LSTM backprop, but the flow of errors inside each block is much more complicated in LSTM backprop. But in order to denote errors of LSTM backprop, instead of \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, I use a special notation \delta \star ^{(t)} = \frac{\partial J}{\partial \star}.

* Again, please be careful of what \delta \star  ^{(t)} means. Neurons depend on time steps, but parameters do not depend on time steps. So if \star are neurons,  \delta \star  ^{(t)}= \frac{\partial J}{ \partial \star ^{(t)}}, but when \star are parameters, \delta \star  ^{(t)} should be rather denoted like \delta \star  ^{(t)}= (\frac{\partial J}{ \partial \star })^{(t)}. In the Space Odyssey paper\boldsymbol{\star} are not used as parameters, but in my PowerPoint slides and some other materials, \boldsymbol{\star} are used also as parameteres.

As I wrote in the last article, you calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below. Unlike the last article, I also added the terms of peephole connections in the equations below, and I also introduced the variances \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, \bar{\boldsymbol{z}}^{(t)}, \bar{\boldsymbol{o}}^{(t)} for convenience.

  • \boldsymbol{\bar{f}}^{(t)}=\boldsymbol{W}_{for} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{for}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{for}
  • \boldsymbol{\bar{i}}^{(t)}=\boldsymbol{W}_{in} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{in}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{in}
  • \boldsymbol{\bar{z}}^{(t)}=\boldsymbol{W}_z \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z
  • \boldsymbol{\bar{o}}^{(t)}=\boldsymbol{W}_{out} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{out}\odot \boldsymbol{c}^{(t)} + \boldsymbol{b}_{out}
  • \boldsymbol{f}^{(t)}=\sigma( \boldsymbol{\bar{f}}^{(t)})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{\bar{i}}^{(t)})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{\bar{z}}^{(t)})
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{\bar{o}}^{(t)})

With  Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

In this article I would rather give instructions on how to read my PowerPoint slide. Just as general backprop, you need to calculate gradients of error functions with respect to parameters, such as \delta \boldsymbol{W}_{\star}, \delta \boldsymbol{R}_{\star}, \delta \boldsymbol{p}_{\star}, \delta \boldsymbol{b}_{\star}, where \star is either of \{z, in, for, out \}. And just as backprop of simple RNNs, in order to calculate gradients with respect to parameters, you need to calculate errors of neurons, that is gradients of error functions with respect to neurons, such as \delta \boldsymbol{f}^{(t)}, \delta \boldsymbol{i}^{(t)}, \delta \boldsymbol{z}^{(t)}, \delta \boldsymbol{o}^{(t)}.

*Again and again, keep it in mind that neurons depend on time steps, but parameters do not depend on time steps.

When you calculate gradients with respect to neurons, you can first calculate \delta \boldsymbol{y}^{(t)}, but the equation for this error is the most difficult, so I recommend you to put it aside for now. After calculating \delta \boldsymbol{y}^{(t)}, you can next calculate \delta \bar{\boldsymbol{o}}^{(t)}= \frac{\partial J^{(t)}}{ \partial \bar{\boldsymbol{o}}^{(t)}}. If you see the LSTM block below as a graphical model which I introduced, the information of \bar{\boldsymbol{o}}^{(t)} flow like the purple arrows. That means, \bar{\boldsymbol{o}}^{(t)} affects J only via \boldsymbol{y}^{(t)}, and this structure is equal to the first graphical model which I have introduced above. And if you calculate \bar{\boldsymbol{o}}^{(t)} element-wise, you get the equation \delta \bar{o}_{k}^{(t)}=\frac{\partial J}{\partial \bar{o}_{k}^{(t)}}= \frac{\partial J}{\partial y_{k}^{(t)}} \frac{\partial y_{k}^{(t)}}{\partial \bar{o}_{k}^{(t)}}.

*The k is an index of an element of vectors. If you can calculate element-wise gradients, it is easy to understand that as differentiation of vectors and matrices.

Next you can calculate \delta \boldsymbol{c}^{(t)}, and chain rules are very important in this process. The flow of \delta \boldsymbol{c}^{(t)} to J can be roughly divided into two streams: the one which flows to J as \bodlsymbol{y}^{(t)}, and the one which flows to J as \bodlsymbol{c}^{(t+1)}. And the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{y}^{(t)} also have two branches: the one via \bar{\boldsymbol{o}}^{(t)} and the one which directly converges as  \bodlsymbol{y}^{(t)}. Just as well, the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{c}^{(t+1)} also have three branches: the ones via \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, and the one which directly converges as \bodlsymbol{c}^{(t+1)}.

If you see see these flows as graphical a graphical model, that would be like the figure below.

According to this graphical model, you can calculate \delta \boldsymbol{c} ^{(t)} element-wise as below.

* TO BE VERY HONEST I still do not fully understand why we can apply chain rules like above to calculate \delta \boldsymbol{c}^{(t)}. When you apply the formula of chain rules, which I showed in the first section, to this case, you have to be careful of where to apply partial differential operators \frac{\partial}{ \partial c_{k}^{(t)}}. In the case above, in the part \frac{\partial y_{k}^{(t)}}{\partial c_{k}^{(t)}} the partial differential operator only affects tanh(c_{k}^{(t)}) of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}). And in the part \frac{\partial c_{k}^{(t+1)}}{\partial c_{k}^{(t)}}, the partial differential operator \frac{\partial}{\partial c_{k}^{(t)}} only affects the part c_{k}^{(t)} of the term c^{t}_{k} \cdot f_{k}^{(t+1)}. In the \frac{\partial \bar{o}_{k}^{(t)}}{\partial c_{k}^{(t)}} part, only (p_{out})_{k} \cdot c_{k}^{(t)},  in the \frac{\partial \bar{i}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}, and in the \frac{\partial \bar{f}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}. But some other parts, which are not affected by \frac{\partial}{ \partial c_{k}^{(t)}} are also functions of c_{k}^{(t)}. For example o_{k}^{(t)} of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}) is also a function of c_{k}^{(t)}. And I am still not sure about the logic behind where to affect those partial differential operators.

*But at least, these are the only decent equations for LSTM backprop which I could find, and a frequently cited paper on LSTM uses implementation based on these equations. Computer science is more of practical skills, rather than rigid mathematical logic. Also I think I have spent great deal of my time thinking about this part, and it is now time for me to move to next step. If you have any comments or advice on this point, please let me know.

Calculating \delta \bar{\boldsymbol{f}}^{(t)}, \delta \bar{\boldsymbol{i}}^{(t)}, \delta \bar{\boldsymbol{z}}^{(t)} are also relatively straigtforward as calculating \delta \bar{\boldsymbol{o}}^{(t)}. They all use the first type of chain rule in the first section. Thereby you can get these gradients: \delta \bar{f}_{k}^{(t)}=\frac{\partial J}{ \partial \bar{f}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{f}_{k}^{(t)}}, \delta \bar{i}_{k}^{(t)}=\frac{\partial J}{\partial \bar{i}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}, and \delta \bar{z}_{k}^{(t)}=\frac{\partial J}{\partial \bar{z}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}.

All the gradients which we have calculated use the error \delta \boldsymbol{y}^{(t)}, but when it comes to calculating \delta \boldsymbol{y}^{(t)}….. I can only say “Please be patient. I did my best in my PowerPoint slides to explain that.” It is not a kind of process which I want to explain on Word Press. In conclusion you get an error like this: \delta \boldsymbol{y}^{(t)}=\frac{\partial J^{(t)}}{\partial \boldsymbol{y}^{(t)}} + \boldsymbol{R}_{for}^{T} \delta \bar{\boldsymbol{f}}^{(t+1)} + \boldsymbol{R}_{in}^{T}\delta \bar{\boldsymbol{i}}^{(t+1)} + \boldsymbol{R}_{out}^{T}\delta \bar{\boldsymbol{o}}^{(t+1)} + \boldsymbol{R}_{z}^{T}\delta \bar{\boldsymbol{z}}^{(t+1)}, and the flows of \boldsymbol{y}^{(t)} are as blow.

Combining the gradients we have got so far, we can calculate gradients with respect to parameters. For concrete results, please check the Space Odyssey paper or my PowerPoint slide.

3. How LSTMs tackle exploding/vanishing gradients problems

*If you are allergic to mathematics, you should not read this section or even download my PowerPoint slide.

*Part of this section is more or less subjective, so if you really want to know how LSTM mitigate the problems, I highly recommend you to also refer to other materials. But at least I did my best for this article.

LSTMs do not completely solve, vanishing gradient problems. They mitigate vanishing/exploding gradient problems. I am going to roughly explain why they can tackle those problems. I think you find many explanations on that topic, but many of them seems to have some mathematical mistakes (even the slide used in a lecture in Stanford University) and I could not partly agree with some statements. I also could not find any papers or materials which show the whole picture of how LSTMs can tackle those problems. So in this article I am only going to give instructions on the major way to explain this topic.

First let’s see how gradients actually “vanish” or “explode” in simple RNNs. As I in the second article of this series, simple RNNs propagate forward as the equations below.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

And every time step, you get an error function J^{(t)}. Let’s consider the gradient of J^{(t)} with respect to \boldsymbol{h}^{(k)}, that is the error flowing from J^{(t)} to \boldsymbol{h}^{(k)}. This error is the most used to calculate gradients of the parameters in the equations above.

If you calculate this error more concretely, \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} \cdots \frac{\partial \boldsymbol{h}^{(k+2)}}{\partial \boldsymbol{h}^{(k+1)}} \frac{\partial \boldsymbol{h}^{(k+1)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}, where \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}} = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{b} + \boldsymbol{W}\cdot \boldsymbol{h}^{(s-1)} + \boldsymbol{U}\cdot \boldsymbol{x}^{(s)})) = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})).

* If you see the figure as a type of graphical model, you should be able to understand the why chain rules can be applied as the equation above.

*According to this paper \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}  = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})), but it seems that many study materials and web sites are mistaken in this point.

Hence \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})) = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} (\boldsymbol{W} ^T )^{(t - k)} \prod_{k< s \leq t} diag(g'(\boldsymbol{a}^{(s)})). If you take norms of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} you get an equality \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} \right\rVert \leq \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right\rVert \left\lVert \boldsymbol{W} ^T \right\rVert ^{(t - k)} \prod_{k< s \leq t} \left\lVert diag(g'(\boldsymbol{a}^{(s)}))\right\rVert. I will not go into detail anymore, but it is known that according to this inequality, multiplication of weight vectors exponentially converge to 0 or to infinite number.

We have seen that the error \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} is the main factor causing vanishing/exploding gradient problems of simple RNNs. In case of LSTM, \frac{\partial J^{(t)}}{\partial \boldsymbol{c}^{(k)}} is an equivalent. For simplicity, let’s calculate only \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, which is equivalent to \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} of simple RNN backprop.

* Just as I noted above, you have to be careful of which part the partial differential operator \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} affects in the chain rule above. That is, you need to calculate \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} (\boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}), and the partial differential operator only affects \boldsymbol{c}^{(t-1)}. I think this is not a correct mathematical notation, but please forgive me for doing this for convenience.

If you continue calculating the equation above more concretely, you get the equation below.

I cannot mathematically explain why, but it is known that this characteristic of gradients of LSTM backprop mitigate the vanishing/exploding gradient problem. We have seen that if you take a norm of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}}, that is equal or smaller than repeated multiplication of the norm of the same weight matrix, and that soon leads to vanishing/exploding gradient problem. But according to the equation above, even if you take a norm of repeatedly multiplied \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, its norm cannot be evaluated with a simple value like repeated multiplication of the norm of the same weight matrix. The outputs of each gate are different from time steps to time steps, and that adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}.

*I personally guess the term diag(\boldsymbol{f}^{(t)}) is very effective. The unaffected value of the elements of \boldsymbol{f}^{(t)} can directly adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}. And as a matter of fact, it is known that performances of LSTM drop the most when you get rid of forget gates.

When it comes to tackling exploding gradient problems, there is a much easier technique called gradient clipping. This algorithm is very simple: you just have to adjust the size of gradient so that the absolute value of gradient is under a threshold at every time step. Imagine that you decide in which direction to move by calculating gradients, but when the footstep is going to be too big, you just adjust the size of footstep to the threshold size you have set. In a pseudo code, you can write a gradient clipping part only with some two line codes as below.

*\boldsymbol{g} is a gradient at the time step threshold is the maximum size of the “step.”

The figure below, cited from a deep learning text from MIT press textbook, is a good and straightforward explanation on gradient clipping.It is known that a strongly nonlinear function, such as error functions of RNN, can have very steep or plain areas. If you artificially visualize the idea in 3-dimensional space, as the surface of a loss function J with two variants w, b, that means the loss function J has plain areas and very steep cliffs like in the figure.Without gradient clipping, at the left side, you can see that the black dot all of a sudden climb the cliff and could jump to an unexpected area. But with gradient clipping, you avoid such “big jumps” on error functions.

Source: Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, 409p

 

I am glad that I have finally finished this article series. I am not sure how many of the readers would have read through all of the articles in this series, including my PowerPoint slides. I bet that is not so many. I spent a great deal of my time for making these contents, but sadly even when I was studying LSTM, it was becoming old-fashioned, at least in natural language processing (NLP) field: a very promising algorithm named Transformer has been replacing the position of LSTM. Deep learning is a very fast changing field. I also would like to make illustrative introductions on attention mechanism in NLP, from seq2seq model to Transformer. And I think LSTM would still remain as one of the algorithms in sequence data processing, such as hidden Hidden Markov model, or particle filter.

 

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Simple RNN

Understanding LSTM forward propagation in two ways

*This article is only for the sake of understanding the equations in the second page of the paper named “LSTM: A Search Space Odyssey”. If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.

*This article is the fourth article of “A gentle introduction to the tiresome part of understanding RNN.”

1. Preface

I  heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.

Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for “normal people” to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.

I already had a feeling that “Does it help to understand only LSTM with this precision? I should do more practical codings.” For example François Chollet, the developer of Keras, in his book, said as below.

 

For me that sounds like “We have already implemented RNNs for you, so just shut up and use Tensorflow/Keras.” Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called Transformer seems to be replacing the position of LSTM in natural language processing. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.

But I think, or rather hope,  it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work.

I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

2. Why LSTM?

First of all, let’s take a brief look at what I said about the structures of RNNs,  in the first and the second article. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward/back propagation if you focus on its recurrent connections.

That is why simple RNNs suffer from vanishing/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in this paper. But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing/exploding gradient problem.

LSTM was invented as a way to tackle such problems as I mentioned in the last article.

3. How to display LSTM

I would like you to just go to image search on Google, Bing, or Yahoo!, and type in “LSTM.” I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them “Space Odyssey type” and “electronic circuit type”, and in conclusion, I highly recommend you to understand LSTM as the “electronic circuit type.”

*I just randomly came up with the terms “Space Odyssey type” and “electronic circuit type” because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.

However, not that all the well-made explanations on LSTM use the “electronic circuit type,” and I am sure you sometimes have to understand LSTM as the “space odyssey type.” And the paper “LSTM: A Search Space Odyssey,” which I learned a lot about LSTM from,  also adopts the “Space Odyssey type.”

LSTM architectur visualization

The main reason why I recommend the “electronic circuit type” is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.

*Behaviors of both of them look different, but of course they are doing the same things.

If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs  are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.

But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.

4. Forward propagation of LSTM in “electronic circuit type”

*For further understanding of mathematics of LSTM forward/back propagation, I recommend you to download my slides.

The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.

Let’s look at the simplified architecture of  an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the gates, and the one going through cell connections, the “highway” of LSTM block. For simplicity, we will see the architecture of an LSTM block without peephole connections, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.

In a LSTM block, the input and the output of the former time step separately go through sections named “gates”: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.

*The small two dots on the cell connections are the “on-ramp” of cell conection highway.

*You would see the terms “input gate,” “forget gate,” “output gate” almost everywhere, but how to call the “block gate” depends on textbooks.

Let’s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step (t) gets \boldsymbol{y}^{(t-1)}, the output at the last time step,  and \boldsymbol{c}^{(t-1)}, the information of the cell at the time step (t-1), via recurrent connections. The block at time step (t) gets the input \boldsymbol{x}^{(t)}, and it separately goes through each gate, together with \boldsymbol{y}^{(t-1)}. After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}. The outputs of the gates are mixed with \boldsymbol{c}^{(t-1)} and the LSTM block gives out an output \boldsymbol{y}^{(t)}, and gives \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} to the next LSTM block via recurrent connections.

You calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below.

  • \boldsymbol{f}^{(t)}= \sigma(\boldsymbol{W}_{for} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \boldsymbol{y}^{(t-1)} +  \boldsymbol{b}_{for})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{W}_{in} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{in})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{W}_z \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z)
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{W}_{out} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{out})

*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.

The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called Hadamard product, which you denote as \odot. This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.

With this Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

The values of \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} are compressed into the range of [0, 1] or [-1, 1] with activation functions. You can see that the input gate and the block input give new information to the cell. The part \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)} means that the output of the forget gate “forgets” the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell \boldsymbol{c}^{(t)} is activated with tanh() and the output of the output gate “suppress” the activated value of \boldsymbol{c}^{(t)}. In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input \boldsymbol{x}^{(t)}, and the recurrent connection \boldsymbol{y}^{(t-1)}. That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always (1, 1, …1)^T, LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And  if all the outputs of every gate are always (0, 0, …0)^T, LSTMs forget everything, receive no inputs, and give out nothing.

This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.

LSTM inner architecture

LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.

5. Space Odyssey type

I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.

In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted “highway” in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections,  \boldsymbol{x}^{(t)}, the input at the time step (t) , sand \boldsymbol{y}^{(t-1)}, the output at the last time step (t-1), which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.

Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step (t-1) goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly “forgotten.” Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step  (t). And the partly “forgotten” \boldsymbol{c}^{(t-1)} goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how \boldsymbol{c}^{(t)} is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is “suppressed” with the output gate.

I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.

 

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

[References]

[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey,” (2017)

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204

[3] “Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021”, Institute of advanced research in artificial intelligence, (2020)
URL: https://www.iarai.ac.at/news/sepp-hochreiter-receives-ieee-cis-neural-networks-pioneer-award-2021/?fbclid=IwAR27cwT5MfCw4Tqzs3MX_W9eahYDcIFuoGymATDR1A-gbtVmDpb8ExfQ87A

[4] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 120-125
岡谷貴之 著, 「機械学習プロフェッショナルシリーズ 深層学習」, (2015), pp. 120-125

[5] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 252-257
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 252-257

[6] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ~ 最近の動向と共に」, Qiita, (2015)
URL: https://qiita.com/t_Signull/items/21b82be280b46f467d1b

Simple RNN

A brief history of neural nets: everything you should know before learning LSTM

This series is not a college course or something on deep learning with strict deadlines for assignments, so let’s take a detour from practical stuff and take a brief look at the history of neural networks.

The history of neural networks is also a big topic, which could be so long that I had to prepare another article series. And usually I am supposed to begin such articles with something like “The term ‘AI’ was first used by John McCarthy in Dartmouth conference 1956…” but you can find many of such texts written by people with much more experiences in this field. Therefore I am going to write this article from my point of view, as an intern writing articles on RNN, as a movie buff, and as one of many Japanese men who spent a great deal of childhood with video games.

We are now in the third AI boom, and some researchers say this boom began in 2006. A professor in my university said there we are now in a kind of bubble economy in machine learning/data science industry, but people used to say “Stop daydreaming” to AI researchers. The second AI winter is partly due to vanishing/exploding gradient problem of deep learning. And LSTM was invented as one way to tackle such problems, in 1997.

1, First AI boom

In the first AI boom, I think people were literally “daydreaming.” Even though the applications of machine learning algorithms were limited to simple tasks like playing chess, checker, or searching route of 2d mazes, and sometimes this time is called GOFAI (Good Old Fashioned AI).

Source: https://www.youtube.com/watch?v=K-HfpsHPmvw&feature=youtu.be

Even today when someone use the term “AI” merely for tasks with neural networks, that amuses me because for me deep learning is just statistically and automatically training neural networks, which are capable of universal approximation, into some classifiers/regressors. Actually the algorithms behind that is quite impressive, but the structure of human brains is much more complicated. The hype of “AI” already started in this first AI boom. Let me take an example of machine translation in this video. In fact the research of machine translation already started in the early 1950s, and of  specific interest in the time was translation between English and Russian due to Cold War. In the first article of this series, I said one of the most famous applications of RNN is machine translation, such as Google Translation, DeepL. They are a type of machine translation called neural machine translation because they use neural networks, especially RNNs. Neural machine translation was an astonishing breakthrough around 2014 in machine translation field. The former major type of machine translation was statistical machine translation, based on statistical language models. And the machine translator in the first AI boom was rule base machine translators, which are more primitive than statistical ones.

Source: https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon

The most remarkable invention in this time was of course perceptron by Frank Rosenblatt. Some people say that this is the first neural network. Even though you can implement perceptron with a-few-line codes in Python, obviously they did not have Jupyter Notebook in those days. The perceptron was implemented as a huge instrument named Mark 1 Perceptron, and it was composed of randomly connected wires. I do not precisely know how it works, but it was a huge effort to implement even the most primitive type of neural networks. They needed to use a big lighting fixture to get a 20*20 pixel image using 20*20 array of cadmium sulphide photocells. The research by Rosenblatt, however, was criticized by Marvin Minsky in his book because perceptrons could only be used for linearly separable data. To make matters worse the criticism prevailed as that more general, multi-layer perceptrons were also not useful for linearly inseparable data (as I mentioned in the first article, multi-layer perceptrons, namely normal neural networks,  can be universal approximators, which have potentials to classify/regress various types of complex data). In case you do not know what “linearly separable” means, imagine that there are data plotted on a piece of paper. If an elementary school kid can draw a border line between two clusters of the data with a ruler and a pencil on the paper, the 2d data is “linearly separable”….

With big disappointments to the research on “electronic brains,” the budget of AI research was reduced and AI research entered its first winter.

Source: https://www.nzz.ch/digital/ehre-fuer-die-deep-learning-mafia-ld.1472761?reduced=true and https://anatomiesofintelligence.github.io/posts/2019-06-21-organization-mark-i-perceptron

I think  the frame problem (1969),  by John McCarthy and Patrick J. Hayes, is also an iconic theory in the end of the first AI boom. This theory is known as a story of creating a robot trying to pull out its battery on a wheeled wagon in a room. But there is also a time bomb on the wagon. The first prototype of the robot, named R1, naively tried to pull out the wagon form the room, and the bomb exploded. The problems was obvious: R1 was not programmed to consider the risks by taking each action, so the researchers made the next prototype named R1D1, which was programmed to consider the potential risks of taking each action. When R1D1 tried to pull out the wagon, it realized the risk of pulling the bomb together with the battery. But soon it started considering all the potential risks, such as the risk of the ceiling falling down, the distance between the wagon and all the walls, and so on, when the bomb exploded. The next problem was also obvious: R1D1 was not programmed to distinguish if the factors are relevant of irrelevant to the main purpose, and the next prototype R2D1 was programmed to do distinguish them. This time, R2D1 started thinking about “whether the factor is  irrelevant to the main purpose,” on every factor measured, and again the bomb exploded. How can we get a perfect AI, R2D2?

The situation of mentioned above is a bit extreme, but it is said AI could also get stuck when it try to take some super simple actions like finding a number in a phone book and make a phone call. It is difficult for an artificial intelligence to decide what is relevant and what is irrelevant, but humans will not get stuck with such simple stuff, and sometimes the frame problem is counted as the most difficult and essential problem of developing AI. But personally I think the original frame problem was unreasonable in that McCarthy, in his attempts to model the real world, was inflexible in his handling of the various equations involved, treating them all with equal weight regardless of the particular circumstances of a situation. Some people say that McCarthy, who was an advocate for AI, also wanted to see the field come to an end, due to its failure to meet the high expectations it once aroused.

Not only the frame problem, but also many other AI-related technological/philosophical problems have been proposed, such as Chinese room (1980), the symbol grounding problem (1990), and they are thought to be as hardships in inventing artificial intelligence, but I omit those topics in this article.

*The name R2D2 did not come from the famous story of frame problem. The story was Daniel Dennett first proposed the story of R2D2 in his paper published in 1984. Star Wars was first released in 1977. It is said that the name R2D2 came from “Reel 2, Dialogue 2,” which George Lucas said while film shooting. And the design of C3PO came from Maria in Metropolis(1927). It is said that the most famous AI duo in movie history was inspired by Tahei and Matashichi in The Hidden Fortress (1958), directed by Kurosawa Akira.

Source: https://criterioncollection.tumblr.com/post/135392444906/the-original-r2-d2-and-c-3po-the-hidden-fortress

Interestingly, in the end of the first AI boom, 2001: A Space Odyssey, directed by Stanley Kubrick, was released in 1968. Unlike conventional fantasylike AI characters, for example Maria in Metropolis (1927), HAL 9000 was portrayed as a very realistic AI, and the movie already pointed out the risk of AI being insane when it gets some commands from several users. HAL 9000 still has been a very iconic character in AI field. For example when you say some quotes from 2001: A Space Odyssey to Siri you get some parody responses. I also thin you should keep it in mind that in order to make an AI like HAL 9000 come true, for now RNNs would be indispensable in many ways: you would need RNNs for better voice recognition, better conversational system, and for reading lips.

Source: https://imgflip.com/memetemplate/34339860/Open-the-pod-bay-doors-Hal

*Just as you cannot understand Monty Python references in Python official tutorials without watching Monty Python and the Holy Grail, you cannot understand many parodies in AI contexts without watching 2001: A Space Odyssey. Even though the movie had some interview videos with some researchers and some narrations, Stanley Kubrick cut off all the footage and made the movie very difficult to understand. Most people did not or do not understand that it is a movie about aliens who gave homework of coming to Jupiter to human beings.

2, Second AI boom/winter

Source: Fukushima Kunihiko, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” (1980)

I am not going to write about the second AI boom in detail, but at least you should keep it in mind that convolutional neural network (CNN) is a keyword in this time. Neocognitron, an artificial model of how sight nerves perceive thing, was invented by Kunihiko Fukushima in 1980, and the model is said to be the origin on CNN. And Neocognitron got inspired by the Hubel and Wiesel’s research on sight nerves. In 1989, a group in AT & T Bell Laboratory led by Yann LeCun invented the first practical CNN to read handwritten digit.

Y. LeCun, “Backpropagation Applied to Handwritten Zip Code Recognition,” (1989)

Another turning point in this second AI boom was that back propagation algorithm was discovered, and the CNN by LeCun was also trained with back propagation. LeCun made a deep neural networks with some layers in 1998 for more practical uses.

But his research did not gain so much attention like today, because AI research entered its second winter at the beginning of the 1990s, and that was partly due to vanishing/exploding gradient problem of deep learning. People knew that neural networks had potentials of universal approximation, but when they tried to train naively stacked neural nets, the gradients, which you need for training neural networks, exponentially increased/decreased. Even though the CNN made by LeCun was the first successful case of “deep” neural nets which did not suffer from the vanishing/exploding gradient problem so much, deep learning research also stagnated in this time.

The ultimate goal of this article series is to understand LSTM at a more abstract/mathematical level because it is one of the practical RNNs, but the idea of LSTM (Long Short Term Memory) itself was already proposed in 1997 as an RNN algorithm to tackle vanishing gradient problem. (Exploding gradient problem is solved with a technique named gradient clipping, and this is easier than techniques for preventing vanishing gradient problems. I am also going to explain it in the next article.) After that some other techniques like introducing forget gate, peephole connections, were discovered, but basically it took some 20 years till LSTM got attentions like today. The reasons for that is lack of hardware and data sets, and that was also major reasons for the second AI winter.

Source: Sepp HochreiterJürgen, Schmidhuber, “Long Short-term Memory,” (1997)

In the 1990s, the mid of second AI winter, the Internet started prevailing for commercial uses. I think one of the iconic events in this time was the source codes WWW (World Wide Web) were announced in 1993. Some of you might still remember that you little by little became able to transmit more data online in this time. That means people came to get more and more access to various datasets in those days, which is indispensable for machine learning tasks.

After all, we could not get HAL 9000 by the end of 2001, but instead we got Xbox console.

3, Video game industry and GPU

Even though research on neural networks stagnated in the 1990s the same period witnessed an advance in the computation of massive parallel linear transformations, due to their need in fields such as image processing.

Computer graphics move or rotate in 3d spaces, and that is also linear transformations. When you think about a car moving in a city, it is convenient to place the car, buildings, and other objects on a fixed 3d space. But when you need to make computer graphics of scenes of the city from a view point inside the car, you put a moving origin point in the car and see the city. The spatial information of the city is calculated as vectors from the moving origin point. Of course this is also linear transformations. Of course I am not talking about a dot or simple figures moving in the 3d spaces. Computer graphics are composed of numerous plane panels, and each of them have at least three vertexes, and they move on 3d spaces. Depending on viewpoints, you need project the 3d graphics in 3d spaces on 2d spaces to display the graphics on devices. You need to calculate which part of the panel is projected to which pixel on the display, and that is called rasterization. Plus, in order to get photophotorealistic image, you need to think about how lights from light sources reflect on the panel and projected on the display. And you also have to put some textures on groups of panels. You might also need to change color spaces, which is also linear transformations.

My point is, in short, you really need to do numerous linear transformations in parallel in image processing.

When it comes to the use of CGI in movies,  two pioneer movies were released during this time: Jurassic Park in 1993, and Toy Story in 1995. It is famous that Pixar used to be one of the departments in ILM (Industrial Light and Magic), founded by George Lucas, and Steve Jobs bought the department. Even though the members in Pixar had not even made a long feature film in their lives, after trial and errors, they made the first CGI animated feature movie. On the other hand, in order to acquire funds for the production of Schindler’s List (1993), Steven Spielberg took on Jurassic Park (1993), consequently changing the history of CGI through this “side job.”

Source: http://renderstory.com/jurassic-park-23-years-later/

*I think you have realized that George Lucas is mentioned almost everywhere in this article. His influences on technologies are not only limited to image processing, but also sound measuring system, nonlinear editing system. Photoshop was also originally developed under his company. I need another article series for this topic, but maybe not in Data Science Blog.

Source: https://editorial.rottentomatoes.com/article/5-technical-breakthroughs-in-star-wars-that-changed-movies-forever/

Considering that the first wire-frame computer graphics made and displayed by computers appeared in the scene of displaying the wire frame structure of Death Star in a war room, in Star Wars: A New Hope, the development of CGI was already astonishing at this time. But I think deep learning owe its development more to video game industry.

*I said that the Death Star scene is the first use of graphics made and DISPLAYED by computers, because I have to say one of the first graphics in movie MADE by computer dates back to the legendary title sequence of Vertigo(1958).

When it comes to 3D video games the processing unit has to constantly deal with real time commands from controllers. It is famous that GPU was originally specifically designed for plotting computer graphics. Video game market is the biggest in entertainment industry in general, and it is said that the quality of computer graphics have the strongest correlation with video games sales, therefore enhancing this quality is a priority for the video game console manufacturers.

One good example to see how much video games developed is comparing original Final Fantasy 7 and the remake one. The original one was released in 1997, the same year as when LSTM was invented. And recently  the remake version of Final Fantasy 7 was finally released this year. The original one was also made with very big budget, and it was divided into three CD-ROMs. The original one was also very revolutionary given that the former ones of Final Fantasy franchise were all 2d video retro style video games. But still the computer graphics looks like polygons, and in almost all scenes the camera angle was fixed in the original one. On the other hand the remake one is very photorealistic and you can move the angle of the camera as you want while you play the video game.

There were also fierce battles by graphic processor manufacturers in computer video game market in the 1990s, but personally I think the release of Xbox console was a turning point in the development of GPU. To be concrete, Microsoft adopted a type of NV20 GPU for Xbox consoles, and that left some room of programmability for developers. The chief architect of NV20, which was released under the brand of GeForce3, said making major changes in the company’s graphic chips was very risky. But that decision opened up possibilities of uses of GPU beyond computer graphics.

Source: https://de.wikipedia.org/wiki/Nvidia-GeForce-3-Serie

I think that the idea of a programmable GPU provided other scientific fields with more visible benefits after CUDA was launched. And GPU gained its position not only in deep learning, but also many other fields including making super computers.

*When it comes to deep learning, even GPUs have strong rivals. TPU(Tensor Processing Unit) made by Google, is specialized for deep learning tasks, and have astonishing processing speed. And FPGA(Field Programmable Gate Array), which was originally invented customizable electronic circuit, proved to be efficient for reducing electricity consumption of deep learning tasks.

*I am not so sure about this GPU part. Processing unit, including GPU is another big topic, that is beyond my capacity to be honest.  I would appreciate it if you could share your view and some references to confirm your opinion, on the comment section or via email.

*If you are interested you should see this video of game fans’ reactions to the announcement of Final Fantasy 7. This is the industry which grew behind the development of deep learning, and many fields where you need parallel computations owe themselves to the nerds who spent a lot of money for video games, including me.

*But ironically the engineers who invented the GPU said they did not play video games simply because they were busy. If you try to study the technologies behind video games, you would not have much time playing them. That is the reality.

We have seen that the in this second AI winter, Internet and GPU laid foundation of the next AI boom. But still the last piece of the puzzle is missing: let’s look at the breakthrough which solved the vanishing /exploding gradient problem of deep learning in the next section.

4, Pretraining of deep belief networks: “The Dawn of Deep Learning”

Some researchers say the invention of pretraining of deep belief network by Geoffrey Hinton was a breakthrough which put an end to the last AI winter. Deep belief networks are different type of networks from the neural networks we have discussed, but their architectures are similar to those of the neural networks. And it was also unknown how to train deep belief nets when they have several layers. Hinton discovered that training the networks layer by layer in advance can tackle vanishing gradient problems. And later it was discovered that you can do pretraining neural networks layer by layer with autoencoders.

*Deep belief network is beyond the scope of this article series. I have to talk about generative models, Boltzmann machine, and some other topics.

The pretraining techniques of neural networks is not mainstream anymore. But I think it is very meaningful to know that major deep learning techniques such as using ReLU activation functions, optimization with Adam, dropout, batch normalization, came up as more effective algorithms for deep learning after the advent of the pretraining techniques, and now we are in the third AI boom.

In the next next article we are finally going to work on LSTM. Specifically, I am going to offer a clearer guide to a well-made paper on LSTM, named “LSTM: A Search Space Odyssey.”

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

[References]

[1] Taniguchi Tadahiro, “An Illustrated Guide to Artificial Intelligence”, (2010), Kodansha pp. 3-11
谷口忠大 著, 「イラストで学ぶ人工知能概論」, (2010), 講談社, pp. 3-11

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 14-24

[3] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 1-5, 151-156
岡谷貴之 著, 「機械学習プロフェッショナルシリーズ 深層学習」, (2015), pp. 1-5, 151-156

[4] Abigail See, Matthew Lamm, “Natural Language Processingwith Deep LearningCS224N/Ling284 Lecture 8:Machine Translation,Sequence-to-sequence and Attention,” (2020),
URL: http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture08-nmt.pdf

[5]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 192-196

[6] Daniel C. Dennett, “Cognitive Wheels: the Frame Problem of AI,” (1984), pp. 1-2

[7] Machiyama Tomohiro, “Understanding Cinemas of 1967-1979,” (2014), Yosensya, pp. 14-30
町山智浩 著, 「<映画の見方>が分かる本」,(2014), 洋泉社, pp. 14-30

[8] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 156-157
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 156-157

[9] Suyama Atsushi, “Machine Learning Professional Series: Bayesian Deep Learning,” (2019)岡谷貴之 須山敦志 著, 「機械学習プロフェッショナルシリーズ ベイズ深層学習」, (2019)

[10] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ~ 最近の動向と共に」, Qiita, (2015)
URL: https://qiita.com/t_Signull/items/21b82be280b46f467d1b

[11] Hisa Ando, “WEB+DB PRESS plus series: Technologies Supporting Processors – The World Endlessly Pursuing Speed,” (2017), Gijutsu-hyoron-sya, pp 313-317
Hisa Ando, 「WEB+DB PRESS plusシリーズ プロセッサを支える技術― 果てしなくスピードを追求する世界」, (2017), 技術評論社, pp. 313-317

[12] “Takahashi Yoshiki and Utamaru discuss George Lucas,” miyearnZZ Labo, (2016)
“高橋ヨシキと宇多丸 ジョージ・ルーカスを語る,” miyearnZZ Labo, (2016)
URL: https://miyearnzzlabo.com/archives/38865

[13] Katherine Bourzac, “Chip Hall of Fame: Nvidia NV20 The first configurable graphics processor opened the door to a machine-learning revolution,” IEEE SPECTRUM, (2018)
URL: https://spectrum.ieee.org/tech-history/silicon-revolution/chip-hall-of-fame-nvidia-nv20

Data Analytics and Mining for Dummies

Data Analytics and Mining is often perceived as an extremely tricky task cut out for Data Analysts and Data Scientists having a thorough knowledge encompassing several different domains such as mathematics, statistics, computer algorithms and programming. However, there are several tools available today that make it possible for novice programmers or people with no absolutely no algorithmic or programming expertise to carry out Data Analytics and Mining. One such tool which is very powerful and provides a graphical user interface and an assembly of nodes for ETL: Extraction, Transformation, Loading, for modeling, data analysis and visualization without, or with only slight programming is the KNIME Analytics Platform.

KNIME, or the Konstanz Information Miner, was developed by the University of Konstanz and is now popular with a large international community of developers. Initially KNIME was originally made for commercial use but now it is available as an open source software and has been used extensively in pharmaceutical research since 2006 and also a powerful data mining tool for the financial data sector. It is also frequently used in the Business Intelligence (BI) sector.

KNIME as a Data Mining Tool

KNIME is also one of the most well-organized tools which enables various methods of machine learning and data mining to be integrated. It is very effective when we are pre-processing data i.e. extracting, transforming, and loading data.

KNIME has a number of good features like quick deployment and scaling efficiency. It employs an assembly of nodes to pre-process data for analytics and visualization. It is also used for discovering patterns among large volumes of data and transforming data into more polished/actionable information.

Some Features of KNIME:

  • Free and open source
  • Graphical and logically designed
  • Very rich in analytics capabilities
  • No limitations on data size, memory usage, or functionalities
  • Compatible with Windows ,OS and Linux
  • Written in Java and edited with Eclipse.

A node is the smallest design unit in KNIME and each node serves a dedicated task. KNIME contains graphical, drag-drop nodes that require no coding. Nodes are connected with one’s output being another’s input, as a workflow. Therefore end-to-end pipelines can be built requiring no coding effort. This makes KNIME stand out, makes it user-friendly and make it accessible for dummies not from a computer science background.

KNIME workflow designed for graduate admission prediction

KNIME workflow designed for graduate admission prediction

KNIME has nodes to carry out Univariate Statistics, Multivariate Statistics, Data Mining, Time Series Analysis, Image Processing, Web Analytics, Text Mining, Network Analysis and Social Media Analysis. The KNIME node repository has a node for every functionality you can possibly think of and need while building a data mining model. One can execute different algorithms such as clustering and classification on a dataset and visualize the results inside the framework itself. It is a framework capable of giving insights on data and the phenomenon that the data represent.

Some commonly used KNIME node groups include:

  • Input-Output or I/O:  Nodes in this group retrieve data from or to write data to external files or data bases.
  • Data Manipulation: Used for data pre-processing tasks. Contains nodes to filter, group, pivot, bin, normalize, aggregate, join, sample, partition, etc.
  • Views: This set of nodes permit users to inspect data and analysis results using multiple views. This gives a means for truly interactive exploration of a data set.
  • Data Mining: In this group, there are nodes that implement certain algorithms (like K-means clustering, Decision Trees, etc.)

Comparison with other tools 

The first version of the KNIME Analytics Platform was released in 2006 whereas Weka and R studio were released in 1997 and 1993 respectively. KNIME is a proper data mining tool whereas Weka and R studio are Machine Learning tools which can also do data mining. KNIME integrates with Weka to add machine learning algorithms to the system. The R project adds statistical functionalities as well. Furthermore, KNIME’s range of functions is impressive, with more than 1,000 modules and ready-made application packages. The modules can be further expanded by additional commercial features.

Simple RNN

Simple RNN: the first foothold for understanding LSTM

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

In the last article, I mentioned “When it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN.” Even if you manage to understand DCL and CNN, you can be suddenly left behind once you try to understand RNN because it looks like a different field. In the second section of this article, I am going to provide a some helps for more abstract understandings of DCL/CNN , which you need when you read most other study materials.

My explanation on this simple RNN is based on a chapter in a textbook published by Massachusetts Institute of Technology, which is also recommended in some deep learning courses of Stanford University.

First of all, you should keep it in mind that simple RNN are not useful in many cases, mainly because of vanishing/exploding gradient problem, which I am going to explain in the next article. LSTM is one major type of RNN used for tackling those problems. But without clear understanding forward/back propagation of RNN, I think many people would get stuck when they try to understand how LSTM works, especially during its back propagation stage. If you have tried climbing the mountain of understanding LSTM, but found yourself having to retreat back to the foot, I suggest that you read through this article on simple RNNs. It should help you to gain a solid foothold, and you would be ready for trying to climb the mountain again.

*This article is the second article of “A gentle introduction to the tiresome part of understanding RNN.”

1, A brief review on back propagation of DCL.

Simple RNNs are straightforward applications of DCL, but if you do not even have any ideas on DCL forward/back propagation, you will not be able to understand this article. If you more or less understand how back propagation of DCL works, you can skip this first section.

Deep learning is a part of machine learning. And most importantly, whether it is classical machine learning or deep learning, adjusting parameters is what machine learning is all about. Parameters mean elements of functions except for variants. For example when you get a very simple function f(x)=a + bx + cx^2 + dx^3, then x is a variant, and a, b, c, d are parameters. In case of classical machine learning algorithms, the number of those parameters are very limited because they were originally designed manually. Such functions for classical machine learning is useful for features found by humans, after trial and errors(feature engineering is a field of finding such effective features, manually). You adjust those parameters based on how different the outputs(estimated outcome of classification/regression) are from supervising vectors(the data prepared to show ideal answers).

In the last article I said neural networks are just mappings, whose inputs are vectors, matrices, or sequence data. In case of DCLs, inputs are vectors. Then what’s the number of parameters ? The answer depends on the the number of neurons and layers. In the example of DCL at the right side, the number of the connections of the neurons is the number of parameters(Would you like to try to count them? At least I would say “No.”). Unlike classical machine learning you no longer need to do feature engineering, but instead you need to design networks effective for each task and adjust a lot of parameters.

*I think the hype of AI comes from the fact that neural networks find features automatically. But the reality is difficulty of feature engineering was just replaced by difficulty of designing proper neural networks.

It is easy to imagine that you need an efficient way to adjust those parameters, and the method is called back propagation (or just backprop). As long as it is about DCL backprop, you can find a lot of well-made study materials on that, so I am not going to cover that topic precisely in this article series. Simply putting, during back propagation, in order to adjust parameters of a layer you need errors in the next layer. And in order calculate the errors of the next layer, you need errors in the next next layer.

*You should not think too much about what the “errors” exactly mean. Such “errors” are defined in this context, and you will see why you need them if you actually write down all the mathematical equations behind backprops of DCL.

The red arrows in the figure shows how errors of all the neurons in a layer propagate backward to a neuron in last layer. The figure shows only some sets of such errors propagating backward, but in practice you have to think about all the combinations of such red arrows in the whole back propagation(this link would give you some ideas on how DCLs work).

These points are minimum prerequisites for continuing reading this  RNN this article. But if you are planning to understand RNN forward/back propagation at  an abstract/mathematical level that you can read academic papers,  I highly recommend you to actually write down all the equations of DCL backprop. And if possible you should try to implement backprop of three-layer DCL.

2, Forward propagation of simple RNN

*For better understandings of the second and third section, I recommend you to download an animated PowerPoint slide which I prepared. It should help you understand simple RNNs.

In fact the simple RNN which we are going to look at in this article has only three layers. From now on imagine that inputs of RNN come from the bottom and outputs go up. But RNNs have to keep information of earlier times steps during upcoming several time steps because as I mentioned in the last article RNNs are used for sequence data, the order of whose elements is important. In order to do that, information of the neurons in the middle layer of RNN propagate forward to the middle layer itself. Therefore in one time step of forward propagation of RNN, the input at the time step propagates forward as normal DCL, and the RNN gives out an output at the time step. And information of one neuron in the middle layer propagate forward to the other neurons like yellow arrows in the figure. And the information in the next neuron propagate forward to the other neurons, and this process is repeated. This is called recurrent connections of RNN.

*To be exact we are just looking at a type of recurrent connections. For example Elman RNNs have simpler recurrent connections. And recurrent connections of LSTM are more complicated.

Whether it is a simple one or not, basically RNN repeats this process of getting an input at every time step, giving out an output, and making recurrent connections to the RNN itself. But you need to keep the values of activated neurons at every time step, so virtually you need to consider the same RNNs duplicated for several time steps like the figure below. This is the idea of unfolding RNN. Depending on contexts, the whole unfolded DCLs with recurrent connections is also called an RNN.

In many situations, RNNs are simplified as below. If you have read through this article until this point, I bet you gained some better understanding of RNNs, so you should little by little get used to this more abstract, blackboxed  way of showing RNN.

You have seen that you can unfold an RNN, per time step. From now on I am going to show the simple RNN in a simpler way,  based on the MIT textbook which I recomment. The figure below shows how RNN propagate forward during two time steps (t-1), (t).

The input \boldsymbol{x}^{(t-1)}at time step(t-1) propagate forward as a normal DCL, and gives out the output \hat{\boldsymbol{y}} ^{(t)} (The notation on the \boldsymbol{y} ^{(t)} is called “hat,” and it means that the value is an estimated value. Whatever machine learning tasks you work on, the outputs of the functions are just estimations of ideal outcomes. You need to adjust parameters for better estimations. You should always be careful whether it is an actual value or an estimated value in the context of machine learning or statistics). But the most important parts are the middle layers.

*To be exact I should have drawn the middle layers as connections of two layers of neurons like the figure at the right side. But I made my figure closer to the chart in the MIT textbook, and also most other study materials show the combinations of the two neurons before/after activation as one neuron.

\boldsymbol{a}^{(t)} is just linear summations of \boldsymbol{x}^{(t)} (If you do not know what “linear summations” mean, please scroll this page a bit), and \boldsymbol{h}^{(t)} is a combination of activated values of \boldsymbol{a}^{(t)} and linear summations of \boldsymbol{h}^{(t-1)} from the last time step, with recurrent connections. The values of \boldsymbol{h}^{(t)} propagate forward in two ways. One is normal DCL forward propagation to \hat{\boldsymbol{y}} ^{(t)} and \boldsymbol{o}^{(t)}, and the other is recurrent connections to \boldsymbol{h}^{(t+1)} .

These are equations for each step of forward propagation.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

*Please forgive me for adding some mathematical equations on this article even though I pledged not to in the first article. You can skip the them, but for some people it is on the contrary more confusing if there are no equations. In case you are allergic to mathematics, I prescribed some treatments below.

*Linear summation is a type of weighted summation of some elements. Concretely, when you have a vector \boldsymbol{x}=(x_0, x_1, x_2), and weights \boldsymbol{w}=(w_0,w_1, w_2), then \boldsymbol{w}^T \cdot \boldsymbol{x} = w_0 \cdot x_0 + w_1 \cdot x_1 +w_2 \cdot x_2 is a linear summation of \boldsymbol{x}, and its weights are \boldsymbol{w}.

*When you see a product of a matrix and a vector, for example a product of \boldsymbol{W} and \boldsymbol{v}, you should clearly make an image of connections between two layers of a neural network. You can also say each element of \boldsymbol{u}} is a linear summations all the elements of \boldsymbol{v}} , and \boldsymbol{W} gives the weights for the summations.

A very important point is that you share the same parameters, in this case \boldsymbol{\theta \in \{\boldsymbol{U}, \boldsymbol{W}, \boldsymbol{b}, \boldsymbol{V}, \boldsymbol{c} \}}, at every time step. 

And you are likely to see this RNN in this blackboxed form.

3, The steps of back propagation of simple RNN

In the last article, I said “I have to say backprop of RNN, especially LSTM (a useful and mainstream type or RNN), is a monster of chain rules.” I did my best to make my PowerPoint on LSTM backprop straightforward. But looking at it again, the LSTM backprop part still looks like an electronic circuit, and it requires some patience from you to understand it. If you want to understand LSTM at a more mathematical level, understanding the flow of simple RNN backprop is indispensable, so I would like you to be patient while understanding this step (and you have to be even more patient while understanding LSTM backprop).

This might be a matter of my literacy, but explanations on RNN backprop are very frustrating for me in the points below.

  • Most explanations just show how to calculate gradients at each time step.
  • Most study materials are visually very poor.
  • Most explanations just emphasize that “errors are back propagating through time,” using tons of arrows, but they lack concrete instructions on how actually you renew parameters with those errors.

If you can relate to the feelings I mentioned above, the instructions from now on could somewhat help you. And with the animated PowerPoint slide I prepared, you would have clear understandings on this topic at a more mathematical level.

Backprop of RNN , as long as you are thinking about simple RNNs, is not so different from that of DCLs. But you have to be careful about the meaning of errors in the context of RNN backprop. Back propagation through time (BPTT) is one of the major methods for RNN backprop, and I am sure most textbooks explain BPTT. But most study materials just emphasize that you need errors from all the time steps, and I think that is very misleading and confusing.

You need all the gradients to adjust parameters, but you do not necessarily need all the errors to calculate those gradients. Gradients in the context of machine learning mean partial derivatives of error functions (in this case J) with respect to certain parameters, and mathematically a gradient of J with respect to \boldsymbol{\theta \in \{\boldsymbol{U}, \boldsymbol{W}, \boldsymbol{b}^{(t)}, \boldsymbol{V}, \boldsymbol{c} \}}is denoted as ( \frac{\partial J}{\partial \boldsymbol{\theta}}  ). And another confusing point in many textbooks, including the MIT one, is that they give an impression that parameters depend on time steps. For example some study materials use notations like \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, and I think this gives an impression that this is a gradient with respect to the parameters at time step (t). In my opinion this gradient rather should be written as ( \frac{\partial J}{\partial \boldsymbol{\theta}} )^{(t)} . But many study materials denote gradients of those errors in the former way, so from now on let me use the notations which you can see in the figures in this article.

In order to calculate the gradient \frac{\partial J}{\partial \boldsymbol{x}^{(t)}} you need errors from time steps s (s \geq t) \quad (as you can see in the figure, in order to calculate a gradient in a colored frame, you need all the errors in the same color).

*To be exact, in the figure above I am supposed prepare much more arrows in \tau + 1 different colors  to show the whole process of RNN backprop, but that is not realistic. In the figure I displayed only the flows of errors necessary for calculating each gradient at time step 0, t, \tau.

*Another confusing point is that the \frac{\partial J}{\partial \boldsymbol{\ast ^{(t)}}}, \boldsymbol{\ast} \in \{\boldsymbol{a}^{(t)}, \boldsymbol{h}^{(t)}, \boldsymbol{o}^{(t)}, \dots \} are correct notations, because \boldsymbol{\ast} are values of neurons after forward propagation. They depend on time steps, and these are very values which I have been calling “errors.” That is why parameters do not depend on time steps, whereas errors depend on time steps.

As I mentioned before, you share the same parameters at every time step. Again, please do not assume that parameters are different from time step to time step. It is gradients/errors (you need errors to calculate gradients) which depend on time step. And after calculating errors at every time step, you can finally adjust parameters one time, and that’s why this is called “back propagation through time.” (It is easy to imagine that this method can be very inefficient. If the input is the whole text on a Wikipedia link, you need to input all the sentences in the Wikipedia text to renew parameters one time. To solve this problem there is a backprop method named “truncated BPTT,” with which you renew parameters based on a part of a text. )

And after calculating those gradients \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} you can take a summation of them: \frac{\partial J}{\partial \boldsymbol{\theta}}=\sum_{t=0}^{t=\tau}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}. With this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} , you can finally renew the value of \boldsymbol{\theta} one time.

At the beginning of this article I mentioned that simple RNNs are no longer for practical uses, and that comes from exploding/vanishing problem of RNN. This problem was one of the reasons for the AI winter which lasted for some 20 years. In the next article I am going to write about LSTM, a fancier type of RNN, in the context of a history of neural network history.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Simple RNN

Prerequisites for understanding RNN at a more mathematical level

Writing the A gentle introduction to the tiresome part of understanding RNN Article Series on recurrent neural network (RNN) is nothing like a creative or ingenious idea. It is quite an ordinary topic. But still I am going to write my own new article on this ordinary topic because I have been frustrated by lack of sufficient explanations on RNN for slow learners like me.

I think many of readers of articles on this website at least know that RNN is a type of neural network used for AI tasks, such as time series prediction, machine translation, and voice recognition. But if you do not understand how RNNs work, especially during its back propagation, this blog series is for you.

After reading this articles series, I think you will be able to understand RNN in more mathematical and abstract ways. But in case some of the readers are allergic or intolerant to mathematics, I tried to use as little mathematics as possible.

Ideal prerequisite knowledge:

  • Some understanding on densely connected layers (or fully connected layers, multilayer perception) and how their forward/back propagation work.
  •  Some understanding on structure of Convolutional Neural Network.

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1, Difficulty of Understanding RNN

I bet a part of difficulty of understanding RNN comes from the variety of its structures. If you search “recurrent neural network” on Google Image or something, you will see what I mean. But that cannot be helped because RNN enables a variety of tasks.

Another major difficulty of understanding RNN is understanding its back propagation algorithm. I think some of you found it hard to understand chain rules in calculating back propagation of densely connected layers, where you have to make the most of linear algebra. And I have to say backprop of RNN, especially LSTM, is a monster of chain rules. I am planing to upload not only a blog post on RNN backprop, but also a presentation slides with animations to make it more understandable, in some external links.

In order to avoid such confusions, I am going to introduce a very simplified type of RNN, which I call a “simple RNN.” The RNN displayed as the head image of this article is a simple RNN.

2, How Neurons are Connected

How to connect neurons and how to activate them is what neural networks are all about. Structures of those neurons are easy to grasp as long as that is about DCL or CNN. But when it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN(*If you are not sure how neurons are connected in CNN, this link should be helpful. Draw a random digit in the square at the corner.). In fact the structure of RNN is also the same, and as long as it is a simple RNN, and it is not hard to visualize its structure.

Even though RNN is also connections of neurons, usually most RNN charts are simplified, using blackboxes. In case of simple RNN, most study material would display it as the chart below.

But that also cannot be helped because fancier RNN have more complicated connections of neurons, and there are no longer advantages of displaying RNN as connections of neurons, and you would need to understand RNN in more abstract way, I mean, as you see in most of textbooks.

I am going to explain details of simple RNN in the next article of this series.

3, Neural Networks as Mappings

If you still think that neural networks are something like magical spider webs or models of brain tissues, forget that. They are just ordinary mappings.

If you have been allergic to mathematics in your life, you might have never heard of the word “mapping.” If so, at least please keep it in mind that the equation y=f(x), which most people would have seen in compulsory education, is a part of mapping. If you get a value x, you get a value y corresponding to the x.

But in case of deep learning, x is a vector or a tensor, and it is denoted in bold like \boldsymbol{x} . If you have never studied linear algebra , imagine that a vector is a column of Excel data (only one column), a matrix is a sheet of Excel data (with some rows and columns), and a tensor is some sheets of Excel data (each sheet does not necessarily contain only one column.)

CNNs are mainly used for image processing, so their inputs are usually image data. Image data are in many cases (3, hight, width) tensors because usually an image has red, blue, green channels, and the image in each channel can be expressed as a height*width matrix (the “height” and the “width” are number of pixels, so they are discrete numbers).

The convolutional part of CNN (which I call “feature extraction part”) maps the tensors to a vector, and the last part is usually DCL, which works as classifier/regressor. At the end of the feature extraction part, you get a vector. I call it a “semantic vector” because the vector has information of “meaning” of the input image. In this link you can see maps of pictures plotted depending on the semantic vector. You can see that even if the pictures are not necessarily close pixelwise, they are close in terms of the “meanings” of the images.

In the example of a dog/cat classifier introduced by François Chollet, the developer of Keras, the CNN maps (3, 150, 150) tensors to 2-dimensional vectors, (1, 0) or (0, 1) for (dog, cat).

Wrapping up the points above, at least you should keep two points in mind: first, DCL is a classifier or a regressor, and CNN is a feature extractor used for image processing. And another important thing is, feature extraction parts of CNNs map images to vectors which are more related to the “meaning” of the image.

Importantly, I would like you to understand RNN this way. An RNN is also just a mapping.

*I recommend you to at least take a look at the beautiful pictures in this link. These pictures give you some insight into how CNN perceive images.

4, Problems of DCL and CNN, and needs for RNN

Taking an example of RNN task should be helpful for this topic. Probably machine translation is the most famous application of RNN, and it is also a good example of showing why DCL and CNN are not proper for some tasks. Its algorithms is out of the scope of this article series, but it would give you a good insight of some features of RNN. I prepared three sentences in German, English, and Japanese, which have the same meaning. Assume that each sentence is divided into some parts as shown below and that each vector corresponds to each part. In machine translation we want to convert a set of the vectors into another set of vectors.

Then let’s see why DCL and CNN are not proper for such task.

  • The input size is fixed: In case of the dog/cat classifier I have mentioned, even though the sizes of the input images varies, they were first molded into (3, 150, 150) tensors. But in machine translation, usually the length of the input is supposed to be flexible.
  • The order of inputs does not mater: In case of the dog/cat classifier the last section, even if the input is “cat,” “cat,” “dog” or “dog,” “cat,” “cat” there’s no difference. And in case of DCL, the network is symmetric, so even if you shuffle inputs, as long as you shuffle all of the input data in the same way, the DCL give out the same outcome . And if you have learned at least one foreign language, it is easy to imagine that the orders of vectors in sequence data matter in machine translation.

*It is said English language has phrase structure grammar, on the other hand Japanese language has dependency grammar. In English, the orders of words are important, but in Japanese as long as the particles and conjugations are correct, the orders of words are very flexible. In my impression, German grammar is between them. As long as you put the verb at the second position and the cases of the words are correct, the orders are also relatively flexible.

5, Sequence Data

We can say DCL and CNN are not useful when you want to process sequence data. Sequence data are a type of data which are lists of vectors. And importantly, the orders of the vectors matter. The number of vectors in sequence data is usually called time steps. A simple example of sequence data is meteorological data measured at a spot every ten minutes, for instance temperature, air pressure, wind velocity, humidity. In this case the data is recorded as 4-dimensional vector every ten minutes.

But this “time step” does not necessarily mean “time.” In case of natural language processing (including machine translation), which you I mentioned in the last section, the numberings of each vector denoting each part of sentences are “time steps.”

And RNNs are mappings from a sequence data to another sequence data.

In case of the machine translation above, the each sentence in German, English, and German is expressed as sequence data \boldsymbol{G}=(\boldsymbol{g}_1,\dots ,\boldsymbol{g}_{12}), \boldsymbol{E}=(\boldsymbol{e}_1,\dots ,\boldsymbol{e}_{11}), \boldsymbol{J}=(\boldsymbol{j}_1,\dots ,\boldsymbol{j}_{14}), and machine translation is nothing but mappings between these sequence data.

 

*At least I found a paper on the RNN’s capability of universal approximation on many-to-one RNN task. But I have not found any papers on universal approximation of many-to-many RNN tasks. Please let me know if you find any clue on whether such approximation is possible. I am desperate to know that. 

6, Types of RNN Tasks

RNN tasks can be classified into some types depending on the lengths of input/output sequences (the “length” means the times steps of input/output sequence data).

If you want to predict the temperature in 24 hours, based on several time series data points in the last 96 hours, the task is many-to-one. If you sample data every ten minutes, the input size is 96*6=574 (the input data is a list of 574 vectors), and the output size is 1 (which is a value of temperature). Another example of many-to-one task is sentiment classification. If you want to judge whether a post on SNS is positive or negative, the input size is very flexible (the length of the post varies.) But the output size is one, which is (1, 0) or (0, 1), which denotes (positive, negative).

*The charts in this section are simplified model of RNN used for each task. Please keep it in mind that they are not 100% correct, but I tried to make them as exact as possible compared to those in other study materials.

Music/text generation can be one-to-many tasks. If you give the first sound/word you can generate a phrase.

Next, let’s look at many-to-many tasks. Machine translation and voice recognition are likely to be major examples of many-to-many tasks, but here name entity recognition seems to be a proper choice. Name entity recognition is task of finding proper noun in a sentence . For example if you got two sentences “He said, ‘Teddy bears on sale!’ ” and ‘He said, “Teddy Roosevelt was a great president!” ‘ judging whether the “Teddy” is a proper noun or a normal noun is name entity recognition.

Machine translation and voice recognition, which are more popular, are also many-to-many tasks, but they use more sophisticated models. In case of machine translation, the inputs are sentences in the original language, and the outputs are sentences in another language. When it comes to voice recognition, the input is data of air pressure at several time steps, and the output is the recognized word or sentence. Again, these are out of the scope of this article but I would like to introduce the models briefly.

Machine translation uses a type of RNN named sequence-to-sequence model (which is often called seq2seq model). This model is also very important for other natural language processes tasks in general, such as text summarization. A seq2seq model is divided into the encoder part and the decoder part. The encoder gives out a hidden state vector and it used as the input of the decoder part. And decoder part generates texts, using the output of the last time step as the input of next time step.

Voice recognition is also a famous application of RNN, but it also needs a special type of RNN.

*To be honest, I don’t know what is the state-of-the-art voice recognition algorithm. The example in this article is a combination of RNN and a collapsing function made using Connectionist Temporal Classification (CTC). In this model, the output of RNN is much longer than the recorded words or sentences, so a collapsing function reduces the output into next output with normal length.

You might have noticed that RNNs in the charts above are connected in both directions. Depending on the RNN tasks you need such bidirectional RNNs.  I think it is also easy to imagine that such networks are necessary. Again, machine translation is a good example.

And interestingly, image captioning, which enables a computer to describe a picture, is one-to-many-task. As the output is a sentence, it is easy to imagine that the output is “many.” If it is a one-to-many task, the input is supposed to be a vector.

Where does the input come from? I mentioned that the last some layers in of CNN are closely connected to how CNNs extract meanings of pictures. Surprisingly such vectors, which I call a “semantic vectors” is the inputs of image captioning task (after some transformations, depending on the network models).

I think this articles includes major things you need to know as prerequisites when you want to understand RNN at more mathematical level. In the next article, I would like to explain the structure of a simple RNN, and how it forward propagate.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Simple RNN

A gentle introduction to the tiresome part of understanding RNN

Just as a normal conversation in a random pub or bar in Berlin, people often ask me “Which language do you use?” I always answer “LaTeX and PowerPoint.”

I have been doing an internship at DATANOMIQ and trying to make straightforward but precise study materials on deep learning. I myself started learning machine learning in April of 2019, and I have been self-studying during this one-year-vacation of mine in Berlin.

Many study materials give good explanations on densely connected layers or convolutional neural networks (CNNs). But when it comes to back propagation of CNN and recurrent neural networks (RNNs), I think there’s much room for improvement to make the topic understandable to learners.

Many study materials avoid the points I want to understand, and that was as frustrating to me as listening to answers to questions in the Japanese Diet, or listening to speeches from the current Japanese minister of the environment. With the slightest common sense, you would always get the feeling “How?” after reading an RNN chapter in any book.

This blog series focuses on the introductory level of recurrent neural networks. By “introductory”, I mean prerequisites for a better and more mathematical understanding of RNN algorithms.

I am going to keep these posts as visual as possible, avoiding equations, but I am also going to attach some links to check more precise mathematical explanations.

This blog series is composed of five contents.:

  1. Prerequisites for understanding RNN at a more mathematical level
  2. Simple RNN: the first foothold for understanding LSTM
  3. A brief history of neural nets: everything you should know before learning LSTM
  4. Understanding LSTM forward propagation in two ways
  5. LSTM back propagation: following the flows of variables