On the difficulty of language: prerequisites for NLP with deep learning

This is the first article of my article series “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Preface

This section is virtually just my essay on language. You can skip this if you want to get down on more technical topic.

As I do not study in natural language processing (NLP) field, I would not be able to provide that deep insight into this fast changing deep leaning field throughout my article series. However at least I do understand language is a difficult and profound field, not only in engineering but also in many other study fields. Some people might be feeling that technologies are eliminating languages, or one’s motivations to understand other cultures. First of all, I would like you to keep it in mind that I am not a geek who is trying to turn this multilingual world into a homogeneous one and rebuild Tower of Babel, with deep learning. I would say I am more keen on social or anthropological sides of language.

I think you would think more about languages if you have mastered at least one foreign language. As my mother tongue is Japanese, which is totally different from many other Western languages in terms of characters and ambiguity, I understand translating is not what learning a language is all about. Each language has unique characteristics, and I believe they more or less influence one’s personalities. For example, many Western languages make the verb, I mean the conclusion, of sentences clear in the beginning part of the sentences. That is also true of Chinese, I heard. However in Japanese, the conclusion comes at the end, so that is likely to give an impression that Japanese people are being obscure or indecisive. Also, Japanese sentences usually omit their subjects. In German as well, the conclusion of a sentences tend to come at the end, but I am almost 100% sure that no Japanese people would feel German people make things unclear. I think that comes from the structures of German language, which tends to make the number, verb, relations of words crystal clear.


Let’s take an example to see how obscure Japanese is. A Japanese sentence 「頭が赤い魚を食べる猫」can be interpreted in five ways, depending on where you put emphases on.

Common sense tells you that the sentence is likely to mean the first two cases, but I am sure they can mean those five possibilities. There might be similarly obscure sentences in other languages, but I bet few languages can be as obscure as Japanese. Also as you can see from the last two sentences, you can omit subjects in Japanese. This rule is nothing exceptional. Japanese people usually don’t use subjects in normal conversations. And when you read classical Japanese, which Japanese high school students have to do just like Western students learn some of classical Latin, the writings omit subjects much more frequently.

*However interestingly we have rich vocabulary of subjects. The subject “I” can be translated to 「私」、「僕」、「俺」、「自分」、「うち」etc, depending on your personality, who you are talking to, and the time when it is written in.

I believe one can see the world only in the framework of their language, and it seems one’s personality changes depending on the language they use. I am not sure whether the language originally determines how they think, or how they think forms the language. But at least I would like you to keep it in mind that if you translate a conversation, for example a random conversation at a bar in Berlin, into Japanese, that would linguistically sound Japanese, but not anthropologically. Imagine that such kind of random conversation in Berlin or something is like playing a catch, I mean throwing a ball named “your opinion.” On the other hand,  normal conversations of Japanese people are in stead more of, I would say,  “resonance” of several tuning forks. They do their bests to show that they are listening to each other, by excessively nodding or just repeating “Really?”, but usually it seems hardly any constructive dialogues have been made.

*I sometimes feel you do not even need deep learning to simulate most of such Japanese conversations. Several-line Python codes would be enough.

My point is, this article series is mainly going to cover only a few techniques of NLP in deep learning field: sequence to sequence model (seq2seq model) , and especially Transformer. They are, at least for now, just mathematical models and mappings of a small part of this profound field of language (as far as I can cover in this article series). But still, examples of language would definitely help you understand Transformer model in the long run.

2 Tokens and word embedding

*Throughout my article series, “words” just means the normal words you use in daily life. “Tokens” means more general unit of NLP tasks. For example the word “Transformer” might be denoted as a single token “Transformer,” or maybe as a combination of two tokens “Trans” and “former.”

One challenging part of handling language data is its encodings. If you started learning programming in a language other than English, you would have encountered some troubles of using keyboards with different arrangements or with characters. Some comments on your codes in your native languages are sometimes not readable on some software. You can easily get away with that by using only English, but when it comes to NLP you have to deal with this difficulty seriously. How to encode characters in each language should be a first obstacle of NLP. In this article we are going to rely on a library named BPEmb, which provides word embedding in various languages, and you do not have to care so much about encodings in languages all over the world with this library.

In the first section, you might have noticed that Japanese sentence is not separated with spaces like Western languages. This is also true of Chinese language, and that means we need additional tasks of separating those sentences at least into proper chunks of words. This is not only a matter of engineering, but also of some linguistic fields. Also I think many people are not so conscious of how sentences in their native languages are grammatically separated.

The next point is, unlike other scientific data, such as temperature, velocity, voltage, or air pressure, language itself is not measured as numerical data. Thus in order to process language, including English, you first have to map language to certain numerical data, and after some processes you need to conversely map the output numerical data into language data. This section is going to be mainly about one-hot encoding and word embedding, the ways to convert word/token into numerical data. You might already have heard about this

You might have learnt about word embedding to some extent, but I hope you could get richer insight into this topic through this article.

2.1 One-hot encoding

One-hot encoding would be the most straightforward way to encode words/tokens. Assume that you have a dictionary whose size is |\mathcal{V}|, and it includes words from “a”, “ablation”, “actually” to “zombie”, “?”, “!”

In a mathematical manner, in order to choose a word out of those |\mathcal{V}| words, all you need is a |\mathcal{V}| dimensional vector, one of whose elements is 1, and the others are 0. When you want to choose the No. i word, which is “indeed” in the example below, its corresponding one-hot vector is \boldsymbol{v} = (0, \dots, 1, \dots, 0 ), where only the No. i element is 1. One-hot encoding is also easy to understand, and that’s all. It is easy to imagine that people have already come up with more complicated and better way to encoder words. And one major way to do that is word embedding.

2.2 Word embedding

Source: Francois Chollet, Deep Learning with Python,(2018), Manning

Actually word embedding is related to one-hot encoding, and if you understand how to train a simple neural network, for example densely connected layers, you would understand word embedding easily. The key idea of word embedding is denoting each token with a D dimensional vector, whose dimension is fewer than the vocabulary size |\mathcal{V}|. The elements of the resulting word embedding vector are real values, I mean not only 0 or 1. Obviously you can encode much richer variety of tokens with such vectors. The figure at the left side is from “Deep Learning with Python” by François Chollet, and I think this is an almost perfect and simple explanation of the comparison of one-hot encoding and word embedding. But the problem is how to get such convenient vectors. The answer is very simple: you have only to train a network whose inputs are one-hot vector of the vocabulary.

The figure below is a simplified model of word embedding of a certain word. When the word is input into a neural network, only the corresponding element of the one-hot vector is 1, and that virtually means the very first input layer is composed of one neuron whose value is 1. And the only one neuron propagates to the next D dimensional embedding layer. These weights are the very values which most other study materials call “an embedding vector.”

When you input each word into a certain network, for example RNN or Transformer, you map the input one-hot vector into the embedding layer/vector. The examples in the figure are how inputs are made when the input sentences are “You’ve got the touch” and “You’ve got the power.”   Assume that you have a dictionary of one-hot encoding, whose vocabulary is {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}, and the dimension of word embeding is 6. In this case |\mathcal{V}| = 9, D=6. When the inputs are “You’ve got the touch” or “You’ve got the power” , you put the one-hot vector corresponding to “You’ve”, “got”, “the”, “touch” or “You’ve”, “got”, “the”, “power” sequentially every time step t.

In order to get word embedding of certain vocabulary, you just need to train the network. We know that the words “actually” and “indeed” are used in similar ways in writings. Thus when we propagate those words into the embedding layer, we can expect that those embedding layers are similar. This is how we can mathematically get effective word embedding of certain vocabulary.

More interestingly, if word embedding is properly trained, you can mathematically “calculate” words. For example, \boldsymbol{v}_{king} - \boldsymbol{v}_{man} + \boldsymbol{v}_{woman} \approx \boldsymbol{v}_{queen}, \boldsymbol{v}_{Japan} - \boldsymbol{v}_{Tokyo} + \boldsymbol{v}_{Vietnam} \approx \boldsymbol{v}_{Hanoi}.

*I have tried to demonstrate this type of calculation on several word embedding, but none of them seem to work well. At least you should keep it in mind that word embedding learns complicated linear relations between words.

I should explain word embedding techniques such as word2vec in detail, but the main focus of this article is not NLP, so the points I have mentioned are enough to understand Transformer model with NLP examples in the upcoming articles.


3 Language model

Language models is one of the most straightforward, but crucial ideas in NLP. This is also a big topic, so this article is going to cover only basic points. Language model is a mathematical model of the probabilities of which words to come next, given a context. For example if you have a sentence “In the lecture, he opened a _.”, a language model predicts what comes at the part “_.” It is obvious that this is contextual. If you are talking about general university students, “_” would be “textbook,” but if you are talking about Japanese universities, especially in liberal art department, “_” would be more likely to be “smartphone. I think most of you use this language model everyday. When you type in something on your computer or smartphone, you would constantly see text predictions, or they might even correct your spelling or grammatical errors. This is language modelling. You can make language models in several ways, such as n-gram and neural language models, but in this article I can explain only general formulations for such models.

*I am not sure which algorithm is used in which services. That must be too fast changing and competitive for me to catch up.

As I mentioned in the first article series on RNN, a sentence is usually processed as sequence data in NLP. One single sentence is denoted as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}), a list of vectors. The vectors are usually embedding vectors, and the (t) is the index of the order of tokens. For example the sentence “You’ve go the power.” can be expressed as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}), where \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)} denote “You’ve”, “got”, “the”, “power”, “.” respectively. In this case \tau = 4.

In practice a sentence \boldsymbol{X} usually includes two tokens BOS and EOS at the beginning and the end of the sentence. They mean “Beginning Of Sentence” and “End Of Sentence” respectively. Thus in many cases \boldsymbol{X} = (\boldsymbol{BOS} , \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS} ). \boldsymbol{BOS} and \boldsymbol{EOS} are also both vectors, at least in the Tensorflow tutorial.

P(\boldsymbol{X} = (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) is the probability of incidence of the sentence. But it is easy to imagine that it would be very hard to directly calculate how likely the sentence \boldsymbol{X} appears out of all possible sentences. I would rather say it is impossible. Thus instead in NLP we calculate the probability P(\boldsymbol{X}) as a product of the probability of incidence or a certain word, given all the words so far. When you’ve got the words (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1}) so far, the probability of the incidence of \boldsymbol{x}^{(t)}, given the context is  P(\boldsymbol{x}^{(t)}|\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1)}). P(\boldsymbol{BOS}) is a probability of the the sentence \boldsymbol{X} being (\boldsymbol{BOS}), and the probability of \boldsymbol{X} being (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) can be decomposed this way: P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS})P(\boldsymbol{BOS}).

Just as well P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P( \boldsymbol{BOS}, \boldsymbol{x}^{(1)})= P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P( \boldsymbol{BOS}).

Hence, the general probability of incidence of a sentence \boldsymbol{X} is P(\boldsymbol{X})=P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \dots, \boldsymbol{x}^{(\tau -1)}, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) = P(\boldsymbol{EOS}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}) P(\boldsymbol{x}^{(\tau)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau - 1)}) \cdots P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P(\boldsymbol{BOS}).

Let \boldsymbol{x}^{(0)} be \boldsymbol{BOS} and \boldsymbol{x}^{(\tau + 1)} be \boldsymbol{EOS}. Plus, let P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]}) be P(\boldsymbol{x}^{(t+1)}|\boldsymbol{x}^{(0)}, \dots, \boldsymbol{x}^{(t)}), then P(\boldsymbol{X}) = P(\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Language models calculate which words to come sequentially in this way.

Here’s a question: how would you evaluate a language model?

I would say the answer is, when the language model generates words, the more confident the language model is, the better the language model is. Given a context, when the distribution of the next word is concentrated on a certain word, we can say the language model is confident about which word to come next, given the context.

*For some people, it would be more understandable to call this “entropy.”

Let’s take the vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”} as an example. Assume that P(\boldsymbol{X}) = P(\boldsymbol{BOS}, \boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}, \boldsymbol{EOS}) = P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}, \boldsymbol{EOS})= P(\boldsymbol{x}^{(0)})\prod_{t=0}^{4}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Given a context (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}), the probability of incidence of \boldsymbol{x}^{(2)} is P(\boldsymbol{x}^{2}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}). In the figure below, the distribution at the left side is less confident because probabilities do not spread widely, on the other hand the one at the right side is more confident that next word is “got” because the distribution concentrates on “got”.

*You have to keep it in mind that the sum of all possible probability P(\boldsymbol{x}^{(2)} | \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) is 1, that is, P(\boldsymbol{the}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + P(\boldsymbol{You've}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + \cdots + P(\boldsymbol{Boogie}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = 1.

While the language model generating the sentence “BOS You’ve got the touch EOS”, it is better if the language model keeps being confident. If it is confident, P(\boldsymbol{X})= P(\boldsymbol{BOS}) P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}) P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} gets higher. Thus (-1) \{ log_{b}{P(\boldsymbol{BOS})} + log_{b}{P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}) + log_{b}{P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)})} + log_{b}{P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)})} + log_{b}{P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} \} gets lower, where usually b=2 or b=e.

This is how to measure how confident language models are, and the indicator of the confidence is called perplexity. Assume that you have a data set for evaluation \mathcal{D} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \dots, \boldsymbol{X}_{|\mathcal{D}|}), which is composed of |\mathcal{D}| sentences in total. Each sentence \boldsymbol{X}_n = (\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau ^{(n)}}{P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})} has \tau^{(n)} tokens in total excluding \boldsymbol{BOS}, \boldsymbol{EOS}. And let |\mathcal{V}| be the size of the vocabulary of the language model. Then the perplexity of the language model is b^z, where z = \frac{-1}{|\mathcal{V}|}\sum_{n=1}^{|\mathcal{D}|}{\sum_{t=0}^{\tau ^{(n)}}{log_{b}P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})}. The b is usually 2 or e.

For example, assume that \mathcal{V} is vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}. Also assume that the evaluation data set for perplexity of a language model is \mathcal{D} = (\boldsymbol{X}_1, \boldsymbol{X}_2), where \boldsymbol{X_1} =(\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}) \boldsymbol{X_2} = (\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the }, \boldsymbol{power}). In this case |\mathcal{V}|=9, |\mathcal{D}|=2. I have already showed you how to calculate the perplexity of the sentence “You’ve got the touch.” above. You just need to do a similar thing on another sentence “You’ve got the power”, and then you can get the perplexity of the language model.

*If the network is not properly trained, it would also be confident of generating wrong outputs. However, such network still would give high perplexity because it is “confident” at any rate. I’m sorry I don’t know how to tackle the problem. Please let me put this aside, and let’s get down on Transformer model soon.


Let’s see how word embedding is implemented with a very simple example in the official Tensorflow tutorial. It is a simple binary classification task on IMDb Dataset. The dataset is composed to comments on movies by movie critics, and you have only to classify if the commentary is positive or negative about the movie. For example when you get you get an input “To be honest, Michael Bay is a terrible as an action film maker. You cannot understand what is going on during combat scenes, and his movies rely too much on advertisements. I got a headache when Mark Walberg used a Chinese cridit card in Texas. However he is very competent when it comes to humorous scenes. He is very talented as a comedy director, and I have to admit I laughed a lot.“, the neural netowork has to judge whether the statement is positive or negative.

This networks just takes an average of input embedding vectors and regress it into a one dimensional value from 0 to 1. The shape of embedding layer is (8185, 16). Weights of neural netowrks are usually implemented as matrices, and you can see that each row of the matrix corresponds to emmbedding vector of each token.

*It is easy to imagine that this technique is problematic. This network virtually taking a mean of input embedding vectors. That could mean if the input sentence includes relatively many tokens with negative meanings, it is inclined to be classified as negative. But for example, if the sentence is “This masterpiece is a dark comedy by Charlie Chaplin which depicted stupidity of the evil tyrant gaining power in the time. It thoroughly mocked Germany in the time as an absurd group of fanatics, but such propaganda could have never been made until ‘Casablanca.'” , this can be classified as negative, because only the part “masterpiece” is positive as a token, and there are much more words with negative meanings themselves.

The official Tensorflow tutorial provides visualization of word embedding with Embedding Projector, but I would like you to take more control over the data by yourself. Please just copy and paste the codes below, installing necessary libraries. You would get a map of vocabulary used in the text classification task. It seems you cannot find clear tendency of the clusters of the tokens. You can try other dimension reduction methods to get maps of the vocabulary by for example using Scikit Learn.


[1] “Word embeddings” Tensorflow Core

[2]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 43-64, 72-85, 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 43-64, 72-85, 191-193

[3]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)

[4] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 178-185

[5]”2.2. Manifold learning,” scikit-learn

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

The algorithm known as PCA and my taxonomy of linear dimension reductions

In one of my previous articles, I explained the importance of reducing dimensions. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the simplest types of dimension reduction algorithms. In upcoming articles of mine, you are going to see what these algorithms do. In conclusion, diagonalization, which I mentioned in the last article, is what these algorithms are all about, but still in this article I can mainly cover only PCA.

This article is largely based on the explanations in Pattern Recognition and Machine Learning by C. M. Bishop (which is often called “PRML”), and when you search “PCA” on the Internet, you will find more or less similar explanations. However I hope I can go some steps ahead throughout this article series. I mean, I am planning to also cover more generalized versions of PCA, meanings of diagonalization, the idea of subspace. I believe this article series is also effective for refreshing your insight into linear algebra.

*This is the third article of my article series “Illustrative introductions on dimension reduction.”

1. My taxonomy on linear dimension reduction

*If you soon want to know  what the algorithm called “PCA” is, you should skip this section for now to avoid confusion.

Out of the two algorithms I mentioned, PCA is especially important and you would see the same or similar ideas in various fields such as signal processing, psychology, and structural mechanics. However in most cases, the word “PCA” refers to one certain algorithm of linear dimension reduction. Most articles or study materials only mention the “PCA,” and this article is also going to cover only the algorithm which most poeple call “PCA.” However I found that PCA is only one branch of linear dimension reduction algorithms.

*From now on all the terms “PCA” in this article means the algorithm known as PCA unless I clearly mention the generalized KL transform.

*This chart might be confusing to you. According to PRML, PCA and KL transform is identical. PCA has two formulations, maximum variance formulation and minimum error formulation, and they can give the same result. However according to a Japanese textbook, which is very precise about this topic, KL transform has two formulations, and what we call PCA is based on maximum variance formulation. I am still not sure about correct terminology, but in this article I am going to call the most general algorithm “generalized KL transform,” I mean the root of the chart above.

*Most materials just explain the most major PCA, but if you consider this generalized KL transform, I can introduce an intriguing classification algorithm called subspace method. This algorithm was invented in Japan, and this is not so popular in machine learning textbooks in general, but learning this method would give you better insight into the idea of multidimensional space in machine learning. In the future, I am planning to cover this topic in this article series.

2. PCA

When someones mention “PCA,” I am sure for the most part that means the algorithm I am going to explain in the rest of this article. The most intuitive and straightforward way to explain PCA is that, PCA (Principal Component Analysis) of two or three dimensional data is fitting an oval to two dimensional data or fitting an ellipsoid to three dimensional data. You can actually try to plot some random dots on a piece of paper, and draw an oval which fits the dots the best. Assume that you have these 2 or 3 dimensional data below, and please try to put an oval or an ellipsoid to the data.

I think this is nothing difficult, but I have a question: what was the logic behind your choice?

Some might have roughly drawn its outline. Formulas of  “the surface” of general ellipsoids can be explained in several ways, but in this article you only have to consider ellipsoids whose center is the origin point of the coordinate system. In PCA you virtually shift data so that the mean of the data comes to the origin point of the coordinate system. When A is a certain type of D\times D matrix, the formula of a D-dimensional ellipsoid whose center is identical to the origin point is as follows: (\boldsymbol{x}, A\boldsymbol{x}) = 1, where \boldsymbol{x}\in \mathbb{R}. As is always the case with formulas in data science, you can visualize such ellipsoids if you are talking about 1, 2, or 3 dimensional data like in the figure below, but in general D-dimensional space, it is theoretical/imaginary stuff on blackboards.

*In order to explain the conditions which the matrix A has to hold, I need another article, so for now please just assume that the A is a kind of magical matrix.

You might have seen equations of 2 or 3 dimensional ellipsoids in the following way: \frac{x^2}{a^2} + \frac{y^2}{b^2} = 1, where a\neq 0, b\neq 0 or \frac{x^2}{a^2} + \frac{y^2}{b^2} + \frac{z^2}{c^2}= 1, where a\neq 0, b\neq 0, c \neq 0. These are special cases of the equation (\boldsymbol{x}, A\boldsymbol{x}) = 1, where A=diag(a_1^2, \dots, a_D^2). In this case the axes of ellipsoids the same as those of the coordinate system. Thus in this simple case, A=diag(a^2, b^2) or A=diag(a^2,c^2,c^2).

I am going explain these equations in detail in the upcoming articles. But thre is one problem: how would you fit an ellipsoid when a data distribution does not look like an ellipsoid?

In fact we have to focus more on another feature of ellipsoids: all the axes of an ellipsoid are orthogonal. In conclusion the axes of the ellipsoids are more important in PCA, so I do want you to forget about the surface of ellipsoids for the time being. You might get confused if you also think about the surface of ellipsoid now. I am planning to cover this topic in the next article. I hope this article, combined with the last one and the next one, would help you have better insight into the ideas which frequently appear in data science or machine learning context.

3. Fitting orthogonal axes on data

*If you have no trouble reading the chapter 12.1 of PRML, you do not need to this section or maybe even this article, but I hope at least some charts or codes of mine would enhance your understanding on this topic.

*I must admit I wrote only the essence of PCA formulations. If that seems too abstract to you, you should just breifly read through this section and go to the next section with a more concrete example. If you are confused, there should be other good explanations on PCA on the internet, and you should also check them. But at least the visualization of PCA in the next section would be helpful.

As I implied above, all the axes of ellipsoids are orthogonal, and selecting the orthogonal axes which match data is what PCA is all about. And when you choose those orthogonal axes, it is ideal if the data look like an ellipsoid. Simply putting we want the data to “swell” along the axes.

Then let’s see how to let them “swell,” more mathematically. Assume that you have 2 dimensional data plotted on a coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2) as below (The samples are plotted in purple). Intuitively, the data “swell” the most along the vector \boldsymbol{u}_1. Also  it is clear that \boldsymbol{u}_2 is the only vector orthogonal to \boldsymbol{u}_1. We can expect that the new coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2) expresses the data in a better way, and you you can get new coordinate points of the samples by projecting them on new axes as done with yellow lines below.

Next, let’s think about a case in 3 dimensional data. When you have 3 dimensional data in a coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2,\boldsymbol{e}_3) as below,  the data “swell” the most also along \boldsymbol{u}_1. And the data swells the second most along \boldsymbol{u}_2. The two axes, or vectors span the plain in purple. If you project all the samples on the plain, you will get 2 dimensional data at the right side. It is important that we did not consider the third axis. That implies you might be able to display the data well with only 2 dimensional sapce, which is spanned by the two axes \boldsymbol{v}_1, \boldsymbol{v}_2.


Thus the problem is how to calculate such axis \boldsymbol{u}_1. We want the variance of data projected on \boldsymbol{u}_1 to be the biggest. The coordinate of \boldsymbol{x}_n on the axis \boldsymbol{u}_1. The coordinate of a data point \boldsymbol{x}_n on the axis \boldsymbol{u}_1 is calculated by projecting \boldsymbol{x}_n on \boldsymbol{u}_1. In data science context, such projection is synonym to taking an inner  product of \boldsymbol{x}_n and \boldsymbol{u}_1, that is calculating \boldsymbol{u}_1^T \boldsymbol{x}_n.

*Each element of \boldsymbol{x}_n is the coordinate of the data point \boldsymbol{x}_n in the original coordinate system. And the projected data on \boldsymbol{u}_1 whose coordinates are 1-dimensional correspond to only one element of transformed data.

To calculate the variance of projected data on \boldsymbol{u}_1, we just have to calculate the mean of variances of 1-dimensional data projected on \boldsymbol{u}_1. Assume that \bar{\boldsymbol{x}} is the mean of data in the original coordinate, then the deviation of \boldsymbol{x}_1 on the axis \boldsymbol{u}_1 is calculated as \boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}, as shown in the figure. Hence the variance, I mean the mean of the deviation on is \frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}}, where N is the total number of data points. After some deformations, you get the next equation \frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}} = \boldsymbol{u}_1^T S \boldsymbol{u}_1, where S = \frac{1}{N}\sum_{n=1}^{N}{(\boldsymbol{x}_n - \bar{\boldsymbol{x}})(\boldsymbol{x}_n - \bar{\boldsymbol{x}})^T}. S is known as a covariance matrix.

We are now interested in maximizing the variance of projected data on  \boldsymbol{u}_1^T S \boldsymbol{u}_1, and for mathematical derivation we need some college level calculus, so if that is too much for you, you can skip reading this part till the next section.

We now want to calculate \boldsymbol{u}_1 with which \boldsymbol{u}_1^T S \boldsymbol{u}_1 is its maximum value. General \boldsymbol{u}_i including \boldsymbol{u}_1 are just coordinate axes after PCA, so we are just interested in their directions. Thus we can set one constraint \boldsymbol{u}_1^T  \boldsymbol{u}_1 = 1. Introducing a Lagrange multiplier, we have only to optimize next problem: \boldsymbol{u}_1 ^ {*} = \mathop{\rm arg~max}\limits_{\boldsymbol{u}_1} \{ \boldsymbol{u}_1^T S \boldsymbol{u}_1 + \lambda_1 (1 - \boldsymbol{u}_1^T \boldsymbol{u}_1) \}. In conclusion \boldsymbol{u}_1 ^ {*} satisfies S\boldsymbol{u}_1 ^ {*}  = \lamba_1 \boldsymbol{u}_1 ^ {*}. If you have read my last article on eigenvectors, you wold soon realize that this is an equation for calculating eigenvectors, and that means \boldsymbol{u}_1 ^ {*} is one of eigenvectors of the covariance matrix S. Given the equation of eigenvector the next equation holds \boldsymbol{u}_1 ^ {*}^T S \boldsymbol{u}_1 ^ {*} = \lambda_1. We have seen that \boldsymbol{u}_1 ^T S \boldsymbol{u}_1 ^ is a the variance of data when projected on a vector \boldsymbol{u}_1, thus the eigenvalue \lambda_1 is the biggest variance possible when the data are projected on a vector.

Just in the same way you can calculate the next biggest eigenvalue \lambda_2, and it it the second biggest variance possible, and in this case the date are projected on \boldsymbol{u}_2, which is orthogonal to \boldsymbol{u}_1. As well you can calculate orthogonal 3rd 4th …. Dth eigenvectors.

*To be exact I have to explain the cases where we can get such D orthogonal eigenvectors, but that is going to be long. I hope I can to that in the next article.

4. Practical three dimensional example of PCA

We have seen that PCA is sequentially choosing orthogonal axes along which data points swell the most. Also we have seen that it is equal to calculating eigenvalues of the covariance matrix of the data from the largest to smallest one. From now on let’s work on a practical example of data. Assume that we have 30 students’ scores of Japanese, math, and English tests as below.

* I think the subject “Japanese” is equivalent to “English” or “language art” in English speaking countries, and maybe “Deutsch” in Germany. This example and the explanation are largely based on a Japanese textbook named 「これなら分かる応用数学教室 最小二乗法からウェーブレットまで」. This is a famous textbook with cool and precise explanations on mathematics for engineering. Partly sharing this is one of purposes of this article.

At the right side of the figure below is plots of the scores with all the combinations of coordinate axes. In total 9 inverse graphs are symmetrically arranged in the figure, and it is easy to see that English & Japanese or English and math have relatively high correlation. The more two axes have linear correlations, the bigger the covariance between them is.

In the last article, I visualized the eigenvectors of a 3\times 3 matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}, and in fact the matrix is just a constant multiplication of this covariance matrix. I think now you understand that PCA is calculating the orthogonal eigenvectors of covariance matrix of data, that is diagonalizing covariance matrix with orthonormal eigenvectors. Hence we can guess that covariance matrix enables a type of linear transformation of rotation and expansion and contraction of vectors. And data points swell along eigenvectors of such matrix.

Then why PCA is useful? In order to see that at first, for simplicity assume that x, y, z denote Japanese, Math, English scores respectively. The mean of the data is \left( \begin{array}{c} \bar{x} \\ \bar{y} \\ \bar{z} \end{array} \right) = \left( \begin{array}{c} 58.1 \\ 61.8 \\ 67.3 \end{array} \right), and the covariance matrix of data in the original coordinate system is V_{xyz} = \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. The eigenvalues of  V_{xyz} are \lambda_1=148.34, \lambda_2 = 30.62, and \lambda_3 = 3.60, and their corresponding unit eigenvectors are \boldsymbol{u}_1 =  \left( \begin{array}{c} 0.540 \\ 0.602 \\ 0.589 \end{array} \right) , \boldsymbol{u}_2 =  \left( \begin{array}{c} 0.736 \\ -0.677 \\ 0.0174 \end{array} \right) , \boldsymbol{u}_3 =  \left( \begin{array}{c} -0.408 \\ -0.4.23 \\ 0.809 \end{array} \right) respectively.  U = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2 \quad \boldsymbol{u}_3 )  is an orthonormal matrix, where \boldsymbol{u}_i^T\boldsymbol{u}_j = \begin{cases} 1 & (i=j) \\ 0 & (otherwise) \end{cases}. As I explained in the last article, you can diagonalize V_{xyz} with U: U^T V_{xyz}U = diag(\lambda_1, \dots, \lambda_D).

In order to see how PCA is useful, assume that \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)  = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right).

Let’s take a brief look at what a linear transformation by U^T means. Each element of \boldsymbol{x} denotes coordinate of the data point \boldsymbol{x}  in the original coordinate system (In this case the original coordinate system is composed of \boldsymbol{e}_1, \boldsymbol{e}_2, and \boldsymbol{e}_3). U = (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) enables a rotation of a rigid body, which means the shape or arrangement of data will not change after the rotation, and U^T enables a reverse rotation of the rigid body.

*Roughly putting, if you hold a bold object such as a metal ball and rotate your arm, that is a rotation of a rigid body, and your shoulder is the origin point. On the other hand, if you hold something soft like a marshmallow, it would be squashed in your hand, and that is not a not a rotation of a rigid body.

You can rotate \boldsymbol{x} with U like U^T\boldsymbol{x} = \left( \begin{array}{c} -\boldsymbol{u}_1^{T}- \\ -\boldsymbol{u}_2^{T}- \\ -\boldsymbol{u}_3^{T}- \end{array} \right)\boldsymbol{x}=\left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right), and \boldsymbol{u}_i^{T}\boldsymbol{x} is the coordinate of \boldsymbol{x} projected on the axis \boldsymbol{u}_i.

Let’s see this more visually. Assume that the data point \boldsymbol{x}  is a purple dot and its position is expressed in the original coordinate system spanned by black arrows . By multiplying \boldsymbol{x} with U^T, the purple point \boldsymbol{x} is projected on the red axes respectively, and the product \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right) denotes the coordinate point of the purple point in the red coordinate system. \boldsymbol{x} is rotated this way, but for now I think it is better to think that the data are projected on new coordinate axes rather than the data themselves are rotating.

Now that we have seen what rotation by U means, you should have clearer image on what \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)  = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) means. \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right) denotes the coordinates of data projected on new axes \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3, which are unit eigenvectors of V_{xyz}. In the coordinate system spanned by the eigenvectors, the data distribute like below.

By multiplying U from both sides of the equation above, we get \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) =U \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right), which means you can express deviations of the original data as linear combinations of the three factors \xi, \eta, and \zeta. We expect that those three factors contain keys for understanding the original data more efficiently. If you concretely write down all the equations for the factors: \xi = 0.540 (x - \bar{x}) + 0.602 (y - \bar{y}) + 0.588 (z - \bar{z}), \eta = 0.736(x - \bar{x}) - 0.677 (y - \bar{y}) + 0.0174 (z - \bar{z}), and \zeta = - 0.408 (x - \bar{x}) - 0.423 (y - \bar{y}) + 0.809(z - \bar{z}). If you examine the coefficients of the deviations (x - \bar{x}), (y - \bar{y}), and (z - \bar{z}), we can observe that \eta almost equally reflects the deviation of the scores of all the subjects, thus we can say \eta is a factor indicating one’s general academic level. When it comes to \eta Japanese and Math scores are important, so we can guess that this factor indicates whether the student is at more of “scientific side” or “liberal art side.” In the same way \zeta relatively makes much of one’s English score,  so it should show one’s “internationality.” However the covariance of the data \xi, \eta, \zeta is V_{\xi \eta \zeta} = \begin{pmatrix} 148.34 & 0 & 0 \\ 0 & 30.62 & 0 \\ 0 & 0 & 3.60 \end{pmatrix}. You can see \zeta does not vary from students to students, which means it is relatively not important to describe the tendency of data. Therefore for dimension reduction you can cut off the factor \zeta.

*Assume that you can apply PCA on D-dimensional data and that you get \boldsymbol{x}', where \boldsymbol{x}' = U^T\boldsymbol{x} - \bar{\boldsymbol{x}}. The variance of data projected on new D-dimensional coordinate system is V'=\frac{1}{N}\sum{(\boldsymbol{x}')^T\boldsymbol{x}'} =\frac{1}{N}\sum{(U^T\boldsymbol{x})^T(U^T\boldsymbol{x})} =\frac{1}{N}\sum{U^T\boldsymbol{x}\boldsymbol{x}^TU} =U^T(\frac{1}{N}\sum{\boldsymbol{x}\boldsymbol{x}^T})U =U^TVU =diag(\lambda_1, \dots, \lambda_D). This means that in the new coordinate system after PCA, covariances between any pair of variants are all zero.

*As I mentioned U is a rotation of a rigid body, and U^T is the reverse rotation, hence U^TU = UU^T = I.

Hence you can approximate the original 3 dimensional data on the coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) from the reduced two dimensional coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2) with the following equation: \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) \approx U_{reduced} \left( \begin{array}{c} \xi \\ \eta  \end{array} \right)  = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2) \left( \begin{array}{c} \xi \\ \eta  \end{array} \right). Then it mathematically clearer that we can express the data with two factors: “how smart the student is” and “whether he is at scientific side or liberal art side.”

We can observe that eigenvalue \lambda_i is a statistic which indicates how much the corresponding \boldsymbol{u}_i can express the data, \frac{\lambda_i}{\sum_{j=1}^{D}{\lambda_j}} is called the contribution ratio of eigenvector \boldsymbol{u}_i. In the example above, the contribution ratios of \boldsymbol{u}_1, \boldsymbol{u}_2, and \boldsymbol{u}_3 are respectively \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3}=0.813, \frac{\lambda_2}{\lambda_1 + \lambda_2 + \lambda_3}=0.168, \frac{\lambda_3}{\lambda_1 + \lambda_2 + \lambda_3}=0.0197. You can decide how many degrees of dimensions you reduce based on this information.

Appendix: Playing with my toy PCA on MNIST dataset

Applying “so called” PCA on MNIST dataset is a super typical topic that many other tutorial on PCA also introduce, but I still recommend you to actually implement, or at least trace PCA implementation with MNIST dataset without using libraries like scikit-learn. While reading this article I recommend you to actually run the first and the second code below. I think you can just copy and paste them on your tool to run Python, installing necessary libraries. I wrote them on Jupyter Notebook.

In my implementation, in the simple configuration part you can set the USE_ALL_NUMBERS as True or False boolean. If you set it as True, you apply PCA on all the data of numbers from 0 to 9. If you set it as True, you can specify which digit to apply PCA on. In this article, I show the results results of PCA on the data of digit ‘3.’ The first three images of ‘3’ are as below.

You have to keep it in mind that the data are all shown as 28 by 28 pixel grayscale images, but in the process of PCA, they are all processed as 28 * 28 = 784 dimensional vectors. After applying PCA on the 784 dimensional vectors of images of ‘3,’ the first 25 eigenvectors are as below. You can see that at the beginning the eigenvectors partly retain the shapes of ‘3,’ but they are distorted as the eigenvalues get smaller. We can guess that the latter eigenvalues are not that helpful in reconstructing the shape of ‘3.’

Just as we saw in the last section, you you can cut off axes of eigenvectors with small eigenvalues and reduce the dimension of MNIST data. The figure below shows how contribution ratio of MNIST data grows. You can see that around 200 dimension degree, the contribution ratio reaches around 0.95. Then we can guess that even if we reduce the dimension of MNIST from 784 to 200 we can retain the most of the structure of original data.

Some results of reconstruction of data from 200 dimensional space are as below. You can set how many images to display by adjusting NUMBER_OF_RESULTS in the code. And if you set LATENT_DIMENSION as 784, you can completely reconstruct the data.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.



Rethinking linear algebra: visualizing linear transformations and eigenvectors

In terms of calculation processes of Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), which are the dimension reduction techniques I am going to explain in the following articles, diagonalization is what they are all about. Throughout this article, I would like you to have richer insight into diagonalization in order to prepare for understanding those basic dimension reduction techniques.

When our professor started a lecture on the last chapter of our textbook on linear algebra, he said “It is no exaggeration to say that everything we have studied is for this ‘diagonalization.'” Until then we had to write tons of numerical matrices and vectors all over our notebooks, calculating those products, adding their rows or columns to other rows or columns, sometimes transposing the matrices, calculating their determinants.

It was like the scene in “The Karate Kid,” where the protagonist finally understood the profound meaning behind the prolonged and boring “wax on, wax off” training given by Miyagi (or “jacket on, jacket off” training given by Jackie Chan). We had finally understood why we had been doing those seemingly endless calculations.


But usually you can do those calculations easily with functions in the Numpy library. Unlike Japanese college freshmen, I bet you are too busy to reopen textbooks on linear algebra to refresh your mathematics. Thus I am going to provide less mathematical and more intuitive explanation of diagonalization in this article.

*This is the second article of the article series ” Illustrative introductions on dimension reduction .”

1, The mainstream ways of explaining diagonalization.

*The statements below are very rough for mathematical topics, but I am going to give priority to offering more visual understanding on linear algebra in this article. For further understanding, please refer to textbooks on linear algebra. If you would like to have minimum understandings on linear algebra needed for machine learning, I recommend the Appendix C of Pattern Recognition and Machine Learning by C. M. Bishop.

In most textbooks on linear algebra, the explanations on dioagonalization is like this (if you are not sure what diagonalization is or if you are allergic to mathematics, you do not have to read this seriously):

Let V (dimV = D)be a vector space and let  T_A : V \rightarrow V be a mapping of V into itself,  defined as T_A(v) = A \cdot \boldsymbol{v}, where A is a D\times D matrix and \boldsymbol{v} is D dimensional vector. An element \boldsymbol{v} \in V is called an eigen vector if there exists a number \lambda such that A \cdot \boldsymbol{v}= \lambda \cdot \boldsymbol{v} and \boldsymbol{v} \neq \boldsymbol{0}. In this case \lambda is uniquely determined and is called an eigen value of A belonging to the eigen vector \boldsymbol{v}.

Any matrix A has D eigen values \lambda_{i}, belonging to \boldsymbol{v}_{i} (i=1, 2, …., D). If \boldsymbol{v}_{i} is basis of the vector space V, then A is diagonalizable.

When A is diagonalizable, with D \times D matrices P = (\boldsymbol{v}_{1}, \dots, \boldsymbol{v}_{D}) , whose column vectors are eigen vectors \boldsymbol{v}_{i} (i=1, 2, …., D), the following equation holds: P^{-1}AP = \Lambda, where \Lambda = diag(\lambda_{1}, \dots, \lambda_{D})= \begin{pmatrix} \lambda_{1} & 0& \ldots &0\\ 0 & \lambda_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_{D} \end{pmatrix}.

And when A is diagonalizable, you can diagonalize A as below.

Most textbooks keep explaining these type of stuff, but I have to say they lack efforts to make it understandable to readers with low mathematical literacy like me. Especially if you have to apply the idea to data science field, I believe you need more visual understanding of diagonalization. Therefore instead of just explaining the definitions and theorems, I would like to take a different approach. But in order to understand them in more intuitive ways, we first have to rethink waht linear transformation T_A means in more visible ways.

2, Linear transformations

Even though I did my best to make this article understandable to people with little prerequisite knowledge, you at least have to understand linear transformation of numerical vectors and with matrices. Linear transformation is nothing difficult, and in this article I am going to use only 2 or 3 dimensional numerical vectors or square matrices. You can calculate linear transformation of \boldsymbol{v} by A as equations in the figure. In other words, \boldsymbol{u} is a vector transformed by A.

*I am not going to use the term “linear transformation” in a precise way in the context of linear algebra. In this article or in the context of data science or machine learning, “linear transformation” for the most part means products of matrices or vectors. 

*Forward/back propagation of deep learning is mainly composed of this linear transformation. You keep linearly transforming input vectors, frequently transforming them with activation functions, which are for the most part not linear transformation.

As you can see in the equations above, linear transformation with A transforms a vector to another vector. Assume that you have an original vector \boldsymbol{v} in grey and that the vector \boldsymbol{u} in pink is the transformed \boldsymbol{v} by A is. If you subtract \boldsymbol{v} from \boldsymbol{u}, you can get a displacement vector, which I displayed in purple. A displacement vector means the transition from a vector to another vector.

Let’s calculate the displacement vector with more vectors \boldsymbol{v}. Assume that A =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, and I prepared several grid vectors \boldsymbol{v} in grey as you can see in the figure below. If you transform those grey grid points with A, they are mapped into the vectors \boldsymbol{u} in pink. With those vectors in grey or pink, you can calculate the their displacement vectors \boldsymbol{u} - \boldsymbol{v}, which are in purple.

The displacement vectors in the figure above have some tendencies. In order to see that more clearly, let’s calculate displacement vectors with several matrices A and more grid points. Assume that you have three 2 \times 2 square matrices A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, A_2 =\begin{pmatrix} 3 & 1 \\ -1 & 1 \end{pmatrix}, A_3 =\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}, and I plotted displace vectors made by the matrices respectively in the figure below.

I think you noticed some characteristics of the displacement vectors made by those linear transformations: the vectors are swirling and many of them seem to be oriented in certain directions. To be exact, some displacement vectors extend in the same directions as some of original vectors in grey. That means  linear transformation by A did not change the direction of the original vector \boldsymbol{v}, and the unchanged vectors are called eigen vectors. Real eigen vectors of each A are displayed as arrows in yellow in the figure above. But when it comes to A_3, the matrix does not have any real eigan values.

In linear algebra, depending on the type matrices A, you have to consider various cases such as whether the matrices have real or imaginary eigen values, whether the matrices are diagonalizable, whether the eigen vectors are orthogonal, or whether they are unit vectors. But those topics are out of the scope of this article series, so please refer to textbooks on linear algebra if you are interested.

Luckily, however, in terms of PCA or LDA, you only have to consider a type of matrices named positive semidefinite matrices, which A_1 is classified to, and I am going to explain positive semidefinite matrices in the fourth section.

3, Eigen vectors as coordinate system

Source: Ian Stewart, “Professor Stewart’s Cabinet of Mathematical Curiosities,” (2008), Basic Books

Let me take Fibonacci numbers as an example to briefly see why diagonalization is useful. Fibonacci is sequence is quite simple and it is often explained using an example of pairs of rabbits increasing generation by generation. Let a_n (n=0, 1, 2, …) be the number of pairs of grown up rabbits in the n^{th} generation. One pair of grown up rabbits produce one pair of young rabbit The concrete values of a_n are a_0 = 0, a_1 = 1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8, a_7=13, \dots. Assume that A =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} and that \begin{pmatrix} a_1 \\ a_0  \end{pmatrix} =\begin{pmatrix} 1 \\ 0  \end{pmatrix}, then you can calculate the number of the pairs of grown up rabbits in the next generation with the following recurrence relation. \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \cdot \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}.Let \boldsymbol{a}_n be \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}, then the recurrence relation can be written as \boldsymbol{a}_{n+1} = A \boldsymbol{a}_n, and the transition of \boldsymbol{a}_n are like purple arrows in the figure below. It seems that the changes of the purple arrows are irregular if you look at the plots in normal coordinate.

Assume that \lambda _1, \lambda_2 (\lambda _1< \lambda_2) are eigen values of A, and \boldsymbol{v}_1, \boldsymbol{v}_2 are eigen vectors belonging to them respectively. Also let \alpha, \beta scalars such that \begin{pmatrix} a_{1} \\ a_{0}  \end{pmatrix} = \begin{pmatrix} 1 \\ 0  \end{pmatrix} = \alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2. According to the definition of eigen values and eigen vectors belonging to them, the following two equations hold: A\boldsymbol{v}_1 = \lambda_1 \boldsymbol{v}_1, A\boldsymbol{v}_2 = \lambda_2 \boldsymbol{v}_2. If you calculate \boldsymbol{a}_1 is, using eigen vectors of A, \boldsymbol{a}_1  = A\boldsymbol{a}_0 = A (\alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2) = \alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2. In the same way, \boldsymbol{a}_2 = A\boldsymbol{a}_1 = A (\alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2) = \alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2, and \boldsymbol{a}_3 = A\boldsymbol{a}_2 = A (\alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2) = \alpha\lambda _{1}^{3} \boldsymbol{v}_1 + \beta \lambda_{2}^{3} \boldsymbol{v}_2. These equations show that in coordinate system made by eigen vectors of A, linear transformation by A is easily done by just multiplying eigen values with each eigen vector. Compared to the graph of Fibonacci numbers above, in the figure below you can see that in coordinate system made by eigen vectors the plots changes more systematically generation by generation.


In coordinate system made by eigen vectors of square matrices, the linear transformations by the matrices can be much more straightforward, and this is one powerful strength of eigen vectors.

*I do not major in mathematics, so I am not 100% sure, but vectors in linear algebra have more abstract meanings. Various things in mathematics can be vectors, even though in machine learning or data science we  mainly use numerical vectors with more concrete elements. We can also say that matrices are a kind of maps. That is just like, at least in my impression, even though a real town is composed of various components such as houses, smooth or bumpy roads, you can simplify its structure with simple orthogonal lines, like the map of Manhattan. But if you know what the town actually looks like, you do not have to follow the zigzag path on the map.

4, Eigen vectors of positive semidefinite matrices

In the second section of this article I told you that, even though you have to consider various elements when you discuss general diagonalization, in terms of PCA and LDA we mainly use only a type of matrices named positive semidefinite matrices. Let A be a D \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} \geq 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive semidefinite matrix. And also it is known that A being a semidefinite matrix is equivalent to \lambda _{i} \geq 0 for all the eigen values \lambda_i (i=1, \dots , D).

*I think most people first learn a type of matrices called positive definite matrices. Let A be aD \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} > 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive definite matrix. You have to keep it in mind that even if all the elements of A are positive, A is not necessarly positive definite/semidefinite.

Just as we did in the second section of this article, let’s visualize displacement vectors made by linear transformation with a 3 \times 3 square positive semidefinite matrix A.

*In fact A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, whose linear transformation I visualized the second section, is also positive semidefinite.

Let’s visualize linear transformations by a positive definite matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. I visualized the displacement vectors made by the A just as the same way as in the second section of this article. The result is as below, and you can see that, as well as the displacement vectors made by A_1, the three dimensional displacement vectors below are swirling and extending in three directions, in the directions of the three orthogonal eigen vectors \boldsymbol{v}_1, \boldsymbol{v}_2, and \boldsymbol{v}_3.

*It might seem like a weird choice of a matrix, but you are going to see why I chose it in the next article.

You might have already noticed A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix} and A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix} are both symmetric matrices and that their elements are all real values, and that their diagonal elements are all positive values. Super importantly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}).

*The title of this section might be misleading, but please keep it in mind that positive definite/semidefinite matrices are not necessarily real symmetric matrices. And real symmetric vectors are not necessarily positive definite/semidefinite matrices.

5, Orthonormal matrices and rotation of vectors

In this section I am gong to explain orthonormal matrices, as known as rotation matrices. If a D\times D matrix U is an orthonormal matrix, column vectors of U are orthonormal, which means U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D), where \begin{cases} \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 1 \quad (i = j) \\ \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 0 \quad (i\neq j) \end{cases}. In other words column vectors \boldsymbol{u}_{i} form an orthonormal coordinate system.

Orthonormal matrices U have several important properties, and one of the most important properties is U^{-1} = U^{T}. Combining this fact with what I have told you so far, you we can reach one conclusion: you can orthogonalize a real symmetric matrix A as U^{T}AU = \Lambda. This is known as spectral decomposition or singular value decomposition.

Another important property of U is that U^{T} is also orthonormal. In other words, assume U is orthonormal and that U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D) = \begin{pmatrix} -\boldsymbol{v_1}^{T}- \\ \vdots \\ -\boldsymbol{v_D}^{T}- \end{pmatrix}, (\boldsymbol{v}_1 \dots \boldsymbol{v}_D) also forms a orthonormal coordinate system.

…It seems things are getting too mathematical and abstract (for me), thus for now I am going to wrap up what I have explained in this article .

We have seen

  • Numerical matrices linearly transform vectors.
  • Certain linear transformations do not change the direction of vectors in certain directions, which are called eigen vectors.
  • Making use of eigen vectors, you can form new coordinate system which can describe the linear transformations in a more straightforward way.
  • You can diagonalize a real symmetric matrix A with an orthonormal matrix U.

Of our current interest is what kind of linear transformation the real symmetric positive definite matrix enables. I am going to explain why the purple vectors in the figure above is swirling in the upcoming articles. Before that, however, we are going to  see one application of what we have seen in this article, on dimension reduction. To be concrete the next article is going to be about principal component analysis (PCA), which is very important in many fields.

*In short, the orthonormal matrix U, which I mentioned above enables rotation of matrix, and the diagonal matrix diag(\lambda_1, \dots, \lambda_D) expands or contracts vectors along each axis. I am going to explain that more precisely in the upcoming articles.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.


Simple RNN

Simple RNN: the first foothold for understanding LSTM

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

In the last article, I mentioned “When it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN.” Even if you manage to understand DCL and CNN, you can be suddenly left behind once you try to understand RNN because it looks like a different field. In the second section of this article, I am going to provide a some helps for more abstract understandings of DCL/CNN , which you need when you read most other study materials.

My explanation on this simple RNN is based on a chapter in a textbook published by Massachusetts Institute of Technology, which is also recommended in some deep learning courses of Stanford University.

First of all, you should keep it in mind that simple RNN are not useful in many cases, mainly because of vanishing/exploding gradient problem, which I am going to explain in the next article. LSTM is one major type of RNN used for tackling those problems. But without clear understanding forward/back propagation of RNN, I think many people would get stuck when they try to understand how LSTM works, especially during its back propagation stage. If you have tried climbing the mountain of understanding LSTM, but found yourself having to retreat back to the foot, I suggest that you read through this article on simple RNNs. It should help you to gain a solid foothold, and you would be ready for trying to climb the mountain again.

*This article is the second article of “A gentle introduction to the tiresome part of understanding RNN.”

1, A brief review on back propagation of DCL.

Simple RNNs are straightforward applications of DCL, but if you do not even have any ideas on DCL forward/back propagation, you will not be able to understand this article. If you more or less understand how back propagation of DCL works, you can skip this first section.

Deep learning is a part of machine learning. And most importantly, whether it is classical machine learning or deep learning, adjusting parameters is what machine learning is all about. Parameters mean elements of functions except for variants. For example when you get a very simple function f(x)=a + bx + cx^2 + dx^3, then x is a variant, and a, b, c, d are parameters. In case of classical machine learning algorithms, the number of those parameters are very limited because they were originally designed manually. Such functions for classical machine learning is useful for features found by humans, after trial and errors(feature engineering is a field of finding such effective features, manually). You adjust those parameters based on how different the outputs(estimated outcome of classification/regression) are from supervising vectors(the data prepared to show ideal answers).

In the last article I said neural networks are just mappings, whose inputs are vectors, matrices, or sequence data. In case of DCLs, inputs are vectors. Then what’s the number of parameters ? The answer depends on the the number of neurons and layers. In the example of DCL at the right side, the number of the connections of the neurons is the number of parameters(Would you like to try to count them? At least I would say “No.”). Unlike classical machine learning you no longer need to do feature engineering, but instead you need to design networks effective for each task and adjust a lot of parameters.

*I think the hype of AI comes from the fact that neural networks find features automatically. But the reality is difficulty of feature engineering was just replaced by difficulty of designing proper neural networks.

It is easy to imagine that you need an efficient way to adjust those parameters, and the method is called back propagation (or just backprop). As long as it is about DCL backprop, you can find a lot of well-made study materials on that, so I am not going to cover that topic precisely in this article series. Simply putting, during back propagation, in order to adjust parameters of a layer you need errors in the next layer. And in order calculate the errors of the next layer, you need errors in the next next layer.

*You should not think too much about what the “errors” exactly mean. Such “errors” are defined in this context, and you will see why you need them if you actually write down all the mathematical equations behind backprops of DCL.

The red arrows in the figure shows how errors of all the neurons in a layer propagate backward to a neuron in last layer. The figure shows only some sets of such errors propagating backward, but in practice you have to think about all the combinations of such red arrows in the whole back propagation(this link would give you some ideas on how DCLs work).

These points are minimum prerequisites for continuing reading this  RNN this article. But if you are planning to understand RNN forward/back propagation at  an abstract/mathematical level that you can read academic papers,  I highly recommend you to actually write down all the equations of DCL backprop. And if possible you should try to implement backprop of three-layer DCL.

2, Forward propagation of simple RNN

*For better understandings of the second and third section, I recommend you to download an animated PowerPoint slide which I prepared. It should help you understand simple RNNs.

In fact the simple RNN which we are going to look at in this article has only three layers. From now on imagine that inputs of RNN come from the bottom and outputs go up. But RNNs have to keep information of earlier times steps during upcoming several time steps because as I mentioned in the last article RNNs are used for sequence data, the order of whose elements is important. In order to do that, information of the neurons in the middle layer of RNN propagate forward to the middle layer itself. Therefore in one time step of forward propagation of RNN, the input at the time step propagates forward as normal DCL, and the RNN gives out an output at the time step. And information of one neuron in the middle layer propagate forward to the other neurons like yellow arrows in the figure. And the information in the next neuron propagate forward to the other neurons, and this process is repeated. This is called recurrent connections of RNN.

*To be exact we are just looking at a type of recurrent connections. For example Elman RNNs have simpler recurrent connections. And recurrent connections of LSTM are more complicated.

Whether it is a simple one or not, basically RNN repeats this process of getting an input at every time step, giving out an output, and making recurrent connections to the RNN itself. But you need to keep the values of activated neurons at every time step, so virtually you need to consider the same RNNs duplicated for several time steps like the figure below. This is the idea of unfolding RNN. Depending on contexts, the whole unfolded DCLs with recurrent connections is also called an RNN.

In many situations, RNNs are simplified as below. If you have read through this article until this point, I bet you gained some better understanding of RNNs, so you should little by little get used to this more abstract, blackboxed  way of showing RNN.

You have seen that you can unfold an RNN, per time step. From now on I am going to show the simple RNN in a simpler way,  based on the MIT textbook which I recomment. The figure below shows how RNN propagate forward during two time steps (t-1), (t).

The input \boldsymbol{x}^{(t-1)}at time step(t-1) propagate forward as a normal DCL, and gives out the output \hat{\boldsymbol{y}} ^{(t)} (The notation on the \boldsymbol{y} ^{(t)} is called “hat,” and it means that the value is an estimated value. Whatever machine learning tasks you work on, the outputs of the functions are just estimations of ideal outcomes. You need to adjust parameters for better estimations. You should always be careful whether it is an actual value or an estimated value in the context of machine learning or statistics). But the most important parts are the middle layers.

*To be exact I should have drawn the middle layers as connections of two layers of neurons like the figure at the right side. But I made my figure closer to the chart in the MIT textbook, and also most other study materials show the combinations of the two neurons before/after activation as one neuron.

\boldsymbol{a}^{(t)} is just linear summations of \boldsymbol{x}^{(t)} (If you do not know what “linear summations” mean, please scroll this page a bit), and \boldsymbol{h}^{(t)} is a combination of activated values of \boldsymbol{a}^{(t)} and linear summations of \boldsymbol{h}^{(t-1)} from the last time step, with recurrent connections. The values of \boldsymbol{h}^{(t)} propagate forward in two ways. One is normal DCL forward propagation to \hat{\boldsymbol{y}} ^{(t)} and \boldsymbol{o}^{(t)}, and the other is recurrent connections to \boldsymbol{h}^{(t+1)} .

These are equations for each step of forward propagation.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

*Please forgive me for adding some mathematical equations on this article even though I pledged not to in the first article. You can skip the them, but for some people it is on the contrary more confusing if there are no equations. In case you are allergic to mathematics, I prescribed some treatments below.

*Linear summation is a type of weighted summation of some elements. Concretely, when you have a vector \boldsymbol{x}=(x_0, x_1, x_2), and weights \boldsymbol{w}=(w_0,w_1, w_2), then \boldsymbol{w}^T \cdot \boldsymbol{x} = w_0 \cdot x_0 + w_1 \cdot x_1 +w_2 \cdot x_2 is a linear summation of \boldsymbol{x}, and its weights are \boldsymbol{w}.

*When you see a product of a matrix and a vector, for example a product of \boldsymbol{W} and \boldsymbol{v}, you should clearly make an image of connections between two layers of a neural network. You can also say each element of \boldsymbol{u}} is a linear summations all the elements of \boldsymbol{v}} , and \boldsymbol{W} gives the weights for the summations.

A very important point is that you share the same parameters, in this case \boldsymbol{\theta \in \{\boldsymbol{U}, \boldsymbol{W}, \boldsymbol{b}, \boldsymbol{V}, \boldsymbol{c} \}}, at every time step. 

And you are likely to see this RNN in this blackboxed form.

3, The steps of back propagation of simple RNN

In the last article, I said “I have to say backprop of RNN, especially LSTM (a useful and mainstream type or RNN), is a monster of chain rules.” I did my best to make my PowerPoint on LSTM backprop straightforward. But looking at it again, the LSTM backprop part still looks like an electronic circuit, and it requires some patience from you to understand it. If you want to understand LSTM at a more mathematical level, understanding the flow of simple RNN backprop is indispensable, so I would like you to be patient while understanding this step (and you have to be even more patient while understanding LSTM backprop).

This might be a matter of my literacy, but explanations on RNN backprop are very frustrating for me in the points below.

  • Most explanations just show how to calculate gradients at each time step.
  • Most study materials are visually very poor.
  • Most explanations just emphasize that “errors are back propagating through time,” using tons of arrows, but they lack concrete instructions on how actually you renew parameters with those errors.

If you can relate to the feelings I mentioned above, the instructions from now on could somewhat help you. And with the animated PowerPoint slide I prepared, you would have clear understandings on this topic at a more mathematical level.

Backprop of RNN , as long as you are thinking about simple RNNs, is not so different from that of DCLs. But you have to be careful about the meaning of errors in the context of RNN backprop. Back propagation through time (BPTT) is one of the major methods for RNN backprop, and I am sure most textbooks explain BPTT. But most study materials just emphasize that you need errors from all the time steps, and I think that is very misleading and confusing.

You need all the gradients to adjust parameters, but you do not necessarily need all the errors to calculate those gradients. Gradients in the context of machine learning mean partial derivatives of error functions (in this case J) with respect to certain parameters, and mathematically a gradient of J with respect to \boldsymbol{\theta \in \{\boldsymbol{U}, \boldsymbol{W}, \boldsymbol{b}^{(t)}, \boldsymbol{V}, \boldsymbol{c} \}}is denoted as ( \frac{\partial J}{\partial \boldsymbol{\theta}}  ). And another confusing point in many textbooks, including the MIT one, is that they give an impression that parameters depend on time steps. For example some study materials use notations like \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, and I think this gives an impression that this is a gradient with respect to the parameters at time step (t). In my opinion this gradient rather should be written as ( \frac{\partial J}{\partial \boldsymbol{\theta}} )^{(t)} . But many study materials denote gradients of those errors in the former way, so from now on let me use the notations which you can see in the figures in this article.

In order to calculate the gradient \frac{\partial J}{\partial \boldsymbol{x}^{(t)}} you need errors from time steps s (s \geq t) \quad (as you can see in the figure, in order to calculate a gradient in a colored frame, you need all the errors in the same color).

*To be exact, in the figure above I am supposed prepare much more arrows in \tau + 1 different colors  to show the whole process of RNN backprop, but that is not realistic. In the figure I displayed only the flows of errors necessary for calculating each gradient at time step 0, t, \tau.

*Another confusing point is that the \frac{\partial J}{\partial \boldsymbol{\ast ^{(t)}}}, \boldsymbol{\ast} \in \{\boldsymbol{a}^{(t)}, \boldsymbol{h}^{(t)}, \boldsymbol{o}^{(t)}, \dots \} are correct notations, because \boldsymbol{\ast} are values of neurons after forward propagation. They depend on time steps, and these are very values which I have been calling “errors.” That is why parameters do not depend on time steps, whereas errors depend on time steps.

As I mentioned before, you share the same parameters at every time step. Again, please do not assume that parameters are different from time step to time step. It is gradients/errors (you need errors to calculate gradients) which depend on time step. And after calculating errors at every time step, you can finally adjust parameters one time, and that’s why this is called “back propagation through time.” (It is easy to imagine that this method can be very inefficient. If the input is the whole text on a Wikipedia link, you need to input all the sentences in the Wikipedia text to renew parameters one time. To solve this problem there is a backprop method named “truncated BPTT,” with which you renew parameters based on a part of a text. )

And after calculating those gradients \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} you can take a summation of them: \frac{\partial J}{\partial \boldsymbol{\theta}}=\sum_{t=0}^{t=\tau}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}. With this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} , you can finally renew the value of \boldsymbol{\theta} one time.

At the beginning of this article I mentioned that simple RNNs are no longer for practical uses, and that comes from exploding/vanishing problem of RNN. This problem was one of the reasons for the AI winter which lasted for some 20 years. In the next article I am going to write about LSTM, a fancier type of RNN, in the context of a history of neural network history.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Funktionsweise künstlicher neuronaler Netze

Künstliche neuronale Netze sind ein Spezialbereich des maschinellen Lernens, der sogar einen eigenen Trendbegriff hat: Deep Learning.
Doch wie funktioniert ein künstliches neuronales Netz überhaupt? Und wie wird es in Python realisiert? Dies ist Artikel 2 von 6 der Artikelserie –Einstieg in Deep Learning.

Gleich vorweg, wir beschränken uns hier auf die künstlichen neuronalen Netze des überwachten maschinellen Lernens. Dafür ist es wichtig, dass das Prinzip des Trainings und Testens von überwachten Verfahren verstanden ist. Künstliche neuronale Netze können aber auch zur unüberwachten Dimensionsreduktion und zum Clustering eingesetzt werden. Das bekannteste Verfahren ist das AE-Net (Auto Encoder Network), das hier aus der Betrachtung herausgenommen wird.

Beginnen wir mit einfach künstlichen neuronalen Netzen, die alle auf dem Perzeptron als Kernidee beruhen. Das Vorbild für künstliche neuronale Netze sind natürliche neuronale Netze, wie Sie im menschlichen Gehirn zu finden sind.


Das Perzeptron (engl. Perceptron) ist ein „Klassiker“ unter den künstlichen neuronalen Netzen. Wenn von einem neuronalen Netz gesprochen wird, ist meistens ein Perzeptron oder eine Variation davon gemeint. Perzeptrons sind mehrschichtige Netze ohne Rückkopplung, mit festen Eingabe- und Ausgabeschichten. Es gibt keine absolut einheitliche Definition eines Perzeptrons, in der Regel ist es jedoch ein reines FeedForward-Netz mit einer Input-Schicht (auch Abtast-Schicht oder Retina genannt) mit statisch oder dynamisch gewichteten Verbindungen zur Ausgabe-Schicht, die (als Single-Layer-Perceptron) aus einem einzigen Neuron besteht. Das eine Neuron setzt sich aus zwei mathematischen Funktionen zusammen: Einer Berechnung der Nettoeingabe und einer Aktivierungsfunktion, die darüber entscheidet, ob die berechnete Nettoeingabe im Brutto nun “feuert” oder nicht. Es ist in seiner Ausgabe folglich binär: Man kann es sich auch als kleines Lämpchen vorstellen, so dass abhängig von den Eingabewerten und den Gewichtungen eine Nettoeingabe (Summe) bildet und eine Sprungfunktion darüber entscheidet, ob am Ende das Lämpchen leuchtet oder nicht. Dieses Konzept der Ausgabeerzeugung wird Forward-Propagation genannt.


Auch wenn “Netz” für ein einzelnes Perzeptron mit seinem einen Neuron etwas übertrieben wirken mag, ist es doch die Grundlage für viele größere und mehrschichtige Netze.

Betrachten wir nun die Mathematik der Forward-Propagation.

Wir haben eine Menge an Eingabewerten x_0, x_1 \dots x_n. Wobei für x_0 als Bias-Input stets gilt: x_0 = 1,0. Der Bias-Input ist nur ein Platzhalter für das wichtige Bias-Gewicht.

    \[ x = \begin{bmatrix} x_0\\ x_1\\ x_2\\ x_3\\ \vdots\\ x_n \end{bmatrix} \]

Für jede Eingabevariable wird eine Gewichtsvariable benötigt: w_0, w_1 \dots w_n

    \[ w = \begin{bmatrix} w_0\\ w_1\\ w_2\\ w_3\\ \vdots\\ w_n \end{bmatrix} \]

Jedes Produkt aus Eingabewert und Gewichtung soll in Summe die Nettoeingabe z bilden. Hier zeigt sich z als lineare mathematische Funktion, die zwei-dimensional leicht als z = w_0 + w_1 \cdot x_1 mit w_0 als Y-Achsenschnitt wenn x_1 = 0.

    \[ z = w_0 \cdot x_0 + w_1 \cdot x_1 + \dots + w_n \cdot x_n \]

Die lineare Funktion wird nur durch die Sprungfunktion als sogenannte Aktivierungsfunktion zu einer binären Klasseneinteilung (siehe hierzu: Machine Learning – Regression vs Klassifikation), denn wenn z einen festzulegenden Schwellwert \theta überschreitet, liefert die Sprungfunktion \phi mit der Eingabe z einen anderen Wert als wenn dieser Schwellwert nicht überschritten wird.

(1)   \begin{equation*} \phi(z) = \begin{cases} 1 & \text{wenn } z \le \theta \\ -1 & \text{wenn } z < \theta \\ \end{cases} \end{equation*}

Die Definition dieser Aktivierungsfunktion ist der Kern der Klassifikation und viele erweiterte künstliche neuronale Netze unterscheiden sich im Wesentlichen vom Perzeptron dadurch, dass die Aktivierungsfunktion komplexer ist, als eine reine Sprungfunktion, beispielsweise als Sigmoid-Funktion (basierend auf der logistischen Funktion) oder die Tangens hyperbolicus (tanh) -Funktion. Mehr darüber dann im nächsten Artikel dieser Artikelserie, bleiben wir also bei der einfachen Sprungfunktion.

Künstliche neuronale Netze sind im Grunde nichts anderes als viel-dimensionale, mathematische Funktionen, die durch Schaltung als Neuronen nebeneinander (Neuronen einer Schicht) und hintereinander (mehrere Schichten) eine enorme Komplexität erfassen können. Die Gewichtungen sind dabei die Stellschraube, die die Form der mathematischen Funktion gestaltet, aus Geraden und Kurven, um eine Punktwolke zu beschreiben (Regression) oder um Klassengrenzen zu identifizieren (Klassifikation).

Eine andere Sichtweise auf künstliche neuronale ist die des Filters: Ein künstliches neuronales Netz nimmt alle Eingabe-Variablen entgegen (z. B. alle Pixel eines Bildes) und über ein Training werden die Gewichtungen (die Form des Filters) so gestaltet, dass der Filter immer zu richtigen Klasse (im Kontext der Bildklassifikation: die Objektklasse) führt.

Kommen wir nochmal kurz zurück zu der Berechnung der Nettoeingabe z. Da diese Schreibweise…

    \[ z = w_0 \cdot x_0 + w_1 \cdot x_1 + \dots + w_n \cdot x_n \]

… recht anstrengend ist, schreiben Fortgeschrittene der linearen Algebra lieber z = w^T \cdot x.

    \[ z = w^T \cdot x \]

Das hochgestellte T steht dabei für transponieren. Transponieren bedeutet, dass Spalten zu Zeilen werden – oder umgekehrt.

Beispielsweise befüllen wir zwei Vektoren x und w mit beispielhaften Inhalten:


    \[ x = \begin{bmatrix} 5\\ 12\\ 30\\ 2 \end{bmatrix} \]


    \[ w = \begin{bmatrix} 1\\ 2\\ 5\\ 12 \end{bmatrix} \]

Kann nun die Nettoeingabe z berechnet werden, denn der Gewichtungsvektor wird vom Spaltenvektor zum Zeilenvektor. So kann – mathematisch korrekt dargestellt – jedes Element des einen Vektors mit dem zugehörigen Element des anderen Vektors multipliziert werden, die dabei entstehenden Ergebniswerte werden summiert.

    \[ z = w^T \cdot x = \big[1\text{ }2\text{ }5\text{ }12\big] \cdot \begin{bmatrix} 5\\ 12\\ 30\\ 2 \end{bmatrix} = 1 \cdot 5 + 2 \cdot 12 + 5 \cdot 30 + 12 \cdot 2 = 203 \]

Zurück zur eigentlichen Aufgabe des künstlichen neuronalen Netzes: Klassifikation! (Regression, Clustering und Dimensionsreduktion blenden wir ja in diesem Artikel als Aufgabe aus 🙂

Das Perzeptron soll zwei Klassen trennen. Dafür sollen alle Eingaben richtig gewichtet werden, so dass die entstehende Nettoeingabe z die Sprungfunktion dann aktiviert, wenn der Datensatz nicht für die eine, sondern für die andere Klasse ausweist.

Da wir es mit einer linearen Funktion z zutun haben, ist die Konvergenz (= Passgenauigkeit des Models mit der Realität) eines Single-Layer-Perzeptrons nur für lineare Trennbarkeit möglich!

Training des Perzeptron-Netzes

Die Aufgabe ist nun, die richtigen Gewichte zu finden – und nicht nur irgendwelche richtigen, sondern genau die optimalen. Die Frage, die sich für jedes künstliche neuronale Netz stellt, ist die nach den richtigen Gewichtungen. Das Training eines Perzeptron ist vergleichsweise einfach, gerade weil es binär ist. Denn binär bedeutet auch, dass wenn eine falsche Antwort gegeben wurde, muss das jeweils andere mögliche Ergebnis korrekt sein.

Das Training eines Perzeptrons funktioniert wie folgt:

  1. Setze alle Gewichtungen auf den Wert 0,00
  2. Mit jedem Datensatz des Trainings
    1. Berechne den Ausgabewert \^{y}
    2. Vergleiche den Ausgabewert \^{y} mit dem tatsächlichen Ergebnis y
    3. Aktualisiere die Gewichtungen entgegen des Fehlers: w_i = w_i + \Delta w_i

Wobei die Gewichtsanpassung \Delta w_i entgegen des Fehlers (bzw. hin zur jeweils anderen möglichen Antwort) geschieht:

\Delta w_i = (\^{y}_j - y_j ) \cdot x_i

Anmerkung für die Experten: Die Schrittweite \eta blenden wir hier einfach mal aus. Bitte einfach von \eta = 1.0 ausgehen.

\Delta w_i ist die Differenz aus der Prädiktion und dem tatsächlichen Ergebnis (Klasse). Alle Gewichtungen werden mit jedem Fehler gleichzeitig aktualisiert. Sind alle Gewichtungen aktualisiert, kommt der nächste Durchlauf (erneuter Vergleich zwischen \^{y} und y), nicht zu vergessen ist dabei natürlich die Abhängigkeit von den Eingabewerten x:

\Delta w_0 = (\^{y}_j - y_j ) \cdot x_0

\Delta w_2 = (\^{y}_j - y_j ) \cdot x_1

\Delta w_2 = (\^{y}_j - y_j) \cdot x_2

\Delta w_n = (\^{y}_j - y_j) \cdot x_n

Training eines Perzeptrons

Das Training im überwachten Lernen basiert immer auf der Idee, den Ausgabe-Fehler (die Differenz zwischen Prädiktion und tatsächlich korrektem Ergebnis) zu betrachten und die Klassifikationslogik an den richtigen Stellschrauben (bei neuronalen Netzen sind das die Gewichtungen) entgegen des Fehlers anzupassen.

Richtige Klassifikations-Situationen können True-Positives und True-Negatives darstellen, die zu keiner Gewichtsanpassung führen sollen:

True-Positive -> Klassifikation: 1 | korrekte Klasse: 1

\Delta w_i = (\^{y}_j - y_j) \cdot x_i = (1 - 1) \cdot x_i = 0

True-Negative-> Klassifikation: -1 | korrekte Klasse: -1

\Delta w_i = (\^{y}_j - y_j) \cdot x_i = (-1 - -1) \cdot x_i = 0

Falsche Klassifikationen erzeugen einen Fehler, der zu einer Gewichtsanpassung entgegen des Fehlers führen soll:

False-Positive -> Klassifikation: 1 | korrekte Klasse: -1

\Delta w_i = (\^{y}_j - y_j) \cdot x_i = (1 - -1) \cdot x_i = 2 \cdot x_i

False-Negative -> Klassifikation: -1 | korrekte Klasse: 1

\Delta w_i = (\^{y}_j - y_j) \cdot x_i = (-1 - 1) \cdot x_i = -2 \cdot x_i

Imaginäres Trainingsbeispiel eines Single-Layer-Perzeptrons (SLP)

Nehmen wir an, dass x_1 = 0,5 ist und das SLP irrtümlicherweise die Klasse \^{y_1} = -1 ausgewiesen hat, obwohl die korrekte Klasse y_1 = +1 wäre. (Und die Schrittweite lassen wir bei \eta = 1,0)

Dann passiert folgendes:

\Delta w_1 = (\^{y}_1 - y_1) \cdot x_1 = (-1 - 1) \cdot 0,5 = -2,0 \cdot 0,5 = -1,0

Die Gewichtung w_1 verringert sich entsprechend w_1 = w_1 + \Delta w_1 = w_1 - 1,0 und somit wird die Wahrscheinlichkeit größer, dass wenn bei der nächsten Iteration (j=1) wieder die Klasse +1 korrekt sei,  den Schwellwert \phi(z) zu unterschreiten und auf eben diese korrekte Klasse zu stoßen.

Die Aktualisierung der Gewichtung \Delta w_i ist proportional zu x_i. So würde beispielsweise ein neues x_1=2,0 (bei Iteration j=2) zu einer irrtümlichen Klassifikation \^(y_2) = -1 (y_2 = +1) führen, würde die Entscheidungsgrenze zur korrekten Prädiktion der Klasse beim nächsten Durchlauf (j = 3) an w_1 noch weiter in die gleiche Richtung verschoben werden:

\Delta w_1 = (\^{y}_2 - y_2) \cdot x_1 = (-1 - 1) \cdot 2,0 = -2,0 \cdot 2,0 = -4,0

Mehr zum Training von künstlichen neuronalen Netzen ist im nächsten Artikel dieser Artikelserie zu erfahren.

Single-Layer-Perzeptrons (SLP) – Beispiel mit der boolischen Trennung

Verlassen wir nun das Training des Perzeptrons und gehen einfach mal davon aus, dass die idealen Gewichte schon gefunden wurden und schauen uns nun an, was ein Perzeptron alles (nicht) kann. Denn nicht vergessen, es soll eigentlich Klassen unterscheiden bzw. die dafür nötigen Entscheidungsgrenzen finden.

Boolische Operatoren unterscheiden Fälle nach boolischen Werten. Sie sind ein beliebtes “Hello World” für die Einarbeitung in die lineare Entscheidungslogik eines Perzeptrons. Es gibt drei grundlegende boolische Vergleichsoperatoren: AND, OR und XOR

  x1     x2   AND OR XOR
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0

Ein Perzeptron zur Lösung dieser Aufgabe bräuchte also zwei Dimensionen (+ Bias): x_1 und x_2
Und es müsste Gewichtungen haben, die dafür sorgen, dass die Vorhersage entsprechend der Logik AND, OR oder XOR mit \^{y} = \phi(z) = \phi (w_0 \cdot 1 + w_1 \cdot x_1 + w_2 \cdot x_2) funktioniert.

Dabei ist es wichtig, dass wir auch phi \phi als Sprungfunktion definieren. Sie könnte beispielsweise so aussehen, dass sie auf den Wert \phi(z) = 1 springt, wenn z > 0 ist, ansonsten aber \phi(z) = 0 bleibt.

Das Netz und die Gewichtungen (w-Setup) könnten für die AND- und die OR-Logik so aussehen:

Die Gewichtungen funktionieren beim SLP problemlos, denn wir haben es mit linear trennbaren Problemen zutun:

Kleiner Test gefällig? So nehmen wir uns erstmal die AND-Logik vor:

  • Wenn x1 = 0 und x2 = 0 ist, gilt: z = -1,5 \cdot 1 + 1 \cdot 0 + 1 \cdot 0 = - 1,5,
    wie erhalten als Prädiktion \phi(z) = \phi(-1,5) = 0
  • Wenn x1 = 1 und x2 = 0 ist, gilt: z = -1,5 \cdot 1 + 1 \cdot 1 + 1 \cdot 0 = - 0,5,
    wie erhalten als Prädiktion \phi(z) = \phi(-0,5) = 0
  • Wenn x1 = 1 und x2 = 1 ist, gilt: z = -1,5 \cdot 1 + 1 \cdot 1 + 1 \cdot 1 = + 0,5,
    wie erhalten als Prädiktion \phi(z) = \phi(0,5) = 1

Scheint zu funktionieren!

Und dann die OR-Logik mit

  • Wenn x1 = 0 und x2 = 0 ist, gilt: z = -0,5 \cdot 1 + 1 \cdot 0 + 1 \cdot 0 = - 0,5,
    wie erhalten als Prädiktion \phi(z) = \phi(-0,5) = 0
  • Wenn x1 = 1 und x2 = 0 ist, gilt: z = -0,5 \cdot 1 + 1 \cdot 1 + 1 \cdot 0 = + 0,5,
    wie erhalten als Prädiktion \phi(z) = \phi(0,5) = 1
  • Wenn x1 = 1 und x2 = 1 ist, gilt: z = -0,5 \cdot 1 + 1 \cdot 1 + 1 \cdot 1 = + 1,5,
    wie erhalten als Prädiktion \phi(z) = \phi(1,5) = 1

Super! Jedoch stellt sich nun die Frage, wie das XOR-Problem zu lösen ist, denn das bedingt sowohl die Grenzen von AND als auch jene des OR-Operators.

Multi-Layer-Perzeptron (MLP) bzw. (Deep) Feed Forward (FF) Net

Denn ein XOR kann mathematisch auch so korrekt beschrieben werden: x_1 \text{ xor } x_2 = (x_1 \text{ and } \neg x_2) \text{ or } (\neg x_1 \text{ and } x_2)

Testen wir es aus!

  • Wenn x1 = 0 und x2 = 0 ist, gilt:
    z_1 = w_{10} \cdot 1 + w_{11} \cdot x1 + w_{12} \cdot  x2 = -0.5 \cdot 1 + 1,0 \cdot 0 - 1,0 \cdot 0 = -0,5 und somit \phi(z_1) = \phi(-0,5) = 0
    z_2 = w_{20} \cdot 1 + w_{21} \cdot x1 + w_{22} \cdot  x2 = -0.5 \cdot 1 - 1,0 \cdot 0 + 1,0 \cdot 0 = -0,5 und somit \phi(z_2) = \phi(-0,5) = 0
    z_3 = w_{30} \cdot 1 + w_{31} \cdot \phi(z_1) + w_{32} \cdot \phi(z_2) = -0,5 \cdot 1 + 1,0 \cdot 0 + 1,0 \cdot 0 = -0,5 und somit \phi(z_3) = \phi(-0,5) = 0
  • Wenn x1 = 1 und x2 = 0 ist, gilt:
    z_1 = w_{10} \cdot 1 + w_{11} \cdot x1 + w_{12} \cdot  x2 = -0.5 \cdot 1 + 1,0 \cdot 1 - 1,0 \cdot 0 = 0,5 und somit \phi(z_1) = \phi(0,5) = 1
    z_2 = w_{20} \cdot 1 + w_{21} \cdot x1 + w_{22} \cdot  x2 = -0.5 \cdot 1 - 1,0 \cdot 1 + 1,0 \cdot 0 = -1,5 und somit \phi(z_2) = \phi(-1,5) = 0
    z_3 = w_{30} \cdot 1 + w_{31} \cdot \phi(z_1) + w_{32} \cdot \phi(z_2) = -0,5 \cdot 1 + 1,0 \cdot 1 + 1,0 \cdot 0 = 0,5 und somit \phi(z_3) = \phi(0,5) = 1
  • Wenn x1 = 0 und x2 = 1 ist, gilt:
    z_1 = w_{10} \cdot 1 + w_{11} \cdot x1 + w_{12} \cdot  x2 = -0.5 \cdot 1 + 1,0 \cdot 0 - 1,0 \cdot 1 = -1,5 und somit \phi(z_1) = \phi(-1,5) = 0
    z_2 = w_{20} \cdot 1 + w_{21} \cdot x1 + w_{22} \cdot  x2 = -0.5 \cdot 1 - 1,0 \cdot 0 + 1,0 \cdot 1 = 0,5 und somit \phi(z_2) = \phi(0,5) = 1
    z_3 = w_{30} \cdot 1 + w_{31} \cdot \phi(z_1) + w_{32} \cdot \phi(z_2) = -0,5 \cdot 1 + 1,0 \cdot 0 + 1,0 \cdot 1 = 0,5 und somit \phi(z_3) = \phi(0,5) = 1
  • Wenn x1 = 1 und x2 = 1 ist, gilt:
    z_1 = w_{10} \cdot 1 + w_{11} \cdot x1 + w_{12} \cdot  x2 = -0.5 \cdot 1 + 1,0 \cdot 1 - 1,0 \cdot 1 = -1,5 und somit \phi(z_1) = \phi(-0,5) = 0
    z_2 = w_{20} \cdot 1 + w_{21} \cdot x1 + w_{22} \cdot  x2 = -0.5 \cdot 1 - 1,0 \cdot 1 + 1,0 \cdot 1 = 0,5 und somit \phi(z_2) = \phi(-0,5) = 0
    z_3 = w_{30} \cdot 1 + w_{31} \cdot \phi(z_1) + w_{32} \cdot \phi(z_2) = -0,5 \cdot 1 + 1,0 \cdot 0 + 1,0 \cdot 0 = -0,5 und somit \phi(z_3) = \phi(-0,5) = 0

Es funktioniert!

Mehrfachklassifikation mit dem Perzeptron

Ein Perzeptron-Netz klassifiziert binär, die Ausgabe beschränkt sich auf 1 oder -1 bzw. 0 oder 1.

Jedoch wird in der Praxis oftmals eine One-vs-All (OvA) bzw. One-vs-Rest (OvR) Klassifikation implementiert. In diesem Fall steht die 1 für die Erkennung einer konkreten Klasse, während alle anderen übrigen Klassen als negativ betrachtet werden.

Um jede Klasse erkennen zu können, werden n Klassifizierer (= n Perzeptron-Netze) benötigt. Jedes Perzeptron-Netz ist auf die Erkennung einer bestimmten Klasse trainiert.

Adaline – Oder: die Limitation des Perzeptrons

Das Perzeptron wird nur über eine Sprungfunktion aktiviert. Das schränkt die Feinabstimmung des Trainings enorm ein. Besser sind Aktivierungen über stetige Funktionen, die dann nämlich differenzierbar (ableitbar) sind. Das ergibt eine konvexe Fehlerfunktion mit einem eindeutigen Minimum. Der Adaline-Algorithmus (ADAptive Linear NEuron) erweitert die Idee des Perzeptrons um genau diese Idee. Der wesentliche Fortschritt der Adaline-Regel gegenüber der des Perzeptrons ist demnach, dass die Aktualisierung der Gewichtungen nicht wie beim Perzeptron auf einer einfachen Sprungfunktion, sondern auf einer linearen, stetigen Aktivierungsfunktion beruht.


Wie ein künstliches neuronales Netz mit der Kategorie Adaline trainiert werden kann, wird im nächsten Artikel dieser Artikelserie erläutert.

Weiterführende Netz-Konzepte (CNN und RNN)

Wer bereits mit Frameworks wie TensorFlow in das Deep Learning eingestiegen ist, hat möglicherweise schon erweiterte Konzepte der künstlichen neuronalen Netze kennen gelernt. Die CNNs (Convolutional Neuronal Network) sind im Moment die Wahl für die Verarbeitung von hochdimensionalen Aufgaben, beispielsweise die Bilderkennung (Computer Vision) und Texterkennung (NLP). Das CNN erweitert die Möglichkeiten mit neuronalen Netzen deutlich, indem ein Netz zur Dimensionsreduktion vorgeschaltet wird, im Kern steckt jedoch weiterhin die Idee der MLPs. Beim Einsatz in der Bilderkennung funktionieren CNNs vereinfacht gesprochen so, dass der vorgeschaltete Netzbereich die Millionen Bildpixel sektorweise ausliest (Convolution, Faltung durch Auslesen über Sektoren, die sich gegenseitig überlappen), verdichtet (Pooling, beispielsweise über nicht-lineare Funktionen wie max()) und dann – nach diesem Prozedere – ähnlich eim MLP klassifiziert.


Eine andere erweiterte Form sind RNNs (Recurrent Neuronal Network), die ebenfalls auf der Idee des MLPs basieren, dieses Konzept jedoch dank Rückverbindungen (Neuronen senden an vorherige Schichten) und Selbstverbindungen (Neuronen senden an sich selbst) wiederum auf den Kopf stellen.


Dennoch ist es für das tiefere Verständnis von CNNs und RNNs essenziell, dass vorher das Konzept des MLPs verstanden ist. Es ist die einfachste Form der auch heute noch am meisten eingesetzten und sehr mächtigen Netz-Topologien.

Im Jahr 2016 hatte Fjodor van Veen von hatte – dankenswerterweise – mal eine Zusammenstellung von Netz-Topologien erstellt, auf die ich heute noch immer mal wieder einen Blick werfe:

Künstliche neuronale Netze – Topologie-Übersicht von Fjodor van Veen


Die folgenden Bücher nutze ich für mein Selbststudium von Machine Learning und Deep Learning und sind teilweise Gedankenvorlagen auch für diesen Artikel gewesen:


Machine Learning mit Python und Scikit-Learn und TensorFlow: Das umfassende Praxis-Handbuch für Data Science, Predictive Analytics und Deep Learning (mitp Professional) Deep Learning mit Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek(mitp Professional)


ID3-Algorithmus: Ein Rechenbeispiel

Dieser Artikel ist Teil 3 von 4 der Artikelserie Maschinelles Lernen mit Entscheidungsbaumverfahren und nun wollen wir einen Entscheidungsbaum aus Daten herleiten, jedoch ohne Programmierung, sondern direkt auf Papier (bzw. HTML :-).

Folgender Datensatz sei gegeben:

Zeile Kundenart Zahlungsgeschwindigkeit Kauffrequenz Herkunft Zahlungsmittel: Rechnung?
 1  Neukunde  niedrig  niedrig  Inland  false
 2  Neukunde  niedrig  niedrig  Ausland  false
 3  Stammkunde  niedrig  niedrig  Inland  true
 4  Normalkunde  mittel  niedrig  Inland  true
 5  Normalkunde  hoch  hoch  Inland  true
 6  Normalkunde  hoch  hoch  Ausland  false
 7  Stammkunde  hoch  hoch  Ausland  true
 8  Neukunde  mittel  niedrig  Inland  false
 9  Neukunde  hoch  hoch  Inland  true
 10  Normalkunde  mittel  hoch  Inland  true
 11  Neukunde  mittel  hoch  Ausland  true
 12  Stammkunde  mittel  niedrig  Ausland  true
 13  Stammkunde  niedrig  hoch  Inland  true
 14  Normalkunde  mittel  niedrig  Ausland  false

Gleich vorweg ein Disclaimer: Der Datensatz ist natürlich überaus klein, ja gerade zu winzig. Dafür würden wir in der Praxis niemals einen Machine Learning Algorithmus einsetzen. Dennoch bleiben wir besser übersichtlich und nachvollziehbar mit diesen 14 Zeilen. Das Lernziel dieser Übung ist es, ein Gefühl für die Erstellung von Entscheidungsbäumen zu erhalten.
Zu beachten ist ferner, dass dieser Datensatz bereits aggregiert ist, denn eigentlich nummerisch abbildbare Daten wurden in Klassen zusammengefasst.

Das Ziel:

Der Datensatz spielt wieder, welchem Kunden (ID) bisher die Zahlung per Rechnung erlaubt und nicht widerrufen wurde. Das Ziel soll sein, eine Vorhersage darüber zu machen zu können, wann ein Kunde per Rechnung zahlen darf und wann nicht (dann per Vorkasse).

Der Algorithmus:

Wir verwenden den ID3-Algorithmus in seiner Reinform. Der ID3-Algorithmus ist der gängigste Algorithmus zum Aufbau datengetriebener Entscheidungsbäume und es gibt mehrere Abwandlungen. Die Vorgehensweise des Algorithmus wird in dem Teil 2 der Artikelserie Entscheidungsbaum-Algorithmus ID3 erläutert.

1. Schritt: Auswählen des Attributes mit dem höchsten Informationsgewinn

Der Informationsgewinn eines Attributes (A) im Sinne des ID3-Algorithmus ist die Differenz aus der Entropie (E(S)) (siehe Teil 1 der Artikelserie Entropie, ein Maß für die Unreinheit in Daten) des gesamten Datensatzes (S) und der Summe aus den gewichteten Entropien des Attributes für jeden einzelnen Wert (Value i), der im Attribut vorkommt:
IG(S, A) = H(S) - \sum_{i=1}^n \frac{\bigl|S_i\bigl|}{\bigl|S\bigl|} \cdot H(S_i)

1.1 Gesamt-Entropie des Datensatzes berechnen

Erstmal schauen wir uns die Entropie des gesamten Datensatzes an. Die Entropie bezieht sich dabei auf das gewünschte Klassifikationsergebnis, also ist die Zahlung via Rechnung erlaubt oder nicht? Diese Frage wird entweder mit true oder false beantwortet.

H(S) = - \frac{9}{14} \cdot \log_2(\frac{9}{14}) - \frac{5}{14} \cdot \log_2(\frac{5}{14})  = 0.94

1.2 Berechnung der Informationsgewinne aller Attribute

Berechnen wir nun also die Informationsgewinne über alle Spalten.

Attribut Subset Count(true) Count(false)
Kundenart “Neukunde” 2 3
“Stammkunde” 4 0
“Normalkunde” 3 2

Wir zerlegen den gesamten Datensatz gedanklich in drei Kategorien der Kundenart und berechnen die Entropie bezogen auf das Klassifikationsziel:

H(S_{Neukunde}) = - \frac{2}{5} \cdot \log_2(\frac{2}{5}) - \frac{3}{5} \cdot \log_2(\frac{3}{5})  = 0.97

H(S_{Stammkunde}) = - \frac{4}{4} \cdot \log_2(\frac{4}{4}) - \frac{0}{4} \cdot \log_2(\frac{0}{4})  = 0.00

H(S_{Normalkunde}) = - \frac{3}{5} \cdot \log_2(\frac{3}{5}) - \frac{2}{5} \cdot \log_2(\frac{2}{5})  = 0.97

Zur Erinnerung, der Informationsgewinn (Information Gain) wird wie folgt berechnet:

    \[ IG(S, A_{Kundenart}) =  - \sum_{i=1}^n \frac{\bigl|S_i\bigl|}{\bigl|S\bigl|} \cdot H(S_i) \]

Angewendet auf das Attribut “Kundenart”…

    \[ IG(S, A_{Kundenart}) =  H(S) - \frac{\bigl|S_{Neukunde}\bigl|}{\bigl|S\bigl|} \cdot H(S_{Neukunde}) - \frac{\bigl|S_{Stammkunde}\bigl|}{\bigl|S\bigl|} \cdot H(S_{Stammkunde}) - \frac{\bigl|S_{Normalkunde}\bigl|}{\bigl|S\bigl|} \cdot H(S_{Normalkunde}) \]

… erhalten wir der Formal nach folgenden Informationsgewinn:

    \[ IG(S, A_{Kundenart}) =  0.94 - \frac{5}{14} \cdot 0.97 - \frac{4}{14} \cdot 0.00 - \frac{5}{14} \cdot 0.97 = 0.247 \]

Nun für die weiteren Spalten:

Attribut Subset Count(true) Count(false)
Zahlungsgeschwindigkeit “niedrig” 2 2
“mittel” 4 2
“schnell” 3 1

Entropien für die “Zahlungsgeschwindigkeit”:

H(S_{niedrig}) = - \frac{2}{4} \cdot \log_2(\frac{2}{4}) - \frac{2}{4} \cdot \log_2(\frac{2}{4})  = 1.00

H(S_{mittel}) = - \frac{4}{6} \cdot \log_2(\frac{4}{6}) - \frac{2}{6} \cdot \log_2(\frac{2}{6})  = 0.92

H(S_{schnell}) = - \frac{3}{4} \cdot \log_2(\frac{3}{4}) - \frac{1}{4} \cdot \log_2(\frac{1}{4})  = 0.81

So berechnen wir wieder den Informationsgewinn:

    \[ IG(S, A_{Zahlungsgeschwindigkeit}) =  H(S) - \frac{\bigl|S_{niedrig}\bigl|}{\bigl|S\bigl|} \cdot H(S_{niedrig}) - \frac{\bigl|S_{mittel}\bigl|}{\bigl|S\bigl|} \cdot H(S_{mittel}) - \frac{\bigl|S_{schnell}\bigl|}{\bigl|S\bigl|} \cdot H(S_{schnell}) \]

Einsatzen und ausrechnen:

    \[ IG(S, A_{Zahlungsgeschwindigkeit}) =  0.94 - \frac{4}{14} \cdot 1.00 - \frac{6}{14} \cdot 0.92 - \frac{4}{14} \cdot 0.81 = 0.029 \]

Und nun für die Spalte “Kauffrequenz”:

Attribut Subset Count(true) Count(false)
Kauffrequenz “niedrig” 3 4
“hoch” 6 1


H(S_{niedrig}) = - \frac{3}{7} \cdot \log_2(\frac{3}{7}) - \frac{4}{7} \cdot \log_2(\frac{4}{7})  = 0.99

H(S_{hoch}) = - \frac{6}{7} \cdot \log_2(\frac{6}{7}) - \frac{1}{7} \cdot \log_2(\frac{1}{7})  = 0.59


    \[ IG(S, A_{Kauffrequenz}) =  H(S) - \frac{\bigl|S_{niedrig}\bigl|}{\bigl|S\bigl|} \cdot H(S_{niedrig}) - \frac{\bigl|S_{hoch}\bigl|}{\bigl|S\bigl|} \cdot H(S_{hoch}) \]

Einsetzen und Ausrechnen:

    \[ IG(S, A_{Kauffrequenz}) =  0.94 - \frac{7}{14} \cdot 1.00 - \frac{7}{14} \cdot 0.59 = 0.150 \]

Und last but not least die Spalte “Herkunft”:

Attribut Subset Count(true) Count(false)
Herkunft “Inland” 6 2
“Ausland” 3 3


H(S_{Inland}) = - \frac{6}{8} \cdot \log_2(\frac{6}{8}) - \frac{2}{8} \cdot \log_2(\frac{2}{8})  = 0.81

H(S_{Ausland}) = - \frac{3}{6} \cdot \log_2(\frac{3}{6}) - \frac{3}{6} \cdot \log_2(\frac{3}{6})  = 1.00


    \[ IG(S, A_{Herkunft}) =  H(S) - \frac{\bigl|S_{Inland}\bigl|}{\bigl|S\bigl|} \cdot H(S_{Inland}) - \frac{\bigl|S_{Ausland}\bigl|}{\bigl|S\bigl|} \cdot H(S_{Ausland}) \]

Einsetzen und Ausrechnen:

    \[ IG(S, A_{Herkunft}) =  0.94 - \frac{8}{14} \cdot 0.81 - \frac{6}{14} \cdot 1.00 = 0.05 \]

2. Schritt: Anlegen des Wurzel-Knotens

Der Informationsgewinn ist für das Attribut “Kundenart” am größten, daher entscheiden wir uns im Sinne des ID3-Algorithmus für dieses Attribut als Wurzel-Knoten.

3. Schritt: Rekursive Wiederholung (!!!)

Nun stellt sich natürlich die Frage: Wie geht es weiter?

Der Algorithmus kann eigentlich nur eines: Einen Wurzelknoten finden. Diesen Vorgang müssen wir nun nur noch rekursiv wiederholen, und das tun wir wie folgt.

Der Datensatz wurde bereits aufgeteilt in die drei Kundenarten. Für jede Kundenart ergibt sich jeweils ein Subset mit den verbleibenden Attributen. Für alle drei Subsets erstellen wir dann wieder einen Wurzelknoten, so dass ein neuer Ast entsteht.

3.1 Erster Rekursionsschritt

Machen wir also weiter und bestimmen wir das nächste Attribut nach der Kundenart, für die Fälle Kundenart = “Neukunde”:

Zeile Kundenart Zahlungsgeschwindigkeit Kauffrequenz Herkunft Zahlungsmittel: Rechnung?
 1  Neukunde  niedrig  niedrig  Inland  false
 2  Neukunde  niedrig  niedrig  Ausland  false
 8  Neukunde  mittel  niedrig  Inland  false
 9  Neukunde  hoch  hoch  Inland  true
 11  Neukunde  mittel  hoch  Ausland  true

Die Entropie des Gesamtdatensatzes (ja, es ist für diesen Schritt betrachtet der gesamte Datensatz!) ist wie folgt:

H(S_{Neukunde}) = - \frac{2}{5} \cdot \log_2(\frac{2}{5}) - \frac{3}{5} \cdot \log_2(\frac{3}{5})  = 0.97

Die Entropie ist weit weg von einer bestimmten Wahrscheinlichkeit (nahe der Gleichverteilung). Daher müssen wir hier nochmal ansetzen und losrechnen:

Entropien für “Zahlungsgeschwindigkeit” bei Neukunden:

H(S_{niedrig}) = 0.00

H(S_{mittel}) = 1.00

H(S_{hoch}) = 0.00

Informationsgewinn des Attributes “Zahlungsgeschwindigkeit” bei Neukunden:

    \[ IG(S_{Neukunde},A_{Zahlungsgeschwindigkeit}) = 0.97 - \frac{3}{5} \cdot 0.00 - \frac{2}{5} \cdot 1.00 -  \frac{1}{5} \cdot 0.00 = 0.57 \]

Betrachtung der Spalte “Kauffrequenz” bei Neukunden:

Entropien für “Kauffrequenz” bei Neukunden:

H(S_{niedrig}) = 0.00

H(S_{hoch}) = 0.00

Informationsgewinn des Attributes “Kauffrequenz” bei Neukunden:

    \[ IG(S_{Neukunde},A_{Kauffrequenz}) = 0.97 - \frac{3}{5} \cdot 0.00 - \frac{2}{5} \cdot 0.00 = 0.97 \]

Betrachtung der Spalte “Herkunft” bei Neukunden:

Entropien für “Herkunft” bei Neukunden:

H(S_{Inland}) = 0.92

H(S_{hoch}) = 1.00

Informationsgewinn des Attributes “Herkunft” bei Neukunden:

    \[ IG(S_{Neukunde},A_{Herkunft}) = 0.97 - \frac{3}{5} \cdot 0.92 - \frac{2}{5} \cdot 1.00 = 0.018 \]

Wir entscheiden uns also für das Attribut “Kauffrequenz” als Ast nach der Entscheidung “Neukunde”, denn dieses Attribut bring uns den größten Informationsgewinn und trennt uns die Unterscheidung für oder gegen das Zahlungsmittel “Rechnung” eindeutig auf.

3.1 Zweiter Rekursionsschritt

Was passiert mit der Kundenart “Stammkunde”?

Zeile Kundenart Zahlungsgeschwindigkeit Kauffrequenz Herkunft Zahlungsmittel: Rechnung?
 3  Stammkunde  niedrig  niedrig  Inland  true
 7  Stammkunde  hoch  hoch  Ausland  true
 12  Stammkunde  mittel  niedrig  Ausland  true
 13  Stammkunde  niedrig  hoch  Inland  true

Die Antwort ist einfach: Nichts!
Wer ein Stammkunde ist, dem wurde stets die Zahlung per Rechnung erlaubt.

H(S_{Stammkunde}) = 0.0

3.1 Dritter Rekursionsschritt

Fehlt nun nur noch die Frage nach der Unterscheidung von Normalkunden.

Zeile Kundenart Zahlungsgeschwindigkeit Kauffrequenz Herkunft Zahlungsmittel: Rechnung?
 4  Normalkunde  mittel  niedrig  Inland  true
 5  Normalkunde  hoch  hoch  Inland  true
 6  Normalkunde  hoch  hoch  Ausland  false
 14  Normalkunde  mittel  niedrig  Ausland  false

Zwar ist die Entropie des Subsets der Normalkunden…

H(S_{Normalkunde}) = 1.0

… denkbar schlecht, da maximal. Aber wir können genauso vorgehen, wie wir es bei dem Subset der Neukunden getan haben. Ich nehme es nun aber vorweg: Wenn wir uns den Datensatz näher ansehen, erkennen wir, dass wir diese Gesamtentropie von 1.0 für das Subset “Normalkunde” nicht mit den Attributen “Kauffrequenz” oder “Zahlungsgeschwindigkeit” reduzieren können, da dieses auch für sich betrachtet in Entropien der Größe 1.0 erhalten werden. Das Attribut “Herkunft” hingegen teilt den Datensatz sauber in true und false auf:

Somit ist der Informationsgewinn für das Attribut “Herkunft” am größten und wir haben unseren Baum komplett und – glücklicherweise – eindeutig bestimmen können!

Ergebnis: Der Entscheidungsbaum

Somit haben wir den Entscheidungsbaum über den ID3-Algorithmus erstellt, der eine Auskunft darüber macht, ob einem Kunden die Zahlung über Rechnung (statt Vorkasse) erlaubt wird:

true = Rechnung als Zahlungsmittel erlaubt
false = Rechnung als Zahlungsmittel nicht erlaubt