The algorithm known as PCA and my taxonomy of linear dimension reductions

In one of my previous articles, I explained the importance of reducing dimensions. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the simplest types of dimension reduction algorithms. In upcoming articles of mine, you are going to see what these algorithms do. In conclusion, diagonalization, which I mentioned in the last article, is what these algorithms are all about, but in this article I am going to cover mainly only PCA.

This article is largely based on the explanations in Pattern Recognition and Machine Learning by C. M. Bishop (which is often called “PRML”), and when you search “PCA” on the Internet, you will find more or less similar explanations. However I hope I can go some steps ahead throughout this article series. I mean, I am planning to also cover more generalized versions of PCA, meanings of diagonalization, the idea of subspace. I believe this article series is also effective for refreshing your insight into linear algebra.

*This is the third article of my article series “Illustrative introductions on dimension reduction.”

1. My taxonomy on linear dimension reduction

*If you soon want to know  what the algorithm called “PCA” is, you should skip this section for now to avoid confusion.

Out of the two algorithms I mentioned, PCA is especially important and you would see the same or similar ideas in various fields such as signal processing, psychology, and structural mechanics. However in most cases, the word “PCA” refers to one certain algorithm of linear dimension reduction. Most articles or study materials only mention the “PCA,” and this article is also going to cover only the algorithm. However I found that PCA is only one branch of linear dimension reduction algorithms.

*This chart might be confusing to you. According to PRML, PCA and KL transform is identical. PCA has two formulations, maximum variance formulation and minimum error formulation, and they can give the same result. However according to a Japanese textbook, which is very precise about this topic, KL transform has two formulations, and what we call PCA is based on maximum variance formulation. I am still not sure about correct terminology, but in this article I am going to call the most general algorithm “generalized KL transform,” I mean the root of the chart above.

*Most materials just explain the most major PCA, but if you consider this generalized KL transform, I can introduce an intriguing classification algorithm called subspace method. This algorithm was invented in Japan, and this is not so popular in machine learning textbooks in general, but learning this method would give you better insight into the idea of multidimensional space in machine learning. In the future, I am planning to cover this topic in this article series.

2. PCA

When someones mention “PCA,” I am sure for the most part that means the algorithm I am going to explain in the rest of this article. The most intuitive and straightforward way to explain PCA is that, PCA (Principal Component Analysis) of two or three dimensional data is fitting an oval to two dimensional data or fitting an ellipsoid to three dimensional data. You can actually try to plot some random dots on a piece of paper, and draw an oval which fits the dots the best. Assume that you have these 2 or 3 dimensional data below, and please try to put an oval or an ellipsoid to on data.

I think this is nothing difficult, but I have a question: what was the logic behind your choice?

Some might have roughly drawn its outline. Formulas of  “the surface” of general ellipsoids can be explained in several ways, but in this article you only have to consider ellipsoids whose center is the origin point of the coordinate system. In PCA you virtually shift data so that the mean comes to the origin point. When A is a certain type of D\times D matrix, the formula of a D-dimensional ellipsoid whose center is identical to the origin point is (\boldsymbol{x}, A\boldsymbol{x}) = 1, where \boldsymbol{x}\in \mathbb{R}. As is always the case with formulas in data science, you can visualize such ellipsoids if you are talking about 1, 2, or 3 dimensional data like in the figure below, but in general D-dimensional space, it is theoretical/imaginary stuff on blackboards.

*In order to explain the conditions of the matrix A, I need another article, so for now please just assume that the A is a kind of magical matrix.

You might have seen equations of 2 or 3 dimensional ellipsoids in the following way: \frac{x^2}{a^2} + \frac{y^2}{b^2} = 1, where a\neq 0, b\neq 0 or \frac{x^2}{a^2} + \frac{y^2}{b^2} + \frac{z^2}{c^2}= 1, where a\neq 0, b\neq 0, c \neq 0. These are special cases of the equation (\boldsymbol{x}, A\boldsymbol{x}) = 1, where A=diag(a_1^2, \dots, a_D^2). In this case the axes of ellipsoids the same as those of the coordinate system. Thus in the simple case which I have just mentioned , A=diag(a^2, b^2) or A=diag(a^2,c^2,c^2).

I am going explain these equations in detail in the upcoming articles, but how would you fit an ellipsoid when a data distribution does not look like an ellipsoid?

In fact we have to focus more on another feature of ellipsoids: all the axes of an ellipoid are orthogonal. In conclusion the axes of the ellipsoids are the points in PCA, so I do want you to forget about the surface of ellipsoids for the time being. You might be getting confused if you also think about the surface of ellipsoid, but I am planning to cover this topic in the next article. I hope this article, combined with the last one and the next one, would help you have better insight into the ideas which frequently appear in data science or machine learning context.

3. Fitting orthogonal axes on data

*If you have no trouble reading the chapter 12.1 of PRML, you do not need to this section or maybe even this article, but I hope at least some charts or codes of mine would enhance your understanding on this topic.

*I must admit I wrote only the essence of PCA formulations. If this seems too abstract for you, you should just breifly read through this section  go to the next section with a more concrete example. If you are confused there should be other good explanations on PCA on the internet, and you should also check them. But at least the visualization of PCA in the next section would be helpful.

As I implied above, all the axes of ellipsoids are orthogonal, and selecting the orthogonal axes which match data is what PCA is all about. And when you choose those orthogonal axes, it is ideal if the data look like ellipsoid. Simply putting we want the data to “swell” along the axes.

Then let’s see how to let them “swell,” more mathematically. Assume that you have 2 dimensional data plotted on a coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2) as below (The samples are plotted in purple). Intuitively, the data “swell” the most along the vector \boldsymbol{u}_1. Also  it is clear that \boldsymbol{u}_2 is the only vector orthogonal to \boldsymbol{u}_1. We can expect that the new coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2) expresses the data in a better way, and you you can get new coordinate points of the samples by projecting them on new axes as done with yellow lines below.

Next, let’s think about a case in 3 dimensional data. When you have 3 dimensional data in a coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2,\boldsymbol{e}_2) as below,  the data “swell” the most also along \boldsymbol{u}_1. And the data swells the second most along \boldsymbol{u}_2. The two axes, or vectors span the plain in purple. If you project all the samples on the plain, you will get 2 dimensional data at the right side. It is important that we did not consider the third axis. You might be able to extract important tendencies of data with fewer dimensions.


Thus the problem is how to calculate such axis \boldsymbol{u}_1. We want the variance of data projected on \boldsymbol{u}_1 to be the biggest. The coordinate of \boldsymbol{x}_n on the axis \boldsymbol{u}_1. The coordinate of a data point \boldsymbol{x}_n on the axis \boldsymbol{u}_1 is calculated by projecting \boldsymbol{x}_n on \boldsymbol{u}_1. In data science context, such projection is synonym to taking an inner  product of \boldsymbol{x}_n and \boldsymbol{u}_1, that is calculating \boldsymbol{u}_1^T \boldsymbol{x}_n.

*Each element of \boldsymbol{x}_n is the coordinate of the data point \boldsymbol{x}_n in the original coordinate system. And the projected data on \boldsymbol{u}_1 whose coordinates are 1-dimensional correspond to only one element of transformed data.

To calculate the variance of projected data on \boldsymbol{u}_1, we just have to calculate the mean of variances of 1-dimensional data projected on \boldsymbol{u}_1. Assume that \bar{\boldsymbol{x}} is the mean of data in the original coordinate, then the deviation of \boldsymbol{x}_1 on the axis \boldsymbol{u}_1 is calculated as \boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}, as shown in the figure. Hence the variance, I mean the mean of the deviation on is \frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}}, where N is the total number of data points. After some deformations, you get the next equation \frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}} = \boldsymbol{u}_1^T S \boldsymbol{u}_1, where S = \frac{1}{N}\sum_{n=1}^{N}{(\boldsymbol{x}_n - \bar{\boldsymbol{x}})(\boldsymbol{x}_n - \bar{\boldsymbol{x}})^T}. S is known as a covariance matrix.

We are now interested in maximizing the variance of projected data on  \boldsymbol{u}_1^T S \boldsymbol{u}_1, and for mathematical derivation we need some college level calculus, so if that is too much for you, you can skip reading this part till the next section.

We now want to calculate \boldsymbol{u}_1 with which \boldsymbol{u}_1^T S \boldsymbol{u}_1 is its maximum value. General \boldsymbol{u}_i including \boldsymbol{u}_1 are just coordinate axes after PCA, so we are just interested in their directions. Thus we can set one constraint \boldsymbol{u}_1^T  \boldsymbol{u}_1 = 1. Introducing a Lagrange multiplier, we have only to optimize next problem: \boldsymbol{u}_1 ^ {*} = \mathop{\rm arg~max}\limits_{\boldsymbol{u}_1} \{ \boldsymbol{u}_1^T S \boldsymbol{u}_1 + \lambda_1 (1 - \boldsymbol{u}_1^T \boldsymbol{u}_1) \}. In conclusion \boldsymbol{u}_1 ^ {*} satisfies S\boldsymbol{u}_1 ^ {*}  = \lamba_1 \boldsymbol{u}_1 ^ {*}. If you have read my last article on eigenvectors, you wold soon realize that this is an equation for calculating eigenvectors, and that means \boldsymbol{u}_1 ^ {*} is one of eigenvectors of the covariance matrix S. Given the equation of eigenvector the next equation holds \boldsymbol{u}_1 ^ {*}^T S \boldsymbol{u}_1 ^ {*} = \lambda_1. We have seen that \boldsymbol{u}_1 ^T S \boldsymbol{u}_1 ^ is a the variance of data when projected on a vector \boldsymbol{u}_1, thus the eigenvalue \lambda_1 is the biggest variance possible when the data are projected on a vector.

Just in the same way you can calculate the next biggest eigenvalue \lambda_2, and it it the second biggest variance possible, and in this case the date are projected on \boldsymbol{u}_2, which is orthogonal to \boldsymbol{u}_1. As well you can calculate orthogonal 3rd 4th …. Dth eigenvectors.

*To be exact I have to explain the cases where we can get such D orthogonal eigenvectors, but that is going to be long. I hope I can to that in the next article.

4. Practical three dimensional example of PCA

We have seen that PCA is sequentially choosing orthogonal axes along which data points swell the most. Also we have seen that it is equal to calculating eigenvalues of the covariance matrix of the data from the largest to smallest one. From now on let’s work on a practical example of data. Assume that we have 30 students’ scores of Japanese, math, and English tests as below.

* I think the subject “Japanese” is equivalent to “English” or “language art” in English speaking countries, and maybe “Deutsch” in Germany. This example and the explanation are largely based on a Japanese textbook named 「これなら分かる応用数学教室 最小二乗法からウェーブレットまで」. This is a famous textbook with cool and precise explanations on mathematics for engineering. Partly sharing this is one of purposes of this article.

At the right side of the figure below is plots of the scores with all the combinations of coordinate axes. In total 9 inverse graphs are symmetrically arranged in the figure, and it is easy to see that English & Japanese or English and math have relatively high correlation. The more two axes have linear correlations, the bigger the covariance between them is.

In the last article, I visualized the eigenvectors of a 3\times 3 matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}, and in fact the matrix is just a constant multiplication of this covariance matrix. I think now you understand that PCA is calculating the orthogonal eigenvectors of covariance matrix of data, that is diagonalizing covariance matrix with orthonormal eigenvectors. Hence we can guess that covariance matrix enables a type of linear transformation of rotation and expansion and contraction of vectors. And data points swell along eigenvectors of such matrix.

Then why PCA is useful? In order to see that at first, for simplicity assume that x, y, z denote Japanese, Math, English scores respectively. The mean of the data is \left( \begin{array}{c} \bar{x} \\ \bar{y} \\ \bar{z} \end{array} \right) = \left( \begin{array}{c} 58.1 \\ 61.8 \\ 67.3 \end{array} \right), and the covariance matrix of data in the original coordinate system is V_{xyz} = \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. The eigenvalues of  V_{xyz} are \lambda_1=148.34, \lambda_2 = 30.62, and \lambda_3 = 3.60, and their corresponding unit eigenvectors are \boldsymbol{u}_1 =  \left( \begin{array}{c} 0.540 \\ 0.602 \\ 0.589 \end{array} \right) , \boldsymbol{u}_2 =  \left( \begin{array}{c} 0.736 \\ -0.677 \\ 0.0174 \end{array} \right) , \boldsymbol{u}_3 =  \left( \begin{array}{c} -0.408 \\ -0.4.23 \\ 0.809 \end{array} \right) respectively.  U = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2 \quad \boldsymbol{u}_3 )  is an orthonormal matrix, where \boldsymbol{u}_i^T\boldsymbol{u}_j = \begin{cases} 1 & (i=j) \\ 0 & (otherwise) \end{cases}. As I explained in the last article, you can diagonalize V_{xyz} with U: U^T V_{xyz}U = diag(\lambda_1, \dots, \lambda_D).

In order to see how PCA is useful, assume that \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)  = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right).

Let’s take a brief look at what a linear transformation by U^T means. Each element of \boldsymbol{x} denotes coordinate of the data point \boldsymbol{x}  in the original coordinate system (In this case the original coordinate system is composed of \boldsymbol{e}_1, \boldsymbol{e}_2, and \boldsymbol{e}_3). U = (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) enables a rotation of a rigid body, which means the shape or arrangement of data will not change after the rotation, and U^T enables a reverse rotation of the rigid body.

*Roughly putting, if you hold a bold object such as a metal ball and rotate your arm, that is a rotation of a rigid body, and your shoulder is the origin point. On the other hand, if you hold something soft like a marshmallow, it would be squashed in your hand, and that is not a not a rotation of a rigid body.

You can rotate \boldsymbol{x} with U like U^T\boldsymbol{x} = \left( \begin{array}{c} -\boldsymbol{u}_1^{T}- \\ -\boldsymbol{u}_2^{T}- \\ -\boldsymbol{u}_3^{T}- \end{array} \right)\boldsymbol{x}=\left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right), and \boldsymbol{u}_i^{T}\boldsymbol{x} is the coordinate of \boldsymbol{x} projected on the axis \boldsymbol{u}_i.

Let’s see this more visually. Assume that the data point \boldsymbol{x}  is a purple dot and its position is expressed in the original coordinate system spanned by black arrows . By multiplying \boldsymbol{x} with U^T, the purple point \boldsymbol{x} is projected on the red axes respectively, and the product \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right) denotes the coordinate point of the purple point in the red coordinate system. \boldsymbol{x} is rotated this way, but for now I think it is better to think that the data are projected on new coordinate axes rather than the data themselves are rotating.

Now that we have seen what rotation by U means, you should have clearer image on what \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)  = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) means. \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right) denotes the coordinates of data projected on new axes \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3, which are unit eigenvectors of V_{xyz}. In the coordinate system spanned by the eigenvectors, the data distribute like below.

By multiplying U from both sides of the equation above, we get \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) =U \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right), which means you can express deviations of the original data as linear combinations of the three factors \xi, \eta, and \zeta. We expect that those three factors contain keys for understanding the original data more efficiently. If you concretely write down all the equations for the factors: \xi = 0.540 (x - \bar{x}) + 0.602 (y - \bar{y}) + 0.588 (z - \bar{z}), \eta = 0.736(x - \bar{x}) - 0.677 (y - \bar{y}) + 0.0174 (z - \bar{z}), and \zeta = - 0.408 (x - \bar{x}) - 0.423 (y - \bar{y}) + 0.809(z - \bar{z}). If you examine the coefficients of the deviations (x - \bar{x}), (y - \bar{y}), and (z - \bar{z}), we can observe that \eta almost equally reflects the deviation of the scores of all the subjects, thus we can say \eta is a factor indicating one’s general academic level. When it comes to \eta Japanese and Math scores are important, so we can guess that this factor indicates whether the student is at more of “scientific side” or “liberal art side.” In the same way \zeta relatively makes much of one’s English score,  so it should show one’s “internationality.” However the covariance of the data \xi, \eta, \zeta is V_{\xi \eta \zeta} = \begin{pmatrix} 148.34 & 0 & 0 \\ 0 & 30.62 & 0 \\ 0 & 0 & 3.60 \end{pmatrix}. You can see \zeta does not vary from students to students, which means it is relatively not important to describe the tendency of data. Therefore for dimension reduction you can cut off the factor \zeta.

*Assume that you can apply PCA on D-dimensional data and that you get \boldsymbol{x}', where \boldsymbol{x}' = U^T\boldsymbol{x} - \bar{\boldsymbol{x}}. The variance of data projected on new D-dimensional coordinate system is V'=\frac{1}{N}\sum{(\boldsymbol{x}')^T\boldsymbol{x}'} =\frac{1}{N}\sum{(U^T\boldsymbol{x})^T(U^T\boldsymbol{x})} =\frac{1}{N}\sum{U^T\boldsymbol{x}\boldsymbol{x}^TU} =U^T(\frac{1}{N}\sum{\boldsymbol{x}\boldsymbol{x}^T})U =U^TVU =diag(\lambda_1, \dots, \lambda_D). This means that in the new coordinate system after PCA, covariances between any pair of variants are all zero.

*As I mentioned U is a rotation of a rigid body, and U^T is the reverse rotation, hence U^TU = UU^T = I.

Hence you can approximate the original 3 dimensional data on the coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) from the reduced two dimensional coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2) with the following equation: \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) \approx U_{reduced} \left( \begin{array}{c} \xi \\ \eta  \end{array} \right)  = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2) \left( \begin{array}{c} \xi \\ \eta  \end{array} \right). Then it mathematically clearer that we can express the data with two factors: “how smart the student is” and “whether he is at scientific side or liberal art side.”

We can observe that eigenvalue \lambda_i is a statistic which indicates how much the corresponding \boldsymbol{u}_i can express the data, \frac{\lambda_i}{\sum_{j=1}^{D}{\lambda_j}} is called the contribution ratio of eigenvector \boldsymbol{u}_i. In the example above, the contribution ratios of \boldsymbol{u}_1, \boldsymbol{u}_2, and \boldsymbol{u}_3 are respectively \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3}=0.813, \frac{\lambda_2}{\lambda_1 + \lambda_2 + \lambda_3}=0.168, \frac{\lambda_3}{\lambda_1 + \lambda_2 + \lambda_3}=0.0197. You can decide how many degrees of dimensions you reduce based on this information.

Appendix: Playing with my toy PCA on MNIST dataset

Applying “so called” PCA on MNIST dataset is a super typical topic that many other tutorial on PCA also introduce, but I still recommend you to actually implement, or at least trace PCA implementation with MNIST dataset without using libraries like scikit-learn. While reading this article I recommend you to actually run the first and the second code below. I think you can just copy and paste them on your tool to run Python, installing necessary libraries. I wrote them on Jupyter Notebook.

In my implementation, in the simple configuration part you can set the USE_ALL_NUMBERS as True or False boolean. If you set it as True, you apply PCA on all the data of numbers from 0 to 9. If you set it as True, you can specify which digit to apply PCA on. In this article, I show the results results of PCA on the data of digit ‘3.’ The first three images of ‘3’ are as below.

You have to keep it in mind that the data are all shown as 28 by 28 pixel grayscale images, but in the process of PCA, they are all processed as 28 * 28 = 784 dimensional vectors. After applying PCA on the 784 dimensional vectors of images of ‘3,’ the first 25 eigenvectors are as below. You can see that at the beginning the eigenvectors partly retain the shapes of ‘3,’ but they are distorted as the eigenvalues get smaller. We can guess that the latter eigenvalues are not that helpful in reconstructing the shape of ‘3.’

Just as we saw in the last section, you you can cut off axes of eigenvectors with small eigenvalues and reduce the dimension of MNIST data. The figure below shows how contribution ratio of MNIST data grows. You can see that around 200 dimension degree, the contribution ratio reaches around 0.95. Then we can guess that even if we reduce the dimension of MNIST from 784 to 200 we can retain the most of the structure of original data.

Some results of reconstruction of data from 200 dimensional space are as below. You can set how many images to display by adjusting NUMBER_OF_RESULTS in the code. And if you set LATENT_DIMENSION as 784, you can completely reconstruct the data.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.



Bias and Variance in Machine Learning

Machine learning continues to be an ever more vital component of our lives and ecosystem, whether we’re applying the techniques to answer research or business problems or in some cases even predicting the future. Machine learning models need to give accurate predictions in order to create real value for a given industry or domain.

While training a model is one of the key steps in the Data Science Project Life Cycle, how the model generalizes on unseen data is an equally important aspect that should be considered in every Data Science Project Life Cycle. We need to know whether it works and, consequently, if we can trust its predictions. Could the model be merely memorizing the data it is fed with, and therefore unable to make good predictions on future samples, or samples that it hasn’t seen before?

Let’s know the importance of evaluation with a simple example, There are two student’s Ramesh and Suresh preparing for the CAT exam to get into top IIMs (Indian Institute of Management). They both are quite good friends and stayed in the room during preparation and put an equal amount of hard work while solving numerical problems.

They both prepared for almost the same number of hours for the entire year and appeared in the final CAT exam. Surprisingly, Ramesh cleared, but Suresh did not. When asked, we got to know that there was one difference in their strategy of preparation between them, Ramesh had joined a Test Series course where he used to test his knowledge and understanding by giving mock exams and then further evaluating on which portions he is lagging and making necessary adjustments to he is preparation cycle in order to do well in those areas. But Suresh was confident, and he just kept training himself without testing on the preparation he had done.

Like the above situation we can train a Machine Learning Algorithm extensively with many parameters and new techniques, but if you are skipping its evaluation step, you cannot trust your model to perform well on the unseen data. In this article, we explain the importance of Bias, Variance and the trade-off between them in order to know how well a machine learning model generalizes to new, previously unseen data.

Training of Supervised Machine Learning


Bias is the difference between the Predicted Value and the Expected Value or how far are the predicted values from the actual values. During the training process the model makes certain assumptions on the training data provided. After Training, when it is introduced to the testing/validation data or unseen data, these assumptions may not always be correct.

If we use a large number of nearest neighbors in the K-Nearest Neighbors Algorithm, the model can totally decide that some parameters are not important at all for the modelling.  For example, it can just consider that only two predictor variables are enough to classify the data point though we have more than 10 variables.

This type of model will make very strong assumptions about the other parameters not affecting the outcome at all. You can take it as a model predicting or understanding only the simple relationship when the data points clearly indicate a more complex relationship.

When the model has high bias error, it results in a very simplistic model that does not consider the complexity of the data very well leading to Underfitting.


Variance occurs when the model performs well on the trained dataset but does not do well on an unseen data set, it is when the model considers the fluctuations or i.e. the noise as in the data as well. The model will still consider the variance as something to learn from because it learns too much from the noise inside the trained data set that it fails to perform as expected on the unseen data.

Based on the above example from Bias, if the model learns that all the ten predictor variables are important to classify a given data point then it tends to have high variance. You can take it as the model is trying to understand every minute detail making it more complex and failing to perform well on the unseen data.

When a model has High Bias error, it underfits the data and makes very simplistic assumptions on it. When a model has High Variance error, it overfits the data and learns too much from it. When a model has balanced Bias and Variance errors, it performs well on the unseen data.

Bias-Variance Trade-off

Based on the definitions of bias and variance, there is clear trade-off between bias and variance when it comes to the performance of the model. A model will have a high error if it has very high bias and low variance and have a high error if it has high variance and low bias.

A model that strikes a balance between the bias and variance can minimize the error better than those that live on extreme ends.

We can find whether the model has High Bias using the below steps:

  1. We tend to get high training errors.
  2. The validation error or test error will be similar to the training error.

We can find whether the model has High Bias using the below steps:

  1. We tend to get low training error
  2. The validation error or test error will be very high.

We can fix the High Bias using below steps:

  1. We need to gather more input features or can even try to create few using the feature engineering techniques.
  2. We can even add few polynomial features in order to increase the complexity.
  3. If we are using any regularization terms in our model, we can try to minimize it.

We can fix the High Variance using below steps:

  1. We can gather more training data so that the model can learn more on the patterns rather than the noise.
  2. We can even try to reduce the input features or do feature selection.
  3.  If we are using any regularization terms in our model we can try to maximize it.


In this article, we got to know the importance of the evaluation step in the Data Science Project Life Cycle, definitions of Bias and Variance, the trade-off between them and the steps we can take to fix the Underfitting and Overfitting of a Machine Learning Model.

Rethinking linear algebra: visualizing linear transformations and eigenvectors

In terms of calculation processes of Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), which are the dimension reduction techniques I am going to explain in the following articles, diagonalization is what they are all about. Throughout this article, I would like you to have richer insight into diagonalization in order to prepare for understanding those basic dimension reduction techniques.

When our professor started a lecture on the last chapter of our textbook on linear algebra, he said “It is no exaggeration to say that everything we have studied is for this ‘diagonalization.'” Until then we had to write tons of numerical matrices and vectors all over our notebooks, calculating those products, adding their rows or columns to other rows or columns, sometimes transposing the matrices, calculating their determinants.

It was like the scene in “The Karate Kid,” where the protagonist finally understood the profound meaning behind the prolonged and boring “wax on, wax off” training given by Miyagi (or “jacket on, jacket off” training given by Jackie Chan). We had finally understood why we had been doing those seemingly endless calculations.


But usually you can do those calculations easily with functions in the Numpy library. Unlike Japanese college freshmen, I bet you are too busy to reopen textbooks on linear algebra to refresh your mathematics. Thus I am going to provide less mathematical and more intuitive explanation of diagonalization in this article.

*This is the second article of the article series ” Illustrative introductions on dimension reduction .”

1, The mainstream ways of explaining diagonalization.

*The statements below are very rough for mathematical topics, but I am going to give priority to offering more visual understanding on linear algebra in this article. For further understanding, please refer to textbooks on linear algebra. If you would like to have minimum understandings on linear algebra needed for machine learning, I recommend the Appendix C of Pattern Recognition and Machine Learning by C. M. Bishop.

In most textbooks on linear algebra, the explanations on dioagonalization is like this (if you are not sure what diagonalization is or if you are allergic to mathematics, you do not have to read this seriously):

Let V (dimV = D)be a vector space and let  T_A : V \rightarrow V be a mapping of V into itself,  defined as T_A(v) = A \cdot \boldsymbol{v}, where A is a D\times D matrix and \boldsymbol{v} is D dimensional vector. An element \boldsymbol{v} \in V is called an eigen vector if there exists a number \lambda such that A \cdot \boldsymbol{v}= \lambda \cdot \boldsymbol{v} and \boldsymbol{v} \neq \boldsymbol{0}. In this case \lambda is uniquely determined and is called an eigen value of A belonging to the eigen vector \boldsymbol{v}.

Any matrix A has D eigen values \lambda_{i}, belonging to \boldsymbol{v}_{i} (i=1, 2, …., D). If \boldsymbol{v}_{i} is basis of the vector space V, then A is diagonalizable.

When A is diagonalizable, with D \times D matrices P = (\boldsymbol{v}_{1}, \dots, \boldsymbol{v}_{D}) , whose column vectors are eigen vectors \boldsymbol{v}_{i} (i=1, 2, …., D), the following equation holds: P^{-1}AP = \Lambda, where \Lambda = diag(\lambda_{1}, \dots, \lambda_{D})= \begin{pmatrix} \lambda_{1} & 0& \ldots &0\\ 0 & \lambda_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_{D} \end{pmatrix}.

And when A is diagonalizable, you can diagonalize A as below.

Most textbooks keep explaining these type of stuff, but I have to say they lack efforts to make it understandable to readers with low mathematical literacy like me. Especially if you have to apply the idea to data science field, I believe you need more visual understanding of diagonalization. Therefore instead of just explaining the definitions and theorems, I would like to take a different approach. But in order to understand them in more intuitive ways, we first have to rethink waht linear transformation T_A means in more visible ways.

2, Linear transformations

Even though I did my best to make this article understandable to as little prerequisite knowledge, you at least have to understand linear transformation of numerical vectors and with matrices. Linear transformation is nothing difficult, and in this article I am going to use only 2 or 3 dimensional numerical vectors or square matrices. You can calculate linear transformation of \boldsymbol{v} by A as equations in the figure. In other words, \boldsymbol{u} is a vector transformed by A.

*I am not going to use the term “linear transformation” in a precise way in the context of linear algebra. In this article or in the context of data science or machine learning, “linear transformation” for the most part means products of matrices or vectors. 

*Forward/back propagation of deep learning is mainly composed of this linear transformation. You keep linearly transforming input vectors, frequently transforming them with activation functions, which are for the most part not linear transformation.

As you can see in the equations above, linear transformation with A transforms a vector to another vector. Assume that you have an original vector \boldsymbol{v} in grey and that the vector \boldsymbol{u} in pink is the transformed \boldsymbol{v} by A is. If you subtract \boldsymbol{v} from \boldsymbol{u}, you can get a displacement vector, which I displayed in purple. A displacement vector means the transition from a vector to another vector.

Let’s calculate the displacement vector with more vectors \boldsymbol{v}. Assume that A =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, and I prepared several grid vectors \boldsymbol{v} in grey as you can see in the figure below. If you transform those grey grid points with A, they are mapped into the vectors \boldsymbol{u} in pink. With those vectors in grey or pink, you can calculate the their displacement vectors \boldsymbol{u} = \boldsymbol{v} in purple.

I think you noticed that the displacement vectors in the figure above have some tendencies. In order to see that more clearly, let’s calculate displacement vectors with several matrices A and more grid points. Assume that you have three 2 \times 2 square matrices A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, A_2 =\begin{pmatrix} 3 & 1 \\ -1 & 1 \end{pmatrix}, A_3 =\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}, and I plotted displace vectors made by the matrices respectively in the figure below.

I think you noticed some characteristics of the displacement vectors made by those linear transformations: the vectors are swirling and many of them seem to be oriented in certain directions. To be exact, some displacement vectors have extend in the same directions as some of original vectors in grey. That means  linear transformation by A did not change the direction of the original vector \boldsymbol{v}, and the unchanged vectors are called eigen vectors. Real eigen vectors of each A are displayed as arrows in yellow in the figure above. But when it comes to A_3, the matrix does not have any real eigan values.

In linear algebra, depending on the type matrices A, you have consider various cases such as whether the matrices have real or imaginary eigen values, whether the matrices are diagonalizable, whether the eigen vectors are orthogonal, or whether they are unit vectors. But those topics are out of the scope of this article series, so please refer to textbooks on linear algebra if you are interested.

Luckily, however, in terms of PCA or LDA, you only have to consider a type of matrices named positive semidefinite matrices, which A_1 is classified to, and I am going to explain positive semidefinite matrices in the fourth section.

3, Eigen vectors as coordinate system

Source: Ian Stewart, “Professor Stewart’s Cabinet of Mathematical Curiosities,” (2008), Basic Books

Let me take Fibonacci numbers as an example to briefly see why diagonalization is useful. Fibonacci is sequence is quite simple and it is often explained using an example of pairs of rabbits increasing generation by generation. Let a_n (n=0, 1, 2, …) be the number of pairs of grown up rabbits in the n^{th} generation. One pair of grown up rabbits produce one pair of young rabbit The concrete values of a_n are a_0 = 0, a_1 = 1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8, a_7=13, \dots. Assume that A =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} and that \begin{pmatrix} a_1 \\ a_0  \end{pmatrix} =\begin{pmatrix} 1 \\ 0  \end{pmatrix}, then you can calculate the number of the pairs of grown up rabbits in the next generation with the following recurrence relation. \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \cdot \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}.Let \boldsymbol{a}_n be \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}, then the recurrence relation can be written as \boldsymbol{a}_{n+1} = A \boldsymbol{a}_n, and the transition of \boldsymbol{a}_n are like purple arrows in the figure below. It seems that the changes of the purple arrows are irregular if you look at the plots in normal coordinate.

Assume that \lambda _1, \lambda_2 (\lambda _1< \lambda_2) are eigen values of A, and \boldsymbol{v}_1, \boldsymbol{v}_2 are eigen vectors belonging to them respectively. Also let \alpha, \beta scalars such that \begin{pmatrix} a_{1} \\ a_{0}  \end{pmatrix} = \begin{pmatrix} 1 \\ 0  \end{pmatrix} = \alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2. According to the definition of eigen values and eigen vectors belonging to them, the following two equations hold: A\boldsymbol{v}_1 = \lambda_1 \boldsymbol{v}_1, A\boldsymbol{v}_2 = \lambda_2 \boldsymbol{v}_2. If you calculate \boldsymbol{a}_1 is, using eigen vectors of A, \boldsymbol{a}_1  = A\boldsymbol{a}_0 = A (\alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2) = \alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2. In the same way, \boldsymbol{a}_2 = A\boldsymbol{a}_1 = A (\alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2) = \alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2, and \boldsymbol{a}_3 = A\boldsymbol{a}_2 = A (\alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2) = \alpha\lambda _{1}^{3} \boldsymbol{v}_1 + \beta \lambda_{2}^{3} \boldsymbol{v}_2. These equations show that in coordinate system made by eigen vectors of A, linear transformation by A is easily done by just multiplying eigen values with each eigen vector. Compared to the graph of Fibonacci numbers above, in the figure below you can see that in coordinate system made by eigen vectors the plots changes more systematically generation by generation.


In coordinate system made by eigen vectors of square matrices, the linear transformations by the matrices can be much more straightforward, and this is one powerful strength of eigen vectors.

*I do not major in mathematics, so I am not 100% sure, but vectors in linear algebra have more abstract meanings and various things in mathematics can be vectors, even though in machine learning or data science we  mainly use numerical vectors with more concrete elements. We can also say that matrices are a kind of maps. That is just like, at leas in my impression, even though a real town is composed of various components such as houses, smooth or bumpy roads, you can simplify its structure with simple orthogonal lines, like the map of Manhattan. But if you know what the town actually looks like, you do not have to follow the zigzag path on the map.

4, Eigen vectors of positive semidefinite matrices

In the second section of this article I told you that, even though you have to consider various elements when you discuss general diagonalization, in terms of PCA and LDA we mainly use only a type of matrices named positive semidefinite matrices. Let A be a D \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} \geq 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive semidefinite matrix. And also it is known that A being a semidefinite matrix is equivalent to \lambda _{i} \geq 0 for all the eigen values \lambda_i (i=1, \dots , D).

*I think most people first learn a type of matrices called positive definite matrices. Let A be aD \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} > 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive definite matrix. You have to keep it in mind that even if all the elements of A are positive, A is not necessarly positive definite/semidefinite.

Just as we did in the second section of this article, let’s visualize displacement vectors made by linear transformation with a 3 \times 3 square positive semidefinite matrix A.

*In fact A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, whose linear transformation I visualized the second section, is also positive semidefinite.

Let’s visualize linear transformations by a positive definite matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. I visualized the displacement vectors made by the A just as the same way as in the second section of this article. The result is as below, and you can see that, as well as the displacement vectors made by A_1, the three dimensional displacement vectors below are swirling and extending in three directions, in the directions of the three orthogonal eigen vectors \boldsymbol{v}_1, \boldsymbol{v}_2, and \boldsymbol{v}_3.

*It might seem like a weird choice of a matrix, but you are going to see why in the next article.

You might have already noticed A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix} and A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix} are both symmetric matrices and that their elements are all real values, and that their diagonal elements are all positive values. Super importantly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}).

*The title of this section might be misleading, but please keep it in mind that positive definite/semidefinite matrices are not necessarily real symmetric matrices. And real symmetric vectors are not necessarily positive definite/semidefinite matrices.

5, Orthonormal matrices and rotation of vectors

In this section I am gong to explain orthonormal matrices, as known as rotation matrices. If a D\times D matrix U is an orthonormal matrix, column vectors of U are orthonormal, which means U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D), where \begin{cases} \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 1 \quad (i = j) \\ \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 0 \quad (i\neq j) \end{cases}. In other words column vectors \boldsymbol{u}_{i} form an orthonormal coordinate system.

Orthonormal matrices U have several important matrices, and one of them is U^{-1} = U^{T}. Combining this fact with what I have told you so far, you we can reach one conclusion that you can orthogonalize a real symmetric matrix A as U^{T}AU = \Lambda. This is known as spectral decomposition or singular value decomposition.

Another important property of U is that U^{T} is also orthonormal. In other words, assume U is orthonormal and that U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D) = \begin{pmatrix} -\boldsymbol{v_1}^{T}- \\ \vdots \\ -\boldsymbol{v_D}^{T}- \end{pmatrix}, (\boldsymbol{v}_1 \dots \boldsymbol{v}_D) also forms a orthonormal coordinate system.

…It seems things are getting too mathematical and abstract (for me), thus for now I am going to wrap up what I have explained in this article .

We have seen

  • Numerical matrices linearly transform vectors.
  • Certain linear transformations do not change the direction of vectors in certain directions, which are called eigen vectors.
  • Making use of eigen vectors, you can form new coordinate system which can describe the linear transformations in a more straightforward way.
  • You can diagonalize a real symmetric matrix A with an orthonormal matrix U.

Of our current interest is what kind of linear transformation the real symmetric positive definite matrix enables. I am going to explain why the purple vectors in the figure above is swirling like that in the upcoming articles. Before that, however, we are going to  see one application of what we have seen in this article, on dimension reduction. To be concrete the next article is going to be about principal component analysis (PCA), which is very important in many fields.

*In short, the orthonormal matrix U I mentioned above enables rotation of matrix, and the diagonal matrix diag(\lambda_1, \dots, \lambda_D) expands or contracts vectors along each axis. I am going to explain that more precisely in the upcoming articles.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.


K Nearest Neighbour For Supervised Learning

K-Nearest Neighbour (KNN) Algorithms is an easy-to-implement & advanced level supervised machine learning algorithm used for both – classification as well as regression problems. However, you can see a wide of its applications in classification problems across various industries.

If you’ve been shopping a lot in e-commerce sites like Amazon, Flipkart, Myntra, or love watching web series over Netflix and Amazon Prime, one common thing you’ve always noticed, and that is recommendations.

Are you wondering how they recommend you following your choice? They use KNN Supervised Learning to find out what you may need the next when you’re buying and recommend you with a few more products.

Imagine you’re looking for an iPhone to purchase. When you scroll down a little, you see some iPhone cases, tempered glasses – saying, “People who purchased an iPhone have also purchased these items. The same applies to Netflix and Amazon Prime. When you finished a show or a series, they give you recommendations of the same genre. And do it all using KNN supervised learning and classify the items for the best user experience.

Advantages Of KNN

  • Quickest Calculation Time
  • Simple Algorithms
  • High Accuracy
  • Versatile – best use for Regression and Classification.
  • Doesn’t make any assumptions about data.

Where KNN Are Mostly Used

  • Simple Recommendation Models
  • Image Recognition Technology
  • Decision-Making Models
  • Calculating Credit Rating

Choosing The Right Value For K

 To choose the right value of K, you have to run KNN algorithms several times with different values of K and select the value of K, which reduces the number of errors you’ve come across and come out as the most stable value for K.

Your Step-By-Step Guide For Choosing The Value Of K

  • As you decrease the value of K to 1 (K = 1), you’ll reach a query point, where you get to see many elements from class A (-) and class B (+) where (-) is the only nearest neighbor. Reasonably, you would think about the query point to be most likely the red one. As K =1, which has a blue color, KNN incorrectly predicts the wrong color blue.
  • As you increase the value of K to 2 (K=2), you get to see two elements, (-) and (+) are the only nearest neighbor. As you have two values, which are of Class A and Class B, KNN incorrectly predicts the wrong values (Blue and Red).
  • As you increase the value of K to 3 (K=3), you get to see three elements (-) and (+), (+) are the only nearest neighbor. And this time, you got three values, one from blue and two from red. As your assumption is red, KNN correctly predicts the right value (Blue and Red, Red). Your answer is more stable this time compared to previous ones.


KNN works by finding the nearest distance between a query and all the elements in the database. By choosing the value for K, we get the closest to the query. And then, KNN algorithms look for the most frequent labels in classification and averages of labels in regression.

Spiky cubes, Pac-Man walking, empty M&M’s chocolate: curse of dimensionality

This is the first article of the article series Illustrative introductions on dimension reduction.

“Curse of dimensionality” means the difficulties of machine learning which arise when the dimension of data is higher. In short if the data have too many features like “weight,” “height,” “width,” “strength,” “temperature”…., that can undermine the performances of machine learning. The fact might be contrary to your image which you get from the terms “big” data or “deep” learning. You might assume that the more hints you have, the better the performances of machine learning are. There are some reasons for curse of dimensionality, and in this article I am going to introduce two major reasons below.

  1. High dimensional data usually have rich expressiveness, but usually training data are too poor for that.
  2. The behaviors of data points in high dimensional space are totally different from our common sense.

Through these topics, you will see that you always have to think about which features to use considering the number of data points.

*From now on I am going to talk about only Euclidean distance. If you are not sure what Euclidean distance means, please just keep it in mind that it is the type of distance most people wold have learnt in normal compulsory education.

*This is the first article of the article series ” Illustrative introductions on dimension reduction .”

1. Number of samples and degree of dimension

The most straightforward demerit of adding many features, or increasing dimensions of data, is the growth of computational costs. More importantly, however, you always have to think about the degree of dimensions in relation of the number of data points you have. Let me take a simple example in a book “Pattern Recognition and Machine Learning” by C. M. Bishop (PRML). This is an example of measurements of a pipeline. The figure below shows a comparison plot of 3 classes (red, green and blue), with parameter x_7 plotted against parameter x_6 out of 12 parameters.

* The meaning of data is not important in this article. If you are interested please refer to the appendix in PRML.

Assume that we are interested in classifying the cross in black into one of the three classes. One of the most naive ideas of this classification is dividing the graph into grids and labeling each grid depending on the number of samples in the classes (which are colored at the right side of the figure). And you can classify the test sample, the cross in black, into the class of the grid where the test sample is in. Thereby the cross is classified to the class in red.

Source: C.M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 34-35

As I mentioned in the figure above, we used only two features out of 12 features in total. When the total number of data points is fixed and you add remaining ten axes/features one after another, what would happen? Let’s see what “adding axes/features” means. If you are talking about 1, 2, or 3 dimensional grids, you can visualize them. And as you can see from the figure below, if you make each 10^1, 10^2, 100^3 grids respectively in 1, 2, 3 dimensional spaces, the number of the small regions in the grids are respectively 10, 100, 1000. Even though you cannot visualize it anymore, you can make grids for more than 3 dimensional data. If you continue increasing the degree of dimension, the number of grids increases exponentially, and that can soon surpass the number of training data points. That means there would be a lot of empty spaces in such high dimensional grids. And the classifying method above: coloring each grid and classifying unknown samples depending on the colors of the grids, does not work out anymore because there would be a lot of empty grids.

* If you are still puzzled by the idea of “more than 3 dimensional grids,” you should not think too much about that now. It is enough if you can get some understandings on high dimensional data after reading the whole article of this.

Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, p. 153

I said the method above is the most naive way, but other classical classification methods , for example k-nearest neighbors algorithm, are more or less base on a similar idea. Many of classical machine learning algorithms are based on the idea of smoothness prior, or local constancy prior. In short in classical ways, you  do not expect data to change so much in a small region, so you can expect unknown samples to be similar to data in vicinity. But that soon turns out to be problematic when the dimension of data is bigger because training data would be sparse because the area of multidimensional space grows exponentially as I mentioned above. And sometimes you would not be able to find training data around test data. Plus, in high dimensional data, you cannot treat distance in the same as you do in lower dimensional space. The ideas of “close,” “nearby,” or “vicinity” get more obscure in high dimensional data. That point is related to the next topic: the intuition have cultivated in normal life is not applicable to higher dimensional data.

2. Bizarre characteristics of high dimensional data

We form our sense of recognition in 3-dimensional ways in our normal life. Even though we can visualize only 1, 2, or 3 dimensional data, we can actually generalize the ideas in 1, 2, or 3 dimensional ideas to higher dimensions. For example 4 dimensional cubes, 100 dimensional spheres, or orthogonality in 255 dimensional space. Again, you cannot exactly visualize those ideas, and for many people, such high dimensional phenomenon are just imaginary matters on blackboards. Those high dimensional ideas are designed to retain some conditions just as well as 1, 2, or 3 dimensional space. Let’s take an example of spheres in several dimensional spaces. General spheres in any D-dimensional space can be defined as a set of any \boldsymbol{x}, such that |\boldsymbol{x} - \boldsymbol{c}| = r, where \boldsymbol{c} is the center point and r is length of radius. When \boldsymbol{x} is 2-dimensional, the spheres are called “circles.” When \boldsymbol{x} is 3-dimensional, the spheres are called “spheres” in our normal life, unless it is used in a conversation in a college cafeteria, by some students in mathematics department. And when \boldsymbol{x} is D-dimensional, they are called D-ball, and again, this is just a imaginary phenomenon on blackboard.

* Vectors and points are almost the same because all the vectors are denoted as “arrows” from the an origin point to sample data points.  The only difference is that when you use vectors, you have to consider their directions.

* “D-ball” is usually called “n-ball,” and in such context it is a sphere in a n-dimensional space. But please let me use the term “D-ball” in this article.

Not only spheres, but only many other ideas have been generalized to D-dimensional space, and many of them are indispensable also for data science. But there is one severe problem: the behaviors of data in high dimensional field is quite different from those in two or three dimensional space. To be concrete, in high dimensional field, cubes are spiky, you have to move like Pac-Man, and M & M’s Chocolate looks empty inside but tastes normal.

2.1: spiky cubes
Let’s take a look at an elementary-school-level example of geometry first. Assume that you have several unit squares or unit cubes like below. In each of them a circle or sphere with diameter 1 is inscribed. The length of a diagonal line in each square is \sqrt{2}, and that in each cube is \sqrt{3}.

If you stack the squares or cubes as below, what are the length of diameters of the blue circle or sphere, circumscribing all the 4 orange circles or the 8 orange spheres?

The answers are, the diameter of the blue circle is \sqrt{2} - 1, and the diameter of the blue sphere is \sqrt{3} - 1.

Next let’s think about the same situation in higher dimensional space. Assume that there are some unit D-dimensional hypercubes stacked, in each of which a D-ball with diameter 1 is inscribed, touching all the surfaces inside. Then what is the length of the diameter of  a D-ball circumscribing all the unit D-ball in the hypercubes ? Given the results above, it ca be predicted that its diameter is \sqrt{D}  -1. If that is true, there is one strange point: \sqrt{D} - 1 can soon surpass 2: that means in the chart above the blue sphere will stick out of the stacked cubes. That sounds like a paradox, but with one hypothesis, the phenomenon makes sense: cubes become more spiky as the degree of dimension grows. This hypothesis is a natural deduction because diagonal lines of hyper cubes get longer, and the the center of each surface of hypercubes still touches the unit D-ball with diameter 1, inscribing inscribing inside each unit hypercube.

If you stack 4 hypercubes, the blue sphere circumscribing them will not stick out of the stacked hypercubes anymore like the figure below.

*Of course you cannot visualize what is going on in D-dimensional space, so the figure below is just a pseudo simulation of D-dimensional space in our 3-dimensional sense. I guess you have to stack more than four hyper cubes in higher dimensional data, but you cannot easily imagine what will go on in such space anymore.


*You can confirm the fact that hypercube gets more spiky as the degree of dimension growth, by comparing the volume of the hypercube and the volume of the D-ball inscribed inside the hypercube. Thereby you can prove that the volume of hypercube concentrates on the corners of the hypercube. Plus, as I mentioned the longest diagonal distance of hypercube gets longer as dimension degree increases. That is why hypercube is said to be spiky. For mathematical proof, please check the Exercise 1.19 of PRML.

2.2: Pac-Man walking

Next intriguing phenomenon in high dimensional field is that most of pairs of vectors in high dimensional space are orthogonal. In other words, if you select two random vectors in high dimensional space, the angle between them are mostly close to 90^\circ. Let’s see the general meaning of angle between two vectors in any dimensional spaces. Assume that the angle between two vectors \boldsymbol{u}, and \boldsymbol{v} is \theta, then cos\theta is calculated as cos\theta = \frac{<\boldsymbol{u}, \boldsymbol{v}>}{|\boldsymbol{u}||\boldsymbol{v}|}. In 1, 2, or 3 dimensional space, you can actually see the angle, but again you can define higher dimensional angle, which you cannot visualize anymore. And angles are sometimes used as similarity of two vectors.

* <\boldsymbol{u}, \boldsymbol{v}> is the inner product of \boldsymbol{u}, and \boldsymbol{v}.

Assume that you generate a pair of two points inside a D-dimensional unit sphere and make two vectors \boldsymbol{u}, and \boldsymbol{v} by connecting the origin point and those two points respectively. When D is 2, I mean spheres are circles in this case, any \theta are equally generated as in the chart below. The fact might be the same as your intuition.   How about in 3-dimensional space? In fact the distribution of \theta is not uniform. \theta = 90^\circ is the most likely to be generated. As I explain in the figure below, if you compare the area of cross section of a hemisphere and the area of a cone whose vertex is the center point of the sphere, you can see why.

I generated 10000 random pairs of points in side a D-dimensional unit sphere, and calculated the angle between them. In other words I just randomly generated two D-dimensional vectors \boldsymbol{u} and \boldsymbol{v}, whose elements are randomly generated values between -1 and 1, and calculated the angle between them, repeating this process 10000 times. The chart below are the histograms of angle between pairs of generated vectors in respectively 2, 3, 50, and 100 dimensional space.

As I explained above, in 2-dimensional space, the distribution of \theta is almost uniform. However the distribution concentrates a little around 90^\circ in 3-dimensional space. You can see that the bigger the degree of dimension is, the more the angles of generated vectors concentrate around 90^\circ. That means most pairs of vectors in high dimensional space are close to orthogonal. Movements are also sequence of vectors, so when most pairs of movement vectors are orthogonal, that means you can only move like Pac-Man in such space.


* Of course I am talking about arcade Mac-Man game. Not Pac-Man in Super Smash Bros.  Retro RPG video games might have more similar playability, but in high dimensional space it is also difficult to turn back. At any rate, I think you have understood it is even difficult to move smoothly in high dimensional space, just like the first notorious Resident Evil on the first PS console also had terrible playability .

2.3: empty M & M’s chocolate

Let’s think about the proportion of the volume of the outermost \epsilon surface of general spheres with radius r. First, in 2 two dimensional space, spheres are circles. The area of the brown part of the circle below is \pi r^2. In order calculate the are of \epsilon \cdot r thick surface of the circle, you have only to subtract the area of \pi \{ (1 - \epsilon)\cdot r\} ^2. When \epsilon = 0.01, the area of outer most surface is \pi r^2 - \pi (0.99\cdot r)^2, and its proportion to the area of the whole circle is \frac{\pi r^2 - \pi (0.99\cdot r)^2}{\pi r^2} = 0.0199.

In case of 3-dimensional space, the value of a sphere with radius r is \frac{4}{3} \pi r^2, so the proportion of the \epsilon surface is calculated in the same way: \frac{\frac{4}{3} \pi r^3 -\frac{4}{3} \pi (0.99\cdot r)^2}{\frac{4}{3}\pi r^2} = 0.0297. Compared to the case in 2 dimensional space, the proportion is a little bigger.

How about in D-dimensional space? We have seen that even in  D-dimensional space the surface of a sphere, I mean D-ball, can be defined as a set of any points whose distance from the center point is all r. And it is known that the volume of D-ball is defined as below.

\Gamma () is called gamma function, but in this article it is not so important. The most important point now is, if you discuss any D-ball, their volume only depends on their radius r. That meas the proportion of outer \epsilon surface of D-ball is calculated as \frac{\pi r^2 - \pi \{ (1 - \epsilon)\cdot r\} ^2}{\pi r^2}. When \epsilon is 0.01, the proportion of the 1% surface of D-ball changes like in the chart below.

* And of course when D is 2,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \pi, and when D is 3 ,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \frac{4}{3} \pi

You can see that when D is over 400, around 90% of volume is concentrated in the very thin 1% surface of D-ball. That is why, in high dimensional space, M & M’s chocolate look empty but tastes normal: all the chocolate are concentrated beneath the sugar coating.

More interestingly, even if you choose any points as a central point of a sphere with radius r, the other points are squashed to the surface of the sphere, even if all the data points are uniformly distributed. This situation is problematic for classical machine learning algorithms, which are often based on the Euclidean distances between pairs of two sample data points: if you go from the central point to another sample point, the possibility of finding the point within (1 - \epsilon)\cdot r radius of the center is almost zero. But if you reach the outermost \epsilon part of the surface of the sphere, most data points are there. However, for one of the data points in the surface, any other data points are distant in the same way.

Inside M & M’s chocolate is a mysterious world.


You have seen that using high dimensional data can be problematic in many ways. Data science and machine learning are largely based on one idea: you can find a lower dimensional meaningful and easier structure in data. In the next articles I am going to introduce some famous dimension reduction algorithms. And hopefully I would like to give some deeper insights in to these algorithms, in straightforward ways.

* I could not explain the relationships of variance and bias of data. This is also a very important factor when you think about dimensionality of data. I hope I can write about this topic someday. You can also look it up if you are interested.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Illustrative introductions on dimension reduction

“What is your image on dimensions?”

….That might be a cheesy question to ask to reader of Data Science Blog, but most people, with no scientific background, would answer “One dimension is a line, and two dimension is a plain, and we live in three-dimensional world.” After that if you ask “How about the fourth dimension?” many people would answer “Time?”

You can find books or writings about dimensions in various field. And you can use the word “dimension” in normal conversations, in many contexts.

*In Japanese, if you say “He likes two dimension.” that means he prefers anime characters to real women, as is often the case with Japanese computer science students.

The meanings of “dimensions” depend on the context, but in data science dimension is usually the number of rows of your Excel data.

When you study data science or machine learning, usually you should start with understanding the algorithms with 2 or 3 dimensional data, and you can apply those ideas to any D dimensional data. But of course you cannot visualize D dimensional data anymore, and you always have to be careful of what happens if you expand degree of dimension.

Conversely it is also important to reduce dimension to understand abstract high dimensional stuff in 2 or 3 dimensional space, which are close to our everyday sense. That means dimension reduction is one powerful way of data visualization.

In this blog series I am going to explain meanings of dimension itself in machine learning context and algorithms for dimension reductions, such as PCA, LDA, and t-SNE, with 2 or 3 dimensional visible data. Along with that, I am going to delve into the meaning of calculations so that you can understand them in more like everyday-life sense.

This article series is going to be roughly divided into the contents below.

  1. Curse of Dimensionality
  2. Rethinking linear algebra: visualizing linear transformations and eigen vector
  3. PCA (to be published soon)
  4. KL expansion and subspace method (to be published soon)
  5. Autoencoder as dimension reduction (to be published soon)
  6. t-SNE (to be published soon)

I hope you could see that reducing dimension is one of the fundamental approaches in data science or machine learning.

Data Science in Engineering Process - Product Lifecycle Management

How to develop digital products and solutions for industrial environments?

The Data Science and Engineering Process in PLM.

Huge opportunities for digital products are accompanied by huge risks

Digitalization is about to profoundly change the way we live and work. The increasing availability of data combined with growing storage capacities and computing power make it possible to create data-based products, services, and customer specific solutions to create insight with value for the business. Successful implementation requires systematic procedures for managing and analyzing data, but today such procedures are not covered in the PLM processes.

From our experience in industrial settings, organizations start processing the data that happens to be available. This data often does not fully cover the situation of interest, typically has poor quality, and in turn the results of data analysis are misleading. In industrial environments, the reliability and accuracy of results are crucial. Therefore, an enormous responsibility comes with the development of digital products and solutions. Unless there are systematic procedures in place to guide data management and data analysis in the development lifecycle, many promising digital products will not meet expectations.

Various methodologies exist but no comprehensive framework

Over the last decades, various methodologies focusing on specific aspects of how to deal with data were promoted across industries and academia. Examples are Six Sigma, CRISP-DM, JDM standard, DMM model, and KDD process. These methodologies aim at introducing principles for systematic data management and data analysis. Each methodology makes an important contribution to the overall picture of how to deal with data, but none provides a comprehensive framework covering all the necessary tasks and activities for the development of digital products. We should take these approaches as valuable input and integrate their strengths into a comprehensive Data Science and Engineering framework.

In fact, we believe it is time to establish an independent discipline to address the specific challenges of developing digital products, services and customer specific solutions. We need the same kind of professionalism in dealing with data that has been achieved in the established branches of engineering.

Data Science and Engineering as new discipline

Whereas the implementation of software algorithms is adequately guided by software engineering practices, there is currently no established engineering discipline covering the important tasks that focus on the data and how to develop causal models that capture the real world. We believe the development of industrial grade digital products and services requires an additional process area comprising best practices for data management and data analysis. This process area addresses the specific roles, skills, tasks, methods, tools, and management that are needed to succeed.

Figure: Data Science and Engineering as new engineering discipline

More than in other engineering disciplines, the outputs of Data Science and Engineering are created in repetitions of tasks in iterative cycles. The tasks are therefore organized into workflows with distinct objectives that clearly overlap along the phases of the PLM process.

Feasibility of Objectives
  Understand the business situation, confirm the feasibility of the product idea, clarify the data infrastructure needs, and create transparency on opportunities and risks related to the product idea from the data perspective.
Domain Understanding
  Establish an understanding of the causal context of the application domain, identify the influencing factors with impact on the outcomes in the operational scenarios where the digital product or service is going to be used.
Data Management
  Develop the data management strategy, define policies on data lifecycle management, design the specific solution architecture, and validate the technical solution after implementation.
Data Collection
  Define, implement and execute operational procedures for selecting, pre-processing, and transforming data as basis for further analysis. Ensure data quality by performing measurement system analysis and data integrity checks.
  Select suitable modeling techniques and create a calibrated prediction model, which includes fitting the parameters or training the model and verifying the accuracy and precision of the prediction model.
Insight Provision
  Incorporate the prediction model into a digital product or solution, provide suitable visualizations to address the information needs, evaluate the accuracy of the prediction results, and establish feedback loops.

Real business value will be generated only if the prediction model at the core of the digital product reliably and accurately reflects the real world, and the results allow to derive not only correct but also helpful conclusions. Now is the time to embrace the unique chances by establishing professionalism in data science and engineering.


Peter Louis                               

Peter Louis is working at Siemens Advanta Consulting as Senior Key Expert. He has 25 years’ experience in Project Management, Quality Management, Software Engineering, Statistical Process Control, and various process frameworks (Lean, Agile, CMMI). He is an expert on SPC, KPI systems, data analytics, prediction modelling, and Six Sigma Black Belt.

Ralf Russ    

Ralf Russ works as a Principal Key Expert at Siemens Advanta Consulting. He has more than two decades experience rolling out frameworks for development of industrial-grade high quality products, services, and solutions. He is Six Sigma Master Black Belt and passionate about process transparency, optimization, anomaly detection, and prediction modelling using statistics and data analytics.4

Simple RNN

LSTM back propagation: following the flows of variables

First of all, the summary of this article is: please just download my Power Point slides which I made and be patient, following the equations.

I am not supposed to use so many mathematics when I write articles on Data Science Blog. However using little mathematics when I talk about LSTM backprop is like writing German, never caring about “der,” “die,” “das,” or speaking little English in English classes (which most high school English teachers in Japan do) or writing Japanese without using any Chinese characters (which looks like a terrible handwriting by a drug addict). In short, that is ridiculous. And all the precise equations of LSTM backprop, written on a blog is not a comfortable thing to see. So basically the whole of this article is an advertisement on my PowerPoint slides, sponsored by DATANOMIQ, and I can just give you some tips to get ready for the most tiresome part of understanding LSTM here.

*This article is the fifth article of “A gentle introduction to the tiresome part of understanding RNN.”

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1. Chain rules

This article is virtually an article on chain rules of differentiation. Even if you have clear understandings on chain rules, I recommend you to take a look at this section. If you have written down all the equations of back propagation of DCL, you would have seen what chain rules are. Even simple chain rules for backprop of normal DCL can be difficult to some people, but when it comes to backprop of LSTM, it is a pure torture.  I think using graphical models would help you understand what chain rules are like. Graphical models are basically used to describe the relations of variables and functions in probabilistic models, so to be exact I am going to use “something like graphical models” in this article. Not that this is a common way to explain chain rules.

First, let’s think about the simplest type of chain rule. Assume that you have a function f=f(x)=f(x(y)), and relations of the functions are displayed as the graphical model at the left side of the figure below. Variables are a type of function, so you should think that every node in graphical models denotes a function. Arrows in purple in the right side of the chart show how information propagate in differentiation.

Next, if you have a function f , which has two variances  x_1 and x_2. And both of the variances also share two variances  y_1 and y_2. When you take partial differentiation of f with respect to y_1 or y_2, the formula is a little tricky. Let’s think about how to calculate \frac{\partial f}{\partial y_1}. The variance y_1 propagates to f via x_1 and x_2. In this case the partial differentiation has two terms as below.

In chain rules, you have to think about all the routes where a variance can propagate through. If you generalize chain rules as the graphical model below, the partial differentiation of f with respect to y_i is calculated as below. And you need to understand chain rules in this way to understanding any types of back propagation.

The figure above shows that if you calculate partial differentiation of f with respect to y_i, the partial differentiation has n terms in total because y_i propagates to f via n variances. In order to understand backprop of LSTM, you constantly have to care about the flows of variances, which I display as purple arrows.

2. Chain rules in LSTM

I would like you to remember the figure below, which I used in the second article to show how errors propagate backward during backprop of simple RNNs. After forward propagation, first of all, you need to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, gradients of the error function with respect to parameters, at each time step. But you have to be careful that even though these gradients depend on time steps, the parameters \boldsymbol{\theta} do not depend on time steps.

*As I mentioned in the second article I personally think \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} should be rather denoted as (\frac{\partial J}{\partial \boldsymbol{\theta}})^{(t)} because parameters themselves do not depend on time. However even the textbook by MIT press partly use the former notation. And I think you are likely to encounter this type of notation, so I think it is not bad to get ready for both.

The errors at time step (t) propagate backward to all the \boldsymbol{h} ^{(s)} (s \leq t). Conversely, in order to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} errors flowing from J^{(s)}  (s \geq t). In the chart you need arrows of errors in purple for the gradient in a purple frame, orange arrows for gradients in orange frame, red arrows for gradients in red frame. And you need to sum up \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} to calculate \frac{\partial J}{\partial \boldsymbol{\theta}} = \sum_{t}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}, and you need this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} to renew parameters, one time.

At an RNN block level, the flows of errors and how to renew parameters are the same in LSTM backprop, but the flow of errors inside each block is much more complicated in LSTM backprop. But in order to denote errors of LSTM backprop, instead of \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, I use a special notation \delta \star ^{(t)} = \frac{\partial J}{\partial \star}.

* Again, please be careful of what \delta \star  ^{(t)} means. Neurons depend on time steps, but parameters do not depend on time steps. So if \star are neurons,  \delta \star  ^{(t)}= \frac{\partial J}{ \partial \star ^{(t)}}, but when \star are parameters, \delta \star  ^{(t)} should be rather denoted like \delta \star  ^{(t)}= (\frac{\partial J}{ \partial \star })^{(t)}. In the Space Odyssey paper\boldsymbol{\star} are not used as parameters, but in my PowerPoint slides and some other materials, \boldsymbol{\star} are used also as parameteres.

As I wrote in the last article, you calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below. Unlike the last article, I also added the terms of peephole connections in the equations below, and I also introduced the variances \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, \bar{\boldsymbol{z}}^{(t)}, \bar{\boldsymbol{o}}^{(t)} for convenience.

  • \boldsymbol{\bar{f}}^{(t)}=\boldsymbol{W}_{for} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{for}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{for}
  • \boldsymbol{\bar{i}}^{(t)}=\boldsymbol{W}_{in} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{in}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{in}
  • \boldsymbol{\bar{z}}^{(t)}=\boldsymbol{W}_z \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z
  • \boldsymbol{\bar{o}}^{(t)}=\boldsymbol{W}_{out} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{out}\odot \boldsymbol{c}^{(t)} + \boldsymbol{b}_{out}
  • \boldsymbol{f}^{(t)}=\sigma( \boldsymbol{\bar{f}}^{(t)})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{\bar{i}}^{(t)})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{\bar{z}}^{(t)})
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{\bar{o}}^{(t)})

With  Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

In this article I would rather give instructions on how to read my PowerPoint slide. Just as general backprop, you need to calculate gradients of error functions with respect to parameters, such as \delta \boldsymbol{W}_{\star}, \delta \boldsymbol{R}_{\star}, \delta \boldsymbol{p}_{\star}, \delta \boldsymbol{b}_{\star}, where \star is either of \{z, in, for, out \}. And just as backprop of simple RNNs, in order to calculate gradients with respect to parameters, you need to calculate errors of neurons, that is gradients of error functions with respect to neurons, such as \delta \boldsymbol{f}^{(t)}, \delta \boldsymbol{i}^{(t)}, \delta \boldsymbol{z}^{(t)}, \delta \boldsymbol{o}^{(t)}.

*Again and again, keep it in mind that neurons depend on time steps, but parameters do not depend on time steps.

When you calculate gradients with respect to neurons, you can first calculate \delta \boldsymbol{y}^{(t)}, but the equation for this error is the most difficult, so I recommend you to put it aside for now. After calculating \delta \boldsymbol{y}^{(t)}, you can next calculate \delta \bar{\boldsymbol{o}}^{(t)}= \frac{\partial J^{(t)}}{ \partial \bar{\boldsymbol{o}}^{(t)}}. If you see the LSTM block below as a graphical model which I introduced, the information of \bar{\boldsymbol{o}}^{(t)} flow like the purple arrows. That means, \bar{\boldsymbol{o}}^{(t)} affects J only via \boldsymbol{y}^{(t)}, and this structure is equal to the first graphical model which I have introduced above. And if you calculate \bar{\boldsymbol{o}}^{(t)} element-wise, you get the equation \delta \bar{o}_{k}^{(t)}=\frac{\partial J}{\partial \bar{o}_{k}^{(t)}}= \frac{\partial J}{\partial y_{k}^{(t)}} \frac{\partial y_{k}^{(t)}}{\partial \bar{o}_{k}^{(t)}}.

*The k is an index of an element of vectors. If you can calculate element-wise gradients, it is easy to understand that as differentiation of vectors and matrices.

Next you can calculate \delta \boldsymbol{c}^{(t)}, and chain rules are very important in this process. The flow of \delta \boldsymbol{c}^{(t)} to J can be roughly divided into two streams: the one which flows to J as \bodlsymbol{y}^{(t)}, and the one which flows to J as \bodlsymbol{c}^{(t+1)}. And the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{y}^{(t)} also have two branches: the one via \bar{\boldsymbol{o}}^{(t)} and the one which directly converges as  \bodlsymbol{y}^{(t)}. Just as well, the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{c}^{(t+1)} also have three branches: the ones via \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, and the one which directly converges as \bodlsymbol{c}^{(t+1)}.

If you see see these flows as graphical a graphical model, that would be like the figure below.

According to this graphical model, you can calculate \delta \boldsymbol{c} ^{(t)} element-wise as below.

* TO BE VERY HONEST I still do not fully understand why we can apply chain rules like above to calculate \delta \boldsymbol{c}^{(t)}. When you apply the formula of chain rules, which I showed in the first section, to this case, you have to be careful of where to apply partial differential operators \frac{\partial}{ \partial c_{k}^{(t)}}. In the case above, in the part \frac{\partial y_{k}^{(t)}}{\partial c_{k}^{(t)}} the partial differential operator only affects tanh(c_{k}^{(t)}) of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}). And in the part \frac{\partial c_{k}^{(t+1)}}{\partial c_{k}^{(t)}}, the partial differential operator \frac{\partial}{\partial c_{k}^{(t)}} only affects the part c_{k}^{(t)} of the term c^{t}_{k} \cdot f_{k}^{(t+1)}. In the \frac{\partial \bar{o}_{k}^{(t)}}{\partial c_{k}^{(t)}} part, only (p_{out})_{k} \cdot c_{k}^{(t)},  in the \frac{\partial \bar{i}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}, and in the \frac{\partial \bar{f}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}. But some other parts, which are not affected by \frac{\partial}{ \partial c_{k}^{(t)}} are also functions of c_{k}^{(t)}. For example o_{k}^{(t)} of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}) is also a function of c_{k}^{(t)}. And I am still not sure about the logic behind where to affect those partial differential operators.

*But at least, these are the only decent equations for LSTM backprop which I could find, and a frequently cited paper on LSTM uses implementation based on these equations. Computer science is more of practical skills, rather than rigid mathematical logic. Also I think I have spent great deal of my time thinking about this part, and it is now time for me to move to next step. If you have any comments or advice on this point, please let me know.

Calculating \delta \bar{\boldsymbol{f}}^{(t)}, \delta \bar{\boldsymbol{i}}^{(t)}, \delta \bar{\boldsymbol{z}}^{(t)} are also relatively straigtforward as calculating \delta \bar{\boldsymbol{o}}^{(t)}. They all use the first type of chain rule in the first section. Thereby you can get these gradients: \delta \bar{f}_{k}^{(t)}=\frac{\partial J}{ \partial \bar{f}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{f}_{k}^{(t)}}, \delta \bar{i}_{k}^{(t)}=\frac{\partial J}{\partial \bar{i}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}, and \delta \bar{z}_{k}^{(t)}=\frac{\partial J}{\partial \bar{z}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}.

All the gradients which we have calculated use the error \delta \boldsymbol{y}^{(t)}, but when it comes to calculating \delta \boldsymbol{y}^{(t)}….. I can only say “Please be patient. I did my best in my PowerPoint slides to explain that.” It is not a kind of process which I want to explain on Word Press. In conclusion you get an error like this: \delta \boldsymbol{y}^{(t)}=\frac{\partial J^{(t)}}{\partial \boldsymbol{y}^{(t)}} + \boldsymbol{R}_{for}^{T} \delta \bar{\boldsymbol{f}}^{(t+1)} + \boldsymbol{R}_{in}^{T}\delta \bar{\boldsymbol{i}}^{(t+1)} + \boldsymbol{R}_{out}^{T}\delta \bar{\boldsymbol{o}}^{(t+1)} + \boldsymbol{R}_{z}^{T}\delta \bar{\boldsymbol{z}}^{(t+1)}, and the flows of \boldsymbol{y}^{(t)} are as blow.

Combining the gradients we have got so far, we can calculate gradients with respect to parameters. For concrete results, please check the Space Odyssey paper or my PowerPoint slide.

3. How LSTMs tackle exploding/vanishing gradients problems

*If you are allergic to mathematics, you should not read this section or even download my PowerPoint slide.

*Part of this section is more or less subjective, so if you really want to know how LSTM mitigate the problems, I highly recommend you to also refer to other materials. But at least I did my best for this article.

LSTMs do not completely solve, vanishing gradient problems. They mitigate vanishing/exploding gradient problems. I am going to roughly explain why they can tackle those problems. I think you find many explanations on that topic, but many of them seems to have some mathematical mistakes (even the slide used in a lecture in Stanford University) and I could not partly agree with some statements. I also could not find any papers or materials which show the whole picture of how LSTMs can tackle those problems. So in this article I am only going to give instructions on the major way to explain this topic.

First let’s see how gradients actually “vanish” or “explode” in simple RNNs. As I in the second article of this series, simple RNNs propagate forward as the equations below.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

And every time step, you get an error function J^{(t)}. Let’s consider the gradient of J^{(t)} with respect to \boldsymbol{h}^{(k)}, that is the error flowing from J^{(t)} to \boldsymbol{h}^{(k)}. This error is the most used to calculate gradients of the parameters in the equations above.

If you calculate this error more concretely, \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} \cdots \frac{\partial \boldsymbol{h}^{(k+2)}}{\partial \boldsymbol{h}^{(k+1)}} \frac{\partial \boldsymbol{h}^{(k+1)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}, where \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}} = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{b} + \boldsymbol{W}\cdot \boldsymbol{h}^{(s-1)} + \boldsymbol{U}\cdot \boldsymbol{x}^{(s)})) = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})).

* If you see the figure as a type of graphical model, you should be able to understand the why chain rules can be applied as the equation above.

*According to this paper \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}  = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})), but it seems that many study materials and web sites are mistaken in this point.

Hence \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})) = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} (\boldsymbol{W} ^T )^{(t - k)} \prod_{k< s \leq t} diag(g'(\boldsymbol{a}^{(s)})). If you take norms of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} you get an equality \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} \right\rVert \leq \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right\rVert \left\lVert \boldsymbol{W} ^T \right\rVert ^{(t - k)} \prod_{k< s \leq t} \left\lVert diag(g'(\boldsymbol{a}^{(s)}))\right\rVert. I will not go into detail anymore, but it is known that according to this inequality, multiplication of weight vectors exponentially converge to 0 or to infinite number.

We have seen that the error \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} is the main factor causing vanishing/exploding gradient problems of simple RNNs. In case of LSTM, \frac{\partial J^{(t)}}{\partial \boldsymbol{c}^{(k)}} is an equivalent. For simplicity, let’s calculate only \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, which is equivalent to \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} of simple RNN backprop.

* Just as I noted above, you have to be careful of which part the partial differential operator \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} affects in the chain rule above. That is, you need to calculate \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} (\boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}), and the partial differential operator only affects \boldsymbol{c}^{(t-1)}. I think this is not a correct mathematical notation, but please forgive me for doing this for convenience.

If you continue calculating the equation above more concretely, you get the equation below.

I cannot mathematically explain why, but it is known that this characteristic of gradients of LSTM backprop mitigate the vanishing/exploding gradient problem. We have seen that if you take a norm of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}}, that is equal or smaller than repeated multiplication of the norm of the same weight matrix, and that soon leads to vanishing/exploding gradient problem. But according to the equation above, even if you take a norm of repeatedly multiplied \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, its norm cannot be evaluated with a simple value like repeated multiplication of the norm of the same weight matrix. The outputs of each gate are different from time steps to time steps, and that adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}.

*I personally guess the term diag(\boldsymbol{f}^{(t)}) is very effective. The unaffected value of the elements of \boldsymbol{f}^{(t)} can directly adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}. And as a matter of fact, it is known that performances of LSTM drop the most when you get rid of forget gates.

When it comes to tackling exploding gradient problems, there is a much easier technique called gradient clipping. This algorithm is very simple: you just have to adjust the size of gradient so that the absolute value of gradient is under a threshold at every time step. Imagine that you decide in which direction to move by calculating gradients, but when the footstep is going to be too big, you just adjust the size of footstep to the threshold size you have set. In a pseudo code, you can write a gradient clipping part only with some two line codes as below.

*\boldsymbol{g} is a gradient at the time step threshold is the maximum size of the “step.”

The figure below, cited from a deep learning text from MIT press textbook, is a good and straightforward explanation on gradient clipping.It is known that a strongly nonlinear function, such as error functions of RNN, can have very steep or plain areas. If you artificially visualize the idea in 3-dimensional space, as the surface of a loss function J with two variants w, b, that means the loss function J has plain areas and very steep cliffs like in the figure.Without gradient clipping, at the left side, you can see that the black dot all of a sudden climb the cliff and could jump to an unexpected area. But with gradient clipping, you avoid such “big jumps” on error functions.

Source: Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, 409p


I am glad that I have finally finished this article series. I am not sure how many of the readers would have read through all of the articles in this series, including my PowerPoint slides. I bet that is not so many. I spent a great deal of my time for making these contents, but sadly even when I was studying LSTM, it was becoming old-fashioned, at least in natural language processing (NLP) field: a very promising algorithm named Transformer has been replacing the position of LSTM. Deep learning is a very fast changing field. I also would like to make illustrative introductions on attention mechanism in NLP, from seq2seq model to Transformer. And I think LSTM would still remain as one of the algorithms in sequence data processing, such as hidden Hidden Markov model, or particle filter.


* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Hypothesis Test for real problems

Hypothesis tests are significant for evaluating answers to questions concerning samples of data.

A statistical hypothesis is a belief made about a population parameter. This belief may or might not be right. In other words, hypothesis testing is a proper technique utilized by scientist to support or reject statistical hypotheses. The foremost ideal approach to decide if a statistical hypothesis is correct is examine the whole population.

Since that’s frequently impractical, we normally take a random sample from the population and inspect the equivalent. Within the event sample data set isn’t steady with the statistical hypothesis, the hypothesis is refused.

Types of hypothesis:

There are two sort of hypothesis and both the Null Hypothesis (Ho) and Alternative Hypothesis (Ha) must be totally mutually exclusive events.

• Null hypothesis is usually the hypothesis that the event wont’t happen.

• Alternative hypothesis is a hypothesis that the event will happen.

Why we need Hypothesis Testing?

Suppose a specific cosmetic producing company needs to launch a new Shampoo in the market. For this situation they will follow Hypothesis Testing all together decide the success of new product in the market.

Where likelihood of product being ineffective in market is undertaken as Null Hypothesis and likelihood of product being profitable is undertaken as Alternative Hypothesis. By following the process of Hypothesis testing they will foresee the accomplishment.

How to Calculate Hypothesis Testing?

  • State the two theories with the goal that just one can be correct, to such an extent that the two occasions are totally unrelated.
  • Now figure a study plan, that will lay out how the data will be assessed.
  • Now complete the plan and genuinely investigate the sample dataset.
  • Finally examine the outcome and either accept or reject the null hypothesis.

Another example

Assume, Person have gone after a typing job and he has expressed in the resume that his composing speed is 70 words per minute. The recruiter might need to test his case. On the off chance that he sees his case as adequate, he will enlist him in any case reject him. Thus, he types an example letter and found that his speed is 63 words a minute. Presently, he can settle on whether to employ him or not.  In the event that he meets all other qualification measures. This procedure delineates Hypothesis Testing in layman’s terms.

In statistical terms Hypothesis his typing speed is 70 words per minute is a hypothesis to be tested so-called null hypothesis. Clearly, the alternating hypothesis his composing speed isn’t 70 words per minute.

So, normal composing speed is population parameter and sample composing speed is sample statistics.

The conditions of accepting or rejecting his case is to be chosen by the selection representative. For instance, he may conclude that an error of 6 words is alright to him so he would acknowledge his claim between 64 to 76 words per minute. All things considered, sample speed 63 words per minute will close to reject his case. Furthermore, the choice will be he was producing a fake claim.

In any case, if the selection representative stretches out his acceptance region to positive/negative 7 words that is 63 to 77 words, he would be tolerating his case.

In this way, to finish up, Hypothesis Testing is a procedure to test claims about the population dependent on sample. It is a fascinating reasonable subject with a quite statistical jargon. You have to dive more to get familiar with the details.

Significance Level and Rejection Region for Hypothesis

Type I error probability is normally indicated by α and generally set to 0.05.  The value of α is recognized as the significance level.

The rejection region is the set of sample data that prompts the rejection of the null hypothesis.  The significance level, α, decides the size of the rejection region.  Sample results in the rejection region are labelled statistically significant at level of α .

The impact of differing α is that If α is small, for example, 0.01, the likelihood of a type I error is little, and a ton of sample evidence for the alternative hypothesis is needed before the null hypothesis can be dismissed. Though, when α is bigger, for example, 0.10, the rejection region is bigger, and it is simpler to dismiss the null hypothesis.

Significance from p-values

A subsequent methodology is to evade the utilization of a significance level and rather just report how significant the sample evidence is. This methodology is as of now more widespread.  It is accomplished by method of a p value. P value is gauge of power of the evidence against null hypothesis. It is the likelihood of getting the observed value of test statistic, or value with significantly more prominent proof against null hypothesis (Ho), if the null hypothesis of an investigation question is true. The less significant the p value, the more proof there is supportive of the alternative hypothesis. Sample evidence is measurably noteworthy at the α level just if the p value is less than α. They have an association for two tail tests. When utilizing a confidence interval to playout a two-tailed hypothesis test, reject the null hypothesis if and just if the hypothesized value doesn’t lie inside a confidence interval for the parameter.

Hypothesis Tests and Confidence Intervals

Hypothesis tests and confidence intervals are cut out of the same cloth. An event whose 95% confidence interval reject the hypothesis is an event for which p<0.05 under the relating hypothesis test, and the other way around. A p value is letting you know the greatest confidence interval that despite everything prohibits the hypothesis. As such, if p<0.03 against the null hypothesis, that implies that a 97% confidence interval does exclude the null hypothesis.

Hypothesis Tests for a Population Mean

We do a t test on the ground that the population mean is unknown. The general purpose is to contrast sample mean with some hypothetical population mean, to assess whether the watched the truth is such a great amount of unique in relation to the hypothesis that we can say with assurance that the hypothetical population mean isn’t, indeed, the real population mean.

Hypothesis Tests for a Population Proportion

At the point when you have two unique populations Z test facilitates you to choose if the proportion of certain features is the equivalent or not in the two populations. For instance, if the male proportion is equivalent between two nations.

Hypothesis Test for Equal Population Variances

F Test depends on F distribution and is utilized to think about the variance of the two impartial samples. This is additionally utilized with regards to investigation of variance for making a decision about the significance of more than two sample.

T test and F test are totally two unique things. T test is utilized to evaluate the population parameter, for example, population mean, and is likewise utilized for hypothesis testing for population mean. However, it must be utilized when we don’t know about population standard deviation. On the off chance that we know the population standard deviation, we will utilize Z test. We can likewise utilize T statistic to approximate population mean. T statistic is likewise utilised for discovering the distinction in two population mean with the assistance of sample means.

Z statistic or T statistic is utilized to assess population parameters such as population mean and population proportion. It is likewise used for testing hypothesis for population mean and population proportion. In contrast to Z statistic or T statistic, where we manage mean and proportion, Chi Square or F test is utilized for seeing if there is any variance inside the samples. F test is the proportion of fluctuation of two samples.


Hypothesis encourages us to make coherent determinations, the connection among variables, and gives the course to additionally investigate. Hypothesis for the most part results from speculation concerning studied behaviour, natural phenomenon, or proven theory. An honest hypothesis ought to be clear, detailed, and reliable with the data. In the wake of building up the hypothesis, the following stage is validating or testing the hypothesis. Testing of hypothesis includes the process that empowers to concur or differ with the expressed hypothesis.

Simple RNN

Understanding LSTM forward propagation in two ways

*This article is only for the sake of understanding the equations in the second page of the paper named “LSTM: A Search Space Odyssey”. If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.

*This article is the fourth article of “A gentle introduction to the tiresome part of understanding RNN.”

1. Preface

I  heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.

Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for “normal people” to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.

I already had a feeling that “Does it help to understand only LSTM with this precision? I should do more practical codings.” For example François Chollet, the developer of Keras, in his book, said as below.


For me that sounds like “We have already implemented RNNs for you, so just shut up and use Tensorflow/Keras.” Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called Transformer seems to be replacing the position of LSTM in natural language processing. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.

But I think, or rather hope,  it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work.

I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

2. Why LSTM?

First of all, let’s take a brief look at what I said about the structures of RNNs,  in the first and the second article. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward/back propagation if you focus on its recurrent connections.

That is why simple RNNs suffer from vanishing/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in this paper. But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing/exploding gradient problem.

LSTM was invented as a way to tackle such problems as I mentioned in the last article.

3. How to display LSTM

I would like you to just go to image search on Google, Bing, or Yahoo!, and type in “LSTM.” I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them “Space Odyssey type” and “electronic circuit type”, and in conclusion, I highly recommend you to understand LSTM as the “electronic circuit type.”

*I just randomly came up with the terms “Space Odyssey type” and “electronic circuit type” because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.

However, not that all the well-made explanations on LSTM use the “electronic circuit type,” and I am sure you sometimes have to understand LSTM as the “space odyssey type.” And the paper “LSTM: A Search Space Odyssey,” which I learned a lot about LSTM from,  also adopts the “Space Odyssey type.”

LSTM architectur visualization

The main reason why I recommend the “electronic circuit type” is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.

*Behaviors of both of them look different, but of course they are doing the same things.

If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs  are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.

But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.

4. Forward propagation of LSTM in “electronic circuit type”

*For further understanding of mathematics of LSTM forward/back propagation, I recommend you to download my slides.

The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.

Let’s look at the simplified architecture of  an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the gates, and the one going through cell connections, the “highway” of LSTM block. For simplicity, we will see the architecture of an LSTM block without peephole connections, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.

In a LSTM block, the input and the output of the former time step separately go through sections named “gates”: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.

*The small two dots on the cell connections are the “on-ramp” of cell conection highway.

*You would see the terms “input gate,” “forget gate,” “output gate” almost everywhere, but how to call the “block gate” depends on textbooks.

Let’s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step (t) gets \boldsymbol{y}^{(t-1)}, the output at the last time step,  and \boldsymbol{c}^{(t-1)}, the information of the cell at the time step (t-1), via recurrent connections. The block at time step (t) gets the input \boldsymbol{x}^{(t)}, and it separately goes through each gate, together with \boldsymbol{y}^{(t-1)}. After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}. The outputs of the gates are mixed with \boldsymbol{c}^{(t-1)} and the LSTM block gives out an output \boldsymbol{y}^{(t)}, and gives \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} to the next LSTM block via recurrent connections.

You calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below.

  • \boldsymbol{f}^{(t)}= \sigma(\boldsymbol{W}_{for} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \boldsymbol{y}^{(t-1)} +  \boldsymbol{b}_{for})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{W}_{in} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{in})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{W}_z \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z)
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{W}_{out} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{out})

*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.

The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called Hadamard product, which you denote as \odot. This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.

With this Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

The values of \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} are compressed into the range of [0, 1] or [-1, 1] with activation functions. You can see that the input gate and the block input give new information to the cell. The part \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)} means that the output of the forget gate “forgets” the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell \boldsymbol{c}^{(t)} is activated with tanh() and the output of the output gate “suppress” the activated value of \boldsymbol{c}^{(t)}. In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input \boldsymbol{x}^{(t)}, and the recurrent connection \boldsymbol{y}^{(t-1)}. That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always (1, 1, …1)^T, LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And  if all the outputs of every gate are always (0, 0, …0)^T, LSTMs forget everything, receive no inputs, and give out nothing.

This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.

LSTM inner architecture

LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.

5. Space Odyssey type

I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.

In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted “highway” in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections,  \boldsymbol{x}^{(t)}, the input at the time step (t) , sand \boldsymbol{y}^{(t-1)}, the output at the last time step (t-1), which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.

Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step (t-1) goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly “forgotten.” Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step  (t). And the partly “forgotten” \boldsymbol{c}^{(t-1)} goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how \boldsymbol{c}^{(t)} is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is “suppressed” with the output gate.

I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.


* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.


[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey,” (2017)

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204

[3] “Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021”, Institute of advanced research in artificial intelligence, (2020)

[4] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 120-125
岡谷貴之 著, 「機械学習プロフェッショナルシリーズ 深層学習」, (2015), pp. 120-125

[5] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 252-257
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 252-257

[6] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ~ 最近の動向と共に」, Qiita, (2015)