Machine Learning Archives

Instructions on Transformer for people outside NLP field, but with examples of NLP

December 30, 2020/in Artificial Intelligence, Data Mining, Data Science, Deep Learning, Machine Learning, Main Category, Mathematics, Natural Language Processing, Predictive Analytics/by Yasuto Tamura

I found it quite difficult to explain mathematical details of long short-term memory (LSTM) in my previous article series. But when I was studying LSTM, a new promising algorithm was already attracting attentions. The algorithm is named Transformer. Its algorithm was a first announced in a paper named “Attention Is All You Need,” and it outperformed conventional translation algorithms with lower computational costs.

In this article series, I am going to provide explanations on minimum prerequisites for understanding deep learning in NLP (natural language process) tasks, but NLP is not the main focus of this article series, and actually I do not study in NLP field. I think Transformer is going to be a new major model of deep learning as well as CNN or RNN, and the model is now being applied in various fields.

Even though Transformer is going to be a very general deep learning model, I still believe it would be an effective way to understand Transformer with some NLP because language is a good topic we have in common. Unlike my previous article series, in which I tried to explain theoretical side of RNN as precisely as possible, in this article I am going to focus on practical stuff with my toy implementations of NLP tasks, largely based on Tensorflow official tutorial. But still I will do my best to make it as straightforward as possible to understand the architecture of Transformer with various original figures.

This series is going to be composed of the articles below.

On the difficulty of language: prerequisites for NLP with Transformer
Seq2seq model and attention mechanism: a backbone of NLP with deep learning
Multi-head attention: the key component of Transformer
Positional encoding, residual connections, padding masks: covering the rest of Transformer components
How to make a toy English-German translator with multi-head attention heat maps: the overall architecture of Transformer
Transformer in image processing (Coming soon)

If you are in the field and can read the codes in the official tutorial with no questions, this article series is not for you, but if you want to see how a Transformer works but do not want to go too much into details of NLP, this article would be for you.

The algorithm known as PCA and my taxonomy of linear dimension reductions

November 23, 2020/in Artificial Intelligence, Big Data, Data Mining, Data Science, Machine Learning, Main Category, Mathematics, Statistics/by Yasuto Tamura

In one of my previous articles, I explained the importance of reducing dimensions. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the simplest types of dimension reduction algorithms. In upcoming articles of mine, you are going to see what these algorithms do. In conclusion, diagonalization, which I mentioned in the last article, is what these algorithms are all about, but still in this article I can mainly cover only PCA.

This article is largely based on the explanations in Pattern Recognition and Machine Learning by C. M. Bishop (which is often called “PRML”), and when you search “PCA” on the Internet, you will find more or less similar explanations. However I hope I can go some steps ahead throughout this article series. I mean, I am planning to also cover more generalized versions of PCA, meanings of diagonalization, the idea of subspace. I believe this article series is also effective for refreshing your insight into linear algebra.

*This is the third article of my article series “Illustrative introductions on dimension reduction.”

1. My taxonomy on linear dimension reduction

*If you soon want to know what the algorithm called “PCA” is, you should skip this section for now to avoid confusion.

Out of the two algorithms I mentioned, PCA is especially important and you would see the same or similar ideas in various fields such as signal processing, psychology, and structural mechanics. However in most cases, the word “PCA” refers to one certain algorithm of linear dimension reduction. Most articles or study materials only mention the “PCA,” and this article is also going to cover only the algorithm which most poeple call “PCA.” However I found that PCA is only one branch of linear dimension reduction algorithms.

*From now on all the terms “PCA” in this article means the algorithm known as PCA unless I clearly mention the generalized KL transform.

*This chart might be confusing to you. According to PRML, PCA and KL transform is identical. PCA has two formulations, maximum variance formulation and minimum error formulation, and they can give the same result. However according to a Japanese textbook, which is very precise about this topic, KL transform has two formulations, and what we call PCA is based on maximum variance formulation. I am still not sure about correct terminology, but in this article I am going to call the most general algorithm “generalized KL transform,” I mean the root of the chart above.

*Most materials just explain the most major PCA, but if you consider this generalized KL transform, I can introduce an intriguing classification algorithm called subspace method. This algorithm was invented in Japan, and this is not so popular in machine learning textbooks in general, but learning this method would give you better insight into the idea of multidimensional space in machine learning. In the future, I am planning to cover this topic in this article series.

2. PCA

When someones mention “PCA,” I am sure for the most part that means the algorithm I am going to explain in the rest of this article. The most intuitive and straightforward way to explain PCA is that, PCA (Principal Component Analysis) of two or three dimensional data is fitting an oval to two dimensional data or fitting an ellipsoid to three dimensional data. You can actually try to plot some random dots on a piece of paper, and draw an oval which fits the dots the best. Assume that you have these 2 or 3 dimensional data below, and please try to put an oval or an ellipsoid to the data.

I think this is nothing difficult, but I have a question: what was the logic behind your choice?

Some might have roughly drawn its outline. Formulas of “the surface” of general ellipsoids can be explained in several ways, but in this article you only have to consider ellipsoids whose center is the origin point of the coordinate system. In PCA you virtually shift data so that the mean of the data comes to the origin point of the coordinate system. When $A$ is a certain type of $D\times D$ matrix, the formula of a D-dimensional ellipsoid whose center is identical to the origin point is as follows: $(\boldsymbol{x}, A\boldsymbol{x}) = 1$ , where $\boldsymbol{x}\in \mathbb{R}$ . As is always the case with formulas in data science, you can visualize such ellipsoids if you are talking about 1, 2, or 3 dimensional data like in the figure below, but in general D-dimensional space, it is theoretical/imaginary stuff on blackboards.

*In order to explain the conditions which the matrix $A$ has to hold, I need another article, so for now please just assume that the $A$ is a kind of magical matrix.

You might have seen equations of 2 or 3 dimensional ellipsoids in the following way: $\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1$ , where $a\neq 0, b\neq 0$ or $\frac{x^2}{a^2} + \frac{y^2}{b^2} + \frac{z^2}{c^2}= 1$ , where $a\neq 0, b\neq 0, c \neq 0$ . These are special cases of the equation $(\boldsymbol{x}, A\boldsymbol{x}) = 1$ , where $A=diag(a_1^2, \dots, a_D^2)$ . In this case the axes of ellipsoids the same as those of the coordinate system. Thus in this simple case, $A=diag(a^2, b^2)$ or $A=diag(a^2,c^2,c^2)$ .

I am going explain these equations in detail in the upcoming articles. But thre is one problem: how would you fit an ellipsoid when a data distribution does not look like an ellipsoid?

In fact we have to focus more on another feature of ellipsoids: all the axes of an ellipsoid are orthogonal. In conclusion the axes of the ellipsoids are more important in PCA, so I do want you to forget about the surface of ellipsoids for the time being. You might get confused if you also think about the surface of ellipsoid now. I am planning to cover this topic in the next article. I hope this article, combined with the last one and the next one, would help you have better insight into the ideas which frequently appear in data science or machine learning context.

3. Fitting orthogonal axes on data

*If you have no trouble reading the chapter 12.1 of PRML, you do not need to this section or maybe even this article, but I hope at least some charts or codes of mine would enhance your understanding on this topic.

*I must admit I wrote only the essence of PCA formulations. If that seems too abstract to you, you should just breifly read through this section and go to the next section with a more concrete example. If you are confused, there should be other good explanations on PCA on the internet, and you should also check them. But at least the visualization of PCA in the next section would be helpful.

As I implied above, all the axes of ellipsoids are orthogonal, and selecting the orthogonal axes which match data is what PCA is all about. And when you choose those orthogonal axes, it is ideal if the data look like an ellipsoid. Simply putting we want the data to “swell” along the axes.

Then let’s see how to let them “swell,” more mathematically. Assume that you have 2 dimensional data plotted on a coordinate system $(\boldsymbol{e}_1, \boldsymbol{e}_2)$ as below (The samples are plotted in purple). Intuitively, the data “swell” the most along the vector $\boldsymbol{u}_1$ . Also it is clear that $\boldsymbol{u}_2$ is the only vector orthogonal to $\boldsymbol{u}_1$ . We can expect that the new coordinate system $(\boldsymbol{u}_1, \boldsymbol{u}_2)$ expresses the data in a better way, and you you can get new coordinate points of the samples by projecting them on new axes as done with yellow lines below.

Next, let’s think about a case in 3 dimensional data. When you have 3 dimensional data in a coordinate system $(\boldsymbol{e}_1, \boldsymbol{e}_2,\boldsymbol{e}_3)$ as below, the data “swell” the most also along $\boldsymbol{u}_1$ . And the data swells the second most along $\boldsymbol{u}_2$ . The two axes, or vectors span the plain in purple. If you project all the samples on the plain, you will get 2 dimensional data at the right side. It is important that we did not consider the third axis. That implies you might be able to display the data well with only 2 dimensional sapce, which is spanned by the two axes $\boldsymbol{v}_1, \boldsymbol{v}_2$ .

Thus the problem is how to calculate such axis $\boldsymbol{u}_1$ . We want the variance of data projected on $\boldsymbol{u}_1$ to be the biggest. The coordinate of $\boldsymbol{x}_n$ on the axis $\boldsymbol{u}_1$ . The coordinate of a data point $\boldsymbol{x}_n$ on the axis $\boldsymbol{u}_1$ is calculated by projecting $\boldsymbol{x}_n$ on $\boldsymbol{u}_1$ . In data science context, such projection is synonym to taking an inner product of $\boldsymbol{x}_n$ and $\boldsymbol{u}_1$ , that is calculating $\boldsymbol{u}_1^T \boldsymbol{x}_n$ .

*Each element of $\boldsymbol{x}_n$ is the coordinate of the data point $\boldsymbol{x}_n$ in the original coordinate system. And the projected data on $\boldsymbol{u}_1$ whose coordinates are 1-dimensional correspond to only one element of transformed data.

To calculate the variance of projected data on $\boldsymbol{u}_1$ , we just have to calculate the mean of variances of 1-dimensional data projected on $\boldsymbol{u}_1$ . Assume that $\bar{\boldsymbol{x}}$ is the mean of data in the original coordinate, then the deviation of $\boldsymbol{x}_1$ on the axis $\boldsymbol{u}_1$ is calculated as $\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}$ , as shown in the figure. Hence the variance, I mean the mean of the deviation on is $\frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}}$ , where $N$ is the total number of data points. After some deformations, you get the next equation $\frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}} = \boldsymbol{u}_1^T S \boldsymbol{u}_1$ , where $S = \frac{1}{N}\sum_{n=1}^{N}{(\boldsymbol{x}_n - \bar{\boldsymbol{x}})(\boldsymbol{x}_n - \bar{\boldsymbol{x}})^T}$ . $S$ is known as a covariance matrix.

We are now interested in maximizing the variance of projected data on $\boldsymbol{u}_1^T S \boldsymbol{u}_1$ , and for mathematical derivation we need some college level calculus, so if that is too much for you, you can skip reading this part till the next section.

We now want to calculate $\boldsymbol{u}_1$ with which $\boldsymbol{u}_1^T S \boldsymbol{u}_1$ is its maximum value. General $\boldsymbol{u}_i$ including $\boldsymbol{u}_1$ are just coordinate axes after PCA, so we are just interested in their directions. Thus we can set one constraint $\boldsymbol{u}_1^T \boldsymbol{u}_1 = 1$ . Introducing a Lagrange multiplier, we have only to optimize next problem: $\boldsymbol{u}_1 ^ {*} = \mathop{\rm arg~max}\limits_{\boldsymbol{u}_1} \{ \boldsymbol{u}_1^T S \boldsymbol{u}_1 + \lambda_1 (1 - \boldsymbol{u}_1^T \boldsymbol{u}_1) \}$ . In conclusion $\boldsymbol{u}_1 ^ {*}$ satisfies $S\boldsymbol{u}_1 ^ {*} = \lamba_1 \boldsymbol{u}_1 ^ {*}$ . If you have read my last article on eigenvectors, you wold soon realize that this is an equation for calculating eigenvectors, and that means $\boldsymbol{u}_1 ^ {*}$ is one of eigenvectors of the covariance matrix S. Given the equation of eigenvector the next equation holds $\boldsymbol{u}_1 ^ {*}^T S \boldsymbol{u}_1 ^ {*} = \lambda_1$ . We have seen that $\boldsymbol{u}_1 ^T S \boldsymbol{u}_1 ^$ is a the variance of data when projected on a vector $\boldsymbol{u}_1$ , thus the eigenvalue $\lambda_1$ is the biggest variance possible when the data are projected on a vector.

Just in the same way you can calculate the next biggest eigenvalue $\lambda_2$ , and it it the second biggest variance possible, and in this case the date are projected on $\boldsymbol{u}_2$ , which is orthogonal to $\boldsymbol{u}_1$ . As well you can calculate orthogonal 3rd 4th …. Dth eigenvectors.

*To be exact I have to explain the cases where we can get such D orthogonal eigenvectors, but that is going to be long. I hope I can to that in the next article.

4. Practical three dimensional example of PCA

We have seen that PCA is sequentially choosing orthogonal axes along which data points swell the most. Also we have seen that it is equal to calculating eigenvalues of the covariance matrix of the data from the largest to smallest one. From now on let’s work on a practical example of data. Assume that we have 30 students’ scores of Japanese, math, and English tests as below.

* I think the subject “Japanese” is equivalent to “English” or “language art” in English speaking countries, and maybe “Deutsch” in Germany. This example and the explanation are largely based on a Japanese textbook named 「これなら分かる応用数学教室　最小二乗法からウェーブレットまで」. This is a famous textbook with cool and precise explanations on mathematics for engineering. Partly sharing this is one of purposes of this article.

At the right side of the figure below is plots of the scores with all the combinations of coordinate axes. In total 9 inverse graphs are symmetrically arranged in the figure, and it is easy to see that English & Japanese or English and math have relatively high correlation. The more two axes have linear correlations, the bigger the covariance between them is.

In the last article, I visualized the eigenvectors of a $3\times 3$ matrix $A = \frac{1}{50} \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}$ , and in fact the matrix is just a constant multiplication of this covariance matrix. I think now you understand that PCA is calculating the orthogonal eigenvectors of covariance matrix of data, that is diagonalizing covariance matrix with orthonormal eigenvectors. Hence we can guess that covariance matrix enables a type of linear transformation of rotation and expansion and contraction of vectors. And data points swell along eigenvectors of such matrix.

Then why PCA is useful? In order to see that at first, for simplicity assume that $x, y, z$ denote Japanese, Math, English scores respectively. The mean of the data is $\left( \begin{array}{c} \bar{x} \\ \bar{y} \\ \bar{z} \end{array} \right) = \left( \begin{array}{c} 58.1 \\ 61.8 \\ 67.3 \end{array} \right)$ , and the covariance matrix of data in the original coordinate system is $V_{xyz} = \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}$ . The eigenvalues of $V_{xyz}$ are $\lambda_1=148.34, \lambda_2 = 30.62$ , and $\lambda_3 = 3.60$ , and their corresponding unit eigenvectors are $\boldsymbol{u}_1 = \left( \begin{array}{c} 0.540 \\ 0.602 \\ 0.589 \end{array} \right) , \boldsymbol{u}_2 = \left( \begin{array}{c} 0.736 \\ -0.677 \\ 0.0174 \end{array} \right) , \boldsymbol{u}_3 = \left( \begin{array}{c} -0.408 \\ -0.4.23 \\ 0.809 \end{array} \right)$ respectively. $U = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2 \quad \boldsymbol{u}_3 )$ is an orthonormal matrix, where $\boldsymbol{u}_i^T\boldsymbol{u}_j = \begin{cases} 1 & (i=j) \\ 0 & (otherwise) \end{cases}$ . As I explained in the last article, you can diagonalize $V_{xyz}$ with $U$ : $U^T V_{xyz}U = diag(\lambda_1, \dots, \lambda_D)$ .

In order to see how PCA is useful, assume that $\left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right) = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right)$ .

Let’s take a brief look at what a linear transformation by $U^T$ means. Each element of $\boldsymbol{x}$ denotes coordinate of the data point $\boldsymbol{x}$ in the original coordinate system (In this case the original coordinate system is composed of $\boldsymbol{e}_1, \boldsymbol{e}_2$ , and $\boldsymbol{e}_3$ ). $U = (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3)$ enables a rotation of a rigid body, which means the shape or arrangement of data will not change after the rotation, and $U^T$ enables a reverse rotation of the rigid body.

*Roughly putting, if you hold a bold object such as a metal ball and rotate your arm, that is a rotation of a rigid body, and your shoulder is the origin point. On the other hand, if you hold something soft like a marshmallow, it would be squashed in your hand, and that is not a not a rotation of a rigid body.

You can rotate $\boldsymbol{x}$ with $U$ like $U^T\boldsymbol{x} = \left( \begin{array}{c} -\boldsymbol{u}_1^{T}- \\ -\boldsymbol{u}_2^{T}- \\ -\boldsymbol{u}_3^{T}- \end{array} \right)\boldsymbol{x}=\left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right)$ , and $\boldsymbol{u}_i^{T}\boldsymbol{x}$ is the coordinate of $\boldsymbol{x}$ projected on the axis $\boldsymbol{u}_i$ .

Let’s see this more visually. Assume that the data point $\boldsymbol{x}$ is a purple dot and its position is expressed in the original coordinate system spanned by black arrows . By multiplying $\boldsymbol{x}$ with $U^T$ , the purple point $\boldsymbol{x}$ is projected on the red axes respectively, and the product $\left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right)$ denotes the coordinate point of the purple point in the red coordinate system. $\boldsymbol{x}$ is rotated this way, but for now I think it is better to think that the data are projected on new coordinate axes rather than the data themselves are rotating.

Now that we have seen what rotation by $U$ means, you should have clearer image on what $\left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right) = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right)$ means. $\left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)$ denotes the coordinates of data projected on new axes $\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3$ , which are unit eigenvectors of $V_{xyz}$ . In the coordinate system spanned by the eigenvectors, the data distribute like below.

By multiplying $U$ from both sides of the equation above, we get $\left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) =U \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)$ , which means you can express deviations of the original data as linear combinations of the three factors $\xi, \eta$ , and $\zeta$ . We expect that those three factors contain keys for understanding the original data more efficiently. If you concretely write down all the equations for the factors: $\xi = 0.540 (x - \bar{x}) + 0.602 (y - \bar{y}) + 0.588 (z - \bar{z})$ , $\eta = 0.736(x - \bar{x}) - 0.677 (y - \bar{y}) + 0.0174 (z - \bar{z})$ , and $\zeta = - 0.408 (x - \bar{x}) - 0.423 (y - \bar{y}) + 0.809(z - \bar{z})$ . If you examine the coefficients of the deviations $(x - \bar{x}), (y - \bar{y})$ , and $(z - \bar{z})$ , we can observe that $\eta$ almost equally reflects the deviation of the scores of all the subjects, thus we can say $\eta$ is a factor indicating one’s general academic level. When it comes to $\eta$ Japanese and Math scores are important, so we can guess that this factor indicates whether the student is at more of “scientific side” or “liberal art side.” In the same way $\zeta$ relatively makes much of one’s English score, so it should show one’s “internationality.” However the covariance of the data $\xi, \eta, \zeta$ is $V_{\xi \eta \zeta} = \begin{pmatrix} 148.34 & 0 & 0 \\ 0 & 30.62 & 0 \\ 0 & 0 & 3.60 \end{pmatrix}$ . You can see $\zeta$ does not vary from students to students, which means it is relatively not important to describe the tendency of data. Therefore for dimension reduction you can cut off the factor $\zeta$ .

*Assume that you can apply PCA on D-dimensional data and that you get $\boldsymbol{x}'$ , where $\boldsymbol{x}' = U^T\boldsymbol{x} - \bar{\boldsymbol{x}}$ . The variance of data projected on new D-dimensional coordinate system is $V'=\frac{1}{N}\sum{(\boldsymbol{x}')^T\boldsymbol{x}'}$ $=\frac{1}{N}\sum{(U^T\boldsymbol{x})^T(U^T\boldsymbol{x})}$ $=\frac{1}{N}\sum{U^T\boldsymbol{x}\boldsymbol{x}^TU}$ $=U^T(\frac{1}{N}\sum{\boldsymbol{x}\boldsymbol{x}^T})U$ $=U^TVU$ $=diag(\lambda_1, \dots, \lambda_D)$ . This means that in the new coordinate system after PCA, covariances between any pair of variants are all zero.

*As I mentioned $U$ is a rotation of a rigid body, and $U^T$ is the reverse rotation, hence $U^TU = UU^T = I$ .

Hence you can approximate the original 3 dimensional data on the coordinate system $(\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3)$ from the reduced two dimensional coordinate system $(\boldsymbol{u}_1, \boldsymbol{u}_2)$ with the following equation: $\left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) \approx U_{reduced} \left( \begin{array}{c} \xi \\ \eta \end{array} \right) = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2) \left( \begin{array}{c} \xi \\ \eta \end{array} \right)$ . Then it mathematically clearer that we can express the data with two factors: “how smart the student is” and “whether he is at scientific side or liberal art side.”

We can observe that eigenvalue $\lambda_i$ is a statistic which indicates how much the corresponding $\boldsymbol{u}_i$ can express the data, $\frac{\lambda_i}{\sum_{j=1}^{D}{\lambda_j}}$ is called the contribution ratio of eigenvector $\boldsymbol{u}_i$ . In the example above, the contribution ratios of $\boldsymbol{u}_1, \boldsymbol{u}_2,$ and $\boldsymbol{u}_3$ are respectively $\frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3}=0.813$ , $\frac{\lambda_2}{\lambda_1 + \lambda_2 + \lambda_3}=0.168$ , $\frac{\lambda_3}{\lambda_1 + \lambda_2 + \lambda_3}=0.0197$ . You can decide how many degrees of dimensions you reduce based on this information.

Appendix: Playing with my toy PCA on MNIST dataset

Applying “so called” PCA on MNIST dataset is a super typical topic that many other tutorial on PCA also introduce, but I still recommend you to actually implement, or at least trace PCA implementation with MNIST dataset without using libraries like scikit-learn. While reading this article I recommend you to actually run the first and the second code below. I think you can just copy and paste them on your tool to run Python, installing necessary libraries. I wrote them on Jupyter Notebook.

In my implementation, in the simple configuration part you can set the USE_ALL_NUMBERS as True or False boolean. If you set it as True, you apply PCA on all the data of numbers from 0 to 9. If you set it as True, you can specify which digit to apply PCA on. In this article, I show the results results of PCA on the data of digit ‘3.’ The first three images of ‘3’ are as below.

You have to keep it in mind that the data are all shown as 28 by 28 pixel grayscale images, but in the process of PCA, they are all processed as 28 * 28 = 784 dimensional vectors. After applying PCA on the 784 dimensional vectors of images of ‘3,’ the first 25 eigenvectors are as below. You can see that at the beginning the eigenvectors partly retain the shapes of ‘3,’ but they are distorted as the eigenvalues get smaller. We can guess that the latter eigenvalues are not that helpful in reconstructing the shape of ‘3.’

Just as we saw in the last section, you you can cut off axes of eigenvectors with small eigenvalues and reduce the dimension of MNIST data. The figure below shows how contribution ratio of MNIST data grows. You can see that around 200 dimension degree, the contribution ratio reaches around 0.95. Then we can guess that even if we reduce the dimension of MNIST from 784 to 200 we can retain the most of the structure of original data.

Some results of reconstruction of data from 200 dimensional space are as below. You can set how many images to display by adjusting NUMBER_OF_RESULTS in the code. And if you set LATENT_DIMENSION as 784, you can completely reconstruct the data.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.

import numpy as np

import keras # There should be some other simpler ways to download MNIST dataset,

# but at least on my laptop, this was the easiest.

import matplotlib.pyplot as plt

# Configuration part

USE_ALL_NUMBERS = False

WHICH_NUMBER = 3

W_RAW_DATA = 3

H_RAW_DATA = 1 # 1<= W_RAW_DATA*H_RAW_DATA <= 28*28

H_EIGENVECTORS = 5

W_EIGENVECTORS = 5 # 1<= H_EIGENVECTORS*W_EIGENVECTORS <= 28*28

# Preparing data

mnist = keras.datasets.mnist

(data, labels), (_, _) = mnist.load_data()

data = data.reshape([-1, 28*28])

if USE_ALL_NUMBERS:

data_to_use = data

labels_to_use = labels

else:

data_to_use = data[labels==WHICH_NUMBER]

labels_to_use = labels[labels==WHICH_NUMBER]

# Applying PCA to the data to use

ave = np.mean(data_to_use, axis=0)

covariance_matrix = np.dot((data_to_use - ave).T, (data_to_use - ave))

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

sorted_index = eigen_values.argsort()[::-1]

eigen_values=eigen_values[sorted_index]

eigen_vectors=eigen_vectors[:, sorted_index].real

plt.ﬁgure(1, ﬁgsize=(5*W_RAW_DATA, 5*H_RAW_DATA))

plt.gray()

for id in range(1*3):

plt.subplot(1, 3, id + 1)

img = data_to_use[id, :].reshape(28, 28)

plt.pcolor(img)

plt.text(1, 2, "No. %d raw data" % (id+1),

color='orange', fontsize=20)

plt.xlim(0, 27)

plt.ylim(27, 0)

plt.grid('on', color='white')

plt.savefig("MNIST_raw_data_3.png")

plt.show()

plt.ﬁgure(1, ﬁgsize=(W_EIGENVECTORS*5, H_EIGENVECTORS*5))

plt.gray()

for id in range(H_EIGENVECTORS*W_EIGENVECTORS):

plt.subplot(H_EIGENVECTORS, W_EIGENVECTORS, id + 1)

img = eigen_vectors.T[id, :].reshape(28, 28)

plt.pcolor(img)

plt.text(1, 2, "No. %d eigenvector" % (id+1),

color='orange', fontsize=20)

plt.xlim(0, 27)

plt.ylim(27, 0)

plt.grid('on', color='white')

plt.savefig("MNIST_PCA_eigenvectors.png")

plt.show()

import numpy as np

import keras

import matplotlib.pyplot as plt

# Configuration part

USE_ALL_NUMBERS = True

NUMBER_OF_RESULTS = 10

LATENT_DIMENSION = 95 # From 1 to 28*28=784

# Preparing data

mnist = keras.datasets.mnist

(data, labels), (_, _) = mnist.load_data()

data = data.reshape([-1, 28*28])

if USE_ALL_NUMBERS:

data_to_use = data

labels_to_use = labels

else:

data_to_use = data[labels==WHICH_NUMBER]

labels_to_use = labels[labels==WHICH_NUMBER]

# Applying PCA to the data to use

ave = np.mean(data_to_use, axis=0)

covariance_matrix = np.dot((data_to_use - ave).T, (data_to_use - ave))

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

sorted_index = eigen_values.argsort()[::-1]

eigen_values=eigen_values[sorted_index]

eigen_vectors=eigen_vectors[:, sorted_index].real

contribution_ratio = np.array([eigen_values[:i].sum()/eigen_values.sum() for i in range(len(eigen_values))])

plt.figure(1, figsize=(10, 6))

plt.title("Contribution ratio of MNIST data", fontsize=20)

plt.plot(np.arange(len(eigen_values)), contribution_ratio)

plt.xlabel('Dimension degree')

plt.ylabel('Contribution ratio')

plt.show()

U_reduced = eigen_vectors[:, :LATENT_DIMENSION]

data_transformed = np.dot(eigen_vectors.T, data_to_use.T).T

data_transformed_and_reduced = data_transformed[:, :LATENT_DIMENSION]

x_recomposed = np.dot(U_reduced, data_transformed_and_reduced.T).T

plt.ﬁgure(1, ﬁgsize=(2*5, NUMBER_OF_RESULTS*5))

plt.subplots_adjust(hspace=0.4)

plt.gray()

for id in range(NUMBER_OF_RESULTS*2):

plt.subplot(NUMBER_OF_RESULTS, 2, id + 1)

if(id % 2 == 0):

img = x_recomposed[id//2, :].reshape(28, 28)

plt.title("Reconstructed")

else:

img = data_to_use[id//2, :].reshape(28, 28)

plt.title("Original")

plt.pcolor(img)

plt.text(3, 5, "%d" % (labels_to_use[id//2]),

color='orange', fontsize=28)

plt.xlim(0, 27)

plt.ylim(27, 0)

plt.grid('on', color='white')

plt.show()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d.proj3d import proj_transform

from mpl_toolkits.mplot3d.axes3d import Axes3D

from matplotlib.text import Annotation

from matplotlib.patches import FancyArrowPatch

import matplotlib.patches as mpatches

class Annotation3D(Annotation):

def __init__(self, text, xyz, *args, **kwargs):

super().__init__(text, xy=(0,0), *args, **kwargs)

self._xyz = xyz

def draw(self, renderer):

x2, y2, z2 = proj_transform(*self._xyz, renderer.M)

self.xy=(x2,y2)

super().draw(renderer)

def _annotate3D(ax,text, xyz, *args, **kwargs):

'''Add anotation `text` to an `Axes3d` instance.'''

annotation= Annotation3D(text, xyz, *args, **kwargs)

ax.add_artist(annotation)

setattr(Axes3D,'annotate3D',_annotate3D)

class Arrow3D(FancyArrowPatch):

def __init__(self, x, y, z, dx, dy, dz, *args, **kwargs):

super().__init__((0,0), (0,0), *args, **kwargs)

self._xyz = (x,y,z)

self._dxdydz = (dx,dy,dz)

def draw(self, renderer):

x1,y1,z1 = self._xyz

dx,dy,dz = self._dxdydz

x2,y2,z2 = (x1+dx,y1+dy,z1+dz)

xs, ys, zs = proj_transform((x1,x2),(y1,y2),(z1,z2), renderer.M)

self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))

super().draw(renderer)

def _arrow3D(ax, x, y, z, dx, dy, dz, *args, **kwargs):

'''Add an 3d arrow to an `Axes3D` instance.'''

arrow = Arrow3D(x, y, z, dx, dy, dz, *args, **kwargs)

ax.add_artist(arrow)

setattr(Axes3D,'arrow3D',_arrow3D)

jp_score = np.array([49, 58, 64, 65, 54, 58, 49, 67, 54, 66, 72, 66, 54, 64, 39,

56, 54, 56, 48, 57, 57, 47, 50, 60, 72, 54, 59, 61, 64, 70])

math_score = np.array([51, 58, 56, 70, 45, 70, 45, 69, 66, 73, 71, 72, 57, 53, 58,

57, 71, 63, 53, 62, 62, 59, 57, 65, 74, 66, 72, 50, 69, 60])

en_score = np.array([59, 63, 68, 77, 55, 71, 57, 79, 66, 81, 81, 77, 62, 67, 56,

62, 70, 67, 61, 70, 68, 59, 61, 71, 77, 66, 70, 59, 68, 71])

mean_vector = np.array([jp_score.mean(), math_score.mean(), en_score.mean()])

data_matrix = np.c_[jp_score, math_score, en_score]

data_mean_reduced = data_matrix - mean_vector

covariance_matrix = np.dot(data_mean_reduced.T, data_mean_reduced) / len(jp_score)

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

sorted_index = eigen_values.argsort()[::-1]

eigen_values=eigen_values[sorted_index]

eigen_vectors=eigen_vectors[:, sorted_index]

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

sorted_idx = eigen_values.argsort()[::-1]

eigen_values = eigen_values[sorted_idx]

eigen_vectors = eigen_vectors[:,sorted_idx]

eigen_vectors = eigen_vectors.astype(float)

subject_labels = ['Japanese score deviation', 'Math score deviation', 'English score deviation']

const_range = 2

X = np.arange(-const_range, const_range + 1, 1)

Y = np.arange(-const_range, const_range + 1, 1)

Z = np.arange(-const_range, const_range + 1, 1)

U_x, U_y, U_z = np.meshgrid(X, Y, Z)

V_x = np.zeros((len(U_x), len(U_y), len(U_z)))

V_y = np.zeros((len(U_x), len(U_y), len(U_z)))

V_z = np.zeros((len(U_x), len(U_y), len(U_z)))

temp_vec = np.zeros((1, 3))

W_x = np.zeros((len(U_x), len(U_y), len(U_z)))

W_y = np.zeros((len(U_x), len(U_y), len(U_z)))

W_z = np.zeros((len(U_x), len(U_y), len(U_z)))

fig = plt.figure(figsize=(15, 15))

grid_range = 15

for idx in range(2):

if idx ==0:

ax = fig.add_subplot(1, 2, idx + 1, projection='3d')

for i in range(len(U_x)):

for j in range(len(U_x)):

for k in range(len(U_x)):

temp_vec[0][0] = U_x[i][j][k]

temp_vec[0][1] = U_y[i][j][k]

temp_vec[0][2] = U_z[i][j][k]

temp_vec[0] = np.dot(covariance_matrix, temp_vec[0])

V_x[i][j][k] = temp_vec[0][0]

V_y[i][j][k] = temp_vec[0][1]

V_z[i][j][k] = temp_vec[0][2]

W_x[i][j][k] = (V_x[i][j][k] - U_x[i][j][k]) / (2*grid_range)

W_y[i][j][k] = (V_y[i][j][k] - U_y[i][j][k]) / (2*grid_range)

W_z[i][j][k] = (V_z[i][j][k] - U_z[i][j][k]) / (2*grid_range)

ax.arrow3D(0, 0, 0,

U_x[i][j][k], U_y[i][j][k], U_z[i][j][k],

mutation_scale=10, arrowstyle="-|>", fc='dimgrey', ec='dimgrey')

#ax.arrow3D(0, 0, 0,

# V_x[i][j][k], V_y[i][j][k], V_z[i][j][k],

# mutation_scale=10, arrowstyle="-|>", fc='red', ec='red')

ax.arrow3D(U_x[i][j][k], U_y[i][j][k], U_z[i][j][k],

W_x[i][j][k], W_y[i][j][k], W_z[i][j][k],

mutation_scale=10, arrowstyle="-|>", fc='darkviolet', ec='darkviolet')

if idx==1:

ax = fig.add_subplot(1, 2, idx + 1, projection='3d')

ax.scatter(data_mean_reduced[:, 0], data_mean_reduced[:, 1], data_mean_reduced[:, 2], marker='o', s=80)

ax.arrow3D(0, 0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, eigen_vectors.T[0][2]*10,

mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange', lw = 3)

ax.arrow3D(0, 0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, eigen_vectors.T[1][2]*10,

mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange', lw = 3)

ax.arrow3D(0, 0, 0, eigen_vectors.T[2][0]*10, eigen_vectors.T[2][1]*10, eigen_vectors.T[2][2]*10,

mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange', lw = 3)

ax.text(eigen_vectors.T[0][0]*8 , eigen_vectors.T[0][1]*8, eigen_vectors.T[0][2]*8+1, r'$u_1$', fontsize=20)

ax.text(eigen_vectors.T[1][0]*8 , eigen_vectors.T[1][1]*8, eigen_vectors.T[1][2]*8, r'$u_2$', fontsize=20)

ax.text(eigen_vectors.T[2][0]*8 , eigen_vectors.T[2][1]*8, eigen_vectors.T[2][2]*8, r'$u_3$', fontsize=20)

ax.set_xlim(-grid_range, grid_range)

ax.set_ylim(-grid_range, grid_range)

ax.set_zlim(-grid_range, grid_range)

#ax.set_xlabel(r'$x_1$', fontsize=25)

#ax.set_ylabel(r'$x_2$', fontsize=25)

#ax.set_zlabel(r'$x_3$', fontsize=25)

ax.set_xlabel(subject_labels[0], fontsize=10)

ax.set_ylabel(subject_labels[1], fontsize=10)

ax.set_zlabel(subject_labels[2], fontsize=10)

#lt.savefig("visualizing_covariance_matrix.png")

plt.show()

Bias and Variance in Machine Learning

November 2, 2020/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Main Category/by Pavan Vadapalli

Machine learning continues to be an ever more vital component of our lives and ecosystem, whether we’re applying the techniques to answer research or business problems or in some cases even predicting the future. Machine learning models need to give accurate predictions in order to create real value for a given industry or domain.

While training a model is one of the key steps in the Data Science Project Life Cycle, how the model generalizes on unseen data is an equally important aspect that should be considered in every Data Science Project Life Cycle. We need to know whether it works and, consequently, if we can trust its predictions. Could the model be merely memorizing the data it is fed with, and therefore unable to make good predictions on future samples, or samples that it hasn’t seen before?

Let’s know the importance of evaluation with a simple example, There are two student’s Ramesh and Suresh preparing for the CAT exam to get into top IIMs (Indian Institute of Management). They both are quite good friends and stayed in the room during preparation and put an equal amount of hard work while solving numerical problems.

They both prepared for almost the same number of hours for the entire year and appeared in the final CAT exam. Surprisingly, Ramesh cleared, but Suresh did not. When asked, we got to know that there was one difference in their strategy of preparation between them, Ramesh had joined a Test Series course where he used to test his knowledge and understanding by giving mock exams and then further evaluating on which portions he is lagging and making necessary adjustments to he is preparation cycle in order to do well in those areas. But Suresh was confident, and he just kept training himself without testing on the preparation he had done.

Like the above situation we can train a Machine Learning Algorithm extensively with many parameters and new techniques, but if you are skipping its evaluation step, you cannot trust your model to perform well on the unseen data. In this article, we explain the importance of Bias, Variance and the trade-off between them in order to know how well a machine learning model generalizes to new, previously unseen data.

Bias

Bias is the difference between the Predicted Value and the Expected Value or how far are the predicted values from the actual values. During the training process the model makes certain assumptions on the training data provided. After Training, when it is introduced to the testing/validation data or unseen data, these assumptions may not always be correct.

If we use a large number of nearest neighbors in the K-Nearest Neighbors Algorithm, the model can totally decide that some parameters are not important at all for the modelling. For example, it can just consider that only two predictor variables are enough to classify the data point though we have more than 10 variables.

This type of model will make very strong assumptions about the other parameters not affecting the outcome at all. You can take it as a model predicting or understanding only the simple relationship when the data points clearly indicate a more complex relationship.

When the model has high bias error, it results in a very simplistic model that does not consider the complexity of the data very well leading to Underfitting.

Variance

Variance occurs when the model performs well on the trained dataset but does not do well on an unseen data set, it is when the model considers the fluctuations or i.e. the noise as in the data as well. The model will still consider the variance as something to learn from because it learns too much from the noise inside the trained data set that it fails to perform as expected on the unseen data.

Based on the above example from Bias, if the model learns that all the ten predictor variables are important to classify a given data point then it tends to have high variance. You can take it as the model is trying to understand every minute detail making it more complex and failing to perform well on the unseen data.

When a model has High Bias error, it underfits the data and makes very simplistic assumptions on it. When a model has High Variance error, it overfits the data and learns too much from it. When a model has balanced Bias and Variance errors, it performs well on the unseen data.

Bias-Variance Trade-off

Based on the definitions of bias and variance, there is clear trade-off between bias and variance when it comes to the performance of the model. A model will have a high error if it has very high bias and low variance and have a high error if it has high variance and low bias.

A model that strikes a balance between the bias and variance can minimize the error better than those that live on extreme ends.

We can find whether the model has High Bias using the below steps:

We tend to get high training errors.
The validation error or test error will be similar to the training error.

We can find whether the model has High Bias using the below steps:

We tend to get low training error
The validation error or test error will be very high.

We can fix the High Bias using below steps:

We need to gather more input features or can even try to create few using the feature engineering techniques.
We can even add few polynomial features in order to increase the complexity.
If we are using any regularization terms in our model, we can try to minimize it.

We can fix the High Variance using below steps:

We can gather more training data so that the model can learn more on the patterns rather than the noise.
We can even try to reduce the input features or do feature selection.
If we are using any regularization terms in our model we can try to maximize it.

Conclusion

In this article, we got to know the importance of the evaluation step in the Data Science Project Life Cycle, definitions of Bias and Variance, the trade-off between them and the steps we can take to fix the Underfitting and Overfitting of a Machine Learning Model.

Rethinking linear algebra: visualizing linear transformations and eigenvectors

October 27, 2020/in Data Mining, Data Science, Machine Learning, Main Category, Mathematics/by Yasuto Tamura

In terms of calculation processes of Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), which are the dimension reduction techniques I am going to explain in the following articles, diagonalization is what they are all about. Throughout this article, I would like you to have richer insight into diagonalization in order to prepare for understanding those basic dimension reduction techniques.

When our professor started a lecture on the last chapter of our textbook on linear algebra, he said “It is no exaggeration to say that everything we have studied is for this ‘diagonalization.'” Until then we had to write tons of numerical matrices and vectors all over our notebooks, calculating those products, adding their rows or columns to other rows or columns, sometimes transposing the matrices, calculating their determinants.

It was like the scene in “The Karate Kid,” where the protagonist finally understood the profound meaning behind the prolonged and boring “wax on, wax off” training given by Miyagi (or “jacket on, jacket off” training given by Jackie Chan). We had finally understood why we had been doing those seemingly endless calculations.

Source: http://thinkbedoleadership.com/secret-success-wax-wax-off/

But usually you can do those calculations easily with functions in the Numpy library. Unlike Japanese college freshmen, I bet you are too busy to reopen textbooks on linear algebra to refresh your mathematics. Thus I am going to provide less mathematical and more intuitive explanation of diagonalization in this article.

*This is the second article of the article series ” Illustrative introductions on dimension reduction .”

1, The mainstream ways of explaining diagonalization.

*The statements below are very rough for mathematical topics, but I am going to give priority to offering more visual understanding on linear algebra in this article. For further understanding, please refer to textbooks on linear algebra. If you would like to have minimum understandings on linear algebra needed for machine learning, I recommend the Appendix C of Pattern Recognition and Machine Learning by C. M. Bishop.

In most textbooks on linear algebra, the explanations on dioagonalization is like this (if you are not sure what diagonalization is or if you are allergic to mathematics, you do not have to read this seriously):

Let $V (dimV = D)$ be a vector space and let $T_A : V \rightarrow V$ be a mapping of $V$ into itself, defined as $T_A(v) = A \cdot \boldsymbol{v}$ , where $A$ is a $D\times D$ matrix and $\boldsymbol{v}$ is $D$ dimensional vector. An element $\boldsymbol{v} \in V$ is called an eigen vector if there exists a number $\lambda$ such that $A \cdot \boldsymbol{v}= \lambda \cdot \boldsymbol{v}$ and $\boldsymbol{v} \neq \boldsymbol{0}$ . In this case $\lambda$ is uniquely determined and is called an eigen value of $A$ belonging to the eigen vector $\boldsymbol{v}$ .

Any matrix $A$ has $D$ eigen values $\lambda_{i}$ , belonging to $\boldsymbol{v}_{i} (i=1, 2, …., D)$ . If $\boldsymbol{v}_{i}$ is basis of the vector space $V$ , then $A$ is diagonalizable.

When $A$ is diagonalizable, with $D \times D$ matrices $P = (\boldsymbol{v}_{1}, \dots, \boldsymbol{v}_{D})$ , whose column vectors are eigen vectors $\boldsymbol{v}_{i} (i=1, 2, …., D)$ , the following equation holds: $P^{-1}AP = \Lambda$ , where $\Lambda = diag(\lambda_{1}, \dots, \lambda_{D})= \begin{pmatrix} \lambda_{1} & 0& \ldots &0\\ 0 & \lambda_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_{D} \end{pmatrix}$ .

And when $A$ is diagonalizable, you can diagonalize $A$ as below.

Most textbooks keep explaining these type of stuff, but I have to say they lack efforts to make it understandable to readers with low mathematical literacy like me. Especially if you have to apply the idea to data science field, I believe you need more visual understanding of diagonalization. Therefore instead of just explaining the definitions and theorems, I would like to take a different approach. But in order to understand them in more intuitive ways, we first have to rethink waht linear transformation $T_A$ means in more visible ways.

2, Linear transformations

Even though I did my best to make this article understandable to people with little prerequisite knowledge, you at least have to understand linear transformation of numerical vectors and with matrices. Linear transformation is nothing difficult, and in this article I am going to use only 2 or 3 dimensional numerical vectors or square matrices. You can calculate linear transformation of $\boldsymbol{v}$ by $A$ as equations in the figure. In other words, $\boldsymbol{u}$ is a vector transformed by $A$ .

*I am not going to use the term “linear transformation” in a precise way in the context of linear algebra. In this article or in the context of data science or machine learning, “linear transformation” for the most part means products of matrices or vectors.

*Forward/back propagation of deep learning is mainly composed of this linear transformation. You keep linearly transforming input vectors, frequently transforming them with activation functions, which are for the most part not linear transformation.

As you can see in the equations above, linear transformation with $A$ transforms a vector to another vector. Assume that you have an original vector $\boldsymbol{v}$ in grey and that the vector $\boldsymbol{u}$ in pink is the transformed $\boldsymbol{v}$ by $A$ is. If you subtract $\boldsymbol{v}$ from $\boldsymbol{u}$ , you can get a displacement vector, which I displayed in purple. A displacement vector means the transition from a vector to another vector.

Let’s calculate the displacement vector with more vectors $\boldsymbol{v}$ . Assume that $A =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}$ , and I prepared several grid vectors $\boldsymbol{v}$ in grey as you can see in the figure below. If you transform those grey grid points with $A$ , they are mapped into the vectors $\boldsymbol{u}$ in pink. With those vectors in grey or pink, you can calculate the their displacement vectors $\boldsymbol{u} - \boldsymbol{v}$ , which are in purple.

The displacement vectors in the figure above have some tendencies. In order to see that more clearly, let’s calculate displacement vectors with several matrices $A$ and more grid points. Assume that you have three $2 \times 2$ square matrices $A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, A_2 =\begin{pmatrix} 3 & 1 \\ -1 & 1 \end{pmatrix}, A_3 =\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}$ , and I plotted displace vectors made by the matrices respectively in the figure below.

I think you noticed some characteristics of the displacement vectors made by those linear transformations: the vectors are swirling and many of them seem to be oriented in certain directions. To be exact, some displacement vectors extend in the same directions as some of original vectors in grey. That means linear transformation by $A$ did not change the direction of the original vector $\boldsymbol{v}$ , and the unchanged vectors are called eigen vectors. Real eigen vectors of each A are displayed as arrows in yellow in the figure above. But when it comes to $A_3$ , the matrix does not have any real eigan values.

In linear algebra, depending on the type matrices $A$ , you have to consider various cases such as whether the matrices have real or imaginary eigen values, whether the matrices are diagonalizable, whether the eigen vectors are orthogonal, or whether they are unit vectors. But those topics are out of the scope of this article series, so please refer to textbooks on linear algebra if you are interested.

Luckily, however, in terms of PCA or LDA, you only have to consider a type of matrices named positive semidefinite matrices, which $A_1$ is classified to, and I am going to explain positive semidefinite matrices in the fourth section.

3, Eigen vectors as coordinate system

Source: Ian Stewart, “Professor Stewart’s Cabinet of Mathematical Curiosities,” (2008), Basic Books

Let me take Fibonacci numbers as an example to briefly see why diagonalization is useful. Fibonacci is sequence is quite simple and it is often explained using an example of pairs of rabbits increasing generation by generation. Let $a_n (n=0, 1, 2, …)$ be the number of pairs of grown up rabbits in the $n^{th}$ generation. One pair of grown up rabbits produce one pair of young rabbit The concrete values of $a_n$ are $a_0 = 0$ , $a_1 = 1$ , $a_2=1$ , $a_3=2$ , $a_4=3$ , $a_5=5$ , $a_6=8$ , $a_7=13, \dots$ . Assume that $A =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}$ and that $\begin{pmatrix} a_1 \\ a_0 \end{pmatrix} =\begin{pmatrix} 1 \\ 0 \end{pmatrix}$ , then you can calculate the number of the pairs of grown up rabbits in the next generation with the following recurrence relation. $\begin{pmatrix} a_{n+1} \\ a_{n} \end{pmatrix}=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \cdot \begin{pmatrix} a_{n+1} \\ a_{n} \end{pmatrix}$ .Let $\boldsymbol{a}_n$ be $\begin{pmatrix} a_{n+1} \\ a_{n} \end{pmatrix}$ , then the recurrence relation can be written as $\boldsymbol{a}_{n+1} = A \boldsymbol{a}_n$ , and the transition of $\boldsymbol{a}_n$ are like purple arrows in the figure below. It seems that the changes of the purple arrows are irregular if you look at the plots in normal coordinate.

Assume that $\lambda _1, \lambda_2 (\lambda _1< \lambda_2)$ are eigen values of $A$ , and $\boldsymbol{v}_1, \boldsymbol{v}_2$ are eigen vectors belonging to them respectively. Also let $\alpha, \beta$ scalars such that $\begin{pmatrix} a_{1} \\ a_{0} \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} = \alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2$ . According to the definition of eigen values and eigen vectors belonging to them, the following two equations hold: $A\boldsymbol{v}_1 = \lambda_1 \boldsymbol{v}_1, A\boldsymbol{v}_2 = \lambda_2 \boldsymbol{v}_2$ . If you calculate $\boldsymbol{a}_1$ is, using eigen vectors of $A$ , $\boldsymbol{a}_1 = A\boldsymbol{a}_0 = A (\alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2) = \alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2$ . In the same way, $\boldsymbol{a}_2 = A\boldsymbol{a}_1 = A (\alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2) = \alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2$ , and $\boldsymbol{a}_3 = A\boldsymbol{a}_2 = A (\alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2) = \alpha\lambda _{1}^{3} \boldsymbol{v}_1 + \beta \lambda_{2}^{3} \boldsymbol{v}_2$ . These equations show that in coordinate system made by eigen vectors of $A$ , linear transformation by $A$ is easily done by just multiplying eigen values with each eigen vector. Compared to the graph of Fibonacci numbers above, in the figure below you can see that in coordinate system made by eigen vectors the plots changes more systematically generation by generation.

In coordinate system made by eigen vectors of square matrices, the linear transformations by the matrices can be much more straightforward, and this is one powerful strength of eigen vectors.

*I do not major in mathematics, so I am not 100% sure, but vectors in linear algebra have more abstract meanings. Various things in mathematics can be vectors, even though in machine learning or data science we mainly use numerical vectors with more concrete elements. We can also say that matrices are a kind of maps. That is just like, at least in my impression, even though a real town is composed of various components such as houses, smooth or bumpy roads, you can simplify its structure with simple orthogonal lines, like the map of Manhattan. But if you know what the town actually looks like, you do not have to follow the zigzag path on the map.

4, Eigen vectors of positive semidefinite matrices

In the second section of this article I told you that, even though you have to consider various elements when you discuss general diagonalization, in terms of PCA and LDA we mainly use only a type of matrices named positive semidefinite matrices. Let $A$ be a $D \times D$ square matrix. If $\boldsymbol{x}^T A \boldsymbol{x} \geq 0$ for all values of the vector $\boldsymbol{x}$ , the $A$ is said to be a positive semidefinite matrix. And also it is known that $A$ being a semidefinite matrix is equivalent to $\lambda _{i} \geq 0$ for all the eigen values $\lambda_i (i=1, \dots , D)$ .

*I think most people first learn a type of matrices called positive definite matrices. Let $A$ be a $D \times D$ square matrix. If $\boldsymbol{x}^T A \boldsymbol{x} > 0$ for all values of the vector $\boldsymbol{x}$ , the $A$ is said to be a positive definite matrix. You have to keep it in mind that even if all the elements of $A$ are positive, $A$ is not necessarly positive definite/semidefinite.

Just as we did in the second section of this article, let’s visualize displacement vectors made by linear transformation with a $3 \times 3$ square positive semidefinite matrix $A$ .

*In fact $A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}$ , whose linear transformation I visualized the second section, is also positive semidefinite.

Let’s visualize linear transformations by a positive definite matrix $A = \frac{1}{50} \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}$ . I visualized the displacement vectors made by the $A$ just as the same way as in the second section of this article. The result is as below, and you can see that, as well as the displacement vectors made by $A_1$ , the three dimensional displacement vectors below are swirling and extending in three directions, in the directions of the three orthogonal eigen vectors $\boldsymbol{v}_1, \boldsymbol{v}_2$ , and $\boldsymbol{v}_3$ .

*It might seem like a weird choice of a matrix, but you are going to see why I chose it in the next article.

You might have already noticed $A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}$ and $A = \frac{1}{50} \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}$ are both symmetric matrices and that their elements are all real values, and that their diagonal elements are all positive values. Super importantly, when all the elements of a $D \times D$ symmetric matrix $A$ are real values and its eigen values are $\lambda_{i} (i=1, \dots , D)$ , there exist orthonormal matrices $U$ such that $U^{-1}AU = \Lambda$ , where $\Lambda = diag(\lambda_{1}, \dots , \lambda_{D})$ .

*The title of this section might be misleading, but please keep it in mind that positive definite/semidefinite matrices are not necessarily real symmetric matrices. And real symmetric vectors are not necessarily positive definite/semidefinite matrices.

5, Orthonormal matrices and rotation of vectors

In this section I am gong to explain orthonormal matrices, as known as rotation matrices. If a $D\times D$ matrix $U$ is an orthonormal matrix, column vectors of $U$ are orthonormal, which means $U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D)$ , where $\begin{cases} \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 1 \quad (i = j) \\ \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 0 \quad (i\neq j) \end{cases}$ . In other words column vectors $\boldsymbol{u}_{i}$ form an orthonormal coordinate system.

Orthonormal matrices $U$ have several important properties, and one of the most important properties is $U^{-1} = U^{T}$ . Combining this fact with what I have told you so far, you we can reach one conclusion: you can orthogonalize a real symmetric matrix $A$ as $U^{T}AU = \Lambda$ . This is known as spectral decomposition or singular value decomposition.

Another important property of $U$ is that $U^{T}$ is also orthonormal. In other words, assume $U$ is orthonormal and that $U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D) = \begin{pmatrix} -\boldsymbol{v_1}^{T}- \\ \vdots \\ -\boldsymbol{v_D}^{T}- \end{pmatrix}$ , $(\boldsymbol{v}_1 \dots \boldsymbol{v}_D)$ also forms a orthonormal coordinate system.

…It seems things are getting too mathematical and abstract (for me), thus for now I am going to wrap up what I have explained in this article .

We have seen

Numerical matrices linearly transform vectors.
Certain linear transformations do not change the direction of vectors in certain directions, which are called eigen vectors.
Making use of eigen vectors, you can form new coordinate system which can describe the linear transformations in a more straightforward way.
You can diagonalize a real symmetric matrix $A$ with an orthonormal matrix $U$ .

Of our current interest is what kind of linear transformation the real symmetric positive definite matrix enables. I am going to explain why the purple vectors in the figure above is swirling in the upcoming articles. Before that, however, we are going to see one application of what we have seen in this article, on dimension reduction. To be concrete the next article is going to be about principal component analysis (PCA), which is very important in many fields.

*In short, the orthonormal matrix $U$ , which I mentioned above enables rotation of matrix, and the diagonal matrix $diag(\lambda_1, \dots, \lambda_D)$ expands or contracts vectors along each axis. I am going to explain that more precisely in the upcoming articles.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.

import matplotlib.pyplot as plt

import numpy as np

import matplotlib.patches as mpatches

T_A = np.array([[1, 1],

[1, 0]])

total_step = 5

x = np.zeros((total_step, 2))

x[0] = np.array([1, 0])

for i in range(total_step - 1):

x[i + 1] = np.dot(T_A, x[i])

eigen_values, eigen_vectors = np.linalg.eig(T_A)

idx = eigen_values.argsort()[::-1]

eigen_values = eigen_values[idx]

eigen_vectors = eigen_vectors[:,idx]

for i in range(len(eigen_vectors)):

if(eigen_vectors.T[i][0] < 0):

eigen_vectors.T[i] = - eigen_vectors.T[i]

v_initial = x[0]

v_coefficients = np.zeros((total_step , 2))

v_coefficients[0] = np.dot(v_initial , np.linalg.inv(eigen_vectors.T))

for i in range(total_step-1):

v_coefficients[i + 1] = v_coefficients[i] * eigen_values

v_1_list = np.zeros((total_step + 1, 2))

v_2_list = np.zeros((total_step + 1, 2))

for i in range(total_step):

v_1_list[i+1] = v_coefficients.T[0][i]*eigen_vectors.T[0]

v_2_list[i+1] = v_coefficients.T[1][i]*eigen_vectors.T[1]

plt.figure(figsize=(20, 15))

fontsize = 20

small_shift = 0.2

plt.plot(x[:, 0], x[:, 1], marker='o', linestyle='none', markersize=10, color='black')

plt.arrow(0, 0, eigen_vectors.T[0][0], eigen_vectors.T[0][1], width=0.05, head_width=0.2, color='orange')

plt.arrow(0, 0, eigen_vectors.T[1][0], eigen_vectors.T[1][1], width=0.05, head_width=0.2, color='orange')

plt.text(eigen_vectors.T[0][0], eigen_vectors.T[0][1]+small_shift, r'$v_{1}$', va='center',ha='right', fontsize=fontsize + 10)

plt.text(eigen_vectors.T[1][0] - small_shift, eigen_vectors.T[1][1],r'$v_{2}$', va='center',ha='right', fontsize=fontsize + 10)

for i in range(total_step):

plt.arrow(0, 0, v_1_list[i+1][0], v_1_list[i+1][1], head_width=0.05, color='darkviolet', length_includes_head=True)

plt.arrow(0, 0, v_2_list[i+1][0], v_2_list[i+1][1], head_width=0.05, color='darkviolet', length_includes_head=True)

plt.text(v_1_list[i+1][0] + 2*small_shift , v_1_list[i+1][1]-2*small_shift,r'$\alpha \cdot \lambda_{0} ^{1} \cdot v_{2} $'.format(1,i+1, 1),va='center',ha='right', fontsize=fontsize)

plt.text(v_2_list[i+1][0]-0.1, v_2_list[i+1][1],r'$\beta \cdot \lambda_{0} ^{1} \cdot v_{2} $'.format(2, i+1, 2),va='center',ha='right', fontsize=fontsize)

plt.arrow(v_1_list[i+1][0], v_1_list[i+1][1], v_2_list[i+1][0], v_2_list[i+1][1], head_width=0, color='black', linestyle='--', length_includes_head=True)

plt.arrow(v_2_list[i+1][0], v_2_list[i+1][1], v_1_list[i+1][0], v_1_list[i+1][1], head_width=0, color='black', linestyle='--', length_includes_head=True)

orange_patch = mpatches.Patch(color='orange', label='Eigen vectors')

purple_patch = mpatches.Patch(color='darkviolet', label='Scalar multiples of the eigen vectors')

plt.legend(handles=[orange_patch, purple_patch], fontsize=25, loc='lower right')

for i in range(total_step):

plt.text(x[i][0]+0.1, x[i][1]-0.05, r'n={0}'.format(i), fontsize=20)

plt.grid(True)

plt.ylabel("$a_{n}: n^{th} generation$", fontsize=20)

plt.xlabel("$a_{n+1}: n+1 ^{th} geneartion$", fontsize=20)

plt.title("Fibonacci sequence and its eigen space", fontsize=30)

#plt.savefig("Fibonacci_eigen_space.png")

plt.show()

import matplotlib.pyplot as plt

import numpy as np

import matplotlib.patches as mpatches

const_range = 10

X = np.arange(-const_range, const_range + 1, 1)

Y = np.arange(-const_range, const_range + 1, 1)

U_x, U_y = np.meshgrid(X, Y)

T_A_0 = np.array([[3, 1],

[1, 2]])

T_A_1 = np.array([[3, 1],

[-1, 1]])

T_A_2 = np.array([[1, -1],

[1, 1]])

T_A_list = np.array((T_A_0, T_A_1, T_A_2))

const_range = 5

plt.figure(figsize=(30, 10))

plt.subplots_adjust(wspace=0.1)

labels = ["Grids", "Displacement vectors made by A", "Real eigen vectors of A"]

title_list = [r"$A_1$ has two different real eigen vectors.", r"$A_2$ has two identical real unit eigen vectors.", r"$A_3$ has only imaginary eigen vectors."]

for idx in range(len(T_A_list)):

eigen_values, eigen_vectors = np.linalg.eig(T_A_list[idx])

sorted_idx = eigen_values.argsort()[::-1]

eigen_values = eigen_values[sorted_idx]

eigen_vectors = eigen_vectors[:,sorted_idx]

eigen_vectors = eigen_vectors.astype(float)

for i in range(len(eigen_vectors)):

if(eigen_vectors.T[i][0] < 0):

eigen_vectors.T[i] = - eigen_vectors.T[i]

X = np.arange(-const_range, const_range + 1, 1)

Y = np.arange(-const_range, const_range + 1, 1)

U_x, U_y = np.meshgrid(X, Y)

V_x = np.zeros((len(U_x), len(U_y)))

V_y = np.zeros((len(U_x), len(U_y)))

temp_vec = np.zeros((1, 2))

W_x = np.zeros((len(U_x), len(U_y)))

W_y = np.zeros((len(U_x), len(U_y)))

plt.subplot(1, 3, idx + 1)

for i in range(len(U_x)):

for j in range(len(U_y)):

temp_vec[0][0] = U_x[i][j]

temp_vec[0][1] = U_y[i][j]

temp_vec[0] = np.dot(T_A_list[idx], temp_vec[0])

V_x[i][j] = temp_vec[0][0]

V_y[i][j] = temp_vec[0][1]

W_x[i][j] = V_x[i][j] - U_x[i][j]

W_y[i][j] = V_y[i][j] - U_y[i][j]

#plt.arrow(0, 0, V_x[i][j], V_y[i][j], head_width=0.1, color='red')

plt.arrow(0, 0, U_x[i][j], U_y[i][j], head_width=0.3, color='dimgrey', label=labels[0])

plt.arrow(U_x[i][j], U_y[i][j], W_x[i][j], W_y[i][j], head_width=0.3, color='darkviolet', label=labels[1])

range_const = 20

plt.xlim([-range_const, range_const])

plt.ylim([-range_const, range_const])

plt.title(title_list[idx], fontsize=25)

if idx==2:

continue

plt.arrow(0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, head_width=1, color='orange', label=labels[2])

plt.arrow(0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, head_width=1, color='orange', label=labels[2])

grey_patch = mpatches.Patch(color='grey', label='Grids')

purple_patch = mpatches.Patch(color='darkviolet', label='Displacement vectors made by A')

yellow_patch = mpatches.Patch(color='gold', label='Real eigen vectors of A')

plt.legend(handles=[grey_patch, purple_patch, yellow_patch], fontsize=25, loc='lower right', bbox_to_anchor=(-0.1, -.35))

#plt.savefig("linear_transformation.png")

plt.show()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

# I borrowed codes from https://gist.github.com/WetHat/1d6cd0f7309535311a539b42cccca89c to make 3D arrows with matlotlib

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d.proj3d import proj_transform

from mpl_toolkits.mplot3d.axes3d import Axes3D

from matplotlib.text import Annotation

from matplotlib.patches import FancyArrowPatch

import matplotlib.patches as mpatches

class Annotation3D(Annotation):

def __init__(self, text, xyz, *args, **kwargs):

super().__init__(text, xy=(0,0), *args, **kwargs)

self._xyz = xyz

def draw(self, renderer):

x2, y2, z2 = proj_transform(*self._xyz, renderer.M)

self.xy=(x2,y2)

super().draw(renderer)

def _annotate3D(ax,text, xyz, *args, **kwargs):

'''Add anotation `text` to an `Axes3d` instance.'''

annotation= Annotation3D(text, xyz, *args, **kwargs)

ax.add_artist(annotation)

setattr(Axes3D,'annotate3D',_annotate3D)

class Arrow3D(FancyArrowPatch):

def __init__(self, x, y, z, dx, dy, dz, *args, **kwargs):

super().__init__((0,0), (0,0), *args, **kwargs)

self._xyz = (x,y,z)

self._dxdydz = (dx,dy,dz)

def draw(self, renderer):

x1,y1,z1 = self._xyz

dx,dy,dz = self._dxdydz

x2,y2,z2 = (x1+dx,y1+dy,z1+dz)

xs, ys, zs = proj_transform((x1,x2),(y1,y2),(z1,z2), renderer.M)

self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))

super().draw(renderer)

def _arrow3D(ax, x, y, z, dx, dy, dz, *args, **kwargs):

'''Add an 3d arrow to an `Axes3D` instance.'''

arrow = Arrow3D(x, y, z, dx, dy, dz, *args, **kwargs)

ax.add_artist(arrow)

setattr(Axes3D,'arrow3D',_arrow3D)

T_A = np.array([[60.45, 33.63, 46.29],

[33.63, 68.49, 50.93],

[46.29, 50.93, 53.61]])

T_A = T_A/50

const_range = 2

X = np.arange(-const_range, const_range + 1, 1)

Y = np.arange(-const_range, const_range + 1, 1)

Z = np.arange(-const_range, const_range + 1, 1)

U_x, U_y, U_z = np.meshgrid(X, Y, Z)

V_x = np.zeros((len(U_x), len(U_y), len(U_z)))

V_y = np.zeros((len(U_x), len(U_y), len(U_z)))

V_z = np.zeros((len(U_x), len(U_y), len(U_z)))

temp_vec = np.zeros((1, 3))

W_x = np.zeros((len(U_x), len(U_y), len(U_z)))

W_y = np.zeros((len(U_x), len(U_y), len(U_z)))

W_z = np.zeros((len(U_x), len(U_y), len(U_z)))

eigen_values, eigen_vectors = np.linalg.eig(T_A)

sorted_idx = eigen_values.argsort()[::-1]

eigen_values = eigen_values[sorted_idx]

eigen_vectors = eigen_vectors[:,sorted_idx]

eigen_vectors = eigen_vectors.astype(float)

fig = plt.figure(figsize=(15, 15))

ax = fig.add_subplot(111, projection='3d')

grid_range = const_range + 5

ax.set_xlim(-grid_range, grid_range)

ax.set_ylim(-grid_range, grid_range)

ax.set_zlim(-grid_range, grid_range)

eigen_values, eigen_vectors = np.linalg.eig(T_A)

sorted_idx = eigen_values.argsort()[::-1]

eigen_values = eigen_values[sorted_idx]

eigen_vectors = eigen_vectors[:,sorted_idx]

eigen_vectors = eigen_vectors.astype(float)

for i in range(len(eigen_vectors)):

if(eigen_vectors.T[i][0] < 0):

eigen_vectors.T[i] = - eigen_vectors.T[i]

for i in range(len(U_x)):

for j in range(len(U_x)):

for k in range(len(U_x)):

temp_vec[0][0] = U_x[i][j][k]

temp_vec[0][1] = U_y[i][j][k]

temp_vec[0][2] = U_z[i][j][k]

temp_vec[0] = np.dot(T_A, temp_vec[0])

V_x[i][j][k] = temp_vec[0][0]

V_y[i][j][k] = temp_vec[0][1]

V_z[i][j][k] = temp_vec[0][2]

W_x[i][j][k] = V_x[i][j][k] - U_x[i][j][k]

W_y[i][j][k] = V_y[i][j][k] - U_y[i][j][k]

W_z[i][j][k] = V_z[i][j][k] - U_z[i][j][k]

ax.arrow3D(0, 0, 0,

U_x[i][j][k], U_y[i][j][k], U_z[i][j][k],

mutation_scale=10, arrowstyle="-|>", fc='dimgrey', ec='dimgrey')

#ax.arrow3D(0, 0, 0,

# V_x[i][j][k], V_y[i][j][k], V_z[i][j][k],

# mutation_scale=10, arrowstyle="-|>", fc='red', ec='red')

ax.arrow3D(U_x[i][j][k], U_y[i][j][k], U_z[i][j][k],

W_x[i][j][k], W_y[i][j][k], W_z[i][j][k],

mutation_scale=10, arrowstyle="-|>", fc='darkviolet', ec='darkviolet')

ax.arrow3D(0, 0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, eigen_vectors.T[0][2]*10,

mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')

ax.arrow3D(0, 0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, eigen_vectors.T[1][2]*10,

mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')

ax.arrow3D(0, 0, 0, eigen_vectors.T[2][0]*10, eigen_vectors.T[2][1]*10, eigen_vectors.T[2][2]*10,

mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')

ax.text(eigen_vectors.T[0][0]*8 , eigen_vectors.T[0][1]*8, eigen_vectors.T[0][2]*8+1, r'$v_1$', fontsize=20)

ax.text(eigen_vectors.T[1][0]*8 , eigen_vectors.T[1][1]*8, eigen_vectors.T[1][2]*8, r'$v_2$', fontsize=20)

ax.text(eigen_vectors.T[2][0]*8 , eigen_vectors.T[2][1]*8, eigen_vectors.T[2][2]*8, r'$v_3$', fontsize=20)

grey_patch = mpatches.Patch(color='grey', label='Grids')

orange_patch = mpatches.Patch(color='orange', label='Orthogonal eigen vectors of A')

purple_patch = mpatches.Patch(color='darkviolet', label='Displacement vectors made by A')

plt.legend(handles=[grey_patch, orange_patch, purple_patch], fontsize=20, loc='lower right')

ax.set_xlabel(r'$x_1$', fontsize=25)

ax.set_ylabel(r'$x_2$', fontsize=25)

ax.set_zlabel(r'$x_3$', fontsize=25)

#plt.savefig("symmetric_positive_definite_visualizaiton.png")

plt.show()

K Nearest Neighbour For Supervised Learning

October 10, 2020/in Artificial Intelligence, Data Science, Machine Learning, Main Category, Mathematics/by Ram Tavva

K-Nearest Neighbour (KNN) Algorithms is an easy-to-implement & advanced level supervised machine learning algorithm used for both – classification as well as regression problems. However, you can see a wide of its applications in classification problems across various industries.

If you’ve been shopping a lot in e-commerce sites like Amazon, Flipkart, Myntra, or love watching web series over Netflix and Amazon Prime, one common thing you’ve always noticed, and that is recommendations.

Are you wondering how they recommend you following your choice? They use KNN Supervised Learning to find out what you may need the next when you’re buying and recommend you with a few more products.

Imagine you’re looking for an iPhone to purchase. When you scroll down a little, you see some iPhone cases, tempered glasses – saying, “People who purchased an iPhone have also purchased these items. The same applies to Netflix and Amazon Prime. When you finished a show or a series, they give you recommendations of the same genre. And do it all using KNN supervised learning and classify the items for the best user experience.

Advantages Of KNN

Quickest Calculation Time
Simple Algorithms
High Accuracy
Versatile – best use for Regression and Classification.
Doesn’t make any assumptions about data.

Where KNN Are Mostly Used

Simple Recommendation Models
Image Recognition Technology
Decision-Making Models
Calculating Credit Rating

Choosing The Right Value For K

To choose the right value of K, you have to run KNN algorithms several times with different values of K and select the value of K, which reduces the number of errors you’ve come across and come out as the most stable value for K.

Your Step-By-Step Guide For Choosing The Value Of K

As you decrease the value of K to 1 (K = 1), you’ll reach a query point, where you get to see many elements from class A (-) and class B (+) where (-) is the only nearest neighbor. Reasonably, you would think about the query point to be most likely the red one. As K =1, which has a blue color, KNN incorrectly predicts the wrong color blue.
As you increase the value of K to 2 (K=2), you get to see two elements, (-) and (+) are the only nearest neighbor. As you have two values, which are of Class A and Class B, KNN incorrectly predicts the wrong values (Blue and Red).
As you increase the value of K to 3 (K=3), you get to see three elements (-) and (+), (+) are the only nearest neighbor. And this time, you got three values, one from blue and two from red. As your assumption is red, KNN correctly predicts the right value (Blue and Red, Red). Your answer is more stable this time compared to previous ones.

Conclusion

KNN works by finding the nearest distance between a query and all the elements in the database. By choosing the value for K, we get the closest to the query. And then, KNN algorithms look for the most frequent labels in classification and averages of labels in regression.

Spiky cubes, Pac-Man walking, empty M&M’s chocolate: curse of dimensionality

October 8, 2020/in Artificial Intelligence, Data Mining, Data Science, Machine Learning, Main Category/by Yasuto Tamura

This is the first article of the article series Illustrative introductions on dimension reduction.

“Curse of dimensionality” means the difficulties of machine learning which arise when the dimension of data is higher. In short if the data have too many features like “weight,” “height,” “width,” “strength,” “temperature”…., that can undermine the performances of machine learning. The fact might be contrary to your image which you get from the terms “big” data or “deep” learning. You might assume that the more hints you have, the better the performances of machine learning are. There are some reasons for curse of dimensionality, and in this article I am going to introduce two major reasons below.

High dimensional data usually have rich expressiveness, but usually training data are too poor for that.
The behaviors of data points in high dimensional space are totally different from our common sense.

Through these topics, you will see that you always have to think about which features to use considering the number of data points.

*From now on I am going to talk about only Euclidean distance. If you are not sure what Euclidean distance means, please just keep it in mind that it is the type of distance most people wold have learnt in normal compulsory education.

*This is the first article of the article series ” Illustrative introductions on dimension reduction .”

1. Number of samples and degree of dimension

The most straightforward demerit of adding many features, or increasing dimensions of data, is the growth of computational costs. More importantly, however, you always have to think about the degree of dimensions in relation of the number of data points you have. Let me take a simple example in a book “Pattern Recognition and Machine Learning” by C. M. Bishop (PRML). This is an example of measurements of a pipeline. The figure below shows a comparison plot of 3 classes (red, green and blue), with parameter $x_7$ plotted against parameter $x_6$ out of 12 parameters.

* The meaning of data is not important in this article. If you are interested please refer to the appendix in PRML.

Assume that we are interested in classifying the cross in black into one of the three classes. One of the most naive ideas of this classification is dividing the graph into grids and labeling each grid depending on the number of samples in the classes (which are colored at the right side of the figure). And you can classify the test sample, the cross in black, into the class of the grid where the test sample is in. Thereby the cross is classified to the class in red.

Source: C.M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 34-35

As I mentioned in the figure above, we used only two features out of 12 features in total. When the total number of data points is fixed and you add remaining ten axes/features one after another, what would happen? Let’s see what “adding axes/features” means. If you are talking about 1, 2, or 3 dimensional grids, you can visualize them. And as you can see from the figure below, if you make each $10^1, 10^2, 100^3$ grids respectively in 1, 2, 3 dimensional spaces, the number of the small regions in the grids are respectively 10, 100, 1000. Even though you cannot visualize it anymore, you can make grids for more than 3 dimensional data. If you continue increasing the degree of dimension, the number of grids increases exponentially, and that can soon surpass the number of training data points. That means there would be a lot of empty spaces in such high dimensional grids. And the classifying method above: coloring each grid and classifying unknown samples depending on the colors of the grids, does not work out anymore because there would be a lot of empty grids.

* If you are still puzzled by the idea of “more than 3 dimensional grids,” you should not think too much about that now. It is enough if you can get some understandings on high dimensional data after reading the whole article of this.

Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, p. 153

I said the method above is the most naive way, but other classical classification methods , for example k-nearest neighbors algorithm, are more or less base on a similar idea. Many of classical machine learning algorithms are based on the idea of smoothness prior, or local constancy prior. In short in classical ways, you do not expect data to change so much in a small region, so you can expect unknown samples to be similar to data in vicinity. But that soon turns out to be problematic when the dimension of data is bigger because training data would be sparse because the area of multidimensional space grows exponentially as I mentioned above. And sometimes you would not be able to find training data around test data. Plus, in high dimensional data, you cannot treat distance in the same as you do in lower dimensional space. The ideas of “close,” “nearby,” or “vicinity” get more obscure in high dimensional data. That point is related to the next topic: the intuition have cultivated in normal life is not applicable to higher dimensional data.

2. Bizarre characteristics of high dimensional data

We form our sense of recognition in 3-dimensional ways in our normal life. Even though we can visualize only 1, 2, or 3 dimensional data, we can actually generalize the ideas in 1, 2, or 3 dimensional ideas to higher dimensions. For example 4 dimensional cubes, 100 dimensional spheres, or orthogonality in 255 dimensional space. Again, you cannot exactly visualize those ideas, and for many people, such high dimensional phenomenon are just imaginary matters on blackboards. Those high dimensional ideas are designed to retain some conditions just as well as 1, 2, or 3 dimensional space. Let’s take an example of spheres in several dimensional spaces. General spheres in any D-dimensional space can be defined as a set of any $\boldsymbol{x}$ , such that $|\boldsymbol{x} - \boldsymbol{c}| = r$ , where $\boldsymbol{c}$ is the center point and $r$ is length of radius. When $\boldsymbol{x}$ is 2-dimensional, the spheres are called “circles.” When $\boldsymbol{x}$ is 3-dimensional, the spheres are called “spheres” in our normal life, unless it is used in a conversation in a college cafeteria, by some students in mathematics department. And when $\boldsymbol{x}$ is D-dimensional, they are called D-ball, and again, this is just a imaginary phenomenon on blackboard.

* Vectors and points are almost the same because all the vectors are denoted as “arrows” from the an origin point to sample data points. The only difference is that when you use vectors, you have to consider their directions.

* “D-ball” is usually called “n-ball,” and in such context it is a sphere in a n-dimensional space. But please let me use the term “D-ball” in this article.

Not only spheres, but only many other ideas have been generalized to D-dimensional space, and many of them are indispensable also for data science. But there is one severe problem: the behaviors of data in high dimensional field is quite different from those in two or three dimensional space. To be concrete, in high dimensional field, cubes are spiky, you have to move like Pac-Man, and M & M’s Chocolate looks empty inside but tastes normal.

2.1: spiky cubes
Let’s take a look at an elementary-school-level example of geometry first. Assume that you have several unit squares or unit cubes like below. In each of them a circle or sphere with diameter 1 is inscribed. The length of a diagonal line in each square is $\sqrt{2}$ , and that in each cube is $\sqrt{3}$ .

If you stack the squares or cubes as below, what are the length of diameters of the blue circle or sphere, circumscribing all the 4 orange circles or the 8 orange spheres?

The answers are, the diameter of the blue circle is $\sqrt{2} - 1$ , and the diameter of the blue sphere is $\sqrt{3} - 1$ .

Next let’s think about the same situation in higher dimensional space. Assume that there are some unit D-dimensional hypercubes stacked, in each of which a D-ball with diameter 1 is inscribed, touching all the surfaces inside. Then what is the length of the diameter of a D-ball circumscribing all the unit D-ball in the hypercubes ? Given the results above, it ca be predicted that its diameter is $\sqrt{D} -1$ . If that is true, there is one strange point: $\sqrt{D} - 1$ can soon surpass 2: that means in the chart above the blue sphere will stick out of the stacked cubes. That sounds like a paradox, but with one hypothesis, the phenomenon makes sense: cubes become more spiky as the degree of dimension grows. This hypothesis is a natural deduction because diagonal lines of hyper cubes get longer, and the the center of each surface of hypercubes still touches the unit D-ball with diameter 1, inscribing inscribing inside each unit hypercube.

If you stack 4 hypercubes, the blue sphere circumscribing them will not stick out of the stacked hypercubes anymore like the figure below.

*Of course you cannot visualize what is going on in D-dimensional space, so the figure below is just a pseudo simulation of D-dimensional space in our 3-dimensional sense. I guess you have to stack more than four hyper cubes in higher dimensional data, but you cannot easily imagine what will go on in such space anymore.

*You can confirm the fact that hypercube gets more spiky as the degree of dimension growth, by comparing the volume of the hypercube and the volume of the D-ball inscribed inside the hypercube. Thereby you can prove that the volume of hypercube concentrates on the corners of the hypercube. Plus, as I mentioned the longest diagonal distance of hypercube gets longer as dimension degree increases. That is why hypercube is said to be spiky. For mathematical proof, please check the Exercise 1.19 of PRML.

2.2: Pac-Man walking

Next intriguing phenomenon in high dimensional field is that most of pairs of vectors in high dimensional space are orthogonal. In other words, if you select two random vectors in high dimensional space, the angle between them are mostly close to $90^\circ$ . Let’s see the general meaning of angle between two vectors in any dimensional spaces. Assume that the angle between two vectors $\boldsymbol{u}$ , and $\boldsymbol{v}$ is $\theta$ , then $cos\theta$ is calculated as $cos\theta = \frac{<\boldsymbol{u}, \boldsymbol{v}>}{|\boldsymbol{u}||\boldsymbol{v}|}$ . In 1, 2, or 3 dimensional space, you can actually see the angle, but again you can define higher dimensional angle, which you cannot visualize anymore. And angles are sometimes used as similarity of two vectors.

* $<\boldsymbol{u}, \boldsymbol{v}>$ is the inner product of $\boldsymbol{u}$ , and $\boldsymbol{v}$ .

Assume that you generate a pair of two points inside a D-dimensional unit sphere and make two vectors $\boldsymbol{u}$ , and $\boldsymbol{v}$ by connecting the origin point and those two points respectively. When D is 2, I mean spheres are circles in this case, any $\theta$ are equally generated as in the chart below. The fact might be the same as your intuition. How about in 3-dimensional space? In fact the distribution of $\theta$ is not uniform. $\theta = 90^\circ$ is the most likely to be generated. As I explain in the figure below, if you compare the area of cross section of a hemisphere and the area of a cone whose vertex is the center point of the sphere, you can see why.

I generated 10000 random pairs of points in side a D-dimensional unit sphere, and calculated the angle between them. In other words I just randomly generated two D-dimensional vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ , whose elements are randomly generated values between -1 and 1, and calculated the angle between them, repeating this process 10000 times. The chart below are the histograms of angle between pairs of generated vectors in respectively 2, 3, 50, and 100 dimensional space.

As I explained above, in 2-dimensional space, the distribution of $\theta$ is almost uniform. However the distribution concentrates a little around $90^\circ$ in 3-dimensional space. You can see that the bigger the degree of dimension is, the more the angles of generated vectors concentrate around $90^\circ$ . That means most pairs of vectors in high dimensional space are close to orthogonal. Movements are also sequence of vectors, so when most pairs of movement vectors are orthogonal, that means you can only move like Pac-Man in such space.

Source: https://edition.cnn.com/style/article/pac-man-40-anniversary-history/index.html

* Of course I am talking about arcade Mac-Man game. Not Pac-Man in Super Smash Bros. Retro RPG video games might have more similar playability, but in high dimensional space it is also difficult to turn back. At any rate, I think you have understood it is even difficult to move smoothly in high dimensional space, just like the first notorious Resident Evil on the first PS console also had terrible playability .

2.3: empty M & M’s chocolate

Let’s think about the proportion of the volume of the outermost $\epsilon$ surface of general spheres with radius $r.$ First, in 2 two dimensional space, spheres are circles. The area of the brown part of the circle below is $\pi r^2$ . In order calculate the are of $\epsilon \cdot r$ thick surface of the circle, you have only to subtract the area of $\pi \{ (1 - \epsilon)\cdot r\} ^2$ . When $\epsilon = 0.01$ , the area of outer most surface is $\pi r^2 - \pi (0.99\cdot r)^2$ , and its proportion to the area of the whole circle is $\frac{\pi r^2 - \pi (0.99\cdot r)^2}{\pi r^2} = 0.0199$ .

In case of 3-dimensional space, the value of a sphere with radius $r$ is $\frac{4}{3} \pi r^2$ , so the proportion of the $\epsilon$ surface is calculated in the same way: $\frac{\frac{4}{3} \pi r^3 -\frac{4}{3} \pi (0.99\cdot r)^2}{\frac{4}{3}\pi r^2} = 0.0297$ . Compared to the case in 2 dimensional space, the proportion is a little bigger.

How about in D-dimensional space? We have seen that even in D-dimensional space the surface of a sphere, I mean D-ball, can be defined as a set of any points whose distance from the center point is all $r$ . And it is known that the volume of D-ball is defined as below.

$\Gamma ()$ is called gamma function, but in this article it is not so important. The most important point now is, if you discuss any D-ball, their volume only depends on their radius $r$ . That meas the proportion of outer $\epsilon$ surface of D-ball is calculated as $\frac{\pi r^2 - \pi \{ (1 - \epsilon)\cdot r\} ^2}{\pi r^2}$ . When $\epsilon$ is 0.01, the proportion of the 1% surface of D-ball changes like in the chart below.

* And of course when $D$ is 2, $\frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \pi$ , and when $D$ is 3 , $\frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \frac{4}{3} \pi$

You can see that when D is over 400, around 90% of volume is concentrated in the very thin 1% surface of D-ball. That is why, in high dimensional space, M & M’s chocolate look empty but tastes normal: all the chocolate are concentrated beneath the sugar coating.

More interestingly, even if you choose any points as a central point of a sphere with radius $r$ , the other points are squashed to the surface of the sphere, even if all the data points are uniformly distributed. This situation is problematic for classical machine learning algorithms, which are often based on the Euclidean distances between pairs of two sample data points: if you go from the central point to another sample point, the possibility of finding the point within $(1 - \epsilon)\cdot r$ radius of the center is almost zero. But if you reach the outermost $\epsilon$ part of the surface of the sphere, most data points are there. However, for one of the data points in the surface, any other data points are distant in the same way.

Inside M & M’s chocolate is a mysterious world.

Source: https://hipwallpaper.com/mms-wallpapers/

You have seen that using high dimensional data can be problematic in many ways. Data science and machine learning are largely based on one idea: you can find a lower dimensional meaningful and easier structure in data. In the next articles I am going to introduce some famous dimension reduction algorithms. And hopefully I would like to give some deeper insights in to these algorithms, in straightforward ways.

* I could not explain the relationships of variance and bias of data. This is also a very important factor when you think about dimensionality of data. I hope I can write about this topic someday. You can also look it up if you are interested.

[References]

[1]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 33-37

[2]Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, p. 153

[3] Shiga Kouji, “30 Lesson to Topology,” (1988)

[4]”Volume of an n-ball,” Wikipedia
https://en.wikipedia.org/wiki/Volume_of_an_n-ball

Illustrative introductions on dimension reduction

October 2, 2020/in Artificial Intelligence, Data Mining, Data Science, Machine Learning, Main Category, Mathematics/by Yasuto Tamura

“What is your image on dimensions?”

….That might be a cheesy question to ask to reader of Data Science Blog, but most people, with no scientific background, would answer “One dimension is a line, and two dimension is a plain, and we live in three-dimensional world.” After that if you ask “How about the fourth dimension?” many people would answer “Time?”

You can find books or writings about dimensions in various field. And you can use the word “dimension” in normal conversations, in many contexts.

*In Japanese, if you say “He likes two dimension.” that means he prefers anime characters to real women, as is often the case with Japanese computer science students.

The meanings of “dimensions” depend on the context, but in data science dimension is usually the number of rows of your Excel data.

When you study data science or machine learning, usually you should start with understanding the algorithms with 2 or 3 dimensional data, and you can apply those ideas to any D dimensional data. But of course you cannot visualize D dimensional data anymore, and you always have to be careful of what happens if you expand degree of dimension.

Conversely it is also important to reduce dimension to understand abstract high dimensional stuff in 2 or 3 dimensional space, which are close to our everyday sense. That means dimension reduction is one powerful way of data visualization.

In this blog series I am going to explain meanings of dimension itself in machine learning context and algorithms for dimension reductions, such as PCA, LDA, and t-SNE, with 2 or 3 dimensional visible data. Along with that, I am going to delve into the meaning of calculations so that you can understand them in more like everyday-life sense.

This article series is going to be roughly divided into the contents below.

Curse of Dimensionality
Rethinking linear algebra: visualizing linear transformations and eigen vector
The algorithm known as PCA and my taxonomy of linear dimension reductions
Rethinking linear algebra part two: ellipsoids in data science
Autoencoder as dimension reduction (to be published soon)
t-SNE (to be published soon)

I hope you could see that reducing dimension is one of the fundamental approaches in data science or machine learning.

Understanding LSTM forward propagation in two ways

August 21, 2020/in Artificial Intelligence, Data Mining, Data Science, Data Science Hack, Deep Learning, Machine Learning, Main Category, Predictive Analytics/by Yasuto Tamura

*This article is only for the sake of understanding the equations in the second page of the paper named “LSTM: A Search Space Odyssey”. If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.

*This article is the fourth article of “A gentle introduction to the tiresome part of understanding RNN.”

1. Preface

I heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.

Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for “normal people” to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.

I already had a feeling that “Does it help to understand only LSTM with this precision? I should do more practical codings.” For example François Chollet, the developer of Keras, in his book, said as below.

For me that sounds like “We have already implemented RNNs for you, so just shut up and use Tensorflow/Keras.” Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called Transformer seems to be replacing the position of LSTM in natural language processing. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.

But I think, or rather hope, it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work.

I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

2. Why LSTM?

First of all, let’s take a brief look at what I said about the structures of RNNs, in the first and the second article. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward/back propagation if you focus on its recurrent connections.

That is why simple RNNs suffer from vanishing/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in this paper. But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing/exploding gradient problem.

LSTM was invented as a way to tackle such problems as I mentioned in the last article.

3. How to display LSTM

I would like you to just go to image search on Google, Bing, or Yahoo!, and type in “LSTM.” I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them “Space Odyssey type” and “electronic circuit type”, and in conclusion, I highly recommend you to understand LSTM as the “electronic circuit type.”

*I just randomly came up with the terms “Space Odyssey type” and “electronic circuit type” because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.

However, not that all the well-made explanations on LSTM use the “electronic circuit type,” and I am sure you sometimes have to understand LSTM as the “space odyssey type.” And the paper “LSTM: A Search Space Odyssey,” which I learned a lot about LSTM from, also adopts the “Space Odyssey type.”

The main reason why I recommend the “electronic circuit type” is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.

*Behaviors of both of them look different, but of course they are doing the same things.

If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.

But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.

4. Forward propagation of LSTM in “electronic circuit type”

*For further understanding of mathematics of LSTM forward/back propagation, I recommend you to download my slides.

The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.

Let’s look at the simplified architecture of an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the gates, and the one going through cell connections, the “highway” of LSTM block. For simplicity, we will see the architecture of an LSTM block without peephole connections, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.

In a LSTM block, the input and the output of the former time step separately go through sections named “gates”: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.

*The small two dots on the cell connections are the “on-ramp” of cell conection highway.

*You would see the terms “input gate,” “forget gate,” “output gate” almost everywhere, but how to call the “block gate” depends on textbooks.

Let’s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step $(t)$ gets $\boldsymbol{y}^{(t-1)}$ , the output at the last time step, and $\boldsymbol{c}^{(t-1)}$ , the information of the cell at the time step $(t-1)$ , via recurrent connections. The block at time step $(t)$ gets the input $\boldsymbol{x}^{(t)}$ , and it separately goes through each gate, together with $\boldsymbol{y}^{(t-1)}$ . After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively $\boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}$ . The outputs of the gates are mixed with $\boldsymbol{c}^{(t-1)}$ and the LSTM block gives out an output $\boldsymbol{y}^{(t)}$ , and gives $\boldsymbol{y}^{(t)}$ and $\boldsymbol{c}^{(t)}$ to the next LSTM block via recurrent connections.

You calculate $\boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}$ as below.

$\boldsymbol{f}^{(t)}= \sigma(\boldsymbol{W}_{for} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{for})$
$\boldsymbol{i}^{(t)}=\sigma(\boldsymbol{W}_{in} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{in})$
$\boldsymbol{z}^{(t)}=tanh(\boldsymbol{W}_z \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z)$
$\boldsymbol{o}^{(t)}=\sigma(\boldsymbol{W}_{out} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{out})$

*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.

The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of $\boldsymbol{y}^{(t)}$ and $\boldsymbol{c}^{(t)}$ with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called Hadamard product, which you denote as $\odot$ . This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.

With this Hadamar product operator, the renewed cell and the output are calculated as below.

$\boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}$
$\boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})$

The values of $\boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}$ are compressed into the range of $[0, 1]$ or $[-1, 1]$ with activation functions. You can see that the input gate and the block input give new information to the cell. The part $\boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}$ means that the output of the forget gate “forgets” the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell $\boldsymbol{c}^{(t)}$ is activated with $tanh()$ and the output of the output gate “suppress” the activated value of $\boldsymbol{c}^{(t)}$ . In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input $\boldsymbol{x}^{(t)}$ , and the recurrent connection $\boldsymbol{y}^{(t-1)}$ . That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always $(1, 1, …1)^T$ , LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And if all the outputs of every gate are always $(0, 0, …0)^T$ , LSTMs forget everything, receive no inputs, and give out nothing.

This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.

LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.

5. Space Odyssey type

I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.

In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted “highway” in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections, $\boldsymbol{x}^{(t)}$ , the input at the time step $(t)$ , sand $\boldsymbol{y}^{(t-1)}$ , the output at the last time step $(t-1)$ , which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.

Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step $(t-1)$ goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly “forgotten.” Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step $(t)$ . And the partly “forgotten” $\boldsymbol{c}^{(t-1)}$ goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how $\boldsymbol{c}^{(t)}$ is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is “suppressed” with the output gate.

I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.

[References]

[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey,” (2017)

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204

[3] “Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021”, Institute of advanced research in artificial intelligence, (2020)
URL: https://www.iarai.ac.at/news/sepp-hochreiter-receives-ieee-cis-neural-networks-pioneer-award-2021/?fbclid=IwAR27cwT5MfCw4Tqzs3MX_W9eahYDcIFuoGymATDR1A-gbtVmDpb8ExfQ87A

[4] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 120-125
岡谷貴之著, 「機械学習プロフェッショナルシリーズ深層学習」, (2015), pp. 120-125

[5] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 252-257
原田達也著, 「機械学習プロフェッショナルシリーズ画像認識」, (2017), pp. 252-257

[6] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ～最近の動向と共に」, Qiita, (2015)
URL: https://qiita.com/t_Signull/items/21b82be280b46f467d1b

A brief history of neural nets: everything you should know before learning LSTM

July 16, 2020/in Artificial Intelligence, Data Mining, Data Science, Data Science Hack, Deep Learning, Machine Learning, Main Category, Predictive Analytics, TensorFlow/by Yasuto Tamura

This series is not a college course or something on deep learning with strict deadlines for assignments, so let’s take a detour from practical stuff and take a brief look at the history of neural networks.

The history of neural networks is also a big topic, which could be so long that I had to prepare another article series. And usually I am supposed to begin such articles with something like “The term ‘AI’ was first used by John McCarthy in Dartmouth conference 1956…” but you can find many of such texts written by people with much more experiences in this field. Therefore I am going to write this article from my point of view, as an intern writing articles on RNN, as a movie buff, and as one of many Japanese men who spent a great deal of childhood with video games.

We are now in the third AI boom, and some researchers say this boom began in 2006. A professor in my university said there we are now in a kind of bubble economy in machine learning/data science industry, but people used to say “Stop daydreaming” to AI researchers. The second AI winter is partly due to vanishing/exploding gradient problem of deep learning. And LSTM was invented as one way to tackle such problems, in 1997.

1, First AI boom

In the first AI boom, I think people were literally “daydreaming.” Even though the applications of machine learning algorithms were limited to simple tasks like playing chess, checker, or searching route of 2d mazes, and sometimes this time is called GOFAI (Good Old Fashioned AI).

Source: https://www.youtube.com/watch?v=K-HfpsHPmvw&feature=youtu.be

Even today when someone use the term “AI” merely for tasks with neural networks, that amuses me because for me deep learning is just statistically and automatically training neural networks, which are capable of universal approximation, into some classifiers/regressors. Actually the algorithms behind that is quite impressive, but the structure of human brains is much more complicated. The hype of “AI” already started in this first AI boom. Let me take an example of machine translation in this video. In fact the research of machine translation already started in the early 1950s, and of specific interest in the time was translation between English and Russian due to Cold War. In the first article of this series, I said one of the most famous applications of RNN is machine translation, such as Google Translation, DeepL. They are a type of machine translation called neural machine translation because they use neural networks, especially RNNs. Neural machine translation was an astonishing breakthrough around 2014 in machine translation field. The former major type of machine translation was statistical machine translation, based on statistical language models. And the machine translator in the first AI boom was rule base machine translators, which are more primitive than statistical ones.

Source: https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon

The most remarkable invention in this time was of course perceptron by Frank Rosenblatt. Some people say that this is the first neural network. Even though you can implement perceptron with a-few-line codes in Python, obviously they did not have Jupyter Notebook in those days. The perceptron was implemented as a huge instrument named Mark 1 Perceptron, and it was composed of randomly connected wires. I do not precisely know how it works, but it was a huge effort to implement even the most primitive type of neural networks. They needed to use a big lighting fixture to get a 20*20 pixel image using 20*20 array of cadmium sulphide photocells. The research by Rosenblatt, however, was criticized by Marvin Minsky in his book because perceptrons could only be used for linearly separable data. To make matters worse the criticism prevailed as that more general, multi-layer perceptrons were also not useful for linearly inseparable data (as I mentioned in the first article, multi-layer perceptrons, namely normal neural networks, can be universal approximators, which have potentials to classify/regress various types of complex data). In case you do not know what “linearly separable” means, imagine that there are data plotted on a piece of paper. If an elementary school kid can draw a border line between two clusters of the data with a ruler and a pencil on the paper, the 2d data is “linearly separable”….

With big disappointments to the research on “electronic brains,” the budget of AI research was reduced and AI research entered its first winter.

Source: https://www.nzz.ch/digital/ehre-fuer-die-deep-learning-mafia-ld.1472761?reduced=true and https://anatomiesofintelligence.github.io/posts/2019-06-21-organization-mark-i-perceptron

I think the frame problem (1969), by John McCarthy and Patrick J. Hayes, is also an iconic theory in the end of the first AI boom. This theory is known as a story of creating a robot trying to pull out its battery on a wheeled wagon in a room. But there is also a time bomb on the wagon. The first prototype of the robot, named R1, naively tried to pull out the wagon form the room, and the bomb exploded. The problems was obvious: R1 was not programmed to consider the risks by taking each action, so the researchers made the next prototype named R1D1, which was programmed to consider the potential risks of taking each action. When R1D1 tried to pull out the wagon, it realized the risk of pulling the bomb together with the battery. But soon it started considering all the potential risks, such as the risk of the ceiling falling down, the distance between the wagon and all the walls, and so on, when the bomb exploded. The next problem was also obvious: R1D1 was not programmed to distinguish if the factors are relevant of irrelevant to the main purpose, and the next prototype R2D1 was programmed to do distinguish them. This time, R2D1 started thinking about “whether the factor is irrelevant to the main purpose,” on every factor measured, and again the bomb exploded. How can we get a perfect AI, R2D2?

The situation of mentioned above is a bit extreme, but it is said AI could also get stuck when it try to take some super simple actions like finding a number in a phone book and make a phone call. It is difficult for an artificial intelligence to decide what is relevant and what is irrelevant, but humans will not get stuck with such simple stuff, and sometimes the frame problem is counted as the most difficult and essential problem of developing AI. But personally I think the original frame problem was unreasonable in that McCarthy, in his attempts to model the real world, was inflexible in his handling of the various equations involved, treating them all with equal weight regardless of the particular circumstances of a situation. Some people say that McCarthy, who was an advocate for AI, also wanted to see the field come to an end, due to its failure to meet the high expectations it once aroused.

Not only the frame problem, but also many other AI-related technological/philosophical problems have been proposed, such as Chinese room (1980), the symbol grounding problem (1990), and they are thought to be as hardships in inventing artificial intelligence, but I omit those topics in this article.

*The name R2D2 did not come from the famous story of frame problem. The story was Daniel Dennett first proposed the story of R2D2 in his paper published in 1984. Star Wars was first released in 1977. It is said that the name R2D2 came from “Reel 2, Dialogue 2,” which George Lucas said while film shooting. And the design of C3PO came from Maria in Metropolis(1927). It is said that the most famous AI duo in movie history was inspired by Tahei and Matashichi in The Hidden Fortress (1958), directed by Kurosawa Akira.

Source: https://criterioncollection.tumblr.com/post/135392444906/the-original-r2-d2-and-c-3po-the-hidden-fortress

Interestingly, in the end of the first AI boom, 2001: A Space Odyssey, directed by Stanley Kubrick, was released in 1968. Unlike conventional fantasylike AI characters, for example Maria in Metropolis (1927), HAL 9000 was portrayed as a very realistic AI, and the movie already pointed out the risk of AI being insane when it gets some commands from several users. HAL 9000 still has been a very iconic character in AI field. For example when you say some quotes from 2001: A Space Odyssey to Siri you get some parody responses. I also thin you should keep it in mind that in order to make an AI like HAL 9000 come true, for now RNNs would be indispensable in many ways: you would need RNNs for better voice recognition, better conversational system, and for reading lips.

Source: https://imgflip.com/memetemplate/34339860/Open-the-pod-bay-doors-Hal

*Just as you cannot understand Monty Python references in Python official tutorials without watching Monty Python and the Holy Grail, you cannot understand many parodies in AI contexts without watching 2001: A Space Odyssey. Even though the movie had some interview videos with some researchers and some narrations, Stanley Kubrick cut off all the footage and made the movie very difficult to understand. Most people did not or do not understand that it is a movie about aliens who gave homework of coming to Jupiter to human beings.

2, Second AI boom/winter

Source: Fukushima Kunihiko, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” (1980)

I am not going to write about the second AI boom in detail, but at least you should keep it in mind that convolutional neural network (CNN) is a keyword in this time. Neocognitron, an artificial model of how sight nerves perceive thing, was invented by Kunihiko Fukushima in 1980, and the model is said to be the origin on CNN. And Neocognitron got inspired by the Hubel and Wiesel’s research on sight nerves. In 1989, a group in AT & T Bell Laboratory led by Yann LeCun invented the first practical CNN to read handwritten digit.

Y. LeCun, “Backpropagation Applied to Handwritten Zip Code Recognition,” (1989)

Another turning point in this second AI boom was that back propagation algorithm was discovered, and the CNN by LeCun was also trained with back propagation. LeCun made a deep neural networks with some layers in 1998 for more practical uses.

But his research did not gain so much attention like today, because AI research entered its second winter at the beginning of the 1990s, and that was partly due to vanishing/exploding gradient problem of deep learning. People knew that neural networks had potentials of universal approximation, but when they tried to train naively stacked neural nets, the gradients, which you need for training neural networks, exponentially increased/decreased. Even though the CNN made by LeCun was the first successful case of “deep” neural nets which did not suffer from the vanishing/exploding gradient problem so much, deep learning research also stagnated in this time.

The ultimate goal of this article series is to understand LSTM at a more abstract/mathematical level because it is one of the practical RNNs, but the idea of LSTM (Long Short Term Memory) itself was already proposed in 1997 as an RNN algorithm to tackle vanishing gradient problem. (Exploding gradient problem is solved with a technique named gradient clipping, and this is easier than techniques for preventing vanishing gradient problems. I am also going to explain it in the next article.) After that some other techniques like introducing forget gate, peephole connections, were discovered, but basically it took some 20 years till LSTM got attentions like today. The reasons for that is lack of hardware and data sets, and that was also major reasons for the second AI winter.

Source: Sepp HochreiterJürgen, Schmidhuber, “Long Short-term Memory,” (1997)

In the 1990s, the mid of second AI winter, the Internet started prevailing for commercial uses. I think one of the iconic events in this time was the source codes WWW (World Wide Web) were announced in 1993. Some of you might still remember that you little by little became able to transmit more data online in this time. That means people came to get more and more access to various datasets in those days, which is indispensable for machine learning tasks.

After all, we could not get HAL 9000 by the end of 2001, but instead we got Xbox console.

3, Video game industry and GPU

Even though research on neural networks stagnated in the 1990s the same period witnessed an advance in the computation of massive parallel linear transformations, due to their need in fields such as image processing.

Computer graphics move or rotate in 3d spaces, and that is also linear transformations. When you think about a car moving in a city, it is convenient to place the car, buildings, and other objects on a fixed 3d space. But when you need to make computer graphics of scenes of the city from a view point inside the car, you put a moving origin point in the car and see the city. The spatial information of the city is calculated as vectors from the moving origin point. Of course this is also linear transformations. Of course I am not talking about a dot or simple figures moving in the 3d spaces. Computer graphics are composed of numerous plane panels, and each of them have at least three vertexes, and they move on 3d spaces. Depending on viewpoints, you need project the 3d graphics in 3d spaces on 2d spaces to display the graphics on devices. You need to calculate which part of the panel is projected to which pixel on the display, and that is called rasterization. Plus, in order to get photophotorealistic image, you need to think about how lights from light sources reflect on the panel and projected on the display. And you also have to put some textures on groups of panels. You might also need to change color spaces, which is also linear transformations.

My point is, in short, you really need to do numerous linear transformations in parallel in image processing.

When it comes to the use of CGI in movies, two pioneer movies were released during this time: Jurassic Park in 1993, and Toy Story in 1995. It is famous that Pixar used to be one of the departments in ILM (Industrial Light and Magic), founded by George Lucas, and Steve Jobs bought the department. Even though the members in Pixar had not even made a long feature film in their lives, after trial and errors, they made the first CGI animated feature movie. On the other hand, in order to acquire funds for the production of Schindler’s List (1993), Steven Spielberg took on Jurassic Park (1993), consequently changing the history of CGI through this “side job.”

Source: http://renderstory.com/jurassic-park-23-years-later/

*I think you have realized that George Lucas is mentioned almost everywhere in this article. His influences on technologies are not only limited to image processing, but also sound measuring system, nonlinear editing system. Photoshop was also originally developed under his company. I need another article series for this topic, but maybe not in Data Science Blog.

Source: https://editorial.rottentomatoes.com/article/5-technical-breakthroughs-in-star-wars-that-changed-movies-forever/

Considering that the first wire-frame computer graphics made and displayed by computers appeared in the scene of displaying the wire frame structure of Death Star in a war room, in Star Wars: A New Hope, the development of CGI was already astonishing at this time. But I think deep learning owe its development more to video game industry.

*I said that the Death Star scene is the first use of graphics made and DISPLAYED by computers, because I have to say one of the first graphics in movie MADE by computer dates back to the legendary title sequence of Vertigo(1958).

When it comes to 3D video games the processing unit has to constantly deal with real time commands from controllers. It is famous that GPU was originally specifically designed for plotting computer graphics. Video game market is the biggest in entertainment industry in general, and it is said that the quality of computer graphics have the strongest correlation with video games sales, therefore enhancing this quality is a priority for the video game console manufacturers.

One good example to see how much video games developed is comparing original Final Fantasy 7 and the remake one. The original one was released in 1997, the same year as when LSTM was invented. And recently the remake version of Final Fantasy 7 was finally released this year. The original one was also made with very big budget, and it was divided into three CD-ROMs. The original one was also very revolutionary given that the former ones of Final Fantasy franchise were all 2d video retro style video games. But still the computer graphics looks like polygons, and in almost all scenes the camera angle was fixed in the original one. On the other hand the remake one is very photorealistic and you can move the angle of the camera as you want while you play the video game.

There were also fierce battles by graphic processor manufacturers in computer video game market in the 1990s, but personally I think the release of Xbox console was a turning point in the development of GPU. To be concrete, Microsoft adopted a type of NV20 GPU for Xbox consoles, and that left some room of programmability for developers. The chief architect of NV20, which was released under the brand of GeForce3, said making major changes in the company’s graphic chips was very risky. But that decision opened up possibilities of uses of GPU beyond computer graphics.

Source: https://de.wikipedia.org/wiki/Nvidia-GeForce-3-Serie

I think that the idea of a programmable GPU provided other scientific fields with more visible benefits after CUDA was launched. And GPU gained its position not only in deep learning, but also many other fields including making super computers.

*When it comes to deep learning, even GPUs have strong rivals. TPU(Tensor Processing Unit) made by Google, is specialized for deep learning tasks, and have astonishing processing speed. And FPGA(Field Programmable Gate Array), which was originally invented customizable electronic circuit, proved to be efficient for reducing electricity consumption of deep learning tasks.

*I am not so sure about this GPU part. Processing unit, including GPU is another big topic, that is beyond my capacity to be honest. I would appreciate it if you could share your view and some references to confirm your opinion, on the comment section or via email.

*If you are interested you should see this video of game fans’ reactions to the announcement of Final Fantasy 7. This is the industry which grew behind the development of deep learning, and many fields where you need parallel computations owe themselves to the nerds who spent a lot of money for video games, including me.

*But ironically the engineers who invented the GPU said they did not play video games simply because they were busy. If you try to study the technologies behind video games, you would not have much time playing them. That is the reality.

We have seen that the in this second AI winter, Internet and GPU laid foundation of the next AI boom. But still the last piece of the puzzle is missing: let’s look at the breakthrough which solved the vanishing /exploding gradient problem of deep learning in the next section.

4, Pretraining of deep belief networks: “The Dawn of Deep Learning”

Some researchers say the invention of pretraining of deep belief network by Geoffrey Hinton was a breakthrough which put an end to the last AI winter. Deep belief networks are different type of networks from the neural networks we have discussed, but their architectures are similar to those of the neural networks. And it was also unknown how to train deep belief nets when they have several layers. Hinton discovered that training the networks layer by layer in advance can tackle vanishing gradient problems. And later it was discovered that you can do pretraining neural networks layer by layer with autoencoders.

*Deep belief network is beyond the scope of this article series. I have to talk about generative models, Boltzmann machine, and some other topics.

The pretraining techniques of neural networks is not mainstream anymore. But I think it is very meaningful to know that major deep learning techniques such as using ReLU activation functions, optimization with Adam, dropout, batch normalization, came up as more effective algorithms for deep learning after the advent of the pretraining techniques, and now we are in the third AI boom.

In the next next article we are finally going to work on LSTM. Specifically, I am going to offer a clearer guide to a well-made paper on LSTM, named “LSTM: A Search Space Odyssey.”

[References]

[1] Taniguchi Tadahiro, “An Illustrated Guide to Artificial Intelligence”, (2010), Kodansha pp. 3-11
谷口忠大著, 「イラストで学ぶ人工知能概論」, (2010), 講談社, pp. 3-11

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 14-24

[3] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 1-5, 151-156
岡谷貴之著, 「機械学習プロフェッショナルシリーズ深層学習」, (2015), pp. 1-5, 151-156

[4] Abigail See, Matthew Lamm, “Natural Language Processingwith Deep LearningCS224N/Ling284 Lecture 8:Machine Translation,Sequence-to-sequence and Attention,” (2020),
URL: http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture08-nmt.pdf

[5]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 192-196

[6] Daniel C. Dennett, “Cognitive Wheels: the Frame Problem of AI,” (1984), pp. 1-2

[7] Machiyama Tomohiro, “Understanding Cinemas of 1967-1979,” (2014), Yosensya, pp. 14-30
町山智浩著, 「＜映画の見方＞が分かる本」,(2014), 洋泉社, pp. 14-30

[8] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 156-157
原田達也著, 「機械学習プロフェッショナルシリーズ画像認識」, (2017), pp. 156-157

[9] Suyama Atsushi, “Machine Learning Professional Series: Bayesian Deep Learning,” (2019)岡谷貴之須山敦志著, 「機械学習プロフェッショナルシリーズベイズ深層学習」, (2019)

[10] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ～最近の動向と共に」, Qiita, (2015)
URL: https://qiita.com/t_Signull/items/21b82be280b46f467d1b

[11] Hisa Ando, “WEB+DB PRESS plus series: Technologies Supporting Processors – The World Endlessly Pursuing Speed,” (2017), Gijutsu-hyoron-sya, pp 313-317
Hisa Ando, 「WEB+DB PRESS plusシリーズプロセッサを支える技術― 果てしなくスピードを追求する世界」, (2017), 技術評論社, pp. 313-317

[12] “Takahashi Yoshiki and Utamaru discuss George Lucas,” miyearnZZ Labo, (2016)
“高橋ヨシキと宇多丸　ジョージ・ルーカスを語る,” miyearnZZ Labo, (2016)
URL: https://miyearnzzlabo.com/archives/38865

[13] Katherine Bourzac, “Chip Hall of Fame: Nvidia NV20 The first configurable graphics processor opened the door to a machine-learning revolution,” IEEE SPECTRUM, (2018)
URL: https://spectrum.ieee.org/tech-history/silicon-revolution/chip-hall-of-fame-nvidia-nv20

Data Science für Smart Home im familiengeführten Unternehmen Miele

June 30, 2020/in Artificial Intelligence, Data Science, Insights, Interviews, Machine Learning, Main Category, Use Cases/by Benjamin Aunkofer

Dr. Florian Nielsen ist Principal for AI und Data Science bei Miele im Bereich Smart Home und zuständig für die Entwicklung daten-getriebener digitaler Produkte und Produkterweiterungen. Der studierte Informatiker promovierte an der Universität Ulm zum Thema multimodale kognitive technische Systeme.

Data Science Blog: Herr Dr. Nielsen, viele Unternehmen und Anwender reden heute schon von Smart Home, haben jedoch eher ein Remote Home. Wie machen Sie daraus tatsächlich ein Smart Home?

Tatsächlich entspricht das auch meiner Wahrnehmung. Die bloße Steuerung vernetzter Produkte über digitale Endgeräte macht aus einem vernetzten Produkt nicht gleich ein „smartes“. Allerdings ist diese Remotefunktion ein notwendiges Puzzlestück in der Entwicklung von einem nicht vernetzten Produkt, über ein intelligentes, vernetztes Produkt hin zu einem Ökosystem von sich ergänzenden smarten Produkten und Services. Vernetzte Produkte, selbst wenn sie nur aus der Ferne gesteuert werden können, erzeugen Daten und ermöglichen uns die Personalisierung, Optimierung oder gar Automatisierung von Produktfunktionen basierend auf diesen Daten voran zu treiben. „Smart“ wird für mich ein Produkt, wenn es sich beispielsweise besser den Bedürfnissen des Nutzers anpasst oder über Assistenzfunktionen eine Arbeitserleichterung im Alltag bietet.

Data Science Blog: Smart Home wiederum ist ein großer Begriff, der weit mehr als Geräte für Küchen und Badezimmer betrifft. Wie weit werden Sie hier ins Smart Home vordringen können?

Smart Home ist für mich schon fast ein verbrannter Begriff. Der Nutzer assoziiert hiermit doch vor allem die Steuerung von Heizung und Rollladen. Im Prinzip geht es doch um eine Vision in der sich smarte, vernetzte Produkt in ein kontextbasiertes Ökosystem einbetten um den jeweiligen Nutzer in seinem Alltag, nicht nur in seinem Zuhause, Mehrwert mit intelligenten Produkten und Services zu bieten. Für uns fängt das beispielsweise nicht erst beim Starten des Kochprozesses mit Miele-Geräten an, sondern deckt potenziell die komplette „User Journey“ rund um Ernährung (z. B. Inspiration, Einkaufen, Vorratshaltung) und Kochen ab. Natürlich überlegen wir verstärkt, wie Produkte und Services unser existierendes Produktportfolio ergänzen bzw. dem Nutzer zugänglicher machen könnten, beschränken uns aber hierauf nicht. Ein zusätzlicher für uns als Miele essenzieller Aspekt ist allerdings auch die Privatsphäre des Kunden. Bei der Bewertung potenzieller Use-Cases spielt die Privatsphäre unserer Kunden immer eine wichtige Rolle.

Data Science Blog: Die meisten Data-Science-Abteilungen befassen sich eher mit Prozessen, z. B. der Qualitätsüberwachung oder Prozessoptimierung in der Produktion. Sie jedoch nutzen Data Science als Komponente für Produkte. Was gibt es dabei zu beachten?

Kundenbedürfnisse. Wir glauben an nutzerorientierte Produktentwicklung und dementsprechend fängt alles bei uns bei der Identifikation von Bedürfnissen und potenziellen Lösungen hierfür an. Meist starten wir mit „Design Thinking“ um die Themen zu identifizieren, die für den Kunden einen echten Mehrwert bieten. Wenn dann noch Data Science Teil der abgeleiteten Lösung ist, kommen wir verstärkt ins Spiel. Eine wesentliche Herausforderung ist, dass wir oft nicht auf der grünen Wiese starten können. Zumindest wenn es um ein zusätzliches Produktfeature geht, das mit bestehender Gerätehardware, Vernetzungsarchitektur und der daraus resultierenden Datengrundlage zurechtkommen muss. Zwar sind unsere neuen Produktgenerationen „Remote Update“-fähig, aber auch das hilft uns manchmal nur bedingt. Dementsprechend ist die Antizipation von Geräteanforderungen essenziell. Etwas besser sieht es natürlich bei Umsetzungen von cloud-basierten Use-Cases aus.

Data Science Blog: Es heißt häufig, dass Data Scientists kaum zu finden sind. Ist Recruiting für Sie tatsächlich noch ein Thema?

Data Scientists, hier mal nicht interpretiert als Mythos „Unicorn“ oder „Full-Stack“ sind natürlich wichtig, und auch nicht leicht zu bekommen in einer Region wie Gütersloh. Aber Engineers, egal ob Data, ML, Cloud oder Software generell, sind der viel wesentlichere Baustein für uns. Für die Umsetzung von Ideen braucht es nun mal viel Engineering. Es ist mittlerweile hinlänglich bekannt, dass Data Science einen zwar sehr wichtigen, aber auch kleineren Teil des daten-getriebenen Produkts ausmacht. Mal abgesehen davon habe ich den Eindruck, dass immer mehr „Data Science“- Studiengänge aufgesetzt werden, die uns einerseits die Suche nach Personal erleichtern und andererseits ermöglichen Fachkräfte einzustellen die nicht, wie früher einen PhD haben (müssen).

Data Science Blog: Sie haben bereits einige Analysen erfolgreich in Ihre Produkte integriert. Welche Herausforderungen mussten dabei überwunden werden? Und welche haben Sie heute noch vor sich?

Wir sind, wie viele Data-Science-Abteilungen, noch ein relativ junger Bereich. Bei den meisten unserer smarten Produkte und Services stecken wir momentan in der MVP-Entwicklung, deshalb gibt es einige Herausforderungen, die wir aktuell hautnah erfahren. Dies fängt, wie oben erwähnt, bei der Berücksichtigung von bereits vorhandenen Gerätevoraussetzungen an, geht über mitunter heterogene, inkonsistente Datengrundlagen, bis hin zur Etablierung von Data-Science- Infrastruktur und Deploymentprozessen. Aus meiner Sicht stehen zudem viele Unternehmen vor der Herausforderung die Weiterentwicklung und den Betrieb von AI bzw. Data- Science- Produkten sicherzustellen. Verglichen mit einem „fire-and-forget“ Mindset nach Start der Serienproduktion früherer Zeiten muss ein Umdenken stattfinden. Daten-getriebene Produkte und Services „leben“ und müssen dementsprechend anders behandelt und umsorgt werden – mit mehr Aufwand aber auch mit der Chance „immer besser“ zu werden. Deshalb werden wir Buzzwords wie „MLOps“ vermehrt in den üblichen Beraterlektüren finden, wenn es um die nachhaltige Generierung von Mehrwert von AI und Data Science für Unternehmen geht. Und das zu Recht.

Data Science Blog: Data Driven Thinking wird heute sowohl von Mitarbeitern in den Fachbereichen als auch vom Management verlangt. Gerade für ein Traditionsunternehmen wie Miele sicherlich eine Herausforderung. Wie könnten Sie diese Denkweise im Unternehmen fördern?

Data Driven Thinking kann nur etabliert werden, wenn überhaupt der Zugriff auf Daten und darauf aufbauende Analysen gegeben ist. Deshalb ist Daten-Demokratisierung der wichtigste erste Schritt. Aus meiner Perspektive geht es darum initial die Potenziale aufzuzeigen, um dann mithilfe von Daten Unsicherheiten zu reduzieren. Wir haben die Erfahrung gemacht, dass viele Fachbereiche echtes Interesse an einer daten-getriebenen Analyse ihrer Hypothesen haben und dankbar für eine daten-getriebene Unterstützung sind. Miele war und ist ein sehr innovatives Unternehmen, dass „immer besser“ werden will. Deshalb erfahren wir momentan große Unterstützung von ganz oben und sind sehr positiv gestimmt. Wir denken, dass ein Schritt in die richtige Richtung bereits getan ist und mit zunehmender Zahl an Multiplikatoren ein „Data Driven Thinking“ sich im gesamten Unternehmen etablieren kann.

1. My taxonomy on linear dimension reduction

2. PCA

3. Fitting orthogonal axes on data

4. Practical three dimensional example of PCA

Appendix: Playing with my toy PCA on MNIST dataset

Bias

Variance

Bias-Variance Trade-off

Conclusion

1, The mainstream ways of explaining diagonalization.

2, Linear transformations

3, Eigen vectors as coordinate system

4, Eigen vectors of positive semidefinite matrices

5, Orthonormal matrices and rotation of vectors

Advantages Of KNN

Where KNN Are Mostly Used

Choosing The Right Value For K

Your Step-By-Step Guide For Choosing The Value Of K

Conclusion

1. Number of samples and degree of dimension

2. Bizarre characteristics of high dimensional data

2.2: Pac-Man walking

2.3: empty M & M’s chocolate

[References]

This article series is going to be roughly divided into the contents below.

1. Preface

2. Why LSTM?

3. How to display LSTM

4. Forward propagation of LSTM in “electronic circuit type”

5. Space Odyssey type

1, First AI boom

2, Second AI boom/winter

3, Video game industry and GPU

4, Pretraining of deep belief networks: “The Dawn of Deep Learning”

Interesting links

Pages

Categories

Archive