My elaborate study notes on reinforcement learning

I will not tell you why, but all of a sudden I was in need of writing an article series on Reinforcement Learning. Though I am also a beginner in reinforcement learning field. Everything I knew was what I learned from one online lecture conducted in a lazy tone in my college. However in the process of learning reinforcement learning, I found a line which could connect the two dots, one is reinforcement learning and the other is my studying field. That is why I made up my mind to make an article series on reinforcement learning seriously.

To be a bit more concrete, I imagine that technologies in our world could be enhanced by a combination of reinforcement learning and virtual reality. That means companies like Toyota or VW might come to invest on visual effect or video game companies more seriously in the future. And I have been actually struggling with how to train deep learning with cgi, which might bridge the virtual world and the real world.

As I am also a beginner in reinforcement learning, this article series would a kind of study note for me. But as I have been doing in my former articles, I prefer exhaustive but intuitive explanations on AI algorithms, thus I will do my best to make my series as instructive and effective as existing tutorial on reinforcement learning.

This article is going to be composed of the following contents.

In this article I would like to share what I have learned about RL, and I hope you could get some hints of learning this fascinating field. In case you have any comments or advice on my “study note,” leaving a comment or contacting me via email would be appreciated.

Rethinking linear algebra part two: ellipsoids in data science

*This is the fourth article of my article series “Illustrative introductions on dimension reduction.”

1 Our expedition of eigenvectors still continues

This article is still going to be about eigenvectors and PCA, and this article still will not cover LDA (linear discriminant analysis). Hereby I would like you to have more organic links of the data science ideas with eigenvectors.

In the second article, we have covered the following points:

  • You can visualize linear transformations with matrices by calculating displacement vectors, and they usually look like vectors swirling.
  • Diagonalization is finding a direction in which the displacement vectors do not swirl, and that is equal to finding new axis/basis where you can describe its linear transformations more straightforwardly. But we have to consider diagonalizability of the matrices.
  • In linear dimension reduction such as PCA or LDA, we mainly use types of matrices called positive definite or positive semidefinite matrices.

In the last article we have seen the following points:

  • PCA is an algorithm of calculating orthogonal axes along which data “swell” the most.
  • PCA is equivalent to calculating a new orthonormal basis for the data where the covariance between components is zero.
  • You can reduced the dimension of the data in the new coordinate system by ignoring the axes corresponding to small eigenvalues.
  • Covariance matrices enable linear transformation of rotation and expansion and contraction of vectors.

I emphasized that the axes are more important than the surface of the high dimensional ellipsoids, but in this article let’s focus more on the surface of ellipsoids, or I would rather say general quadratic curves. After also seeing how to draw ellipsoids on data, you would see the following points about PCA or eigenvectors.

  • Covariance matrices are real symmetric matrices, and also they are positive semidefinite. That means you can always diagonalize covariance matrices, and their eigenvalues are all equal or greater than 0.
  • PCA is equivalent to finding axes of quadratic curves in which gradients are biggest. The values of quadratic curves increases the most in those directions, and that means the directions describe great deal of information of data distribution.
  • Intuitively dimension reduction by PCA is equal to fitting a high dimensional ellipsoid on data and cutting off the axes corresponding to small eigenvalues.

Even if you already understand PCA to some extent, I hope this article provides you with deeper insight into PCA, and at least after reading this article, I think you would be more or less able to visually control eigenvectors and ellipsoids with the Numpy and Maplotlib libraries.

*Let me first introduce some mathematical facts and how I denote them throughout this article in advance. If you are allergic to mathematics, take it easy or please go back to my former articles.

  • Any quadratic curves can be denoted as \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where \boldsymbol{x}\in \mathbb{R}^D , A \in \mathbb{R}^{D\times D} \boldsymbol{b}\in \mathbb{R}^D s\in \mathbb{R}.
  • When I want to clarify dimensions of variables of quadratic curves, I denote parameters as A_D, b_D.
  • If a matrix A is a real symmetric matrix, there exist a rotation matrix U such that U^T A U = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D) and U = (\boldsymbol{u}_1, \dots , \boldsymbol{u}_D). \boldsymbol{u}_1, \dots , \boldsymbol{u}_D are eigenvectors corresponding to \lambda_1, \dots, \lambda_D respectively.
  • PCA corresponds to a case of diagonalizing A where A is a covariance matrix of certain data. When I want to clarify that A is a covariance matrix, I denote it as A=\Sigma.
  • Importantly covariance matrices \Sigma are positive semidefinite and real symmetric, which means you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

*In the last article, I denoted the covariance of data as S, based on Pattern Recognition and Machine Learning by C. M. Bishop.

*Sooner or later you are going to see that I am explaining basically the same ideas from different points of view, using the topic of PCA. However I believe they are all important when you learn linear algebra for data science of machine learning. Even you have not learnt linear algebra or if you have to teach linear algebra, I recommend you to first take a review on the idea of diagonalization, like the second article. And you should be conscious that, in the context of machine learning or data science, only a very limited type of matrices are important, which I have been explaining throughout this article.

2 Rotation or projection?

In this section I am going to talk about basic stuff found in most textbooks on linear algebra. In the last article, I mentioned that if A is a real symmetric matrix, you can diagonalize A with a rotation matrix U = (\boldsymbol{u}_1 \: \cdots \: \boldsymbol{u}_D), such that U^{-1}AU = U^{T}AU =\Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). I also explained that PCA is a case where A=\Sigma, that is, A is the covariance matrix of certain data. \Sigma is known to be positive semidefinite and real symmetric. Thus you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

I think we first need to clarify the difference of rotation and projection. In order to visualize the ideas, let’s consider a case of D=3. Assume that you have got an orthonormal rotation matrix U = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) which diagonalizes A. In the last article I said diagonalization is equivalent to finding new orthogonal axes formed by eigenvectors, and in the case of this section you got new orthonoramal basis (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) which are in red in the figure below. Projecting a point \boldsymbol{x} = (x, y, z) on the new orthonormal basis is simple: you just have to multiply \boldsymbol{x} with U^T. Let U^T \boldsymbol{x} be (x', y', z')^T, and then \left( \begin{array}{c} x' \\ y' \\ z' \end{array} \right) = U^T\boldsymbol{x} = \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right). You can see x', y', z' are \boldsymbol{x} projected on \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3 respectively, and the left side of the figure below shows the idea. When you replace the orginal orthonormal basis (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) with (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) as in the right side of the figure below, you can comprehend the projection as a rotation from (x, y, z) to (x', y', z') by a rotation matrix U^T.

Next, let’s see what rotation is. In case of rotation, you should imagine that you rotate the point \boldsymbol{x} in the same coordinate system, rather than projecting to other coordinate system. You can rotate \boldsymbol{x} by multiplying it with U. This rotation looks like the figure below.

In the initial position, the edges of the cube are aligned with the three orthogonal black axes (\boldsymbol{e}_1,  \boldsymbol{e}_2 , \boldsymbol{e}_3), with one corner of the cube located at the origin point of those axes. The purple dot denotes the corner of the cube directly opposite the origin corner. The cube is rotated in three dimensions, with the origin corner staying fixed in place. After the rotation with a pivot at the origin, the edges of the cube are now aligned with a new set of orthogonal axes (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3), shown in red. You might understand that more clearly with an equation: U\boldsymbol{x} = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) \left( \begin{array}{c} x \\ y \\ z \end{array} \right) = x\boldsymbol{u}_1 + y\boldsymbol{u}_2 + z\boldsymbol{u}_3. In short this rotation means you keep relative position of \boldsymbol{x}, I mean its coordinates (x, y, z), in the new orthonormal basis. In this article, let me call this a “cube rotation.”

The discussion above can be generalized to spaces with dimensions higher than 3. When U \in \mathbb{R}^{D \times D} is an orthonormal matrix and a vector \boldsymbol{x} \in \mathbb{R}^D, you can project \boldsymbol{x} to \boldsymbol{x}' = U^T \boldsymbol{x}or rotate it to \boldsymbol{x}'' = U \boldsymbol{x}, where \boldsymbol{x}' = (x_{1}', \dots, x_{D}')^T and \boldsymbol{x}'' = (x_{1}'', \dots, x_{D}'')^T. In other words \boldsymbol{x} = U \boldsymbol{x}', which means you can rotate back \boldsymbol{x}' to the original point \boldsymbol{x} with the rotation matrix U.

I think you at least saw that rotation and projection are basically the same, and that is only a matter of how you look at the coordinate systems. But I would say the idea of projection is more important through out this article.

Let’s consider a function f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} = (\boldsymbol{x}, A \boldsymbol{x}), where A\in \mathbb{R}^{D\times D} is a real symmetric matrix. The distribution of f(\boldsymbol{x}; A) is quadratic curves whose center point covers the origin, and it is known that you can express this distribution in a much simpler way using eigenvectors. When you project this function on eigenvectors of A, that is when you substitute U \boldsymbol{x}' for \boldsymbol{x}, you get f = (\boldsymbol{x}, A \boldsymbol{x}) =(U \boldsymbol{x}', AU \boldsymbol{x}') = (\boldsymbol{x}')^T U^TAU \boldsymbol{x}' = (\boldsymbol{x}')^T \Lambda \boldsymbol{x}' = \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2. You can always diagonalize real symmetric matrices, so the formula implies that the shapes of quadratic curves largely depend on eigenvectors. We are going to see this in detail in the next section.

*(\boldsymbol{x}, \boldsymbol{y}) denotes an inner product of \boldsymbol{x} and \boldsymbol{y}.

*We are going to see details of the shapes of quadratic “curves” or “functions” in the next section.

To be exact, you cannot naively multiply U or U^T for rotation. Let’s take a part of data I showed in the last article as an example. In the figure below, I projected data on the basis (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3).

You might have noticed that you cannot do a “cube rotation” in this case. If you make the coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) with your left hand, like you might have done in science classes in school to learn Fleming’s rule, you would soon realize that the coordinate systems in the figure above do not match. You need to flip the direction of one axis to match them.

Mathematically, you have to consider the determinant of the rotation matrix U. You can do a “cube rotation” when det(U)=1, and in the case above det(U) was -1, and you needed to flip one axis to make the determinant 1. In the example in the figure below, you can match the basis. This also can be generalized to higher dimensions, but that is also beyond the scope of this article series. If you are really interested, you should prepare some coffee and snacks and textbooks on linear algebra, and some weekends.

When you want to make general ellipsoids in a 3d space on Matplotlib, you can take advantage of rotation matrices. You first make a simple ellipsoid symmetric about xyz axis using polar coordinates, and you can rotate the whole ellipsoid with rotation matrices. I made some simple modules for drawing ellipsoid. If you put in a rotation matrix which diagonalize the covariance matrix of data and a list of three radiuses \sqrt{\lambda_1}, \sqrt{\lambda_2}, \sqrt{\lambda_3}, you can rotate the original ellipsoid so that it fits the data well.

3 Types of quadratic curves.

*This article might look like a mathematical writing, but I would say this is more about computer science. Please tolerate some inaccuracy in terms of mathematics. I gave priority to visualizing necessary mathematical ideas in my article series. If you are not sure about details, please let me know.

In linear dimension reduction, or at least in this article series you mainly have to consider ellipsoids. However ellipsoids are just one type of quadratic curves. In the last article, I mentioned that when the center of a D dimensional ellipsoid is the origin point of a normal coordinate system, the formula of the surface of the ellipsoid is as follows: (\boldsymbol{x}, A\boldsymbol{x})=1, where A satisfies certain conditions. To be concrete, when (\boldsymbol{x}, A\boldsymbol{x})=1 is the surface of a ellipsoid, A has to be diagonalizable and positive definite.

*Real symmetric matrices are diagonalizable, and positive definite matrices have only positive eigenvalues. Covariance matrices \Sigma, whose displacement vectors I visualized in the last two articles, are known to be symmetric real matrices and positive semi-defintie. However, the surface of an ellipsoid which fit the data is \boldsymbol{x}^T \Sigma ^{-1} \boldsymbol{x} = const., not \boldsymbol{x}^T \Sigma \boldsymbol{x} = const..

*You have to keep it in mind that \boldsymbol{x} are all deviations.

*You do not have to think too much about what the “semi” of the term “positive semi-definite” means fow now.

As you could imagine, this is just one simple case of richer variety of graphs. Let’s consider a 3-dimensional space. Any quadratic curves in this space can be denoted as ax^2 + by^2 + cz^2 + dxy + eyz + fxz + px + qy + rz + s = 0, where at least one of a, b, c, d, e, f, p, q, r, s is not 0.  Let \boldsymbol{x} be (x, y, z)^T, then the quadratic curves can be simply denoted with a 3\times 3 matrix A and a 3-dimensional vector \boldsymbol{b} as follows: \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right), \boldsymbol{b} = \left( \begin{array}{c} \frac{p}{2} \\ \frac{q}{2} \\ \frac{r}{2} \end{array} \right). General quadratic curves are roughly classified into the 9 types below.

You can shift these quadratic curves so that their center points come to the origin, without rotation, and the resulting curves are as follows. The curves can be all denoted as \boldsymbol{x}^T A\boldsymbol{x}.

As you can see, A is a real symmetric matrix. As I have mentioned repeatedly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthogonal/orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). Hence, you can diagonalize the A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right) with an orthogonal matrix U. Let U be an orthogonal matrix such that U^T A U = \left( \begin{array}{ccc} \alpha  & 0 & 0 \\ 0 & \beta & 0 \\ 0 & 0 & \gamma \end{array} \right) =\left( \begin{array}{ccc} \lambda_1  & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{array} \right). After you apply rotation by U to the curves (a)” ~ (i)”, those curves are symmetrically placed about the xyz axes, and their center points still cross the origin. The resulting curves look like below. Or rather I should say you projected (a)’ ~ (i)’ on their eigenvectors.

In this article mainly (a)” , (g)”, (h)”, and (i)” are important. General equations for the curves is as follows

  • (a)”: \frac{x^2}{l^2} + \frac{y^2}{m^2} + \frac{z^2}{n^2} = 1
  • (g)”: z = \frac{x^2}{l^2} + \frac{y^2}{m^2}
  • (h)”: z = \frac{x^2}{l^2} - \frac{y^2}{m^2}
  • (i)”: z = \frac{x^2}{l^2}

, where l, m, n \in \mathbb{R}^+.

Even if this section has been puzzling to you, you just have to keep one point in your mind: we have been discussing general quadratic curves, but in PCA, you only need to consider a case where A is a covariance matrix, that is A=\Sigma. PCA corresponds to the case where you shift and rotate the curve (a) into (a)”. Subtracting the mean of data from each point of data corresponds to shifting quadratic curve (a) to (a)’. Calculating eigenvectors of A corresponds to calculating a rotation matrix U such that the curve (a)’ comes to (a)” after applying the rotation, or projecting curves on eigenvectors of \Sigma. Importantly we are only discussing the covariance of certain data, not the distribution of the data itself.

*Just in case you are interested in a little more mathematical sides: it is known that if you rotate all the points \boldsymbol{x} on the curve \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0 with the rotation matrix P, those points \boldsymbol{x} are mapped into a new quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0. That means the rotation of the original quadratic curve with P (or rather rotating axes) enables getting rid of the terms xy, yz, zx. Also it is known that when \alpha ' \neq 0, with proper translations and rotations, the quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0 can be mapped into one of the types of quadratic curves in the figure below, depending on coefficients of the original quadratic curve. And the discussion so far can be generalized to higher dimensional spaces, but that is beyond the scope of this article series. Please consult decent textbooks on linear algebra around you for further details.

4 Eigenvectors are gradients and sometimes variances.

In the second section I explained that you can express quadratic functions f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} in a very simple way by projecting \boldsymbol{x} on eigenvectors of A.

You can comprehend what I have explained in another way: eigenvectors, to be exact eigenvectors of real symmetric matrices A, are gradients. And in case of PCA, I mean when A=\Sigma eigenvalues are also variances. Before explaining what that means, let me explain a little of the totally common facts on mathematics. If you have variables \boldsymbol{x}\in \mathbb{R}^D, I think you can comprehend functions f(\boldysmbol{x}) in two ways. One is a normal “functions” f(\boldsymbol{x}), and the others are “curves” f(\boldsymbol{x}) = const.. “Functions” get an input \boldsymbol{x} and gives out an output f(\boldsymbol{x}), just as well as normal functions you would imagine. “Curves” are rather sets of \boldsymbol{x} \in \mathbb{R}^D such that f(\boldsymbol{x}) = const..

*Please assume that the terms “functions” and “curves” are my original words. I use them just in case I fail to use functions and curves properly.

The quadratic curves in the figure above are all “curves” in my term, which can be denoted as f(\boldsymbol{x}; A_3, \boldsymbol{b}_3)=const or f(\boldsymbol{x}; A_3)=const. However if you replace z of (g)”, (h)”, and (i)” with f, you can interpret the “curves” as “functions” which are denoted as f(\boldsymbol{x}; A_2). This might sounds too obvious to you, and my point is you can visualize how values of “functions” change only when the inputs are 2 dimensional.

When a symmetric 2\times 2 real matrices A_2 have two eigenvalues \lambda_1, \lambda_2, the distribution of quadratic curves can be roughly classified to the following three types.

  • (g): Both \lambda_1 and \lambda_2 are positive or negative.
  • (h): Either of \lambda_1 or \lambda_2 is positive and the other is negative.
  • (i): Either of \lambda_1 or \lambda_2 is 0 and the other is not.

The equations of (g)” , (h)”, and (i)” correspond to each type of f=(\boldsymbol{x}; A_2), and thier curves look like the three graphs below.

And in fact, when start from the origin and go in the direction of an eigenvector \boldsymbol{u}_i, \lambda_i is the gradient of the direction. You can see that more clearly when you restrict the distribution of f=(\boldsymbol{x}; A_2) to a unit circle. Like in the figure below, in case \lambda_1 = 7, \lambda_2 = 3, which is classified to (g), the distribution looks like the left side, and if you restrict the distribution in the unit circle, the distribution looks like a bowl like the middle and the right side. When you move in the direction of \boldsymbol{u}_1, you can climb the bowl as as high as \lambda_1, in \boldsymbol{u}_2 as high as \lambda_2.

Also in case of (h), the same facts hold. But in this case, you can also descend the curve.

*You might have seen the curve above in the context of optimization with stochastic gradient descent. The origin of the curve above is a notorious saddle point, where gradients are all 0 in any directions but not a local maximum or minimum. Points can be stuck in this point during optimization.

Especially in case of PCA, A is a covariance matrix, thus A=\Sigma. Eigenvalues of \Sigma are all equal to or greater than 0. And it is known that in this case \lambda_i is the variance of data projected on its corresponding eigenvector \boldsymbol{u}_i (i=0, \dots , D). Hence, if you project f(\boldsymbol{x}; \Sigma), quadratic curves formed by a covariance matrix \Sigma, on eigenvectors of \Sigma, you get f(\boldsymbol{x}; \Sigma) = ({x'}_1 \: \dots \: {x'}_D) (\lambda_1 {x'}_1 \: \dots \: \lambda_D {x'}_D)^t =\lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2.  This shows that you can re-weight ({x'}_1 \: \dots \: {x'}_D), the coordinates of data projected projected on eigenvectors of A, with \lambda_1, \dots, \lambda_D, which are variances ({x'}_1 \: \dots \: {x'}_D). As I mentioned in an example of data of exam scores in the last article, the bigger a variance \lambda_i is, the more the feature described by \boldsymbol{u}_i vary from sample to sample. In other words, you can ignore eigenvectors corresponding to small eigenvalues.

That is a great hint why principal components corresponding to large eigenvectors contain much information of the data distribution. And you can also interpret PCA as a “climbing” a bowl of f(\boldsymbol{x}; A_D), as I have visualized in the case of (g) type curve in the figure above.

*But as I have repeatedly mentioned, ellipsoid which fit data well isf(\boldsymbol{x}; \Sigma ^{-1}) =(\boldsymbol{x}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D})\boldsymbol{x}' = \frac{({x'}_{1})^2}{\lambda_1} + \cdots + \frac{({x'}_{D})^2}{\lambda_D} = const..

*You have to be careful that even if you slice a type (h) curve f(\boldsymbol{x}; A_D) with a place z=const. the resulting cross section does not fit the original data well because the equation of the cross section is \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2 = const. The figure below is an example of slicing the same f(\boldsymbol{x}; A_2) as the one above with z=1, and the resulting cross section.

As we have seen, \lambda_i, the eigenvalues of the covariance matrix of data are variances or data when projected on it eigenvectors. At the same time, when you fit an ellipsoid on the data, \sqrt{\lambda_i} is the radius of the ellipsoid corresponding to \boldsymbol{u}_i. Thus ignoring data projected on eigenvectors corresponding to small eigenvalues is equivalent to cutting of the axes of the ellipsoid with small radiusses.

I have explained PCA in three different ways over three articles.

  • The second article: I focused on what kind of linear transformations convariance matrices \Sigma enable, by visualizing displacement vectors. And those vectors look like swirling and extending into directions of eigenvectors of \Sigma.
  • The third article: We directly found directions where certain data distribution “swell” the most, to find that data swell the most in directions of eigenvectors.
  • In this article, we have seen PCA corresponds to only one case of quadratic functions, where the matrix A is a covariance matrix. When you go in the directions of eigenvectors corresponding to big eigenvalues, the quadratic function increases the most. Also that means data samples have bigger variances when projected on the eigenvectors. Thus you can cut off eigenvectors corresponding to small eigenvectors because they retain little information about data, and that is equivalent to fitting an ellipsoid on data and cutting off axes with small radiuses.

*Let A be a covariance matrix, and you can diagonalize it with an orthogonal matrix U as follow: U^{T}AU = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D). Thus A = U \Lambda U^{T}. U is a rotation, and multiplying a \boldsymbol{x} with \Lambda means you multiply each eigenvalue to each element of \boldsymbol{x}. At the end U^T enables the reverse rotation.

If you get data like the left side of the figure below, most explanation on PCA would just fit an oval on this data distribution. However after reading this articles series so far, you would have learned to see PCA from different viewpoints like at the right side of the figure below.

 

5 Ellipsoids in Gaussian distributions.

I have explained that if the covariance of a data distribution is \boldsymbol{\Sigma}, the ellipsoid which fits the distribution the best is \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = 1. You might have seen the part \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) somewhere else. It is the exponent of general Gaussian distributions: \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) \}.  It is known that the eigenvalues of \Sigma ^{-1} are \frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}, and eigenvectors corresponding to each eigenvalue are also \boldsymbol{u}_1, \dots, \boldsymbol{u}_D respectively. Hence just as well as what we have seen, if you project (\boldsymbol{x} - \boldsymbol{\mu}) on each eigenvector of \Sigma ^{-1}, we can convert the exponent of the Gaussian distribution.

Let -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) be \boldsymbol{y} and U ^{-1} \boldsymbol{y}= U^{T} \boldsymbol{y} be \boldsymbol{y}', where U=(\boldsymbol{u}_1 \: \dots \: \boldsymbol{u}_D). Just as we have seen, (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) =\boldsymbol{y}^T\Sigma^{-1} \boldsymbol{y} =(U\boldsymbol{y}')^T \Sigma^{-1} U\boldsymbol{y}' =((\boldsymbol{y}')^T U^T \Sigma^{-1} U\boldsymbol{y}' = (\boldsymbol{y}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}) \boldsymbol{y}' = \frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D}. Hence \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{y}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{y}) \} =  \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D} ) \} =\frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2} \frac{({y'}_{1})^2}{\lambda_1} \biggl) \cdots \frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2}\frac{({y'}_{D})^2}{\lambda_D} \biggl).

*To be mathematically exact about changing variants of normal distributions, you have to consider for example Jacobian matrices.

This results above demonstrate that, by projecting data on the eigenvectors of its covariance matrix, you can factorize the original multi-dimensional Gaussian distribution into a product of Gaussian distributions which are irrelevant to each other. However, at the same time, that is the potential limit of approximating data with PCA. This idea is going to be more important when you think about more probabilistic ways to handle PCA, which is more robust to lack of data.

I have explained PCA over 3 articles from various viewpoints. If you have been patient enough to read my article series, I think you have gained some deeper insight into not only PCA, but also linear algebra, and that should be helpful when you learn or teach data science. I hope my codes also help you. In fact these are not the only topics about PCA. There are a lot of important PCA-like algorithms.

In fact our expedition of ellipsoids, or PCA still continues, just as Star Wars series still continues. Especially if I have to explain an algorithm named probabilistic PCA, I need to explain the “Bayesian world” of machine learning. Most machine learning algorithms covered by major introductory textbooks tend to be too deterministic and dependent on the size of data. Many of those algorithms have another “parallel world,” where you can handle inaccuracy in better ways. I hope I can also write about them, and I might prepare another trilogy for such PCA. But I will not disappoint you, like “The Phantom Menace.”

Appendix: making a model of a bunch of grape with ellipsoid berries.

If you can control quadratic curves, reshaping and rotating them, you can make a model of a grape of olive bunch on Matplotlib. I made a program of making a model of a bunch of berries on Matplotlib using the module to draw ellipsoids which I introduced earlier. You can check the codes in this page.

*I have no idea how many people on this earth are in need of making such models.

I made some modules so that you can see the grape bunch from several angles. This might look very simple to you, but the locations of berries are organized carefully so that it looks like they are placed around a stem and that the berries are not too close to each other.

 

The programming code I created for this article is completly available here.

[Refereces]

[1]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 78-83, 559-577

[2]「理工系新課程 線形代数 基礎から応用まで」, 培風館、(2017)

[3]「これなら分かる 最適化数学 基礎原理から計算手法まで」, 金谷健一著、共立出版, (2019), pp. 17-49

[4]「これなら分かる 応用数学教室 最小二乗法からウェーブレットまで」, 金谷健一著、共立出版, (2019), pp.165-208

[5] 「サボテンパイソン 」
https://sabopy.com/

 

How to make a toy English-German translator with multi-head attention heat maps: the overall architecture of Transformer

If you have been patient enough to read the former articles of this article series Instructions on Transformer for people outside NLP field, but with examples of NLP, you should have already learned a great deal of Transformer model, and I hope you gained a solid foundation of learning theoretical sides on this algorithm.

This article is going to focus more on practical implementation of a transformer model. We use codes in the Tensorflow official tutorial. They are maintained well by Google, and I think it is the best practice to use widely known codes.

The figure below shows what I have explained in the articles so far. Depending on your level of understanding, you can go back to my former articles. If you are familiar with NLP with deep learning, you can start with the third article.

1 The datasets

I think this article series appears to be on NLP, and I do believe that learning Transformer through NLP examples is very effective. But I cannot delve into effective techniques of processing corpus in each language. Thus we are going to use a library named BPEmb. This library enables you to encode any sentences in various languages into lists of integers. And conversely you can decode lists of integers to the language. Thanks to this library, we do not have to do simplification of alphabets, such as getting rid of Umlaut.

*Actually, I am studying in computer vision field, so my codes would look elementary to those in NLP fields.

The official Tensorflow tutorial makes a Portuguese-English translator, but in article we are going to make an English-German translator. Basically, only the codes below are my original. As I said, this is not an article on NLP, so all you have to know is that at every iteration you get a batch of (64, 41) sized tensor as the source sentences, and a batch of (64, 42) tensor as corresponding target sentences. 41, 42 are respectively the maximum lengths of the input or target sentences, and when input sentences are shorter than them, the rest positions are zero padded, as you can see in the codes below.

*If you just replace datasets and modules for encoding, you can make translators of other pairs of languages.

We are going to train a seq2seq-like Transformer model of converting those list of integers, thus a mapping from a vector to another vector. But each word, or integer is encoded as an embedding vector, so virtually the Transformer model is going to learn a mapping from sequence data to another sequence data. Let’s formulate this into a bit more mathematics-like way: when we get a pair of sequence data \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau _x)}) and \boldsymbol{Y} = (\boldsymbol{y}^{(1)}, \dots, \boldsymbol{y}^{(\tau _y)}), where \boldsymbol{x}^{(t)} \in \mathbb{R}^{|\mathcal{V}_{\mathcal{X}}|}, \boldsymbol{x}^{(t)} \in \mathbb{R}^{|\mathcal{V}_{\mathcal{Y}}|}, respectively from English and German corpus, then we learn a mapping f: \boldsymbol{X} \to \boldsymbol{Y}.

*In this implementation the vocabulary sizes are both 10002. Thus |\mathcal{V}_{\mathcal{X}}|=|\mathcal{V}_{\mathcal{Y}}|=10002

2 The whole architecture

This article series has covered most of components of Transformer model, but you might not understand how seq2seq-like models can be constructed with them. It is very effective to understand how transformer is constructed by actually reading or writing codes, and in this article we are finally going to construct the whole architecture of a Transforme translator, following the Tensorflow official tutorial. At the end of this article, you would be able to make a toy English-German translator.

The implementation is mainly composed of 4 classes, EncoderLayer(), Encoder(), DecoderLayer(), and Decoder() class. The inclusion relations of the classes are displayed in the figure below.

To be more exact in a seq2seq-like model with Transformer, the encoder and the decoder are connected like in the figure below. The encoder part keeps converting input sentences in the original language through N layers. The decoder part also keeps converting the inputs in the target languages, also through N layers, but it receives the output of the final layer of the Encoder at every layer.

You can see how the Encoder() class and the Decoder() class are combined in Transformer in the codes below. If you have used Tensorflow or Pytorch to some extent, the codes below should not be that hard to read.

3 The encoder

*From now on “sentences” do not mean only the input tokens in natural language, but also the reweighted and concatenated “values,” which I repeatedly explained in explained in the former articles. By the end of this section, you will see that Transformer repeatedly converts sentences layer by layer, remaining the shape of the original sentence.

I have explained multi-head attention mechanism in the third article, precisely, and I explained positional encoding and masked multi-head attention in the last article. Thus if you have read them and have ever written some codes in Tensorflow or Pytorch, I think the codes of Transformer in the official Tensorflow tutorial is not so hard to read. What is more, you do not use CNNs or RNNs in this implementation. Basically all you need is linear transformations. First of all let’s see how the EncoderLayer() and the Encoder() classes are implemented in the codes below.

You might be confused what “Feed Forward” means in  this article or the original paper on Transformer. The original paper says this layer is calculated as FFN(x) = max(0, xW_1 + b_1)W_2 +b_2. In short you stack two fully connected layers and activate it with a ReLU function. Let’s see how point_wise_feed_forward_network() function works in the implementation with some simple codes. As you can see from the number of parameters in each layer of the position wise feed forward neural network, the network does not depend on the length of the sentences.

From the number of parameters of the position-wise feed forward neural networks, you can see that you share the same parameters over all the positions of the sentences. That means in the figure above, you use the same densely connected layers at all the positions, in single layer. But you also have to keep it in mind that parameters for position-wise feed-forward networks change from layer to layer. That is also true of “Layer” parts in Transformer model, including the output part of the decoder: there are no learnable parameters which cover over different positions of tokens. These facts lead to one very important feature of Transformer: the number of parameters does not depend on the length of input or target sentences. You can offset the influences of the length of sentences with multi-head attention mechanisms. Also in the decoder part, you can keep the shape of sentences, or reweighted values, layer by layer, which is expected to enhance calculation efficiency of Transformer models.

4, The decoder

The structures of DecoderLayer() and the Decoder() classes are quite similar to those of EncoderLayer() and the Encoder() classes, so if you understand the last section, you would not find it hard to understand the codes below. What you have to care additionally in this section is inter-language multi-head attention mechanism. In the third article I was repeatedly explaining multi-head self attention mechanism, taking the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as an example. However, as I explained in the second article, usually in attention mechanism, you compare sentences with the same meaning in two languages. Thus the decoder part of Transformer model has not only self-attention multi-head attention mechanism of the target sentence, but also an inter-language multi-head attention mechanism. That means, In case of translating from English to German, you compare the sentence “Anthony Hopkins hat Michael Bay als einen großartigen Regisseur bewundert.” with the sentence itself in masked multi-head attention mechanism (, just as I repeatedly explained in the third article). On the other hand, you compare “Anthony Hopkins hat Michael Bay als einen großartigen Regisseur bewundert.” with “Anthony Hopkins admired Michael Bay as a great director.” in the inter-language multi-head attention mechanism (, just as you can see in the figure above).

*The “inter-language multi-head attention mechanism” is my original way to call it.

I briefly mentioned how you calculate the inter-language multi-head attention mechanism in the end of the third article, with some simple codes, but let’s see that again, with more straightforward figures. If you understand my explanation on multi-head attention mechanism in the third article, the inter-language multi-head attention mechanism is nothing difficult to understand. In the multi-head attention mechanism in encoder layers, “queries”, “keys”, and “values” come from the same sentence in English, but in case of inter-language one, only “keys” and “values” come from the original sentence, and “queries” come from the target sentence. You compare “queries” in German with the “keys” in the original sentence in English, and you re-weight the sentence in English. You use the re-weighted English sentence in the decoder part, and you do not need look-ahead mask in this inter-language multi-head attention mechanism.

Just as well as multi-head self-attention, you can calculate inter-language multi-head attention mechanism as follows: softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}). In the example above, the resulting multi-head attention map is a 10 \times 9 matrix like in the figure below.

Once you keep the points above in you mind, the implementation of the decoder part should not be that hard.

5 Masking tokens in practice

I explained masked-multi-head attention mechanism in the last article, and the ideas itself is not so difficult. However in practice this is implemented in a little tricky way. You might have realized that the size of input matrices is fixed so that it fits the longest sentence. That means, when the maximum length of the input sentences is 41, even if the sentences in a batch have less than 41 tokens, you sample (64, 41) sized tensor as a batch every time (The 64 is a batch size). Let “Anthony Hopkins admired Michael Bay as a great director.”, which has 9 tokens in total, be an input. We have been considering calculating (9, 9) sized attention maps or (10, 9) sized attention maps, but in practice you use (41, 41) or (42, 41) sized ones. When it comes to calculating self attentions in the encoder part, you zero pad self attention maps with encoder padding masks, like in the figure below. The black dots denote the zero valued elements.

As you can see in the codes below, encode padding masks are quite simple. You just multiply the padding masks with -1e9 and add them to attention maps and apply a softmax function. Thereby you can zero-pad the columns in the positions/columns where you added -1e9 to.

I explained look ahead mask in the last article, and in practice you combine normal padding masks and look ahead masks like in the figure below. You can see that you can compare each token with only its previous tokens. For example you can compare “als” only with “Anthony”, “Hopkins”, “hat”, “Michael”, “Bay”, “als”, not with “einen”, “großartigen”, “Regisseur” or “bewundert.”

Decoder padding masks are almost the same as encoder one. You have to keep it in mind that you zero pad positions which surpassed the length of the source input sentence.

6 Decoding process

In the last section we have seen that we can zero-pad columns, but still the rows are redundant. However I guess that is not a big problem because you decode the final output in the direction of the rows of attention maps. Once you decode <end> token, you stop decoding. The redundant rows would not affect the decoding anymore.

This decoding process is similar to that of seq2seq models with RNNs, and that is why you need to hide future tokens in the self-multi-head attention mechanism in the decoder. You share the same densely connected layers followed by a softmax function, at all the time steps of decoding. Transformer has to learn how to decode only based on the words which have appeared so far.

According to the original paper, “We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.” After these explanations, I think you understand the part more clearly.

The codes blow is for the decoding part. You can see that you first start decoding an output sentence with a sentence composed of only <start>, and you decide which word to decoded, step by step.

*It easy to imagine that this decoding procedure is not the best. In reality you have to consider some possibilities of decoding, and you can do that with beam search decoding.

After training this English-German translator for 30 epochs you can translate relatively simple English sentences into German. I displayed some results below, with heat maps of multi-head attention. Each colored attention maps corresponds to each head of multi-head attention. The examples below are all from the fourth (last) layer, but you can visualize maps in any layers. When it comes to look ahead attention, naturally only the lower triangular part of the maps is activated.

This article series has not covered some important topics machine translation, for example how to calculate translation errors. Actually there are many other fascinating topics related to machine translation. For example beam search decoding, which consider some decoding possibilities, or other topics like how to handle proper nouns such as “Anthony” or “Hopkins.” But this article series is not on NLP. I hope you could effectively learn the architecture of Transformer model with examples of languages so far. And also I have not explained some details of training the network, but I will not cover that because I think that depends on tasks. The next article is going to be the last one of this series, and I hope you can see how Transformer is applied in computer vision fields, in a more “linguistic” manner.

But anyway we have finally made it. In this article series we have seen that one of the earliest computers was invented to break Enigma. And today we can quickly make a more or less accurate translator on our desk. With Transformer models, you can even translate deadly funny jokes into German.

*You can train a translator with this code.

*After training a translator, you can translate English sentences into German with this code.

[References]

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core
https://www.tensorflow.org/overview

[3] Jay Alammar, “The Illustrated Transformer,”
http://jalammar.github.io/illustrated-transformer/

[4] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)
https://www.youtube.com/watch?v=5vcj8kSwBCY

[5]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Positional encoding, residual connections, padding masks: covering the rest of Transformer components

This is the fourth article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Wrapping points up so far

This article series has already covered a great deal of the Transformer mechanism. Whether you have read my former articles or not, I bet you are more or less lost in the course of learning Transformer model. The left side of the figure below is from the original paper on Transformer model, and my previous articles explained the parts in each colored frame. In the first article, I  mainly explained how language is encoded in deep learning task and how that is evaluated.

This is more of a matter of inputs and the outputs of deep learning networks, which are in blue dotted frames in the figure. They are not so dependent on types of deep learning NLP tasks. In the second article, I explained seq2seq models, which are encoder-decoder models used in machine translation. Seq2seq models can can be simplified like the figure in the orange frame. In the article I mainly explained seq2seq models with RNNs, but the purpose of this article series is ultimately replace them with Transformer models. In the last article, I finally wrote about some actual components of Transformer models: multi-head attention mechanism. I think this mechanism is the core of Transformed models, and I did my best to explain it with a whole single article, with a lot of visualizations. However, there are still many elements I have not explained.

First, you need to do positional encoding to the word embedding so that Transformer models can learn the relations of the positions of input tokens. At least I was too stupid to understand what this is only with the original paper on Transformer. I am going to explain this algorithm in illustrative ways, which I needed to self-teach it. The second point is residual connections.

The last article has already explained multi-head attention, as precisely as I could do, but I still have to say I covered only two multi-head attention parts in a layer of Transformer model, which are in pink frames. During training, you have to mask some tokens at the decoder part so that some of tokens are invisible, and masked multi-head attention enables that.

You might be tired of the words “queries,” “keys,” and “values,” if you read the last article. But in fact that was not enough. When you think about applying Transformer in other tasks, such as object detection or image generation, you need to reconsider what the structure of data and how “queries,” “keys,” and “values,” correspond to each elements of the data, and probably one of my upcoming articles would cover this topic.

2 Why Transformer?

One powerful strength of Transformer model is its parallelization. As you saw in the last article, Trasformer models enable calculating relations of tokens to all other tokens, on different standards, independently in each head. And each head requires very simple linear transformations. In case of RNN encoders, if an input has \tau tokens, basically you have to wait for \tau time steps to finish encoding the input sentence. Also, at the time step (\tau) the RNN cell retains the information at the time step (1) only via recurrent connections. In this way you cannot attend to tokens in the earlier time steps, and this is obviously far from how we compare tokens in a sentence. You can bring information backward by bidirectional connection s in RNN models, but that all the more deteriorate parallelization of the model. And possessing information via recurrent connections, like a telephone game, potentially has risks of vanishing gradient problems. Gated RNN, such as LSTM or GRU mitigate the problems by a lot of nonlinear functions, but that adds to computational costs. If you understand multi-head attention mechanism, I think you can see that Transformer solves those problems.

I guess this is closer to when you speak a foreign language which you are fluent in. You wan to say something in a foreign language, and you put the original sentence in your mother tongue in the “encoder” in your brain. And you decode it, word by word, in the foreign language. You do not have to wait for the word at the end in your language, or rather you have to consider the relations of of a chunk of words to another chunk of words, in forward and backward ways. This is crucial especially when Japanese people speak English. You have to make the conclusion clear in English usually with the second word, but the conclusion is usually at the end of the sentence in Japanese.

3 Positional encoding

I explained disadvantages of RNN in the last section, but RNN has been a standard algorithm of neural machine translation. As I mentioned in the fourth section of the first article of my series on RNN, other neural nets like fully connected layers or convolutional neural networks cannot handle sequence data well. I would say RNN could be one of the only algorithms to handle sequence data, including natural language data, in more of classical methods of time series data processing.

*As I explained in this article, the original idea of RNN was first proposed in 1997, and I would say the way it factorizes time series data is very classical, and you would see similar procedures in many other algorithms. I think Transformer is a successful breakthrough which gave up the idea of processing sequence data time step by time step.

You might have noticed that multi-head attention mechanism does not explicitly uses the the information of the orders or position of input data, as it basically calculates only the products of matrices. In the case where the input is “Anthony Hopkins admired Michael Bay as a great director.”, multi head attention mechanism does not uses the information that “Hopkins” is the second token, or the information that the token two time steps later is “Michael.” Transformer tackles this problem with an almost magical algorithm named positional encoding.

In order to learn positional encoding, you should first think about what kind of encoding is ideal. According to this blog post, ideal encoding of positions of tokens have the following features.

  • Positional encoding of one token deterministically represents the position of the token.
  • The actual values of positional encoding should not be too big compared to the values of elements of embedding vectors.
  • Positional encodings of different tokens should successfully express their relative positions.

The most straightforward way to give the information of position is implementing the index of times steps (t), but if you naively give the term (t) to the data, the term could get too big compared to the values of data ,for example when the sequence data is 100 time steps long. The next straightforward idea is compressing the idea of time steps to for example the range [0, 1]. With this approach, however, the resolution of encodings can vary depending on the length of the input sequence data. Thus these naive approaches do not meet the requirements above, and I guess even conventional RNN-based models were not so successful in these points.

*I guess that is why attention mechanism of RNN seq2seq models, which I explained in the second article, was successful. You can constantly calculate the relative positions of decoder tokens compared to the encoder tokens.

Positional encoding, to me almost magically, meets the points I have mentioned. However the explanation of positional encoding in the original paper of Transformer is unkindly brief. It says you can encode positions of tokens with the following vector PE_{(pos, 2i)} = sin(pos / 10000^{2i/d_model}), PE_{(pos, 2i+1)} = cos(pos / 10000^{2i/d_model}), where i = 0, 1, \dots, d_{model}/2 - 1. d_{model} is the dimension of word embedding. The heat map below is the most typical type of visualization of positional encoding you would see everywhere, and in this case d_{model}=256, and pos is discrete number which varies from 0 to 49, thus the heat map blow is equal to a 50\times 256 matrix, whose elements are from -1 to 1. Each row of the graph corresponds to one token, and you can see that lower dimensional part is constantly changing like waves. Also it is quite easy to encode an input with this positional encoding: assume that you have a matrix of an input sentence composed of 50 tokens, each of which is a 256 dimensional vector, then all you have to do is just adding the heat map below to the matrix.

Concretely writing down, the encoding of the 256-dim token at pos  is (PE_{(pos, 0)}, PE_{(pos, 1)}, \dots ,  PE_{(pos, 254)}, PE_{(pos, 255)})^T = \bigl( sin(pos / 10000^{0/256}), cos(pos / 10000^{0/256}) \bigr),  \dots , \bigl( sin(pos / 10000^{254/256}), cos(pos / 10000^{254/256}) \bigr)^T.

You should see this encoding more as d_{model} / 2 pairs of circles rather than d_{model} dimensional vectors. When you fix the i, the index of the depth of each encoding, you can extract a 2 dimensional vector \boldsymbol{PE}_i = \bigl( sin(pos / 10000^{2i/d_model}), cos(pos / 10000^{2i/d_model}) \bigr). If you constantly change the value pos, the vector \boldsymbol{PE}_i rotates clockwise on the unit circle in the figure below.

Also, the deeper the dimension of the embedding is, I mean the bigger the index i is, the smaller the frequency of rotation is. I think the video below is a more intuitive way to see how each token is encoded with positional encoding. You can see that the bigger pos is, that is the more tokens an input has, the deeper part positional encoding starts to rotate on the circles.

 

Very importantly, the original paper of Transformer says, “We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.” For each circle at any depth, I mean for any i, the following simple equation holds:

\left( \begin{array}{c} sin(\frac{pos+k}{10000^{2i/d_{model}}}) \\ cos(\frac{pos+k}{10000^{2i/d_{model}}}) \end{array} \right) =
\left( \begin{array}{ccc} cos(\frac{k}{10000^{2i/d_{model}}}) & sin(\frac{k}{10000^{2i/d_{model}}}) \\ -sin(\frac{k}{10000^{2i/d_{model}}}) & cos(\frac{k}{10000^{2i/d_{model}}}) \\ \end{array} \right) \cdot \left( \begin{array}{c} sin(\frac{pos}{10000^{2i/d_{model}}}) \\ cos(\frac{pos}{10000^{2i/d_{model}}}) \end{array} \right)

The matrix is a simple rotation matrix, so if i is fixed the rotation only depends on k, how many positions to move forward or backward. Then we get a very important fact: as the pos changes (pos is a discrete number), each point rotates in proportion to the offset of “pos,” with different frequencies depending on the depth of the circles. The deeper the circle is, the smaller the frequency is. That means, this type of positional encoding encourages Transformer models to learn definite and relative positions of tokens with rotations of those circles, and the values of each element of the rotation matrices are from -1 to 1, so they do not get bigger no matter how many tokens inputs have.

For example when an input is “Anthony Hopkins admired Michael Bay as a great director.”, a shift from the token “Hopkins” to “Bay” is a rotation matrix  \left( \begin{array}{ccc} cos(\frac{k}{10000^{2i/d_{model}}}) & sin(\frac{k}{10000^{2i/d_{model}}}) \\ -sin(\frac{k}{10000^{2i/d_{model}}}) & cos(\frac{k}{10000^{2i/d_{model}}}) \\ \end{array} \right), where k=3. Also the shift from “Bay” to “great” has the same rotation.

*Positional encoding reminded me of Enigma, a notorious cipher machine used by Nazi Germany. It maps alphabets to different alphabets with different rotating gear connected by cables. With constantly changing gears and keys, it changed countless patterns of alphabetical mappings, every day, which is impossible for humans to solve. One of the first form of computers was invented to break Enigma.

*As far as I could understand from “Imitation Game (2014).”

*But I would say Enigma only relied on discrete deterministic algebraic mapping of alphabets. The rotations of positional encoding is not that tricky as Enigma, but it can encode both definite and deterministic positions of much more variety of tokens. Or rather I would say AI algorithms developed enough to learn such encodings with subtle numerical changes, and I am sure development of NLP increased the possibility of breaking the Turing test in the future.

5 Residual connections

If you naively stack neural networks with simple implementation, that would suffer from vanishing gradient problems during training. Back propagation is basically multiplying many gradients, so

One way to mitigate vanishing gradient problems is quite easy: you have only to make a bypass of propagation. You would find a lot of good explanations on residual connections, so I am not going to explain how this is effective for vanishing gradient problems in this article.

In Transformer models you add positional encodings to the input only in the first layer, but I assume that the encodings remain through the layers by these bypass routes, and that might be one of reasons why Transformer models can retain information of positions of tokens.

6 Masked multi-head attention

Even though Transformer, unlike RNN, can attend to the whole input sentence at once, the decoding process of Transformer-based translator is close to RNN-based one, and you are going to see that more clearly in the codes in the next article. As I explained in the second article, RNN decoders decode each token only based on the tokens the have generated so far. Transformer decoder also predicts the output sequences autoregressively one token at a time step, just as RNN decoders. I think it easy to understand this process because RNN decoder generates tokens just as you connect RNN cells one after another, like connecting rings to a chain. In this way it is easy to make sure that generating of one token in only affected by the former tokens. On the other hand, during training Transformer decoders, you input the whole sentence at once. That means Transformer decoders can see the whole sentence during training. That is as if a student preparing for a French translation test could look at the whole answer French sentences. It is easy to imagine that you cannot prepare for the French test effectively if you study this way. Transformer decoders also have to learn to decode only based on the tokens they have generated so far.

In order to properly train a Transformer-based translator to learn such decoding, you have to hide the upcoming tokens in target sentences during training. During calculating multi-head attentions in each Transformer layer, if you keep ignoring the weights from up coming tokens like in the figure below, it is likely that Transformer models learn to decode only based on the tokens generated so far. This is called masked multi-head attention.

*I am going to take an input “Anthonly Hopkins admire Michael Bay as a great director.” as an example of calculating masked multi-head attention mechanism, but this is supposed to be in the target laguage. So when you train an translator from English to German, in practice you have to calculate masked multi-head atetntion of “Anthony Hopkins hat Michael Bay als einen großartigen Regisseur bewundert.”

As you can see from the whole architecture of Transformer, you only need to consider masked multi-head attentions of of self-attentions of the input sentences at the decoder side. In order to concretely calculate masked multi-head attentions, you need a technique named look ahead masking. This is also quite simple. Just as well as the last article, let’s take an example of calculating self attentions of an input “Anthony Hopkins admired Michael Bay as a great director.” Also in this case you just calculate multi-head attention as usual, but when you get the histograms below, you apply look ahead masking to each histogram and delete the weights from the future tokens. In the figure below the black dots denote zero, and the sum of each row of the resulting attention map is also one. In other words, you get a lower triangular matrix, the sum of whose each row is 1.

Also just as I explained in the last article, you reweight vlaues with the triangular attention map. The figure below is calculating a transposed masked multi-head attention because I think it is a more straightforward way to see how vectors are reweighted in multi-head attention mechanism.

When you closely look at how each column of the transposed multi-head attention is reweighted, you can clearly see that the token is reweighted only based on the tokens generated so far.

*If you are still not sure why you need such masking in multi-head attention of target sentences, you should proceed to the next article for now. Once you check the decoding processes of Transformer-based translators, you would see why you need masked multi-head attention mechanism on the target sentence during training.

If you have read my articles, at least this one and the last one, I think you have gained more or less clear insights into how each component of Transfomer model works. You might have realized that each components require simple calculations. Combined with the fact that multi-head attention mechanism is highly parallelizable, Transformer is easier to train, compared to RNN.

In this article, we are going to see how masking of multi-head attention is implemented and how the whole Transformer structure is constructed. By the end of the next article, you would be able to create a toy English-German translator with more or less clear understanding on its architecture.

Appendix

You can visualize positional encoding the way I explained with simple Python codes below. Please just copy and paste them, importing necessary libraries. You can visualize positional encoding as both heat maps and points rotating on rings, and in this case the dimension of word embedding is 256, and the maximum length of sentences is 50.

*In fact some implementations use different type of positional encoding, as you can see in the codes below. In this case, embedding vectors are roughly divided into two parts, and each part is encoded with different sine waves. I have been using a metaphor of rotating rings or gears in this article to explain positional encoding, but to be honest that is not necessarily true of all the types of Transformer implementation. Some papers compare different types of pairs of positional encoding. The most important point is, Transformer models is navigated to learn positions of tokens with certain types of mathematical patterns.

[References]

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core
https://www.tensorflow.org/overview

[3] Jay Alammar, “The Illustrated Transformer,”
http://jalammar.github.io/illustrated-transformer/

[4] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)
https://www.youtube.com/watch?v=5vcj8kSwBCY

[5]Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 191-193
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 191-193

[6] Amirhossein Kazemnejad, “Transformer Architecture: The Positional Encoding
Let’s use sinusoidal functions to inject the order of words in our model”, Amirhossein Kazemnejad’s Blog, (2019)
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko, “End-to-End Object Detection with Transformers,” (2020)

[8]中西 啓、「【第5回】機械式暗号機の傑作~エニグマ登場~」、HH News & Reports, (2011)
https://www.hummingheads.co.jp/reports/series/ser01/110714.html

[9]中西 啓、「【第6回】エニグマ解読~第2次世界大戦とコンピュータの誕生~」、HH News & Reports, (2011)

[10]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[11]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

 

Multi-head attention mechanism: “queries”, “keys”, and “values,” over and over again

*A comment added on 04/05/2022: Thanks to a comment by Mr. Maier, I found a major mistake in my visualization. To be concrete, there is a mistake in expressing how to get each colored divided group of tokens by applying linear transformations. That corresponds to the section 3.2.2 in the paper “Attention Is All You Need.” There would be no big differences in the main point of this article, the relations of keys, queries, and values, but please bear that in your mind if you need Transformer at a practical work. Besides checking the implementation by Tensorflow, I will soon prepare a modified version of visualization. For further details, please see comments at the bottom of this article.

This is the third article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

In the last article, I explained how attention mechanism works in simple seq2seq models with RNNs, and it basically calculates correspondences of the hidden state at every time step, with all the outputs of the encoder. However I would say the attention mechanisms of RNN seq2seq models use only one standard for comparing them. Using only one standard is not enough for understanding languages, especially when you learn a foreign language. You would sometimes find it difficult to explain how to translate a word in your language to another language. Even if a pair of languages are very similar to each other, translating them cannot be simple switching of vocabulary. Usually a single token in one language is related to several tokens in the other language, and vice versa. How they correspond to each other depends on several criteria, for example “what”, “who”, “when”, “where”, “why”, and “how”. It is easy to imagine that you should compare tokens with several criteria.

Transformer model was first introduced in the original paper named “Attention Is All You Need,” and from the title you can easily see that attention mechanism plays important roles in this model. When you learn about Transformer model, you will see the figure below, which is used in the original paper on Transformer.  This is the simplified overall structure of one layer of Transformer model, and you stack this layer N times. In one layer of Transformer, there are three multi-head attention, which are displayed as boxes in orange. These are the very parts which compare the tokens on several standards. I made the head article of this article series inspired by this multi-head attention mechanism.

The figure below is also from the original paper on Transfromer. If you can understand how multi-head attention mechanism works with the explanations in the paper, and if you have no troubles understanding the codes in the official Tensorflow tutorial, I have to say this article is not for you. However I bet that is not true of majority of people, and at least I need one article to clearly explain how multi-head attention works. Please keep it in mind that this article covers only the architectures of the two figures below. However multi-head attention mechanisms are crucial components of Transformer model, and throughout this article, you would not only see how they work but also get a little control over it at an implementation level.

1 Multi-head attention mechanism

When you learn Transformer model, I recommend you first to pay attention to multi-head attention. And when you learn multi-head attentions, before seeing what scaled dot-product attention is, you should understand the whole structure of multi-head attention, which is at the right side of the figure above. In order to calculate attentions with a “query”, as I said in the last article, “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” Sooner or later, you will notice I would be just repeating these phrases over and over again throughout this article, in several ways.

*Even if you are not sure what “reweighting” means in this context, please keep reading. I think you would little by little see what it means especially in the next section.

The overall process of calculating multi-head attention, displayed in the figure above, is as follows (Please just keep reading. Please do not think too much.): first you split the V: “values”, K: “keys”, and Q: “queries”, and second you transform those divided “values”, “keys”, and “queries” with densely connected layers (“Linear” in the figure). Next you calculate attention weights and reweight the “values” and take the summation of the reiweighted “values”, and you concatenate the resulting summations. At the end you pass the concatenated “values” through another densely connected layers. The mechanism of scaled dot-product attention is just a matter of how to concretely calculate those attentions and reweight the “values”.

*In the last article I briefly mentioned that “keys” and “queries” can be in the same language. They can even be the same sentence in the same language, and in this case the resulting attentions are called self-attentions, which we are mainly going to see. I think most people calculate “self-attentions” unconsciously when they speak. You constantly care about what “she”, “it” , “the”, or “that” refers to in you own sentence, and we can say self-attention is how these everyday processes is implemented.

Let’s see the whole process of calculating multi-head attention at a little abstract level. From now on, we consider an example of calculating multi-head self-attentions, where the input is a sentence “Anthony Hopkins admired Michael Bay as a great director.” In this example, the number of tokens is 9, and each token is encoded as a 512-dimensional embedding vector. And the number of heads is 8. In this case, as you can see in the figure below, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” is implemented as a 9\times 512 matrix. You first split each token into 512/8=64 dimensional, 8 vectors in total, as I colored in the figure below. In other words, the input matrix is divided into 8 colored chunks, which are all 9\times 64 matrices, but each colored matrix expresses the same sentence. And you calculate self-attentions of the input sentence independently in the 8 heads, and you reweight the “values” according to the attentions/weights. After this, you stack the sum of the reweighted “values”  in each colored head, and you concatenate the stacked tokens of each colored head. The size of each colored chunk does not change even after reweighting the tokens. According to Ashish Vaswani, who invented Transformer model, each head compare “queries” and “keys” on each standard. If the a Transformer model has 4 layers with 8-head multi-head attention , at least its encoder has 4\times 8 = 32 heads, so the encoder learn the relations of tokens of the input on 32 different standards.

I think you now have rough insight into how you calculate multi-head attentions. In the next section I am going to explain the process of reweighting the tokens, that is, I am finally going to explain what those colorful lines in the head image of this article series are.

*Each head is randomly initialized, so they learn to compare tokens with different criteria. The standards might be straightforward like “what” or “who”, or maybe much more complicated. In attention mechanisms in deep learning, you do not need feature engineering for setting such standards.

2 Calculating attentions and reweighting “values”

If you have read the last article or if you understand attention mechanism to some extent, you should already know that attention mechanism calculates attentions, or relevance between “queries” and “keys.” In the last article, I showed the idea of weights as a histogram, and in that case the “query” was the hidden state of the decoder at every time step, whereas the “keys” were the outputs of the encoder. In this section, I am going to explain attention mechanism in a more abstract way, and we consider comparing more general “tokens”, rather than concrete outputs of certain networks. In this section each [ \cdots ] denotes a token, which is usually an embedding vector in practice.

Please remember this mantra of attention mechanism: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” The figure below shows an overview of a case where “Michael” is a query. In this case you compare the query with the “keys”, that is, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” and you get the histogram of attentions/weights. Importantly the sum of the weights 1. With the attentions you have just calculated, you can reweight the “values,” which also denote the same input sentence. After that you can finally take a summation of the reweighted values. And you use this summation.

*I have been repeating the phrase “reweighting ‘values’  with attentions,”  but you in practice calculate the sum of those reweighted “values.”

Assume that compared to the “query”  token “Michael”, the weights of the “key” tokens “Anthony”, “Hopkins”, “admired”, “Michael”, “Bay”, “as”, “a”, “great”, and “director.” are respectively 0.06, 0.09, 0.05, 0.25, 0.18, 0.06, 0.09, 0.06, 0.15. In this case the sum of the reweighted token is 0.06″Anthony” + 0.09″Hopkins” + 0.05″admired” + 0.25″Michael” + 0.18″Bay” + 0.06″as” + 0.09″a” + 0.06″great” 0.15″director.”, and this sum is the what wee actually use.

*Of course the tokens are embedding vectors in practice. You calculate the reweighted vector in actual implementation.

You repeat this process for all the “queries.”  As you can see in the figure below, you get summations of 9 pairs of reweighted “values” because you use every token of the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as a “query.” You stack the sum of reweighted “values” like the matrix in purple in the figure below, and this is the output of a one head multi-head attention.

3 Scaled-dot product

This section is a only a matter of linear algebra. Maybe this is not even so sophisticated as linear algebra. You just have to do lots of Excel-like operations. A tutorial on Transformer by Jay Alammar is also a very nice study material to understand this topic with simpler examples. I tried my best so that you can clearly understand multi-head attention at a more mathematical level, and all you need to know in order to read this section is how to calculate products of matrices or vectors, which you would see in the first some pages of textbooks on linear algebra.

We have seen that in order to calculate multi-head attentions, we prepare 8 pairs of “queries”, “keys” , and “values”, which I showed in 8 different colors in the figure in the first section. We calculate attentions and reweight “values” independently in 8 different heads, and in each head the reweighted “values” are calculated with this very simple formula of scaled dot-product: Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) =softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})\boldsymbol{V}. Let’s take an example of calculating a scaled dot-product in the blue head.

At the left side of the figure below is a figure from the original paper on Transformer, which explains one-head of multi-head attention. If you have read through this article so far, the figure at the right side would be more straightforward to understand. You divide the input sentence into 8 chunks of matrices, and you independently put those chunks into eight head. In one head, you convert the input matrix by three different fully connected layers, which is “Linear” in the figure below, and prepare three matrices Q, K, V, which are “queries”, “keys”, and “values” respectively.

*Whichever color attention heads are in, the processes are all the same.

*You divide \frac{\boldsymbol{Q}} {\boldsymbol{K}^T} by \sqrt{d}_k in the formula. According to the original paper, it is known that re-scaling \frac{\boldsymbol{Q} }{\boldsymbol{K}^T} by \sqrt{d}_k is found to be effective. I am not going to discuss why in this article.

As you can see in the figure below, calculating Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) is virtually just multiplying three matrices with the same size (Only K is transposed though). The resulting 9\times 64 matrix is the output of the head.

softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}) is calculated like in the figure below. The softmax function regularize each row of the re-scaled product \frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}, and the resulting 9\times 9 matrix is a kind a heat map of self-attentions.

The process of comparing one “query” with “keys” is done with simple multiplication of a vector and a matrix, as you can see in the figure below. You can get a histogram of attentions for each query, and the resulting 9 dimensional vector is a list of attentions/weights, which is a list of blue circles in the figure below. That means, in Transformer model, you can compare a “query” and a “key” only by calculating an inner product. After re-scaling the vectors by dividing them with \sqrt{d_k} and regularizing them with a softmax function, you stack those vectors, and the stacked vectors is the heat map of attentions.

You can reweight “values” with the heat map of self-attentions, with simple multiplication. It would be more straightforward if you consider a transposed scaled dot-product \boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T. This also should be easy to understand if you know basics of linear algebra.

One column of the resulting matrix (\boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T) can be calculated with a simple multiplication of a matrix and a vector, as you can see in the figure below. This corresponds to the process or “taking a summation of reweighted ‘values’,” which I have been repeating. And I would like you to remember that you got those weights (blue) circles by comparing a “query” with “keys.”

Again and again, let’s repeat the mantra of attention mechanism together: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” If you have been patient enough to follow my explanations, I bet you have got a clear view on how multi-head attention mechanism works.

We have been seeing the case of the blue head, but you can do exactly the same procedures in every head, at the same time, and this is what enables parallelization of multi-head attention mechanism. You concatenate the outputs of all the heads, and you put the concatenated matrix through a fully connected layers.

If you are reading this article from the beginning, I think this section is also showing the same idea which I have repeated, and I bet more or less you no have clearer views on how multi-head attention mechanism works. In the next section we are going to see how this is implemented.

4 Tensorflow implementation of multi-head attention

Let’s see how multi-head attention is implemented in the Tensorflow official tutorial. If you have read through this article so far, this should not be so difficult. I also added codes for displaying heat maps of self attentions. With the codes in this Github page, you can display self-attention heat maps for any input sentences in English.

The multi-head attention mechanism is implemented as below. If you understand Python codes and Tensorflow to some extent, I think this part is relatively easy.  The multi-head attention part is implemented as a class because you need to train weights of some fully connected layers. Whereas, scaled dot-product is just a function.

*I am going to explain the create_padding_mask() and create_look_ahead_mask() functions in upcoming articles. You do not need them this time.

Let’s see a case of using multi-head attention mechanism on a (1, 9, 512) sized input tensor, just as we have been considering in throughout this article. The first axis of (1, 9, 512) corresponds to the batch size, so this tensor is virtually a (9, 512) sized tensor, and this means the input is composed of 9 512-dimensional vectors. In the results below, you can see how the shape of input tensor changes after each procedure of calculating multi-head attention. Also you can see that the output of the multi-head attention is the same as the input, and you get a 9\times 9 matrix of attention heat maps of each attention head.

I guess the most complicated part of this implementation above is the split_head() function, especially if you do not understand tensor arithmetic. This part corresponds to splitting the input tensor to 8 different colored matrices as in one of the figures above. If you cannot understand what is going on in the function, I recommend you to prepare a sample tensor as below.

This is just a simple (1, 9, 512) sized tensor with sequential integer elements. The first row (1, 2, …., 512) corresponds to the first input token, and (4097, 4098, … , 4608) to the last one. You should try converting this sample tensor to see how multi-head attention is implemented. For example you can try the operations below.

These operations correspond to splitting the input into 8 heads, whose sizes are all (9, 64). And the second axis of the resulting (1, 8, 9, 64) tensor corresponds to the index of the heads. Thus sample_sentence[0][0] corresponds to the first head, the blue 9\times 64 matrix. Some Tensorflow functions enable linear calculations in each attention head, independently as in the codes below.

Very importantly, we have been only considering the cases of calculating self attentions, where all “queries”, “keys”, and “values” come from the same sentence in the same language. However, as I showed in the last article, usually “queries” are in a different language from “keys” and “values” in translation tasks, and “keys” and “values” are in the same language. And as you can imagine, usualy “queries” have different number of tokens from “keys” or “values.” You also need to understand this case, which is not calculating self-attentions. If you have followed this article so far, this case is not that hard to you. Let’s briefly see an example where the input sentence in the source language is composed 9 tokens, on the other hand the output is composed 12 tokens.

As I mentioned, one of the outputs of each multi-head attention class is 9\times 9 matrix of attention heat maps, which I displayed as a matrix composed of blue circles in the last section. The the implementation in the Tensorflow official tutorial, I have added codes to display actual heat maps of any input sentences in English.

*If you want to try displaying them by yourself, download or just copy and paste codes in this Github page. Please maker “datasets” directory in the same directory as the code. Please download “spa-eng.zip” from this page, and unzip it. After that please put “spa.txt” on the “datasets” directory. Also, please download the “checkpoints_en_es” folder from this link, and place the folder in the same directory as the file in the Github page. In the upcoming articles, you would need similar processes to run my codes.

After running codes in the Github page, you can display heat maps of self attentions. Let’s input the sentence “Anthony Hopkins admired Michael Bay as a great director.” You would get a heat maps like this.

In fact, my toy implementation cannot handle proper nouns such as “Anthony” or “Michael.” Then let’s consider a simple input sentence “He admired her as a great director.” In each layer, you respectively get 8 self-attention heat maps.

I think we can see some tendencies in those heat maps. The heat maps in the early layers, which are close to the input, are blurry. And the distributions of the heat maps come to concentrate more or less diagonally. At the end, presumably they learn to pay attention to the start and the end of sentences.

You have finally finished reading this article. Congratulations.

You should be proud of having been patient, and you passed the most tiresome part of learning Transformer model. You must be ready for making a toy English-German translator in the upcoming articles. Also I am sure you have understood that Michael Bay is a great director, no matter what people say.

[References]

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core
https://www.tensorflow.org/overview

[3] “Neural machine translation with attention,” Tensorflow Core
https://www.tensorflow.org/tutorials/text/nmt_with_attention

[4] Jay Alammar, “The Illustrated Transformer,”
http://jalammar.github.io/illustrated-transformer/

[5] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)
https://www.youtube.com/watch?v=5vcj8kSwBCY

[6]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[7]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[8]Rosemary Rossi, “Anthony Hopkins Compares ‘Genius’ Michael Bay to Spielberg, Scorsese,” yahoo! entertainment, (2017)
https://www.yahoo.com/entertainment/anthony-hopkins-transformers-director-michael-bay-guy-genius-010058439.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Support Vector Machines for Text Recognition

Hand Written Alphabet recognition Using Support Vector Machine

We have used image classification as an task in many cases, more often this has been done using an module like openCV in python or using pre-trained models like in case of MNIST data sets. The idea of using Support Vector Machines for carrying out the same task is to give a simpler approach for a complicated process. There are some pro’s and con’s in every algorithm. Support vector machine for data with very high dimension may prove counter productive. But in case of image data we are actually using a array. If its a mono chrome then its just a 2 dimensional array, if grey scale or color image stack then we may have a 3 dimensional array processing to be considered. You can get more clarity on the array part if you go through this article on Machine learning using only numpy array. While there are certainly advantages of using OCR packages like Tesseract or OpenCV or GPTs, I am putting forth this approach of using a simple SVM model for hand written text classification. As a student while doing linear regression, I learn’t a principle “Occam’s Razor”, Basically means, keep things simple if they can explain what you want to. In short, the law of parsimony, simplify and not complicate. Applying the same principle on Hand written Alphabet recognition is an attempt to simplify using a classic algorithm, the Support Vector Machine. We break the  problem of hand written alphabet recognition into a simple process rather avoiding usage of heavy packages. This is an attempt to create the data and then build a model using Support Vector Machines for Classification.

Data Preparation

Manually edit the data instead of downloading it from the web. This will help you understand your data from the beginning. Manually write some letters on white paper and get the photo from your mobile phone. Then store it on your hard drive. As we are doing a trial we don’t want to waste a lot of time in data creation at this stage, so it’s a good idea to create two or three different characters for your dry run. You may need to change the code as you add more instances of classes, but this is where the learning phase begins. We are now at the training level.

Data Structure

You can create the data yourself by taking standard pictures of hand written text in a 200 x 200 pixel dimension. Alternatively you can use a pen tab to manually write these alphabets and save them as files. If you know and photo editing tools you can use them as well. For ease of use, I have already created a sample data and saved it in the structure as below.

Image Source : From Author

You can download the data which I have used, right click on this download data link and open in new tab or window. Then unzip the folders and you should be able to see the same structure and data as above in your downloads folder. I would suggest, you should create your own data and repeat the  process. This would help you understand the complete flow.

Install the Dependency Packages for RStudio

We will be using the jpeg package in R for Image handling and the SVM implementation from the kernlab package.  Also we need to make sure that the image data has dimension’s of 200 x 200 pixels, with a horizontal and vertical resolution of 120dpi. You can vary the dimension’s like move it to 300 x 300 or reduce it to 100 x 100. The higher the dimension, you will need more compute power. Experiment around the color channels and resolution later once you have implemented it in the current form.

 

Load the training data set

Feature Transformation

Since we don’t intend to use the typical CNN, we are going to use the white, grey and black pixel values for new feature creation. We will use the summation of all the pixel values of a image  and save it as a new feature called as “sum”, the count of all pixels adding up to zero as “zero”, the count of all pixels adding up to “ones” and the sum of all pixels between zero’s and one’s as “in_between”. The “label” feature names are extracted from the names of the folder

Support Vector Machine model

Evaluate the Model on the Testing Data Set

I would recommend you to learn concepts of SVM which couldn’t be explained completely in this article by going through my free Data Science and Machine Learning video courses. We have created the classifier using the Kerlab package in R, but I would advise you to study the mathematics involved in Support vector machines to get a clear understanding.

Data Mining Process flow – Easy Understanding

1 Overview

Development of computer processing power, network and automated software completely change and give new concept of each business. And data mining play the vital part to solve, finding the hidden patterns and relationship from large dataset with business by using sophisticated data analysis tools like methodology, method, process flow etc.

On this paper, proposed a process flow followed CRISP-DM methodology and has six steps where data understanding does not considered.

Phase of new process flow given below:-

Phase 1: Involved with collection, outliner treatment, imputation, transformation, scaling, and partition dataset in to two sub-frames (Training and Testing). Here as an example for outliner treatment, imputation, transformation, scaling consider accordingly Z score, mean, One hot encoding and Min Max Scaler.

Phase 2: On this Phase training and testing data balance with same balancing algorithm but separately. As an example here SMOTE (synthetic minority oversampling technique) is considered.

Phase 3: This phase involved with reduction, selection, aggregation, extraction. But here for an example considering same feature reduction algorithm (LDA -Linear Discriminant analysis) on training and testing data set separately.

Phase 4: On this Phase Training data set again partition into two more set (Training and Validation).

Phase 5: This Phase considering several base algorithms as a base model like CNN, RNN, Random forest, MLP, Regression, Ensemble method. This phase also involve to find out best hyper parameter and sub-algorithm for each base algorithm. As an example on this paper consider two class classification problems and also consider Random forest (Included CART – Classification and Regression Tree and GINI index impurity) and MLP classifier (Included (Relu, Sigmoid, binary cross entropy, Adam – Adaptive Moment Estimation) as base algorithms.

Phase 6: First, Prediction with validation data then evaluates with Test dataset which is fully unknown for these (Random forest, MLP classifier) two base algorithms. Then calculate the confusion matrix, ROC, AUC to find the best base algorithm.

New method from phase 1 to phase 4 followed CRISP-DM methodology steps such as data collection, data preparation then phase 5 followed modelling and phase 6 followed evaluation and implementation steps.

Structure of proposed process flow for two class problem combined with algorithm and sub-algorithm display on figure – 1.

These articles mainly focus to describe all algorithms which are going to implementation for better understanding.

 

 

Data Mining Process Flow

Figure 1 – Data Mining Process Flow

2 Phase 1: Outlier treatment, Transform, Scaling, Imputation

This phase involved with outlier treatment, imputation, scaling, and transform data.

2.1 Outliner treatment: – Z score

Outlier is a data point which lies far from all other data point in a data set. Outlier need to treat because it may bias the entire result. Outlier treatment with Z score is a common technique.  Z score is a standard score in statistics.  Z score provides information about data value is smaller or grater then mean that means how many standard deviations away from the mean value. Z score equation display below:

Z = \frac{(x - \mu)}{\sigma}

Here x = data point
σ = Standard deviation
μ = mean value

Equation- 1 Z-Score

In a normal distribution Z score represent 68% data lies on +/- 1, 95% data point lies on +/- 2, 99.7% data point lies on +/- 3 standard deviation.

2.2 Imputation data: – mean

Imputation is a way to handle missing data by replacing substituted value. There are many imputation technique represent like mean, median, mode, k-nearest neighbours. Mean imputation is the technique to replacing missing information with mean value. On the mean imputation first calculate the particular features mean value and then replace the missing value with mean value. The next equation displays the mean calculation:

\mu = \frac{(\sum x)}{n}

Here x = value of each point
n = number of values
μ = mean value

Equation- 2 Mean

2.3 Transform: – One hot encoding

Encoding is a pre-processing technique which represents data in such a way that computer can understand.  For understanding of machine learning algorithm categorical columns convert to numerical columns, this process called categorical encoding. There are multiple way to handle categorical variable but most widely used techniques are label encoding and one host encoding. On label encoding give a numeric (integer number) for each category. Suppose there are 3 categories of foods like apples, orange, banana. When label encoding is used then 3 categories will get a numerical value like apples = 1, banana = 2 and orange = 3. But there is very high probability that machine learning model can capture the relationship in between categories such as apple < banana < orange or calculate average across categories like 1 +3 = 4 / 2 = 2 that means model can understand average of apple and orange together is banana which is not acceptable because model correlation calculation is wrong. For solving this problem one hot encoding appear. The following table displays the label encoding is transformed into one hot encoding.

Label Encoding and One-Hot-Encoding

Table- 1 Encoding example

On hot encoding categorical value split into columns and each column contains 0 or 1 according to columns placement.

2.4 Scaling data: – Min Max Scaler

Feature scaling method is standardized or normalization the independent variable that means it is used to scale the data in a particular range like -1 to +1 or depending on algorithm. Generally normalization used where data distribution does not follow Gaussian distribution and standardization used where data distribution follow Gaussian distribution. On standardization techniques transform data values are cantered around the mean and unit is standard deviation. Formula for standardization given below:

Standardization X = \frac{(X - \mu)}{\sigma}

Equation-3 Equations for Standardization

X represent the feature value, µ represent mean of the feature value and σ represent standard deviation of the feature value. Standardized data value does not restrict to a particular range.

Normalization techniques shifted and rescaled data value range between 0 and 1. Normalization techniques also called Min-Max scaling. Formula for normalization given below:

Normalization X = \frac{(X - X_{min})}{X_{max} - X_{min}}

Equation – 4 Equations for Normalization

Above X, Xmin, Xmax are accordingly feature values, feature minimum value and feature maximum value. On above formula when X is minimums then numerator will be 0 (  is 0) or if X is maximums then the numerator is equal to the denominator (  is 1). But when X data value between minimum and maximum then  is between 0 and 1. If ranges value of data does not normalized then bigger range can influence the result.

3 Phase 2: – Balance Data

3.1 SMOTE

SMOTE (synthetic minority oversampling technique) is an oversampling technique where synthetic observations are created based on existing minority observations. This technique operates in feature space instead of data space. Under SMOTE each minority class observation calculates k nearest neighbours and randomly chose the neighbours depending on over-sampling requirements. Suppose there are 4 data point on minority class and 10 data point on majority class. For this imbalance data set, balance by increasing minority class with synthetic data point.   SMOTE creating synthetic data point but it is necessary to consider k nearest neighbours first. If k = 3 then SMOTE consider 3 nearest neighbours. Figure-2 display SMOTE with k = 3 and x = x1, x2, x3, x4 data point denote minority class. And all circles represent majority class.

SMOTE Example

Figure- 2 SMOTE example

 

4 Phase 3: – Feature Reduction

4.1 LDA

LDA stands for Linear Discriminant analysis supervised technique are commonly used for classification problem.  On this feature reduction account continuous independent variable and output categorical variable. It is multivariate analysis technique. LDA analyse by comparing mean of the variables.  Main goal of LDA is differentiate classes in low dimension space. LDA is similar to PCA (Principal component analysis) but in addition LDA maximize the separation between multiple classes. LDA is a dimensionality reduction technique where creating synthetic feature from linear combination of original data set then discard less important feature. LDA calculate class variance, it maximize between class variance and minimize within class variance. Table-2 display the process steps of LDA.

LDA Process

Table- 2 LDA process

5 Phase 5: – Base Model

Here we consider two base model ensemble random forest and MLP classifier.

5.1 Random Forest

Random forest is an ensemble (Bagging) method where group of weak learner (decision tree) come together to form a strong leaner. Random forest is a supervised algorithm which is used for regression and classification problem. Random forests create several decisions tree for predictions and provide solution by voting (classification) or mean (regression) value. Working process of Random forest given below (Table -3).

Random Forest

Table-3 Random Forest process

When training a Random forest root node contains a sample of bootstrap dataset and the feature is as same as original dataset. Suppose the dataset is D and contain d record and m number of columns. From the dataset D random forest first randomly select sample of rows (d) with replacement and sample of features (n) and give it to the decision tree. Suppose Random forest created several decision trees like T1, T2, T3, T4 . . . Tn. Then randomly selected dataset D = d + n is given to the decision tree T1, T2, T3, T4 . . . Tn where D < D, m > n and d > d.  After taking the dataset decision tree give the prediction for binary classification 1 or 0 then aggregating the decision and select the majority voted result. Figure-3 describes the structure of random forest process.

Random Forest Process

Figure- 3 Random Forest process

On Random forest base learner Decision Tree grows complete depth where bias (properly train on training dataset) is low and variance is high (when implementing test data give big error) called overfitting. On Random forest using multiple decision trees where each Decision tree is high variance but when combining all decision trees with the respect of majority vote then high variance converted into low variance because using row and feature sampling with replacement and taking the majority vote where decision is not depend on one decision tree.

CART (Classification and Regression Tree) is binary segmentation technique. CART is a Gini’s impurity index based classical algorithm to split a dataset and build a decision tree. By splitting a selected dataset CART created two child nodes repeatedly and builds a tree until the data no longer be split. There are three steps CART algorithm follow:

  1. Find best split for each features. For each feature in binary split make two groups of the ordered classes. That means possibility of split for k classes is k-1. Find which split is maximized and contain best splits (one for each feature) result.
  2. Find the best split for nodes. From step 1 find the best one split (from all features) which maximized the splitting criterion.
  3. Split the best node from step 2 and repeat from step 1 until fulfil the stopping criterion.

 

For splitting criteria CART use GINI index impurity algorithm to calculate the purity of split in a decision tree. Gini impurity randomly classified the labels with the same distribution in the dataset. A Gini impurity of 0 (lowest) is the best possible impurity and it is achieve when everything is in a same class. Gini index varies from 0 to 1. 0 indicate the purity of class where only one class exits or all element under a specific class. 1 indicates that elements are randomly distributed across various classes. And 0.5 indicate equal elements distributed over classes. Gini index (GI) described by mathematically that sum of squared of probabilities of each class (pi) deducted from one (Equation-5).

Gini Impurities

Equation – 5 Gini impurities

Here (Equation-5) pi represent the probability (probability of p+ or yes and probability of p- or no) of distinct class with classified element. Suppose randomly selected feature (a1) which has 8 yes and 4 no. After the split right had side (b1 on equation-6) has 4 yes and 4 no and left had side (b2 on equation – 7) has 4 yes and 0 no. here b2 is a pure split (leaf node) because only one class yes is present. By using the GI (Gini index) formula for b1 and b2:-

Equation- 6 & 7 – Gini Impurity b1 & Gini Impurity b2

Here for b1 value 0.5 indicates that equal element (yes and no) distribute over classes which is not pure split. And b2 value 0 indicates pure split. On GINI impurity indicates that when probability (yes or no) increases GINI value also increases. Here 0 indicate pure split and .5 indicate equal split that means worst situation. After calculating the GINI index for b1 and b2 now calculate the reduction of impurity for data point a1. Here total yes 8 (b1 and b2 on Equation – 8) and total no 4 (b1) so total data is 12 on a1. Below display the weighted GINI index for feature a1:

Total data point on b1 with Gini index (m) = 8/12 * 0.5 = 0.3333

Total data point on b2 with Gini index (n) = 4/12 * 0 = 0

Weighted Gini index for feature a1 = m + n = 0.3333

Equation- 8 Gini Impurity b1 & b2

After computing the weighted Gini value for every feature on a dataset taking the highest value feature as first node and split accordingly in a decision tree. Gini is less costly to compute.

5.2 Multilayer Perceptron Classifier (MLP Classifier)

Multilayer perceptron classifier is a feedforward neural network utilizes supervised learning technique (backpropagation) for training. MLP Classifier combines with multiple perceptron (hidden) layers. For feedforward taking input send combining with weight bias and then activation function from one hidden layer output goes to other hidden and this process continuing until reached the output. Then output calculates the error with error algorithm. These errors send back with backpropagation for weight adjustment by decreasing the total error and process is repeated, this process is call epoch. Number of epoch is determined with the hyper-parameter and reduction rate of total error.

5.2.1 Back-Propagation

Backpropagation is supervised learning algorithm that is used to train neural network. A neural network consists of input layer, hidden layer and output layer and each layer consists of neuron. So a neural network is a circuit of neurons. Backpropagation is a method to train multilayer neural network the updating of the weights of neural network and is done in such a way so that the error observed can be reduced here, error is only observed in the output layer and that error is back propagated to the previous layers and previous layer is proportionally updated weight. Backpropagation maintain chain rule to update weight. Mainly three steps on backpropagation are (Table-4):

Step Process
Step 1 Forward Pass
Step 2 Backward Pass
Step 3 Sum of all values and calculate updated weight value with Chain – rules.

Table-4 Back-Propagation process

5.2.2 Forward pass/ Forward propagation

Forward propagation is the process where input layer send the input value with randomly selected weight and bias to connected neuron and inside neuron selected activation function combine them and forward to other connected neuron layer after layer then give an output with the help of output layer. Below (Figure-4) display the forward propagation.

Foreward Pass

Figure-4 Forward passes

Input layer take the input of X (X1, X2) combine with randomly selected weight for each connection and with fixed bias (different hidden layer has different bias) send it to first hidden layer where first multiply the input with corresponding weight and added all input with single bias then selected activation function (may different form other layer) combine all input and give output according to function and this process is going on until reach in output layer. Output layer give the output like Y (Y1, Y2) (here output is binary classification as an example) according to selected activation function.

5.2.3 Backward Pass

After calculating error (difference between Forward pass output and actual output) backward pass try to minimize the error with optimisation function by sending backward with proportionally distribution and maintain a chain rule. Backward pass distribution the error in such a way where weighted value is taking under consideration. Below (Figure-5) diagram display the Backward pass process.

Backward Pass

Figure-5 Backward passes

Backpropagation push back the error which is calculated with error function or loss function for update proportional distribution with the help of optimisation algorithm. Division of Optimisation algorithm given below on Figure – 6

Optimisation Algorithms

Figure -6 Division of Optimisation algorithms

Gradient decent calculate gradient and update value by increases or decreases opposite direction of gradients unit and try to find the minimal value. Gradient decent update just one time for whole dataset but stochastic gradient decent update on each training sample and it is faster than normal gradient decent. Gradient decent can be improve by tuning parameter like learning rate (0 to 1 mostly use 0.5). Adagrad use time step based parameter to compute learning rate for every parameter. Adam is Adaptive Moment Estimation. It calculates different parameter with different learning rate. It is faster and performance rate is higher than other optimization algorithm. On the other way Adam algorithm is squares the calculated exponential weighted moving average of gradient.

5.2.4 Chain – rules

Backpropagation maintain chain-rules to update weighted value. On chain-rules backpropagation find the derivative of error respect to any weight. Suppose E is output error. w is weight for input a and bias b and ac neuron output respect of activation function and summation of bias with weighted input (w*a) input to neuron is net. So partial derivative for error respect to weight is ∂E / ∂w display the process on figure-7.

Figure- 7 Partial derivative for error respect to weight

On the chain rules for backward pass to find (error respect to weight) ∂E / ∂w = ∂E / ∂ac * ∂ac / ∂net * ∂net / ∂w. here find to error respect to weight are error respect to output of activation function multiply by activation function output respect to input in a neuron multiply by input in a neuron respect to weight.

5.2.5 Activation function

Activation function is a function which takes the decision about neuron to activate or deactivate. If the activate function activate the neuron then it will give an output on the basis of input. Input in a activation function is sum of input multiply with corresponding weight and adding the layered bias.  The main function of a activate function is non-linearity output of a neuron.

Activation Function

Figure-8 Activation function

Figure – 8 display a neuron in a hidden layer. Here several input (1, 2, 3) with corresponding weight (w1, w2, w3) putting in a neuron input layer where layer bias add with summation of multiplication with input and weight. Equation-9 display the output of an activate function.

Output from activate function y = Activate function (Ʃ (weight * input) + bias)

y = f (Ʃ (w*x) +b)

Equation- 9 Activate function

There are many activation functions like linear function for regression problem, sigmoid function for binary classification problem where result either 0 or 1, Tanh function which is based on sigmoid function but mathematically shifted version and values line -1 to 1. RELU function is Rectified linear unit. RELU is less expensive to compute.

5.2.6 Sigmoid

Sigmoid is a squashing activate function where output range between 0 and 1. Sigmoidal name comes from Greek letter sigma which looks like letter S when graphed. Sigmoid function is a logistic type function, it mainly use in output layer in neural network. Sigmoid is non-linear, fixed output range (between 0 and 1), monotonic (never decrees or never increases) and continuously differentiated function. Sigmoid function is good at classification and output from sigmoid is nonlinear. But Sigmoid has a vanishing gradient problem because output variable is very less to change in input variable. Figure- 9 displays the output of a Sigmoid and derivative of Sigmoid. Here x is any number (positive or negative). On sigmoid function 1 is divided by exponential negative input with adding 1.

Sigmoid

Figure – 9 Sigmoid Functions

4.5.2.7 RELU

RELU stands for Rectified Linear Units it is simple, less expensive in computation and rectifies the gradient vanishing problem. RELU is nonlinear activation function. It gives output either positive (infinity) or 0. RELU has a dying problem because if neurons stop for responding to variation because of gradient is 0 or nothing has to change. Figure- 10 displays the output of an RELU and derivative of RELU. Here x is any positive input and if x is grater then 0 give the output as x or give output 0. RELU function gives the output maximum value of input, here max (0, x).

Relu Activation Function

Figure – 10 RELU Function

4.5.2.8 Cost / loss function (Binary Cross-Entropy)

Cost or loss function compare the predictive value (model outcome) with actual value and give a quantitative value which give the indication about how much good or bad the prediction is.

Cost Function

Figure- 11 Cost function work process

Figure-11 x1 and x2 are input in a activate function f(x) and output y1_out which is sum of weighted input added with bias going through activate function. After model output activate function compare the output with actual output and give a quantitative value which indicate how good or bad the prediction is.

There are many type of loss function but choosing of optimal loss function depends on the problem going to be solved such as regression or classification. For binary classification problem binary cross entropy is used to calculate cost. Equation-10 displays the binary cross entropy where y is actual binary value and yp predictive outcome range 0 and 1. And i is scalar vale range between 1 to model output size (N).

Binary Crossentropy

Equation-10 displays the binary cross entropy

6 Phase 6: – Evaluation

6.1 Confusion matrix

In a classification confusion matrix describe the performance of actual value against predictive value. Confusion Matrix does the performance measurement. So confusion matrix classifies and display predicted and actual value (Visa, S., Ramsay 2011).

Confusion Matrix

Table- 5 Confusion Matrix

Confusion Matrix (Table-5) combines with True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). True Positive is prediction positive and true. True Negative is prediction negative and that is true. False positive is prediction positive and it’s false. False negative is prediction negative and that is false. False positive is known as Type1 error and false negative is known as Type 2 error. Confusion matrix can able to calculate several list of rates which are given below on Table- 6.

Here    N = Total number of observation, TP = True Positive, TN = True Negative

FP = False Positive, FN = False Negative, Total Actual No (AN) = TN + FP,

Total Predictive Yes (PY) = FP + TP. Total Actual Yes (AY) = FN + TP

Rate

 

Description Mathematical Description
Accuracy Classifier, overall how often correctly identified  (TP+TN) / N
Misclassification Rate Classifier, overall how often wrongly identified (FP + FN) / N
True Positive Rate

(Sensitivity / Recall)

Classifier, how often predict correctly yes when it is actually yes.  TP / AY
False Positive Rate Classifier, how often predict wrongly yes when it is actually no.  FP / AN
True Negative Rate

(Specificity)

Classifier, how often predict correctly no when it is actually no.  TN / AN
Precision Classifier how often predict yes when it is correct.  TP / PY
Prevalence Yes conditions how often occur in a sample. AY / N

Table – 6 Confusion matrixes Calculation

From confusion matrix F1 score can be calculated because F1 score related to precision and recall. Higher F1 score is better. If precision or recall any one goes down F1 score also go down.

F1 = \frac{2 * Precision * Recall}{Precision + Recall}

4.6.2 ROC (Receiver Operating Characteristic) curve

In statistics ROC is represent in a graph with plotting a curve which describe a binary classifiers performance as its differentiation threshold is varied. ROC (Equation-11) curve created true positive rate (TPR) against false positive rate (FPR). True positive rate also called as Sensitivity and False positive rate also known as Probability of false alarm. False positive rate also called as a probability of false alarm and it is calculated as 1 – Specificity.

True Positive Rate = \frac{True Positive}{True Positive + False Negative} = Recall or Sensivity

False Positive Rate = \frac{True Negative}{True Negative + False Positive} = 1 - Specificity

Equation- 11 ROC

So ROC (Receiver Operation Characteristic) curve allows visual representation between sensitivity and specificity associated with different values of the test result (Grzybowski, M. and Younger, J.G., 1997)

On ROC curve each point has different Threshold level. Below (Figure – 12) display the ROC curve. Higher the area curve covers is better that means high sensitivity and high specificity represent more accuracy. ROC curve also represent that if classifier predict more often true than it has more true positive and also more false positive. If classifier predict true less often then fewer false positive and also fewer true positive.

ROC Curve

ROC Curve

Figure – 12 ROC curve description

4.6.3 AUC (Area under Curve)

Area under curve (AUC) is the area surrounded by the ROC curve and AUC also represent the degree of separability that means how good the model to distinguished between classes. Higher the AUC value represents better the model performance to separate classes. AUC = 1 for perfect classifier, AUC = 0 represent worst classifier, and AUC = 0.5 means has no class separation capacity. Suppose AUC value is 0.6 that means 60% chance that model can classify positive and negative class.

Figure- 13 to Figure – 16 displays an example of AUC where green distribution curve for positive class and blue distribution curve for negative class. Here threshold or cut-off value is 0.5 and range between ‘0’ to ‘1’. True negative = TN, True Positive = TP, False Negative = FN, False Positive = FP, True positive rate = TPR (range 0 to 1), False positive rate = FPR (range 0 to 1).

On Figure – 13 left distribution curve where two class curves does not overlap that means both class are perfectly distinguished. So this is ideal position and AUC value is 1.  On the left side ROC also display that TPR for positive class is 100% occupied.

ROC distributions (perfectly distinguished

ROC distributions (perfectly distinguished

Figure – 14 two class overlap each other and raise false positive (Type 1), false negative (Type 2) errors. Here error could be minimize or maximize according to threshold. Suppose here AUC = 0.6, that means chance of a model to distinguish two classes is 60%. On ROC curve also display the curve occupied for positive class is 60%.

ROC distributions (class partly overlap distinguished)

ROC distributions (class partly overlap distinguished)

Figure- 15 displayed that positive and negative overlap each other. Here AUC value is 0.5 or near to 0.5. On this position classifier model does not able distinguish positive and negative classes. On left side ROC curve become straight that means TPR and FPR are equal.

ROC distributions (class fully overlap distinguished)

ROC distributions (class fully overlap distinguished)

Figure- 16 positive and negative class swap position and on this position AUC = 0. That means classified model predict positive as a negative and negative as a positive. On the left ROC curve display that curve on FPR side fully fitted.

ROC distributions (class swap position distinguished)

ROC distributions (class swap position distinguished)

7 Summaries

This paper describes a data mining process flow and related model and its algorithm with textual representation. One hot encoding create dummy variable for class features and min-max scaling scale the data in a single format. Balancing by SMOTE data where Euclidian distance calculates the distance in-between nearest neighbour to produce synthetic data under minority class. LDA reduce the distance inside class and maximise distance in-between class and for two class problem give a single dimension features which is less costly to calculate accuracy by base algorithm (random forest and MLP classifier).  Confusion matrix gives the accuracy, precision, sensitivity, specificity which is help to take a decision about base algorithm. AUC and ROC curve also represent true positive rate against false positive rate which indicate base algorithm performance.

Base algorithm Random forest using CART with GINI impurity for feature selection to spread the tree. Here CART is selected because of less costly to run. Random forest algorithm is using bootstrap dataset to grow trees, and aggregation using majority vote to select accuracy.

MLP classifier is a neural network algorithm using backpropagation chain-rule to reducing error. Here inside layers using RLU activation function. Output layers using Sigmoid activation function and binary cross entropy loss function calculate the loss which is back propagate with Adam optimizer to optimize weight and reduce loss.

References:

  1. Visa, S., Ramsay, B., Ralescu, A.L. and Van Der Knaap, E., 2011. Confusion Matrix-based Feature Selection. MAICS, 710, pp.120-127.
  2. Grzybowski, M. and Younger, J.G., 1997. Statistical methodology: III. Receiver operating characteristic (ROC) curves. Academic Emergency Medicine, 4(8), pp.818-826.

On the difficulty of language: prerequisites for NLP with deep learning

This is the first article of my article series “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Preface

This section is virtually just my essay on language. You can skip this if you want to get down on more technical topic.

As I do not study in natural language processing (NLP) field, I would not be able to provide that deep insight into this fast changing deep leaning field throughout my article series. However at least I do understand language is a difficult and profound field, not only in engineering but also in many other study fields. Some people might be feeling that technologies are eliminating languages, or one’s motivations to understand other cultures. First of all, I would like you to keep it in mind that I am not a geek who is trying to turn this multilingual world into a homogeneous one and rebuild Tower of Babel, with deep learning. I would say I am more keen on social or anthropological sides of language.

I think you would think more about languages if you have mastered at least one foreign language. As my mother tongue is Japanese, which is totally different from many other Western languages in terms of characters and ambiguity, I understand translating is not what learning a language is all about. Each language has unique characteristics, and I believe they more or less influence one’s personalities. For example, many Western languages make the verb, I mean the conclusion, of sentences clear in the beginning part of the sentences. That is also true of Chinese, I heard. However in Japanese, the conclusion comes at the end, so that is likely to give an impression that Japanese people are being obscure or indecisive. Also, Japanese sentences usually omit their subjects. In German as well, the conclusion of a sentences tend to come at the end, but I am almost 100% sure that no Japanese people would feel German people make things unclear. I think that comes from the structures of German language, which tends to make the number, verb, relations of words crystal clear.

Source: https://twitter.com/nakamurakihiro

Let’s take an example to see how obscure Japanese is. A Japanese sentence 「頭が赤い魚を食べる猫」can be interpreted in five ways, depending on where you put emphases on.

Common sense tells you that the sentence is likely to mean the first two cases, but I am sure they can mean those five possibilities. There might be similarly obscure sentences in other languages, but I bet few languages can be as obscure as Japanese. Also as you can see from the last two sentences, you can omit subjects in Japanese. This rule is nothing exceptional. Japanese people usually don’t use subjects in normal conversations. And when you read classical Japanese, which Japanese high school students have to do just like Western students learn some of classical Latin, the writings omit subjects much more frequently.

*However interestingly we have rich vocabulary of subjects. The subject “I” can be translated to 「私」、「僕」、「俺」、「自分」、「うち」etc, depending on your personality, who you are talking to, and the time when it is written in.

I believe one can see the world only in the framework of their language, and it seems one’s personality changes depending on the language they use. I am not sure whether the language originally determines how they think, or how they think forms the language. But at least I would like you to keep it in mind that if you translate a conversation, for example a random conversation at a bar in Berlin, into Japanese, that would linguistically sound Japanese, but not anthropologically. Imagine that such kind of random conversation in Berlin or something is like playing a catch, I mean throwing a ball named “your opinion.” On the other hand,  normal conversations of Japanese people are in stead more of, I would say,  “resonance” of several tuning forks. They do their bests to show that they are listening to each other, by excessively nodding or just repeating “Really?”, but usually it seems hardly any constructive dialogues have been made.

*I sometimes feel you do not even need deep learning to simulate most of such Japanese conversations. Several-line Python codes would be enough.

My point is, this article series is mainly going to cover only a few techniques of NLP in deep learning field: sequence to sequence model (seq2seq model) , and especially Transformer. They are, at least for now, just mathematical models and mappings of a small part of this profound field of language (as far as I can cover in this article series). But still, examples of language would definitely help you understand Transformer model in the long run.

2 Tokens and word embedding

*Throughout my article series, “words” just means the normal words you use in daily life. “Tokens” means more general unit of NLP tasks. For example the word “Transformer” might be denoted as a single token “Transformer,” or maybe as a combination of two tokens “Trans” and “former.”

One challenging part of handling language data is its encodings. If you started learning programming in a language other than English, you would have encountered some troubles of using keyboards with different arrangements or with characters. Some comments on your codes in your native languages are sometimes not readable on some software. You can easily get away with that by using only English, but when it comes to NLP you have to deal with this difficulty seriously. How to encode characters in each language should be a first obstacle of NLP. In this article we are going to rely on a library named BPEmb, which provides word embedding in various languages, and you do not have to care so much about encodings in languages all over the world with this library.

In the first section, you might have noticed that Japanese sentence is not separated with spaces like Western languages. This is also true of Chinese language, and that means we need additional tasks of separating those sentences at least into proper chunks of words. This is not only a matter of engineering, but also of some linguistic fields. Also I think many people are not so conscious of how sentences in their native languages are grammatically separated.

The next point is, unlike other scientific data, such as temperature, velocity, voltage, or air pressure, language itself is not measured as numerical data. Thus in order to process language, including English, you first have to map language to certain numerical data, and after some processes you need to conversely map the output numerical data into language data. This section is going to be mainly about one-hot encoding and word embedding, the ways to convert word/token into numerical data. You might already have heard about this

You might have learnt about word embedding to some extent, but I hope you could get richer insight into this topic through this article.

2.1 One-hot encoding

One-hot encoding would be the most straightforward way to encode words/tokens. Assume that you have a dictionary whose size is |\mathcal{V}|, and it includes words from “a”, “ablation”, “actually” to “zombie”, “?”, “!”

In a mathematical manner, in order to choose a word out of those |\mathcal{V}| words, all you need is a |\mathcal{V}| dimensional vector, one of whose elements is 1, and the others are 0. When you want to choose the No. i word, which is “indeed” in the example below, its corresponding one-hot vector is \boldsymbol{v} = (0, \dots, 1, \dots, 0 ), where only the No. i element is 1. One-hot encoding is also easy to understand, and that’s all. It is easy to imagine that people have already come up with more complicated and better way to encoder words. And one major way to do that is word embedding.

2.2 Word embedding

Source: Francois Chollet, Deep Learning with Python,(2018), Manning

Actually word embedding is related to one-hot encoding, and if you understand how to train a simple neural network, for example densely connected layers, you would understand word embedding easily. The key idea of word embedding is denoting each token with a D dimensional vector, whose dimension is fewer than the vocabulary size |\mathcal{V}|. The elements of the resulting word embedding vector are real values, I mean not only 0 or 1. Obviously you can encode much richer variety of tokens with such vectors. The figure at the left side is from “Deep Learning with Python” by François Chollet, and I think this is an almost perfect and simple explanation of the comparison of one-hot encoding and word embedding. But the problem is how to get such convenient vectors. The answer is very simple: you have only to train a network whose inputs are one-hot vector of the vocabulary.

The figure below is a simplified model of word embedding of a certain word. When the word is input into a neural network, only the corresponding element of the one-hot vector is 1, and that virtually means the very first input layer is composed of one neuron whose value is 1. And the only one neuron propagates to the next D dimensional embedding layer. These weights are the very values which most other study materials call “an embedding vector.”

When you input each word into a certain network, for example RNN or Transformer, you map the input one-hot vector into the embedding layer/vector. The examples in the figure are how inputs are made when the input sentences are “You’ve got the touch” and “You’ve got the power.”   Assume that you have a dictionary of one-hot encoding, whose vocabulary is {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}, and the dimension of word embeding is 6. In this case |\mathcal{V}| = 9, D=6. When the inputs are “You’ve got the touch” or “You’ve got the power” , you put the one-hot vector corresponding to “You’ve”, “got”, “the”, “touch” or “You’ve”, “got”, “the”, “power” sequentially every time step t.

In order to get word embedding of certain vocabulary, you just need to train the network. We know that the words “actually” and “indeed” are used in similar ways in writings. Thus when we propagate those words into the embedding layer, we can expect that those embedding layers are similar. This is how we can mathematically get effective word embedding of certain vocabulary.

More interestingly, if word embedding is properly trained, you can mathematically “calculate” words. For example, \boldsymbol{v}_{king} - \boldsymbol{v}_{man} + \boldsymbol{v}_{woman} \approx \boldsymbol{v}_{queen}, \boldsymbol{v}_{Japan} - \boldsymbol{v}_{Tokyo} + \boldsymbol{v}_{Vietnam} \approx \boldsymbol{v}_{Hanoi}.

*I have tried to demonstrate this type of calculation on several word embedding, but none of them seem to work well. At least you should keep it in mind that word embedding learns complicated linear relations between words.

I should explain word embedding techniques such as word2vec in detail, but the main focus of this article is not NLP, so the points I have mentioned are enough to understand Transformer model with NLP examples in the upcoming articles.

 

3 Language model

Language models is one of the most straightforward, but crucial ideas in NLP. This is also a big topic, so this article is going to cover only basic points. Language model is a mathematical model of the probabilities of which words to come next, given a context. For example if you have a sentence “In the lecture, he opened a _.”, a language model predicts what comes at the part “_.” It is obvious that this is contextual. If you are talking about general university students, “_” would be “textbook,” but if you are talking about Japanese universities, especially in liberal art department, “_” would be more likely to be “smartphone. I think most of you use this language model everyday. When you type in something on your computer or smartphone, you would constantly see text predictions, or they might even correct your spelling or grammatical errors. This is language modelling. You can make language models in several ways, such as n-gram and neural language models, but in this article I can explain only general formulations for such models.

*I am not sure which algorithm is used in which services. That must be too fast changing and competitive for me to catch up.

As I mentioned in the first article series on RNN, a sentence is usually processed as sequence data in NLP. One single sentence is denoted as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}), a list of vectors. The vectors are usually embedding vectors, and the (t) is the index of the order of tokens. For example the sentence “You’ve go the power.” can be expressed as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}), where \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)} denote “You’ve”, “got”, “the”, “power”, “.” respectively. In this case \tau = 4.

In practice a sentence \boldsymbol{X} usually includes two tokens BOS and EOS at the beginning and the end of the sentence. They mean “Beginning Of Sentence” and “End Of Sentence” respectively. Thus in many cases \boldsymbol{X} = (\boldsymbol{BOS} , \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS} ). \boldsymbol{BOS} and \boldsymbol{EOS} are also both vectors, at least in the Tensorflow tutorial.

P(\boldsymbol{X} = (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) is the probability of incidence of the sentence. But it is easy to imagine that it would be very hard to directly calculate how likely the sentence \boldsymbol{X} appears out of all possible sentences. I would rather say it is impossible. Thus instead in NLP we calculate the probability P(\boldsymbol{X}) as a product of the probability of incidence or a certain word, given all the words so far. When you’ve got the words (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1}) so far, the probability of the incidence of \boldsymbol{x}^{(t)}, given the context is  P(\boldsymbol{x}^{(t)}|\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1)}). P(\boldsymbol{BOS}) is a probability of the the sentence \boldsymbol{X} being (\boldsymbol{BOS}), and the probability of \boldsymbol{X} being (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) can be decomposed this way: P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS})P(\boldsymbol{BOS}).

Just as well P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P( \boldsymbol{BOS}, \boldsymbol{x}^{(1)})= P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P( \boldsymbol{BOS}).

Hence, the general probability of incidence of a sentence \boldsymbol{X} is P(\boldsymbol{X})=P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \dots, \boldsymbol{x}^{(\tau -1)}, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) = P(\boldsymbol{EOS}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}) P(\boldsymbol{x}^{(\tau)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau - 1)}) \cdots P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P(\boldsymbol{BOS}).

Let \boldsymbol{x}^{(0)} be \boldsymbol{BOS} and \boldsymbol{x}^{(\tau + 1)} be \boldsymbol{EOS}. Plus, let P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]}) be P(\boldsymbol{x}^{(t+1)}|\boldsymbol{x}^{(0)}, \dots, \boldsymbol{x}^{(t)}), then P(\boldsymbol{X}) = P(\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Language models calculate which words to come sequentially in this way.

Here’s a question: how would you evaluate a language model?

I would say the answer is, when the language model generates words, the more confident the language model is, the better the language model is. Given a context, when the distribution of the next word is concentrated on a certain word, we can say the language model is confident about which word to come next, given the context.

*For some people, it would be more understandable to call this “entropy.”

Let’s take the vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”} as an example. Assume that P(\boldsymbol{X}) = P(\boldsymbol{BOS}, \boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}, \boldsymbol{EOS}) = P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}, \boldsymbol{EOS})= P(\boldsymbol{x}^{(0)})\prod_{t=0}^{4}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Given a context (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}), the probability of incidence of \boldsymbol{x}^{(2)} is P(\boldsymbol{x}^{2}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}). In the figure below, the distribution at the left side is less confident because probabilities do not spread widely, on the other hand the one at the right side is more confident that next word is “got” because the distribution concentrates on “got”.

*You have to keep it in mind that the sum of all possible probability P(\boldsymbol{x}^{(2)} | \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) is 1, that is, P(\boldsymbol{the}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + P(\boldsymbol{You've}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + \cdots + P(\boldsymbol{Boogie}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = 1.

While the language model generating the sentence “BOS You’ve got the touch EOS”, it is better if the language model keeps being confident. If it is confident, P(\boldsymbol{X})= P(\boldsymbol{BOS}) P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}) P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} gets higher. Thus (-1) \{ log_{b}{P(\boldsymbol{BOS})} + log_{b}{P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}) + log_{b}{P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)})} + log_{b}{P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)})} + log_{b}{P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} \} gets lower, where usually b=2 or b=e.

This is how to measure how confident language models are, and the indicator of the confidence is called perplexity. Assume that you have a data set for evaluation \mathcal{D} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \dots, \boldsymbol{X}_{|\mathcal{D}|}), which is composed of |\mathcal{D}| sentences in total. Each sentence \boldsymbol{X}_n = (\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau ^{(n)}}{P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})} has \tau^{(n)} tokens in total excluding \boldsymbol{BOS}, \boldsymbol{EOS}. And let |\mathcal{V}| be the size of the vocabulary of the language model. Then the perplexity of the language model is b^z, where z = \frac{-1}{|\mathcal{V}|}\sum_{n=1}^{|\mathcal{D}|}{\sum_{t=0}^{\tau ^{(n)}}{log_{b}P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})}. The b is usually 2 or e.

For example, assume that \mathcal{V} is vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}. Also assume that the evaluation data set for perplexity of a language model is \mathcal{D} = (\boldsymbol{X}_1, \boldsymbol{X}_2), where \boldsymbol{X_1} =(\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}) \boldsymbol{X_2} = (\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the }, \boldsymbol{power}). In this case |\mathcal{V}|=9, |\mathcal{D}|=2. I have already showed you how to calculate the perplexity of the sentence “You’ve got the touch.” above. You just need to do a similar thing on another sentence “You’ve got the power”, and then you can get the perplexity of the language model.

*If the network is not properly trained, it would also be confident of generating wrong outputs. However, such network still would give high perplexity because it is “confident” at any rate. I’m sorry I don’t know how to tackle the problem. Please let me put this aside, and let’s get down on Transformer model soon.

Appendix

Let’s see how word embedding is implemented with a very simple example in the official Tensorflow tutorial. It is a simple binary classification task on IMDb Dataset. The dataset is composed to comments on movies by movie critics, and you have only to classify if the commentary is positive or negative about the movie. For example when you get you get an input “To be honest, Michael Bay is a terrible as an action film maker. You cannot understand what is going on during combat scenes, and his movies rely too much on advertisements. I got a headache when Mark Walberg used a Chinese cridit card in Texas. However he is very competent when it comes to humorous scenes. He is very talented as a comedy director, and I have to admit I laughed a lot.“, the neural netowork has to judge whether the statement is positive or negative.

This networks just takes an average of input embedding vectors and regress it into a one dimensional value from 0 to 1. The shape of embedding layer is (8185, 16). Weights of neural netowrks are usually implemented as matrices, and you can see that each row of the matrix corresponds to emmbedding vector of each token.

*It is easy to imagine that this technique is problematic. This network virtually taking a mean of input embedding vectors. That could mean if the input sentence includes relatively many tokens with negative meanings, it is inclined to be classified as negative. But for example, if the sentence is “This masterpiece is a dark comedy by Charlie Chaplin which depicted stupidity of the evil tyrant gaining power in the time. It thoroughly mocked Germany in the time as an absurd group of fanatics, but such propaganda could have never been made until ‘Casablanca.'” , this can be classified as negative, because only the part “masterpiece” is positive as a token, and there are much more words with negative meanings themselves.

The official Tensorflow tutorial provides visualization of word embedding with Embedding Projector, but I would like you to take more control over the data by yourself. Please just copy and paste the codes below, installing necessary libraries. You would get a map of vocabulary used in the text classification task. It seems you cannot find clear tendency of the clusters of the tokens. You can try other dimension reduction methods to get maps of the vocabulary by for example using Scikit Learn.

[References]

[1] “Word embeddings” Tensorflow Core
https://www.tensorflow.org/tutorials/text/word_embeddings

[2]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 43-64, 72-85, 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 43-64, 72-85, 191-193

[3]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[4] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 178-185

[5]”2.2. Manifold learning,” scikit-learn
https://scikit-learn.org/stable/modules/manifold.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

CRISP-DM methodology in technical view

On this paper discuss about CRISP-DM (Cross Industry Standard Process for data mining) methodology and its steps including selecting technique to successful the data mining process. Before going to CRISP-DM it is better to understand what data mining is? So, here first I introduce the data mining and then discuss about CRISP-DM and its steps for any beginner (data scientist) need to know.

1 Data Mining

Data mining is an exploratory analysis where has no idea about interesting outcome (Kantardzic, 2003). So data mining is a process to explore by analysis a large set of data to discover meaningful information which help the business to take a proper decision. For better business decision data mining is a way to select feature, correlation, and interesting patterns from large dataset (Fu, 1997; SPSS White Paper, 1999).

Data mining is a step by step process to discover knowledge from data. Pre-processing data is vital part for a data mining. In pre-process remove noisy data, combining multiple sources of data, retrieve relevant feature and transforming data for analysis. After pre-process mining algorithm applied to extract data pattern so data mining is a step by step process and applied algorithm to find meaning full data pattern. Actually data mining is not only conventional analysis it is more than that (Read, 1999).

Data mining and statistics closely related. Main goal of data mining and statistic is find the structure of data because data mining is a part of statistics (Hand, 1999). However, data mining use tools, techniques, database, machine learning which not part of statistics but data mining use statistics algorithm to find a pattern or discover hidden decision.

Data mining objective could be prediction or description. On prediction data mining considering several features of dataset to predict unidentified future, on the other hand description involve identifying pattern of data to interpreted (Kantardzic, 2003).

From figure 1.1 shows data mining is the only one part of getting unknown information from data but it is the central process of whole process. Before data mining there are several processes need to be done like collecting data from several sources than integrated data and keep in data storage. Stored unprocessed data evaluated and selected with pre-processed activity to give a standard format than data mining algorithm to analysis for hidden pattern.

Data Mining Process

2 CRISP-DM Methodologies

Cross Industry Standard Process for data mining (CRISP-DM) is most popular and widely uses data mining methodology. CRISP-DM breaks down the data mining project life cycle into six phases and each phase consists of many second-level generic tasks. Generic task cover all possible data mining application. CRISP-DM extends KDD (Knowledge Discovery and Data Mining) into six steps which are sequence of data mining application (Martínez-Plumed 2019).

Data science and data mining project extract meaningful information from data. Data science is an art where a lot of time need to spend for understanding the business value and data before applying any algorithm then evaluate and deployed a project. CRISP-DM help any data science and data mining project from start to end by giving step by step process.

Present world every day billions of data are generating. So organisations are struggling with overwhelmed data to process and find a business goal. Comprehensive data mining methodology, CRISP-DM help business to achieve desirable goal by analysing data.

CRISP-DM (Cross Industry Standard Process for Data Mining) is well documented, freely available, data mining methodology. CRISP-DM is developed by more than 200 data mining users and many mining tool and service providers funded by European Union. CRISP-DM encourages organization for best practice and provides a structure of data mining to get better, faster result.

CRISP-DM is a step by step methodology. Figure-2.1 show the phases of CRISP-DM and process of data mining. Here one side arrow indicates the dependency between phases and double side arrow represents repeatable process. Six phases of CRISP-DM are Business understanding, Data understanding, Modelling, Evaluation and Deployment.

CRISP-DM

2.1 Business Understanding

Business Understanding or domain understanding is the first step of CRISP-DM methodology. On this stage identify the area of business which is going to transform into meaningful information by analysing, processing and implementing several algorithms. Business understanding identifies the available resource (human and hardware), problems and set a goal. Identification of business objective should be agreed with project sponsors and other unit of business which will be affected. This step also focuses about details business success criteria, requirements, constraints, risk, project plan and timeline.

2.2 Data Understanding

Data understanding is the second and closely related with the business understanding phase. This phase mainly focus on data collection and proceeds to get familiar with the data and also detect interesting subset from data. Data understanding has four subsets these are:-

2.2.1 Initial data collection

On this subset considering the data collection sources which is mainly divided into two categories like outsource data or internal source data.  If data is from outsource then it may costly, time consuming and may be low quality but if data is collected form internal source it is an easy and less costly, but it may be contain irrelevant data. If internal source data does not fulfil the interest of analysis than it is necessary to move outsource data. Data collection also give an assumption that the data is quantitative (continuous, count) or qualitative (categorical).  It also gives information about balance or imbalanced dataset.  On data collection should avoid random error, systematic error, exclusion errors, and errors of choosing.

2.2.2 Data Description

Data description performs initial analysis about data. On this stage it is going to determine about the source of data like RDBMS, SQL, NoSQL, Big data etc. then analysis and describe the data about size (large data set give more accurate result but time consuming), number of records, tables, database, variables, and data types (numeric, categorical or Boolean). On this phase examine the accessibility and availability of attributes.

2.2.3 Exploratory data analysis (EDA)

On exploratory data analysis describe the inferential statistics, descriptive statistics and graphical representation of data. Inferential statistics summarize the entire population from the sample data to perform sampling and hypothesis testing. On Parametric hypothesis testing  (Null or alternate – ANOVA, t-test, chi square test) perform for known distribution (based on population) like mean, variance, standard deviation, proportion and Non-parametric hypothesis testing perform when distribution is unknown or sample size is small. On sample dataset, random sampling implement when dataset is balance but for imbalance dataset should be follow random resampling (under  and over sampling), k fold cross validation, SMOTE (synthetic minority oversampling technique), cluster base sampling, ensemble techniques (bagging and boosting – Add boost, Gradient Tree Boosting, XG Boost) to form a balance dataset.

On descriptive statistics analysis describe about the mean, median, mode for measures of central tendency on first moment business decision. On second moment business decision describe the measure of dispersion about the variance, standard deviation and range of data.  On third and fourth moment business decision describe accordingly skewness (Positive skewness – heavier tail to the right, negative skewness – heavier tail to the left, Zero skewness – symmetric distribution) and Kurtosis (Leptokurtosis – heavy tail, platykurtosis – light tail, mesokurtic – normal distribution).

Graphical representation is divided into univariate, bivariate and multivariate analysis. Under univariate whisker plot, histogram identify the outliers and shape of distribution of data and Q-Q plot (Quantile – Quantile) plot describe the normality of data that means data is normally distribution or not.  On whisker plot if data present above of Q3 + 1.5 (IQR) and below of Q1 – 1.5 (IQR) is outlier. For Bivariate correlations identify with scatter plot which describe positive, negative or no correlation and also identify the data linearity or non-linearity. Scatter plot also describe the clusters and outliers of data.  For multivariate has no graphical analysis but used to use regression analysis, ANOVA, Hypothesis analysis.

2.2.4 Data Quality analysis

This phase identified and describes the potential errors like outliers, missing data, level of granularity, validation, reliability, bad metadata and inconsistency.  On this phase AAA (attribute agreement analysis) analysed discrete data for data error. Continuous data analysed with Gage repeatability and reproducibility (Gage R & R) which follow SOP (standard operating procedures). Here Gage R & R define the aggregation of variation in the measurement data because of the measurement system.

2.3 Data Preparation

Data Preparation is the time consuming stage for every data science project. Overall on every data science project 60% to 70% time spend on data preparation stage. Data preparation stapes are described below.

2.3.1 Data integration

Data integration involved to integrate or merged multiple dataset. Integration integrates data from different dataset where same attribute or same columns presents but when there is different attribute then merging the both dataset.

2.3.2 Data Wrangling

On this subset data are going to clean, curate and prepare for next level. Here analysis the outlier and treatment done with 3 R technique (Rectify, Remove, Retain) and for special cases if there are lots of outliner then need to treat outlier separately (upper outliner in an one dataset and lower outliner in another dataset) and alpha (significant value) trim technique use to separate the outliner from the original dataset. If dataset has a missing data then need to use imputation technique like mean, median, mode, regression, KNN etc.

If dataset is not normal or has a collinearity problem or autocorrelation then need to implement transformation techniques like log, exponential, sort, Reciprocal, Box-cox etc. On this subset use the data normalization (data –means/standard deviation) or standardization (min- max scaler) technique to make unitless and scale free data. This step also help if data required converting into categorical then need to use discretization or binning or grouping technique. For factor variable (where has limited set of values), dummy variable creation technique need to apply like one hot encoding.  On this subset also help heterogeneous data to transform into homogenous with clustering technique. Data inconsistencies also handle the inconsistence of data to make data in a single scale.

2.3.3 Feature engineering and selection/reduction

Feature engineering may called as attribute generation or feature extraction. Feature extraction creating new feature by reducing original feature to make simplex model. Feature engineering also do the normalized feature by producing calculative new feature. So feature engineering is a data pre-process technique where improve data quality by cleaning, integration, reduction, transformation and scaling.

Feature selections reduce the multicollinearity or high correlated data and make model simple. Main two type of feature selection technique are supervised and unsupervised. Principal Components Analysis (PCA) is an unsupervised feature reduction/ feature selection technique and LDA is a Linear Discriminant analysis supervised technique mainly use for classification problem. LDA analyse by comparing mean of the variables. Supervised technique is three types filter, wrapper and ensemble method. Filter method is easy to implement but wrapper is costly method and ensemble use inside a model.

2.4 Model

2.4.1 Model Selection Technique

Model selection techniques are influence by accuracy and performance.  Because recommendation need better performance but banking fraud detection needs better accuracy technique.  Model is mainly subdivided into two category supervised learning where predict an output variable according to given an input variable and unsupervised learning where has not output variable.

On supervised learning if an output variable is categorical than it is classification problem like two classes or multiclass classification problem. If an output variable is continuous (numerical) then the problem is called prediction problem. If need to recommending according to relevant information is called recommendation problem or if need to retrieve data according to relevance data is called retrieval problem.

On unsupervised learning where target or output variable is not present. On this technique all variable is treated as an input variable. Unsupervised learning also called clustering problem where clustering the dataset for future decision.

Reinforcement learning agent solves the problem by getting reward for success and penalty for any failure. And semi-supervised learning is a process to solve the problem by combining supervised and unsupervised learning method. On semi-supervised, a problem solved by apply unsupervised clustering technique then for each cluster apply different type of supervised machine learning algorithm like linear algorithm, neural network, K nearest  neighbour etc.

On data mining model selection technique, where output variable is known, then need to implement supervised learning.  Regression is the first choice where interpretation of parameter is important. If response variable is continuous then linear regression or if response variable is discrete with 2 categories value then logistic regression or if response variable is discrete with more than 2 categorical values then multinomial or ordinal regression or if response variable is count then poission where mean is equal to variance or negative binomial regression where variance is grater then mean or if response variable contain excessive zero values then need to choose Zero inflated poission (ZIP) or Zero inflated negative binomial (ZINB).

On supervised technique except regression technique all other technique can be used for both continuous or categorical response variable like KNN (K-Nearest Neighbour),  Naïve Bays, Black box techniques (Neural network, Support vector machine), Ensemble Techniques (Stacking, Bagging like random forest, Boosting like Decision tree, Gradient boosting, XGB, Adaboost).

When response variable is unknown then need to implement unsupervised learning. Unsupervised learning for row reduction is K-Means, Hierarchical etc., for columns reduction or dimension reduction PCA (principal component analysis), LDA (Linear Discriminant analysis), SVD (singular value decomposition) etc. On market basket analysis or association rules where measure are support and confidence then lift ration to determine which rules is important. There are recommendation systems, text analysis and NLP (Natural language processing) also unsupervised learning technique.

For time series need to select forecasting technique. Where forecasting may model based or data based. For Trend under model based need to use linear, exponential, quadratic techniques. And for seasonality need to use additive, multiplicative techniques. On data base approaches used auto regressive, moving average, last sample, exponential smoothing (e.g. SES – simple exponential smoothing, double exponential smoothing, and winters method).

2.4.2 Model building

After selection model according to model criterion model is need to be build. On model building provided data is subdivided with training, validation and testing.  But sometime data is subdivided just training and testing where information may leak from testing data to training data and cause an overfitting problem. So training dataset should be divided into training and validation whereas training model is tested with validation data and if need any tuning to do according to feedback from validation dataset. If accuracy is acceptable and error is reasonable then combine the training and validation data and build the model and test it on unknown testing dataset. If the training error and testing error is minimal or reasonable then the model is right fit or if the training error is low and testing error is high then model is over fitted (Variance) or if training error is high and testing error is also high then model is under fitted (bias). When model is over fitted then need to implement regularization technique (e.g. linear – lasso, ridge regression, Decision tree – pre-pruning, post-pruning, Knn – K value, Naïve Bays – Laplace, Neural network – dropout, drop connect, batch normalization, SVM –  kernel trick)

When data is balance then split the data training, validation and testing and here training is larger dataset then validation and testing. If data set is imbalance then need to use random resampling (over and under) by artificially increases training dataset. On random resampling by randomly partitioning data and for each partition implement the model and taking the average of accuracy. Under K fold cross validation creating K times cross dataset and creating model for every dataset and validate, after validation taking the average of accuracy of all model. There is more technique for imbalance dataset like SMOTH (synthetic minority oversampling technique), cluster based sampling, ensemble techniques e.g. Bagging, Boosting (Ada Boost, XGBoost).

2.4.3 Model evaluation and Tuning

On this stage model evaluate according to errors and accuracy and tune the error and accuracy for acceptable manner. For continuous outcome variable there are several way to measure the error like mean error, mean absolute deviation, Mean squared error, Root mean squared error, Mean percentage error and Mean absolute percentage error but more acceptable way is Mean absolute percentage error. For this continuous data if error is known then it is easy to find out the accuracy because accuracy and error combining value is one. The error function also called cost function or loss function.

For discrete output variable model, for evaluation and tuning need to use confusion matrix or cross table. From confusion matrix, by measuring accuracy, error, precision, sensitivity, specificity, F1 help to take decision about model fitness. ROC curve (Receiver operating characteristic curve), AUC curve (Area under the ROC curve) also evaluate the discrete output variable. AUC and ROC curve plot of sensitivity (true positive rate) vs 1-specificity (false positive rate).  Here sensitivity is a positive recall and  recall is basically out of all positive samples, how sample classifier able to identify. Specificity is negative recall here recall is out of all negative samples, how many sample classifier able to identify.  On AUC where more the area under the ROC is represent better accuracy. On ROC were step bend it’s indicate the cut off value.

2.4.4 Model Assessment

There is several ways to assess the model. First it is need to verify model performance and success according to desire achievement. It needs to identify the implemented model result according to accuracy where accuracy is repeatable and reproducible. It is also need to identify that the model is scalable, maintainable, robust and easy to deploy. On assessment identify that the model evaluation about satisfactory results (identify the precision, recall, sensitivity are balance) and meet business requirements.

2.5 Evaluation

On evaluation steps, all models which are built with same dataset, given a rank to find out the best model by assessing model quality of result and simplicity of algorithm and also cost of deployment. Evaluation part contains the data sufficiency report according to model result and also contain suggestion, feedback and recommendation from solutions team and SMEs (Subject matter experts) and record all these under OPA (organizational process assets).

2.6 Deployment

Deployment process needs to monitor under PEST (political economical social technological) changes within the organization and outside of the organization. PEST is similar to SWOT (strength weakness opportunity and thread) where SW represents the changes of internal and OT represents external changes.

On this deployment steps model should be seamless (like same environment, same result etc.) from development to production. Deployment plan contain the details of human resources, hardware, software requirements. Deployment plan also contain maintenance and monitoring plan by checking the model result and validity and if required then implement retire, replace and update plan.

3 Summaries

CRISP-DM implementation is costly and time consuming. But CRISP-DM methodology is an umbrella for data mining process. CRISP-DM has six phases, Business understanding, Data understanding, Modelling, Evaluation and Deployment. Every phase has several individual criteria, standard and process. CRISP-DM is Guideline for data mining process so if CRISP-DM is going to implement in any project it is necessary to follow each and every single guideline and maintain standard and criteria to get required result.

4 References

  1. Fu, Y., (1997), “Data Mining: Tasks, Techniques and Applications”, Potentials, IEEE, 16: 4, 18–20.
  2. Hand, D. J., (1999), “Statistics and Data Mining: Intersecting Disciplines”, ACM SIGKDD Explorations Newsletter, 1: 1, 16 – 19.
  3. Kantardzic, M., (2003), “Data Mining: Concepts, Models, Methods, and Algorithms” John Wiley and Sons, Inc., Hoboken, New Jersey
  4. Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C., Orallo, J.H., Kull, M., Lachiche, N., Quintana, M.J.R. and Flach, P.A., 2019. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering.
  5. Read, B.J., (1999), “Data Mining and Science? Knowledge discovery in science as opposed to business”, 12th ERCIM Workshop on Database Research.

Instructions on Transformer for people outside NLP field, but with examples of NLP

I found it quite difficult to explain mathematical details of long short-term memory (LSTM) in my previous article series. But when I was studying LSTM, a new promising algorithm was already attracting attentions. The algorithm is named Transformer. Its algorithm was a first announced in a paper named “Attention Is All You Need,” and it outperformed conventional translation algorithms with lower computational costs.

In this article series, I am going to provide explanations on minimum prerequisites for understanding deep learning in NLP (natural language process) tasks, but NLP is not the main focus of this article series, and actually I do not study in NLP field. I think Transformer is going to be a new major model of deep learning as well as CNN or RNN, and the model is now being applied in various fields.

Even though Transformer is going to be a very general deep learning model, I still believe it would be an effective way to understand Transformer with some NLP because language is a good topic we have in common. Unlike my previous article series, in which I tried to explain theoretical side of RNN as precisely as possible, in this article I am going to focus on practical stuff with my toy implementations of NLP tasks, largely based on Tensorflow official tutorial. But still I will do my best to make it as straightforward as possible to understand the architecture of Transformer with various original figures.

This series is going to be composed of the articles below.

If you are in the field and can read the codes in the official tutorial with no questions, this article series is not for you, but if you want to see how a Transformer works but do not want to go too much into details of NLP, this article would be for you.