Posts

Rethinking linear algebra part two: ellipsoids in data science

1 Our expedition of eigenvectors still continues

This article is still going to be about eigenvectors and PCA, and this article still will not cover LDA (linear discriminant analysis). Hereby I would like you to have more organic links of the data science ideas with eigenvectors.

In the second article, we have covered the following points:

  • You can visualize linear transformations with matrices by calculating displacement vectors, and they usually look like vectors swirling.
  • Diagonalization is finding a direction in which the displacement vectors do not swirl, and that is equal to finding new axis/basis where you can describe its linear transformations more straightforwardly. But we have to consider diagonalizability of the matrices.
  • In linear dimension reduction such as PCA or LDA, we mainly use types of matrices called positive definite or positive semidefinite matrices.

In the last article we have seen the following points:

  • PCA is an algorithm of calculating orthogonal axes along which data “swell” the most.
  • PCA is equivalent to calculating a new orthonormal basis for the data where the covariance between components is zero.
  • You can reduced the dimension of the data in the new coordinate system by ignoring the axes corresponding to small eigenvalues.
  • Covariance matrices enable linear transformation of rotation and expansion and contraction of vectors.

I emphasized that the axes are more important than the surface of the high dimensional ellipsoids, but in this article let’s focus more on the surface of ellipsoids, or I would rather say general quadratic curves. After also seeing how to draw ellipsoids on data, you would see the following points about PCA or eigenvectors.

  • Covariance matrices are real symmetric matrices, and also they are positive semidefinite. That means you can always diagonalize covariance matrices, and their eigenvalues are all equal or greater than 0.
  • PCA is equivalent to finding axes of quadratic curves in which gradients are biggest. The values of quadratic curves increases the most in those directions, and that means the directions describe great deal of information of data distribution.
  • Intuitively dimension reduction by PCA is equal to fitting a high dimensional ellipsoid on data and cutting off the axes corresponding to small eigenvalues.

Even if you already understand PCA to some extent, I hope this article provides you with deeper insight into PCA, and at least after reading this article, I think you would be more or less able to visually control eigenvectors and ellipsoids with the Numpy and Maplotlib libraries.

*Let me first introduce some mathematical facts and how I denote them throughout this article in advance. If you are allergic to mathematics, take it easy or please go back to my former articles.

  • Any quadratic curves can be denoted as \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where \boldsymbol{x}\in \mathbb{R}^D , A \in \mathbb{R}^{D\times D} \boldsymbol{b}\in \mathbb{R}^D s\in \mathbb{R}.
  • When I want to clarify dimensions of variables of quadratic curves, I denote parameters as A_D, b_D.
  • If a matrix A is a real symmetric matrix, there exist a rotation matrix U such that U^T A U = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D) and U = (\boldsymbol{u}_1, \dots , \boldsymbol{u}_D). \boldsymbol{u}_1, \dots , \boldsymbol{u}_D are eigenvectors corresponding to \lambda_1, \dots, \lambda_D respectively.
  • PCA corresponds to a case of diagonalizing A where A is a covariance matrix of certain data. When I want to clarify that A is a covariance matrix, I denote it as A=\Sigma.
  • Importantly covariance matrices \Sigma are positive semidefinite and real symmetric, which means you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

*In the last article, I denoted the covariance of data as S, based on Pattern Recognition and Machine Learning by C. M. Bishop.

*Sooner or later you are going to see that I am explaining basically the same ideas from different points of view, using the topic of PCA. However I believe they are all important when you learn linear algebra for data science of machine learning. Even you have not learnt linear algebra or if you have to teach linear algebra, I recommend you to first take a review on the idea of diagonalization, like the second article. And you should be conscious that, in the context of machine learning or data science, only a very limited type of matrices are important, which I have been explaining throughout this article.

2 Rotation or projection?

In this section I am going to talk about basic stuff found in most textbooks on linear algebra. In the last article, I mentioned that if A is a real symmetric matrix, you can diagonalize A with a rotation matrix U = (\boldsymbol{u}_1 \: \cdots \: \boldsymbol{u}_D), such that U^{-1}AU = U^{T}AU =\Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). I also explained that PCA is a case where A=\Sigma, that is, A is the covariance matrix of certain data. \Sigma is known to be positive semidefinite and real symmetric. Thus you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

I think we first need to clarify the difference of rotation and projection. In order to visualize the ideas, let’s consider a case of D=3. Assume that you have got an orthonormal rotation matrix U = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) which diagonalizes A. In the last article I said diagonalization is equivalent to finding new orthogonal axes formed by eigenvectors, and in the case of this section you got new orthonoramal basis (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) which are in red in the figure below. Projecting a point \boldsymbol{x} = (x, y, z) on the new orthonormal basis is simple: you just have to multiply \boldsymbol{x} with U^T. Let U^T \boldsymbol{x} be (x', y', z')^T, and then \left( \begin{array}{c} x' \\ y' \\ z' \end{array} \right) = U^T\boldsymbol{x} = \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right). You can see x', y', z' are \boldsymbol{x} projected on \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3 respectively, and the left side of the figure below shows the idea. When you replace the orginal orthonormal basis (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) with (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) as in the right side of the figure below, you can comprehend the projection as a rotation from (x, y, z) to (x', y', z') by a rotation matrix U^T.

Next, let’s see what rotation is. In case of rotation, you should imagine that you rotate the point \boldsymbol{x} in the same coordinate system, rather than projecting to other coordinate system. You can rotate \boldsymbol{x} by multiplying it with U. This rotation looks like the figure below.

In the initial position, the edges of the cube are aligned with the three orthogonal black axes (\boldsymbol{e}_1,  \boldsymbol{e}_2 , \boldsymbol{e}_3), with one corner of the cube located at the origin point of those axes. The purple dot denotes the corner of the cube directly opposite the origin corner. The cube is rotated in three dimensions, with the origin corner staying fixed in place. After the rotation with a pivot at the origin, the edges of the cube are now aligned with a new set of orthogonal axes (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3), shown in red. You might understand that more clearly with an equation: U\boldsymbol{x} = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) \left( \begin{array}{c} x \\ y \\ z \end{array} \right) = x\boldsymbol{u}_1 + y\boldsymbol{u}_2 + z\boldsymbol{u}_3. In short this rotation means you keep relative position of \boldsymbol{x}, I mean its coordinates (x, y, z), in the new orthonormal basis. In this article, let me call this a “cube rotation.”

The discussion above can be generalized to spaces with dimensions higher than 3. When U \in \mathbb{R}^{D \times D} is an orthonormal matrix and a vector \boldsymbol{x} \in \mathbb{R}^D, you can project \boldsymbol{x} to \boldsymbol{x}' = U^T \boldsymbol{x}or rotate it to \boldsymbol{x}'' = U \boldsymbol{x}, where \boldsymbol{x}' = (x_{1}', \dots, x_{D}')^T and \boldsymbol{x}'' = (x_{1}'', \dots, x_{D}'')^T. In other words \boldsymbol{x} = U \boldsymbol{x}', which means you can rotate back \boldsymbol{x}' to the original point \boldsymbol{x} with the rotation matrix U.

I think you at least saw that rotation and projection are basically the same, and that is only a matter of how you look at the coordinate systems. But I would say the idea of projection is more important through out this article.

Let’s consider a function f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} = (\boldsymbol{x}, A \boldsymbol{x}), where A\in \mathbb{R}^{D\times D} is a real symmetric matrix. The distribution of f(\boldsymbol{x}; A) is quadratic curves whose center point covers the origin, and it is known that you can express this distribution in a much simpler way using eigenvectors. When you project this function on eigenvectors of A, that is when you substitute U \boldsymbol{x}' for \boldsymbol{x}, you get f = (\boldsymbol{x}, A \boldsymbol{x}) =(U \boldsymbol{x}', AU \boldsymbol{x}') = (\boldsymbol{x}')^T U^TAU \boldsymbol{x}' = (\boldsymbol{x}')^T \Lambda \boldsymbol{x}' = \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2. You can always diagonalize real symmetric matrices, so the formula implies that the shapes of quadratic curves largely depend on eigenvectors. We are going to see this in detail in the next section.

*(\boldsymbol{x}, \boldsymbol{y}) denotes an inner product of \boldsymbol{x} and \boldsymbol{y}.

*We are going to see details of the shapes of quadratic “curves” or “functions” in the next section.

To be exact, you cannot naively multiply U or U^T for rotation. Let’s take a part of data I showed in the last article as an example. In the figure below, I projected data on the basis (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3).

You might have noticed that you cannot do a “cube rotation” in this case. If you make the coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) with your left hand, like you might have done in science classes in school to learn Fleming’s rule, you would soon realize that the coordinate systems in the figure above do not match. You need to flip the direction of one axis to match them.

Mathematically, you have to consider the determinant of the rotation matrix U. You can do a “cube rotation” when det(U)=1, and in the case above det(U) was -1, and you needed to flip one axis to make the determinant 1. In the example in the figure below, you can match the basis. This also can be generalized to higher dimensions, but that is also beyond the scope of this article series. If you are really interested, you should prepare some coffee and snacks and textbooks on linear algebra, and some weekends.

When you want to make general ellipsoids in a 3d space on Matplotlib, you can take advantage of rotation matrices. You first make a simple ellipsoid symmetric about xyz axis using polar coordinates, and you can rotate the whole ellipsoid with rotation matrices. I made some simple modules for drawing ellipsoid. If you put in a rotation matrix which diagonalize the covariance matrix of data and a list of three radiuses \sqrt{\lambda_1}, \sqrt{\lambda_2}, \sqrt{\lambda_3}, you can rotate the original ellipsoid so that it fits the data well.

3 Types of quadratic curves.

*This article might look like a mathematical writing, but I would say this is more about computer science. Please tolerate some inaccuracy in terms of mathematics. I gave priority to visualizing necessary mathematical ideas in my article series. If you are not sure about details, please let me know.

In linear dimension reduction, or at least in this article series you mainly have to consider ellipsoids. However ellipsoids are just one type of quadratic curves. In the last article, I mentioned that when the center of a D dimensional ellipsoid is the origin point of a normal coordinate system, the formula of the surface of the ellipsoid is as follows: (\boldsymbol{x}, A\boldsymbol{x})=1, where A satisfies certain conditions. To be concrete, when (\boldsymbol{x}, A\boldsymbol{x})=1 is the surface of a ellipsoid, A has to be diagonalizable and positive definite.

*Real symmetric matrices are diagonalizable, and positive definite matrices have only positive eigenvalues. Covariance matrices \Sigma, whose displacement vectors I visualized in the last two articles, are known to be symmetric real matrices and positive semi-defintie. However, the surface of an ellipsoid which fit the data is \boldsymbol{x}^T \Sigma ^{-1} \boldsymbol{x} = const., not \boldsymbol{x}^T \Sigma \boldsymbol{x} = const..

*You have to keep it in mind that \boldsymbol{x} are all deviations.

*You do not have to think too much about what the “semi” of the term “positive semi-definite” means fow now.

As you could imagine, this is just one simple case of richer variety of graphs. Let’s consider a 3-dimensional space. Any quadratic curves in this space can be denoted as ax^2 + by^2 + cz^2 + dxy + eyz + fxz + px + qy + rz + s = 0, where at least one of a, b, c, d, e, f, p, q, r, s is not 0.  Let \boldsymbol{x} be (x, y, z)^T, then the quadratic curves can be simply denoted with a 3\times 3 matrix A and a 3-dimensional vector \boldsymbol{b} as follows: \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right), \boldsymbol{b} = \left( \begin{array}{c} \frac{p}{2} \\ \frac{q}{2} \\ \frac{r}{2} \end{array} \right). General quadratic curves are roughly classified into the 9 types below.

You can shift these quadratic curves so that their center points come to the origin, without rotation, and the resulting curves are as follows. The curves can be all denoted as \boldsymbol{x}^T A\boldsymbol{x}.

As you can see, A is a real symmetric matrix. As I have mentioned repeatedly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthogonal/orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). Hence, you can diagonalize the A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right) with an orthogonal matrix U. Let U be an orthogonal matrix such that U^T A U = \left( \begin{array}{ccc} \alpha  & 0 & 0 \\ 0 & \beta & 0 \\ 0 & 0 & \gamma \end{array} \right) =\left( \begin{array}{ccc} \lambda_1  & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{array} \right). After you apply rotation by U to the curves (a)” ~ (i)”, those curves are symmetrically placed about the xyz axes, and their center points still cross the origin. The resulting curves look like below. Or rather I should say you projected (a)’ ~ (i)’ on their eigenvectors.

In this article mainly (a)” , (g)”, (h)”, and (i)” are important. General equations for the curves is as follows

  • (a)”: \frac{x^2}{l^2} + \frac{y^2}{m^2} + \frac{z^2}{n^2} = 1
  • (g)”: z = \frac{x^2}{l^2} + \frac{y^2}{m^2}
  • (h)”: z = \frac{x^2}{l^2} - \frac{y^2}{m^2}
  • (i)”: z = \frac{x^2}{l^2}

, where l, m, n \in \mathbb{R}^+.

Even if this section has been puzzling to you, you just have to keep one point in your mind: we have been discussing general quadratic curves, but in PCA, you only need to consider a case where A is a covariance matrix, that is A=\Sigma. PCA corresponds to the case where you shift and rotate the curve (a) into (a)”. Subtracting the mean of data from each point of data corresponds to shifting quadratic curve (a) to (a)’. Calculating eigenvectors of A corresponds to calculating a rotation matrix U such that the curve (a)’ comes to (a)” after applying the rotation, or projecting curves on eigenvectors of \Sigma. Importantly we are only discussing the covariance of certain data, not the distribution of the data itself.

*Just in case you are interested in a little more mathematical sides: it is known that if you rotate all the points \boldsymbol{x} on the curve \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0 with the rotation matrix P, those points \boldsymbol{x} are mapped into a new quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0. That means the rotation of the original quadratic curve with P (or rather rotating axes) enables getting rid of the terms xy, yz, zx. Also it is known that when \alpha ' \neq 0, with proper translations and rotations, the quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0 can be mapped into one of the types of quadratic curves in the figure below, depending on coefficients of the original quadratic curve. And the discussion so far can be generalized to higher dimensional spaces, but that is beyond the scope of this article series. Please consult decent textbooks on linear algebra around you for further details.

4 Eigenvectors are gradients and sometimes variances.

In the second section I explained that you can express quadratic functions f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} in a very simple way by projecting \boldsymbol{x} on eigenvectors of A.

You can comprehend what I have explained in another way: eigenvectors, to be exact eigenvectors of real symmetric matrices A, are gradients. And in case of PCA, I mean when A=\Sigma eigenvalues are also variances. Before explaining what that means, let me explain a little of the totally common facts on mathematics. If you have variables \boldsymbol{x}\in \mathbb{R}^D, I think you can comprehend functions f(\boldysmbol{x}) in two ways. One is a normal “functions” f(\boldsymbol{x}), and the others are “curves” f(\boldsymbol{x}) = const.. “Functions” get an input \boldsymbol{x} and gives out an output f(\boldsymbol{x}), just as well as normal functions you would imagine. “Curves” are rather sets of \boldsymbol{x} \in \mathbb{R}^D such that f(\boldsymbol{x}) = const..

*Please assume that the terms “functions” and “curves” are my original words. I use them just in case I fail to use functions and curves properly.

The quadratic curves in the figure above are all “curves” in my term, which can be denoted as f(\boldsymbol{x}; A_3, \boldsymbol{b}_3)=const or f(\boldsymbol{x}; A_3)=const. However if you replace z of (g)”, (h)”, and (i)” with f, you can interpret the “curves” as “functions” which are denoted as f(\boldsymbol{x}; A_2). This might sounds too obvious to you, and my point is you can visualize how values of “functions” change only when the inputs are 2 dimensional.

When a symmetric 2\times 2 real matrices A_2 have two eigenvalues \lambda_1, \lambda_2, the distribution of quadratic curves can be roughly classified to the following three types.

  • (g): Both \lambda_1 and \lambda_2 are positive or negative.
  • (h): Either of \lambda_1 or \lambda_2 is positive and the other is negative.
  • (i): Either of \lambda_1 or \lambda_2 is 0 and the other is not.

The equations of (g)” , (h)”, and (i)” correspond to each type of f=(\boldsymbol{x}; A_2), and thier curves look like the three graphs below.

And in fact, when start from the origin and go in the direction of an eigenvector \boldsymbol{u}_i, \lambda_i is the gradient of the direction. You can see that more clearly when you restrict the distribution of f=(\boldsymbol{x}; A_2) to a unit circle. Like in the figure below, in case \lambda_1 = 7, \lambda_2 = 3, which is classified to (g), the distribution looks like the left side, and if you restrict the distribution in the unit circle, the distribution looks like a bowl like the middle and the right side. When you move in the direction of \boldsymbol{u}_1, you can climb the bowl as as high as \lambda_1, in \boldsymbol{u}_2 as high as \lambda_2.

Also in case of (h), the same facts hold. But in this case, you can also descend the curve.

*You might have seen the curve above in the context of optimization with stochastic gradient descent. The origin of the curve above is a notorious saddle point, where gradients are all 0 in any directions but not a local maximum or minimum. Points can be stuck in this point during optimization.

Especially in case of PCA, A is a covariance matrix, thus A=\Sigma. Eigenvalues of \Sigma are all equal to or greater than 0. And it is known that in this case \lambda_i is the variance of data projected on its corresponding eigenvector \boldsymbol{u}_i (i=0, \dots , D). Hence, if you project f(\boldsymbol{x}; \Sigma), quadratic curves formed by a covariance matrix \Sigma, on eigenvectors of \Sigma, you get f(\boldsymbol{x}; \Sigma) = ({x'}_1 \: \dots \: {x'}_D) (\lambda_1 {x'}_1 \: \dots \: \lambda_D {x'}_D)^t =\lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2.  This shows that you can re-weight ({x'}_1 \: \dots \: {x'}_D), the coordinates of data projected projected on eigenvectors of A, with \lambda_1, \dots, \lambda_D, which are variances ({x'}_1 \: \dots \: {x'}_D). As I mentioned in an example of data of exam scores in the last article, the bigger a variance \lambda_i is, the more the feature described by \boldsymbol{u}_i vary from sample to sample. In other words, you can ignore eigenvectors corresponding to small eigenvalues.

That is a great hint why principal components corresponding to large eigenvectors contain much information of the data distribution. And you can also interpret PCA as a “climbing” a bowl of f(\boldsymbol{x}; A_D), as I have visualized in the case of (g) type curve in the figure above.

*But as I have repeatedly mentioned, ellipsoid which fit data well isf(\boldsymbol{x}; \Sigma ^{-1}) =(\boldsymbol{x}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D})\boldsymbol{x}' = \frac{({x'}_{1})^2}{\lambda_1} + \cdots + \frac{({x'}_{D})^2}{\lambda_D} = const..

*You have to be careful that even if you slice a type (h) curve f(\boldsymbol{x}; A_D) with a place z=const. the resulting cross section does not fit the original data well because the equation of the cross section is \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2 = const. The figure below is an example of slicing the same f(\boldsymbol{x}; A_2) as the one above with z=1, and the resulting cross section.

As we have seen, \lambda_i, the eigenvalues of the covariance matrix of data are variances or data when projected on it eigenvectors. At the same time, when you fit an ellipsoid on the data, \sqrt{\lambda_i} is the radius of the ellipsoid corresponding to \boldsymbol{u}_i. Thus ignoring data projected on eigenvectors corresponding to small eigenvalues is equivalent to cutting of the axes of the ellipsoid with small radiusses.

I have explained PCA in three different ways over three articles.

  • The second article: I focused on what kind of linear transformations convariance matrices \Sigma enable, by visualizing displacement vectors. And those vectors look like swirling and extending into directions of eigenvectors of \Sigma.
  • The third article: We directly found directions where certain data distribution “swell” the most, to find that data swell the most in directions of eigenvectors.
  • In this article, we have seen PCA corresponds to only one case of quadratic functions, where the matrix A is a covariance matrix. When you go in the directions of eigenvectors corresponding to big eigenvalues, the quadratic function increases the most. Also that means data samples have bigger variances when projected on the eigenvectors. Thus you can cut off eigenvectors corresponding to small eigenvectors because they retain little information about data, and that is equivalent to fitting an ellipsoid on data and cutting off axes with small radiuses.

*Let A be a covariance matrix, and you can diagonalize it with an orthogonal matrix U as follow: U^{T}AU = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D). Thus A = U \Lambda U^{T}. U is a rotation, and multiplying a \boldsymbol{x} with \Lambda means you multiply each eigenvalue to each element of \boldsymbol{x}. At the end U^T enables the reverse rotation.

If you get data like the left side of the figure below, most explanation on PCA would just fit an oval on this data distribution. However after reading this articles series so far, you would have learned to see PCA from different viewpoints like at the right side of the figure below.

 

5 Ellipsoids in Gaussian distributions.

I have explained that if the covariance of a data distribution is \boldsymbol{\Sigma}, the ellipsoid which fits the distribution the best is \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = 1. You might have seen the part \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) somewhere else. It is the exponent of general Gaussian distributions: \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) \}.  It is known that the eigenvalues of \Sigma ^{-1} are \frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}, and eigenvectors corresponding to each eigenvalue are also \boldsymbol{u}_1, \dots, \boldsymbol{u}_D respectively. Hence just as well as what we have seen, if you project (\boldsymbol{x} - \boldsymbol{\mu}) on each eigenvector of \Sigma ^{-1}, we can convert the exponent of the Gaussian distribution.

Let -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) be \boldsymbol{y} and U ^{-1} \boldsymbol{y}= U^{T} \boldsymbol{y} be \boldsymbol{y}', where U=(\boldsymbol{u}_1 \: \dots \: \boldsymbol{u}_D). Just as we have seen, (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) =\boldsymbol{y}^T\Sigma^{-1} \boldsymbol{y} =(U\boldsymbol{y}')^T \Sigma^{-1} U\boldsymbol{y}' =((\boldsymbol{y}')^T U^T \Sigma^{-1} U\boldsymbol{y}' = (\boldsymbol{y}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}) \boldsymbol{y}' = \frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D}. Hence \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{y}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{y}) \} =  \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D} ) \} =\frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2} \frac{({y'}_{1})^2}{\lambda_1} \biggl) \cdots \frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2}\frac{({y'}_{D})^2}{\lambda_D} \biggl).

*To be mathematically exact about changing variants of normal distributions, you have to consider for example Jacobian matrices.

This results above demonstrate that, by projecting data on the eigenvectors of its covariance matrix, you can factorize the original multi-dimensional Gaussian distribution into a product of Gaussian distributions which are irrelevant to each other. However, at the same time, that is the potential limit of approximating data with PCA. This idea is going to be more important when you think about more probabilistic ways to handle PCA, which is more robust to lack of data.

I have explained PCA over 3 articles from various viewpoints. If you have been patient enough to read my article series, I think you have gained some deeper insight into not only PCA, but also linear algebra, and that should be helpful when you learn or teach data science. I hope my codes also help you. In fact these are not the only topics about PCA. There are a lot of important PCA-like algorithms.

In fact our expedition of ellipsoids, or PCA still continues, just as Star Wars series still continues. Especially if I have to explain an algorithm named probabilistic PCA, I need to explain the “Bayesian world” of machine learning. Most machine learning algorithms covered by major introductory textbooks tend to be too deterministic and dependent on the size of data. Many of those algorithms have another “parallel world,” where you can handle inaccuracy in better ways. I hope I can also write about them, and I might prepare another trilogy for such PCA. But I will not disappoint you, like “The Phantom Menace.”

Appendix: making a model of a bunch of grape with ellipsoid berries.

If you can control quadratic curves, reshaping and rotating them, you can make a model of a grape of olive bunch on Matplotlib. I made a program of making a model of a bunch of berries on Matplotlib using the module to draw ellipsoids which I introduced earlier. You can check the codes in this page.

*I have no idea how many people on this earth are in need of making such models.

I made some modules so that you can see the grape bunch from several angles. This might look very simple to you, but the locations of berries are organized carefully so that it looks like they are placed around a stem and that the berries are not too close to each other.

 

The programming code I created for this article is completly available here.

[Refereces]

[1]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 78-83, 559-577

[2]「理工系新課程 線形代数 基礎から応用まで」, 培風館、(2017)

[3]「これなら分かる 最適化数学 基礎原理から計算手法まで」, 金谷健一著、共立出版, (2019), pp. 17-49

[4]「これなら分かる 応用数学教室 最小二乗法からウェーブレットまで」, 金谷健一著、共立出版, (2019), pp.165-208

[5] 「サボテンパイソン 」
https://sabopy.com/

 

Multi-head attention mechanism: “queries”, “keys”, and “values,” over and over again

This is the third article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

In the last article, I explained how attention mechanism works in simple seq2seq models with RNNs, and it basically calculates correspondences of the hidden state at every time step, with all the outputs of the encoder. However I would say the attention mechanisms of RNN seq2seq models use only one standard for comparing them. Using only one standard is not enough for understanding languages, especially when you learn a foreign language. You would sometimes find it difficult to explain how to translate a word in your language to another language. Even if a pair of languages are very similar to each other, translating them cannot be simple switching of vocabulary. Usually a single token in one language is related to several tokens in the other language, and vice versa. How they correspond to each other depends on several criteria, for example “what”, “who”, “when”, “where”, “why”, and “how”. It is easy to imagine that you should compare tokens with several criteria.

Transformer model was first introduced in the original paper named “Attention Is All You Need,” and from the title you can easily see that attention mechanism plays important roles in this model. When you learn about Transformer model, you will see the figure below, which is used in the original paper on Transformer.  This is the simplified overall structure of one layer of Transformer model, and you stack this layer N times. In one layer of Transformer, there are three multi-head attention, which are displayed as boxes in orange. These are the very parts which compare the tokens on several standards. I made the head article of this article series inspired by this multi-head attention mechanism.

The figure below is also from the original paper on Transfromer. If you can understand how multi-head attention mechanism works with the explanations in the paper, and if you have no troubles understanding the codes in the official Tensorflow tutorial, I have to say this article is not for you. However I bet that is not true of majority of people, and at least I need one article to clearly explain how multi-head attention works. Please keep it in mind that this article covers only the architectures of the two figures below. However multi-head attention mechanisms are crucial components of Transformer model, and throughout this article, you would not only see how they work but also get a little control over it at an implementation level.

1 Multi-head attention mechanism

When you learn Transformer model, I recommend you first to pay attention to multi-head attention. And when you learn multi-head attentions, before seeing what scaled dot-product attention is, you should understand the whole structure of multi-head attention, which is at the right side of the figure above. In order to calculate attentions with a “query”, as I said in the last article, “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” Sooner or later, you will notice I would be just repeating these phrases over and over again throughout this article, in several ways.

*Even if you are not sure what “reweighting” means in this context, please keep reading. I think you would little by little see what it means especially in the next section.

The overall process of calculating multi-head attention, displayed in the figure above, is as follows (Please just keep reading. Please do not think too much.): first you split the V: “values”, K: “keys”, and Q: “queries”, and second you transform those divided “values”, “keys”, and “queries” with densely connected layers (“Linear” in the figure). Next you calculate attention weights and reweight the “values” and take the summation of the reiweighted “values”, and you concatenate the resulting summations. At the end you pass the concatenated “values” through another densely connected layers. The mechanism of scaled dot-product attention is just a matter of how to concretely calculate those attentions and reweight the “values”.

*In the last article I briefly mentioned that “keys” and “queries” can be in the same language. They can even be the same sentence in the same language, and in this case the resulting attentions are called self-attentions, which we are mainly going to see. I think most people calculate “self-attentions” unconsciously when they speak. You constantly care about what “she”, “it” , “the”, or “that” refers to in you own sentence, and we can say self-attention is how these everyday processes is implemented.

Let’s see the whole process of calculating multi-head attention at a little abstract level. From now on, we consider an example of calculating multi-head self-attentions, where the input is a sentence “Anthony Hopkins admired Michael Bay as a great director.” In this example, the number of tokens is 9, and each token is encoded as a 512-dimensional embedding vector. And the number of heads is 8. In this case, as you can see in the figure below, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” is implemented as a 9\times 512 matrix. You first split each token into 512/8=64 dimensional, 8 vectors in total, as I colored in the figure below. In other words, the input matrix is divided into 8 colored chunks, which are all 9\times 64 matrices, but each colored matrix expresses the same sentence. And you calculate self-attentions of the input sentence independently in the 8 heads, and you reweight the “values” according to the attentions/weights. After this, you stack the sum of the reweighted “values”  in each colored head, and you concatenate the stacked tokens of each colored head. The size of each colored chunk does not change even after reweighting the tokens. According to Ashish Vaswani, who invented Transformer model, each head compare “queries” and “keys” on each standard. If the a Transformer model has 4 layers with 8-head multi-head attention , at least its encoder has 4\times 8 = 32 heads, so the encoder learn the relations of tokens of the input on 32 different standards.

I think you now have rough insight into how you calculate multi-head attentions. In the next section I am going to explain the process of reweighting the tokens, that is, I am finally going to explain what those colorful lines in the head image of this article series are.

*Each head is randomly initialized, so they learn to compare tokens with different criteria. The standards might be straightforward like “what” or “who”, or maybe much more complicated. In attention mechanisms in deep learning, you do not need feature engineering for setting such standards.

2 Calculating attentions and reweighting “values”

If you have read the last article or if you understand attention mechanism to some extent, you should already know that attention mechanism calculates attentions, or relevance between “queries” and “keys.” In the last article, I showed the idea of weights as a histogram, and in that case the “query” was the hidden state of the decoder at every time step, whereas the “keys” were the outputs of the encoder. In this section, I am going to explain attention mechanism in a more abstract way, and we consider comparing more general “tokens”, rather than concrete outputs of certain networks. In this section each [ \cdots ] denotes a token, which is usually an embedding vector in practice.

Please remember this mantra of attention mechanism: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” The figure below shows an overview of a case where “Michael” is a query. In this case you compare the query with the “keys”, that is, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” and you get the histogram of attentions/weights. Importantly the sum of the weights 1. With the attentions you have just calculated, you can reweight the “values,” which also denote the same input sentence. After that you can finally take a summation of the reweighted values. And you use this summation.

*I have been repeating the phrase “reweighting ‘values’  with attentions,”  but you in practice calculate the sum of those reweighted “values.”

Assume that compared to the “query”  token “Michael”, the weights of the “key” tokens “Anthony”, “Hopkins”, “admired”, “Michael”, “Bay”, “as”, “a”, “great”, and “director.” are respectively 0.06, 0.09, 0.05, 0.25, 0.18, 0.06, 0.09, 0.06, 0.15. In this case the sum of the reweighted token is 0.06″Anthony” + 0.09″Hopkins” + 0.05″admired” + 0.25″Michael” + 0.18″Bay” + 0.06″as” + 0.09″a” + 0.06″great” 0.15″director.”, and this sum is the what wee actually use.

*Of course the tokens are embedding vectors in practice. You calculate the reweighted vector in actual implementation.

You repeat this process for all the “queries.”  As you can see in the figure below, you get summations of 9 pairs of reweighted “values” because you use every token of the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as a “query.” You stack the sum of reweighted “values” like the matrix in purple in the figure below, and this is the output of a one head multi-head attention.

3 Scaled-dot product

This section is a only a matter of linear algebra. Maybe this is not even so sophisticated as linear algebra. You just have to do lots of Excel-like operations. A tutorial on Transformer by Jay Alammar is also a very nice study material to understand this topic with simpler examples. I tried my best so that you can clearly understand multi-head attention at a more mathematical level, and all you need to know in order to read this section is how to calculate products of matrices or vectors, which you would see in the first some pages of textbooks on linear algebra.

We have seen that in order to calculate multi-head attentions, we prepare 8 pairs of “queries”, “keys” , and “values”, which I showed in 8 different colors in the figure in the first section. We calculate attentions and reweight “values” independently in 8 different heads, and in each head the reweighted “values” are calculated with this very simple formula of scaled dot-product: Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) =softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})\boldsymbol{V}. Let’s take an example of calculating a scaled dot-product in the blue head.

At the left side of the figure below is a figure from the original paper on Transformer, which explains one-head of multi-head attention. If you have read through this article so far, the figure at the right side would be more straightforward to understand. You divide the input sentence into 8 chunks of matrices, and you independently put those chunks into eight head. In one head, you convert the input matrix by three different fully connected layers, which is “Linear” in the figure below, and prepare three matrices Q, K, V, which are “queries”, “keys”, and “values” respectively.

*Whichever color attention heads are in, the processes are all the same.

*You divide \frac{\boldsymbol{Q} \boldsymbol{K}} ^T by \sqrt{d}_k in the formula. According to the original paper, it is known that re-scaling \frac{\boldsymbol{Q} \boldsymbol{K}} ^T by \sqrt{d}_k is found to be effective. I am not going to discuss why in this article.

As you can see in the figure below, calculating Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) is virtually just multiplying three matrices with the same size (Only K is transposed though). The resulting 9\times 64 matrix is the output of the head.

softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}) is calculated like in the figure below. The softmax function regularize each row of the re-scaled product \frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}, and the resulting 9\times 9 matrix is a kind a heat map of self-attentions.

The process of comparing one “query” with “keys” is done with simple multiplication of a vector and a matrix, as you can see in the figure below. You can get a histogram of attentions for each query, and the resulting 9 dimensional vector is a list of attentions/weights, which is a list of blue circles in the figure below. That means, in Transformer model, you can compare a “query” and a “key” only by calculating an inner product. After re-scaling the vectors by dividing them with \sqrt{d_k} and regularizing them with a softmax function, you stack those vectors, and the stacked vectors is the heat map of attentions.

You can reweight “values” with the heat map of self-attentions, with simple multiplication. It would be more straightforward if you consider a transposed scaled dot-product \boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T. This also should be easy to understand if you know basics of linear algebra.

One column of the resulting matrix (\boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T) can be calculated with a simple multiplication of a matrix and a vector, as you can see in the figure below. This corresponds to the process or “taking a summation of reweighted ‘values’,” which I have been repeating. And I would like you to remember that you got those weights (blue) circles by comparing a “query” with “keys.”

Again and again, let’s repeat the mantra of attention mechanism together: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” If you have been patient enough to follow my explanations, I bet you have got a clear view on how multi-head attention mechanism works.

We have been seeing the case of the blue head, but you can do exactly the same procedures in every head, at the same time, and this is what enables parallelization of multi-head attention mechanism. You concatenate the outputs of all the heads, and you put the concatenated matrix through a fully connected layers.

If you are reading this article from the beginning, I think this section is also showing the same idea which I have repeated, and I bet more or less you no have clearer views on how multi-head attention mechanism works. In the next section we are going to see how this is implemented.

4 Tensorflow implementation of multi-head attention

Let’s see how multi-head attention is implemented in the Tensorflow official tutorial. If you have read through this article so far, this should not be so difficult. I also added codes for displaying heat maps of self attentions. With the codes in this Github page, you can display self-attention heat maps for any input sentences in English.

The multi-head attention mechanism is implemented as below. If you understand Python codes and Tensorflow to some extent, I think this part is relatively easy.  The multi-head attention part is implemented as a class because you need to train weights of some fully connected layers. Whereas, scaled dot-product is just a function.

*I am going to explain the create_padding_mask() and create_look_ahead_mask() functions in upcoming articles. You do not need them this time.

Let’s see a case of using multi-head attention mechanism on a (1, 9, 512) sized input tensor, just as we have been considering in throughout this article. The first axis of (1, 9, 512) corresponds to the batch size, so this tensor is virtually a (9, 512) sized tensor, and this means the input is composed of 9 512-dimensional vectors. In the results below, you can see how the shape of input tensor changes after each procedure of calculating multi-head attention. Also you can see that the output of the multi-head attention is the same as the input, and you get a 9\times 9 matrix of attention heat maps of each attention head.

I guess the most complicated part of this implementation above is the split_head() function, especially if you do not understand tensor arithmetic. This part corresponds to splitting the input tensor to 8 different colored matrices as in one of the figures above. If you cannot understand what is going on in the function, I recommend you to prepare a sample tensor as below.

This is just a simple (1, 9, 512) sized tensor with sequential integer elements. The first row (1, 2, …., 512) corresponds to the first input token, and (4097, 4098, … , 4608) to the last one. You should try converting this sample tensor to see how multi-head attention is implemented. For example you can try the operations below.

These operations correspond to splitting the input into 8 heads, whose sizes are all (9, 64). And the second axis of the resulting (1, 8, 9, 64) tensor corresponds to the index of the heads. Thus sample_sentence[0][0] corresponds to the first head, the blue 9\times 64 matrix. Some Tensorflow functions enable linear calculations in each attention head, independently as in the codes below.

Very importantly, we have been only considering the cases of calculating self attentions, where all “queries”, “keys”, and “values” come from the same sentence in the same language. However, as I showed in the last article, usually “queries” are in a different language from “keys” and “values” in translation tasks, and “keys” and “values” are in the same language. And as you can imagine, usualy “queries” have different number of tokens from “keys” or “values.” You also need to understand this case, which is not calculating self-attentions. If you have followed this article so far, this case is not that hard to you. Let’s briefly see an example where the input sentence in the source language is composed 9 tokens, on the other hand the output is composed 12 tokens.

As I mentioned, one of the outputs of each multi-head attention class is 9\times 9 matrix of attention heat maps, which I displayed as a matrix composed of blue circles in the last section. The the implementation in the Tensorflow official tutorial, I have added codes to display actual heat maps of any input sentences in English.

*If you want to try displaying them by yourself, download or just copy and paste codes in this Github page. Please maker “datasets” directory in the same directory as the code. Please download “spa-eng.zip” from this page, and unzip it. After that please put “spa.txt” on the “datasets” directory. Also, please download the “checkpoints_en_es” folder from this link, and place the folder in the same directory as the file in the Github page. In the upcoming articles, you would need similar processes to run my codes.

After running codes in the Github page, you can display heat maps of self attentions. Let’s input the sentence “Anthony Hopkins admired Michael Bay as a great director.” You would get a heat maps like this.

In fact, my toy implementation cannot handle proper nouns such as “Anthony” or “Michael.” Then let’s consider a simple input sentence “He admired her as a great director.” In each layer, you respectively get 8 self-attention heat maps.

I think we can see some tendencies in those heat maps. The heat maps in the early layers, which are close to the input, are blurry. And the distributions of the heat maps come to concentrate more or less diagonally. At the end, presumably they learn to pay attention to the start and the end of sentences.

You have finally finished reading this article. Congratulations.

You should be proud of having been patient, and you passed the most tiresome part of learning Transformer model. You must be ready for making a toy English-German translator in the upcoming articles. Also I am sure you have understood that Michael Bay is a great director, no matter what people say.

*Hannibal Lecter, I mean Athony Hopkins, also wrote a letter to the staff of “Breaking Bad,” and he told them the tv show let him regain his passion. He is a kind of admiring around, and I am a little worried that he might be getting senile. He played a role of a father forgetting his daughter in his new film “The Father.” I must see it to check if that is really an acting, or not.

[References]

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core
https://www.tensorflow.org/overview

[3] “Neural machine translation with attention,” Tensorflow Core
https://www.tensorflow.org/tutorials/text/nmt_with_attention

[4] Jay Alammar, “The Illustrated Transformer,”
http://jalammar.github.io/illustrated-transformer/

[5] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)
https://www.youtube.com/watch?v=5vcj8kSwBCY

[6]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[7]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[8]Rosemary Rossi, “Anthony Hopkins Compares ‘Genius’ Michael Bay to Spielberg, Scorsese,” yahoo! entertainment, (2017)
https://www.yahoo.com/entertainment/anthony-hopkins-transformers-director-michael-bay-guy-genius-010058439.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Data Mining Process flow – Easy Understanding

1 Overview

Development of computer processing power, network and automated software completely change and give new concept of each business. And data mining play the vital part to solve, finding the hidden patterns and relationship from large dataset with business by using sophisticated data analysis tools like methodology, method, process flow etc.

On this paper, proposed a process flow followed CRISP-DM methodology and has six steps where data understanding does not considered.

Phase of new process flow given below:-

Phase 1: Involved with collection, outliner treatment, imputation, transformation, scaling, and partition dataset in to two sub-frames (Training and Testing). Here as an example for outliner treatment, imputation, transformation, scaling consider accordingly Z score, mean, One hot encoding and Min Max Scaler.

Phase 2: On this Phase training and testing data balance with same balancing algorithm but separately. As an example here SMOTE (synthetic minority oversampling technique) is considered.

Phase 3: This phase involved with reduction, selection, aggregation, extraction. But here for an example considering same feature reduction algorithm (LDA -Linear Discriminant analysis) on training and testing data set separately.

Phase 4: On this Phase Training data set again partition into two more set (Training and Validation).

Phase 5: This Phase considering several base algorithms as a base model like CNN, RNN, Random forest, MLP, Regression, Ensemble method. This phase also involve to find out best hyper parameter and sub-algorithm for each base algorithm. As an example on this paper consider two class classification problems and also consider Random forest (Included CART – Classification and Regression Tree and GINI index impurity) and MLP classifier (Included (Relu, Sigmoid, binary cross entropy, Adam – Adaptive Moment Estimation) as base algorithms.

Phase 6: First, Prediction with validation data then evaluates with Test dataset which is fully unknown for these (Random forest, MLP classifier) two base algorithms. Then calculate the confusion matrix, ROC, AUC to find the best base algorithm.

New method from phase 1 to phase 4 followed CRISP-DM methodology steps such as data collection, data preparation then phase 5 followed modelling and phase 6 followed evaluation and implementation steps.

Structure of proposed process flow for two class problem combined with algorithm and sub-algorithm display on figure – 1.

These articles mainly focus to describe all algorithms which are going to implementation for better understanding.

 

 

Data Mining Process Flow

Figure 1 – Data Mining Process Flow

2 Phase 1: Outlier treatment, Transform, Scaling, Imputation

This phase involved with outlier treatment, imputation, scaling, and transform data.

2.1 Outliner treatment: – Z score

Outlier is a data point which lies far from all other data point in a data set. Outlier need to treat because it may bias the entire result. Outlier treatment with Z score is a common technique.  Z score is a standard score in statistics.  Z score provides information about data value is smaller or grater then mean that means how many standard deviations away from the mean value. Z score equation display below:

Z = \frac{(x - \mu)}{\sigma}

Here x = data point
σ = Standard deviation
μ = mean value

Equation- 1 Z-Score

In a normal distribution Z score represent 68% data lies on +/- 1, 95% data point lies on +/- 2, 99.7% data point lies on +/- 3 standard deviation.

2.2 Imputation data: – mean

Imputation is a way to handle missing data by replacing substituted value. There are many imputation technique represent like mean, median, mode, k-nearest neighbours. Mean imputation is the technique to replacing missing information with mean value. On the mean imputation first calculate the particular features mean value and then replace the missing value with mean value. The next equation displays the mean calculation:

\mu = \frac{(\sum x)}{n}

Here x = value of each point
n = number of values
μ = mean value

Equation- 2 Mean

2.3 Transform: – One hot encoding

Encoding is a pre-processing technique which represents data in such a way that computer can understand.  For understanding of machine learning algorithm categorical columns convert to numerical columns, this process called categorical encoding. There are multiple way to handle categorical variable but most widely used techniques are label encoding and one host encoding. On label encoding give a numeric (integer number) for each category. Suppose there are 3 categories of foods like apples, orange, banana. When label encoding is used then 3 categories will get a numerical value like apples = 1, banana = 2 and orange = 3. But there is very high probability that machine learning model can capture the relationship in between categories such as apple < banana < orange or calculate average across categories like 1 +3 = 4 / 2 = 2 that means model can understand average of apple and orange together is banana which is not acceptable because model correlation calculation is wrong. For solving this problem one hot encoding appear. The following table displays the label encoding is transformed into one hot encoding.

Label Encoding and One-Hot-Encoding

Table- 1 Encoding example

On hot encoding categorical value split into columns and each column contains 0 or 1 according to columns placement.

2.4 Scaling data: – Min Max Scaler

Feature scaling method is standardized or normalization the independent variable that means it is used to scale the data in a particular range like -1 to +1 or depending on algorithm. Generally normalization used where data distribution does not follow Gaussian distribution and standardization used where data distribution follow Gaussian distribution. On standardization techniques transform data values are cantered around the mean and unit is standard deviation. Formula for standardization given below:

Standardization X = \frac{(X - \mu)}{\sigma}

Equation-3 Equations for Standardization

X represent the feature value, µ represent mean of the feature value and σ represent standard deviation of the feature value. Standardized data value does not restrict to a particular range.

Normalization techniques shifted and rescaled data value range between 0 and 1. Normalization techniques also called Min-Max scaling. Formula for normalization given below:

Normalization X = \frac{(X - X_{min})}{X_{max} - X_{min}}

Equation – 4 Equations for Normalization

Above X, Xmin, Xmax are accordingly feature values, feature minimum value and feature maximum value. On above formula when X is minimums then numerator will be 0 (  is 0) or if X is maximums then the numerator is equal to the denominator (  is 1). But when X data value between minimum and maximum then  is between 0 and 1. If ranges value of data does not normalized then bigger range can influence the result.

3 Phase 2: – Balance Data

3.1 SMOTE

SMOTE (synthetic minority oversampling technique) is an oversampling technique where synthetic observations are created based on existing minority observations. This technique operates in feature space instead of data space. Under SMOTE each minority class observation calculates k nearest neighbours and randomly chose the neighbours depending on over-sampling requirements. Suppose there are 4 data point on minority class and 10 data point on majority class. For this imbalance data set, balance by increasing minority class with synthetic data point.   SMOTE creating synthetic data point but it is necessary to consider k nearest neighbours first. If k = 3 then SMOTE consider 3 nearest neighbours. Figure-2 display SMOTE with k = 3 and x = x1, x2, x3, x4 data point denote minority class. And all circles represent majority class.

SMOTE Example

Figure- 2 SMOTE example

 

4 Phase 3: – Feature Reduction

4.1 LDA

LDA stands for Linear Discriminant analysis supervised technique are commonly used for classification problem.  On this feature reduction account continuous independent variable and output categorical variable. It is multivariate analysis technique. LDA analyse by comparing mean of the variables.  Main goal of LDA is differentiate classes in low dimension space. LDA is similar to PCA (Principal component analysis) but in addition LDA maximize the separation between multiple classes. LDA is a dimensionality reduction technique where creating synthetic feature from linear combination of original data set then discard less important feature. LDA calculate class variance, it maximize between class variance and minimize within class variance. Table-2 display the process steps of LDA.

LDA Process

Table- 2 LDA process

5 Phase 5: – Base Model

Here we consider two base model ensemble random forest and MLP classifier.

5.1 Random Forest

Random forest is an ensemble (Bagging) method where group of weak learner (decision tree) come together to form a strong leaner. Random forest is a supervised algorithm which is used for regression and classification problem. Random forests create several decisions tree for predictions and provide solution by voting (classification) or mean (regression) value. Working process of Random forest given below (Table -3).

Random Forest

Table-3 Random Forest process

When training a Random forest root node contains a sample of bootstrap dataset and the feature is as same as original dataset. Suppose the dataset is D and contain d record and m number of columns. From the dataset D random forest first randomly select sample of rows (d) with replacement and sample of features (n) and give it to the decision tree. Suppose Random forest created several decision trees like T1, T2, T3, T4 . . . Tn. Then randomly selected dataset D = d + n is given to the decision tree T1, T2, T3, T4 . . . Tn where D < D, m > n and d > d.  After taking the dataset decision tree give the prediction for binary classification 1 or 0 then aggregating the decision and select the majority voted result. Figure-3 describes the structure of random forest process.

Random Forest Process

Figure- 3 Random Forest process

On Random forest base learner Decision Tree grows complete depth where bias (properly train on training dataset) is low and variance is high (when implementing test data give big error) called overfitting. On Random forest using multiple decision trees where each Decision tree is high variance but when combining all decision trees with the respect of majority vote then high variance converted into low variance because using row and feature sampling with replacement and taking the majority vote where decision is not depend on one decision tree.

CART (Classification and Regression Tree) is binary segmentation technique. CART is a Gini’s impurity index based classical algorithm to split a dataset and build a decision tree. By splitting a selected dataset CART created two child nodes repeatedly and builds a tree until the data no longer be split. There are three steps CART algorithm follow:

  1. Find best split for each features. For each feature in binary split make two groups of the ordered classes. That means possibility of split for k classes is k-1. Find which split is maximized and contain best splits (one for each feature) result.
  2. Find the best split for nodes. From step 1 find the best one split (from all features) which maximized the splitting criterion.
  3. Split the best node from step 2 and repeat from step 1 until fulfil the stopping criterion.

 

For splitting criteria CART use GINI index impurity algorithm to calculate the purity of split in a decision tree. Gini impurity randomly classified the labels with the same distribution in the dataset. A Gini impurity of 0 (lowest) is the best possible impurity and it is achieve when everything is in a same class. Gini index varies from 0 to 1. 0 indicate the purity of class where only one class exits or all element under a specific class. 1 indicates that elements are randomly distributed across various classes. And 0.5 indicate equal elements distributed over classes. Gini index (GI) described by mathematically that sum of squared of probabilities of each class (pi) deducted from one (Equation-5).

Gini Impurities

Equation – 5 Gini impurities

Here (Equation-5) pi represent the probability (probability of p+ or yes and probability of p- or no) of distinct class with classified element. Suppose randomly selected feature (a1) which has 8 yes and 4 no. After the split right had side (b1 on equation-6) has 4 yes and 4 no and left had side (b2 on equation – 7) has 4 yes and 0 no. here b2 is a pure split (leaf node) because only one class yes is present. By using the GI (Gini index) formula for b1 and b2:-

Equation- 6 & 7 – Gini Impurity b1 & Gini Impurity b2

Here for b1 value 0.5 indicates that equal element (yes and no) distribute over classes which is not pure split. And b2 value 0 indicates pure split. On GINI impurity indicates that when probability (yes or no) increases GINI value also increases. Here 0 indicate pure split and .5 indicate equal split that means worst situation. After calculating the GINI index for b1 and b2 now calculate the reduction of impurity for data point a1. Here total yes 8 (b1 and b2 on Equation – 8) and total no 4 (b1) so total data is 12 on a1. Below display the weighted GINI index for feature a1:

Total data point on b1 with Gini index (m) = 8/12 * 0.5 = 0.3333

Total data point on b2 with Gini index (n) = 4/12 * 0 = 0

Weighted Gini index for feature a1 = m + n = 0.3333

Equation- 8 Gini Impurity b1 & b2

After computing the weighted Gini value for every feature on a dataset taking the highest value feature as first node and split accordingly in a decision tree. Gini is less costly to compute.

5.2 Multilayer Perceptron Classifier (MLP Classifier)

Multilayer perceptron classifier is a feedforward neural network utilizes supervised learning technique (backpropagation) for training. MLP Classifier combines with multiple perceptron (hidden) layers. For feedforward taking input send combining with weight bias and then activation function from one hidden layer output goes to other hidden and this process continuing until reached the output. Then output calculates the error with error algorithm. These errors send back with backpropagation for weight adjustment by decreasing the total error and process is repeated, this process is call epoch. Number of epoch is determined with the hyper-parameter and reduction rate of total error.

5.2.1 Back-Propagation

Backpropagation is supervised learning algorithm that is used to train neural network. A neural network consists of input layer, hidden layer and output layer and each layer consists of neuron. So a neural network is a circuit of neurons. Backpropagation is a method to train multilayer neural network the updating of the weights of neural network and is done in such a way so that the error observed can be reduced here, error is only observed in the output layer and that error is back propagated to the previous layers and previous layer is proportionally updated weight. Backpropagation maintain chain rule to update weight. Mainly three steps on backpropagation are (Table-4):

Step Process
Step 1 Forward Pass
Step 2 Backward Pass
Step 3 Sum of all values and calculate updated weight value with Chain – rules.

Table-4 Back-Propagation process

5.2.2 Forward pass/ Forward propagation

Forward propagation is the process where input layer send the input value with randomly selected weight and bias to connected neuron and inside neuron selected activation function combine them and forward to other connected neuron layer after layer then give an output with the help of output layer. Below (Figure-4) display the forward propagation.

Foreward Pass

Figure-4 Forward passes

Input layer take the input of X (X1, X2) combine with randomly selected weight for each connection and with fixed bias (different hidden layer has different bias) send it to first hidden layer where first multiply the input with corresponding weight and added all input with single bias then selected activation function (may different form other layer) combine all input and give output according to function and this process is going on until reach in output layer. Output layer give the output like Y (Y1, Y2) (here output is binary classification as an example) according to selected activation function.

5.2.3 Backward Pass

After calculating error (difference between Forward pass output and actual output) backward pass try to minimize the error with optimisation function by sending backward with proportionally distribution and maintain a chain rule. Backward pass distribution the error in such a way where weighted value is taking under consideration. Below (Figure-5) diagram display the Backward pass process.

Backward Pass

Figure-5 Backward passes

Backpropagation push back the error which is calculated with error function or loss function for update proportional distribution with the help of optimisation algorithm. Division of Optimisation algorithm given below on Figure – 6

Optimisation Algorithms

Figure -6 Division of Optimisation algorithms

Gradient decent calculate gradient and update value by increases or decreases opposite direction of gradients unit and try to find the minimal value. Gradient decent update just one time for whole dataset but stochastic gradient decent update on each training sample and it is faster than normal gradient decent. Gradient decent can be improve by tuning parameter like learning rate (0 to 1 mostly use 0.5). Adagrad use time step based parameter to compute learning rate for every parameter. Adam is Adaptive Moment Estimation. It calculates different parameter with different learning rate. It is faster and performance rate is higher than other optimization algorithm. On the other way Adam algorithm is squares the calculated exponential weighted moving average of gradient.

5.2.4 Chain – rules

Backpropagation maintain chain-rules to update weighted value. On chain-rules backpropagation find the derivative of error respect to any weight. Suppose E is output error. w is weight for input a and bias b and ac neuron output respect of activation function and summation of bias with weighted input (w*a) input to neuron is net. So partial derivative for error respect to weight is ∂E / ∂w display the process on figure-7.

Figure- 7 Partial derivative for error respect to weight

On the chain rules for backward pass to find (error respect to weight) ∂E / ∂w = ∂E / ∂ac * ∂ac / ∂net * ∂net / ∂w. here find to error respect to weight are error respect to output of activation function multiply by activation function output respect to input in a neuron multiply by input in a neuron respect to weight.

5.2.5 Activation function

Activation function is a function which takes the decision about neuron to activate or deactivate. If the activate function activate the neuron then it will give an output on the basis of input. Input in a activation function is sum of input multiply with corresponding weight and adding the layered bias.  The main function of a activate function is non-linearity output of a neuron.

Activation Function

Figure-8 Activation function

Figure – 8 display a neuron in a hidden layer. Here several input (1, 2, 3) with corresponding weight (w1, w2, w3) putting in a neuron input layer where layer bias add with summation of multiplication with input and weight. Equation-9 display the output of an activate function.

Output from activate function y = Activate function (Ʃ (weight * input) + bias)

y = f (Ʃ (w*x) +b)

Equation- 9 Activate function

There are many activation functions like linear function for regression problem, sigmoid function for binary classification problem where result either 0 or 1, Tanh function which is based on sigmoid function but mathematically shifted version and values line -1 to 1. RELU function is Rectified linear unit. RELU is less expensive to compute.

5.2.6 Sigmoid

Sigmoid is a squashing activate function where output range between 0 and 1. Sigmoidal name comes from Greek letter sigma which looks like letter S when graphed. Sigmoid function is a logistic type function, it mainly use in output layer in neural network. Sigmoid is non-linear, fixed output range (between 0 and 1), monotonic (never decrees or never increases) and continuously differentiated function. Sigmoid function is good at classification and output from sigmoid is nonlinear. But Sigmoid has a vanishing gradient problem because output variable is very less to change in input variable. Figure- 9 displays the output of a Sigmoid and derivative of Sigmoid. Here x is any number (positive or negative). On sigmoid function 1 is divided by exponential negative input with adding 1.

Sigmoid

Figure – 9 Sigmoid Functions

4.5.2.7 RELU

RELU stands for Rectified Linear Units it is simple, less expensive in computation and rectifies the gradient vanishing problem. RELU is nonlinear activation function. It gives output either positive (infinity) or 0. RELU has a dying problem because if neurons stop for responding to variation because of gradient is 0 or nothing has to change. Figure- 10 displays the output of an RELU and derivative of RELU. Here x is any positive input and if x is grater then 0 give the output as x or give output 0. RELU function gives the output maximum value of input, here max (0, x).

Relu Activation Function

Figure – 10 RELU Function

4.5.2.8 Cost / loss function (Binary Cross-Entropy)

Cost or loss function compare the predictive value (model outcome) with actual value and give a quantitative value which give the indication about how much good or bad the prediction is.

Cost Function

Figure- 11 Cost function work process

Figure-11 x1 and x2 are input in a activate function f(x) and output y1_out which is sum of weighted input added with bias going through activate function. After model output activate function compare the output with actual output and give a quantitative value which indicate how good or bad the prediction is.

There are many type of loss function but choosing of optimal loss function depends on the problem going to be solved such as regression or classification. For binary classification problem binary cross entropy is used to calculate cost. Equation-10 displays the binary cross entropy where y is actual binary value and yp predictive outcome range 0 and 1. And i is scalar vale range between 1 to model output size (N).

Binary Crossentropy

Equation-10 displays the binary cross entropy

6 Phase 6: – Evaluation

6.1 Confusion matrix

In a classification confusion matrix describe the performance of actual value against predictive value. Confusion Matrix does the performance measurement. So confusion matrix classifies and display predicted and actual value (Visa, S., Ramsay 2011).

Confusion Matrix

Table- 5 Confusion Matrix

Confusion Matrix (Table-5) combines with True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). True Positive is prediction positive and true. True Negative is prediction negative and that is true. False positive is prediction positive and it’s false. False negative is prediction negative and that is false. False positive is known as Type1 error and false negative is known as Type 2 error. Confusion matrix can able to calculate several list of rates which are given below on Table- 6.

Here    N = Total number of observation, TP = True Positive, TN = True Negative

FP = False Positive, FN = False Negative, Total Actual No (AN) = TN + FP,

Total Predictive Yes (PY) = FP + TP. Total Actual Yes (AY) = FN + TP

Rate

 

Description Mathematical Description
Accuracy Classifier, overall how often correctly identified  (TP+TN) / N
Misclassification Rate Classifier, overall how often wrongly identified (FP + FN) / N
True Positive Rate

(Sensitivity / Recall)

Classifier, how often predict correctly yes when it is actually yes.  TP / AY
False Positive Rate Classifier, how often predict wrongly yes when it is actually no.  FP / AN
True Negative Rate

(Specificity)

Classifier, how often predict correctly no when it is actually no.  TN / AN
Precision Classifier how often predict yes when it is correct.  TP / PY
Prevalence Yes conditions how often occur in a sample. AY / N

Table – 6 Confusion matrixes Calculation

From confusion matrix F1 score can be calculated because F1 score related to precision and recall. Higher F1 score is better. If precision or recall any one goes down F1 score also go down.

F1 = \frac{2 * Precision * Recall}{Precision + Recall}

4.6.2 ROC (Receiver Operating Characteristic) curve

In statistics ROC is represent in a graph with plotting a curve which describe a binary classifiers performance as its differentiation threshold is varied. ROC (Equation-11) curve created true positive rate (TPR) against false positive rate (FPR). True positive rate also called as Sensitivity and False positive rate also known as Probability of false alarm. False positive rate also called as a probability of false alarm and it is calculated as 1 – Specificity.

True Positive Rate = \frac{True Positive}{True Positive + False Negative} = Recall or Sensivity

False Positive Rate = \frac{True Negative}{True Negative + False Positive} = 1 - Specificity

Equation- 11 ROC

So ROC (Receiver Operation Characteristic) curve allows visual representation between sensitivity and specificity associated with different values of the test result (Grzybowski, M. and Younger, J.G., 1997)

On ROC curve each point has different Threshold level. Below (Figure – 12) display the ROC curve. Higher the area curve covers is better that means high sensitivity and high specificity represent more accuracy. ROC curve also represent that if classifier predict more often true than it has more true positive and also more false positive. If classifier predict true less often then fewer false positive and also fewer true positive.

ROC Curve

ROC Curve

Figure – 12 ROC curve description

4.6.3 AUC (Area under Curve)

Area under curve (AUC) is the area surrounded by the ROC curve and AUC also represent the degree of separability that means how good the model to distinguished between classes. Higher the AUC value represents better the model performance to separate classes. AUC = 1 for perfect classifier, AUC = 0 represent worst classifier, and AUC = 0.5 means has no class separation capacity. Suppose AUC value is 0.6 that means 60% chance that model can classify positive and negative class.

Figure- 13 to Figure – 16 displays an example of AUC where green distribution curve for positive class and blue distribution curve for negative class. Here threshold or cut-off value is 0.5 and range between ‘0’ to ‘1’. True negative = TN, True Positive = TP, False Negative = FN, False Positive = FP, True positive rate = TPR (range 0 to 1), False positive rate = FPR (range 0 to 1).

On Figure – 13 left distribution curve where two class curves does not overlap that means both class are perfectly distinguished. So this is ideal position and AUC value is 1.  On the left side ROC also display that TPR for positive class is 100% occupied.

ROC distributions (perfectly distinguished

ROC distributions (perfectly distinguished

Figure – 14 two class overlap each other and raise false positive (Type 1), false negative (Type 2) errors. Here error could be minimize or maximize according to threshold. Suppose here AUC = 0.6, that means chance of a model to distinguish two classes is 60%. On ROC curve also display the curve occupied for positive class is 60%.

ROC distributions (class partly overlap distinguished)

ROC distributions (class partly overlap distinguished)

Figure- 15 displayed that positive and negative overlap each other. Here AUC value is 0.5 or near to 0.5. On this position classifier model does not able distinguish positive and negative classes. On left side ROC curve become straight that means TPR and FPR are equal.

ROC distributions (class fully overlap distinguished)

ROC distributions (class fully overlap distinguished)

Figure- 16 positive and negative class swap position and on this position AUC = 0. That means classified model predict positive as a negative and negative as a positive. On the left ROC curve display that curve on FPR side fully fitted.

ROC distributions (class swap position distinguished)

ROC distributions (class swap position distinguished)

7 Summaries

This paper describes a data mining process flow and related model and its algorithm with textual representation. One hot encoding create dummy variable for class features and min-max scaling scale the data in a single format. Balancing by SMOTE data where Euclidian distance calculates the distance in-between nearest neighbour to produce synthetic data under minority class. LDA reduce the distance inside class and maximise distance in-between class and for two class problem give a single dimension features which is less costly to calculate accuracy by base algorithm (random forest and MLP classifier).  Confusion matrix gives the accuracy, precision, sensitivity, specificity which is help to take a decision about base algorithm. AUC and ROC curve also represent true positive rate against false positive rate which indicate base algorithm performance.

Base algorithm Random forest using CART with GINI impurity for feature selection to spread the tree. Here CART is selected because of less costly to run. Random forest algorithm is using bootstrap dataset to grow trees, and aggregation using majority vote to select accuracy.

MLP classifier is a neural network algorithm using backpropagation chain-rule to reducing error. Here inside layers using RLU activation function. Output layers using Sigmoid activation function and binary cross entropy loss function calculate the loss which is back propagate with Adam optimizer to optimize weight and reduce loss.

References:

  1. Visa, S., Ramsay, B., Ralescu, A.L. and Van Der Knaap, E., 2011. Confusion Matrix-based Feature Selection. MAICS, 710, pp.120-127.
  2. Grzybowski, M. and Younger, J.G., 1997. Statistical methodology: III. Receiver operating characteristic (ROC) curves. Academic Emergency Medicine, 4(8), pp.818-826.

CRISP-DM methodology in technical view

On this paper discuss about CRISP-DM (Cross Industry Standard Process for data mining) methodology and its steps including selecting technique to successful the data mining process. Before going to CRISP-DM it is better to understand what data mining is? So, here first I introduce the data mining and then discuss about CRISP-DM and its steps for any beginner (data scientist) need to know.

1 Data Mining

Data mining is an exploratory analysis where has no idea about interesting outcome (Kantardzic, 2003). So data mining is a process to explore by analysis a large set of data to discover meaningful information which help the business to take a proper decision. For better business decision data mining is a way to select feature, correlation, and interesting patterns from large dataset (Fu, 1997; SPSS White Paper, 1999).

Data mining is a step by step process to discover knowledge from data. Pre-processing data is vital part for a data mining. In pre-process remove noisy data, combining multiple sources of data, retrieve relevant feature and transforming data for analysis. After pre-process mining algorithm applied to extract data pattern so data mining is a step by step process and applied algorithm to find meaning full data pattern. Actually data mining is not only conventional analysis it is more than that (Read, 1999).

Data mining and statistics closely related. Main goal of data mining and statistic is find the structure of data because data mining is a part of statistics (Hand, 1999). However, data mining use tools, techniques, database, machine learning which not part of statistics but data mining use statistics algorithm to find a pattern or discover hidden decision.

Data mining objective could be prediction or description. On prediction data mining considering several features of dataset to predict unidentified future, on the other hand description involve identifying pattern of data to interpreted (Kantardzic, 2003).

From figure 1.1 shows data mining is the only one part of getting unknown information from data but it is the central process of whole process. Before data mining there are several processes need to be done like collecting data from several sources than integrated data and keep in data storage. Stored unprocessed data evaluated and selected with pre-processed activity to give a standard format than data mining algorithm to analysis for hidden pattern.

Data Mining Process

2 CRISP-DM Methodologies

Cross Industry Standard Process for data mining (CRISP-DM) is most popular and widely uses data mining methodology. CRISP-DM breaks down the data mining project life cycle into six phases and each phase consists of many second-level generic tasks. Generic task cover all possible data mining application. CRISP-DM extends KDD (Knowledge Discovery and Data Mining) into six steps which are sequence of data mining application (Martínez-Plumed 2019).

Data science and data mining project extract meaningful information from data. Data science is an art where a lot of time need to spend for understanding the business value and data before applying any algorithm then evaluate and deployed a project. CRISP-DM help any data science and data mining project from start to end by giving step by step process.

Present world every day billions of data are generating. So organisations are struggling with overwhelmed data to process and find a business goal. Comprehensive data mining methodology, CRISP-DM help business to achieve desirable goal by analysing data.

CRISP-DM (Cross Industry Standard Process for Data Mining) is well documented, freely available, data mining methodology. CRISP-DM is developed by more than 200 data mining users and many mining tool and service providers funded by European Union. CRISP-DM encourages organization for best practice and provides a structure of data mining to get better, faster result.

CRISP-DM is a step by step methodology. Figure-2.1 show the phases of CRISP-DM and process of data mining. Here one side arrow indicates the dependency between phases and double side arrow represents repeatable process. Six phases of CRISP-DM are Business understanding, Data understanding, Modelling, Evaluation and Deployment.

CRISP-DM

2.1 Business Understanding

Business Understanding or domain understanding is the first step of CRISP-DM methodology. On this stage identify the area of business which is going to transform into meaningful information by analysing, processing and implementing several algorithms. Business understanding identifies the available resource (human and hardware), problems and set a goal. Identification of business objective should be agreed with project sponsors and other unit of business which will be affected. This step also focuses about details business success criteria, requirements, constraints, risk, project plan and timeline.

2.2 Data Understanding

Data understanding is the second and closely related with the business understanding phase. This phase mainly focus on data collection and proceeds to get familiar with the data and also detect interesting subset from data. Data understanding has four subsets these are:-

2.2.1 Initial data collection

On this subset considering the data collection sources which is mainly divided into two categories like outsource data or internal source data.  If data is from outsource then it may costly, time consuming and may be low quality but if data is collected form internal source it is an easy and less costly, but it may be contain irrelevant data. If internal source data does not fulfil the interest of analysis than it is necessary to move outsource data. Data collection also give an assumption that the data is quantitative (continuous, count) or qualitative (categorical).  It also gives information about balance or imbalanced dataset.  On data collection should avoid random error, systematic error, exclusion errors, and errors of choosing.

2.2.2 Data Description

Data description performs initial analysis about data. On this stage it is going to determine about the source of data like RDBMS, SQL, NoSQL, Big data etc. then analysis and describe the data about size (large data set give more accurate result but time consuming), number of records, tables, database, variables, and data types (numeric, categorical or Boolean). On this phase examine the accessibility and availability of attributes.

2.2.3 Exploratory data analysis (EDA)

On exploratory data analysis describe the inferential statistics, descriptive statistics and graphical representation of data. Inferential statistics summarize the entire population from the sample data to perform sampling and hypothesis testing. On Parametric hypothesis testing  (Null or alternate – ANOVA, t-test, chi square test) perform for known distribution (based on population) like mean, variance, standard deviation, proportion and Non-parametric hypothesis testing perform when distribution is unknown or sample size is small. On sample dataset, random sampling implement when dataset is balance but for imbalance dataset should be follow random resampling (under  and over sampling), k fold cross validation, SMOTE (synthetic minority oversampling technique), cluster base sampling, ensemble techniques (bagging and boosting – Add boost, Gradient Tree Boosting, XG Boost) to form a balance dataset.

On descriptive statistics analysis describe about the mean, median, mode for measures of central tendency on first moment business decision. On second moment business decision describe the measure of dispersion about the variance, standard deviation and range of data.  On third and fourth moment business decision describe accordingly skewness (Positive skewness – heavier tail to the right, negative skewness – heavier tail to the left, Zero skewness – symmetric distribution) and Kurtosis (Leptokurtosis – heavy tail, platykurtosis – light tail, mesokurtic – normal distribution).

Graphical representation is divided into univariate, bivariate and multivariate analysis. Under univariate whisker plot, histogram identify the outliers and shape of distribution of data and Q-Q plot (Quantile – Quantile) plot describe the normality of data that means data is normally distribution or not.  On whisker plot if data present above of Q3 + 1.5 (IQR) and below of Q1 – 1.5 (IQR) is outlier. For Bivariate correlations identify with scatter plot which describe positive, negative or no correlation and also identify the data linearity or non-linearity. Scatter plot also describe the clusters and outliers of data.  For multivariate has no graphical analysis but used to use regression analysis, ANOVA, Hypothesis analysis.

2.2.4 Data Quality analysis

This phase identified and describes the potential errors like outliers, missing data, level of granularity, validation, reliability, bad metadata and inconsistency.  On this phase AAA (attribute agreement analysis) analysed discrete data for data error. Continuous data analysed with Gage repeatability and reproducibility (Gage R & R) which follow SOP (standard operating procedures). Here Gage R & R define the aggregation of variation in the measurement data because of the measurement system.

2.3 Data Preparation

Data Preparation is the time consuming stage for every data science project. Overall on every data science project 60% to 70% time spend on data preparation stage. Data preparation stapes are described below.

2.3.1 Data integration

Data integration involved to integrate or merged multiple dataset. Integration integrates data from different dataset where same attribute or same columns presents but when there is different attribute then merging the both dataset.

2.3.2 Data Wrangling

On this subset data are going to clean, curate and prepare for next level. Here analysis the outlier and treatment done with 3 R technique (Rectify, Remove, Retain) and for special cases if there are lots of outliner then need to treat outlier separately (upper outliner in an one dataset and lower outliner in another dataset) and alpha (significant value) trim technique use to separate the outliner from the original dataset. If dataset has a missing data then need to use imputation technique like mean, median, mode, regression, KNN etc.

If dataset is not normal or has a collinearity problem or autocorrelation then need to implement transformation techniques like log, exponential, sort, Reciprocal, Box-cox etc. On this subset use the data normalization (data –means/standard deviation) or standardization (min- max scaler) technique to make unitless and scale free data. This step also help if data required converting into categorical then need to use discretization or binning or grouping technique. For factor variable (where has limited set of values), dummy variable creation technique need to apply like one hot encoding.  On this subset also help heterogeneous data to transform into homogenous with clustering technique. Data inconsistencies also handle the inconsistence of data to make data in a single scale.

2.3.3 Feature engineering and selection/reduction

Feature engineering may called as attribute generation or feature extraction. Feature extraction creating new feature by reducing original feature to make simplex model. Feature engineering also do the normalized feature by producing calculative new feature. So feature engineering is a data pre-process technique where improve data quality by cleaning, integration, reduction, transformation and scaling.

Feature selections reduce the multicollinearity or high correlated data and make model simple. Main two type of feature selection technique are supervised and unsupervised. Principal Components Analysis (PCA) is an unsupervised feature reduction/ feature selection technique and LDA is a Linear Discriminant analysis supervised technique mainly use for classification problem. LDA analyse by comparing mean of the variables. Supervised technique is three types filter, wrapper and ensemble method. Filter method is easy to implement but wrapper is costly method and ensemble use inside a model.

2.4 Model

2.4.1 Model Selection Technique

Model selection techniques are influence by accuracy and performance.  Because recommendation need better performance but banking fraud detection needs better accuracy technique.  Model is mainly subdivided into two category supervised learning where predict an output variable according to given an input variable and unsupervised learning where has not output variable.

On supervised learning if an output variable is categorical than it is classification problem like two classes or multiclass classification problem. If an output variable is continuous (numerical) then the problem is called prediction problem. If need to recommending according to relevant information is called recommendation problem or if need to retrieve data according to relevance data is called retrieval problem.

On unsupervised learning where target or output variable is not present. On this technique all variable is treated as an input variable. Unsupervised learning also called clustering problem where clustering the dataset for future decision.

Reinforcement learning agent solves the problem by getting reward for success and penalty for any failure. And semi-supervised learning is a process to solve the problem by combining supervised and unsupervised learning method. On semi-supervised, a problem solved by apply unsupervised clustering technique then for each cluster apply different type of supervised machine learning algorithm like linear algorithm, neural network, K nearest  neighbour etc.

On data mining model selection technique, where output variable is known, then need to implement supervised learning.  Regression is the first choice where interpretation of parameter is important. If response variable is continuous then linear regression or if response variable is discrete with 2 categories value then logistic regression or if response variable is discrete with more than 2 categorical values then multinomial or ordinal regression or if response variable is count then poission where mean is equal to variance or negative binomial regression where variance is grater then mean or if response variable contain excessive zero values then need to choose Zero inflated poission (ZIP) or Zero inflated negative binomial (ZINB).

On supervised technique except regression technique all other technique can be used for both continuous or categorical response variable like KNN (K-Nearest Neighbour),  Naïve Bays, Black box techniques (Neural network, Support vector machine), Ensemble Techniques (Stacking, Bagging like random forest, Boosting like Decision tree, Gradient boosting, XGB, Adaboost).

When response variable is unknown then need to implement unsupervised learning. Unsupervised learning for row reduction is K-Means, Hierarchical etc., for columns reduction or dimension reduction PCA (principal component analysis), LDA (Linear Discriminant analysis), SVD (singular value decomposition) etc. On market basket analysis or association rules where measure are support and confidence then lift ration to determine which rules is important. There are recommendation systems, text analysis and NLP (Natural language processing) also unsupervised learning technique.

For time series need to select forecasting technique. Where forecasting may model based or data based. For Trend under model based need to use linear, exponential, quadratic techniques. And for seasonality need to use additive, multiplicative techniques. On data base approaches used auto regressive, moving average, last sample, exponential smoothing (e.g. SES – simple exponential smoothing, double exponential smoothing, and winters method).

2.4.2 Model building

After selection model according to model criterion model is need to be build. On model building provided data is subdivided with training, validation and testing.  But sometime data is subdivided just training and testing where information may leak from testing data to training data and cause an overfitting problem. So training dataset should be divided into training and validation whereas training model is tested with validation data and if need any tuning to do according to feedback from validation dataset. If accuracy is acceptable and error is reasonable then combine the training and validation data and build the model and test it on unknown testing dataset. If the training error and testing error is minimal or reasonable then the model is right fit or if the training error is low and testing error is high then model is over fitted (Variance) or if training error is high and testing error is also high then model is under fitted (bias). When model is over fitted then need to implement regularization technique (e.g. linear – lasso, ridge regression, Decision tree – pre-pruning, post-pruning, Knn – K value, Naïve Bays – Laplace, Neural network – dropout, drop connect, batch normalization, SVM –  kernel trick)

When data is balance then split the data training, validation and testing and here training is larger dataset then validation and testing. If data set is imbalance then need to use random resampling (over and under) by artificially increases training dataset. On random resampling by randomly partitioning data and for each partition implement the model and taking the average of accuracy. Under K fold cross validation creating K times cross dataset and creating model for every dataset and validate, after validation taking the average of accuracy of all model. There is more technique for imbalance dataset like SMOTH (synthetic minority oversampling technique), cluster based sampling, ensemble techniques e.g. Bagging, Boosting (Ada Boost, XGBoost).

2.4.3 Model evaluation and Tuning

On this stage model evaluate according to errors and accuracy and tune the error and accuracy for acceptable manner. For continuous outcome variable there are several way to measure the error like mean error, mean absolute deviation, Mean squared error, Root mean squared error, Mean percentage error and Mean absolute percentage error but more acceptable way is Mean absolute percentage error. For this continuous data if error is known then it is easy to find out the accuracy because accuracy and error combining value is one. The error function also called cost function or loss function.

For discrete output variable model, for evaluation and tuning need to use confusion matrix or cross table. From confusion matrix, by measuring accuracy, error, precision, sensitivity, specificity, F1 help to take decision about model fitness. ROC curve (Receiver operating characteristic curve), AUC curve (Area under the ROC curve) also evaluate the discrete output variable. AUC and ROC curve plot of sensitivity (true positive rate) vs 1-specificity (false positive rate).  Here sensitivity is a positive recall and  recall is basically out of all positive samples, how sample classifier able to identify. Specificity is negative recall here recall is out of all negative samples, how many sample classifier able to identify.  On AUC where more the area under the ROC is represent better accuracy. On ROC were step bend it’s indicate the cut off value.

2.4.4 Model Assessment

There is several ways to assess the model. First it is need to verify model performance and success according to desire achievement. It needs to identify the implemented model result according to accuracy where accuracy is repeatable and reproducible. It is also need to identify that the model is scalable, maintainable, robust and easy to deploy. On assessment identify that the model evaluation about satisfactory results (identify the precision, recall, sensitivity are balance) and meet business requirements.

2.5 Evaluation

On evaluation steps, all models which are built with same dataset, given a rank to find out the best model by assessing model quality of result and simplicity of algorithm and also cost of deployment. Evaluation part contains the data sufficiency report according to model result and also contain suggestion, feedback and recommendation from solutions team and SMEs (Subject matter experts) and record all these under OPA (organizational process assets).

2.6 Deployment

Deployment process needs to monitor under PEST (political economical social technological) changes within the organization and outside of the organization. PEST is similar to SWOT (strength weakness opportunity and thread) where SW represents the changes of internal and OT represents external changes.

On this deployment steps model should be seamless (like same environment, same result etc.) from development to production. Deployment plan contain the details of human resources, hardware, software requirements. Deployment plan also contain maintenance and monitoring plan by checking the model result and validity and if required then implement retire, replace and update plan.

3 Summaries

CRISP-DM implementation is costly and time consuming. But CRISP-DM methodology is an umbrella for data mining process. CRISP-DM has six phases, Business understanding, Data understanding, Modelling, Evaluation and Deployment. Every phase has several individual criteria, standard and process. CRISP-DM is Guideline for data mining process so if CRISP-DM is going to implement in any project it is necessary to follow each and every single guideline and maintain standard and criteria to get required result.

4 References

  1. Fu, Y., (1997), “Data Mining: Tasks, Techniques and Applications”, Potentials, IEEE, 16: 4, 18–20.
  2. Hand, D. J., (1999), “Statistics and Data Mining: Intersecting Disciplines”, ACM SIGKDD Explorations Newsletter, 1: 1, 16 – 19.
  3. Kantardzic, M., (2003), “Data Mining: Concepts, Models, Methods, and Algorithms” John Wiley and Sons, Inc., Hoboken, New Jersey
  4. Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C., Orallo, J.H., Kull, M., Lachiche, N., Quintana, M.J.R. and Flach, P.A., 2019. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering.
  5. Read, B.J., (1999), “Data Mining and Science? Knowledge discovery in science as opposed to business”, 12th ERCIM Workshop on Database Research.

Rethinking linear algebra: visualizing linear transformations and eigen vectors

In terms of calculation processes of Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), which are the dimension reduction techniques I am going to explain in the following articles, diagonalization is what they are all about. Throughout this article, I would like you to have richer insight into diagonalization in order to prepare for understanding those basic dimension reduction techniques.

When our professor started a lecture on the last chapter of our textbook on linear algebra, he said “It is no exaggeration to say that everything we have studied is for this ‘diagonalization.'” Until then we had to write tons of numerical matrices and vectors all over our notebooks, calculating those products, adding their rows or columns to other rows or columns, sometimes transposing the matrices, calculating their determinants.

It was like the scene in “The Karate Kid,” where the protagonist finally understood the profound meaning behind the prolonged and boring “wax on, wax off” training given by Miyagi (or “jacket on, jacket off” training given by Jackie Chan). We had finally understood why we had been doing those seemingly endless calculations.

Source: http://thinkbedoleadership.com/secret-success-wax-wax-off/

But usually you can do those calculations easily with functions in the Numpy library. Unlike Japanese college freshmen, I bet you are too busy to reopen textbooks on linear algebra to refresh your mathematics. Thus I am going to provide less mathematical and more intuitive explanation of diagonalization in this article.

1, The mainstream ways of explaining diagonalization.

*The statements below are very rough for mathematical topics, but I am going to give priority to offering more visual understanding on linear algebra in this article. For further understanding, please refer to textbooks on linear algebra. If you would like to have minimum understandings on linear algebra needed for machine learning, I recommend the Appendix C of Pattern Recognition and Machine Learning by C. M. Bishop.

In most textbooks on linear algebra, the explanations on dioagonalization is like this (if you are not sure what diagonalization is or if you are allergic to mathematics, you do not have to read this seriously):

Let V (dimV = D)be a vector space and let  T_A : V \rightarrow V be a mapping of V into itself,  defined as T_A(v) = A \cdot \boldsymbol{v}, where A is a D\times D matrix and \boldsymbol{v} is D dimensional vector. An element \boldsymbol{v} \in V is called an eigen vector if there exists a number \lambda such that A \cdot \boldsymbol{v}= \lambda \cdot \boldsymbol{v} and \boldsymbol{v} \neq \boldsymbol{0}. In this case \lambda is uniquely determined and is called an eigen value of A belonging to the eigen vector \boldsymbol{v}.

Any matrix A has D eigen values \lambda_{i}, belonging to \boldsymbol{v}_{i} (i=1, 2, …., D). If \boldsymbol{v}_{i} is basis of the vector space V, then A is diagonalizable.

When A is diagonalizable, with D \times D matrices P = (\boldsymbol{v}_{1}, \dots, \boldsymbol{v}_{D}) , whose column vectors are eigen vectors \boldsymbol{v}_{i} (i=1, 2, …., D), the following equation holds: P^{-1}AP = \Lambda, where \Lambda = diag(\lambda_{1}, \dots, \lambda_{D})= \begin{pmatrix} \lambda_{1} & 0& \ldots &0\\ 0 & \lambda_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_{D} \end{pmatrix}.

And when A is diagonalizable, you can diagonalize A as below.

Most textbooks keep explaining these type of stuff, but I have to say they lack efforts to make it understandable to readers with low mathematical literacy like me. Especially if you have to apply the idea to data science field, I believe you need more visual understanding of diagonalization. Therefore instead of just explaining the definitions and theorems, I would like to take a different approach. But in order to understand them in more intuitive ways, we first have to rethink waht linear transformation T_A means in more visible ways.

2, Linear transformations

Even though I did my best to make this article understandable to as little prerequisite knowledge, you at least have to understand linear transformation of numerical vectors and with matrices. Linear transformation is nothing difficult, and in this article I am going to use only 2 or 3 dimensional numerical vectors or square matrices. You can calculate linear transformation of \boldsymbol{v} by A as equations in the figure. In other words, \boldsymbol{u} is a vector transformed by A.

*I am not going to use the term “linear transformation” in a precise way in the context of linear algebra. In this article or in the context of data science or machine learning, “linear transformation” for the most part means products of matrices or vectors. 

*Forward/back propagation of deep learning is mainly composed of this linear transformation. You keep linearly transforming input vectors, frequently transforming them with activation functions, which are for the most part not linear transformation.

As you can see in the equations above, linear transformation with A transforms a vector to another vector. Assume that you have an original vector \boldsymbol{v} in grey and that the vector \boldsymbol{u} in pink is the transformed \boldsymbol{v} by A is. If you subtract \boldsymbol{v} from \boldsymbol{u}, you can get a displacement vector, which I displayed in purple. A displacement vector means the transition from a vector to another vector.

Let’s calculate the displacement vector with more vectors \boldsymbol{v}. Assume that A =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, and I prepared several grid vectors \boldsymbol{v} in grey as you can see in the figure below. If you transform those grey grid points with A, they are mapped into the vectors \boldsymbol{u} in pink. With those vectors in grey or pink, you can calculate the their displacement vectors \boldsymbol{u} = \boldsymbol{v} in purple.

I think you noticed that the displacement vectors in the figure above have some tendencies. In order to see that more clearly, let’s calculate displacement vectors with several matrices A and more grid points. Assume that you have three 2 \times 2 square matrices A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, A_2 =\begin{pmatrix} 3 & 1 \\ -1 & 1 \end{pmatrix}, A_3 =\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}, and I plotted displace vectors made by the matrices respectively in the figure below.

I think you noticed some characteristics of the displacement vectors made by those linear transformations: the vectors are swirling and many of them seem to be oriented in certain directions. To be exact, some displacement vectors have extend in the same directions as some of original vectors in grey. That means  linear transformation by A did not change the direction of the original vector \boldsymbol{v}, and the unchanged vectors are called eigen vectors. Real eigen vectors of each A are displayed as arrows in yellow in the figure above. But when it comes to A_3, the matrix does not have any real eigan values.

In linear algebra, depending on the type matrices A, you have consider various cases such as whether the matrices have real or imaginary eigen values, whether the matrices are diagonalizable, whether the eigen vectors are orthogonal, or whether they are unit vectors. But those topics are out of the scope of this article series, so please refer to textbooks on linear algebra if you are interested.

Luckily, however, in terms of PCA or LDA, you only have to consider a type of matrices named positive semidefinite matrices, which A_1 is classified to, and I am going to explain positive semidefinite matrices in the fourth section.

3, Eigen vectors as coordinate system

Source: Ian Stewart, “Professor Stewart’s Cabinet of Mathematical Curiosities,” (2008), Basic Books

Let me take Fibonacci numbers as an example to briefly see why diagonalization is useful. Fibonacci is sequence is quite simple and it is often explained using an example of pairs of rabbits increasing generation by generation. Let a_n (n=0, 1, 2, …) be the number of pairs of grown up rabbits in the n^{th} generation. One pair of grown up rabbits produce one pair of young rabbit The concrete values of a_n are a_0 = 0, a_1 = 1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8, a_7=13, \dots. Assume that A =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} and that \begin{pmatrix} a_1 \\ a_0  \end{pmatrix} =\begin{pmatrix} 1 \\ 0  \end{pmatrix}, then you can calculate the number of the pairs of grown up rabbits in the next generation with the following recurrence relation. \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \cdot \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}.Let \boldsymbol{a}_n be \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}, then the recurrence relation can be written as \boldsymbol{a}_{n+1} = A \boldsymbol{a}_n, and the transition of \boldsymbol{a}_n are like purple arrows in the figure below. It seems that the changes of the purple arrows are irregular if you look at the plots in normal coordinate.

Assume that \lambda _1, \lambda_2 (\lambda _1< \lambda_2) are eigen values of A, and \boldsymbol{v}_1, \boldsymbol{v}_2 are eigen vectors belonging to them respectively. Also let \alpha, \beta scalars such that \begin{pmatrix} a_{1} \\ a_{0}  \end{pmatrix} = \begin{pmatrix} 1 \\ 0  \end{pmatrix} = \alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2. According to the definition of eigen values and eigen vectors belonging to them, the following two equations hold: A\boldsymbol{v}_1 = \lambda_1 \boldsymbol{v}_1, A\boldsymbol{v}_2 = \lambda_2 \boldsymbol{v}_2. If you calculate \boldsymbol{a}_1 is, using eigen vectors of A, \boldsymbol{a}_1  = A\boldsymbol{a}_0 = A (\alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2) = \alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2. In the same way, \boldsymbol{a}_2 = A\boldsymbol{a}_1 = A (\alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2) = \alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2, and \boldsymbol{a}_3 = A\boldsymbol{a}_2 = A (\alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2) = \alpha\lambda _{1}^{3} \boldsymbol{v}_1 + \beta \lambda_{2}^{3} \boldsymbol{v}_2. These equations show that in coordinate system made by eigen vectors of A, linear transformation by A is easily done by just multiplying eigen values with each eigen vector. Compared to the graph of Fibonacci numbers above, in the figure below you can see that in coordinate system made by eigen vectors the plots changes more systematically generation by generation.

 

In coordinate system made by eigen vectors of square matrices, the linear transformations by the matrices can be much more straightforward, and this is one powerful strength of eigen vectors.

*I do not major in mathematics, so I am not 100% sure, but vectors in linear algebra have more abstract meanings and various things in mathematics can be vectors, even though in machine learning or data science we  mainly use numerical vectors with more concrete elements. We can also say that matrices are a kind of maps. That is just like, at leas in my impression, even though a real town is composed of various components such as houses, smooth or bumpy roads, you can simplify its structure with simple orthogonal lines, like the map of Manhattan. But if you know what the town actually looks like, you do not have to follow the zigzag path on the map.

4, Eigen vectors of positive semidefinite matrices

In the second section of this article I told you that, even though you have to consider various elements when you discuss general diagonalization, in terms of PCA and LDA we mainly use only a type of matrices named positive semidefinite matrices. Let A be a D \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} \geq 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive semidefinite matrix. And also it is known that A being a semidefinite matrix is equivalent to \lambda _{i} \geq 0 for all the eigen values \lambda_i (i=1, \dots , D).

*I think most people first learn a type of matrices called positive definite matrices. Let A be aD \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} > 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive definite matrix. You have to keep it in mind that even if all the elements of A are positive, A is not necessarly positive definite/semidefinite.

Just as we did in the second section of this article, let’s visualize displacement vectors made by linear transformation with a 3 \times 3 square positive semidefinite matrix A.

*In fact A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, whose linear transformation I visualized the second section, is also positive semidefinite.

Let’s visualize linear transformations by a positive definite matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. I visualized the displacement vectors made by the A just as the same way as in the second section of this article. The result is as below, and you can see that, as well as the displacement vectors made by A_1, the three dimensional displacement vectors below are swirling and extending in three directions, in the directions of the three orthogonal eigen vectors \boldsymbol{v}_1, \boldsymbol{v}_2, and \boldsymbol{v}_3.

*It might seem like a weird choice of a matrix, but you are going to see why in the next article.

You might have already noticed A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix} and A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix} are both symmetric matrices and that their elements are all real values, and that their diagonal elements are all positive values. Super importantly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}).

*The title of this section might be misleading, but please keep it in mind that positive definite/semidefinite matrices are not necessarily real symmetric matrices. And real symmetric vectors are not necessarily positive definite/semidefinite matrices.

5, Orthonormal matrices and rotation of vectors

In this section I am gong to explain orthonormal matrices, as known as rotation matrices. If a D\times D matrix U is an orthonormal matrix, column vectors of U are orthonormal, which means U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D), where \begin{cases} \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 1 \quad (i = j) \\ \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 0 \quad (i\neq j) \end{cases}. In other words column vectors \boldsymbol{u}_{i} form an orthonormal coordinate system.

Orthonormal matrices U have several important matrices, and one of them is U^{-1} = U^{T}. Combining this fact with what I have told you so far, you we can reach one conclusion that you can orthogonalize a real symmetric matrix A as U^{T}AU = \Lambda. This is known as spectral decomposition or singular value decomposition.

Another important property of U is that U^{T} is also orthonormal. In other words, assume U is orthonormal and that U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D) = \begin{pmatrix} -\boldsymbol{v_1}^{T}- \\ \vdots \\ -\boldsymbol{v_D}^{T}- \end{pmatrix}, (\boldsymbol{v}_1 \dots \boldsymbol{v}_D) also forms a orthonormal coordinate system.

…It seems things are getting too mathematical and abstract (for me), thus for now I am going to wrap up what I have explained in this article .

We have seen

  • Numerical matrices linearly transform vectors.
  • Certain linear transformations do not change the direction of vectors in certain directions, which are called eigen vectors.
  • Making use of eigen vectors, you can form new coordinate system which can describe the linear transformations in a more straightforward way.
  • You can diagonalize a real symmetric matrix A with an orthonormal matrix U.

Of our current interest is what kind of linear transformation the real symmetric positive definite matrix enables. I am going to explain why the purple vectors in the figure above is swirling like that in the upcoming articles. Before that, however, we are going to  see one application of what we have seen in this article, on dimension reduction. To be concrete the next article is going to be about principal component analysis (PCA), which is very important in many fields.

*In short, the orthonormal matrix U I mentioned above enables rotation of matrix, and the diagonal matrix diag(\lambda_1, \dots, \lambda_D) expands or contracts vectors along each axis. I am going to explain that more precisely in the upcoming articles.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.

import matplotlib.pyplot as plt 
import numpy as np
import matplotlib.patches as mpatches

T_A = np.array([[1, 1], 
                [1, 0]])

total_step = 5
x = np.zeros((total_step, 2))

x[0]  = np.array([1, 0])

for i in range(total_step - 1):
    x[i + 1] = np.dot(T_A, x[i])

eigen_values, eigen_vectors = np.linalg.eig(T_A)
idx = eigen_values.argsort()[::-1]   
eigen_values = eigen_values[idx]
eigen_vectors = eigen_vectors[:,idx]
for i in range(len(eigen_vectors)):
    if(eigen_vectors.T[i][0] < 0):
        eigen_vectors.T[i] = - eigen_vectors.T[i]   

v_initial = x[0]
v_coefficients = np.zeros((total_step , 2))
v_coefficients[0] = np.dot(v_initial ,  np.linalg.inv(eigen_vectors.T)) 

for i in range(total_step-1):
    v_coefficients[i + 1] = v_coefficients[i] * eigen_values 

for i in range(total_step):
    v_1_list[i+1] = v_coefficients.T[0][i]*eigen_vectors.T[0]
    v_2_list[i+1] = v_coefficients.T[1][i]*eigen_vectors.T[1]

plt.figure(figsize=(20, 15))
fontsize = 20
small_shift = 0.2

plt.plot(x[:, 0], x[:, 1], marker='o', linestyle='none', markersize=10, color='black')

plt.arrow(0, 0, eigen_vectors.T[0][0], eigen_vectors.T[0][1], width=0.05, head_width=0.2, color='orange')
plt.arrow(0, 0, eigen_vectors.T[1][0], eigen_vectors.T[1][1], width=0.05, head_width=0.2, color='orange')

plt.text(eigen_vectors.T[0][0], eigen_vectors.T[0][1]+small_shift, r'v_{1}', va='center',ha='right', fontsize=fontsize + 10)
plt.text(eigen_vectors.T[1][0] - small_shift, eigen_vectors.T[1][1],r'v_{2}', va='center',ha='right', fontsize=fontsize + 10)

for i in range(total_step): 
    
    plt.arrow(0, 0, v_1_list[i+1][0], v_1_list[i+1][1], head_width=0.05, color='darkviolet', length_includes_head=True)
    plt.arrow(0, 0, v_2_list[i+1][0], v_2_list[i+1][1], head_width=0.05, color='darkviolet', length_includes_head=True)
    
    plt.text(v_1_list[i+1][0] + 2*small_shift , v_1_list[i+1][1]-2*small_shift,r'\alpha \cdot \lambda_{0} ^{1} \cdot v_{2}'.format(1,i+1, 1),va='center',ha='right', fontsize=fontsize)
    plt.text(v_2_list[i+1][0]-0.1, v_2_list[i+1][1],r'\beta \cdot \lambda_{0} ^{1} \cdot v_{2}'.format(2, i+1, 2),va='center',ha='right', fontsize=fontsize)

    plt.arrow(v_1_list[i+1][0], v_1_list[i+1][1], v_2_list[i+1][0], v_2_list[i+1][1], head_width=0, color='black', linestyle='--', length_includes_head=True)
    plt.arrow(v_2_list[i+1][0], v_2_list[i+1][1], v_1_list[i+1][0], v_1_list[i+1][1], head_width=0, color='black', linestyle='--', length_includes_head=True)
    
orange_patch = mpatches.Patch(color='orange', label='Eigen vectors')
purple_patch = mpatches.Patch(color='darkviolet', label='Scalar multiples of the eigen vectors')
plt.legend(handles=[orange_patch, purple_patch], fontsize=25, loc='lower right')

for i in range(total_step):
    plt.text(x[i][0]+0.1, x[i][1]-0.05, r'n={0}'.format(i), fontsize=20)

plt.grid(True)
plt.ylabel("a_{n}: n^{th} generation", fontsize=20)
plt.xlabel("a_{n+1}: n+1 ^{th} geneartion", fontsize=20)
plt.title("Fibonacci sequence and its eigen space", fontsize=30)
#plt.savefig("Fibonacci_eigen_space.png")
plt.show()
import matplotlib.pyplot as plt 
import numpy as np 
import matplotlib.patches as mpatches

const_range = 10

X = np.arange(-const_range, const_range + 1, 1)
Y = np.arange(-const_range, const_range + 1, 1)
U_x, U_y = np.meshgrid(X, Y)

T_A_0 = np.array([[3, 1], 
                [1, 2]])

T_A_1 = np.array([[3, 1],
                [-1, 1]])

T_A_2 = np.array([[1, -1], 
                [1, 1]])

T_A_list = np.array((T_A_0, T_A_1, T_A_2))

const_range = 5
plt.figure(figsize=(30, 10))
plt.subplots_adjust(wspace=0.1)
labels = ["Grids", "Displacement vectors made by A", "Real eigen vectors of A"]  
title_list = [r"A_1 has two different real eigen vectors.", r"A_2 has two identical real unit eigen vectors.",  r"A_3 has only imaginary eigen vectors."]
for idx in range(len(T_A_list)): 
    
    eigen_values, eigen_vectors = np.linalg.eig(T_A_list[idx])
    sorted_idx = eigen_values.argsort()[::-1]   
    eigen_values = eigen_values[sorted_idx]
    eigen_vectors = eigen_vectors[:,sorted_idx]
    eigen_vectors = eigen_vectors.astype(float)
        
    for i in range(len(eigen_vectors)):
        if(eigen_vectors.T[i][0] < 0):
            eigen_vectors.T[i] = - eigen_vectors.T[i]
        

    X = np.arange(-const_range, const_range + 1, 1)
    Y = np.arange(-const_range, const_range + 1, 1)
    U_x, U_y = np.meshgrid(X, Y)

    V_x = np.zeros((len(U_x), len(U_y)))
    V_y = np.zeros((len(U_x), len(U_y)))

    temp_vec = np.zeros((1, 2))

    W_x = np.zeros((len(U_x), len(U_y)))
    W_y = np.zeros((len(U_x), len(U_y)))

    plt.subplot(1, 3, idx + 1)


    for i in range(len(U_x)):
        for j in range(len(U_y)):
            temp_vec[0][0] = U_x[i][j]
            temp_vec[0][1] = U_y[i][j]
        
            temp_vec[0] = np.dot(T_A_list[idx], temp_vec[0])
        
            V_x[i][j] = temp_vec[0][0]
            V_y[i][j] = temp_vec[0][1]
        
            W_x[i][j] = V_x[i][j] - U_x[i][j]
            W_y[i][j] = V_y[i][j] - U_y[i][j]
            #plt.arrow(0, 0, V_x[i][j], V_y[i][j], head_width=0.1, color='red')
            plt.arrow(0, 0, U_x[i][j], U_y[i][j], head_width=0.3, color='dimgrey', label=labels[0])
            plt.arrow(U_x[i][j], U_y[i][j], W_x[i][j], W_y[i][j], head_width=0.3, color='darkviolet', label=labels[1])
            
            range_const = 20
            plt.xlim([-range_const, range_const])
            plt.ylim([-range_const, range_const])
            plt.title(title_list[idx], fontsize=25)
            
            if idx==2:
                continue
 
            plt.arrow(0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, head_width=1, color='orange', label=labels[2])
            plt.arrow(0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, head_width=1, color='orange', label=labels[2])

grey_patch = mpatches.Patch(color='grey', label='Grids')
purple_patch = mpatches.Patch(color='darkviolet', label='Displacement vectors made by A')
yellow_patch = mpatches.Patch(color='gold', label='Real eigen vectors of A')
plt.legend(handles=[grey_patch, purple_patch, yellow_patch], fontsize=25, loc='lower right', bbox_to_anchor=(-0.1, -.35))
#plt.savefig("linear_transformation.png")
plt.show()
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.proj3d import proj_transform
from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib.text import Annotation
from matplotlib.patches import FancyArrowPatch
import matplotlib.patches as mpatches

class Annotation3D(Annotation):
    def __init__(self, text, xyz, *args, **kwargs):
        super().__init__(text, xy=(0,0), *args, **kwargs)
        self._xyz = xyz

    def draw(self, renderer):
        x2, y2, z2 = proj_transform(*self._xyz, renderer.M)
        self.xy=(x2,y2)
        super().draw(renderer)

def _annotate3D(ax,text, xyz, *args, **kwargs):
    '''Add anotation `text` to an `Axes3d` instance.'''

    annotation= Annotation3D(text, xyz, *args, **kwargs)
    ax.add_artist(annotation)

setattr(Axes3D,'annotate3D',_annotate3D)

class Arrow3D(FancyArrowPatch):
    def __init__(self, x, y, z, dx, dy, dz, *args, **kwargs):
        super().__init__((0,0), (0,0), *args, **kwargs)
        self._xyz = (x,y,z)
        self._dxdydz = (dx,dy,dz)

    def draw(self, renderer):
        x1,y1,z1 = self._xyz
        dx,dy,dz = self._dxdydz
        x2,y2,z2 = (x1+dx,y1+dy,z1+dz)

        xs, ys, zs = proj_transform((x1,x2),(y1,y2),(z1,z2), renderer.M)
        self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
        super().draw(renderer)

def _arrow3D(ax, x, y, z, dx, dy, dz, *args, **kwargs):
    '''Add an 3d arrow to an `Axes3D` instance.'''

    arrow = Arrow3D(x, y, z, dx, dy, dz, *args, **kwargs)
    ax.add_artist(arrow)

setattr(Axes3D,'arrow3D',_arrow3D)

T_A = np.array([[60.45, 33.63, 46.29], 
                [33.63, 68.49, 50.93], 
                [46.29, 50.93, 53.61]])

T_A = T_A/50
const_range = 2


X = np.arange(-const_range, const_range + 1, 1)
Y = np.arange(-const_range, const_range + 1, 1)
Z = np.arange(-const_range, const_range + 1, 1)

U_x, U_y, U_z = np.meshgrid(X, Y, Z)

V_x = np.zeros((len(U_x), len(U_y), len(U_z)))
V_y = np.zeros((len(U_x), len(U_y), len(U_z)))
V_z = np.zeros((len(U_x), len(U_y), len(U_z)))

temp_vec = np.zeros((1, 3))

W_x = np.zeros((len(U_x), len(U_y), len(U_z)))
W_y = np.zeros((len(U_x), len(U_y), len(U_z)))
W_z = np.zeros((len(U_x), len(U_y), len(U_z)))

eigen_values, eigen_vectors = np.linalg.eig(T_A)
sorted_idx = eigen_values.argsort()[::-1]   
eigen_values = eigen_values[sorted_idx]
eigen_vectors = eigen_vectors[:,sorted_idx]
eigen_vectors = eigen_vectors.astype(float)

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111, projection='3d')
grid_range = const_range + 5
ax.set_xlim(-grid_range, grid_range)
ax.set_ylim(-grid_range, grid_range)
ax.set_zlim(-grid_range, grid_range)

eigen_values, eigen_vectors = np.linalg.eig(T_A)
sorted_idx = eigen_values.argsort()[::-1]   
eigen_values = eigen_values[sorted_idx]
eigen_vectors = eigen_vectors[:,sorted_idx]
eigen_vectors = eigen_vectors.astype(float)
     
    
for i in range(len(eigen_vectors)):
    if(eigen_vectors.T[i][0] < 0):
        eigen_vectors.T[i] = - eigen_vectors.T[i]

for i in range(len(U_x)):
    for j in range(len(U_x)):
        for k in range(len(U_x)):
            temp_vec[0][0] = U_x[i][j][k]
            temp_vec[0][1] = U_y[i][j][k]
            temp_vec[0][2] = U_z[i][j][k]
        
            temp_vec[0] = np.dot(T_A, temp_vec[0])
        
            V_x[i][j][k] = temp_vec[0][0]
            V_y[i][j][k] = temp_vec[0][1]
            V_z[i][j][k] = temp_vec[0][2]
        
            W_x[i][j][k] = V_x[i][j][k] - U_x[i][j][k]
            W_y[i][j][k] = V_y[i][j][k] - U_y[i][j][k]
            W_z[i][j][k] = V_z[i][j][k] - U_z[i][j][k]
            ax.arrow3D(0, 0, 0, 
                       U_x[i][j][k], U_y[i][j][k], U_z[i][j][k], 
                       mutation_scale=10, arrowstyle="-|>", fc='dimgrey', ec='dimgrey')
            #ax.arrow3D(0, 0, 0, 
            #          V_x[i][j][k], V_y[i][j][k], V_z[i][j][k], 
            #           mutation_scale=10, arrowstyle="-|>", fc='red', ec='red')
            ax.arrow3D(U_x[i][j][k], U_y[i][j][k], U_z[i][j][k], 
                       W_x[i][j][k], W_y[i][j][k], W_z[i][j][k],
                       mutation_scale=10, arrowstyle="-|>", fc='darkviolet', ec='darkviolet')
            
ax.arrow3D(0, 0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, eigen_vectors.T[0][2]*10,
                       mutation_scale=10,  arrowstyle="-|>", fc='orange', ec='orange')
ax.arrow3D(0, 0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, eigen_vectors.T[1][2]*10,
                       mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')
ax.arrow3D(0, 0, 0, eigen_vectors.T[2][0]*10, eigen_vectors.T[2][1]*10, eigen_vectors.T[2][2]*10,
                       mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')

ax.text(eigen_vectors.T[0][0]*8 , eigen_vectors.T[0][1]*8, eigen_vectors.T[0][2]*8+1, r'v_1', fontsize=20)
ax.text(eigen_vectors.T[1][0]*8 , eigen_vectors.T[1][1]*8, eigen_vectors.T[1][2]*8, r'v_2', fontsize=20)
ax.text(eigen_vectors.T[2][0]*8 , eigen_vectors.T[2][1]*8, eigen_vectors.T[2][2]*8, r'v_3', fontsize=20)


grey_patch = mpatches.Patch(color='grey', label='Grids')
orange_patch = mpatches.Patch(color='orange', label='Orthogonal eigen vectors of A')
purple_patch = mpatches.Patch(color='darkviolet', label='Displacement vectors made by A')
plt.legend(handles=[grey_patch, orange_patch, purple_patch], fontsize=20, loc='lower right')

ax.set_xlabel(r'x_1', fontsize=25)
ax.set_ylabel(r'x_2', fontsize=25)
ax.set_zlabel(r'x_3', fontsize=25)
#plt.savefig("symmetric_positive_definite_visualizaiton.png")
plt.show()

 

Spiky cubes, Pac-Man walking, empty M&M’s chocolate: curse of dimensionality

This is the first article of the article series Illustrative introductions on dimension reduction.

“Curse of dimensionality” means the difficulties of machine learning which arise when the dimension of data is higher. In short if the data have too many features like “weight,” “height,” “width,” “strength,” “temperature”…., that can undermine the performances of machine learning. The fact might be contrary to your image which you get from the terms “big” data or “deep” learning. You might assume that the more hints you have, the better the performances of machine learning are. There are some reasons for curse of dimensionality, and in this article I am going to introduce two major reasons below.

  1. High dimensional data usually have rich expressiveness, but usually training data are too poor for that.
  2. The behaviors of data points in high dimensional space are totally different from our common sense.

Through these topics, you will see that you always have to think about which features to use considering the number of data points.

*From now on I am going to talk about only Euclidean distance. If you are not sure what Euclidean distance means, please just keep it in mind that it is the type of distance most people wold have learnt in normal compulsory education.

1. Number of samples and degree of dimension

The most straightforward demerit of adding many features, or increasing dimensions of data, is the growth of computational costs. More importantly, however, you always have to think about the degree of dimensions in relation of the number of data points you have. Let me take a simple example in a book “Pattern Recognition and Machine Learning” by C. M. Bishop (PRML). This is an example of measurements of a pipeline. The figure below shows a comparison plot of 3 classes (red, green and blue), with parameter x_7 plotted against parameter x_6 out of 12 parameters.

* The meaning of data is not important in this article. If you are interested please refer to the appendix in PRML.

Assume that we are interested in classifying the cross in black into one of the three classes. One of the most naive ideas of this classification is dividing the graph into grids and labeling each grid depending on the number of samples in the classes (which are colored at the right side of the figure). And you can classify the test sample, the cross in black, into the class of the grid where the test sample is in. Thereby the cross is classified to the class in red.

Source: C.M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 34-35

As I mentioned in the figure above, we used only two features out of 12 features in total. When the total number of data points is fixed and you add remaining ten axes/features one after another, what would happen? Let’s see what “adding axes/features” means. If you are talking about 1, 2, or 3 dimensional grids, you can visualize them. And as you can see from the figure below, if you make each 10^1, 10^2, 100^3 grids respectively in 1, 2, 3 dimensional spaces, the number of the small regions in the grids are respectively 10, 100, 1000. Even though you cannot visualize it anymore, you can make grids for more than 3 dimensional data. If you continue increasing the degree of dimension, the number of grids increases exponentially, and that can soon surpass the number of training data points. That means there would be a lot of empty spaces in such high dimensional grids. And the classifying method above: coloring each grid and classifying unknown samples depending on the colors of the grids, does not work out anymore because there would be a lot of empty grids.

* If you are still puzzled by the idea of “more than 3 dimensional grids,” you should not think too much about that now. It is enough if you can get some understandings on high dimensional data after reading the whole article of this.

Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, p. 153

I said the method above is the most naive way, but other classical classification methods , for example k-nearest neighbors algorithm, are more or less base on a similar idea. Many of classical machine learning algorithms are based on the idea of smoothness prior, or local constancy prior. In short in classical ways, you  do not expect data to change so much in a small region, so you can expect unknown samples to be similar to data in vicinity. But that soon turns out to be problematic when the dimension of data is bigger because training data would be sparse because the area of multidimensional space grows exponentially as I mentioned above. And sometimes you would not be able to find training data around test data. Plus, in high dimensional data, you cannot treat distance in the same as you do in lower dimensional space. The ideas of “close,” “nearby,” or “vicinity” get more obscure in high dimensional data. That point is related to the next topic: the intuition have cultivated in normal life is not applicable to higher dimensional data.

2. Bizarre characteristics of high dimensional data

We form our sense of recognition in 3-dimensional ways in our normal life. Even though we can visualize only 1, 2, or 3 dimensional data, we can actually generalize the ideas in 1, 2, or 3 dimensional ideas to higher dimensions. For example 4 dimensional cubes, 100 dimensional spheres, or orthogonality in 255 dimensional space. Again, you cannot exactly visualize those ideas, and for many people, such high dimensional phenomenon are just imaginary matters on blackboards. Those high dimensional ideas are designed to retain some conditions just as well as 1, 2, or 3 dimensional space. Let’s take an example of spheres in several dimensional spaces. General spheres in any D-dimensional space can be defined as a set of any \boldsymbol{x}, such that |\boldsymbol{x} - \boldsymbol{c}| = r, where \boldsymbol{c} is the center point and r is length of radius. When \boldsymbol{x} is 2-dimensional, the spheres are called “circles.” When \boldsymbol{x} is 3-dimensional, the spheres are called “spheres” in our normal life, unless it is used in a conversation in a college cafeteria, by some students in mathematics department. And when \boldsymbol{x} is D-dimensional, they are called D-ball, and again, this is just a imaginary phenomenon on blackboard.

* Vectors and points are almost the same because all the vectors are denoted as “arrows” from the an origin point to sample data points.  The only difference is that when you use vectors, you have to consider their directions.

* “D-ball” is usually called “n-ball,” and in such context it is a sphere in a n-dimensional space. But please let me use the term “D-ball” in this article.

Not only spheres, but only many other ideas have been generalized to D-dimensional space, and many of them are indispensable also for data science. But there is one severe problem: the behaviors of data in high dimensional field is quite different from those in two or three dimensional space. To be concrete, in high dimensional field, cubes are spiky, you have to move like Pac-Man, and M & M’s Chocolate looks empty inside but tastes normal.

2.1: spiky cubes
Let’s take a look at an elementary-school-level example of geometry first. Assume that you have several unit squares or unit cubes like below. In each of them a circle or sphere with diameter 1 is inscribed. The length of a diagonal line in each square is \sqrt{2}, and that in each cube is \sqrt{3}.

If you stack the squares or cubes as below, what are the length of diameters of the blue circle or sphere, circumscribing all the 4 orange circles or the 8 orange spheres?

The answers are, the diameter of the blue circle is \sqrt{2} - 1, and the diameter of the blue sphere is \sqrt{3} - 1.

Next let’s think about the same situation in higher dimensional space. Assume that there are some unit D-dimensional hypercubes stacked, in each of which a D-ball with diameter 1 is inscribed, touching all the surfaces inside. Then what is the length of the diameter of  a D-ball circumscribing all the unit D-ball in the hypercubes ? Given the results above, it ca be predicted that its diameter is \sqrt{D}  -1. If that is true, there is one strange point: \sqrt{D} - 1 can soon surpass 2: that means in the chart above the blue sphere will stick out of the stacked cubes. That sounds like a paradox, but with one hypothesis, the phenomenon makes sense: cubes become more spiky as the degree of dimension grows. This hypothesis is a natural deduction because diagonal lines of hyper cubes get longer, and the the center of each surface of hypercubes still touches the unit D-ball with diameter 1, inscribing inscribing inside each unit hypercube.

If you stack 4 hypercubes, the blue sphere circumscribing them will not stick out of the stacked hypercubes anymore like the figure below.

*Of course you cannot visualize what is going on in D-dimensional space, so the figure below is just a pseudo simulation of D-dimensional space in our 3-dimensional sense. I guess you have to stack more than four hyper cubes in higher dimensional data, but you cannot easily imagine what will go on in such space anymore.

 

*You can confirm the fact that hypercube gets more spiky as the degree of dimension growth, by comparing the volume of the hypercube and the volume of the D-ball inscribed inside the hypercube. Thereby you can prove that the volume of hypercube concentrates on the corners of the hypercube. Plus, as I mentioned the longest diagonal distance of hypercube gets longer as dimension degree increases. That is why hypercube is said to be spiky. For mathematical proof, please check the Exercise 1.19 of PRML.

2.2: Pac-Man walking

Next intriguing phenomenon in high dimensional field is that most of pairs of vectors in high dimensional space are orthogonal. In other words, if you select two random vectors in high dimensional space, the angle between them are mostly close to 90^\circ. Let’s see the general meaning of angle between two vectors in any dimensional spaces. Assume that the angle between two vectors \boldsymbol{u}, and \boldsymbol{v} is \theta, then cos\theta is calculated as cos\theta = \frac{<\boldsymbol{u}, \boldsymbol{v}>}{|\boldsymbol{u}||\boldsymbol{v}|}. In 1, 2, or 3 dimensional space, you can actually see the angle, but again you can define higher dimensional angle, which you cannot visualize anymore. And angles are sometimes used as similarity of two vectors.

* <\boldsymbol{u}, \boldsymbol{v}> is the inner product of \boldsymbol{u}, and \boldsymbol{v}.

Assume that you generate a pair of two points inside a D-dimensional unit sphere and make two vectors \boldsymbol{u}, and \boldsymbol{v} by connecting the origin point and those two points respectively. When D is 2, I mean spheres are circles in this case, any \theta are equally generated as in the chart below. The fact might be the same as your intuition.   How about in 3-dimensional space? In fact the distribution of \theta is not uniform. \theta = 90^\circ is the most likely to be generated. As I explain in the figure below, if you compare the area of cross section of a hemisphere and the area of a cone whose vertex is the center point of the sphere, you can see why.

I generated 10000 random pairs of points in side a D-dimensional unit sphere, and calculated the angle between them. In other words I just randomly generated two D-dimensional vectors \boldsymbol{u} and \boldsymbol{v}, whose elements are randomly generated values between -1 and 1, and calculated the angle between them, repeating this process 10000 times. The chart below are the histograms of angle between pairs of generated vectors in respectively 2, 3, 50, and 100 dimensional space.

As I explained above, in 2-dimensional space, the distribution of \theta is almost uniform. However the distribution concentrates a little around 90^\circ in 3-dimensional space. You can see that the bigger the degree of dimension is, the more the angles of generated vectors concentrate around 90^\circ. That means most pairs of vectors in high dimensional space are close to orthogonal. Movements are also sequence of vectors, so when most pairs of movement vectors are orthogonal, that means you can only move like Pac-Man in such space.

Source: https://edition.cnn.com/style/article/pac-man-40-anniversary-history/index.html

* Of course I am talking about arcade Mac-Man game. Not Pac-Man in Super Smash Bros.  Retro RPG video games might have more similar playability, but in high dimensional space it is also difficult to turn back. At any rate, I think you have understood it is even difficult to move smoothly in high dimensional space, just like the first notorious Resident Evil on the first PS console also had terrible playability .

2.3: empty M & M’s chocolate

Let’s think about the proportion of the volume of the outermost \epsilon surface of general spheres with radius r. First, in 2 two dimensional space, spheres are circles. The area of the brown part of the circle below is \pi r^2. In order calculate the are of \epsilon \cdot r thick surface of the circle, you have only to subtract the area of \pi \{ (1 - \epsilon)\cdot r\} ^2. When \epsilon = 0.01, the area of outer most surface is \pi r^2 - \pi (0.99\cdot r)^2, and its proportion to the area of the whole circle is \frac{\pi r^2 - \pi (0.99\cdot r)^2}{\pi r^2} = 0.0199.

In case of 3-dimensional space, the value of a sphere with radius r is \frac{4}{3} \pi r^2, so the proportion of the \epsilon surface is calculated in the same way: \frac{\frac{4}{3} \pi r^3 -\frac{4}{3} \pi (0.99\cdot r)^2}{\frac{4}{3}\pi r^2} = 0.0297. Compared to the case in 2 dimensional space, the proportion is a little bigger.

How about in D-dimensional space? We have seen that even in  D-dimensional space the surface of a sphere, I mean D-ball, can be defined as a set of any points whose distance from the center point is all r. And it is known that the volume of D-ball is defined as below.

\Gamma () is called gamma function, but in this article it is not so important. The most important point now is, if you discuss any D-ball, their volume only depends on their radius r. That meas the proportion of outer \epsilon surface of D-ball is calculated as \frac{\pi r^2 - \pi \{ (1 - \epsilon)\cdot r\} ^2}{\pi r^2}. When \epsilon is 0.01, the proportion of the 1% surface of D-ball changes like in the chart below.

* And of course when D is 2,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \pi, and when D is 3 ,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \frac{4}{3} \pi

You can see that when D is over 400, around 90% of volume is concentrated in the very thin 1% surface of D-ball. That is why, in high dimensional space, M & M’s chocolate look empty but tastes normal: all the chocolate are concentrated beneath the sugar coating.

More interestingly, even if you choose any points as a central point of a sphere with radius r, the other points are squashed to the surface of the sphere, even if all the data points are uniformly distributed. This situation is problematic for classical machine learning algorithms, which are often based on the Euclidean distances between pairs of two sample data points: if you go from the central point to another sample point, the possibility of finding the point within (1 - \epsilon)\cdot r radius of the center is almost zero. But if you reach the outermost \epsilon part of the surface of the sphere, most data points are there. However, for one of the data points in the surface, any other data points are distant in the same way.

Inside M & M’s chocolate is a mysterious world.

Source: https://hipwallpaper.com/mms-wallpapers/

You have seen that using high dimensional data can be problematic in many ways. Data science and machine learning are largely based on one idea: you can find a lower dimensional meaningful and easier structure in data. In the next articles I am going to introduce some famous dimension reduction algorithms. And hopefully I would like to give some deeper insights in to these algorithms, in straightforward ways.

* I could not explain the relationships of variance and bias of data. This is also a very important factor when you think about dimensionality of data. I hope I can write about this topic someday. You can also look it up if you are interested.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Illustrative introductions on dimension reduction

“What is your image on dimensions?”

….That might be a cheesy question to ask to reader of Data Science Blog, but most people, with no scientific background, would answer “One dimension is a line, and two dimension is a plain, and we live in three-dimensional world.” After that if you ask “How about the fourth dimension?” many people would answer “Time?”

You can find books or writings about dimensions in various field. And you can use the word “dimension” in normal conversations, in many contexts.

*In Japanese, if you say “He likes two dimension.” that means he prefers anime characters to real women, as is often the case with Japanese computer science students.

The meanings of “dimensions” depend on the context, but in data science dimension is usually the number of rows of your Excel data.

When you study data science or machine learning, usually you should start with understanding the algorithms with 2 or 3 dimensional data, and you can apply those ideas to any D dimensional data. But of course you cannot visualize D dimensional data anymore, and you always have to be careful of what happens if you expand degree of dimension.

Conversely it is also important to reduce dimension to understand abstract high dimensional stuff in 2 or 3 dimensional space, which are close to our everyday sense. That means dimension reduction is one powerful way of data visualization.

In this blog series I am going to explain meanings of dimension itself in machine learning context and algorithms for dimension reductions, such as PCA, LDA, and t-SNE, with 2 or 3 dimensional visible data. Along with that, I am going to delve into the meaning of calculations so that you can understand them in more like everyday-life sense.

This article series is going to be roughly divided into the contents below.

  1. Curse of Dimensionality
  2. PCA, LDA (to be published soon)
  3. Rethinking eigen vectors (to be published soon)
  4. KL expansion and subspace method (to be published soon)
  5. Autoencoder as dimension reduction (to be published soon)
  6. t-SNE (to be published soon)

I hope you could see that reducing dimension is one of the fundamental approaches in data science or machine learning.

Data Analytics and Mining for Dummies

Data Analytics and Mining is often perceived as an extremely tricky task cut out for Data Analysts and Data Scientists having a thorough knowledge encompassing several different domains such as mathematics, statistics, computer algorithms and programming. However, there are several tools available today that make it possible for novice programmers or people with no absolutely no algorithmic or programming expertise to carry out Data Analytics and Mining. One such tool which is very powerful and provides a graphical user interface and an assembly of nodes for ETL: Extraction, Transformation, Loading, for modeling, data analysis and visualization without, or with only slight programming is the KNIME Analytics Platform.

KNIME, or the Konstanz Information Miner, was developed by the University of Konstanz and is now popular with a large international community of developers. Initially KNIME was originally made for commercial use but now it is available as an open source software and has been used extensively in pharmaceutical research since 2006 and also a powerful data mining tool for the financial data sector. It is also frequently used in the Business Intelligence (BI) sector.

KNIME as a Data Mining Tool

KNIME is also one of the most well-organized tools which enables various methods of machine learning and data mining to be integrated. It is very effective when we are pre-processing data i.e. extracting, transforming, and loading data.

KNIME has a number of good features like quick deployment and scaling efficiency. It employs an assembly of nodes to pre-process data for analytics and visualization. It is also used for discovering patterns among large volumes of data and transforming data into more polished/actionable information.

Some Features of KNIME:

  • Free and open source
  • Graphical and logically designed
  • Very rich in analytics capabilities
  • No limitations on data size, memory usage, or functionalities
  • Compatible with Windows ,OS and Linux
  • Written in Java and edited with Eclipse.

A node is the smallest design unit in KNIME and each node serves a dedicated task. KNIME contains graphical, drag-drop nodes that require no coding. Nodes are connected with one’s output being another’s input, as a workflow. Therefore end-to-end pipelines can be built requiring no coding effort. This makes KNIME stand out, makes it user-friendly and make it accessible for dummies not from a computer science background.

KNIME workflow designed for graduate admission prediction

KNIME workflow designed for graduate admission prediction

KNIME has nodes to carry out Univariate Statistics, Multivariate Statistics, Data Mining, Time Series Analysis, Image Processing, Web Analytics, Text Mining, Network Analysis and Social Media Analysis. The KNIME node repository has a node for every functionality you can possibly think of and need while building a data mining model. One can execute different algorithms such as clustering and classification on a dataset and visualize the results inside the framework itself. It is a framework capable of giving insights on data and the phenomenon that the data represent.

Some commonly used KNIME node groups include:

  • Input-Output or I/O:  Nodes in this group retrieve data from or to write data to external files or data bases.
  • Data Manipulation: Used for data pre-processing tasks. Contains nodes to filter, group, pivot, bin, normalize, aggregate, join, sample, partition, etc.
  • Views: This set of nodes permit users to inspect data and analysis results using multiple views. This gives a means for truly interactive exploration of a data set.
  • Data Mining: In this group, there are nodes that implement certain algorithms (like K-means clustering, Decision Trees, etc.)

Comparison with other tools 

The first version of the KNIME Analytics Platform was released in 2006 whereas Weka and R studio were released in 1997 and 1993 respectively. KNIME is a proper data mining tool whereas Weka and R studio are Machine Learning tools which can also do data mining. KNIME integrates with Weka to add machine learning algorithms to the system. The R project adds statistical functionalities as well. Furthermore, KNIME’s range of functions is impressive, with more than 1,000 modules and ready-made application packages. The modules can be further expanded by additional commercial features.

5 Things You Should Know About Data Mining

The majority of people spend about twenty-four hours online every week. In that time they give out enough information for big data to know a lot about them. Having people collecting and compiling your data might seem scary but it might have been helpful for you in the past.

 

If you have ever been surprised to find an ad targeted toward something you were talking about earlier or an invention made based on something you were googling, then you already know that data mining can be helpful. Advanced education in data mining can be an awesome resource, so it may pay to have a personal tutor skilled in the area to help you understand. 

 

It is understandable to be unsure of a system that collects all of the information online so that they can learn more about you. Luckily, so much data is put out every day it is unlikely data mining is focusing on any of your important information. Here are a few statistics you should know about mining.

 

1. Data Mining Is Used In Crime Scenes

Using a variation of earthquake prediction software and data, the Los Angeles police department and researchers were able to predict crime within five hundred feet. As they learn how to compile and understand more data patterns, crime detecting will become more accurate.

 

Using their data the Los Angeles police department was able to stop thief activity by thirty-three percent. They were also able to predict violent crime by about twenty-one percent. Those are not perfect numbers, but they are better than before and will get even more impressive as time goes on. 

 

The fact that data mining is able to pick up on crime statistics and compile all of that data to give an accurate picture of where crime is likely to occur is amazing. It gives a place to look and is able to help stop crime as it starts.

 

2. Data Mining Helps With Sales

A great story about data mining in sales is the example of Walmart putting beer near the diapers. The story claims that through measuring statistics and mining data it was found that when men purchase diapers they are also likely to buy a pack of beer. Walmart collected that data and put it to good use by putting the beer next to the diapers.

 

The amount of truth in that story/example is debatable, but it has made data mining popular in most retail stores. Finding which products are often bought together can give insight into where to put products in a store. This practice has increased sales in both items immensely just because people tend to purchase items near one another more than they would if they had to walk to get the second item. 

 

Putting a lot of stock in the data-gathering teams that big stores build does not always work. There have been plenty of times when data teams failed and sales plummeted. Often, the benefits outweigh the potential failure, however, and many stores now use data mining to make a lot of big decisions about their sales.

 

3. It’s Helping With Predicting Disease 

 

In 2009 Google began work to be able to predict the winter flu. Google went through the fifty million most searched words and then compared them with what the CDC was finding during the 2003-2008 flu seasons. With that information google was able to help predict the next winter flu outbreak even down to the states it hit the hardest. 

 

Since 2009, data mining has gotten much better at predicting disease. Since the internet is a newer invention it is still growing and data mining is still getting better. Hopefully, in the future, we will be able to predict disease breakouts quickly and accurately. 

 

With new data mining techniques and research in the medical field, there is hope that doctors will be able to narrow down problems in the heart. As the information grows and more data is entered the medical field gets closer to solving problems through data. It is something that is going to help cure diseases more quickly and find the root of a problem.

 

4. Some Data Mining Gets Ignored

Interestingly, very little of the data that companies collect from you is actually used. “Big data Companies” do not use about eighty-eight percent of the data they have. It is incredibly difficult to use all of the millions of bits of data that go through big data companies every day.

 

The more people that are used for data mining and the more data companies are actually able to filter through, the better the online experience will be. It might be a bit frightening to think of someone going through what you are doing online, but no one is touching any of the information that you keep private. Big data is using the information you put out into the world and using that data to come to conclusions and make the world a better place.

 

There is so much information being put onto the internet at all times. Twenty-four hours a week is the average amount of time a single person spends on the internet, but there are plenty of people who spend more time than that. All of that information takes a lot of people to sift through and there are not enough people in the data mining industry to currently actually go through the majority of the data being put online.

 

5. Too Many Data Mining Jobs

Interestingly, the data industry is booming. In general, there are an amazing amount of careers opening on the internet every day. The industry is growing so quickly that there are not enough people to fill the jobs that are being created.

 

The lack of talent in the industry means there is plenty of room for new people who want to go into the data mining industry. It was predicted that by 2018 there would be a shortage of 140,000 with deep analytical skills. With the lack of jobs that are being discussed, it is amazing that there is such a shortage in the data industry. 

 

If big data is only able to wade through less than half of the data being collected then we are wasting a resource. The more people who go into an analytics or computer career the more information we will be able to collect and utilize. There are currently more jobs than there are people in the data mining field and that needs to be corrected.

 

To Conclude

The data mining industry is making great strides. Big data is trying to use the information they collect to sell more things to you but also to improve the world. Also, there is something very convenient about your computer knowing the type of things you want to buy and showing you them immediately. 

 

Data mining has been able to help predict crime in Los Angeles and lower crime rates. It has also helped companies know what items are commonly purchased together so that stores can be organized more efficiently. Data mining has even been able to predict the outbreak of disease down to the state.

 

Even with so much data being ignored and so many jobs left empty, data mining is doing incredible things. The entire internet is constantly growing and the data mining is growing right along with it. As the data mining industry climbs and more people find their careers mining data the more we will learn and the more facts we will find.

 

Python vs R: Which Language to Choose for Deep Learning?

Data science is increasingly becoming essential for every business to operate efficiently in this modern world. This influences the processes composed together to obtain the required outputs for clients. While machine learning and deep learning sit at the core of data science, the concepts of deep learning become essential to understand as it can help increase the accuracy of final outputs. And when it comes to data science, R and Python are the most popular programming languages used to instruct the machines.

Python and R: Primary Languages Used for Deep Learning

Deep learning and machine learning differentiate based on the input data type they use. While machine learning depends upon the structured data, deep learning uses neural networks to store and process the data during the learning. Deep learning can be described as the subset of machine learning, where the data to be processed is defined in another structure than a normal one.

R is developed specifically to support the concepts and implementation of data science and hence, the support provided by this language is incredible as writing codes become much easier with its simple syntax.

Python is already much popular programming language that can serve more than one development niche without straining even for a bit. The implementation of Python for programming machine learning algorithms is very much popular and the results provided are accurate and faster than any other language. (C or Java). And because of its extended support for data science concept implementation, it becomes a tough competitor for R.

However, if we compare the charts of popularity, Python is obviously more popular among data scientists and developers because of its versatility and easier usage during algorithm implementation. However, R outruns Python when it comes to the packages offered to developers specifically expertise in R over Python. Therefore, to conclude which one of them is the best, let’s take an overview of the features and limits offered by both languages.

Python

Python was first introduced by Guido Van Rossum who developed it as the successor of ABC programming language. Python puts white space at the center while increasing the readability of the developed code. It is a general-purpose programming language that simply extends support for various development needs.

The packages of Python includes support for web development, software development, GUI (Graphical User Interface) development and machine learning also. Using these packages and putting the best development skills forward, excellent solutions can be developed. According to Stackoverflow, Python ranks at the fourth position as the most popular programming language among developers.

Benefits for performing enhanced deep learning using Python are:

  • Concise and Readable Code
  • Extended Support from Large Community of Developers
  • Open-source Programming Language
  • Encourages Collaborative Coding
  • Suitable for small and large-scale products

The latest and stable version of Python has been released as Python 3.8.0 on 14th October 2019. Developing a software solution using Python becomes much easier as the extended support offered through the packages drives better development and answers every need.

R

R is a language specifically used for the development of statistical software and for statistical data analysis. The primary user base of R contains statisticians and data scientists who are analyzing data. Supported by R Foundation for statistical computing, this language is not suitable for the development of websites or applications. R is also an open-source environment that can be used for mining excessive and large amounts of data.

R programming language focuses on the output generation but not the speed. The execution speed of programs written in R is comparatively lesser as producing required outputs is the aim not the speed of the process. To use R in any development or mining tasks, it is required to install its operating system specific binary version before coding to run the program directly into the command line.

R also has its own development environment designed and named RStudio. R also involves several libraries that help in crafting efficient programs to execute mining tasks on the provided data.

The benefits offered by R are pretty common and similar to what Python has to offer:

  • Open-source programming language
  • Supports all operating systems
  • Supports extensions
  • R can be integrated with many of the languages
  • Extended Support for Visual Data Mining

Although R ranks at the 17th position in Stackoverflow’s most popular programming language list, the support offered by this language has no match. After all, the R language is developed by statisticians for statisticians!

Python vs R: Should They be Really Compared?

Even when provided with the best technical support and efficient tools, a developer will not be able to provide quality outputs if he/she doesn’t possess the required skills. The point here is, technical skills rank higher than the resources provided. A comparison of these two programming languages is not advisable as they both hold their own set of advantages. However, the developers considering to use both together are less but they obtain maximum benefit from the process.

Both these languages have some features in common. For example, if a representative comes asking you if you lend technical support for developing an uber clone, you are directly going to decline as Python and R both do not support mobile app development. To benefit the most and develop excellent solutions using both these programming languages, it is advisable to stop comparing and start collaborating!

R and Python: How to Fit Both In a Single Program

Anticipating the future needs of the development industry, there has been a significant development to combine these both excellent programming languages into one. Now, there are two approaches to performing this: either we include R script into Python code or vice versa.

Using the available interfaces, packages and extended support from Python we can include R script into the code and enhance the productivity of Python code. Availability of PypeR, pyRserve and more resources helps run these two programming languages efficiently while efficiently performing the background work.

Either way, using the developed functions and packages made available for integrating Python in R are also effective at providing better results. Available R packages like rJython, rPython, reticulate, PythonInR and more, integrating Python into R language is very easy.

Therefore, using the development skills at their best and maximizing the use of such amazing resources, Python and R can be togetherly used to enhance end results and provide accurate deep learning support.

Conclusion

Python and R both are great in their own names and own places. However, because of the wide applications of Python in almost every operation, the annual packages offered to Python developers are less than the developers skilled in using R. However, this doesn’t justify the usability of R. The ultimate decision of choosing between these two languages depends upon the data scientists or developers and their mining requirements.

And if a developer or data scientist decides to develop skills for both- Python and R-based development, it turns out to be beneficial in the near future. Choosing any one or both to use in your project depends on the project requirements and expert support on hand.