The algorithm known as PCA and my taxonomy of linear dimension reductions

In one of my previous articles, I explained the importance of reducing dimensions. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the simplest types of dimension reduction algorithms. In upcoming articles of mine, you are going to see what these algorithms do. In conclusion, diagonalization, which I mentioned in the last article, is what these algorithms are all about, but still in this article I can mainly cover only PCA.

This article is largely based on the explanations in Pattern Recognition and Machine Learning by C. M. Bishop (which is often called “PRML”), and when you search “PCA” on the Internet, you will find more or less similar explanations. However I hope I can go some steps ahead throughout this article series. I mean, I am planning to also cover more generalized versions of PCA, meanings of diagonalization, the idea of subspace. I believe this article series is also effective for refreshing your insight into linear algebra.

*This is the third article of my article series “Illustrative introductions on dimension reduction.”

1. My taxonomy on linear dimension reduction

*If you soon want to know  what the algorithm called “PCA” is, you should skip this section for now to avoid confusion.

Out of the two algorithms I mentioned, PCA is especially important and you would see the same or similar ideas in various fields such as signal processing, psychology, and structural mechanics. However in most cases, the word “PCA” refers to one certain algorithm of linear dimension reduction. Most articles or study materials only mention the “PCA,” and this article is also going to cover only the algorithm which most poeple call “PCA.” However I found that PCA is only one branch of linear dimension reduction algorithms.

*From now on all the terms “PCA” in this article means the algorithm known as PCA unless I clearly mention the generalized KL transform.

*This chart might be confusing to you. According to PRML, PCA and KL transform is identical. PCA has two formulations, maximum variance formulation and minimum error formulation, and they can give the same result. However according to a Japanese textbook, which is very precise about this topic, KL transform has two formulations, and what we call PCA is based on maximum variance formulation. I am still not sure about correct terminology, but in this article I am going to call the most general algorithm “generalized KL transform,” I mean the root of the chart above.

*Most materials just explain the most major PCA, but if you consider this generalized KL transform, I can introduce an intriguing classification algorithm called subspace method. This algorithm was invented in Japan, and this is not so popular in machine learning textbooks in general, but learning this method would give you better insight into the idea of multidimensional space in machine learning. In the future, I am planning to cover this topic in this article series.

2. PCA

When someones mention “PCA,” I am sure for the most part that means the algorithm I am going to explain in the rest of this article. The most intuitive and straightforward way to explain PCA is that, PCA (Principal Component Analysis) of two or three dimensional data is fitting an oval to two dimensional data or fitting an ellipsoid to three dimensional data. You can actually try to plot some random dots on a piece of paper, and draw an oval which fits the dots the best. Assume that you have these 2 or 3 dimensional data below, and please try to put an oval or an ellipsoid to the data.

I think this is nothing difficult, but I have a question: what was the logic behind your choice?

Some might have roughly drawn its outline. Formulas of  “the surface” of general ellipsoids can be explained in several ways, but in this article you only have to consider ellipsoids whose center is the origin point of the coordinate system. In PCA you virtually shift data so that the mean of the data comes to the origin point of the coordinate system. When A is a certain type of D\times D matrix, the formula of a D-dimensional ellipsoid whose center is identical to the origin point is as follows: (\boldsymbol{x}, A\boldsymbol{x}) = 1, where \boldsymbol{x}\in \mathbb{R}. As is always the case with formulas in data science, you can visualize such ellipsoids if you are talking about 1, 2, or 3 dimensional data like in the figure below, but in general D-dimensional space, it is theoretical/imaginary stuff on blackboards.

*In order to explain the conditions which the matrix A has to hold, I need another article, so for now please just assume that the A is a kind of magical matrix.

You might have seen equations of 2 or 3 dimensional ellipsoids in the following way: \frac{x^2}{a^2} + \frac{y^2}{b^2} = 1, where a\neq 0, b\neq 0 or \frac{x^2}{a^2} + \frac{y^2}{b^2} + \frac{z^2}{c^2}= 1, where a\neq 0, b\neq 0, c \neq 0. These are special cases of the equation (\boldsymbol{x}, A\boldsymbol{x}) = 1, where A=diag(a_1^2, \dots, a_D^2). In this case the axes of ellipsoids the same as those of the coordinate system. Thus in this simple case, A=diag(a^2, b^2) or A=diag(a^2,c^2,c^2).

I am going explain these equations in detail in the upcoming articles. But thre is one problem: how would you fit an ellipsoid when a data distribution does not look like an ellipsoid?

In fact we have to focus more on another feature of ellipsoids: all the axes of an ellipsoid are orthogonal. In conclusion the axes of the ellipsoids are more important in PCA, so I do want you to forget about the surface of ellipsoids for the time being. You might get confused if you also think about the surface of ellipsoid now. I am planning to cover this topic in the next article. I hope this article, combined with the last one and the next one, would help you have better insight into the ideas which frequently appear in data science or machine learning context.

3. Fitting orthogonal axes on data

*If you have no trouble reading the chapter 12.1 of PRML, you do not need to this section or maybe even this article, but I hope at least some charts or codes of mine would enhance your understanding on this topic.

*I must admit I wrote only the essence of PCA formulations. If that seems too abstract to you, you should just breifly read through this section and go to the next section with a more concrete example. If you are confused, there should be other good explanations on PCA on the internet, and you should also check them. But at least the visualization of PCA in the next section would be helpful.

As I implied above, all the axes of ellipsoids are orthogonal, and selecting the orthogonal axes which match data is what PCA is all about. And when you choose those orthogonal axes, it is ideal if the data look like an ellipsoid. Simply putting we want the data to “swell” along the axes.

Then let’s see how to let them “swell,” more mathematically. Assume that you have 2 dimensional data plotted on a coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2) as below (The samples are plotted in purple). Intuitively, the data “swell” the most along the vector \boldsymbol{u}_1. Also  it is clear that \boldsymbol{u}_2 is the only vector orthogonal to \boldsymbol{u}_1. We can expect that the new coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2) expresses the data in a better way, and you you can get new coordinate points of the samples by projecting them on new axes as done with yellow lines below.

Next, let’s think about a case in 3 dimensional data. When you have 3 dimensional data in a coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2,\boldsymbol{e}_3) as below,  the data “swell” the most also along \boldsymbol{u}_1. And the data swells the second most along \boldsymbol{u}_2. The two axes, or vectors span the plain in purple. If you project all the samples on the plain, you will get 2 dimensional data at the right side. It is important that we did not consider the third axis. That implies you might be able to display the data well with only 2 dimensional sapce, which is spanned by the two axes \boldsymbol{v}_1, \boldsymbol{v}_2.

 

Thus the problem is how to calculate such axis \boldsymbol{u}_1. We want the variance of data projected on \boldsymbol{u}_1 to be the biggest. The coordinate of \boldsymbol{x}_n on the axis \boldsymbol{u}_1. The coordinate of a data point \boldsymbol{x}_n on the axis \boldsymbol{u}_1 is calculated by projecting \boldsymbol{x}_n on \boldsymbol{u}_1. In data science context, such projection is synonym to taking an inner  product of \boldsymbol{x}_n and \boldsymbol{u}_1, that is calculating \boldsymbol{u}_1^T \boldsymbol{x}_n.

*Each element of \boldsymbol{x}_n is the coordinate of the data point \boldsymbol{x}_n in the original coordinate system. And the projected data on \boldsymbol{u}_1 whose coordinates are 1-dimensional correspond to only one element of transformed data.

To calculate the variance of projected data on \boldsymbol{u}_1, we just have to calculate the mean of variances of 1-dimensional data projected on \boldsymbol{u}_1. Assume that \bar{\boldsymbol{x}} is the mean of data in the original coordinate, then the deviation of \boldsymbol{x}_1 on the axis \boldsymbol{u}_1 is calculated as \boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}, as shown in the figure. Hence the variance, I mean the mean of the deviation on is \frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}}, where N is the total number of data points. After some deformations, you get the next equation \frac{1}{N} \sum^{N}_{n}{\boldsymbol{u}_1^T \boldsymbol{x}_n - \boldsymbol{u}_1^T \bar{\boldsymbol{x}}} = \boldsymbol{u}_1^T S \boldsymbol{u}_1, where S = \frac{1}{N}\sum_{n=1}^{N}{(\boldsymbol{x}_n - \bar{\boldsymbol{x}})(\boldsymbol{x}_n - \bar{\boldsymbol{x}})^T}. S is known as a covariance matrix.

We are now interested in maximizing the variance of projected data on  \boldsymbol{u}_1^T S \boldsymbol{u}_1, and for mathematical derivation we need some college level calculus, so if that is too much for you, you can skip reading this part till the next section.

We now want to calculate \boldsymbol{u}_1 with which \boldsymbol{u}_1^T S \boldsymbol{u}_1 is its maximum value. General \boldsymbol{u}_i including \boldsymbol{u}_1 are just coordinate axes after PCA, so we are just interested in their directions. Thus we can set one constraint \boldsymbol{u}_1^T  \boldsymbol{u}_1 = 1. Introducing a Lagrange multiplier, we have only to optimize next problem: \boldsymbol{u}_1 ^ {*} = \mathop{\rm arg~max}\limits_{\boldsymbol{u}_1} \{ \boldsymbol{u}_1^T S \boldsymbol{u}_1 + \lambda_1 (1 - \boldsymbol{u}_1^T \boldsymbol{u}_1) \}. In conclusion \boldsymbol{u}_1 ^ {*} satisfies S\boldsymbol{u}_1 ^ {*}  = \lamba_1 \boldsymbol{u}_1 ^ {*}. If you have read my last article on eigenvectors, you wold soon realize that this is an equation for calculating eigenvectors, and that means \boldsymbol{u}_1 ^ {*} is one of eigenvectors of the covariance matrix S. Given the equation of eigenvector the next equation holds \boldsymbol{u}_1 ^ {*}^T S \boldsymbol{u}_1 ^ {*} = \lambda_1. We have seen that \boldsymbol{u}_1 ^T S \boldsymbol{u}_1 ^ is a the variance of data when projected on a vector \boldsymbol{u}_1, thus the eigenvalue \lambda_1 is the biggest variance possible when the data are projected on a vector.

Just in the same way you can calculate the next biggest eigenvalue \lambda_2, and it it the second biggest variance possible, and in this case the date are projected on \boldsymbol{u}_2, which is orthogonal to \boldsymbol{u}_1. As well you can calculate orthogonal 3rd 4th …. Dth eigenvectors.

*To be exact I have to explain the cases where we can get such D orthogonal eigenvectors, but that is going to be long. I hope I can to that in the next article.

4. Practical three dimensional example of PCA

We have seen that PCA is sequentially choosing orthogonal axes along which data points swell the most. Also we have seen that it is equal to calculating eigenvalues of the covariance matrix of the data from the largest to smallest one. From now on let’s work on a practical example of data. Assume that we have 30 students’ scores of Japanese, math, and English tests as below.

* I think the subject “Japanese” is equivalent to “English” or “language art” in English speaking countries, and maybe “Deutsch” in Germany. This example and the explanation are largely based on a Japanese textbook named 「これなら分かる応用数学教室 最小二乗法からウェーブレットまで」. This is a famous textbook with cool and precise explanations on mathematics for engineering. Partly sharing this is one of purposes of this article.

At the right side of the figure below is plots of the scores with all the combinations of coordinate axes. In total 9 inverse graphs are symmetrically arranged in the figure, and it is easy to see that English & Japanese or English and math have relatively high correlation. The more two axes have linear correlations, the bigger the covariance between them is.

In the last article, I visualized the eigenvectors of a 3\times 3 matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}, and in fact the matrix is just a constant multiplication of this covariance matrix. I think now you understand that PCA is calculating the orthogonal eigenvectors of covariance matrix of data, that is diagonalizing covariance matrix with orthonormal eigenvectors. Hence we can guess that covariance matrix enables a type of linear transformation of rotation and expansion and contraction of vectors. And data points swell along eigenvectors of such matrix.

Then why PCA is useful? In order to see that at first, for simplicity assume that x, y, z denote Japanese, Math, English scores respectively. The mean of the data is \left( \begin{array}{c} \bar{x} \\ \bar{y} \\ \bar{z} \end{array} \right) = \left( \begin{array}{c} 58.1 \\ 61.8 \\ 67.3 \end{array} \right), and the covariance matrix of data in the original coordinate system is V_{xyz} = \begin{pmatrix} 60.45 & 33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. The eigenvalues of  V_{xyz} are \lambda_1=148.34, \lambda_2 = 30.62, and \lambda_3 = 3.60, and their corresponding unit eigenvectors are \boldsymbol{u}_1 =  \left( \begin{array}{c} 0.540 \\ 0.602 \\ 0.589 \end{array} \right) , \boldsymbol{u}_2 =  \left( \begin{array}{c} 0.736 \\ -0.677 \\ 0.0174 \end{array} \right) , \boldsymbol{u}_3 =  \left( \begin{array}{c} -0.408 \\ -0.4.23 \\ 0.809 \end{array} \right) respectively.  U = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2 \quad \boldsymbol{u}_3 )  is an orthonormal matrix, where \boldsymbol{u}_i^T\boldsymbol{u}_j = \begin{cases} 1 & (i=j) \\ 0 & (otherwise) \end{cases}. As I explained in the last article, you can diagonalize V_{xyz} with U: U^T V_{xyz}U = diag(\lambda_1, \dots, \lambda_D).

In order to see how PCA is useful, assume that \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)  = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right).

Let’s take a brief look at what a linear transformation by U^T means. Each element of \boldsymbol{x} denotes coordinate of the data point \boldsymbol{x}  in the original coordinate system (In this case the original coordinate system is composed of \boldsymbol{e}_1, \boldsymbol{e}_2, and \boldsymbol{e}_3). U = (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) enables a rotation of a rigid body, which means the shape or arrangement of data will not change after the rotation, and U^T enables a reverse rotation of the rigid body.

*Roughly putting, if you hold a bold object such as a metal ball and rotate your arm, that is a rotation of a rigid body, and your shoulder is the origin point. On the other hand, if you hold something soft like a marshmallow, it would be squashed in your hand, and that is not a not a rotation of a rigid body.

You can rotate \boldsymbol{x} with U like U^T\boldsymbol{x} = \left( \begin{array}{c} -\boldsymbol{u}_1^{T}- \\ -\boldsymbol{u}_2^{T}- \\ -\boldsymbol{u}_3^{T}- \end{array} \right)\boldsymbol{x}=\left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right), and \boldsymbol{u}_i^{T}\boldsymbol{x} is the coordinate of \boldsymbol{x} projected on the axis \boldsymbol{u}_i.

Let’s see this more visually. Assume that the data point \boldsymbol{x}  is a purple dot and its position is expressed in the original coordinate system spanned by black arrows . By multiplying \boldsymbol{x} with U^T, the purple point \boldsymbol{x} is projected on the red axes respectively, and the product \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right) denotes the coordinate point of the purple point in the red coordinate system. \boldsymbol{x} is rotated this way, but for now I think it is better to think that the data are projected on new coordinate axes rather than the data themselves are rotating.

Now that we have seen what rotation by U means, you should have clearer image on what \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right)  = U^T \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) means. \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right) denotes the coordinates of data projected on new axes \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3, which are unit eigenvectors of V_{xyz}. In the coordinate system spanned by the eigenvectors, the data distribute like below.

By multiplying U from both sides of the equation above, we get \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) =U \left( \begin{array}{c} \xi \\ \eta \\ \zeta \end{array} \right), which means you can express deviations of the original data as linear combinations of the three factors \xi, \eta, and \zeta. We expect that those three factors contain keys for understanding the original data more efficiently. If you concretely write down all the equations for the factors: \xi = 0.540 (x - \bar{x}) + 0.602 (y - \bar{y}) + 0.588 (z - \bar{z}), \eta = 0.736(x - \bar{x}) - 0.677 (y - \bar{y}) + 0.0174 (z - \bar{z}), and \zeta = - 0.408 (x - \bar{x}) - 0.423 (y - \bar{y}) + 0.809(z - \bar{z}). If you examine the coefficients of the deviations (x - \bar{x}), (y - \bar{y}), and (z - \bar{z}), we can observe that \eta almost equally reflects the deviation of the scores of all the subjects, thus we can say \eta is a factor indicating one’s general academic level. When it comes to \eta Japanese and Math scores are important, so we can guess that this factor indicates whether the student is at more of “scientific side” or “liberal art side.” In the same way \zeta relatively makes much of one’s English score,  so it should show one’s “internationality.” However the covariance of the data \xi, \eta, \zeta is V_{\xi \eta \zeta} = \begin{pmatrix} 148.34 & 0 & 0 \\ 0 & 30.62 & 0 \\ 0 & 0 & 3.60 \end{pmatrix}. You can see \zeta does not vary from students to students, which means it is relatively not important to describe the tendency of data. Therefore for dimension reduction you can cut off the factor \zeta.

*Assume that you can apply PCA on D-dimensional data and that you get \boldsymbol{x}', where \boldsymbol{x}' = U^T\boldsymbol{x} - \bar{\boldsymbol{x}}. The variance of data projected on new D-dimensional coordinate system is V'=\frac{1}{N}\sum{(\boldsymbol{x}')^T\boldsymbol{x}'} =\frac{1}{N}\sum{(U^T\boldsymbol{x})^T(U^T\boldsymbol{x})} =\frac{1}{N}\sum{U^T\boldsymbol{x}\boldsymbol{x}^TU} =U^T(\frac{1}{N}\sum{\boldsymbol{x}\boldsymbol{x}^T})U =U^TVU =diag(\lambda_1, \dots, \lambda_D). This means that in the new coordinate system after PCA, covariances between any pair of variants are all zero.

*As I mentioned U is a rotation of a rigid body, and U^T is the reverse rotation, hence U^TU = UU^T = I.

Hence you can approximate the original 3 dimensional data on the coordinate system (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) from the reduced two dimensional coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2) with the following equation: \left( \begin{array}{c} x - \bar{x} \\ y - \bar{y} \\ z - \bar{z} \end{array} \right) \approx U_{reduced} \left( \begin{array}{c} \xi \\ \eta  \end{array} \right)  = (\boldsymbol{u}_1 \quad \boldsymbol{u}_2) \left( \begin{array}{c} \xi \\ \eta  \end{array} \right). Then it mathematically clearer that we can express the data with two factors: “how smart the student is” and “whether he is at scientific side or liberal art side.”

We can observe that eigenvalue \lambda_i is a statistic which indicates how much the corresponding \boldsymbol{u}_i can express the data, \frac{\lambda_i}{\sum_{j=1}^{D}{\lambda_j}} is called the contribution ratio of eigenvector \boldsymbol{u}_i. In the example above, the contribution ratios of \boldsymbol{u}_1, \boldsymbol{u}_2, and \boldsymbol{u}_3 are respectively \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3}=0.813, \frac{\lambda_2}{\lambda_1 + \lambda_2 + \lambda_3}=0.168, \frac{\lambda_3}{\lambda_1 + \lambda_2 + \lambda_3}=0.0197. You can decide how many degrees of dimensions you reduce based on this information.

Appendix: Playing with my toy PCA on MNIST dataset

Applying “so called” PCA on MNIST dataset is a super typical topic that many other tutorial on PCA also introduce, but I still recommend you to actually implement, or at least trace PCA implementation with MNIST dataset without using libraries like scikit-learn. While reading this article I recommend you to actually run the first and the second code below. I think you can just copy and paste them on your tool to run Python, installing necessary libraries. I wrote them on Jupyter Notebook.

In my implementation, in the simple configuration part you can set the USE_ALL_NUMBERS as True or False boolean. If you set it as True, you apply PCA on all the data of numbers from 0 to 9. If you set it as True, you can specify which digit to apply PCA on. In this article, I show the results results of PCA on the data of digit ‘3.’ The first three images of ‘3’ are as below.

You have to keep it in mind that the data are all shown as 28 by 28 pixel grayscale images, but in the process of PCA, they are all processed as 28 * 28 = 784 dimensional vectors. After applying PCA on the 784 dimensional vectors of images of ‘3,’ the first 25 eigenvectors are as below. You can see that at the beginning the eigenvectors partly retain the shapes of ‘3,’ but they are distorted as the eigenvalues get smaller. We can guess that the latter eigenvalues are not that helpful in reconstructing the shape of ‘3.’

Just as we saw in the last section, you you can cut off axes of eigenvectors with small eigenvalues and reduce the dimension of MNIST data. The figure below shows how contribution ratio of MNIST data grows. You can see that around 200 dimension degree, the contribution ratio reaches around 0.95. Then we can guess that even if we reduce the dimension of MNIST from 784 to 200 we can retain the most of the structure of original data.

Some results of reconstruction of data from 200 dimensional space are as below. You can set how many images to display by adjusting NUMBER_OF_RESULTS in the code. And if you set LATENT_DIMENSION as 784, you can completely reconstruct the data.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.

 

 

Web Scraping Using R..!

In this blog, I’ll show you, How to Web Scrape using R..?

What is R..?

R is a programming language and its environment built for statistical analysis, graphical representation & reporting. R programming is mostly preferred by statisticians, data miners, and software programmers who want to develop statistical software.

R is also available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form.

Reasons to choose R

Reasons to choose R

Let’s begin our topic of Web Scraping using R.

Step 1- Select the website & the data you want to scrape.

I picked this website “https://www.alexa.com/topsites/countries/IN” and want to scrape data of Top 50 sites in India.

Data we want to scrape

Data we want to scrape

Step 2- Get to know the HTML tags using SelectorGadget.

In my previous blog, I already discussed how to inspect & find the proper HTML tags. So, now I’ll explain an easier way to get the HTML tags.

You have to go to Google chrome extension (chrome://extensions) & search SelectorGadget. Add it to your browser, it’s a quite good CSS selector.

Step 3- R Code

Evoking Important Libraries or Packages

I’m using RVEST package to scrape the data from the webpage; it is inspired by libraries like Beautiful Soup. If you didn’t install the package yet, then follow the code in the snippet below.

Step 4- Set the url of the website

Step 5- Find the HTML tags using SelectorGadget

It’s quite easy to find the proper HTML tags in which your data is present.

Firstly, I have to click on data using SelectorGadget which I want to scrape, it automatically selects the data which are similar to selected HTML tags. Before going forward, cross-check the selected values, are they correct or some junk data is also gets selected..? If you noticed our page has only 50 values, but you can see 156 values are selected.

Selection by SelectorGadget

Selection by SelectorGadget

So I need to remove unwanted values who get selected, once you click on them to deselect it, it turns red and others will turn yellow except our primary selection which turn to green. Now you can see only 50 values are selected as per our primary requirement but it’s not enough. I have to again cross-check that some required values are not exchanged with junk values.

If we satisfy with our selection then copy the HTML tag & include it into the code, else repeat this exercise.

Modified Selection by SelectorGadget

Step 6- Include the tag in our Code

After including the tags, our code is like this.

Code Snippet

If I run the code, values in each list object will be 50.

Data Stored in List Objects

Step 7- Creating DataFrame

Now, we create a dataframe with our list-objects. So for creating a dataframe, we always need to remember one thumb rule that is the number of rows (length of all the lists) should be equal, else we get an error.

Error appears when number of rows differs

Finally, Our DataFrame will look like this:

Our Final Data

Step 8- Writing our DataFrame to CSV file

We need our scraped data to be available locally for further analysis & model building or other purposes.

Our final piece of code to write it in CSV file is:

Writing to CSV file

Step 9- Check the CSV file

Data written in CSV file

Conclusion-

I tried to explain Web Scraping using R in a simple way, Hope this will help you in understanding it better.

Find full code on

https://github.com/vgyaan/Alexa/blob/master/webscrap.R

If you have any questions about the code or web scraping in general, reach out to me on LinkedIn!

Okay, we will meet again with the new exposer.

Till then,

Happy Coding..!

Determining Your Data Pipeline Architecture and Its Efficacy

Data analytics has become a central part of how many businesses operate. If you hope to stay competitive in today’s market, you need to take advantage of all your available data. For that, you’ll need an efficient data pipeline, which is often easier said than done.

If your pipeline is too slow, your data will be all but useless by the time it’s usable. Successful analytics require an optimized pipeline, and that looks different for every company. No matter your specific circumstances, though, a traditional approach will result in inefficiencies.

Creating the most efficient pipeline architecture will require you to change how you look at the process. By understanding each stage’s role and how they serve your goals, you can optimize your data analytics.

Understanding Your Data Needs

You can’t build an optimal data pipeline if you don’t know what you need from your data. If you spend too much time collecting and organizing information you won’t use, you’ll take time away from what you need. Similarly, if you only work to meet one team’s needs, you’ll have to go back and start over to help others.

Data analytics involves multiple stakeholders, all with individual needs and expectations that you should consider. Your data engineers need your pipeline to be accessible and scalable, while analysts require visual, relevant datasets. If you consider these aspects from the beginning, you can build a pipeline that works for everyone.

Start at the earliest stage — collection. You may be collecting data from every channel you can, which could result in an information overload. Focus instead on gathering things from the most relevant sources. At the same time, ensure you can add more channels if necessary in the future.

As you reorganize your pipeline, remember that analytics are only as good as your datasets. If you put more effort into organizing and scrubbing data, helpful analytics will follow. Focus on preparing data well, and the last few stages will be smoother.

Creating a Collaborative Pipeline

When structuring your pipeline, it’s easy to focus too much on the individual stages. While seeing things as rigid steps can help you visualize them, you need something more fluid in practice. If you want the process to run as smoothly as possible, it needs to be collaborative.

Look at the software development practice of DevOps, which doubles a team’s likelihood of exceeding productivity goals. This strategy focuses on collaboration across separate teams instead of passing things back and forth between them. You can do the same thing with your data pipeline.

Instead of dividing steps between engineers and analysts, make it a single, cohesive process. Teams will still focus on different areas according to their expertise, but they’ll reduce disruption by working together instead of independently. If workers can collaborate along every step, they don’t have to go back and forth.

Simultaneously, everyone should have clearly defined responsibilities. Collaboration doesn’t mean overstepping your areas of expertise. The goal here isn’t to make everyone handle everything but to ensure they understand each other’s needs.

Eliminating the time between steps also applies to your platform. Look for or build software that integrates both refinement and data preparation. If you have to export data to various programs, it will cause unnecessary bottlenecks.

Enabling Continuous Improvement

Finally, understand that restructuring your data pipeline isn’t a one-and-done job. Another principle you can adopt from DevOps is continuous development across all sides of the process. Your engineers should keep looking for better ways to structure data as your analysts search for new applications for this information.

Make sure you always measure your throughput and efficiency. If you tweak something and you notice the process starts to slow, revert to the older method. If your changes improve the pipeline, try something similar in another area.

Optimize Your Data Pipeline

Remember to start slow when optimizing your data pipeline. Changing too much at once can cause more disruptions than it avoids, so start small with an emphasis on scalability.

The specifics of your pipeline will vary depending on your needs and circumstances. No matter what these are, though, you can benefit from collaboration and continuous development. When you start breaking down barriers between different steps and teams, you unclog your pipeline.

New Era of Data Science in Today’s World

In today’s digital world, most organizations are flooded with data, both structured and unstructured. Data is a commodity now, and organizations should know how to monetize that data and derive a profit from the deluge. And valuing data is one of the best ways enterprises can become successful in distinguishing themselves in the marketplace.

Data is the new oil

Indeed, data itself has become a commodity, and the mere possession of abundant amounts of data is not enough. But the ability to monetize data effectively (and not merely hoard it) can undoubtedly be a source of competitive advantage in the digital economy. However, we need to refine this data. And refinement of this “new oil” will take a reasonable amount of time. In my opinion, we are still not there. As a result, “data refinement” remains a key factor for successful advanced analytics.

If we talk about the level of activity in data and analytics space in the last two years, most advanced analytics evolved around three categories:

  • Descriptive, or what has happened
  • Predictive, or what could happen, and
  • Prescriptive, or what we should do.

Descriptive analytics has been the core analytics for many years. In the past, we could only describe what has happened to historical data (such as that found in a data warehouse), with dashboard reporting, using traditional analytics. But with the advent of advanced analytics, machine learning (ML), and deep learning and artificial intelligence (AI), our focus has changed to real-time analytics. In the last two years, much work has been done in predictive analytics, and as we move forward into our analytics journey, data-centric organizations will now focus on prescriptive analytics. The use of prescriptive analytics, along with predictive analytics, is very important for any organization to be successful in the future.

Current and recent trends in data and analytics

The analytics trends revolve around AI and ML. The Analytics-as-a-Service model is an essential model for any smart, data-driven organization. We can make an impact on society and try to make a better place to live with the use of advanced analytics. At NTT DATA, we strive to solve these problems to improve the quality, safety and advancement of humanity. From a business perspective, we use data analytics and predictive modeling to help companies increase their sales and revenue.

Let me give you some examples. We have been involved with several technology partners in a project for the Smart City. This project involved the use of predictive analytics for the validation of critical alerts to help reduce the time and amount of data required to be processed. It used Internet of Things (IoT) devices, high-definition video cameras, and sound sensors, as well as video and sound data captured from specific locations. Eventually, the solution also integrated with available data from data sources such as crime, weather and social media. The overall objective of the Smart City project was to use and apply advanced analytics with cognitive computing to facilitate safety decision-making, and for a responder to respond earlier based on real-time data.

Another example is the Smart ICU System developed by NTT DATA for predictive detection of threats for seriously ill patients in an ICU, based on the data. This data was consolidated from various medical devices in the ICU into one platform. From that data, we developed a model that predicts the risk of complications that might occur within the next couple of hours or so of a medical event. We have also used advanced analytics provided by weather data forecasting and used predictive models to predict natural disasters.

Data and analytics strategy

A strategy is an essential aspect of any data-driven organization. It should cover data strategy for AI, ML, statistical modeling and other data science disciplines, such as predictive and prescriptive analytics. In general, advanced analytics is more predictive and actionable than retrospective. Smart organizations see positive results when they place a strategy for data and analytics in the hands of employees who are well-positioned to make decisions, such as those who interact with customers, oversee product development, or run production processes. With data-based insight and clear decision rules, employees can deliver more meaningful services, better assess and address customer demands, and optimize production.

Smart organizations must take time to clean and update their underlying modern data architecture — along with their data governance process, for a cleaner data and analytics strategy. A modern data architecture, combined with a good governance process, can leverage AI and ML to help organizations stay ahead of their competitors.

Data analytics innovation

Machine and Deep Learning, along with AI, are all very popular, but I would like to reiterate that advanced technologies like AI and machine learning will continue to transform data analytics. The next innovation could be the use of automated analytics, which machine learning tools can use to identify hidden patterns in data. For example, customer retention issues, customer default on loans, or predicting customers who are prone to auto accidents. Also, predictive analytics and prescriptive analytics are going to be the key for any future innovations in AI and ML.

We must make targeted investments in traditional business innovation tools, along with emerging data analytics tools to derive benefits from data-driven business initiatives. We need to invest in cloud and underlying IT infrastructure to support these analytics and business initiatives. Most importantly, we also need to invest in people — cross training skilled resources and empowering the people who work closely with clients to make the right decisions for analytics.

Bias and Variance in Machine Learning

Machine learning continues to be an ever more vital component of our lives and ecosystem, whether we’re applying the techniques to answer research or business problems or in some cases even predicting the future. Machine learning models need to give accurate predictions in order to create real value for a given industry or domain.

While training a model is one of the key steps in the Data Science Project Life Cycle, how the model generalizes on unseen data is an equally important aspect that should be considered in every Data Science Project Life Cycle. We need to know whether it works and, consequently, if we can trust its predictions. Could the model be merely memorizing the data it is fed with, and therefore unable to make good predictions on future samples, or samples that it hasn’t seen before?

Let’s know the importance of evaluation with a simple example, There are two student’s Ramesh and Suresh preparing for the CAT exam to get into top IIMs (Indian Institute of Management). They both are quite good friends and stayed in the room during preparation and put an equal amount of hard work while solving numerical problems.

They both prepared for almost the same number of hours for the entire year and appeared in the final CAT exam. Surprisingly, Ramesh cleared, but Suresh did not. When asked, we got to know that there was one difference in their strategy of preparation between them, Ramesh had joined a Test Series course where he used to test his knowledge and understanding by giving mock exams and then further evaluating on which portions he is lagging and making necessary adjustments to he is preparation cycle in order to do well in those areas. But Suresh was confident, and he just kept training himself without testing on the preparation he had done.

Like the above situation we can train a Machine Learning Algorithm extensively with many parameters and new techniques, but if you are skipping its evaluation step, you cannot trust your model to perform well on the unseen data. In this article, we explain the importance of Bias, Variance and the trade-off between them in order to know how well a machine learning model generalizes to new, previously unseen data.

Training of Supervised Machine Learning

Bias

Bias is the difference between the Predicted Value and the Expected Value or how far are the predicted values from the actual values. During the training process the model makes certain assumptions on the training data provided. After Training, when it is introduced to the testing/validation data or unseen data, these assumptions may not always be correct.

If we use a large number of nearest neighbors in the K-Nearest Neighbors Algorithm, the model can totally decide that some parameters are not important at all for the modelling.  For example, it can just consider that only two predictor variables are enough to classify the data point though we have more than 10 variables.

This type of model will make very strong assumptions about the other parameters not affecting the outcome at all. You can take it as a model predicting or understanding only the simple relationship when the data points clearly indicate a more complex relationship.

When the model has high bias error, it results in a very simplistic model that does not consider the complexity of the data very well leading to Underfitting.

Variance

Variance occurs when the model performs well on the trained dataset but does not do well on an unseen data set, it is when the model considers the fluctuations or i.e. the noise as in the data as well. The model will still consider the variance as something to learn from because it learns too much from the noise inside the trained data set that it fails to perform as expected on the unseen data.

Based on the above example from Bias, if the model learns that all the ten predictor variables are important to classify a given data point then it tends to have high variance. You can take it as the model is trying to understand every minute detail making it more complex and failing to perform well on the unseen data.

When a model has High Bias error, it underfits the data and makes very simplistic assumptions on it. When a model has High Variance error, it overfits the data and learns too much from it. When a model has balanced Bias and Variance errors, it performs well on the unseen data.

Bias-Variance Trade-off

Based on the definitions of bias and variance, there is clear trade-off between bias and variance when it comes to the performance of the model. A model will have a high error if it has very high bias and low variance and have a high error if it has high variance and low bias.

A model that strikes a balance between the bias and variance can minimize the error better than those that live on extreme ends.

We can find whether the model has High Bias using the below steps:

  1. We tend to get high training errors.
  2. The validation error or test error will be similar to the training error.

We can find whether the model has High Bias using the below steps:

  1. We tend to get low training error
  2. The validation error or test error will be very high.

We can fix the High Bias using below steps:

  1. We need to gather more input features or can even try to create few using the feature engineering techniques.
  2. We can even add few polynomial features in order to increase the complexity.
  3. If we are using any regularization terms in our model, we can try to minimize it.

We can fix the High Variance using below steps:

  1. We can gather more training data so that the model can learn more on the patterns rather than the noise.
  2. We can even try to reduce the input features or do feature selection.
  3.  If we are using any regularization terms in our model we can try to maximize it.

Conclusion

In this article, we got to know the importance of the evaluation step in the Data Science Project Life Cycle, definitions of Bias and Variance, the trade-off between them and the steps we can take to fix the Underfitting and Overfitting of a Machine Learning Model.