Geschriebene Artikel über Big Data Analytics

K Nearest Neighbour For Supervised Learning

K-Nearest Neighbour (KNN) Algorithms is an easy-to-implement & advanced level supervised machine learning algorithm used for both – classification as well as regression problems. However, you can see a wide of its applications in classification problems across various industries.

If you’ve been shopping a lot in e-commerce sites like Amazon, Flipkart, Myntra, or love watching web series over Netflix and Amazon Prime, one common thing you’ve always noticed, and that is recommendations.

Are you wondering how they recommend you following your choice? They use KNN Supervised Learning to find out what you may need the next when you’re buying and recommend you with a few more products.

Imagine you’re looking for an iPhone to purchase. When you scroll down a little, you see some iPhone cases, tempered glasses – saying, “People who purchased an iPhone have also purchased these items. The same applies to Netflix and Amazon Prime. When you finished a show or a series, they give you recommendations of the same genre. And do it all using KNN supervised learning and classify the items for the best user experience.

Advantages Of KNN

  • Quickest Calculation Time
  • Simple Algorithms
  • High Accuracy
  • Versatile – best use for Regression and Classification.
  • Doesn’t make any assumptions about data.

Where KNN Are Mostly Used

  • Simple Recommendation Models
  • Image Recognition Technology
  • Decision-Making Models
  • Calculating Credit Rating

Choosing The Right Value For K

 To choose the right value of K, you have to run KNN algorithms several times with different values of K and select the value of K, which reduces the number of errors you’ve come across and come out as the most stable value for K.

Your Step-By-Step Guide For Choosing The Value Of K

  • As you decrease the value of K to 1 (K = 1), you’ll reach a query point, where you get to see many elements from class A (-) and class B (+) where (-) is the only nearest neighbor. Reasonably, you would think about the query point to be most likely the red one. As K =1, which has a blue color, KNN incorrectly predicts the wrong color blue.
  • As you increase the value of K to 2 (K=2), you get to see two elements, (-) and (+) are the only nearest neighbor. As you have two values, which are of Class A and Class B, KNN incorrectly predicts the wrong values (Blue and Red).
  • As you increase the value of K to 3 (K=3), you get to see three elements (-) and (+), (+) are the only nearest neighbor. And this time, you got three values, one from blue and two from red. As your assumption is red, KNN correctly predicts the right value (Blue and Red, Red). Your answer is more stable this time compared to previous ones.


KNN works by finding the nearest distance between a query and all the elements in the database. By choosing the value for K, we get the closest to the query. And then, KNN algorithms look for the most frequent labels in classification and averages of labels in regression.

Spiky cubes, Pac-Man walking, empty M&M’s chocolate: curse of dimensionality

This is the first article of the article series Illustrative introductions on dimension reduction.

“Curse of dimensionality” means the difficulties of machine learning which arise when the dimension of data is higher. In short if the data have too many features like “weight,” “height,” “width,” “strength,” “temperature”…., that can undermine the performances of machine learning. The fact might be contrary to your image which you get from the terms “big” data or “deep” learning. You might assume that the more hints you have, the better the performances of machine learning are. There are some reasons for curse of dimensionality, and in this article I am going to introduce two major reasons below.

  1. High dimensional data usually have rich expressiveness, but usually training data are too poor for that.
  2. The behaviors of data points in high dimensional space are totally different from our common sense.

Through these topics, you will see that you always have to think about which features to use considering the number of data points.

*From now on I am going to talk about only Euclidean distance. If you are not sure what Euclidean distance means, please just keep it in mind that it is the type of distance most people wold have learnt in normal compulsory education.

1. Number of samples and degree of dimension

The most straightforward demerit of adding many features, or increasing dimensions of data, is the growth of computational costs. More importantly, however, you always have to think about the degree of dimensions in relation of the number of data points you have. Let me take a simple example in a book “Pattern Recognition and Machine Learning” by C. M. Bishop (PRML). This is an example of measurements of a pipeline. The figure below shows a comparison plot of 3 classes (red, green and blue), with parameter x_7 plotted against parameter x_6 out of 12 parameters.

* The meaning of data is not important in this article. If you are interested please refer to the appendix in PRML.

Assume that we are interested in classifying the cross in black into one of the three classes. One of the most naive ideas of this classification is dividing the graph into grids and labeling each grid depending on the number of samples in the classes (which are colored at the right side of the figure). And you can classify the test sample, the cross in black, into the class of the grid where the test sample is in. Thereby the cross is classified to the class in red.

Source: C.M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 34-35

As I mentioned in the figure above, we used only two features out of 12 features in total. When the total number of data points is fixed and you add remaining ten axes/features one after another, what would happen? Let’s see what “adding axes/features” means. If you are talking about 1, 2, or 3 dimensional grids, you can visualize them. And as you can see from the figure below, if you make each 10^1, 10^2, 100^3 grids respectively in 1, 2, 3 dimensional spaces, the number of the small regions in the grids are respectively 10, 100, 1000. Even though you cannot visualize it anymore, you can make grids for more than 3 dimensional data. If you continue increasing the degree of dimension, the number of grids increases exponentially, and that can soon surpass the number of training data points. That means there would be a lot of empty spaces in such high dimensional grids. And the classifying method above: coloring each grid and classifying unknown samples depending on the colors of the grids, does not work out anymore because there would be a lot of empty grids.

* If you are still puzzled by the idea of “more than 3 dimensional grids,” you should not think too much about that now. It is enough if you can get some understandings on high dimensional data after reading the whole article of this.

Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, p. 153

I said the method above is the most naive way, but other classical classification methods , for example k-nearest neighbors algorithm, are more or less base on a similar idea. Many of classical machine learning algorithms are based on the idea of smoothness prior, or local constancy prior. In short in classical ways, you  do not expect data to change so much in a small region, so you can expect unknown samples to be similar to data in vicinity. But that soon turns out to be problematic when the dimension of data is bigger because training data would be sparse because the area of multidimensional space grows exponentially as I mentioned above. And sometimes you would not be able to find training data around test data. Plus, in high dimensional data, you cannot treat distance in the same as you do in lower dimensional space. The ideas of “close,” “nearby,” or “vicinity” get more obscure in high dimensional data. That point is related to the next topic: the intuition have cultivated in normal life is not applicable to higher dimensional data.

2. Bizarre characteristics of high dimensional data

We form our sense of recognition in 3-dimensional ways in our normal life. Even though we can visualize only 1, 2, or 3 dimensional data, we can actually generalize the ideas in 1, 2, or 3 dimensional ideas to higher dimensions. For example 4 dimensional cubes, 100 dimensional spheres, or orthogonality in 255 dimensional space. Again, you cannot exactly visualize those ideas, and for many people, such high dimensional phenomenon are just imaginary matters on blackboards. Those high dimensional ideas are designed to retain some conditions just as well as 1, 2, or 3 dimensional space. Let’s take an example of spheres in several dimensional spaces. General spheres in any D-dimensional space can be defined as a set of any \boldsymbol{x}, such that |\boldsymbol{x} - \boldsymbol{c}| = r, where \boldsymbol{c} is the center point and r is length of radius. When \boldsymbol{x} is 2-dimensional, the spheres are called “circles.” When \boldsymbol{x} is 3-dimensional, the spheres are called “spheres” in our normal life, unless it is used in a conversation in a college cafeteria, by some students in mathematics department. And when \boldsymbol{x} is D-dimensional, they are called D-ball, and again, this is just a imaginary phenomenon on blackboard.

* Vectors and points are almost the same because all the vectors are denoted as “arrows” from the an origin point to sample data points.  The only difference is that when you use vectors, you have to consider their directions.

* “D-ball” is usually called “n-ball,” and in such context it is a sphere in a n-dimensional space. But please let me use the term “D-ball” in this article.

Not only spheres, but only many other ideas have been generalized to D-dimensional space, and many of them are indispensable also for data science. But there is one severe problem: the behaviors of data in high dimensional field is quite different from those in two or three dimensional space. To be concrete, in high dimensional field, cubes are spiky, you have to move like Pac-Man, and M & M’s Chocolate looks empty inside but tastes normal.

2.1: spiky cubes
Let’s take a look at an elementary-school-level example of geometry first. Assume that you have several unit squares or unit cubes like below. In each of them a circle or sphere with diameter 1 is inscribed. The length of a diagonal line in each square is \sqrt{2}, and that in each cube is \sqrt{3}.

If you stack the squares or cubes as below, what are the length of diameters of the blue circle or sphere, circumscribing all the 4 orange circles or the 8 orange spheres?

The answers are, the diameter of the blue circle is \sqrt{2} - 1, and the diameter of the blue sphere is \sqrt{3} - 1.

Next let’s think about the same situation in higher dimensional space. Assume that there are some unit D-dimensional hypercubes stacked, in each of which a D-ball with diameter 1 is inscribed, touching all the surfaces inside. Then what is the length of the diameter of  a D-ball circumscribing all the unit D-ball in the hypercubes ? Given the results above, it ca be predicted that its diameter is \sqrt{D}  -1. If that is true, there is one strange point: \sqrt{D} - 1 can soon surpass 2: that means in the chart above the blue sphere will stick out of the stacked cubes. That sounds like a paradox, but with one hypothesis, the phenomenon makes sense: cubes become more spiky as the degree of dimension grows. This hypothesis is a natural deduction because diagonal lines of hyper cubes get longer, and the the center of each surface of hypercubes still touches the unit D-ball with diameter 1, inscribing inscribing inside each unit hypercube.

If you stack 4 hypercubes, the blue sphere circumscribing them will not stick out of the stacked hypercubes anymore like the figure below.

*Of course you cannot visualize what is going on in D-dimensional space, so the figure below is just a pseudo simulation of D-dimensional space in our 3-dimensional sense. I guess you have to stack more than four hyper cubes in higher dimensional data, but you cannot easily imagine what will go on in such space anymore.


*You can confirm the fact that hypercube gets more spiky as the degree of dimension growth, by comparing the volume of the hypercube and the volume of the D-ball inscribed inside the hypercube. Thereby you can prove that the volume of hypercube concentrates on the corners of the hypercube. Plus, as I mentioned the longest diagonal distance of hypercube gets longer as dimension degree increases. That is why hypercube is said to be spiky. For mathematical proof, please check the Exercise 1.19 of PRML.

2.2: Pac-Man walking

Next intriguing phenomenon in high dimensional field is that most of pairs of vectors in high dimensional space are orthogonal. In other words, if you select two random vectors in high dimensional space, the angle between them are mostly close to 90^\circ. Let’s see the general meaning of angle between two vectors in any dimensional spaces. Assume that the angle between two vectors \boldsymbol{u}, and \boldsymbol{v} is \theta, then cos\theta is calculated as cos\theta = \frac{<\boldsymbol{u}, \boldsymbol{v}>}{|\boldsymbol{u}||\boldsymbol{v}|}. In 1, 2, or 3 dimensional space, you can actually see the angle, but again you can define higher dimensional angle, which you cannot visualize anymore. And angles are sometimes used as similarity of two vectors.

* <\boldsymbol{u}, \boldsymbol{v}> is the inner product of \boldsymbol{u}, and \boldsymbol{v}.

Assume that you generate a pair of two points inside a D-dimensional unit sphere and make two vectors \boldsymbol{u}, and \boldsymbol{v} by connecting the origin point and those two points respectively. When D is 2, I mean spheres are circles in this case, any \theta are equally generated as in the chart below. The fact might be the same as your intuition.   How about in 3-dimensional space? In fact the distribution of \theta is not uniform. \theta = 90^\circ is the most likely to be generated. As I explain in the figure below, if you compare the area of cross section of a hemisphere and the area of a cone whose vertex is the center point of the sphere, you can see why.

I generated 10000 random pairs of points in side a D-dimensional unit sphere, and calculated the angle between them. In other words I just randomly generated two D-dimensional vectors \boldsymbol{u} and \boldsymbol{v}, whose elements are randomly generated values between -1 and 1, and calculated the angle between them, repeating this process 10000 times. The chart below are the histograms of angle between pairs of generated vectors in respectively 2, 3, 50, and 100 dimensional space.

As I explained above, in 2-dimensional space, the distribution of \theta is almost uniform. However the distribution concentrates a little around 90^\circ in 3-dimensional space. You can see that the bigger the degree of dimension is, the more the angles of generated vectors concentrate around 90^\circ. That means most pairs of vectors in high dimensional space are close to orthogonal. Movements are also sequence of vectors, so when most pairs of movement vectors are orthogonal, that means you can only move like Pac-Man in such space.


* Of course I am talking about arcade Mac-Man game. Not Pac-Man in Super Smash Bros.  Retro RPG video games might have more similar playability, but in high dimensional space it is also difficult to turn back. At any rate, I think you have understood it is even difficult to move smoothly in high dimensional space, just like the first notorious Resident Evil on the first PS console also had terrible playability .

2.3: empty M & M’s chocolate

Let’s think about the proportion of the volume of the outermost \epsilon surface of general spheres with radius r. First, in 2 two dimensional space, spheres are circles. The area of the brown part of the circle below is \pi r^2. In order calculate the are of \epsilon \cdot r thick surface of the circle, you have only to subtract the area of \pi \{ (1 - \epsilon)\cdot r\} ^2. When \epsilon = 0.01, the area of outer most surface is \pi r^2 - \pi (0.99\cdot r)^2, and its proportion to the area of the whole circle is \frac{\pi r^2 - \pi (0.99\cdot r)^2}{\pi r^2} = 0.0199.

In case of 3-dimensional space, the value of a sphere with radius r is \frac{4}{3} \pi r^2, so the proportion of the \epsilon surface is calculated in the same way: \frac{\frac{4}{3} \pi r^3 -\frac{4}{3} \pi (0.99\cdot r)^2}{\frac{4}{3}\pi r^2} = 0.0297. Compared to the case in 2 dimensional space, the proportion is a little bigger.

How about in D-dimensional space? We have seen that even in  D-dimensional space the surface of a sphere, I mean D-ball, can be defined as a set of any points whose distance from the center point is all r. And it is known that the volume of D-ball is defined as below.

\Gamma () is called gamma function, but in this article it is not so important. The most important point now is, if you discuss any D-ball, their volume only depends on their radius r. That meas the proportion of outer \epsilon surface of D-ball is calculated as \frac{\pi r^2 - \pi \{ (1 - \epsilon)\cdot r\} ^2}{\pi r^2}. When \epsilon is 0.01, the proportion of the 1% surface of D-ball changes like in the chart below.

* And of course when D is 2,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \pi, and when D is 3 ,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \frac{4}{3} \pi

You can see that when D is over 400, around 90% of volume is concentrated in the very thin 1% surface of D-ball. That is why, in high dimensional space, M & M’s chocolate look empty but tastes normal: all the chocolate are concentrated beneath the sugar coating.

More interestingly, even if you choose any points as a central point of a sphere with radius r, the other points are squashed to the surface of the sphere, even if all the data points are uniformly distributed. This situation is problematic for classical machine learning algorithms, which are often based on the Euclidean distances between pairs of two sample data points: if you go from the central point to another sample point, the possibility of finding the point within (1 - \epsilon)\cdot r radius of the center is almost zero. But if you reach the outermost \epsilon part of the surface of the sphere, most data points are there. However, for one of the data points in the surface, any other data points are distant in the same way.

Inside M & M’s chocolate is a mysterious world.


You have seen that using high dimensional data can be problematic in many ways. Data science and machine learning are largely based on one idea: you can find a lower dimensional meaningful and easier structure in data. In the next articles I am going to introduce some famous dimension reduction algorithms. And hopefully I would like to give some deeper insights in to these algorithms, in straightforward ways.

* I could not explain the relationships of variance and bias of data. This is also a very important factor when you think about dimensionality of data. I hope I can write about this topic someday. You can also look it up if you are interested.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Illustrative introductions on dimension reduction

“What is your image on dimensions?”

….That might be a cheesy question to ask to reader of Data Science Blog, but most people, with no scientific background, would answer “One dimension is a line, and two dimension is a plain, and we live in three-dimensional world.” After that if you ask “How about the fourth dimension?” many people would answer “Time?”

You can find books or writings about dimensions in various field. And you can use the word “dimension” in normal conversations, in many contexts.

*In Japanese, if you say “He likes two dimension.” that means he prefers anime characters to real women, as is often the case with Japanese computer science students.

The meanings of “dimensions” depend on the context, but in data science dimension is usually the number of rows of your Excel data.

When you study data science or machine learning, usually you should start with understanding the algorithms with 2 or 3 dimensional data, and you can apply those ideas to any D dimensional data. But of course you cannot visualize D dimensional data anymore, and you always have to be careful of what happens if you expand degree of dimension.

Conversely it is also important to reduce dimension to understand abstract high dimensional stuff in 2 or 3 dimensional space, which are close to our everyday sense. That means dimension reduction is one powerful way of data visualization.

In this blog series I am going to explain meanings of dimension itself in machine learning context and algorithms for dimension reductions, such as PCA, LDA, and t-SNE, with 2 or 3 dimensional visible data. Along with that, I am going to delve into the meaning of calculations so that you can understand them in more like everyday-life sense.

This article series is going to be roughly divided into the contents below.

  1. Curse of Dimensionality
  2. PCA, LDA (to be published soon)
  3. Rethinking eigen vectors (to be published soon)
  4. KL expansion and subspace method (to be published soon)
  5. Autoencoder as dimension reduction (to be published soon)
  6. t-SNE (to be published soon)

I hope you could see that reducing dimension is one of the fundamental approaches in data science or machine learning.

How the Pandemic is Changing the Data Analytics Outsourcing Industry

While media pundits have largely focused on the impact of COVID-19 as far as human health is concerned, it hasn’t been particularly good for the health of automated systems either. As cybersecurity budgets plummet in the face of dwindling finances, computer criminals have taken the opportunity to increase attacks against high value targets.

In June, an online antique store suffered a data breach that contained over 3 million records, and it’s likely that a number of similar attacks have simply gone unpublished. Fortunately, data scientists are hard at work developing new methods of fighting back against these kinds of breaches. Budget constraints and a lack of personnel as a result of the pandemic continues to be a problem, but automation has helped to assuage the issue to some degree.

AI-Driven Data Storage Systems

Big data experts have long promoted the cloud as an ideal metaphor for the way that data is stored remotely, but as a result few people today consider the physical locations that this information is stored at. All data has to be located on some sort of physical storage device. Even so-called serverless apps have to be distributed from a server unless they’re fully deployed using P2P services.

Since software can never truly replace hardware, researchers are looking at refining the various abstraction layers that exist between servers and the clients who access them. Data warehousing software has enabled computer scientists to construct centralized data storage solutions that look like traditional disk locations. This gives users the ability to securely interact with resources that are encrypted automatically.

Background services based on artificial intelligence monitor virtual data warehouse locations, which gives specialists the freedom to conduct whatever analytics they deem necessary. In some cases, a data warehouse can even anonymize information as it’s stored, which can streamline workflows involved with the analysis process.

While this level of automation has proven useful, it’s still subject to some of the problems that have occurred as a result of the pandemic. Traditional supply chains are in shambles and a large percentage of technical workers are now telecommuting. If there’s a problem with any existing big data plans, then there’s often nobody around to do any work in person.

Living with Shifting Digital Priorities

Many businesses were in the process of outsourcing their data operations even before the pandemic, and the current situation is speeding this up considerably. Initial industry estimates had projected steady growth numbers for the data analytics sector through 2025. While the current figures might not be quite as bullish, it’s likely that sales of outsourcing contracts will remain high.

That being said, firms are also shifting a large percentage of their IT spending dollars into cybersecurity projects. A recent survey found that 37 percent of business leaders said they were already going to cut their IT department budgets. The same study found that 28 percent of businesses are going to move at least some part of their data analytics programs abroad.

Those companies that can’t find an attractive outsourcing contract might start to patch their remote systems over a virtual private network. Unfortunately, this kind of technology has been strained to some degree in recent months. The virtual servers that power VPNs are flooded with requests, which in turn has brought them down in some instances. Neural networks, which utilize deep learning technology to improve themselves as time goes on, have proven more than capable of predicting when these problems are most likely to arise.

That being said, firms that deploy this kind of technology might find that it still costs more to work with automated technology on-premise compared to simply investing in an outsourcing program that works with these kinds of algorithms at an outside location.

Saving Money in the Time of Corona

Experts from Think Big Analytics pointed out how specialist organizations can deal with a much wider array of technologies than a small business ever could. Since these companies specialize in providing support for other organizations, they have a tendency to offer support for a large number of platforms.

These representatives recently opined that they could provide support for NoSQL, Presto, Apache Spark and several other emerging platforms at the same time. Perhaps most importantly, these organizations can work with Hadoop and other traditional data analysis languages.

Staffers working on data mining operations have long relied on languages like Hadoop and R to write scripts that they later use to automate the process of collecting and analyzing data. By working with an organization that already supports a language that companies rely on, they can avoid the need of changing up their existing operations.

This can help to drastically reduce the cost of migration, which is extremely important since many of the firms that need to migrate to a remote system are already suffering from budget problems. Assuming that some issues related to the pandemic continue to plague businesses for some time, it’s likely that these budget constraints will force IT departments to consider a migration even if they would have otherwise relied solely on a traditional colocation arrangement.

IT department staffers were already moving away from many rare platforms even before the COVID-19 pandemic hit, however, so this shouldn’t be as much of a herculean task as it sounds. For instance, the KNIME Analytics Platform has increased in popularity exponentially since it’s release in 2006. The fact that it supports over 1,000 plug-in modules has made it easy for smaller businesses to move toward the platform.

The road ahead isn’t going to be all that pleasant, however. COBOL and other antiquated languages still rule the roost at many governmental big data processing centers. At the same time, some small businesses have never even been able to put a big data plan into play in the first place. As the pandemic continues to wreak havoc on the world’s economy, however, it’s likely that there will be no shortage of organizations continuing to migrate to more secure third-party platforms backed by outsourcing contracts.

5 Data Privacy Predictions for 2021

2020 has been a significant year for data management. As businesses face new technological challenges amid the COVID-19 pandemic, issues of privacy have spent some time in the spotlight. In response, data privacy could see some substantial changes in 2021.

Few people will emerge from 2020 with an unchanged perception of data security. As these ideas and feelings shift, some trends will accelerate while others get replaced. Businesses will have to adapt to these changes to survive.

Here are five such changes you can expect in 2021.

International Data Privacy Standards Will Increase

Privacy concerns over Chinese-owned app TikTok caused quite a stir in 2020. With the TikTok situation bringing new attention to privacy in international services, you’ll likely see a rise in international regulations. China has already announced new security standards and asked other countries to follow.

2020 has cast doubt over a lot of international relations. More countries will likely issue new standards to ease tension and move past these doubts. This trend started before 2020, as you can see in Europe’s GDPR, but 2021 will further it.

Customers Will Demand Transparency

Governments aren’t the only ones that will expect more of tech companies’ privacy standards. Since things like TikTok have made people more aware of what apps could access, more people will demand privacy. In 2021, companies that are transparent about how they use data will likely be more successful.

According to a PwC poll, 84% of consumers said they would switch services if they don’t trust how a company uses their data. Data privacy isn’t just important to authorities or businesses anymore. The public is growing more concerned about their data, and their choices will reflect it.

Security Will Become More Automated

In response to these growing expectations, businesses will have to do more to secure people’s data. Cybersecurity companies are facing a considerable talent shortage thanks to pandemic-related complications, though. The data security world will turn to automation to fix both of these problems.

With so many businesses changing the way they operate, cybersecurity will have to become more flexible too. Automating some processes through AI will allow companies to achieve that flexibility. Security AI is still relatively new, but as it develops, it could take off in 2021.

Security Data Analytics Will Become the Norm

Big data analytics have already become standard practice in many business applications. In 2021, more companies will start using them to improve their data privacy measures, too. With major companies like Nintendo and Marriott experiencing significant data breaches this year, more will turn to analytics to find any potential shortcomings.

No one wants to be the next data breach news story, especially with more people paying attention to these issues now. Data analytics can highlight operational improvements, showing companies how to better their data security measures. With data privacy in the spotlight in 2021, taking these steps is crucial.

Third-Party Risk Assessments Will Be More Crucial

As people demand better privacy protection, businesses will have to consider their third-party partners. Consumers will be more critical of companies giving third parties access to their data. As a result, companies will have to perform more risk assessments on any third party.

Third-party data breaches affected companies like General Electric and T-Mobile in 2020, exposing thousands of records. Customers will expect businesses to hold their partners to higher standards to avoid these risks.

2021 Could Be a Landmark Year for Data Privacy

Data privacy is more prominent than ever before, mostly due to a few notable scandals. Now that the general public is more aware of these issues, businesses will have to meet higher standards for data privacy. Implementing data security processes may cause some disruption and confusion at first, but it will ultimately lead to a safer digital landscape.

All of these changes could make 2021 a turning point for data security. With higher expectations from consumers and authorities, data management will become more secure.

Data Science in Engineering Process - Product Lifecycle Management

How to develop digital products and solutions for industrial environments?

The Data Science and Engineering Process in PLM.

Huge opportunities for digital products are accompanied by huge risks

Digitalization is about to profoundly change the way we live and work. The increasing availability of data combined with growing storage capacities and computing power make it possible to create data-based products, services, and customer specific solutions to create insight with value for the business. Successful implementation requires systematic procedures for managing and analyzing data, but today such procedures are not covered in the PLM processes.

From our experience in industrial settings, organizations start processing the data that happens to be available. This data often does not fully cover the situation of interest, typically has poor quality, and in turn the results of data analysis are misleading. In industrial environments, the reliability and accuracy of results are crucial. Therefore, an enormous responsibility comes with the development of digital products and solutions. Unless there are systematic procedures in place to guide data management and data analysis in the development lifecycle, many promising digital products will not meet expectations.

Various methodologies exist but no comprehensive framework

Over the last decades, various methodologies focusing on specific aspects of how to deal with data were promoted across industries and academia. Examples are Six Sigma, CRISP-DM, JDM standard, DMM model, and KDD process. These methodologies aim at introducing principles for systematic data management and data analysis. Each methodology makes an important contribution to the overall picture of how to deal with data, but none provides a comprehensive framework covering all the necessary tasks and activities for the development of digital products. We should take these approaches as valuable input and integrate their strengths into a comprehensive Data Science and Engineering framework.

In fact, we believe it is time to establish an independent discipline to address the specific challenges of developing digital products, services and customer specific solutions. We need the same kind of professionalism in dealing with data that has been achieved in the established branches of engineering.

Data Science and Engineering as new discipline

Whereas the implementation of software algorithms is adequately guided by software engineering practices, there is currently no established engineering discipline covering the important tasks that focus on the data and how to develop causal models that capture the real world. We believe the development of industrial grade digital products and services requires an additional process area comprising best practices for data management and data analysis. This process area addresses the specific roles, skills, tasks, methods, tools, and management that are needed to succeed.

Figure: Data Science and Engineering as new engineering discipline

More than in other engineering disciplines, the outputs of Data Science and Engineering are created in repetitions of tasks in iterative cycles. The tasks are therefore organized into workflows with distinct objectives that clearly overlap along the phases of the PLM process.

Feasibility of Objectives
  Understand the business situation, confirm the feasibility of the product idea, clarify the data infrastructure needs, and create transparency on opportunities and risks related to the product idea from the data perspective.
Domain Understanding
  Establish an understanding of the causal context of the application domain, identify the influencing factors with impact on the outcomes in the operational scenarios where the digital product or service is going to be used.
Data Management
  Develop the data management strategy, define policies on data lifecycle management, design the specific solution architecture, and validate the technical solution after implementation.
Data Collection
  Define, implement and execute operational procedures for selecting, pre-processing, and transforming data as basis for further analysis. Ensure data quality by performing measurement system analysis and data integrity checks.
  Select suitable modeling techniques and create a calibrated prediction model, which includes fitting the parameters or training the model and verifying the accuracy and precision of the prediction model.
Insight Provision
  Incorporate the prediction model into a digital product or solution, provide suitable visualizations to address the information needs, evaluate the accuracy of the prediction results, and establish feedback loops.

Real business value will be generated only if the prediction model at the core of the digital product reliably and accurately reflects the real world, and the results allow to derive not only correct but also helpful conclusions. Now is the time to embrace the unique chances by establishing professionalism in data science and engineering.


Peter Louis                               

Peter Louis is working at Siemens Advanta Consulting as Senior Key Expert. He has 25 years’ experience in Project Management, Quality Management, Software Engineering, Statistical Process Control, and various process frameworks (Lean, Agile, CMMI). He is an expert on SPC, KPI systems, data analytics, prediction modelling, and Six Sigma Black Belt.

Ralf Russ    

Ralf Russ works as a Principal Key Expert at Siemens Advanta Consulting. He has more than two decades experience rolling out frameworks for development of industrial-grade high quality products, services, and solutions. He is Six Sigma Master Black Belt and passionate about process transparency, optimization, anomaly detection, and prediction modelling using statistics and data analytics.4

Process Mining mit PAFnow – Artikelserie

Artikelserie zu Process Mining Tools – PAFnow

Der zweite Artikel der Artikelserie Process Mining Tools beschäftigt sich mit dem Anbieter PAFnow. 2014 in Deutschland gegründet kann das Unternehmen PAF, dessen Kürzel für Process Analytics Factory steht, bereits auf eine beachtliche Anzahl an Projekten zurückblicken. Das klare selbst gesteckte Ziel von PAF: Mit dem eigenen Tool namens PAFnow Process Mining für jeden zugänglich machen.

PAFnow basiert auf dem bekannten BI-Tool „Power BI“. Wer sein Wissen zu Power BI noch einmal auffrischen möchte, kann das gerne in diesem Artikel aus der Artikelserie zu BI-Tools machen. Da Power BI selbst als Cloud- und On-Premise-Lösung erhältlich ist, gilt dies indirekt auch für PAFnow. Diese vier Versionen des Process Mining Tools werden von PAFnow angeboten:

Free Pro Premium Enterprise
Lizenz:  Kostenfrei
(Marketplace Power BI)
99€ pro User pro Monat 499€ pro User pro Monat Nur auf Anfrage
Zielgruppe:  Für kleine Unternehmen und Einzelanwender Für kleine bis mittlere Unternehmen Für mittlere und große Unternehmen Für mittlere und große Unternehmen
Datenquellen: Beliebig (Power BI Konnektoren), Transformationen in Power BI Beliebig (Power BI Konnektoren), Transformationen in Power BI Beliebig (Power BI Konnektoren), Transformationen in Power BI Beliebig (Power BI Konnektoren), Transformationen auch via MS SSIS
Datenvolumen: Limitiert auf 30.000 Events,
1 Visual
Unlimitierte Events,
1 Visual, 1 Report
Unlimitierte Events,
9 Visual, 10 Reports
Unlimitierte Events,
10 Visual, 10 Reports, Content Packs
Architektur: Nur On-Premise Nur On-Premise Nur On-Premise Nur On-Premise

Abbildung 1: Übersicht zu den vier verschiedenen Produktversionen des Process Mining Tools PAFnow

PAF führt auf seiner Website weitere Informationen zu den jeweiligen Versionsunterschieden an. Für diesen Artikel wird sich im weiteren Verlauf auf die Enterprise Version bezogen, wenn nicht anderes gekennzeichnet.

Bedienbarkeit und Anpassungsfähigkeit der Analysen

Das übersichtliche Userinterface von Power BI unterstützt die Analyse von Prozessen mit PAFnow. Und auch Anfänger können sich glücklich schätzen, denn es gibt eine beeindruckende Vielzahl an hochwertigen Lernvideos und Dokumentation zu Power BI. Die von PAFnow entwickelten Visuals, wie zum Beispiel der „Process Explorer“ fügt sich reibungslos zu den Power BI Visuals ein. Denn die Bedienung dieser Visuals entspricht größtenteils demselben Prinzip wie dem der Power BI Visuals. Neue Anwendungen wie beim Process Explorer der Conformance Check, werden jedoch auch von PAFnow in Lernvideos erläutert.

PAFnow Process Mining by using Power BIAbbildung 1: Userinterface von PAFnow in dem vorgefertigten Report „Discovery“

Die PAFnow Visuals werden – wie in Power BI – üblich per drag & drop platziert und mit den gewünschten Dimensionen und Measures bestückt. Die Visuals besitzen verschiedenste Einstellungsmöglichkeiten, um dem Benutzer das Visual nach seinen Vorstellungen gestallten zu lassen. Kommt man an die Grenzen der Einstellungen, lohnt sich immer ein Blick in den Marketplace von Power BI. Dort werden viele und teilweise auch technisch sehr gute Visuals kostenlos angeboten, welche viele weitere Analyseideen im Kontext der Prozessanalyse abdecken.

Die vorgefertigten Reports von PAFnow sind intuitiv zu handhaben, denn sie vermitteln dem Analysten direkt den passenden Eindruck, wie die jeweiligen Visuals am besten einzusetzen sind. Einzelne Elemente aus dem Report können gelöscht und nach Belieben ergänzt werden. Dadurch kann Zeit gespart und mit der eigentlichen Analyse schnell begonnen werden.

PAFnow Process Mining Power BI - Varienten-AnalyseAbbildung 2: Vorgefertigter Report „Variants“ an dem direkt eine Root-Cause Analyse durchgeführt werden kann

In Power BI werden die KPI’s bzw. Measures in einer von Microsoft eigens entwickelten Analysesprache namens DAX (Data Analysis Expressions) definiert. Diese Formelsprache ist ein sehr stark an Excel angelehnter Syntax und bietet für viele Nutzer in dieser Hinsicht einen guten Einstieg. Allerdings bietet der Umfang von DAX noch deutlich mehr, als es die meisten Excel Nutzer gewohnt sein werden, so können auch motivierte und technisch affine Business Experten recht tief in die Analyse abtauchen. Da es auch hier eine sehr gut aufgestellte Community als auch Dokumentation gibt, sind die Informationen zu den verborgenen Fähigkeiten von DAX meist nur ein paar Klicks entfernt.


PAF bietet für sein Process Mining Tool aktuell noch keine eigene Cloud-Lösung an und ist somit nur über Power BI selbst als Cloud-Lösung erhältlich. Anwender, die sich eine unabhängige Process Mining – Plattform wünschen, müssen sich daher mit Power BI zufriedengeben. Ob PAFnow in absehbarer Zeit diese Lücke schließen wird und die Enterprise-Readiness des Tools somit erhöhen wird, bleibt abzuwarten, wünschenswert wäre es. Mit Power BI als Cloud-Lösung ist man als Anwender jedoch in den meisten Fällen nicht schlecht vertröstet. Da Power BI sowohl als Cloud- und als On-Premise-Lösung verfügbar ist, kann hier situationsabhängig entschieden werden. An dieser Stelle gilt es abzuwägen, welche Limitationen die beiden Lösungen mit sich bringen und daher sei auch an dieser Stelle der Artikel zu Power BI aus der BI-Tool-Artikelserie empfohlen. Darüber hinaus sollte die Größe der zu analysierenden Prozessdaten berücksichtigt werden. So kann bei plötzlich zu großen Datenmengen auch später noch ein Wechsel von der recht günstigen Power BI Pro-Lizenz auf die deutlich kostenintensivere Premium-Lizenz erfordern. In der Enterprise Version von PAFnow sind zwei frei wählbare Content Packs enthalten, welche aus SAP-Konnektoren, sowie vorentwickelten SSIS Packages bestehen. Mittels Datenextraktor werden die benötigten Prozessdaten, z. B. für die Prozesse P2P (Purchase-to-Pay) und O2C (Order-to-Cash), in eine Datenbank eines MS SQL Servers geladen und dort durch die SSIS-Packages automatisch in das für die Analyse benötigte Format transformiert. SSIS ist ein ETL-Tool von Microsoft und steht für SQL Server Integration Services. SSIS ist ein Teil der Enterprise-Vollversion des Microsoft SQL Servers.

Die vorgefertigten Reports die PAFnow zur Verfügung stellt, können Projekte zusätzlich beschleunigen. Neben den zwei frei wählbaren Content Packs, die in der Enterprise Version von PAFnow enthalten sind, stellen Partner die von Ihnen selbstentwickelte Packs zur Verfügung. Diese sind sofern die zwei kostenlosen Content Packs bereits beansprucht wurden jedoch zahlungspflichtig. PAFnow profitiert von der beeindruckenden Menge an verschiedenen Konnektoren, die Microsoft in Power BI zur Verfügung stellt. So können zusätzlich Daten direkt aus den Quellsystemen in Power BI geladen werden und dem Datenmodel ggf. hinzugefügt werden. Der Vorteil liegt in der Flexibilität, Daten nicht immer zwingend über ein Data Warehouse verfügbar machen zu müssen, sondern durch den direkten Zugriff auf die Datenquellen schnelle Workarounds zu ermöglichen. Allerdings ist dieser Vorteil nur auf ergänzende Daten beschränkt, denn das Event-Log wird stets via SSIS-ETL in der Datenbank oder der sogenannten „Companion-Software“ transformiert und bereitgestellt. Da der Companion jedoch ohne Schedule-Funktion auskommt, Transformationen also manuell angestoßen werden müssen, eignet sich dieser kaum für das Monitoring von Prozessen. Falls eine hohe Aktualität der Daten gefordert ist, sollte daher auf die SSIS-Package-Funktion der Enterprise Version zurückgegriffen werden.

Ergänzende Daten können anschließend mittels einer der vielen Power BI Konnektoren auch direkt aus der Datenquelle geladen werden, um Sie anschließend mit dem Datenmodell zu verknüpfen. Dabei sollte bei der Modellierung jedoch darauf geachtet werden, dass ein entsprechender Verbindungsschlüssel besteht. Die Flexibilität, Daten aus verschiedensten Datenquellen in nahezu x-beliebigem Format der Process Mining Analyse hinzufügen zu können, ist ein klarer Pluspunkt und der große Vorteil von PAFnow, auf die erfolgreiche BI-Lösung von Microsoft aufzusetzen. Mit der Wahl von SSIS als Event-Log/ETL-Lösung, positioniert sich PAFnow noch ein deutliches Stück näher zum Microsoft Stack und erleichtert die Integration in diejenige IT-Infrastruktur, die auf eben diesen Microsoft Stack setzt.

Auch in Sachen Benutzer-Berechtigungsmanagement können die Process Mining Analysen mittels Power BI Features, wie z.B. Row-based Level Security detailliert verwaltet werden. So können ganze Reports nur für bestimmte Personen oder Gruppen zugänglich gemacht werden, aber auch Teile des Reports sowie einzelne Datenausschnitte kontrolliert definierten Rollen zugewiesen werden.


Um große Datenmengen mit Analysemethodik aus dem Process Mining analysieren zu können, muss die Software bei Bedarf skalieren. Wer mit großen Datasets in Power BI Pro lokal auf seinem Rechner schon Erfahrungen sammeln durfte, wird sicherlich schon mal an seine Grenzen gestoßen sein und Power BI nicht unbedingt als Big Data ready bezeichnen. Diese Performance spiegelt allerdings nur die untere Seite des Spektrums wider. So ist Power BI mit der Premium-Lizenz und einer ausreichend skalierten Azure SQL Data Warehouse Instanz durchaus dazu in der Lage, Daten im Petabytebereich zu analysieren. Microsoft entwickelt Power BI kontinuierlich weiter und wird mit an Sicherheit grenzender Wahrscheinlichkeit auch für weitere Performance-Verbesserung sorgen. Dabei wird MS Azure, die Cloud-Plattform von Microsoft, weiterhin eine entscheidende Rolle spielen. Hiervon wird PAFnow profitieren und attraktiv auch für Process Mining Projekte mit Big Data werden. Referenzprojekte mit besonders großen Datenmengen, die mit PAFnow analysiert wurden, sind öffentlich nicht bekannt. Im Grunde sind jegliche Skalierungsfähigkeiten jedoch nicht jene dieser Analysefunktionalität, sondern liegen im Microsoft Technology Stack mit all seinen Vor- und Nachteilen der Nutzung on-Premise oder in der Microsoft Cloud. Dabei steckt der Teufel übrigens immer im Detail und so muss z. B. stets auf die richtige Version von Power BI geachtet werden, denn es gibt für die Nutzung On-Premise mit dem Power BI Report Server als auch für jene Nutzung über Microsoft Azure unterschiedliche Versionen, die zueinander passen müssen.

Die Datenmodellierung erfolgt in der Datenbank (On-Premise oder in der Cloud) und wird dann in Power BI geladen. Das Datenmodell wird in Power BI grafisch und übersichtlich dargestellt, wodurch auch der End-Nutzer jederzeit nachvollziehen kann in welcher Beziehung die einzelnen Tabellen zueinanderstehen. Die folgende Abbildung zeigt ein beispielhaftes Datenmodel visuell in Power BI.

Data Model in Microsoft Power BIAbbildung 3: Grafische Darstellung des Datenmodels in Power BI

Zusätzliche Daten lassen sich – wie bereits erwähnt – sehr einfach hinzufügen und auch einfach anbinden, sofern ein Verbindungsschlüssel besteht. Sollten also zusätzliche Slicer benötigt werden, können diese problemlos ergänzt werden. An dieser Stelle sorgen die vielen von Power BI bereitgestellten Konnektoren für einen hohen Grad an Flexibilität. Für erfahrene Power BI Benutzer ist die Datenmodellierung also wie immer reibungslos und übersichtlich. Aber auch Neulinge sollten, sofern sie Erfahrung in der Datenmodellierung haben, hier keine Schwierigkeiten haben. Kleinere Transformationen beim Datenimport können im Query Editor von Power BI, mit Hilfe der Formelsprache Power Query (M) gemacht werden. Diese Formelsprache ist einsteigerfreundlich und ähnelt in Teilen der Programmiersprache F#. Aber auch ohne diese Formelsprache können einfache Transformationen mit Hilfe des übersichtlichen und mit vielen Funktionen ausgestatteten Userinterfaces im Query Editor intuitiv erledigt werden. Bei größeren und komplexeren Transformationen sollten die Daten jedoch auf Datenbankebene erfolgen. Dort werden die Rohdaten auch für die PAFnow Visuals vorbereitet, sofern die Enterprise-Version genutzt wird. PAFnow stellt für diese Transformationen vorgefertigte SSIS-Packages zur Verfügung, welche auch angepasst und erweitert werden können. Die Modellierung erfolgt somit in T-SQL, das in den SSIS-Queries eingebettet ist und stellt für jeden erfahrenden SQL-Anwender keine Schwierigkeiten dar. Bei der Erweiterbarkeit und Flexibilität der Datenmodelle konnte ich ebenfalls keine besonderen Einschränkungen feststellen. Einzig das Schema, welches von den PAFnow Visuals vorgegeben wird, muss eingehalten werden. Durch das Zurückgreifen auf die Abfragesprache SQL, kann bei der Modellierung auf eine sehr breite Community zurückgegriffen werden. Darüber hinaus können bestehende SQL-Skripte eingefügt und leicht angepasst werden. Und auch die Suche nach einem geeigneten Data Engineer gestaltet sich dadurch praktisch, da SQL im Generellen und der MS SQL Server im Speziellen im Einsatz sehr verbreitet sind.


Grundsätzlich verfolgt PAF nach eigener Aussage einen anderen Ansatz als der Großteil ihrer Mitbewerber: “So setzt PAF weniger auf monolithische Strukturen, sondern verfolgt einen Plattform-agnostischen Ansatz“.  Damit grenzt sich PAF von sogenannte All-in-one Lösungen ab, bei welchen alle Funktionen bereits integriert sind. Der Vorteil solcher Lösungen ist, dass sie vollumfänglich „ready-to-use“ sind, sobald sie erfolgreich implementiert wurden. Der Nachteil solcher Systeme liegt in der unzureichenden Steuerungsmöglichkeit der einzelnen Bestandteile. Microservices hingegen versprechen eben genau diese Kontrolle und erlauben es dem Anwender, nur die Funktionen, die benötigt werden nach eigenen Vorstellungen in das System zu integrieren. Auf der anderen Seite ist der Aufbau solcher agnostischen Systeme deutlich komplexer und beansprucht daher oft mehr Zeit bei der Implementierung und setzt auch ein gewissen Know-How voraus. Die Entscheidung für den einen oder anderen Ansatz gleicht ein wenig einer make-or-buy Entscheidung und muss daher in den individuellen Situationen abgewogen werden.

In den beiden Trendthemen Machine Learning und Task Mining kann PAFnow aktuell noch keine Lösungen vorzeigen. Nach eigenen Aussagen gibt es jedoch bereits einige Neuerungen in der Pipeline, welche PAFnow in Zukunft deutlich AI-getriebener gestalten werden. Näheres zu diesem Thema wollte man an dieser Stelle zum Zeitpunkt der Veröffentlichung dieses Artikels nicht verkünden. Jedoch kann der Website von PAFnow diverse Forschungsprojekte eingesehen werden, welche sich unteranderem mit KI und RPA befassen. Sicherlich profitieren PAFnow Anwender auch von der Zukunftsfähigkeit von Power BI bzw. Microsoft selbst. Inwieweit diese Entwicklungen in dieselbe Richtung gehen wie die Trends im Bereich Process Mining bleibt abzuwarten.


Der Kostenrahmen für das Process Mining Tool von PAFnow ist sehr weit gehalten. Da die Pro Version bereits für 120$ im Monat zu haben ist, spiegelt sich hier die Philosophie von PAFnow wider, Process Mining für jedermann zugänglich zu machen. Mit dieser niedrigen Einstiegshürde können Unternehmen erste Erfahrungen im Process Mining sammeln und diese ohne großes Investitionsrisiko validieren. Nicht im Preis enthalten, sind jedoch etwaige Kosten für das notwendige BI-Tool Power BI. Da jedoch auch hier der Kostenrahmen sehr weit ausfällt und mittlerweile auch im Serviceportfolio von Microsoft 365 enthalten ist, bleibt es bei einer niedrigen Einstieghürde aus finanzieller Sicht. Allerdings kann bei umfangreicher Nutzung der Preis der Power BI Lizenzgebühren auch deutlich höher ausfallen. Kommt Power BI z. B. aus Gründen der Data Governance nur als On-Premise-Lösung in Betracht, steigen die Kosten für Power BI grundsätzlich bereits auf mindestens 4.995 EUR pro Monat. Die Preisbewertung von PAFnow ist also eng verbunden mit dem Power BI Lizenzmodel und sollte im Einzelfall immer mit einbezogen werden. Wer gerne mehr zum Lizenzmodel von Power BI wissen möchte, bekommt hier eine zusammengefasste Übersicht.


Mit PAFnow ist ein durchaus erschwingliche Process Mining Tool auf dem Markt erhältlich, welches sich geschickt in den Microsoft-BI-Stack eingliedert und die Hürden für den Einstieg relativ geringhält. Unternehmen, die ohnehin Power BI als Reporting Lösung nutzen, können ohne großen Aufwand erste Projekte mit Process Mining starten und den Umfang der Funktionen über die verschiedenen Lizenzen hochskalieren. Allerdings sind dem Autor auch Unternehmen bekannt, die Power BI und den MS SQL Server explizit für die Nutzung von PAFnow erstmalig in ihre Unternehmens-IT eingeführt haben. Da Power BI bereits mit vielen Features ausgestattet ist und auch kontinuierlich weiterentwickelt wird, profitiert PAFnow von dieser Entwicklungsarbeit ungemein. Die vorgefertigten Reports von PAFnow können die Time-to-Value lukrativ verkürzen und sind flexibel erweiterbar. Für erfahrene Anwender von Power BI ist der Umgang mit den Visuals von PAF sehr intuitiv und bedarf keines großen Schulungsaufwandes. Die Datenmodellierung erfolgt auf SSIS-Basis in SQL und weist somit auch keine nennenswerten Hürden auf. Wie leistungsstark PAFnow mit großen Datenmengen umgeht kann an dieser Stelle nicht bewertet werden. PAFnow steht nicht nur in diesem Punkt in direkter Abhängigkeit von der zukünftigen Entwicklung des Microsoft Technology Stacks und insbesondere von Microsoft Power BI. Für strategische Überlegungen bzgl. der Integrationsfähigkeit in das jeweilige Unternehmen sollte dies immer berücksichtigt werden.

Simple RNN

LSTM back propagation: following the flows of variables

First of all, the summary of this article is: please just download my Power Point slides which I made and be patient, following the equations.

I am not supposed to use so many mathematics when I write articles on Data Science Blog. However using little mathematics when I talk about LSTM backprop is like writing German, never caring about “der,” “die,” “das,” or speaking little English in English classes (which most high school English teachers in Japan do) or writing Japanese without using any Chinese characters (which looks like a terrible handwriting by a drug addict). In short, that is ridiculous. And all the precise equations of LSTM backprop, written on a blog is not a comfortable thing to see. So basically the whole of this article is an advertisement on my PowerPoint slides, sponsored by DATANOMIQ, and I can just give you some tips to get ready for the most tiresome part of understanding LSTM here.

*This article is the fifth article of “A gentle introduction to the tiresome part of understanding RNN.”

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1. Chain rules

This article is virtually an article on chain rules of differentiation. Even if you have clear understandings on chain rules, I recommend you to take a look at this section. If you have written down all the equations of back propagation of DCL, you would have seen what chain rules are. Even simple chain rules for backprop of normal DCL can be difficult to some people, but when it comes to backprop of LSTM, it is a pure torture.  I think using graphical models would help you understand what chain rules are like. Graphical models are basically used to describe the relations of variables and functions in probabilistic models, so to be exact I am going to use “something like graphical models” in this article. Not that this is a common way to explain chain rules.

First, let’s think about the simplest type of chain rule. Assume that you have a function f=f(x)=f(x(y)), and relations of the functions are displayed as the graphical model at the left side of the figure below. Variables are a type of function, so you should think that every node in graphical models denotes a function. Arrows in purple in the right side of the chart show how information propagate in differentiation.

Next, if you have a function f , which has two variances  x_1 and x_2. And both of the variances also share two variances  y_1 and y_2. When you take partial differentiation of f with respect to y_1 or y_2, the formula is a little tricky. Let’s think about how to calculate \frac{\partial f}{\partial y_1}. The variance y_1 propagates to f via x_1 and x_2. In this case the partial differentiation has two terms as below.

In chain rules, you have to think about all the routes where a variance can propagate through. If you generalize chain rules as the graphical model below, the partial differentiation of f with respect to y_i is calculated as below. And you need to understand chain rules in this way to understanding any types of back propagation.

The figure above shows that if you calculate partial differentiation of f with respect to y_i, the partial differentiation has n terms in total because y_i propagates to f via n variances. In order to understand backprop of LSTM, you constantly have to care about the flows of variances, which I display as purple arrows.

2. Chain rules in LSTM

I would like you to remember the figure below, which I used in the second article to show how errors propagate backward during backprop of simple RNNs. After forward propagation, first of all, you need to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, gradients of the error function with respect to parameters, at each time step. But you have to be careful that even though these gradients depend on time steps, the parameters \boldsymbol{\theta} do not depend on time steps.

*As I mentioned in the second article I personally think \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} should be rather denoted as (\frac{\partial J}{\partial \boldsymbol{\theta}})^{(t)} because parameters themselves do not depend on time. However even the textbook by MIT press partly use the former notation. And I think you are likely to encounter this type of notation, so I think it is not bad to get ready for both.

The errors at time step (t) propagate backward to all the \boldsymbol{h} ^{(s)} (s \leq t). Conversely, in order to calculate \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} errors flowing from J^{(s)}  (s \geq t). In the chart you need arrows of errors in purple for the gradient in a purple frame, orange arrows for gradients in orange frame, red arrows for gradients in red frame. And you need to sum up \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}} to calculate \frac{\partial J}{\partial \boldsymbol{\theta}} = \sum_{t}{\frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}}, and you need this gradient \frac{\partial J}{\partial \boldsymbol{\theta}} to renew parameters, one time.

At an RNN block level, the flows of errors and how to renew parameters are the same in LSTM backprop, but the flow of errors inside each block is much more complicated in LSTM backprop. But in order to denote errors of LSTM backprop, instead of \frac{\partial J}{\partial \boldsymbol{\theta}^{(t)}}, I use a special notation \delta \star ^{(t)} = \frac{\partial J}{\partial \star}.

* Again, please be careful of what \delta \star  ^{(t)} means. Neurons depend on time steps, but parameters do not depend on time steps. So if \star are neurons,  \delta \star  ^{(t)}= \frac{\partial J}{ \partial \star ^{(t)}}, but when \star are parameters, \delta \star  ^{(t)} should be rather denoted like \delta \star  ^{(t)}= (\frac{\partial J}{ \partial \star })^{(t)}. In the Space Odyssey paper\boldsymbol{\star} are not used as parameters, but in my PowerPoint slides and some other materials, \boldsymbol{\star} are used also as parameteres.

As I wrote in the last article, you calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below. Unlike the last article, I also added the terms of peephole connections in the equations below, and I also introduced the variances \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, \bar{\boldsymbol{z}}^{(t)}, \bar{\boldsymbol{o}}^{(t)} for convenience.

  • \boldsymbol{\bar{f}}^{(t)}=\boldsymbol{W}_{for} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{for}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{for}
  • \boldsymbol{\bar{i}}^{(t)}=\boldsymbol{W}_{in} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{in}\odot \boldsymbol{c}^{(t-1)} + \boldsymbol{b}_{in}
  • \boldsymbol{\bar{z}}^{(t)}=\boldsymbol{W}_z \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z
  • \boldsymbol{\bar{o}}^{(t)}=\boldsymbol{W}_{out} \cdot \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \cdot \boldsymbol{y}^{(t-1)} + \boldsymbol{p}_{out}\odot \boldsymbol{c}^{(t)} + \boldsymbol{b}_{out}
  • \boldsymbol{f}^{(t)}=\sigma( \boldsymbol{\bar{f}}^{(t)})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{\bar{i}}^{(t)})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{\bar{z}}^{(t)})
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{\bar{o}}^{(t)})

With  Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

In this article I would rather give instructions on how to read my PowerPoint slide. Just as general backprop, you need to calculate gradients of error functions with respect to parameters, such as \delta \boldsymbol{W}_{\star}, \delta \boldsymbol{R}_{\star}, \delta \boldsymbol{p}_{\star}, \delta \boldsymbol{b}_{\star}, where \star is either of \{z, in, for, out \}. And just as backprop of simple RNNs, in order to calculate gradients with respect to parameters, you need to calculate errors of neurons, that is gradients of error functions with respect to neurons, such as \delta \boldsymbol{f}^{(t)}, \delta \boldsymbol{i}^{(t)}, \delta \boldsymbol{z}^{(t)}, \delta \boldsymbol{o}^{(t)}.

*Again and again, keep it in mind that neurons depend on time steps, but parameters do not depend on time steps.

When you calculate gradients with respect to neurons, you can first calculate \delta \boldsymbol{y}^{(t)}, but the equation for this error is the most difficult, so I recommend you to put it aside for now. After calculating \delta \boldsymbol{y}^{(t)}, you can next calculate \delta \bar{\boldsymbol{o}}^{(t)}= \frac{\partial J^{(t)}}{ \partial \bar{\boldsymbol{o}}^{(t)}}. If you see the LSTM block below as a graphical model which I introduced, the information of \bar{\boldsymbol{o}}^{(t)} flow like the purple arrows. That means, \bar{\boldsymbol{o}}^{(t)} affects J only via \boldsymbol{y}^{(t)}, and this structure is equal to the first graphical model which I have introduced above. And if you calculate \bar{\boldsymbol{o}}^{(t)} element-wise, you get the equation \delta \bar{o}_{k}^{(t)}=\frac{\partial J}{\partial \bar{o}_{k}^{(t)}}= \frac{\partial J}{\partial y_{k}^{(t)}} \frac{\partial y_{k}^{(t)}}{\partial \bar{o}_{k}^{(t)}}.

*The k is an index of an element of vectors. If you can calculate element-wise gradients, it is easy to understand that as differentiation of vectors and matrices.

Next you can calculate \delta \boldsymbol{c}^{(t)}, and chain rules are very important in this process. The flow of \delta \boldsymbol{c}^{(t)} to J can be roughly divided into two streams: the one which flows to J as \bodlsymbol{y}^{(t)}, and the one which flows to J as \bodlsymbol{c}^{(t+1)}. And the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{y}^{(t)} also have two branches: the one via \bar{\boldsymbol{o}}^{(t)} and the one which directly converges as  \bodlsymbol{y}^{(t)}. Just as well, the stream from \bodlsymbol{c}^{(t)} to \bodlsymbol{c}^{(t+1)} also have three branches: the ones via \bar{\boldsymbol{f}}^{(t)}, \bar{\boldsymbol{i}}^{(t)}, and the one which directly converges as \bodlsymbol{c}^{(t+1)}.

If you see see these flows as graphical a graphical model, that would be like the figure below.

According to this graphical model, you can calculate \delta \boldsymbol{c} ^{(t)} element-wise as below.

* TO BE VERY HONEST I still do not fully understand why we can apply chain rules like above to calculate \delta \boldsymbol{c}^{(t)}. When you apply the formula of chain rules, which I showed in the first section, to this case, you have to be careful of where to apply partial differential operators \frac{\partial}{ \partial c_{k}^{(t)}}. In the case above, in the part \frac{\partial y_{k}^{(t)}}{\partial c_{k}^{(t)}} the partial differential operator only affects tanh(c_{k}^{(t)}) of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}). And in the part \frac{\partial c_{k}^{(t+1)}}{\partial c_{k}^{(t)}}, the partial differential operator \frac{\partial}{\partial c_{k}^{(t)}} only affects the part c_{k}^{(t)} of the term c^{t}_{k} \cdot f_{k}^{(t+1)}. In the \frac{\partial \bar{o}_{k}^{(t)}}{\partial c_{k}^{(t)}} part, only (p_{out})_{k} \cdot c_{k}^{(t)},  in the \frac{\partial \bar{i}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}, and in the \frac{\partial \bar{f}_{k}^{(t+1)}}{\partial c_{k}^{(t)}} part, only (p_{in})_{k} \cdot c_{k}^{(t)}. But some other parts, which are not affected by \frac{\partial}{ \partial c_{k}^{(t)}} are also functions of c_{k}^{(t)}. For example o_{k}^{(t)} of o_{k}^{(t)} \cdot tanh(c_{k}^{(t)}) is also a function of c_{k}^{(t)}. And I am still not sure about the logic behind where to affect those partial differential operators.

*But at least, these are the only decent equations for LSTM backprop which I could find, and a frequently cited paper on LSTM uses implementation based on these equations. Computer science is more of practical skills, rather than rigid mathematical logic. Also I think I have spent great deal of my time thinking about this part, and it is now time for me to move to next step. If you have any comments or advice on this point, please let me know.

Calculating \delta \bar{\boldsymbol{f}}^{(t)}, \delta \bar{\boldsymbol{i}}^{(t)}, \delta \bar{\boldsymbol{z}}^{(t)} are also relatively straigtforward as calculating \delta \bar{\boldsymbol{o}}^{(t)}. They all use the first type of chain rule in the first section. Thereby you can get these gradients: \delta \bar{f}_{k}^{(t)}=\frac{\partial J}{ \partial \bar{f}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{f}_{k}^{(t)}}, \delta \bar{i}_{k}^{(t)}=\frac{\partial J}{\partial \bar{i}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}, and \delta \bar{z}_{k}^{(t)}=\frac{\partial J}{\partial \bar{z}_{k}^{(t)}} =\frac{\partial J}{\partial c_{k}^{(t)}} \frac{\partial c_{k}^{(t)}}{ \partial \bar{i}_{k}^{(t)}}.

All the gradients which we have calculated use the error \delta \boldsymbol{y}^{(t)}, but when it comes to calculating \delta \boldsymbol{y}^{(t)}….. I can only say “Please be patient. I did my best in my PowerPoint slides to explain that.” It is not a kind of process which I want to explain on Word Press. In conclusion you get an error like this: \delta \boldsymbol{y}^{(t)}=\frac{\partial J^{(t)}}{\partial \boldsymbol{y}^{(t)}} + \boldsymbol{R}_{for}^{T} \delta \bar{\boldsymbol{f}}^{(t+1)} + \boldsymbol{R}_{in}^{T}\delta \bar{\boldsymbol{i}}^{(t+1)} + \boldsymbol{R}_{out}^{T}\delta \bar{\boldsymbol{o}}^{(t+1)} + \boldsymbol{R}_{z}^{T}\delta \bar{\boldsymbol{z}}^{(t+1)}, and the flows of \boldsymbol{y}^{(t)} are as blow.

Combining the gradients we have got so far, we can calculate gradients with respect to parameters. For concrete results, please check the Space Odyssey paper or my PowerPoint slide.

3. How LSTMs tackle exploding/vanishing gradients problems

*If you are allergic to mathematics, you should not read this section or even download my PowerPoint slide.

*Part of this section is more or less subjective, so if you really want to know how LSTM mitigate the problems, I highly recommend you to also refer to other materials. But at least I did my best for this article.

LSTMs do not completely solve, vanishing gradient problems. They mitigate vanishing/exploding gradient problems. I am going to roughly explain why they can tackle those problems. I think you find many explanations on that topic, but many of them seems to have some mathematical mistakes (even the slide used in a lecture in Stanford University) and I could not partly agree with some statements. I also could not find any papers or materials which show the whole picture of how LSTMs can tackle those problems. So in this article I am only going to give instructions on the major way to explain this topic.

First let’s see how gradients actually “vanish” or “explode” in simple RNNs. As I in the second article of this series, simple RNNs propagate forward as the equations below.

  • \boldsymbol{a}^{(t)} = \boldsymbol{b} + \boldsymbol{W} \cdot \boldsymbol{h}^{(t-1)} + \boldsymbol{U} \cdot \boldsymbol{x}^{(t)}
  • \boldsymbol{h}^{(t)}= g(\boldsymbol{a}^{(t)})
  • \boldsymbol{o}^{(t)} = \boldsymbol{c} + \boldsymbol{V} \cdot \boldsymbol{h}^{(t)}
  • \hat{\boldsymbol{y}} ^{(t)} = f(\boldsymbol{o}^{(t)})

And every time step, you get an error function J^{(t)}. Let’s consider the gradient of J^{(t)} with respect to \boldsymbol{h}^{(k)}, that is the error flowing from J^{(t)} to \boldsymbol{h}^{(k)}. This error is the most used to calculate gradients of the parameters in the equations above.

If you calculate this error more concretely, \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} \cdots \frac{\partial \boldsymbol{h}^{(k+2)}}{\partial \boldsymbol{h}^{(k+1)}} \frac{\partial \boldsymbol{h}^{(k+1)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}, where \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}} = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{b} + \boldsymbol{W}\cdot \boldsymbol{h}^{(s-1)} + \boldsymbol{U}\cdot \boldsymbol{x}^{(s)})) = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})).

* If you see the figure as a type of graphical model, you should be able to understand the why chain rules can be applied as the equation above.

*According to this paper \frac{\partial \boldsymbol{h}^{(s)}}{\partial \boldsymbol{h}^{(s-1)}}  = \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})), but it seems that many study materials and web sites are mistaken in this point.

Hence \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \prod_{k< s \leq t} \boldsymbol{W} ^T \cdot diag(g'(\boldsymbol{a}^{(s)})) = \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} (\boldsymbol{W} ^T )^{(t - k)} \prod_{k< s \leq t} diag(g'(\boldsymbol{a}^{(s)})). If you take norms of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} you get an equality \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} \right\rVert \leq \left\lVert \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right\rVert \left\lVert \boldsymbol{W} ^T \right\rVert ^{(t - k)} \prod_{k< s \leq t} \left\lVert diag(g'(\boldsymbol{a}^{(s)}))\right\rVert. I will not go into detail anymore, but it is known that according to this inequality, multiplication of weight vectors exponentially converge to 0 or to infinite number.

We have seen that the error \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}} is the main factor causing vanishing/exploding gradient problems of simple RNNs. In case of LSTM, \frac{\partial J^{(t)}}{\partial \boldsymbol{c}^{(k)}} is an equivalent. For simplicity, let’s calculate only \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, which is equivalent to \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} of simple RNN backprop.

* Just as I noted above, you have to be careful of which part the partial differential operator \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} affects in the chain rule above. That is, you need to calculate \frac{\partial}{\partial \boldsymbol{c}^{(t-1)}} (\boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}), and the partial differential operator only affects \boldsymbol{c}^{(t-1)}. I think this is not a correct mathematical notation, but please forgive me for doing this for convenience.

If you continue calculating the equation above more concretely, you get the equation below.

I cannot mathematically explain why, but it is known that this characteristic of gradients of LSTM backprop mitigate the vanishing/exploding gradient problem. We have seen that if you take a norm of \frac{\partial J^{(t)}}{\partial \boldsymbol{h}^{(k)}}, that is equal or smaller than repeated multiplication of the norm of the same weight matrix, and that soon leads to vanishing/exploding gradient problem. But according to the equation above, even if you take a norm of repeatedly multiplied \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}, its norm cannot be evaluated with a simple value like repeated multiplication of the norm of the same weight matrix. The outputs of each gate are different from time steps to time steps, and that adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}.

*I personally guess the term diag(\boldsymbol{f}^{(t)}) is very effective. The unaffected value of the elements of \boldsymbol{f}^{(t)} can directly adjust the value of \frac{\partial \boldsymbol{c}^{(t)}}{\partial \boldsymbol{c}^{(t-1)}}. And as a matter of fact, it is known that performances of LSTM drop the most when you get rid of forget gates.

When it comes to tackling exploding gradient problems, there is a much easier technique called gradient clipping. This algorithm is very simple: you just have to adjust the size of gradient so that the absolute value of gradient is under a threshold at every time step. Imagine that you decide in which direction to move by calculating gradients, but when the footstep is going to be too big, you just adjust the size of footstep to the threshold size you have set. In a pseudo code, you can write a gradient clipping part only with some two line codes as below.

*\boldsymbol{g} is a gradient at the time step threshold is the maximum size of the “step.”

The figure below, cited from a deep learning text from MIT press textbook, is a good and straightforward explanation on gradient clipping.It is known that a strongly nonlinear function, such as error functions of RNN, can have very steep or plain areas. If you artificially visualize the idea in 3-dimensional space, as the surface of a loss function J with two variants w, b, that means the loss function J has plain areas and very steep cliffs like in the figure.Without gradient clipping, at the left side, you can see that the black dot all of a sudden climb the cliff and could jump to an unexpected area. But with gradient clipping, you avoid such “big jumps” on error functions.

Source: Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, 409p


I am glad that I have finally finished this article series. I am not sure how many of the readers would have read through all of the articles in this series, including my PowerPoint slides. I bet that is not so many. I spent a great deal of my time for making these contents, but sadly even when I was studying LSTM, it was becoming old-fashioned, at least in natural language processing (NLP) field: a very promising algorithm named Transformer has been replacing the position of LSTM. Deep learning is a very fast changing field. I also would like to make illustrative introductions on attention mechanism in NLP, from seq2seq model to Transformer. And I think LSTM would still remain as one of the algorithms in sequence data processing, such as hidden Hidden Markov model, or particle filter.


* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Hypothesis Test for real problems

Hypothesis tests are significant for evaluating answers to questions concerning samples of data.

A statistical hypothesis is a belief made about a population parameter. This belief may or might not be right. In other words, hypothesis testing is a proper technique utilized by scientist to support or reject statistical hypotheses. The foremost ideal approach to decide if a statistical hypothesis is correct is examine the whole population.

Since that’s frequently impractical, we normally take a random sample from the population and inspect the equivalent. Within the event sample data set isn’t steady with the statistical hypothesis, the hypothesis is refused.

Types of hypothesis:

There are two sort of hypothesis and both the Null Hypothesis (Ho) and Alternative Hypothesis (Ha) must be totally mutually exclusive events.

• Null hypothesis is usually the hypothesis that the event wont’t happen.

• Alternative hypothesis is a hypothesis that the event will happen.

Why we need Hypothesis Testing?

Suppose a specific cosmetic producing company needs to launch a new Shampoo in the market. For this situation they will follow Hypothesis Testing all together decide the success of new product in the market.

Where likelihood of product being ineffective in market is undertaken as Null Hypothesis and likelihood of product being profitable is undertaken as Alternative Hypothesis. By following the process of Hypothesis testing they will foresee the accomplishment.

How to Calculate Hypothesis Testing?

  • State the two theories with the goal that just one can be correct, to such an extent that the two occasions are totally unrelated.
  • Now figure a study plan, that will lay out how the data will be assessed.
  • Now complete the plan and genuinely investigate the sample dataset.
  • Finally examine the outcome and either accept or reject the null hypothesis.

Another example

Assume, Person have gone after a typing job and he has expressed in the resume that his composing speed is 70 words per minute. The recruiter might need to test his case. On the off chance that he sees his case as adequate, he will enlist him in any case reject him. Thus, he types an example letter and found that his speed is 63 words a minute. Presently, he can settle on whether to employ him or not.  In the event that he meets all other qualification measures. This procedure delineates Hypothesis Testing in layman’s terms.

In statistical terms Hypothesis his typing speed is 70 words per minute is a hypothesis to be tested so-called null hypothesis. Clearly, the alternating hypothesis his composing speed isn’t 70 words per minute.

So, normal composing speed is population parameter and sample composing speed is sample statistics.

The conditions of accepting or rejecting his case is to be chosen by the selection representative. For instance, he may conclude that an error of 6 words is alright to him so he would acknowledge his claim between 64 to 76 words per minute. All things considered, sample speed 63 words per minute will close to reject his case. Furthermore, the choice will be he was producing a fake claim.

In any case, if the selection representative stretches out his acceptance region to positive/negative 7 words that is 63 to 77 words, he would be tolerating his case.

In this way, to finish up, Hypothesis Testing is a procedure to test claims about the population dependent on sample. It is a fascinating reasonable subject with a quite statistical jargon. You have to dive more to get familiar with the details.

Significance Level and Rejection Region for Hypothesis

Type I error probability is normally indicated by α and generally set to 0.05.  The value of α is recognized as the significance level.

The rejection region is the set of sample data that prompts the rejection of the null hypothesis.  The significance level, α, decides the size of the rejection region.  Sample results in the rejection region are labelled statistically significant at level of α .

The impact of differing α is that If α is small, for example, 0.01, the likelihood of a type I error is little, and a ton of sample evidence for the alternative hypothesis is needed before the null hypothesis can be dismissed. Though, when α is bigger, for example, 0.10, the rejection region is bigger, and it is simpler to dismiss the null hypothesis.

Significance from p-values

A subsequent methodology is to evade the utilization of a significance level and rather just report how significant the sample evidence is. This methodology is as of now more widespread.  It is accomplished by method of a p value. P value is gauge of power of the evidence against null hypothesis. It is the likelihood of getting the observed value of test statistic, or value with significantly more prominent proof against null hypothesis (Ho), if the null hypothesis of an investigation question is true. The less significant the p value, the more proof there is supportive of the alternative hypothesis. Sample evidence is measurably noteworthy at the α level just if the p value is less than α. They have an association for two tail tests. When utilizing a confidence interval to playout a two-tailed hypothesis test, reject the null hypothesis if and just if the hypothesized value doesn’t lie inside a confidence interval for the parameter.

Hypothesis Tests and Confidence Intervals

Hypothesis tests and confidence intervals are cut out of the same cloth. An event whose 95% confidence interval reject the hypothesis is an event for which p<0.05 under the relating hypothesis test, and the other way around. A p value is letting you know the greatest confidence interval that despite everything prohibits the hypothesis. As such, if p<0.03 against the null hypothesis, that implies that a 97% confidence interval does exclude the null hypothesis.

Hypothesis Tests for a Population Mean

We do a t test on the ground that the population mean is unknown. The general purpose is to contrast sample mean with some hypothetical population mean, to assess whether the watched the truth is such a great amount of unique in relation to the hypothesis that we can say with assurance that the hypothetical population mean isn’t, indeed, the real population mean.

Hypothesis Tests for a Population Proportion

At the point when you have two unique populations Z test facilitates you to choose if the proportion of certain features is the equivalent or not in the two populations. For instance, if the male proportion is equivalent between two nations.

Hypothesis Test for Equal Population Variances

F Test depends on F distribution and is utilized to think about the variance of the two impartial samples. This is additionally utilized with regards to investigation of variance for making a decision about the significance of more than two sample.

T test and F test are totally two unique things. T test is utilized to evaluate the population parameter, for example, population mean, and is likewise utilized for hypothesis testing for population mean. However, it must be utilized when we don’t know about population standard deviation. On the off chance that we know the population standard deviation, we will utilize Z test. We can likewise utilize T statistic to approximate population mean. T statistic is likewise utilised for discovering the distinction in two population mean with the assistance of sample means.

Z statistic or T statistic is utilized to assess population parameters such as population mean and population proportion. It is likewise used for testing hypothesis for population mean and population proportion. In contrast to Z statistic or T statistic, where we manage mean and proportion, Chi Square or F test is utilized for seeing if there is any variance inside the samples. F test is the proportion of fluctuation of two samples.


Hypothesis encourages us to make coherent determinations, the connection among variables, and gives the course to additionally investigate. Hypothesis for the most part results from speculation concerning studied behaviour, natural phenomenon, or proven theory. An honest hypothesis ought to be clear, detailed, and reliable with the data. In the wake of building up the hypothesis, the following stage is validating or testing the hypothesis. Testing of hypothesis includes the process that empowers to concur or differ with the expressed hypothesis.

Simple RNN

Understanding LSTM forward propagation in two ways

*This article is only for the sake of understanding the equations in the second page of the paper named “LSTM: A Search Space Odyssey”. If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.

*This article is the fourth article of “A gentle introduction to the tiresome part of understanding RNN.”

1. Preface

I  heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.

Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for “normal people” to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.

I already had a feeling that “Does it help to understand only LSTM with this precision? I should do more practical codings.” For example François Chollet, the developer of Keras, in his book, said as below.


For me that sounds like “We have already implemented RNNs for you, so just shut up and use Tensorflow/Keras.” Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called Transformer seems to be replacing the position of LSTM in natural language processing. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.

But I think, or rather hope,  it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work.

I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

2. Why LSTM?

First of all, let’s take a brief look at what I said about the structures of RNNs,  in the first and the second article. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward/back propagation if you focus on its recurrent connections.

That is why simple RNNs suffer from vanishing/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in this paper. But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing/exploding gradient problem.

LSTM was invented as a way to tackle such problems as I mentioned in the last article.

3. How to display LSTM

I would like you to just go to image search on Google, Bing, or Yahoo!, and type in “LSTM.” I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them “Space Odyssey type” and “electronic circuit type”, and in conclusion, I highly recommend you to understand LSTM as the “electronic circuit type.”

*I just randomly came up with the terms “Space Odyssey type” and “electronic circuit type” because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.

However, not that all the well-made explanations on LSTM use the “electronic circuit type,” and I am sure you sometimes have to understand LSTM as the “space odyssey type.” And the paper “LSTM: A Search Space Odyssey,” which I learned a lot about LSTM from,  also adopts the “Space Odyssey type.”

LSTM architectur visualization

The main reason why I recommend the “electronic circuit type” is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.

*Behaviors of both of them look different, but of course they are doing the same things.

If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs  are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.

But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.

4. Forward propagation of LSTM in “electronic circuit type”

*For further understanding of mathematics of LSTM forward/back propagation, I recommend you to download my slides.

The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.

Let’s look at the simplified architecture of  an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the gates, and the one going through cell connections, the “highway” of LSTM block. For simplicity, we will see the architecture of an LSTM block without peephole connections, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.

In a LSTM block, the input and the output of the former time step separately go through sections named “gates”: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.

*The small two dots on the cell connections are the “on-ramp” of cell conection highway.

*You would see the terms “input gate,” “forget gate,” “output gate” almost everywhere, but how to call the “block gate” depends on textbooks.

Let’s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step (t) gets \boldsymbol{y}^{(t-1)}, the output at the last time step,  and \boldsymbol{c}^{(t-1)}, the information of the cell at the time step (t-1), via recurrent connections. The block at time step (t) gets the input \boldsymbol{x}^{(t)}, and it separately goes through each gate, together with \boldsymbol{y}^{(t-1)}. After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}. The outputs of the gates are mixed with \boldsymbol{c}^{(t-1)} and the LSTM block gives out an output \boldsymbol{y}^{(t)}, and gives \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} to the next LSTM block via recurrent connections.

You calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below.

  • \boldsymbol{f}^{(t)}= \sigma(\boldsymbol{W}_{for} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \boldsymbol{y}^{(t-1)} +  \boldsymbol{b}_{for})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{W}_{in} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{in})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{W}_z \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z)
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{W}_{out} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{out})

*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.

The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called Hadamard product, which you denote as \odot. This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.

With this Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

The values of \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} are compressed into the range of [0, 1] or [-1, 1] with activation functions. You can see that the input gate and the block input give new information to the cell. The part \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)} means that the output of the forget gate “forgets” the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell \boldsymbol{c}^{(t)} is activated with tanh() and the output of the output gate “suppress” the activated value of \boldsymbol{c}^{(t)}. In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input \boldsymbol{x}^{(t)}, and the recurrent connection \boldsymbol{y}^{(t-1)}. That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always (1, 1, …1)^T, LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And  if all the outputs of every gate are always (0, 0, …0)^T, LSTMs forget everything, receive no inputs, and give out nothing.

This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.

LSTM inner architecture

LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.

5. Space Odyssey type

I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.

In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted “highway” in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections,  \boldsymbol{x}^{(t)}, the input at the time step (t) , sand \boldsymbol{y}^{(t-1)}, the output at the last time step (t-1), which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.

Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step (t-1) goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly “forgotten.” Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step  (t). And the partly “forgotten” \boldsymbol{c}^{(t-1)} goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how \boldsymbol{c}^{(t)} is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is “suppressed” with the output gate.

I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.


* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.


[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey,” (2017)

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204

[3] “Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021”, Institute of advanced research in artificial intelligence, (2020)

[4] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 120-125
岡谷貴之 著, 「機械学習プロフェッショナルシリーズ 深層学習」, (2015), pp. 120-125

[5] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 252-257
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 252-257

[6] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ~ 最近の動向と共に」, Qiita, (2015)