Understanding the “simplicity” of reinforcement learning: comprehensive tips to take the trouble out of RL

This is the first article of my article series “My elaborate study notes on reinforcement learning.”

*I adjusted mathematical notations in this article as close as possible to “Reinforcement Learning:An Introduction.”  This book by Sutton and Barto is said to be almost mandatory for those who studying reinforcement learning. Also I tried to avoid as much mathematical notations, introducing some intuitive examples. In case any descriptions are confusing or unclear, informing me of that via posts or email would be appreciated.

Preface

First of all, I have to emphasize that I am new to reinforcement learning (RL), and my current field is object detection, to be more concrete transfer learning in object detection. Thus this article series itself is also a kind of study note for me. Reinforcement learning (RL) is often briefly compared with human trial and errors, and actually RL is based on neuroscience or psychology as well as neural networks (I am not sure about these fields though). The word “reinforcement” roughly means associating rewards with certain actions. Some experiments of RL were conducted on animals, which are widely known as Skinner box or more classically Pavlov’s Dogs. In short, you can encourage animals to do something by giving foods to them as rewards, just as many people might have done to their dogs. Before animals find linkages between certain actions and foods as rewards to those actions, they would just keep trial and errors. We can think of RL as a family of algorithms which mimics this behavior of animals trying to obtain as much reward as possible.

*My cats will not all the way try to entertain me to get foods though.

RL showed its conspicuous success in the field of video games, such as Atari, and defeating the world champion of Go, one of the most complicated board games. Actually RL can be applied to not only video games or board games, but also various other fields, such as business intelligence, medicine, and finance, but still I am very much fascinated by its application on video games. I am now studying the field which could bridge between the world of video games and the real world. I would like to mention this in the one of upcoming articles.

So far I got an impression that learning RL ideas would be more challenging than learning classical machine learning or deep learning for the following reasons.

  1. RL is a field of how to train models, rather than how to design the models themselves. That means you have to consider a variety of problem settings, and you would often forget which situation you are discussing.
  2. You need prerequisites knowledge about the models of components of RL for example neural networks, which are usually main topics in machine/deep learning textbooks.
  3. It is confusing what can be learned through RL depending on the types of tasks.
  4. Even after looking over at formulations of RL, it is still hard to imagine how RL enables computers to do trial and errors.

*For now I would like you to keep it in mind that basically values and policies are calculated during in during RL.

And I personally believe you should always keep the following points in your mind in order not to be at a loss in the process of learning RL.

  1.  RL basically can be only applied to a very limited type of situation, which is called Markov decision process (MDP). In MDP settings your next state depends only on your current state and action, regardless of what you have done so far.
  2. You are ultimately interested in learning decision making rules in MDP, which are called policies.
  3. In the first stage of learning RL, you consider surprisingly simple situations. They might be simple like mazes in kids’ picture books.
  4. RL is in its early days of development.

Let me explain a bit more about what I meant by the third point above. I have been learning RL mainly with a very precise Japanese textbook named 「機械学習プロフェッショナルシリーズ 強化学習」(Machine Learning Professional Series: Reinforcement Learning). As I mentioned in an article of my series on RNN, I sometimes dislike Western textbooks because they tend to beat around the bush with simple examples to get to the point at a more abstract level. That is why I prefer reading books of this series in Japanese. And especially the RL one in the series was especially bulky and so abstract and overbearing to a spectacular degree. It had so many precise mathematical notations without leaving room for ambiguity, thus it took me a long time to notice that the book was merely discussing simple situations like mazes in kids’ picture books. I mean, the settings discussed were so simple that they can be expressed as tabular data, that is some Excel sheets.

*I could not notice that until the beginning of 6th chapter out of eight out of 8 chapters. The 6th chapter discusses uses of function approximators. With the approximations you can approximate tabular data. My articles will not dig this topic of approximation precisely, but the use of deep learning models, which I am going to explain someday, is a type of this approximation of RL models.

You might find that so many explanations on RL rely on examples of how to make computers navigate themselves in simple mazes or in playing video games, which are mostly impractical in the real world. However, as I will explain later, these are actually helpful examples to learn RL. As I show later, the relations of an agent and an environment are basically the same also in more complicated tasks. Reading some code or actually implementing RL would be very effective, especially in order to know simplicity of the situations in the beginning part of RL textbooks.

Given that you can do a lot of impressive and practical stuff with current deep learning libraries, you might get bored or disappointed by simple applications of RL in many textbooks. But as I mentioned above, RL is in its early days of development, at least at a public level. And in order to show its potential power, I am going to explain one of the most successful and complicated application of RL in the next article: I am planning to explain how AlphaGo or AplhaZero, RL-based AIs enabled computers to defeat the world champion of Go, one of the most complicated board games.

*RL was not used to the chess AI which defeated Kasparov in 1997. Combination of decision trees and super computers, without RL, was enough for the “simplicity” of chess. But uses of decision tree named Monte Carlo Tree Search enabled Alpha Go to read some steps ahead more effectively.  It is said deep learning enabled AlphaGo to have intuition about games. Mote Carlo Tree Search enabled it to have abilities to predict some steps ahead, and RL how to learn from experience.

1, What is RL?

In conclusion, as far as I could understand so far, as a beginner of RL, I would interpret RL as follows: RL is a sub-field of training AI models, and optimal rules for decision makings in an environment are learned through RL, weakly supervised by rewards in a certain period of time. When and how to evaluate decision makings are task-specific, and they are often realized by trial-and-error-like behaviors of agents. Rules for decision makings are called policies in contexts of RL. And optimization problems of policies are called sequential decision-making problems.

You are more or less going to see what I meant by my definition throughout my article series.

*An agent in RL means an entity which makes decisions, interacting with the environment with an action. And the actions are made based on policies.

You can find various types of charts explaining relations of RL with AI, and I personally found the chart below the most plausible.

“Models” in the chart are often hyped as “AI” in media today. But AI is a more comprehensive field of trying to realize human-like intellectual behaviors with computers. And machine learning have been the most central sub-field of AI last decades. Around 2006 there was a breakthrough of deep learning. Due to the breakthrough machine learning gained much better performance with deep learning models. I would say people have been calling popular “models” in each time “AI.” And importantly, RL is one field of training models, besides supervised learning and unsupervised learning, rather than a field of designing “AI” models. Some people say supervised learning or unsupervised learning are more preferable than RL because currently these trainings are more likely to be more successful in wide range of fields than RL. And usually the more data you have the more likely supervised or unsupervised learning are.

*The word “models” are used in another meaning later. Please keep it in mind that the “models” above are something like general functions. And the “models” which show up frequently later are functions modeling environments in RL.

*In case you’re totally new to AI and don’t understand what “supervising” means in these contexts, I think you should imagine cases of instructing students in schools. If a teacher just tells students “We have a Latin conjugation test next week, so you must check this section in the textbook.” to students, that’s a “supervised learning.” Students who take exams are “models.” Apt students like machine learning models would show excellent performances, but they might fail to apply the knowledge somewhere else. I mean, they might fail to properly conjugate words in unseen sentences. Next, if the students share an idea “It’s comfortable to get together with people alike.” they might be clustered to several groups. That might lead to “cool guys” or “not cool guys” group division. This is done without any explicit answers, and this corresponds to “unsupervised learning.” In this case, I would say a certain functions of the students’ brain or atmosphere there, which put similar students together, were the “models.” And finally, if teachers tell the students “Be a good student,” that’s what I meant with “weakly supervising.” However most people would say “How?” RL could correspond to such ultimate goals of education, and as well as education, you have to consider how to give rewards and how to evaluate students/agents. And “models” can vary. But such rewards often shows unexpected results.

2, RL and Markov decision process

As I mentioned in a former section, you have to keep it in mind that RL basically can be applied to a limited situation of sequential decision-making problems, which are called Markov decision processes (MDP). A markov decision process is a type of process where the next state of an agent depends only on the current state and the action taken in the current state. I would only roughly explain MDP in this article with a little formulation.

You might find MDPs very simple. But some people would find that their daily lives in fact can be described well with a MDP. The figure below is a state transition diagram of everyday routine at an office, and this is nothing but a MDP. I think many workers also basically have only four states “Chat” “Coffee” “Computer” and “Home” almost everyday.  Numbers in black are possibilities of transitions at the state, and each corresponding number in orange is the reward you get when the action is taken. The diagram below shows that when you just keep using a computer, you would likely to get high rewards. On the other hand chatting with your colleagues would just continue to another term of chatting with a probability of 50%, and that undermines productivity by giving out the reward of -1. And having some coffee is very likely to lead to a chat. In practice, you optimize which action to take in each situation. You adjust probabilities at each state, that is you adjust a policy, through planning or trial and errors.

Source: https://subscription.packtpub.com/book/data/9781788834247/1/ch01lvl1sec12/markov-decision-processes

*Even if you say “Be a good student,” school kids in puberty they would act far from Markov decision process. Even though I took an example of school earlier, I am sure education should be much more complicated process which requires constant patience.

Of course you have to consider much more complicated MDPs in most RL problems, and in most cases you do not have known models like state transition diagrams. Or rather I have to say RL enables you to estimate such diagrams, which are usually called models in contexts of RL, by trial and errors. When you study RL, for the most part you will see a chart like below. I think it is important to understand what this kind of charts mean, whatever study materials on RL you consult. I said RL is basically a training method for finding optimal decision making rules called policies. And in RL settings, agents estimate such policies by taking actions in the environment. The environment determines a reward and the next state based on the current state and the current action of the agent.

Let’s take a close look at the chart above in a bit mathematical manner. I made it based on “Machine Learning Professional Series: Reinforcement Learning.” The agent exert an action a in the environment, and the agent receives a reward r and the next state s'. r and s' are consequences of taking the action a in the state s. The action a is taken based on a conditional probability given s, which is denoted as \pi(a|s). This probability function \pi(a|s) is the very function representing policies, which we want to optimize in RL.

*Please do not think too much about differences of \sim and = in the chart. Actions, rewards, or transitions of states can be both deterministic or probabilistic. In the chart above, with the notation a \sim \pi (a|s) I meant that the action a is taken with a probability of \pi (a|s). And whether they are probabilistic or deterministic is task-specific. Also you should keep it in mind that all the values in the chart are realized values of random variables as I show in the chart at the right side.

In the textbook “Reinforcement Learning:An Introduction” by Richard S. Sutton, which is almost mandatory for all the RL learners, RL process is displayed as the left side of the figure below. Each capital letter in the chart means a random variable. Relations of random variables can be also displayed as graphical models like the right side of the chart. The graphical model is a time series expansion of the chart of RL loops at the left side. The chart below shows almost the same idea as the one above. Whether they use random variables or realized values is the only difference between them. My point is that decision makings are simplified in RL as the models I have explained. Even if some situations are not strictly MDPs, in many cases the problems are approximated as MDPs in practice so that RL can be applied to.

*I personally think you do not have to care so much about differences of random variables and their realized values in RL unless you discuss RL mathmematically. But if you do not know there are two types of notations, which are strictly different ideas, you might get confused while reading textboks on RL. At least in my artile series, I will strictly distinguish them only when their differences matter.

*In case you are not sure about differences of random variables and their realizations, please roughly grasp the terms as follows: random variables X are probabilistic tools for example dices. On the other hand their realized values x are records of them, for example (4, 1, 6, 6, 2, 1, …).  And the probability that a random variable X takes on the value x is denoted as Pr\{X = x\}. And X \sim p means the random variable X is selected from distribution p(x) \doteq \text{Pr} \{X=x\}. In case X is a “dice,” for any x p(x) = \frac{1}{6}.

3, Planning and RL

We have seen RL is a family of training algorithms which optimizes rules for choosing A_t = a in sequential decision-making problems, usually assuming them to be MDPs. However I have to emphasize that RL is not the only way to optimize such policies. In sequential decision making problems, when the model of the environment is known, policies can be optimized also through planning without collecting data from the environment. On the other hand, when the model of the environment is unknown policies have to be optimized based on data which an agents collects from the environment through trial and errors. This is the very case called RL. You might find planning problems very simple and unrealistic in practical cases. But RL is based on planning of sequential decision-making problems with MDP settings, so studying planning problems is inevitable.  As far as I could see so far, RL is a family of algorithms for approximating techniques in planning problems through trial and errors in environments. To be more concrete, in the next article I am going to explain dynamic programming (DP) in RL contexts as a major example of planning problems, and a formula called the Bellman equation plays a crucial role in planning. And after that we are going to see that RL algorithms are more or less approximations of Bellman equation by agents sampling data from environments.

 

As an intuitive example, I would like to take a case of navigating a robot, which is explained in a famous textbook on robotics named ” Probabilistic Robotics.”  In this case, the state set \mathcal{S} is the whole space on the map where the robot can move around. And the action set is \mathcal{A} = \{\rightarrow, \searrow, \downarrow, \swarrow \leftarrow, \nwarrow, \uparrow, \nearrow \}. If the robot does not fail to take any actions or there are no unexpected obstacles, manipulating the robot on the map is a MDP. In this example, the robot has to be navigated from the start point as the green dot to the goal as the red dot. In this case, blue arrows can be obtained through planning or RL. Each blue arrow denotes the action taken in each place, following the estimated policy. In other words, the function \pi is the flow of the blue arrows. But policies can vary even in the same problem. If you just want the robot to reach the goal as soon as possible, you might get a blue arrows in the figure at the top after planning. But that means the robot has to pass a narrow street, and it is likely to bump into the walls. If you prefer to avoid such risks, you should adopt policies of choosing wider streets, like the blue arrows in the figure at the bottom.

*In the textbook on probabilistic robotics, this case is classified to a planning problem rather than a RL problem because it assumes that the robot has a complete model of the environment, and RL is not introduced in the textbook. In case of robotics one major way of making a model, or rather a map is SLAM (Simultaneous Localization and Mapping). With SLAM, a map of the environment can be made only based on what have been seen with a moving camera like in the figure below. Half the first part of the textbook is about self localization of robots and gaining maps of environments. And the latter part is about planning in the gained map. RL is also based on planning problems as I explained. I would say RL is another branch of techniques to gain such models/maps and proper plans in the environment through trial and errors.

In the example of robotics above, we have not considered rewards R_t in the course of navigating the agent. That means the reward is given only when it reaches the goal. But agents can get lost if they get a reward only at the goal. Thus in many cases you optimize a policy \pi(a|s) such that it maximizes the sum of rewards R_1 + R_2 + \cdots + R_T, where T is the the length of the whole sequence of MDP in this case. More concretely, at every time step t, agents have to estimate G_t \doteq R_{t+1} + R_{t+2} + \cdots + R_T. The G_t is called a return. But you usually have to consider uncertainty of future rewards, so in practice you multiply a discount rate \gamma \quad (0\leq \gamma \leq 1) with rewards every time step. Thus in practice agents estimate a discounted return every time step as follows.

G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots + \gamma ^ {T-t-1} R_T = \sum_{k=0}^{T-t-1}{\gamma ^{k}R_{t+k+1}}

If agents blindly try to maximize immediate upcoming rewards R_t in a greedy way, that can lead to smaller amount of rewards in the long run. Policies in RL have to be optimized so that they maximize return, a sum of upcoming rewards G_t, every time step. But still, it is not realistic to take all the upcoming rewards R_{t+1}, R_{t+2}, \dots directly into consideration. These rewards have to be calculated recursively and probabilistically every time step. To be exact values of states are calculated this way. The value of a state in contexts of RL mean how likely agents get higher values if they start from the state. And how to calculate values is formulated as the Bellman equation.

*If you are not sure what “ecursively” and “probabilistically” mean, please do not think too much. I am going to explain that as precisely as possible in the next article.

I am going to explain Bellman equation, or Bellman operator to be exact in the next article. For now I would like you to keep it in mind that Bellman operator calculates the value of a state by considering future actions and their following states and rewards. Bellman equation is often displayed as a decision-tree-like chart as below. I would say planning and RL are matter of repeatedly applying Bellman equation to values of states. In planning problems, the model of the environment is known. That is, all the connections of nodes of the graph at the left side of the figure below are known. On the other hand in RL, those connections are not completely known, thus they need to be estimated in certain ways by agents collecting data from the environment.

 

*I guess almost no one explain RL ideas as the graphs above, and actually I am in search of effective and correct ways of visualizing RL. But so far, I think the graphs above describe how values updated in RL problem settings with discrete data. You are going to see what these graphs mean little by little in upcoming articles. I am also planning to introduce Bellman operators to formulate RL so that you do not have to think about decision-tree-like graphs all the time.

4, Examples of how RL problems are modeled

You might find that so many explanations on RL rely on examples of how to make computers navigate themselves in simple mazes or play video games, which are mostly impractical in real world. But I think uses of RL in letting computers play video games are good examples when you study RL. The video game industry is one of the most developed and sophisticated area which have produced environments of RL. OpenAI provides some “playgrounds” where agents can actually move around, and there are also some ports of Atari games. I guess once you understand how RL can be modeled in those simulations, that helps to understand how other more practical tasks are implemented.

*It is a pity that there is no E.T. the Extra-Terrestrial. It is a notorious video game which put an end of the reign of Atari. And after that came the era of Nintendo Entertainment System.

In the second section of this article, I showed the most typical diagram of the fundamental RL idea. The diagrams below show correspondences of each element of some simple RL examples to the diagram of general RL. Multi-armed bandit problems are a family of the most straightforward RL tasks, and I am going to explain it a bit more precisely later in this article. An agent solving a maze is also a very major example of RL tasks. In this case states s\in \mathcal{S} are locations where an agent can move. Rewards r \in \mathcal{R} are goals or bonuses the agents get in the course of the maze. And in this case \mathcal{A} = \{\rightarrow, \downarrow,\leftarrow, \uparrow \}.

If the environments are more complicated, deep learning is needed to make more complicated functions to model each component of RL. Such RL is called deep reinforcement learning. The examples below are some successful cases of uses of deep RL. I think it is easy to imagine that the case of solving a maze is close to RL playing video games. In this case \mathcal{A} is all the possible commands with an Atari controller like in the figure below. Deep Q Networks use deep learning in RL algorithms named Q learning. The development of convolutional neural networks (CNN) enabled computers to comprehend what are displayed on video game screens. Thanks to that, video games do not need to be simplified like mazes. Even though playing video games, especially complicated ones today, might not be strict MDPs, deep Q Networks simplifies the process of playing Atari as MDP. That is why the process playing video games can be simplified as the chart below, and this simplified MPD model can surpass human performances. AlphaGo and AlphaZero are anotehr successful cases of deep RL. AlphaGo is ther first RL model which defeated the world Go champion. And some training schemes were simplified and extented to other board games like chess in AlphaZero. Even though they were sensations in media as if they were menaces to human intelligence, they are also based on MDPs. A policy network which calculates which tactics to take to enhance probability of winning board games. But they use much more sophisticated and complicated techniques. And it is almost impossible to try training them unless you own a tech company or something with some servers mounted with some TPUs. But I am going to roughly explain how they work in one of my upcoming articles.

5, Some keywords for organizing terms of RL

As I am also going to explain in next two articles, RL algorithms are totally different frameworks of training machine learning models compared to supervised/unsupervised learnig. I think pairs of keywords below are helpful in classifying RL algorithms you are going to encounter.

(1) “Model-based” or “model-free.”

I said planning problems are basics of RL problems, and in many cases RL algorithms approximate Bellman equation or related ideas. I also said planning problems can be solved by repeatedly applying Bellman equations on states of a model of an environment. But in RL problems, models are usually unknown, and agents can only move in an environment which gives a reward or the next state to an agent. The agent can gains richer information of the environment time step by time step in RL, but this procedure can be roughly classified to two types: model-free type and model-based type. In model-free type, models of the environment are not explicitly made, and policies are updated based on data collected from the environment. On the her hand, in model-based types the models of the environment are estimated, and policies are calculated based on the model.

 

*To be honest, I am still not sure about differences of model-free RL and model-based RL.

*AlphaGo and AlphaZero are examples of model-based RL. Phases of board games can be modeled with CNN. Plannings in this case correspond to reading some phases ahead of games, and they are enabled by Monte Carlo tree search. They are the only examples of model-based RL which I can come up with. And also I had an impression that many study materials on RL focus on model-free types of RL.

(2) “Values” or “policies.”

I mentioned that in RL, values and policies are optimized. Values are functions of a value of each state. The value here means how likely an agent gets high rewards in the future, starting from the state. Policies are functions fro calculating actions to take in each state, which I showed as each of blue arrows in the example of robotics above. But in RL, these two functions are renewed in return, and often they reach optimal functions when they converge. The figure below describes the idea well.

These are essential components of RL, and there too many variations of how to calculate them. For example timings of updating them, whether to update them probabilistically or deterministically.  And whatever RL algorithm I talk about, how values and policies are updated will be of the best interest. Only briefly mentioning them would be just more and more confusing, so let me only briefly take examples of dynamic programming (DP).

Let’s consider DP on a simple grid map which I showed in the preface. This is a planning problem, and agents have a perfect model of the map, so they do not have to actually move around there. Agents can move on any cells except for blocks, and they get a positive rewards at treasure cells, and negative rewards at danger cells. With policy iteration, the agents can interactively update policies and values of all the states of the map. The chart below shows how policies and values of cells are updated.

 

You do not necessarily have to calculate policies every iteration, and this case of DP is called value iteration. But as the chart below suggests, value iteration takes more time to converge.

 

I am going to much more precisely explain the differences of values and policies in DP tasks in the next article.

(3) “Exploration” or “exploitation”

RL agents are not explicitly supervised by the correct answers of each behavior. They just receive rough signals of “good” or “bad.” One of the most typical failed cases of RL is that agents can be myopic. I mean, once agents find some actions which constantly give good reward, they tend to miss other actions which produce better rewards more effectively. One good way of avoiding this is adding some exploration, that is taking some risks to discover other actions.

I mentioned multi-armed bandit problems are simple setting of RL problems. And they also help understand trade-off of exploration and exploitation. In a multi-armed bandit problem, an agent chooses which slot machine to run every time step. Each slot machine gives out coins, or rewards r with a probability of p. The number of trials is limited, so the agent has to find the machine which gives out coins the most efficiently within the limited number of trials. In this problem, the key is the balance of trying to find other effective slot machines and just trying to get as much coins as possible with the machine which for now seems to be the best. This is trade-off of “exploration” or “exploitation.” One simple way to implement exploration and exploitation trade-off is ɛ-greedy algorithm. This is quite simple: with a probability of \epsilon, agents just randomly choose actions which are not thought to be the best then.

*Casino owners are not so stupid. Just as insurance I am sure it is designed so that you would lose in the long run, and before your “exploration” is complete, you will be “exploited.”

Let’s take a look at a simple simulation of a multi-armed bandit problem. There are two “casinos,” I mean sets of slot machines. In casino A, all the slot machines gives out the same reward 1, thus agents only need to find the machine which is the most likely to gives out more coins. But casino B is not simple like that. In this casino, slot machines with small odds give higher rewards.

I prepared four types of “multi-armed bandits,” I mean octopus agents. Each of them has each value of \epsilon, and the \epsilons reflect their “curiosity,” or maybe “how inconsistent they are.” The graphs below show the average reward over 1000 simulations. In each simulation each agent can try slot machines 250 times in total. In casino A, it seems the agent with the curiosity of \epsilon = 0.3 gets the best rewards in a short term. But in the long run, more stable agent whose \epsilon is 0.1, get more rewards. On the other hand in casino B, No on seems to make outstanding results.

*I wold not concretely explain how values of each slot machines are updated in this article. I think I am going to explain multi-armed bandit problems with Monte Carlo tree search in one of upcoming articles to explain the algorithm of AlphaGo/AlphaZero.

(4)”Achievement” or “estimation”

The last pair of keywords is “achievement” or “estimation,” and it might be better to instead see them as a comparison of “Monte Carlo” and “temporal-difference (TD).” I said RL algorithms often approximate Bellman equation based on data an agent has collected. Agents moving around in environments can be viewed as sampling data from the environment. Agents sample data of states, actions, and rewards. At the same time agents constantly estimate the value of each state. Thus agents can modify their estimations of values using value calculated with sampled data. This is how agents make use of their “experiences” in RL. There are several variations of when to update estimations of values, but roughly they are classified to Monte Carlo and Temporal-difference (TD). Monte Carlo is based on achievements of agents after one episode or actions. And TD is more of based on constant estimation of values at every time step. Which approach is to take depends on tasks but it seems many major algorithms adopt TD types. But I got an impression that major RL algorithms adopt TD, and also it is said evaluating actions by TD has some analogies with how brain is “reinforced.” And above all, according to the book by Sutton and Barto “If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” And an intermediate idea, between Monte Carlo and TD, also can be formulated as eligibility trace.

 

 

 

In this article I have briefly covered all the topics I am planning to explain in this series. This article is a start of a long-term journey of studying RL also to me. Any feedback on this series, as posts or  emails, would be appreciated. The next article is going to be about dynamic programming, which is a major way for solving planning problems. In contexts of RL, dynamic programming is solved by repeatedly applying Bellman equation on values of states of a model of an environment. Thus I think it is no exaggeration to say dynamic programming is the backbone of RL algorithms.

Appendix

The code I used for the multi-armed bandit simulation. Just copy and paste them on Jupyter Notebook.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Predictive Maintenance – Konzept und Chancen

(Maschinen)Zeit ist kostbar. Das trifft besonders auf produzierende Unternehmen zu. Denn hier gilt, jeder Stillstand einer Anlage kostet wertvolle Produktionskapazität. Stillstände einer Maschine lassen sich nicht 100%ig vermeiden, nur können sie mit dem passenden Predictive Maintenance Konzept (im weiteren Text als PdM abgekürzt) reduziert und besser planbar gemacht werden. Mit intelligenten Add-ons wie IoT Anbindung von Maschinen in einer Industrie 4.0 Umgebungen und die Integration in ein gut geplantes Predictive Maintenance System lassen sich Kosten einsparen.

Was ist Predictive Maintenance?

Unter PdM sind Features beim Betrieb einer Anlage oder Maschine gemeint, die aus historischen Daten lernen und in Verbindung mit aktuellen oder sogar Echtzeitdaten Prognosen über bevorstehende Ereignisse durchführen. Aus den Berechnungen können z.B. kommende Wartungsarbeiten oder sich andeutende Ausfälle von Komponenten abgeleitet werden. Ersatzteile müssen nicht mehr vorsorglich auf Lager gelegt werden, sondern können aufgrund der tatsächlichen Notwendigkeit bestellt werden.

Vorteile durch Einsatz von Predictive Maintenance

Weiterer Pluspunkt ist die gute Planbarkeit von Wartungsintervallen. Betreiber als auch Hersteller können die Terminpläne nach dem vorherberechneten Wartungszeitpunkt einteilen. Als Betreiber können die Verfügbarkeiten mit den notwendigen Kapazitäten für die Produktion korreliert werden. Als Hersteller können Sie die Bestellung von Ersatzteilen im tatsächlich gebrauchten Umfang termingerecht durchführen. Und Sie müssen nicht von jedem möglichen Ersatzteil eine Anzahl immer vorrätig haben.

Technisches Konzept und Überlegungen

Für die Implementierung eines PdM, egal ob Sie das selbst durchführen wollen oder als Produkt zukaufen wollen, ist ein technisches Konzept der Startpunkt. Wir wollen hier die Eckdaten dieser Überlegungen skizzieren, um so als Arbeitsunterlage zur Erarbeitung des Konzepts dienen zu können.

Ein PdM Konzept besteht grob aus den folgenden Komponenten:

Predictive Maintenance Konzept

  • Maschine: hier werden Daten erzeugt, die in das PdM System übernommen werden sollen. Von den Maschinen sollen die Daten möglichst rasch und instantan abgegriffen werden, um die Werte in die Cloud zu laden und am Endgerät verfügbar zu haben. Meist ist auf den Maschinen nur begrenzter Speicherplatz vorhanden. Und die Daten können dort nur kurz zwischengespeichert werden.
  • Data-Agent und Transferprotokoll: die Übertragung der Daten auf die Cloudplattform wird durch eine Softwarekomponente durchgeführt. Diese kann entweder vom Hersteller mitgeliefert und bereits in der Maschine integriert sein. Oder sie wird als Teil des PdM Konzepts ergänzt.
    Aufgabe ist, die Daten gesichert auf die Cloudplattform zu übertragen. Bei Ausfall der Netzwerkverbindung kann ein lokales Spooling der Daten mit anschließender gesammelter Übertragung erfolgen.
    Neben der Datenübertragung muss an dieser Stelle auch eine Registrierung neuer Maschinen möglich sein. Der Data-Agent darf sich nicht einfach durch Klonen auf eine neue Maschine übertragen lassen. Eine geeignete Sicherung über z.B. Hardware-IDs muss hier durchgeführt werden.
  • Data-Aggregator: kann oder soll nicht die Maschine direkt Daten in die Cloud übertragen, können die Daten mehrerer Maschinen mit einem Data-Aggregator zusammengefasst werden. Von dort wird dann die verschlüsselte Übertragung auf die Cloudplattform durchgeführt.
    Gründe für den Einsatz eines Data-Aggregators könnten sein, dass beim Endkunden keine Übertragung von einzelnen Maschinen ins Netz erlaubt ist. Oder dass auch „Legacy“ Maschinen angebunden werden sollen, für die es keine Plugins zur direkten Datenübertragung gibt. Z.B., wenn eine Maschine mit einer älteren SPS angebunden werden soll, für die technisch keine direkte Übertragung auf die Cloud möglich ist.
  • Cloud- / Webplattform: die übertragenen Daten müssen zentral in einer geeigneten Umgebung gespeichert werden. Aus diesen gesammelten Daten werden die eigentlichen Erkenntnisse und Vorhersagen eines PdM Systems gewonnen. Durch eine KI und selbstlernende Algorithmen können die Daten weiter verwertet werden. Die gewonnen Ergebnisse aus den analysierten Maschinendaten sind die Basis für das PdM System und werden den Anwendern grafisch aufbereitet oder als Infos und Warnungen per Message zugestellt.
  • Endgerät: ist der Zugangspunkt für den Anwender. Die PdM Daten werden als App oder als Webanwendung dargestellt.

Der Data-Agent / Data-Aggregator kann mittels Edge Computing lokale Intelligenz erhalten. Daten können bereits vorausgewertet und zusammengefasst werden. Das reduziert die übertragenen Daten.

Welche Werte sollen übertragen werden?

Ziel des PdM ist durch das Abgreifen und Auswerten der Maschinendaten im Endeffekt eine Vernetzung mit den informationsverarbeitenden Systemen in einem Unternehmen. Das wird z.B. ein ERP Enterprise Resource Planning oder ein MES Manufacturing Execution System sein. Dort werden aufgrund der PdM Daten die Ressourcen und Kapazitäten für die Produktion geplant.

Typische Daten zur Übertragung bei einem PdM sind:

  • Temperatur
  • Druck
  • Geschwindigkeiten
  • Zurückgelegte Wege
  • Schaltspiele
  • Viskositäten
  • Flüssigkeitsstände
  • Vibrationen

Welche Daten Sie abgreifen können hängt damit zusammen, ob Sie Anwender also ein Produktionsbetrieb oder Hersteller also ein Maschinebauer sind. Als Anwender haben Sie normalerweise weniger tiefe Zugriffsmöglichkeiten auf Daten und Parameter der Maschinen. Nur die vom Hersteller bereitgestellten und dokumentierten Werte sind zugänglich. Diese sind vom Level eher auf Applikationsschicht angesetzt. Als Hersteller können Sie auf beliebige Werte zurückgreifen. Dazu gehören auch Dinge wie Schaltspiele oder zurückgelegte Fahrwege von Motoren.

Übertragen Sie jene Daten ins PdM, aus denen Sie die Wartungsarbeiten Ihrer Maschine ermitteln können.
Z.B. bei einer Glashärteanlage wären das die Betriebsstunden der Keramikwalzen, zurückgelegte Wege der Keilriemen oder Einschaltzeiten der Heizelemente.
Z.B. bei einer Automatisierung für die Leiterplattenfertigung wären das die Betriebsstunden der Saugnäpfe oder die zurückgelegten Wege der Antriebsriemen.

Wie oft sollen Werte übertragen werden?

Wenn Sie von der Netzwerkanbindung mit keinen Einschränkungen in Bezug auf Bandbreite oder Datenlimit rechnen müssen, nehmen Sie als Übertragungsintervall eine relativ gute Granularität an. Wählen Sie es so aus, dass Probleme an der Maschine auch nachträglich noch analysiert und der Auslöser gefunden werden kann.
Bei den meisten Industrie 4.0 Umgebungen sollte die Datenmenge keine große Rolle spielen. Sollten Sie in einer IoT Umgebung mit z.B. LoRaWAN-Anbindung arbeiten, dann teilen Sie die Daten in Kategorien nach Priorität ein. Z.B. hoch, mittel, niedrig oder z.B. Produktion, Standby. Die Übertragung der Kategorien können dann je nach Betriebszustand differenziert werden, wann welche Kategorie wichtig ist und priorisiert übertragen werden soll.

Chancen

Die Umsetzung eines Predictive Maintenance Konzepts hilft Ihnen die Produktion agiler zu gestalten. Terminpläne aufgrund vorausgesagter Wartungszeiten der Anlagen lassen eine präzisere und engere Planung der Kapazitäten zu. Dieser Effekt wirkt sich positiv auf Produktionskosten aus.

Ein großes Sparpotential hat ein PdM für die kommenden CO2 Steuermodelle. Mit den ermittelten Daten können Sie exakte Berechnungen über die verbrauchte Energie auf das produzierte Werkstück durchführen und so CO2 Steuer sparen.

Mit smarten Diensten wie einem PdM können Sie als Maschinenhersteller dauerhaft Geld verdienen. Sie generieren weiteren Umsatz von Ihren Kunden und erhöhen gleichzeitig die Kundenbindung. Ihre Kunden werden durch vorausschauende Wartung zufriedener mit Ihren Produkten.

Fazit

Vorausschauende Wartung hat Potential für Endanwender als auch für Hersteller. Beim Endanwender steht das Sparpotential im Vordergrund beim Hersteller die Kundenzufriedenheit. Mit intelligenten Edge-Computing Komponenten lassen sich PdM Lösungen gut skalieren und die Datenmenge reduzieren.

Die Umsetzung einer Lösung für Predictive Maintenance ist nicht an die Installation oder Entwicklung einer neuen Anlage gebunden. Auch bereits laufende Maschinen können leicht in ein PdM integriert werden.

 

Seq2seq models and simple attention mechanism: backbones of NLP tasks

This is the second article of my article series “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Machine translation and seq2seq models

I think machine translation is one of the most iconic and commercialized tasks of NLP. With modern machine translation you can translate relatively complicated sentences, if you tolerate some grammatical errors. As I mentioned in the third article of my series on RNN, research on machine translation already started in the early 1950s, and their focus was translation between English and Russian, highly motivated by Cold War. In the initial phase, machine translation was rule-based, like most students do in their foreign language classes. They just implemented a lot of rules for translations. In the next phase, machine translation was statistics-based. They achieved better performance with statistics for constructing sentences. At any rate, both of them highly relied on feature engineering, I mean, you need to consider numerous rules of translation and manually implement them. After those endeavors of machine translation, neural machine translation appeared. The advent of neural machine translation was an earthshaking change of machine translation field. Neural machine translation soon outperformed the conventional techniques, and it is still state of the art. Some of you might felt that machine translation became more or less reliable around that time.

Source: Monty Python’s Life of Brian (1979)

I think you have learnt at least one foreign or classical language in school. I don’t know how good you were at the classes, but I think you had to learn some conjugations of them and I believe that was tiresome to most of students. For example, as a foreign person, I still cannot use “der”, “die”, “das” properly. Some of my friends recommended I do not care them for the time being while I speak, but I usually care grammar very much. But this method of learning language is close to the rule base machine translation, and modern neural machine translation basically does not rely on such rules.

As far as I understand, machine translation is pattern recognition learned from a large corpus. Basically no one implicitly teach computers how grammar works. Machine translation learns very complicated mapping from a source language to a target language, based on a lot of examples of word or sentence pairs. I am not sure, but this might be close to how bilingual kids learn how the two languages are related. You do not need to navigate the translator to learn specific grammatical rules.

Source: Monty Python’s Flying Circus (1969)

Since machine translation does not rely on manually programming grammatical rules, basically you do not need to prepare another specific network architecture for another pair of languages. The same method can be applied to any pairs of languages, as long as you have an enough size of corpus for that. You do not have to think about translation rules between other pairs of languages.

Source: Monty Python’s Flying Circus (1969)

*I do not follow the cutting edge studies on machine translation, so I am not sure, but I guess there are some heuristic methods for machine translation. That is, designing a network depending on the pair of languages could be effective. When it comes grammatical word orders, English and Japanese have totally different structures, I mean English is basically SVO and Japanese is basically SOV. In many cases, the structures of sentences with the same meaning in both of the languages are almost like reflections in a mirror. A lot of languages have similar structures to English, even in Asia, for example Chinese. On the other hand relatively few languages have Japanese-like structures, for example Korean, Turkish. I guess there would be some grammatical-structure-aware machine translation networks.

Not only machine translations, but also several other NLP tasks, such as summarization, question answering, use a model named seq2seq model (sequence to sequence model). As well as other deep learning techniques, seq2seq models are composed of an encoder and a decoder. In the case of seq2seq models, you use RNNs in both the encoder and decoder parts. For the RNN cells, you usually use a gated RNN such as LSTM or GRU because simple RNNs would suffer from vanishing gradient problem when inputs or outputs are long, and those in translation tasks are long enough. In the encoder part, you just pass input sentences. To be exact, you input them from the first time step to the last time step, every time giving an output, and passing information to the next cell via recurrent connections.

*I think you would be confused without some understandings on how RNNs propagate forward. You do not need to understand this part that much if you just want to learn Transformer. In order to learn Transformer model, attention mechanism, which I explain in the next section is more important. If you want to know how basic RNNs work, an article of mine should help you.

*In the encoder part of the figure below, the cell also propagate information backward. I assumed an encoder part with bidirectional RNNs, and they “forward propagate” information backwards. But in the codes below, we do not consider such complex situation. Please just keep it in mind that seq2seq model could use bidirectional RNNs.

At the last time step in the encoder part, you pass the hidden state of the RNN to the decoder part, which I show as a yellow cell in the figure below, and the yellow cell/layer is the initial hidden layer of the first RNN cell of the decoder part. Just as normal RNNs, the decoder part start giving out outputs, and passing information via reccurent connections. At every time step you choose a token to give out from the vocabulary you use in the task. That means, each cell of decoder RNNs does a classification task and decides which word to write out at the time step. Also, very importantly, in the decoder part, the output at one time step is the input at the next time step, as I show as dotted lines in the figure below.

*The translation algorithm I explained depends on greedy decoding, which has to decide a token at every time step. However it is easy to imagine that that is not how you translate a word. You usually erase the earlier words or you construct some possibilities in your mind. Actually, for better translations you would need decoding strategies such as beam search, but it is out of the scope of at least this article. Thus we are going to make a very simplified translator based on greedy decoding.

2 Learning by making

*It would take some hours on your computer to train the translator if you do not use a GPU. I recommend you to run it at first and continue reading this article.

Seq2seq models do not have that complicated structures, and for now you just need to understand the points I mentioned above. Rather than just formulating the models, I think it would be better to understand this model by actually writing codes. If you copy and paste the codes in this Github page or the official Tensorflow tutorial, installing necessary libraries, it would start training the seq2seq model for Spanish-English translator. In the Github page, I just added comments to the codes in the official tutorial so that they are more understandable. If you can understand the codes in the tutorial without difficulty, I have to say this article itself is not compatible to your level. Otherwise, I am going to help you understand the tutorial with my original figures. I made this article so that it would help you read the next article. If you have no idea what RNN is, at least the second article of my RNN series should be helpful to some extent.

*If you try to read the the whole article series of mine on RNN, I think you should get prepared. I mean, you should prepare some pieces of paper and a pen. It would be nice if you have some stocks of coffee and snacks. Though I do not think you have to do that to read this article.

2.1 The corpus and datasets

In the codes in the Github page, please ignore the part sandwiched by “######”.  Handling language data is not the focus of this article. All you have to know is that the codes below first create datasets from the Spanish-English corpus in http://www.manythings.org/anki/ , and you datasets for training the translator as the tensors below.

Each token is encoded with integers as the codes below, thus after encoding, the Spanish sentence “Todo sobre mi madre.” is [1, 74, 514, 19, 237, 3, 2].

2.2 The encoder

The encoder part is relatively simple. All you have to keep in mind is that you put input sentences, and pass the hidden layer of the last cell to the decoder part. To be more concrete, an RNN cell receives an input word every time step, and gives out an output vector at each time step, passing hidden states to the next cell. You make a chain of RNN cells by the process, like in the figure below. In this case “time steps” means the indexes of the order of the words. If you more or less understand how RNNs work, I think this is nothing difficult. The encoder part passes the hidden state, which is in yellow in the figure below, to the decoder part.

Let’s see how encoders are implemented in the code below. We use a type of RNN named GRU (Gated Recurrent Unit). GRU is simpler than LSTM (Long Short-Term Memory). One GRU cell gets an input every time step, and passes one hidden state via recurrent connections. As well as LSTM, GRU is a gated RNN so that it can mitigate vanishing gradient problems. GRU was invented after LSTM for smaller computation costs. At time step (t) one GRU cell gets an input \boldsymbol{x}^{(t)} and passes its hidden state/vector \boldsymbol{h}^{(t)} to the next cell like the figure below. But in the implementation, you put the whole input sentence as a 16 dimensional vector whose elements are integers, as you saw in the figure in the last subsection 2.1. That means, the ‘Encoder’ class in the implementation below makes a chain of 16 GRU cells every time you put an input sentence in Spanish, even if input sentences have less than 16 tokens.

*TO BE  VERY HONEST, I am not sure why the encoder part of  seq2seq models are implemented this way in the codes below. In the implementation below, the number of total time steps in the encoder part is fixed to 16. If input sentences have less than 16 tokens, it seems the RNN cells get no inputs after the time step of the token “<end>”. As far as I could check, if RNN cells get no inputs, they repeats giving out similar 1024-d vectors. I think in this implementation, RNN cells after the <end> token, which I showed as the dotted RNN cells in the figure above, do not change so much. And the encoder part passes the hidden state of the 16th RNN cell, which is in yellow, to the decoder.

2.3 The decoder

The decoder part is also not that hard to understand. As I briefly explained in the last section, you initialize the first cell of the decoder, using the hidden layer of the last cell the encoder. During decoding, I mean while writing a translation, at the beginning you put the token “<start>” as the first input of the decoder. Given the input “<start>”, the first cell outputs “all” in the example in the figure below, and the output “all” is the input of the next cell. The output of the next cell “about” is also passed to the next cell, and you repeat this till the decoder gives out the token “<end>”.

A more important point is how to get losses in the decoder part during training. We use a technique named teacher enforcing during training the decoder part of a seq2seq model. This is also quite simple: you just have to make sure you input a correct answer to RNN cells, regardless of the outputs generated by the cell last time step. You force the decoder to get the correct input every time step, and that is what teacher forcing is all about.

You can see how the decoder part and teacher forcing is implemented in the codes below. You have to keep it in mind that unlike the ‘Encoder’ class, you put a token into a ‘Decoder’ class every time step. To be exact you also need the outputs of the encoder part to calculate attentions in the decoder part. I am going to explain that in the next subsection.

2.4 Attention mechanism

I think you have learned at least one foreign language, and usually you have to translate some sentences. Remember the processes of writing a translation of a sentence in another language. Imagine that you are about to write a new word after writing some. If you are not used to translations in the language, you must have cared about which parts of the original language correspond to the very new word you are going to write. You have to pay “attention” to the original sentence. This is what attention mechanism is all about.

*I would like you to pay “attention” to this section. As you can see from the fact that the original paper on Transformer model is named “Attention Is All You Need,” attention mechanism is a crucial idea of Transformer.

In the decoder part you initialize the hidden layer with the last hidden layer of the encoder, and its first input is “<start>”.  The decoder part start decoding, , as I explained in the last subsection. If you use attention mechanism in the seq2seq model, you calculate attentions every times step.  Let’s consider an example in the figure below, where the next input in the decoder is “my”, and given the token “my”, the GRU cell calculates a hidden state at the time step. The hidden state is the “query” in this case, and you compare the “query” with the 6 outputs of the encoder, which are “keys”. You get weights/scores, I mean “attentions”, which is the histogram in the figure below.

And you reweight the “values” with the weights in the histogram. In this case the “values” are the outputs of the encoder themselves. You used use the reweighted “values” to calculate the hidden state of the decoder at the times step again. And you used the hidden state updated by the attentions to predict the next word.

*In the implementation, however, the size of the output of the ‘Encoder’ class is always (16, 2024). You calculate attentions for all those 16 output vectors, but virtually only the first 6 1024-d output vectors important.

Summing up the points I have explained, you compare the “query” with the “keys” and get scores/weights for the “values.” Each score/weight is in short the relevance between the “query” and each “key”. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted “values.” In the case of attention mechanism in this article, we can say that “values” and “keys” are the same. You would also see that more clearly in the implementation below.

You especially have to pay attention to the terms “query”, “key”, and “value.” “Keys” and “values” are basically in the same language, and in the case above, they are in Spanish. “Queries” and “keys” can be in either different or the same. In the example above, the “query” is in English, and the “keys” are in Spanish.

You can compare a “query” with “keys” in various ways. The implementation uses the one called  Bahdanau’s additive style, and in Transformer, you use more straightforward ways. You do not have to care about how Bahdanau’s additive style calculates those attentions. It is much more important to learn the relations of “queries”, “keys”, and “values” for now.

*A problem is that Bahdanau’s additive style is slightly different from the figure above. It seems in Bahdanau’s additive style, at the time step (t) in the decoder part, the query is the hidden state at the time step (t-1). You would notice that if you closely look at the implementation below.As you can see in the figure above, you can see that you have to calculate the hidden state of the decoder cell two times at the time step (t): first in order to generate a “query”, second in order to predict the translated word at the time step. That would not be so computationally efficient, and I guess that is why Bahdanau’s additive style uses the hidden layer at the last time step as a query rather than calculating hidden layers twice.

2.5 Translating and displaying attentions

After training the translator for 20 epochs, I could translate Spanish sentences, and the implementation also displays attention scores for between the input and output sentences. For example the translation of the inputs “Todo sobre mi madre.” and “Habre con ella.” were “all about my mother .” and “i talked to her .” respectively, and the results seem fine. One powerful advantage of using attention mechanism is you can display this type of word alignment, I mean correspondences of words in a sentence, easily as in the heat maps below. The yellow parts shows high scores of attentions, and you can see that the distributions of relatively highs scores are more or less diagonal, which implies that English and Spanish have similar word orders.

For other inputs like “Mujeres al borde de un ataque de nervious.” or “Volver.”, the translations are not good.

You might have noticed there is one big problem in this implementation: you can use only the words appeared in the corpus. And actually I had to manually add some pairs of sentences with the word “borde” to the corpus to get the translation in the figure.

[References]

[1] “Neural machine translation with attention,” Tensorflow Core
https://www.tensorflow.org/tutorials/text/nmt_with_attention

[2]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 72-85, 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 72-85, 191-193

[3]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

AI Voice Assistants are the Next Revolution: How Prepared are You?

By 2022, voice-based shopping is predicted to rise to USD 40 billion, based on the data from OC&C Strategy Consultants. We’re in an era of ‘voice’ where drastic transformation is seen between the way AI and voice recognition are changing the way we live.

According to the survey, the surge of voice assistants is said to be driven by the number of homes that used smart speakers, as such that the rise is seen to grow from 13% to 55%. Nonetheless, Amazon will be one of the leaders to dominate the new channel having the largest market share.

Perhaps this is the first time you’ve heard about the voice revolution. Well, why not, based on multiple researchers, it is estimated that the number of voice assistants will grow to USD 8 billion by 2023 from USD 2.5 billion in 2018.

But what is voice revolution or voice assistant or voice search?

It was only until recently that the consumers have started learning about voice assistants which further predicts to exist in the future.

You’ve heard of Alexa, Cortana, Siri, and Google Assistant, these technologies are some of the world’s greatest examples of voice assistants. They will further help to drive consumer behavior as well as prepare the companies and adjust based on the industry demands. Consumers can now transform the way they act, search, and advertise their brand through voice technology.

Voice search is a technology to help users or consumers perform a search on the website by simply asking a question on their smartphone, their computer, or their smart device.

The voice assistant awareness: Why now?

As surveyed by PwC, amongst the 90% respondents, about 72% have been recorded to use voice assistant while merely 10% said they were clueless about voice-enabled devices and products. It is noted, the adoption of voice-enabled was majorly driven by children, young consumers, and households earning an income of around >USD100k.

Let us have a glance to ensure the devices that are used mainly for voice assistance: –

  • Smartphone – 57%
  • Desktop – 29%
  • Tablet – 29%
  • Laptop – 29%
  • Speaker – 27%
  • TV remote – 21%
  • Car navigation – 20%
  • Wearable – 14%

According to the survey, most consumers that use voice-assistants were the younger generation, aged between 18-24.

While individuals between the ages 25-49 were said to use these technologies in a much more statistical manner, and are called the “heavy users.”

Significance of mobile voice assistants: What is the need?

Although mobile is accessible everywhere, you will merely find three out of four consumers using mobile voice assistants in their household i.e. 74%.

Mobile-based AI chatbots have taken our lives by storm, thus providing the best solution to both the customers and agents in varied areas – insurance, travel, and education, etc.

A certain group of individuals said they needed privacy while speaking to their device and that sending a voice command in public is weird.

Well, this simply explains why 18-24 aged group individuals prefer less use of voice assistants. However, this age group tends to spend more time out of their homes.

Situations where voice assistants can be used – standalone speakers Vs mobile

Cooking

  • Standalone speakers – 65%
  • Mobile – 37%

Multitasking

  • Standalone speakers – 62%
  • Mobile – 12%

Watching TV

  • Standalone speakers – 57%
  • Mobile – 43%

In bed

  • Standalone speakers – 38%
  • Mobile – 37%

Working

  • Standalone speakers – 29%
  • Mobile – 25%

Driving

  • Standalone speakers – 0%
  • Mobile – 40%

By the end of 2020, nearly half of all the searches made will be voice-based, as predicted by Comscore, a media analytics firm.

Don’t you think voice-based assistant is changing the way businesses function? Thanks to the advent of AI!

  • A 2018 study on AI chatbots and voice assistants by Spiceworks said, 24% of businesses that were spread largely, and 16% of smaller businesses have already started using AI technologies in their workplaces. While 25% of the business market is expected to adopt AI within the next 12 months.

Surprisingly, voice-based assistants such as Siri, Google Assistant, and Cortana are some of the most prominent technologies these businesses are using in their workstations.

Where will the next AI voice revolution take us?

Voice-authorized transactions

Paypal, an online payment gateway now leverages Siri and Alexa’s voice recognition capability, thus, allowing users to make payments, check their balance, and ask payments from people via voice command.

Voice remote control – AI-powered

Communications conglomerate Comcast, an American telecommunications and media conglomerate introduces their first-ever X1 voice remote control that provides both natural image processing and voice recognition.

With the help of deep learning, the X1 can easily come up with better search results with just a press of the button telling what your television needs to do next.

Voice AI-enabled memos and analytics

Salesforce recently unveiled Einstein Voice which is an AI assistant that helps in entering critical data the moment it hears, making use of the voice command. This AI assistant also initiates in interpreting voice memos. Besides this, the voice bots accompanying Einstein Voice also helps the company create their customized voice bots to answer customer queries.

Voice-activated ordering

It is astonishing to see how Domino’s is using voice-activated feature automate orders made over the phone by customers. Well, welcome to the era of voice revolution.

This app, developed by Nuance Communications already has a Siri like voice recognition feature that allows customers to place their orders just like how they would be doing it in front of the cash counter making your order to take place efficiently.

As more businesses look forward to breaking down the roadblocks between a consumer and a brand, voice search now projects to become an impactful technology of bridging the gap.

Simple RNN

A gentle introduction to the tiresome part of understanding RNN

Just as a normal conversation in a random pub or bar in Berlin, people often ask me “Which language do you use?” I always answer “LaTeX and PowerPoint.”

I have been doing an internship at DATANOMIQ and trying to make straightforward but precise study materials on deep learning. I myself started learning machine learning in April of 2019, and I have been self-studying during this one-year-vacation of mine in Berlin.

Many study materials give good explanations on densely connected layers or convolutional neural networks (CNNs). But when it comes to back propagation of CNN and recurrent neural networks (RNNs), I think there’s much room for improvement to make the topic understandable to learners.

Many study materials avoid the points I want to understand, and that was as frustrating to me as listening to answers to questions in the Japanese Diet, or listening to speeches from the current Japanese minister of the environment. With the slightest common sense, you would always get the feeling “How?” after reading an RNN chapter in any book.

This blog series focuses on the introductory level of recurrent neural networks. By “introductory”, I mean prerequisites for a better and more mathematical understanding of RNN algorithms.

I am going to keep these posts as visual as possible, avoiding equations, but I am also going to attach some links to check more precise mathematical explanations.

This blog series is composed of five contents.:

  1. Prerequisites for understanding RNN at a more mathematical level
  2. Simple RNN: the first foothold for understanding LSTM
  3. A brief history of neural nets: everything you should know before learning LSTM
  4. Understanding LSTM forward propagation in two ways
  5. LSTM back propagation: following the flows of variables

 

Business Data is changing the world’s view towards Green Energy

Energy conservation is one of the main stressed points all around the globe. In the past 30 years, researches in the field of energy conservation and especially green energy have risen to another level. The positive outcomes of these researches have given us a gamut of technologies that can aid in preserving and utilize green energy. It has also reduced the over-dependency of companies on fossil fuels such as oil, coal, and natural gas.

Business data and analytics have all the power and the potential to take the business organizations forward in the future and conquer new frontiers. Seizing the opportunities presented by Green energy, market leaders such as Intel and Google have already implemented it, and now they enjoy the rich benefits of green energy sources.

Business data enables the organizations to keep an eye on measuring the positive outcomes by adopting the green energies. According to a report done by the World energy outlook, the global wind energy capacity will increase by 85% by the year 2020, reaching 1400 TWh. Moreover, in the Paris Summit, more than 170 countries around the world agreed on reducing the impact of global warming by harnessing energy from green energy sources. And for this to work, Big Data Analytics will play a pivotal role.

Overview of Green energy

In simpler terms, Green Energy is the energy coming from natural sources such as wind, sun, plants, tides, and geothermal heat. In contrast to fossil fuels, green energy resources can be replenished in a short period, and one can use them for longer periods. Green energy sources have a minimal ill effect on the environment as compared to fossil fuels. In addition to this, fossil fuels can be replaced by green energy sources in many areas like providing electricity, fuel for motor vehicles, etc..

With the help of business data, organizations throughout the world can change the view of green energy. Big Data can show how different types of green energy sources can help businesses and accelerate sustainable expansion.

Below are the different types of green energy sources:

  • Wind Power
  • Solar Power
  • Geothermal Energy
  • Hydropower
  • Biofuels
  • Bio-mass

Now we present before you a list of advantages that green energy or renewable energy sources have brought to the new age businesses.

Profits on the rise

If the energy produced is more than the energy used, the organizations can sell it back to the grids and earn profit out of it. Green energy sources are renewable sources of energy, and with precise data, the companies will get an overall estimation of the requirement of energy.

With Big Data, the organizations can know the history of the demographical location before setting up the factory. For example, if your company is planning to setup a factory in the coastal region, tidal and wind energy would be more beneficial as compared to solar power. Business data will give the complete analysis of the flow of the wind so that the companies can ascertain the best location of the windmill; this will allow them to store the energy in advance and use it as per their requirement. It not only saves money but also provides an extra source of income to the companies. With green energy sources, the production in the company can increase to an unprecedented level and have sustainable growth over the years.

Synchronizing the maintenance process

If there is a rapid inflow of solar and wind energy sources, the amount of power produced will be huge. Many solar panels and windmills are operating in a solar power plant or in a wind energy source, and with many types of equipment, it becomestoo complex to manage. Big Data analytics will assist the companies in streamlining all the operations to a large extent for their everyday work without any hassle.

Moreover, the analytics tool will convey the performance of renewable energy sources under different weather conditions. Thus, the companies will get the perfect idea about the performance of the green energy sources, thus enabling them to take necessary actions as and when required.

Lowering the attrition rate

Researchers have found that more number of employees want to be associated with companies that support green energies. By opting for green energy sources and investing in them, companies are indirectly investing in keeping the workforce intact and lowering the attrition rate. Stats also show the same track as nearly 50% of the working professionals, and almost 2/3rd of the millennial population want to be associated with the companies who are opting for the green energy sources and have a positive impact on environmental conservation.

The employees will not only wish to stay with the organizations for a long time but will also work hard for the betterment of the organization. Therefore, you can concentrate on expanding the business rather than thinking about the replacement of the employees.

Lowering the risk due to Power Outage

The Business Data Analytics will continuously keep updating the requirements of power needed to run the company. Thus the organizations can cut down the risk of the power outage and also the expenses related to it. The companies will know when to halt the energy transmission as they would know if the grid is under some strain or not.

Business analytics and green energy provide a planned power outage to the companies, which is cost-efficient and thus can decrease the product development cost.  Apart from this, companies can store energy for later usage. Practicing this process will help save a lot of money in the long run, proving that investment in green energy sources is a smart investment.

Reducing the maintenance cost

An increasing number of organizations are using renewable sources of energy as it plays a vital role in decreasing production and maintenance costs. The predictive analysis technology helps renewable energy sources to produce more energy at less cost, thus reducing the cost of infrastructure.

Moreover, data analytics will make green energy sources more bankable for companies. As organizations will have a concrete amount of data related to the energy sources, they can use it wisely on a more productive basis

Escalating Energy Storage

Green energy sources can be stored in bulk and used as per requirement by the business organizations. Using green energy on a larger basis will even allow companies to completely get rid of fossil fuels and thus work towards the betterment of the environment. Big Data analytics with AI and cloud-enabled systems help organizations store renewable energies such as Wind and Solar.

Moreover, it gathers information for the businesses and gives the complete analysis of the exact amount of energy required to complete a particular task. The data will also automate cost savings as it can predict the client’s needs. Based on business data, companies can store renewable energy sources in a better manner.

With Business data analytics, the companies can store energy when it is cheap and use it according to the needs when the energy rates go higher. Although predicting the requirement of storage is a complicated process, with Artificial Intelligence (AI) at work, you can analyze the data efficiently.

Bundling Up

Green energy sources will play a pivotal role in deciding the future of the businesses as fossil fuels are available in a certain limit. Moreover, astute business data analysts will assist the organizations to not only use renewable energy sources in a better manner but also to form a formidable workforce. The data support in the green energy sector will also provide sustainable growth to the companies, monitor their efforts, and assist them in the long run.

Predictive Analytics World 2020 Healthcare

Difficult times call for creative measures

Predictive Analytics World for Healthcare will go virtual and you still have time to join us!

What do you have in store for me?

We will provide a live-streamed virtual version of healthcare Munich 2020 on 11-12 May, 2020: you will be able to attend sessions and to interact and connect with the speakers and fellow members of the data science community including sponsors and exhibitors from your home or your office.

What about the workshops?

The workshops will also be held virtually on the planned date:
13 May, 2020.

Get a complimentary virtual sneak preview!

If you would like to join us for a virtual sneak preview of the workshop „Data Thinking“ on Thursday, April 16, so you can familiarise yourself with the quality of the virtual edition of both conference and workshops and how the interaction with speakers and attendees works, please send a request to registration@risingmedia.com.

Don’t have a ticket yet?

It‘s not too late to join the data science community.
Register by 10 May to receive access to the livestream and recordings.

REGISTER HERE

We’re looking forward to see you – virtually!

This year Predictive Analytics World for Healthcare runs alongside Deep Learning World and Predictive Analytics World for Industry 4.0.

Customer Journey Mapping: The data-driven approach to understanding your users

Businesses across the globe are on a mission to know their customers inside out – something commonly referred to as customer-centricity. It’s an attempt to better understand the needs and wants of customers in order to provide them with a better overall experience.

But while this sounds promising in theory, it’s much harder to achieve in practice. To really know your customer you must not only understand what they want, but you also need to hone in on how they want it, when they want it and how often as well.

In essence, your business should use customer journey mapping. It allows you to visualise customer feelings and behaviours through the different stages of their journey – from the first interaction, right up until the point of purchase and beyond.

The Data-Driven Approach 

To ensure your customer journey mapping is successful, you must conduct some extensive research on your customers. You can’t afford to make decisions based on feelings and emotions alone. There are two types of research that you should use for customer journey mapping – quantitative and qualitative research.

Quantitative data is best for analysing the behaviour of your customers as it identifies their habits over time. It’s also extremely useful for confirming any hypotheses you may have developed. That being so, relying solely upon quantitative data can present one major issue – it doesn’t provide you with the specific reason behind those behaviours.

That’s where qualitative data comes to the rescue. Through data collection methods like surveys, interviews and focus groups, you can figure out the reasoning behind some of your quantitative data trends. The obvious downside to qualitative data is its lack of evidence and its tendency to be subjective. Therefore, a combination of both quantitative and qualitative research is most effective.

Creating A Customer Persona

A customer persona is designed to help businesses understand the key traits of specific groups of people. For example, those defined by their age range or geographic location. A customer persona can help improve your customer journey map by providing more insight into the behavioural trends of your “ideal” customer. 

The one downside to using customer personas is that they can be over-generalised at times. Just because a group of people shares a similar age, for example, it does not mean they all share the same beliefs and interests. Nevertheless, creating a customer persona is still beneficial to customer journey mapping – especially if used in combination with the correct customer journey analytics tools.

All Roads Lead To Customer-centricity 

To achieve customer-centricity, businesses must consider using a data-driven approach to customer journey mapping. First, it requires that you achieve a balance between both quantitative and qualitative research. Quantitative research will provide you with definitive trends while qualitative data gives you the reasoning behind those trends. 

To further increase the effectiveness of your customer journey map, consider creating customer personas. They will give you further insight into the behavioural trends within specific groups. 

This article was written by TAP London. Experts in the Adobe Experience Cloud, TAP London help brands organise data to provide meaningful insight and memorable customer experiences. Find out more at wearetaplondon.com.

5 Applications for Location-Based Data in 2020

Location-based data enables giving people relevant information based on where they are at any given moment. Here are five location data applications to look for in 2020 and beyond. 

1. Increasing Sales and Reducing Frustration

One 2019 report indicated that 89% of the marketers who used geo data saw increased sales within their customer bases. Sometimes, the ideal way to boost sales is to convert what would be a frustration into something positive. 

A French campaign associated with the Actimel yogurt brand achieved this by sending targeted, encouraging messages to drivers who used the Waze navigation app and appeared to have made a wrong turn or got caught in traffic. 

For example, a driver might get a message that said, “Instead of getting mad and honking your horn, pump up the jams! #StayStrong.” The three-month campaign saw a 140% increase in ad recall. 

More recently, home furnishing brand IKEA launched a campaign in Dubai where people can get free stuff for making a long trip to a store. The freebies get more valuable as a person’s commute time increases. The catch is that participants have to activate location settings on their phones and enable Google Maps. Driving five minutes to a store got a person a free veggie hot dog, and they’d get a complimentary table for traveling 49 minutes. 

2. Offering Tailored Ad Targeting in Medical Offices

Pharmaceutical companies are starting to rely on companies that send targeted ads to patients connected to the Wi-Fi in doctors’ offices. One such provider is Semcasting. A recent effort involved sending ads to cardiology offices for a type of drug that lowers cholesterol levels in the blood. 

The company has taken a similar approach for an over-the-counter pediatric drug and a medication to relieve migraine headaches, among others. Such initiatives cause a 10% boost in the halo effect, plus a 1.5% uptick in sales. The first perk relates to the favoritism that people feel towards other products a company makes once they like one of them.

However, location data applications related to health care arguably require special attention regarding privacy. Patients may feel uneasy if they believe that companies are watching them and know they need a particular kind of medical treatment. 

3. Facilitating the Deployment of the 5G Network

The 5G network is coming soon, and network operators are working hard to roll it out. Statistics indicate that the 5G infrastructure investment will total $275 billion over seven years. Geodata can help network brands decide where to deploy 5G connectivity first.

Moreover, once a company offers 5G in an area, marketing teams can use location data to determine which neighborhoods to target when contacting potential customers. Most companies that currently have 5G within their product lineups have carefully chosen which areas are at the top of the list to receive 5G, and that practice will continue throughout 2020. 

It’s easy to envision a scenario whereby people can send error reports to 5G providers by using location data. For example, a company could say that having location data collection enabled on a 5G-powered smartphone allows a technician to determine if there’s a persistent problem with coverage.

Since the 5G network is still, it’s impossible to predict all the ways that a telecommunications operator might use location data to make their installations maximally profitable. However, the potential is there for forward-thinking brands to seize. 

4. Helping People Know About the Events in Their Areas

SoundHound, Inc. and Wcities recently announced a partnership that will rely on location-based data to keep people in the loop about upcoming local events. People can use a conversational intelligence platform that has information about more than 20,000 cities around the world. 

Users also don’t need to mention their locations in voice queries. They could say, for example, “Which bands are playing downtown tonight?” or “Can you give me some events happening on the east side tomorrow?” They can also ask something associated with a longer timespan, such as “Are there any wine festivals happening this month?”

People can say follow-up commands, too. They might ask what the weather forecast is after hearing about an outdoor event they want to attend. The system also supports booking an Uber, letting people get to the happening without hassles. 

5. Using Location-Based Data for Matchmaking

In honor of Valentine’s Day 2020, students from more than two dozen U.S colleges signed up for a matchmaking opportunity. It, at least in part, uses their location data to work. 

Participants answer school-specific questions, and their responses help them find a friend or something more. The platform uses algorithms to connect people with like-minded individuals. 

However, the company that provides the service can also give a breakdown of which residence halls have the most people taking part, or whether people generally live off-campus. This example is not the first time a university used location data by any means, but it’s different from the usual approach. 

Location Data Applications Abound

These five examples show there are no limits to how a company might use location data. However, they must do so with care, protecting user privacy while maintaining a high level of data quality. 

5 Things You Should Know About Data Mining

The majority of people spend about twenty-four hours online every week. In that time they give out enough information for big data to know a lot about them. Having people collecting and compiling your data might seem scary but it might have been helpful for you in the past.

 

If you have ever been surprised to find an ad targeted toward something you were talking about earlier or an invention made based on something you were googling, then you already know that data mining can be helpful. Advanced education in data mining can be an awesome resource, so it may pay to have a personal tutor skilled in the area to help you understand. 

 

It is understandable to be unsure of a system that collects all of the information online so that they can learn more about you. Luckily, so much data is put out every day it is unlikely data mining is focusing on any of your important information. Here are a few statistics you should know about mining.

 

1. Data Mining Is Used In Crime Scenes

Using a variation of earthquake prediction software and data, the Los Angeles police department and researchers were able to predict crime within five hundred feet. As they learn how to compile and understand more data patterns, crime detecting will become more accurate.

 

Using their data the Los Angeles police department was able to stop thief activity by thirty-three percent. They were also able to predict violent crime by about twenty-one percent. Those are not perfect numbers, but they are better than before and will get even more impressive as time goes on. 

 

The fact that data mining is able to pick up on crime statistics and compile all of that data to give an accurate picture of where crime is likely to occur is amazing. It gives a place to look and is able to help stop crime as it starts.

 

2. Data Mining Helps With Sales

A great story about data mining in sales is the example of Walmart putting beer near the diapers. The story claims that through measuring statistics and mining data it was found that when men purchase diapers they are also likely to buy a pack of beer. Walmart collected that data and put it to good use by putting the beer next to the diapers.

 

The amount of truth in that story/example is debatable, but it has made data mining popular in most retail stores. Finding which products are often bought together can give insight into where to put products in a store. This practice has increased sales in both items immensely just because people tend to purchase items near one another more than they would if they had to walk to get the second item. 

 

Putting a lot of stock in the data-gathering teams that big stores build does not always work. There have been plenty of times when data teams failed and sales plummeted. Often, the benefits outweigh the potential failure, however, and many stores now use data mining to make a lot of big decisions about their sales.

 

3. It’s Helping With Predicting Disease 

 

In 2009 Google began work to be able to predict the winter flu. Google went through the fifty million most searched words and then compared them with what the CDC was finding during the 2003-2008 flu seasons. With that information google was able to help predict the next winter flu outbreak even down to the states it hit the hardest. 

 

Since 2009, data mining has gotten much better at predicting disease. Since the internet is a newer invention it is still growing and data mining is still getting better. Hopefully, in the future, we will be able to predict disease breakouts quickly and accurately. 

 

With new data mining techniques and research in the medical field, there is hope that doctors will be able to narrow down problems in the heart. As the information grows and more data is entered the medical field gets closer to solving problems through data. It is something that is going to help cure diseases more quickly and find the root of a problem.

 

4. Some Data Mining Gets Ignored

Interestingly, very little of the data that companies collect from you is actually used. “Big data Companies” do not use about eighty-eight percent of the data they have. It is incredibly difficult to use all of the millions of bits of data that go through big data companies every day.

 

The more people that are used for data mining and the more data companies are actually able to filter through, the better the online experience will be. It might be a bit frightening to think of someone going through what you are doing online, but no one is touching any of the information that you keep private. Big data is using the information you put out into the world and using that data to come to conclusions and make the world a better place.

 

There is so much information being put onto the internet at all times. Twenty-four hours a week is the average amount of time a single person spends on the internet, but there are plenty of people who spend more time than that. All of that information takes a lot of people to sift through and there are not enough people in the data mining industry to currently actually go through the majority of the data being put online.

 

5. Too Many Data Mining Jobs

Interestingly, the data industry is booming. In general, there are an amazing amount of careers opening on the internet every day. The industry is growing so quickly that there are not enough people to fill the jobs that are being created.

 

The lack of talent in the industry means there is plenty of room for new people who want to go into the data mining industry. It was predicted that by 2018 there would be a shortage of 140,000 with deep analytical skills. With the lack of jobs that are being discussed, it is amazing that there is such a shortage in the data industry. 

 

If big data is only able to wade through less than half of the data being collected then we are wasting a resource. The more people who go into an analytics or computer career the more information we will be able to collect and utilize. There are currently more jobs than there are people in the data mining field and that needs to be corrected.

 

To Conclude

The data mining industry is making great strides. Big data is trying to use the information they collect to sell more things to you but also to improve the world. Also, there is something very convenient about your computer knowing the type of things you want to buy and showing you them immediately. 

 

Data mining has been able to help predict crime in Los Angeles and lower crime rates. It has also helped companies know what items are commonly purchased together so that stores can be organized more efficiently. Data mining has even been able to predict the outbreak of disease down to the state.

 

Even with so much data being ignored and so many jobs left empty, data mining is doing incredible things. The entire internet is constantly growing and the data mining is growing right along with it. As the data mining industry climbs and more people find their careers mining data the more we will learn and the more facts we will find.