Posts

Graphical understanding of dynamic programming and the Bellman equation: taking a typical approach at first

This is the second article of the series My elaborate study notes on reinforcement learning.

1, Before getting down on business

As the title of this article suggests, this article is going to be mainly about the Bellman equation and dynamic programming (DP), which are to be honest very typical and ordinary topics. One typical way of explaining DP in contexts of reinforcement learning (RL) would be explaining the Bellman equation, value iteration, and policy iteration, in this order. If you would like to merely follow pseudocode of them and implement them, to be honest that is not a big deal. However even though I have studied RL only for some weeks, I got a feeling that these algorithms, especially policy iteration are more than just single algorithms. In order not to miss the points of DP, rather than typically explaining value iteration and policy iteration, I would like to take a different approach. Eventually I am going to introduce DP in RL as a combination of the following key terms: the Bellman operator, the fixed point of a policy, policy evaluation, policy improvement, and existence of the optimal policy. But first, in this article I would like to cover basic and typical topics of DP in RL.

Many machine learning algorithms which use supervised/unsupervised learning more or less share the same ideas. You design a model and a loss function and input samples from data, and you adjust parameters of the model so that the loss function decreases. And you usually use optimization techniques like stochastic gradient descent (SGD) or ones derived from SGD. Actually feature engineering is needed to extract more meaningful information from raw data. Or especially in this third AI boom, the models are getting more and more complex, and I would say the efforts of feature engineering was just replaced by those of designing neural networks. But still, once you have the whole picture of supervised/unsupervised learning, you would soon realize other various algorithms is just a matter of replacing each component of the workflow. However reinforcement learning has been another framework of training machine learning models. Richard E. Bellman’s research on DP in 1950s is said to have laid a foundation for RL. RL also showed great progress thanks to development of deep neural networks (DNN), but still you have to keep it in mind that RL and supervised/unsupervised learning are basically different frameworks. DNN are just introduced in RL frameworks to enable richer expression of each component of RL. And especially when RL is executed in a higher level environment, for example screens of video games or phases of board games, DNN are needed to process each state of the environment. Thus first of all I think it is urgent to see ideas unique to RL in order to effectively learn RL. In the last article I said RL is an algorithm to enable planning by trial and error in an environment, when the model of the environment is not known. And DP is a major way of solving planning problems. But in this article and the next article, I am mainly going to focus on a different aspect of RL: interactions of policies and values.

According to a famous Japanese textbook on RL named “Machine Learning Professional Series: Reinforcement Learning,” most study materials on RL lack explanations on mathematical foundations of RL, including the book by Sutton and Barto. That is why many people who have studied machine learning often find it hard to get RL formulations at the beginning. The book also points out that you need to refer to other bulky books on Markov decision process or dynamic programming to really understand the core ideas behind algorithms introduced in RL textbooks. And I got an impression most of study materials on RL get away with the important ideas on DP with only introducing value iteration and policy iteration algorithms. But my opinion is we should pay more attention on policy iteration. And actually important RL algorithms like Q learning, SARSA, or actor critic methods show some analogies to policy iteration. Also the book by Sutton and Barto also briefly mentions “Almost all reinforcement learning methods are well described as GPI (generalized policy iteration). That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right side.

Even though I arrogantly, as a beginner in this field, emphasized “simplicity” of RL in the last article, in this article I am conversely going to emphasize the “profoundness” of DP over two articles. But I do not want to cover all the exhaustive mathematical derivations for dynamic programming, which would let many readers feel reluctant to study RL. I tried as hard as possible to visualize the ideas in DP in simple and intuitive ways, as far as I could understand. And as the title of this article series shows, this article is also a study note for me. Any corrections or advice would be appreciated via email or comment pots below.

2, Taking a look at what DP is like

In the last article, I said that planning or RL is a problem of finding an optimal policy \pi(a|s) for choosing which actions to take depending on where you are. Also in the last article I displayed flows of blue arrows for navigating a robot as intuitive examples of optimal policies in planning or RL problems. But you cannot directly calculate those policies. Policies have to be evaluated in the long run so that they maximize returns, the sum of upcoming rewards. Then in order to calculate a policy p(a|s), you need to calculate a value functions v_{\pi}(s). v_{\pi}(s) is a function of how good it is to be in a given state s, under a policy \pi. That means it is likely you get higher return starting from s, when v_{\pi}(s) is high. As illustrated in the figure below, values and policies, which are two major elements of RL, are updated interactively until they converge to an optimal value or an optimal policy. The optimal policy and the optimal value are denoted as v_{\ast} and \pi_{\ast} respectively.

Dynamic programming (DP) is a family of algorithms which is effective for calculating the optimal value v_{\ast} and the optimal policy \pi_{\ast} when the complete model of the environment is given. Whether in my articles or not, the rest of discussions on RL are more or less based on DP. RL can be viewed as a method of achieving the same effects as DP when the model of the environment is not known. And I would say the effects of imitating DP are often referred to as trial and errors in many simplified explanations on RL. If you have studied some basics of computer science, I am quite sure you have encountered DP problems. With DP, in many problems on textbooks you find optimal paths of a graph from a start to a goal, through which you can maximizes the sum of scores of edges you pass. You might remember you could solve those problems in recursive ways, but I think many people have just learnt very limited cases of DP. For the time being I would like you to forget such DP you might have learned and comprehend it as something you newly start learning in the context of RL.

*As a more advances application of DP, you might have learned string matching. You can calculated how close two strings of characters are with DP using string matching.

The way of calculating v_{\pi}(s) and \pi(a|s) with DP can be roughly classified to two types, policy-based and value-based. Especially in the contexts of DP, the policy-based one is called policy iteration, and the values-based one is called value iteration. The biggest difference between them is, in short, policy iteration updates a policy every times step, but value iteration does it only at the last time step. I said you alternate between updating v_{\pi}(s) and \pi(a|s), but in fact that is only true of policy iteration. Value iteration updates a value function v(s). Before formulating these algorithms, I think it will be effective to take a look at how values and policies are actually updated in a very simple case. I would like to introduce a very good tool for visualizing value/policy iteration. You can customize a grid map and place either of “Treasure,” “Danger,” and “Block.” You can choose probability of transition and either of settings, “Policy Iteration” or “Values Iteration.” Let me take an example of conducting DP on a gird map like below. Whichever of “Policy Iteration” or “Values Iteration” you choose, you would get numbers like below. Each number in each cell is the value of each state, and you can see that when you are on states with high values, you are more likely to reach the “treasure” and avoid “dangers.” But I bet this chart does not make any sense if you have not learned RL yet. I prepared some code for visualizing the process of DP on this simulator. The code is available in this link.

*In the book by Sutton and Barto, when RL/DP is discussed at an implementation level, the estimated values of v_{\pi}(s) or v_{\ast}(s) can be denoted as an array V or V_t. But I would like you take it easy while reading my articles. I will repeatedly mentions differences of notations when that matters.

*Remember that at the beginning of studying RL, only super easy cases are considered, so a V is usually just a NumPy array or an Excel sheet.

*The chart above might be also misleading since there is something like a robot at the left bottom corner, which might be an agent. But the agent does not actually move around the environment in planning problems because it has a perfect model of the environment in the head.

The visualization I prepared is based on the implementation of the simulator, so they would give the same outputs. When you run policy iteration in the map, the values and polices are updated as follows. The arrow in each cell is the policy in the state. At each time step the arrows is calculated in a greedy way, and each arrow at each state shows the direction in which the agent is likely to get the highest reward. After 3 iterations, the policies and values converge, and with the policies you can navigate yourself to the “Treasure,” avoiding “Dangers.”

*I am not sure why policies are incorrect at the most left side of the grid map. I might need some modification of code.

You can also update values without modifying policies as the chart below. In this case only the values of cells are updated. This is value-iteration, and after this iteration converges, if you transit to an adjacent cell with the highest value at each cell, you can also navigate yourself to the “treasure,” avoiding “dangers.”

I would like to start formulating DP little by little,based on the notations used in the RL book by Sutton. From now on, I would take an example of the 5 \times 6 grid map which I visualized above. In this case each cell is numbered from 0 to 29 as the figure below. But the cell 7, 13, 14 are removed from the map. In this case \mathcal{S} = {0, 1, 2, 3, 4, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29}, and \mathcal{A} = \{\uparrow, \rightarrow, \downarrow, \leftarrow \}. When you pass s=8, you get a reward r_{treasure}=1, and when you pass the states s=15 or s=19, you get a reward r_{danger}=-1. Also, the agent is encouraged to reach the goal as soon as possible, thus the agent gets a regular reward of r_{regular} = - 0.04 every time step.

In the last section, I mentioned that the purpose of RL is to find the optimal policy which maximizes a return, the sum of upcoming reward R_t. A return is calculated as follows.

R_{t+1} + R_{t+2} +  R_{t+3} + \cdots + R_T

In RL a return is estimated in probabilistic ways, that is, an expectation of the return given a state S_t = s needs to be considered. And this is the value of the state. Thus the value of a state S_t = s is calculated as follows.

\mathbb{E}_{\pi}\bigl[R_{t+1} + R_{t+2} +  R_{t+3} + \cdots + R_T | S_t = s \bigr]

In order to roughly understand how this expectation is calculated let’s take an example of the 5 \times 6 grid map above. When the current state of an agent is s=10, it can take numerous patterns of actions. For example (a) 10 - 9 - 8 - 2 , (b) 10-16-15-21-20-19, (c) 10-11-17-23-29-\cdots. The rewards after each behavior is calculated as follows.

  • If you take a you take the course (a) 10 - 9 - 8 - 2, you get a reward of r_a = -0.04 -0.04 + 1 -0.04 in total. The probability of taking a course of a) is p_a = \pi(A_t = \leftarrow | S_t = 10) \cdot p(S_{t+1} = 9 |S_t = 10, A_t = \leftarrow ) \cdot \pi(A_{t+1} = \leftarrow | S_{t+1} = 9) \cdot p(S_{t+2} = 8 |S_{t+1} = 9, A_{t+1} = \leftarrow ) \cdot \pi(A_{t+2} = \uparrow | S_{t+2} = 8) \cdot p(S_{t+3} = 2 | S_{t+2} = 8, A_{t+2} = \uparrow )
  • Just like the case of (a), the reward after taking the course (b) is r_b = - 0.04 -0.04 -1 -0.04 -0.04 -0.04 -1. The probability of taking the action can be calculated in the same way as p_b = \pi(A_t = \downarrow | S_t = 10) \cdot p(S_{t+1} = 16 |S_t = 10, A_t = \downarrow ) \cdots \pi(A_{t+4} = \leftarrow | S_{t+4} = 20) \cdot p(S_{t+5} = 19 |S_{t+4} = 20, A_{t+4} = \leftarrow ).
  • The rewards and the probability of the case (c) cannot be calculated because future behaviors of the agent is not confirmed.

Assume that (a) and (b) are the only possible cases starting from s, under the policy \pi, then the the value of s=10 can be calculated as follows as a probabilistic sum of rewards of each behavior (a) and (b).

\mathbb{E}_{\pi}\bigl[R_{t+1} + R_{t+2} +  R_{t+3} + \cdots + R_T | S_t = s \bigr] = r_a \cdot p_a + r_b \cdot p_b

But obviously this is not how values of states are calculated in general. Starting from a state a state s=10, not only (a) and (b), but also numerous other behaviors of agents can be considered. Or rather, it is almost impossible to consider all the combinations of actions, transition, and next states. In practice it is quite difficult to calculate a sequence of upcoming rewards R_{t+1}, \gamma R_{t+2}, R_{t+3} \cdots,and it is virtually equal to considering all the possible future cases.A very important formula named the Bellman equation effectively formulate that.

3, The Bellman equation and convergence of value functions

The Bellman equation enables estimating values of states considering future countless possibilities with the following two ideas.

  1.  Returns are calculated recursively.
  2.  Returns are calculated in probabilistic ways.

First of all, I have to emphasize that a discounted return is usually used rather than a normal return, and a discounted one is defined as below

G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots + \gamma ^ {T-t-1} R_T = \sum_{k=0}^{T-t-1}{\gamma ^{k}R_{t+k+1}}

, where \gamma \in (0, 1] is a discount rate. (1)As the first point above, the discounted return can be calculated recursively as follows: G_t = R_{t + 1} + \gamma R_{t + 2} + \gamma ^2 R_{t + 2} + \gamma ^3 R_{t + 3} + \cdots = R_{t + 1} + \gamma (R_{t + 2} + \gamma R_{t + 2} + \gamma ^2 R_{t + 3} + \cdots ) = R_{t + 1} + \gamma G_{t+1}. You can postpone calculation of future rewards corresponding to G_{t+1} this way. This might sound obvious, but this small trick is crucial for defining defining value functions or making update rules of them. (2)The second point might be confusing to some people, but it is the most important in this section. We took a look at a very simplified case of calculating the expectation in the last section, but let’s see how a value function v_{\pi}(s) is defined in the first place.

v_{\pi}(s) \doteq \mathbb{E}_{\pi}\bigl[G_t | S_t = s \bigr]

This equation means that the value of a state s is a probabilistic sum of all possible rewards taken in the future following a policy \pi. That is, v_{\pi}(s) is an expectation of the return, starting from the state s. The definition of a values v_{\pi}(s) is written down as follows, and this is what \mathbb{E}_{\pi} means.

v_{\pi} (s)= \sum_{a}{\pi(a|s) \sum_{s', r}{p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]}}

This is called Bellman equation, and it is no exaggeration to say this is the foundation of many of upcoming DP or RL ideas. Bellman equation can be also written as \sum_{s', r, a}{\pi(a|s) p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]}. It can be comprehended this way: in Bellman equation you calculate a probabilistic sum of r +v_{\pi}(s'), considering all the possible actions of the agent in the time step. r +v_{\pi}(s') is a sum of the values of the next state s' and a reward r, which you get when you transit to the state s' from s. The probability of getting a reward r after moving from the state s to s', taking an action a is \pi(a|s) p(s', r|s, a). Hence the right side of Bellman equation above means the sum of \pi(a|s) p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr], over all possible combinations of s', r, and a.

*I would not say this equation is obvious, and please let me explain a proof of this equation later.

The following figures are based on backup diagrams introduced in the book by Sutton and Barto. As we have just seen, Bellman expectation equation calculates a probabilistic summation of r + v(s'). In order to calculate the expectation, you have to consider all the combinations of s', r, and a. The backup diagram at the left side below shows the idea as a decision-tree-like graph, and strength of color of each arrow is the probability of taking the path.

The Bellman equation I have just introduced is called Bellman expectation equation to be exact. Like the backup diagram at the right side, there is another type of Bellman equation where you consider only the most possible path. Bellman optimality equation is defined as follows.

v_{\ast}(s) \doteq \max_{a} \sum_{s', r}{p(s', r|s, a)\bigl[r + \gamma v_{\ast}(s')\bigr]}

I would like you to pay attention again to the fact that in definitions of Bellman expectation/optimality equations, v_{\pi}(s)/v_{\ast}(s) is defined recursively with v_{\pi}(s)/v_{\ast}(s). You might have thought how to calculate v_{\pi}(s)/v_{\ast}(s) is the problem in the first place.

As I implied in the first section of this article, ideas behind how to calculate these v_{\pi}(s) and v_{\ast}(s) should be discussed more precisely. Especially how to calculate v_{\pi}(s) is a well discussed topic in RL, including the cases where data is sampled from an unknown environment model. In this article we are discussing planning problems, where a model an environment is known. In planning problems, that is DP problems where all the probabilities of transition p(s', r | s, a) are known, a major way of calculating v_{\pi}(s) is iterative policy evaluation. With iterative policy evaluation a sequence of value functions (v_0(s), v_1(s), \dots , v_{k-1}(s), v_{k}(s)) converges to v_{\pi}(s) with the following recurrence relation

v_{k+1}(s) =\sum_{a}{\pi(a|s)\sum_{s', r}{p(s', r | s, a) [r + \gamma v_k (s')]}}.

Once v_{k}(s) converges to v_{\pi}(s), finally the equation of the definition of v_{\pi}(s) holds as follows.

v_{\pi}(s) =\sum_{a}{\pi(a|s)\sum_{s', r}{p(s', r | s, a) [r + \gamma v_{\pi} (s')]}}.

The convergence to v_{\pi}(s) is like the graph below. If you already know how to calculate forward propagation of a neural network, this should not be that hard to understand. You just expand recurrent relation of v_{k}(s) and v_{k+1}(s) from the initial value at k=0 to the converged state at k=K. But you have to be careful abut the directions of the arrows in purple. If you correspond the backup diagrams of the Bellman equation with the graphs below, the purple arrows point to the reverse side to the direction where the graphs extend. This process of converging an arbitrarily initialized v_0(s) to v_{\pi}(s) is called policy evaluation.

*\mathcal{S}, \mathcal{A} are a set of states and actions respectively. Thus |\mathcal{S}|, the size of  \mathcal{S} is the number of white nodes in each layer, and |\mathcal{S}| the number of black nodes.

The same is true of the process of calculating an optimal value function v_{\ast}. With the following recurrence relation

v_{k+1}(s) =\max_a\sum_{s', r}{p(s', r | s, a) [r + \gamma v_k (s')]}

(v_0(s), v_1(s), \dots , v_{k-1}(s), v_{k}(s)) converges to an optimal value function v_{\ast}(s). The graph below visualized the idea of convergence.

4, Pseudocode of policy iteration and value iteration

I prepared pseudocode of each algorithm based on the book by Sutton and Barto. These would be one the most typical DP algorithms you would encounter while studying RL, and if you just want to implement RL by yourself, these pseudocode would enough. Or rather these would be preferable to other more general and abstract pseudocode. But I would like to avoid explaining these pseudocode precisely because I think we need to be more conscious about more general ideas behind DP, which I am going to explain in the next article. I will cover only the important points of these pseudocode, and I would like to introduce some implementation of the algorithms in the latter part of next article. I think you should briefly read this section and come back to this section section or other study materials after reading the next article. In case you want to check the algorithms precisely, you could check the pseudocode I made with LaTeX in this link.

The biggest difference of policy iteration and value iteration is the timings of updating a policy. In policy iteration, a value function v(s) and \pi(a|s) are arbitrarily initialized. (1)The first process is policy evaluation. The policy \pi(a|s) is fixed, and the value function v(s) approximately converge to v_{\pi}(s), which is a value function on the policy \pi. This is conducted by the iterative calculation with the reccurence relation introduced in the last section.(2) The second process is policy improvement. Based on the calculated value function v_{\pi}(s), the new policy \pi(a|s) is updated as below.

\pi(a|s) \gets\text{argmax}_a {r + \sum_{s', r}{p(s', r|s, a)[r + \gamma V(s')]}}, \quad \forall s\in \mathcal{S}

The meaning of this update rule of a policy is quite simple: \pi(a|s) is updated in a greedy way with an action a such that r + \sum_{s', r}{p(s', r|s, a)[r + \gamma V(s')]} is maximized. And when the policy \pi(a|s) is not updated anymore, the policy has converged to the optimal one. At least I would like you to keep it in mind that a while loop of itrative calculation of v_{\pi}(s) is nested in another while loop. The outer loop continues till the policy is not updated anymore.

On the other hand in value iteration, there is mainly only one loop of updating  v_{k}(s), which converge to v_{\ast}(s). And the output policy is the calculated the same way as policy iteration with the estimated optimal value function. According to the book by Sutton and Barto, value iteration can be comprehended this way: the loop of value iteration is truncated with only one iteration, and also policy improvement is done only once at the end.

As I repeated, I think policy iteration is more than just a single algorithm. And relations of values and policies should be discussed carefully rather than just following pseudocode. And whatever RL algorithms you learn, I think more or less you find some similarities to policy iteration. Thus in the next article, I would like to introduce policy iteration in more abstract ways. And I am going to take a rough look at various major RL algorithms with the keywords of “values” and “policies” in the next article.

Appendix

I mentioned the Bellman equation is nothing obvious. In this section, I am going to introduce a mathematical derivation, which I think is the most straightforward. If you are allergic to mathematics, the part blow is not recommendable, but the Bellman equation is the core of RL. I would not say this is difficult, and if you are going to read some texts on RL including some equations, I think mastering the operations I explain below is almost mandatory.

First of all, let’s organize some important points. But please tolerate inaccuracy of mathematical notations here. I am going to follow notations in the book by Sutton and Barto.

  • Capital letters usually denote random variables. For example X, Y,Z, S_t, A_t, R_{t+1}, S_{t+1}. And corresponding small letters are realized values of the random variables. For example x, y, z, s, a, r, s'. (*Please do not think too much about the number of 's on the small letters.)
  • Conditional probabilities in general are denoted as for example \text{Pr}\{X=x, Y=y | Z=z\}. This means the probability of x, y are sampled given that z is sampled.
  • In the book by Sutton and Barto, a probilistic funciton p(\cdot) means a probability of transition, but I am using p(\cdot) to denote probabilities in general. Thus p( s', a, r | s) shows the probability that, given an agent being in state s at time t, the agent will do action a, AND doing this action will cause the agent to proceed to state s' at time t+1, and receive reward r. p( s', a, r | s) is not defined in the book by Barto and Sutton.
  • The following equation holds about any conditional probabilities: p(x, y|z) = p(x|y, z)p(y|z). Thus importantly, p(s', a, r|s) = p(s', r| s, a)p(a|s)=p(s', r' | s, a)\pi(a|s)
  • When random variables X, Y are discrete random variables, a conditional expectation of X given Y=y is calculated as follows: \mathbb{E}[X|Y=y] = \sum_{x}{p(x|Y=y)}.

Keeping the points above in mind, let’s get down on business. First, according to definition of a value function on a policy pi and linearity of an expectation, the following equations hold.

v_{\pi}(s) = \mathbb{E} [G_t | S_t =s] = \mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t =s]

=\mathbb{E} [R_{t+1} | S_t =s] + \gamma \mathbb{E} [G_{t+1} | S_t =s]

Thus we need to calculate \mathbb{E} [R_{t+1} | S_t =s] and \mathbb{E} [G_{t+1} | S_t =s]. As I have explained \mathbb{E} [R_{t+1} | S_t =s] is the sum of p(s', a, r |s) r over all the combinations of (s', a, r). And according to one of the points above, p(s', a, r |s) = p(s', r | s, a)p(a|s)=p(s', r' | s, a)\pi(a|s). Thus the following equation holds.

\mathbb{E} [R_{t+1} | S_t =s] = \sum_{s', a, r}{p(s', a, r|s)r} = \sum_{s', a, r}{p(s', r | s, a)\pi(a|s)r}.

Next we have to calculate

\mathbb{E} [G_{t+1} | S_t =s]

= \mathbb{E} [R_{t + 2} + \gamma R_{t + 3} + \gamma ^2 R_{t + 4} + \cdots | S_t =s]

= \mathbb{E} [R_{t + 2}  | S_t =s] + \gamma \mathbb{E} [R_{t + 2} | S_t =s]  + \gamma ^2\mathbb{E} [ R_{t + 4} | S_t =s]  +\cdots.

Let’s first calculate \mathbb{E} [R_{t + 2}  | S_t =s]. Also \mathbb{E} [R_{t + 3}  | S_t =s] is a sum of p(s'', a', r', s', a, r|s)r' over all the combinations of (s”, a’, r’, s’, a, r).

\mathbb{E}_{\pi} [R_{t + 2}  | S_t =s] =\sum_{s'', a', r', s', a, r}{p(s'', a', r', s', a, r|s)r'}

=\sum_{s'', a', r', s', a, r}{p(s'', a', r'| s', a, r, s)p(s', a, r|s)r'}

=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s'', a', r'}{p(s'', a', r'| s', a, r, s)r'}

I would like you to remember that in Markov decision process the next state S_{t+1} and the reward R_t only depends on the current state S_t and the action A_t at the time step.

Thus in variables s', a, r, s, only s' have the following variables r', a', s'', r'', a'', s''', \dots.  And again p(s', a, r |s) = p(s', r | s, a)p(a|s). Thus the following equations hold.

\mathbb{E}_{\pi} [R_{t + 2}  | S_t =s]=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s'', a', r'}{p(s'', a', r'| s', a, r', s)r'}

=\sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \sum_{s'', a', r'}{p(s'', a', r'| s')r'}

= \sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \mathbb{E}_{\pi} [R_{t+2}  | s'].

\mathbb{E}_{\pi} [R_{t + 3}  | S_t =s] can be calculated the same way.

\mathbb{E}_{\pi}[R_{t + 3}  | S_t =s] =\sum_{s''', a'', r'', s'', a', r', s', a, r}{p(s''', a'', r'', s'', a', r', s', a, r|s)r''}

=\sum_{s''', a'', r'', s'', a', r', s', a, r}{p(s''', a'', r'', s'', a', r'| s', a, r, s)p(s', a, r|s)r''}

=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s''', a'' r'', s'', a', r'}{p(s''', a'', r'', s'', a', r'| s', a, r, s)r''}

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \sum_{s''', a'' r'', s'', a', r'}{p(s''', a'', r'', s'', a', r'| s')r''}

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [R_{t+3}  | s'].

The same is true of calculating \mathbb{E}_{\pi} [R_{t + 4}  | S_t =s], \mathbb{E}_{\pi} [R_{t + 5}  | S_t =s]\dots.  Thus

v_{\pi}(s) =\mathbb{E} [R_{t+1} | S_t =s] + \gamma \mathbb{E} [G_{t+1} | S_t =s]

=\sum_{s', a, r}{p(s', r | s, a)\pi(a|s)r} + \mathbb{E} [R_{t + 2}  | S_t =s] + \gamma \mathbb{E} [R_{t + 3} | S_t =s]  + \gamma ^2\mathbb{E} [ R_{t + 4} | S_t =s]  +\cdots

=\sum_{s, a, r}{p(s', r | s, a)\pi(a|s)r} +\sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \mathbb{E}_{\pi} [R_{t+2}  |S_{t+1}= s'] +\gamma \sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [R_{t+3} |S_{t+1} =  s'] +\gamma^2 \sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [ R_{t+4}|S_{t+1} =  s'] + \cdots

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + \mathbb{E}_{\pi} [\gamma R_{t+2}+ \gamma R_{t+3}+\gamma^2R_{t+4} + \cdots |S_{t+1} =  s'] ]

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + \mathbb{E}_{\pi} [G_{t+1} |S_{t+1} =  s'] ]

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + v_{\pi}(s') ]

5 AI Tricks to Grow Your Online Sales

The way people shop is currently changing. This only means that online stores need optimization to stay competitive and answer to the needs of customers. In this post, we’ll bring up the five ways in which you can use artificial intelligence technology in an online store to grow your revenues. Let’s begin!

1. Personalization with AI

Opening the list of AI trends that are certainly worth covering deals with a step up in personalization. Did you know that according to the results of a survey that was held by Accenture, more than 90% of shoppers are likelier to buy things from those stores and brands that propose suitable product recommendations?

This is exactly where artificial intelligence can give you a big hand. Such progressive technology analyzes the behavior of your consumers individually, keeping in mind their browsing and purchasing history. After collecting all the data, AI draws the necessary conclusions and offers those product recommendations that the user might like.

Look at the example below with the block has a carousel of neat product options. Obviously, this “move” can give a big boost to the average cart sizes.

Screenshot taken on the official Reebok website

Screenshot taken on the official Reebok website

2. Smarter Search Options

With the rise of the popularity of AI voice assistants and the leap in technology in general, the way people look for things on the web has changed. Everything is moving towards saving time and getting faster better results.

One of such trends deals with embracing the text to speech and image search technology. Did you notice how many search bars have “microphone icons” for talking out your request?

On a similar note, numerous sites have made a big jump forward after incorporating search by picture. In this case, uploaded photos get analyzed by artificial intelligence technology. The system studies what’s depicted on the image and cross-checks it with the products sold in the store. In several seconds the user is provided with a selection of similar products.

Without any doubt, this greatly helps users find what they were looking for faster. As you might have guessed, this is a time-saving feature. In essence, this omits the necessity to open dozens of product pages on multiple sites when seeking out a liked item that they’ve taken a screenshot or photo of.

Check out how such a feature works on the official Amazon website by taking a look at the screenshots of StyleSnap provided below.

Screenshot taken on the official Amazon StyleSnap website

Screenshot taken on the official Amazon StyleSnap website

3. Assisting Clients via Chatbots

The next point on the list is devoted to AI chatbots. This feature can be a real magic wand with client support which is also beneficial for online sales.

Real customer support specialists usually aren’t available 24/7. And keeping in mind that most requests are on repetitive topics, having a chatbot instantly handle many of the questions is a neat way to “unload” the work of humans.

Such chatbots use machine learning to get better at understanding and processing client queries. How do they work? They’re “taught” via scripts and scenario schemes. Therefore, the more data you supply them with, the more matters they’ll be able to cover.

Case in point, there’s such a chat available on the official Victoria’s Secret website. If the user launches the Digital Assistant, the messenger bot starts the conversation. Based on the selected topic the user selects from the options, the bot defines what will be discussed.

Screenshot taken on the official Victoria’s Secret website

Screenshot taken on the official Victoria’s Secret website

4. Determining Top-Selling Product Combos

A similar AI use case for boosting online revenues to the one mentioned in the first point, it becomes much easier to cross-sell products when artificial intelligence “cracks” the actual top matches. Based on the findings by Sumo, you can boost your revenues by 10 to 30% if you upsell wisely!

The product database of online stores gets larger by the month, making it harder to know for good which items go well together and complement each other. With AI on your analytics team, you don’t have to scratch your head guessing which products people are likely to additionally buy along with the item they’re browsing at the moment. This work on singling out data can be done for you.

As seen on the screenshot from the official MAC Cosmetics website, the upselling section on the product page presents supplement items in a carousel. Thus, the chance of these products getting added to the shopping cart increases (if you compare it to the situation when the client would search the site and find these products by himself).

Screenshot taken on the official MAC Cosmetics website

Screenshot taken on the official MAC Cosmetics website

5. “Try It On” with a Camera

The fifth AI technology in this list is virtual try on that borrowed the power of augmented reality technology in the world of sales.

Especially for fields like cosmetics or accessories, it is important to find ways to help clients to make up their minds and encourage them to buy an item without testing it physically. If you want, you can play around with such real-time functionality and put on makeup using your camera on the official Maybelline New York site.

Consumers, ultimately, become happier because this solution omits frustration and unneeded doubts. With everything evident and clear, people don’t have the need to take a shot in the dark what will be a good match, they can see it.

Screenshot taken on the official Maybelline New York website

Screenshot taken on the official Maybelline New York website

In Closing

To conclude everything stated in this article, artificial intelligence is a big crunch point. Incorporating various AI-powered features into an online retail store can be a neat advancement leading to a visible growth in conversions.

Simple RNN

A brief history of neural nets: everything you should know before learning LSTM

This is not a college course or something on deep learning with strict deadlines for assignments, so let’s take a detour from practical stuff and take a brief look at the history of neural networks.

The history of neural networks is also a big topic, which could be so long that I had to prepare another article series. And usually I am supposed to begin such articles with something like “The term ‘AI’ was first used by John McCarthy in Dartmouth conference 1956…” but you can find many of such texts written by people with much more experiences in this field. Therefore I am going to write this article from my point of view, as an intern writing articles on RNN, as a movie buff, and as one of many Japanese men who spent a great deal of childhood with video games.

We are now in the third AI boom, and some researchers say this boom began in 2006. A professor in my university said there we are now in a kind of bubble economy in machine learning/data science industry, but people used to say “Stop daydreaming” to AI researchers. The second AI winter is partly due to vanishing/exploding gradient problem of deep learning. And LSTM was invented as one way to tackle such problems, in 1997.

1, First AI boom

In the first AI boom, I think people were literally “daydreaming.” Even though the applications of machine learning algorithms were limited to simple tasks like playing chess, checker, or searching route of 2d mazes, and sometimes this time is called GOFAI (Good Old Fashioned AI).

Even today when someone use the term “AI” merely for tasks with neural networks, that amuses me because for me deep learning is just statistically and automatically training neural networks, which are capable of universal approximation, into some classifiers/regressors. Actually the algorithms behind that is quite impressive, but the structure of human brains is much more complicated. The hype of “AI” already started in this first AI boom. Let me take an example of machine translation in this video. In fact the research of machine translation already started in the early 1950s, and of  specific interest in the time was translation between English and Russian due to Cold War. In the first article of this series, I said one of the most famous applications of RNN is machine translation, such as Google Translation, DeepL. They are a type of machine translation called neural machine translation because they use neural networks, especially RNNs. Neural machine translation was an astonishing breakthrough around 2014 in machine translation field. The former major type of machine translation was statistical machine translation, based on statistical language models. And the machine translator in the first AI boom was rule base machine translators, which are more primitive than statistical ones.

The most remarkable invention in this time was of course perceptron by Frank Rosenblatt. Some people say that this is the first neural network. Even though you can implement perceptron with a-few-line codes in Python, obviously they did not have Jupyter Notebook in those days. The perceptron was implemented as a huge instrument named Mark 1 Perceptron, and it was composed of randomly connected wires. I do not precisely know how it works, but it was a huge effort to implement even the most primitive type of neural networks. They needed to use a big lighting fixture to get a 20*20 pixel image using 20*20 array of cadmium sulphide photocells. The research by Rosenblatt, however, was criticized by Marvin Minsky in his book because perceptrons could only be used for linearly separable data. To make matters worse the criticism prevailed as that more general, multi-layer perceptrons were also not useful for linearly inseparable data (as I mentioned in the first article, multi-layer perceptrons, namely normal neural networks,  can be universal approximators, which have potentials to classify/regress various types of complex data). In case you do not know what “linearly separable” means, imagine that there are data plotted on a piece of paper. If an elementary school kid can draw a border line between two clusters of the data with a ruler and a pencil on the paper, the 2d data is “linearly separable”….

With big disappointments to the research on “electronic brains,” the budget of AI research was reduced and AI research entered its first winter.

I think  the frame problem(1969),  by John McCarthy and Patrick J. Hayes, is also an iconic theory in the end of the first AI boom. This theory is known as a story of creating a robot trying to pull out its battery on a wheeled wagon in a room. The first prototype of the robot, named R1, naively tried to pull out the wagon form the room, and the bomb exploded. The problems was obvious: R1 was not programmed to consider the risks by taking each action, so the researchers made the next prototype named R1D1, which was programmed to consider the potential risks of taking each action. When R1D1 tried to pull out the wagon, it realized the risk of pulling the bomb together with the battery. But soon it started considering all the potential risks, such as the risk of the ceiling falling down, the distance between the wagon and all the walls, and so on, when the bomb exploded. The next problem was also obvious: R1D1 was not programmed to distinguish if the factors are relevant of irrelevant to the main purpose, and the next prototype R2D1 was programmed to do distinguish them. This time, R2D1 started thinking about “whether the factor is  irrelevant to the main purpose,” on every factor measured, and again the bomb exploded. How can we get a perfect AI, R2D2?

The situation of mentioned above is a bit extreme, but it is said AI could also get stuck when it try to take some super simple actions like finding a number in a phone book and make a phone call. It is difficult for an artificial intelligence to decide what is relevant and what is irrelevant, but humans will not get stuck with such simple stuff, and sometimes the frame problem is counted as the most difficult and essential problem of developing AI. But personally I think the original frame problem was unreasonable in that McCarthy, in his attempts to model the real world, was inflexible in his handling of the various equations involved, treating them all with equal weight regardless of the particular circumstances of a situation. Some people say that McCarthy, who was an advocate for AI, also wanted to see the field come to an end, due to its failure to meet the high expectations it once aroused.

Not only the frame problem, but also many other AI-related technological/philosophical problems have been proposed, such as Chinese room (1980), the symbol grounding problem (1990), and they are thought to be as hardships in inventing artificial intelligence, but I omit those topics in this article.

*The name R2D2 did not come from the famous story of frame problem. The story was Daniel Dennett first proposed the story of R2D2 in his paper published in 1984. Star Wars was first released in 1977. It is said that the name R2D2 came from “Reel 2, Dialogue 2,” which George Lucas said while film shooting. And the design of C3PO came from Maria in Metropolis(1927). It is said that the most famous AI duo in movie history was inspired by Tahei and Matashichi in The Hidden Fortress(1958), directed by Kurosawa Akira.

Interestingly, in the end of the first AI boom, 2001: A Space Odyssey, directed by Stanley Kubrick, was released in 1968. Unlike conventional fantasylike AI characters, for example Maria in Metropolis(1927), HAL 9000 was portrayed as a very realistic AI, and the movie already pointed out the risk of AI being insane when it gets some commands from several users. HAL 9000 still has been a very iconic character in AI field. For example when you say some quotes from 2001: A Space Odyssey to Siri you get some parody responses. I also thin you should keep it in mind that in order to make an AI like HAL 9000 come true, for now RNNs would be indispensable in many ways: you would need RNNs for better voice recognition, better conversational system, and for reading lips.

*Just as you cannot understand Monty Python references in Python official tutorials without watching Monty Python and the Holy Grail, you cannot understand many parodies in AI contexts without watching 2001: A Space Odyssey. Even thought the movie had some interview videos with some researchers and some narrations, Stanley Kubrick cut off all the footage and made the movie very difficult to understand. Most people did not or do not understand that it is a movie about aliens who gave homework of coming to Jupiter to human beings.

2, Second AI boom/winter

I am not going to write about the second AI boom in detail, but at least you should keep it in mind that convolutional neural network(CNN) is a keyword in this time. Neocognitron, an artificial model of how sight nerves perceive thing, was invented by Kunihiko Fukushima in 1980, and the model is said to be the origin on CNN. And Neocognitron got inspired by the Hubel and Wiesel’s research on sight nerves. In 1989, a group in AT & T Bell Laboratory led by Yann LeCun invented the first practical CNN to read handwritten digit.

Another turning point in this second AI boom was that back propagation algorithm was discovered, and the CNN by LeCun was also trained with back propagation. LeCun made a deep neural networks with some layers in 1998 for more practical uses.

But his research did not gain so much attention like today, because AI research entered its second winter at the beginning of the 1990s, and that was partly due to vanishing/exploding gradient problem of deep learning. People knew that neural networks had potentials of universal approximation, but when they tried to train naively stacked neural nets the gradients, which you need to train neural networks, exponentially increased/decreased. Even though the CNN made by LeCun was the first successful case of “deep” neural nets which did not suffer from the vanishing/exploding gradient problem, deep learning research also stagnated in this time.

The ultimate goal of this article series is to understand LSTM at a more abstract/mathematical level because it is one of the practical RNNs, but the idea of LSTM (Long Short Term Memory) itself was already proposed in 1997 as an RNN algorithm to tackle vanishing gradient problem. (Exploding gradient problem is solved with a technique named gradient clipping, and this is easier than techniques for preventing vanishing gradient problems. I am also going to explain it in the next article.) After that some other techniques like introducing forget gate, peephole connections, were discovered, but basically it took some 20 years till LSTM got attentions like today. The reasons for that is lack of hardware and data sets, and that was also major reasons for the second AI winter.

In the 1990s, the mid of second AI winter, the Internet started prevailing for commercial uses. I think one of the iconic events in this time was the source codes WWW(World Wide Web) were announced in 1993. Some of you might still remember that you little by little became able to transmit more data online in this time. That means people came to get more and more access to various datasets in those days, which is indispensable for machine learning tasks.

After all, we could not get HAL 9000 by the end of 2001, but instead we got Xbox console.

3, Video game industry and GPU

Even though research on neural networks stagnated in the 1990s the same period witnessed an advance in the computation of massive parallel linear transformations, due to their need in fields such as image processing.

Computer graphics move or rotate in 3d spaces, and that is also linear transformations. When you think about a car moving in a city, it is convenient to place the car, buildings, and other objects on a fixed 3d space. But when you need to make computer graphics of scenes of the city from a view point inside the car, you put a moving origin point in the car and see the city. The spatial information of the city is calculated as vectors from the moving origin point. Of course this is also linear transformations. Of course I am not talking about a dot or simple figures moving in the 3d spaces. Computer graphics are composed of numerous plane panels, and each of them have at least three vertexes, and they move on 3d spaces. Depending on viewpoints, you need project the 3d graphics in 3d spaces on 2d spaces to display the graphics on devices. You need to calculate which part of the panel is projected to which pixel on the display, and that is called rasterization. Plus, in order to get photophotorealistic image, you need to think about how lights from light sources reflect on the panel and projected on the display. And you also have to put some textures on groups of panels. You might also need to change color spaces, which is also linear transformations.

My point is, in short, you really need to do numerous linear transformations in parallel in image processing.

When it comes to the use of CGI in movies,  two pioneer movies were released during this time: Jurassic Park in 1993, and Toy Story in 1995. It is famous that Pixar used to be one of the departments in ILM(Industrial Light and Magic), founded by George Lucas, and Steve Jobs bought the department. Even though the members in Pixar had not even made a long feature film in their lives, after trial and errors, they made the first CGI animated feature movie. On the other hand, in order to acquire funds for the production of Schindler’s List(1993), Steven Spielberg took on Jurassic Park(1993), consequently changing the history of CGI through this “side job.”

*I think you have realized that George Lucas is mentioned almost everywhere in this article. His influences on technologies are not only limited to image processing, but also sound measuring system, nonlinear editing system. Photoshop was also originally developed under his company. I need another article series for this topic, but maybe not in Data Science Blog.

Considering that the first wire-frame computer graphics made and displayed by computers appeared in the scene of displaying the wire frame structure of Death Star in a war room, in Star Wars: A New Hope, the development of CGI was already astonishing at this time. But I think deep learning owe its development more to video game industry.

*I said that the Death Star scene is the first use of graphics made and DISPLAYED by computers, because I have to say one of the first graphics in movie MADE by computer dates back to the legendary title sequence of Vertigo(1958).

When it comes to 3D video games the processing unit has to constantly deal with real time commands from controllers. It is famous that GPU was originally specifically designed for plotting computer graphics. Video game market is the biggest in entertainment industry in general, and it is said that the quality of computer graphics have the strongest correlation with video games sales, therefore enhancing this quality is a priority for the video game console manufacturers.

One good example to see how much video games developed is comparing original Final Fantasy 7 and the remake one. The original one was released in 1997, the same year as when LSTM was invented. And recently  the remake version of Final Fantasy 7 was finally released this year. The original one was also made with very big budget, and it was divided into three CD-ROMs. The original one was also very revolutionary given that the former ones of Final Fantasy franchise were all 2d video retro style video games. But still the computer graphics looks like polygons, and in almost all scenes the camera angle was fixed in the original one. On the other hand the remake one is very photorealistic and you can move the angle of the camera as you want while you play the video game.

There were also fierce battles by graphic processor manufacturers in computer video game market in the 1990s, but personally I think the release of Xbox console was a turning point in the development of GPU. To be concrete, Microsoft adopted a type of NV20 GPU for Xbox consoles, and that left some room of programmability for developers. The chief architect of NV20, which was released under the brand of GeForce3, said making major changes in the company’s graphic chips was very risky. But that decision opened up possibilities of uses of GPU beyond computer graphics.

I think that the idea of a programmable GPU provided other scientific fields with more visible benefits after CUDA was launched. And GPU gained its position not only in deep learning, but also many other fields including making super computers.

*When it comes to deep learning, even GPUs have strong rivals. TPU(Tensor Processing Unit) made by Google, is specialized for deep learning tasks, and have astonishing processing speed. And FPGA(Field Programmable Gate Array), which was originally invented customizable electronic circuit, proved to be efficient for reducing electricity consumption of deep learning tasks.

*I am not so sure about this GPU part. Processing unit, including GPU is another big topic, that is beyond my capacity to be honest.  I would appreciate it if you could share your view and some references to confirm your opinion, on the comment section or via email.

*If you are interested you should see this video of game fans’ reactions to the announcement of Final Fantasy 7. This is the industry which grew behind the development of deep learning, and many fields where you need parallel computations owe themselves to the nerds who spent a lot of money for video games, including me.

*But ironically the engineers who invented the GPU said they did not play video games simply because they were busy. If you try to study the technologies behind video games, you would not have much time playing them. That is the reality.

We have seen that the in this second AI winter, Internet and GPU laid foundation of the next AI boom. But still the last piece of the puzzle is missing: let’s look at the breakthrough which solved the vanishing /exploding gradient problem of deep learning in the next section.

4, Pretraining of deep belief networks: “The Dawn of Deep Learning”

Some researchers say the invention of pretraining of deep belief network by Geoffrey Hinton was a breakthrough which put an end to the last AI winter. Deep belief networks are different type of networks from the neural networks we have discussed, but their architectures are similar to those of the neural networks. And it was also unknown how to train deep belief nets when they have several layers. Hinton discovered that training the networks layer by layer in advance can tackle vanishing gradient problems. And later it was discovered that you can do pretraining neural networks layer by layer with autoencoders.

*Deep belief network is beyond the scope of this article series. I have to talk about generative models, Boltzmann machine, and some other topics.

The pretraining techniques of neural networks is not mainstream anymore. But I think it is very meaningful to know that major deep learning techniques such as using ReLU activation functions, optimization with Adam, dropout, batch normalization, came up as more effective algorithms for deep learning after the advent of the pretraining techniques, and now we are in the third AI boom.

In the next next article we are finally going to work on LSTM. Specifically, I am going to offer a clearer guide to a well-made paper on LSTM, named “LSTM: A Search Space Odyssey.”

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Simple RNN

Prerequisites for understanding RNN at a more mathematical level

Writing the A gentle introduction to the tiresome part of understanding RNN Article Series on recurrent neural network (RNN) is nothing like a creative or ingenious idea. It is quite an ordinary topic. But still I am going to write my own new article on this ordinary topic because I have been frustrated by lack of sufficient explanations on RNN for slow learners like me.

I think many of readers of articles on this website at least know that RNN is a type of neural network used for AI tasks, such as time series prediction, machine translation, and voice recognition. But if you do not understand how RNNs work, especially during its back propagation, this blog series is for you.

After reading this articles series, I think you will be able to understand RNN in more mathematical and abstract ways. But in case some of the readers are allergic or intolerant to mathematics, I tried to use as little mathematics as possible.

Ideal prerequisite knowledge:

  • Some understanding on densely connected layers (or fully connected layers, multilayer perception) and how their forward/back propagation work.
  •  Some understanding on structure of Convolutional Neural Network.

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1, Difficulty of Understanding RNN

I bet a part of difficulty of understanding RNN comes from the variety of its structures. If you search “recurrent neural network” on Google Image or something, you will see what I mean. But that cannot be helped because RNN enables a variety of tasks.

Another major difficulty of understanding RNN is understanding its back propagation algorithm. I think some of you found it hard to understand chain rules in calculating back propagation of densely connected layers, where you have to make the most of linear algebra. And I have to say backprop of RNN, especially LSTM, is a monster of chain rules. I am planing to upload not only a blog post on RNN backprop, but also a presentation slides with animations to make it more understandable, in some external links.

In order to avoid such confusions, I am going to introduce a very simplified type of RNN, which I call a “simple RNN.” The RNN displayed as the head image of this article is a simple RNN.

2, How Neurons are Connected

    \begin{equation*}   1 = 3 - 2 \end{equation*}

How to connect neurons and how to activate them is what neural networks are all about. Structures of those neurons are easy to grasp as long as that is about DCL or CNN. But when it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN(*If you are not sure how neurons are connected in CNN, this link should be helpful. Draw a random digit in the square at the corner.). In fact the structure of RNN is also the same, and as long as it is a simple RNN, and it is not hard to visualize its structure.

Even though RNN is also connections of neurons, usually most RNN charts are simplified, using blackboxes. In case of simple RNN, most study material would display it as the chart below.

But that also cannot be helped because fancier RNN have more complicated connections of neurons, and there are no longer advantages of displaying RNN as connections of neurons, and you would need to understand RNN in more abstract way, I mean, as you see in most of textbooks.

I am going to explain details of simple RNN in the next article of this series.

3, Neural Networks as Mappings

If you still think that neural networks are something like magical spider webs or models of brain tissues, forget that. They are just ordinary mappings.

If you have been allergic to mathematics in your life, you might have never heard of the word “mapping.” If so, at least please keep it in mind that the equation y=f(x), which most people would have seen in compulsory education, is a part of mapping. If you get a value x, you get a value y corresponding to the x.

But in case of deep learning, x is a vector or a tensor, and it is denoted with \boldsymbol{x} . If you have never studied linear algebra , imagine that a vector is a column of Excel data (only one column), a matrix is a sheet of Excel data (with some rows and columns), and a tensor is some sheets of Excel data (each sheet does not necessarily contain only one column.)

CNNs are mainly used for image processing, so their inputs are usually image data. Image data are in many cases (3, hight, width) tensors because usually an image has red, blue, green channels, and the image in each channel can be expressed as a hight*width matrix (the “height” and the “width” are number of pixels, so they are discrete numbers).

The convolutional part of CNN (which I call “feature extraction part”) maps the tensors to a vector, and the last part is usually DCL, which works as classifier/regressor. At the end of the feature extraction part, you get a vector. I call it a “semantic vector” because the vector has information of “meaning” of the input image. In this link you can see maps of pictures plotted depending on the semantic vector. You can see that even if the pictures are not necessarily close pixelwise, they are close in terms of the “meanings” of the images.

In the example of a dog/cat classifier introduced by François Chollet, the developer of Keras, the CNN maps (3, 150, 150) tensors to 2-dimensional vectors, (1, 0) or (0, 1) for (dog, cat).

Wrapping up the points above, at least you should keep two points in mind: first, DCL is a classifier or a regressor, and CNN is a feature extractor used for image processing. And another important thing is, feature extraction parts of CNNs map images to vectors which are more related to the “meaning” of the image.

Importantly, I would like you to understand RNN this way. An RNN is also just a mapping.

*I recommend you to at least take a look at the beautiful pictures in this link. These pictures give you some insight into how CNN perceive images.

4, Problems of DCL and CNN, and needs for RNN

Taking an example of RNN task should be helpful for this topic. Probably machine translation is the most famous application of RNN, and it is also a good example of showing why DCL and CNN are not proper for some tasks. Its algorithms is out of the scope of this article series, but it would give you a good insight of some features of RNN. I prepared three sentences in German, English, and Japanese, which have the same meaning. Assume that each sentence is divided into some parts as shown below and that each vector corresponds to each part. In machine translation we want to convert a set of the vectors into another set of vectors.

Then let’s see why DCL and CNN are not proper for such task.

  • The input size is fixed: In case of the dog/cat classifier I have mentioned, even though the sizes of the input images varies, they were first molded into (3, 150, 150) tensors. But in machine translation, usually the length of the input is supposed to be flexible.
  • The order of inputs does not mater: In case of the dog/cat classifier the last section, even if the input is “cat,” “cat,” “dog” or “dog,” “cat,” “cat” there’s no difference. And in case of DCL, the network is symmetric, so even if you shuffle inputs, as long as you shuffle all of the input data in the same way, the DCL give out the same outcome . And if you have learned at least one foreign language, it is easy to imagine that the orders of vectors in sequence data matter in machine translation.

*It is said English language has phrase structure grammar, on the other hand Japanese language has dependency grammar. In English, the orders of words are important, but in Japanese as long as the particles and conjugations are correct, the orders of words are very flexible. In my impression, German grammar is between them. As long as you put the verb at the second position and the cases of the words are correct, the orders are also relatively flexible.

5, Sequence Data

We can say DCL and CNN are not useful when you want to process sequence data. Sequence data are a type of data which are lists of vectors. And importantly, the orders of the vectors matter. The number of vectors in sequence data is usually called time steps. A simple example of sequence data is meteorological data measured at a spot every ten minutes, for instance temperature, air pressure, wind velocity, humidity. In this case the data is recorded as 4-dimensional vector every ten minutes.

But this “time step” does not necessarily mean “time.” In case of natural language processing (including machine translation), which you I mentioned in the last section, the numberings of each vector denoting each part of sentences are “time steps.”

And RNNs are mappings from a sequence data to another sequence data.

*At least I found a paper on the RNN’s capability of universal approximation on many-to-one RNN task. But I have not found any papers on universal approximation of many-to-many RNN tasks. Please let me know if you find any clue on whether such approximation is possible. I am desperate to know that. 

6, Types of RNN Tasks

RNN tasks can be classified into some types depending on the lengths of input/output sequences (the “length” means the times steps of input/output sequence data).

If you want to predict the temperature in 24 hours, based on several time series data points in the last 96 hours, the task is many-to-one. If you sample data every ten minutes, the input size is 96*6=574 (the input data is a list of 574 vectors), and the output size is 1 (which is a value of temperature). Another example of many-to-one task is sentiment classification. If you want to judge whether a post on SNS is positive or negative, the input size is very flexible (the length of the post varies.) But the output size is one, which is (1, 0) or (0, 1), which denotes (positive, negative).

*The charts in this section are simplified model of RNN used for each task. Please keep it in mind that they are not 100% correct, but I tried to make them as exact as possible compared to those in other study materials.

Music/text generation can be one-to-many tasks. If you give the first sound/word you can generate a phrase.

Next, let’s look at many-to-many tasks. Machine translation and voice recognition are likely to be major examples of many-to-many tasks, but here name entity recognition seems to be a proper choice. Name entity recognition is task of finding proper noun in a sentence . For example if you got two sentences “He said, ‘Teddy bears on sale!’ ” and ‘He said, “Teddy Roosevelt was a great president!” ‘ judging whether the “Teddy” is a proper noun or a normal noun is name entity recognition.

Machine translation and voice recognition, which are more popular, are also many-to-many tasks, but they use more sophisticated models. In case of machine translation, the inputs are sentences in the original language, and the outputs are sentences in another language. When it comes to voice recognition, the input is data of air pressure at several time steps, and the output is the recognized word or sentence. Again, these are out of the scope of this article but I would like to introduce the models briefly.

Machine translation uses a type of RNN named sequence-to-sequence model (which is often called seq2seq model). This model is also very important for other natural language processes tasks in general, such as text summarization. A seq2seq model is divided into the encoder part and the decoder part. The encoder gives out a hidden state vector and it used as the input of the decoder part. And decoder part generates texts, using the output of the last time step as the input of next time step.

Voice recognition is also a famous application of RNN, but it also needs a special type of RNN.

*To be honest, I don’t know what is the state-of-the-art voice recognition algorithm. The example in this article is a combination of RNN and a collapsing function made using Connectionist Temporal Classification (CTC). In this model, the output of RNN is much longer than the recorded words or sentences, so a collapsing function reduces the output into next output with normal length.

You might have noticed that RNNs in the charts above are connected in both directions. Depending on the RNN tasks you need such bidirectional RNNs.  I think it is also easy to imagine that such networks are necessary. Again, machine translation is a good example.

And interestingly, image captioning, which enables a computer to describe a picture, is one-to-many-task. As the output is a sentence, it is easy to imagine that the output is “many.” If it is a one-to-many task, the input is supposed to be a vector.

Where does the input come from? I told you that I was obsessed with the beauty of the last vector of the feature extraction part of CNN. Surprisingly the the “beautiful” vector, which I call a “semantic vector” is the input of image captioning task (after some transformations, depending on the network models).

I think this articles includes major things you need to know as prerequisites when you want to understand RNN at more mathematical level. In the next article, I would like to explain the structure of a simple RNN, and how it forward propagate.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

As Businesses Struggle With ML, Automation Offers a Solution

In recent years, machine learning technology and the business solutions it enables has developed into a big business in and of itself. According to the industry analysts at IDC, spending on ML and AI technology is set to grow to almost $98 billion per year by 2023. In practical terms, that figure represents a business environment where ML technology has become a key priority for companies of every kind.

That doesn’t mean that the path to adopting ML technology is easy for businesses. Far from it. In fact, survey data seems to indicate that businesses are still struggling to get their machine learning efforts up and running. According to one such survey, it currently takes the average business as many as 90 days to deploy a single machine learning model. For 20% of businesses, that number is even higher.

From the data, it seems clear that something is missing in the methodologies that most companies rely on to make meaningful use of machine learning in their business workflows. A closer look at the situation reveals that the vast majority of data workers (analysts, data scientists, etc.) spend an inordinate amount of time on infrastructure work – and not on creating and refining machine learning models.

Streamlining the ML Adoption Process

To fix that problem, businesses need to turn to another growing area of technology: automation. By leveraging the latest in automation technology, it’s now possible to build an automated machine learning pipeline (AutoML pipeline) that cuts down on the repetitive tasks that slow down ML deployments and lets data workers get back to the work they were hired to do. With the right customized solution in place, a business’s ML team can:

  • Reduce the time spent on data collection, cleaning, and ingestion
  • Minimize human errors in the development of ML models
  • Decentralize the ML development process to create an ML-as-a-service model with increased accessibility for all business stakeholders

In short, an AutoML pipeline turns the high-effort functions of the ML development process into quick, self-adjusting steps handled exclusively by machines. In some use cases, an AutoML pipeline can even allow non-technical stakeholders to self-create ML solutions tailored to specific business use cases with no expert help required. In that way, it can cut ML costs, shorten deployment time, and allow data scientists to focus on tackling more complex modelling work to develop custom ML solutions that are still outside the scope of available automation techniques.

The Parts of an AutoML Pipeline

Although the frameworks and tools used to create an AutoML pipeline can vary, they all contain elements that conform to the following areas:

  • Data Preprocessing – Taking available business data from a variety of sources, cleaning it, standardizing it, and conducting missing value imputation
  • Feature Engineering – Identifying features in the raw data set to create hypotheses for the model to base predictions on
  • Model Selection – Choosing the right ML approach or hyperparameters to produce the desired predictions
  • Tuning Hyperparameters – Determining which hyperparameters help the model achieve optimal performance

As anyone familiar with ML development can tell you, the steps in the above process tend to represent the majority of the labour and time-intensive work that goes into creating a model that’s ready for real-world business use. It is also in those steps where the lion’s share of business ML budgets get consumed, and where most of the typical delays occur.

The Limitations and Considerations for Using AutoML

Given the scope of the work that can now become part of an AutoML pipeline, it’s tempting to imagine it as a panacea – something that will allow a business to reduce its reliance on data scientists going forward. Right now, though, the technology can’t do that. At this stage, AutoML technology is still best used as a tool to augment the productivity of business data teams, not to supplant them altogether.

To that end, there are some considerations that businesses using AutoML will need to keep in mind to make sure they get reliable, repeatable, and value-generating results, including:

  • Transparency – Businesses must establish proper vetting procedures to make sure they understand the models created by their AutoML pipeline, so they can explain why it’s making the choices or predictions it’s making. In some industries, such as in medicine or finance, this could even fall under relevant regulatory requirements.
  • Extensibility – Making sure the AutoML framework may be expanded and modified to suit changing business needs or to tackle new challenges as they arise.
  • Monitoring and Maintenance – Since today’s AutoML technology isn’t a set-it-and-forget-it proposition, it’s important to establish processes for the monitoring and maintenance of the deployment so it can continue to produce useful and reliable ML models.

The Bottom Line

As it stands today, the convergence of automation and machine learning holds the promise of delivering ML models at scale for businesses, which would greatly speed up the adoption of the technology and lower barriers to entry for those who have yet to embrace it. On the whole, that’s great news both for the businesses that will benefit from increased access to ML technology, as well as for the legions of data professionals tasked with making it all work.

It’s important to note, of course, that complete end-to-end ML automation with no human intervention is still a long way off. While businesses should absolutely explore building an automated machine learning pipeline to speed up development time in their data operations, they shouldn’t lose sight of the fact that they still need plenty of high-skilled data scientists and analysts on their teams. It’s those specialists that can make appropriate and productive use of the technology. Without them, an AutoML pipeline would accomplish little more than telling the business what it wants to hear.

The good news is that the AutoML tools that exist right now are sufficient to alleviate many of the real-world problems businesses face in their road to ML adoption. As they become more commonplace, there’s little doubt that the lead time to deploy machine learning models is going to shrink correspondingly – and that businesses will enjoy higher ROI and enhanced outcomes as a result.

Interview – There is no stand-alone strategy for AI, it must be part of the company-wide strategy

Ronny FehlingRonny Fehling is Partner and Associate Director for Artificial Intelligence as the Boston Consulting Group GAMMA. With more than 20 years of continually progressive experience in leading business and technology innovation, spearheading digital transformation, and aligning the corporate strategy with Artificial Intelligence he industry-leading organizations to grow their top-line and kick-start their digital transformation.

Ronny Fehling is furthermore speaker of the Predictive Analytics World for Industry 4.0 in May 2020.

Data Science Blog: Mr. Fehling, you are consulting companies and business leaders about AI and how to get started with it. AI as a definition is often misleading. How do you define AI?

This is a good question. I think there are two ways to answer this:

From a technical definition, I often see expressions about “simulation of human intelligence” and “acting like a human”. I find using these terms more often misleading rather than helpful. I studied AI back when it wasn’t yet “cool” and still middle of the AI winter. And yes, we have much more compute power and access to data, but we also think about data in a very different way. For me, I typically distinguish between machine learning, which uses algorithms and statistical methods to identify patterns in data, and AI, which for me attempts to interpret the data in a given context. So machine learning can help me identify and analyze frequency patterns in text and even predict the next word I will type based on my history. AI will help me identify ‘what’ I’m writing about – even if I don’t explicitly name it. It can tell me that when I’m asking “I’m looking for a place to stay” that I might want to see a list of hotels around me. In other words: machine learning can detect correlations and similar patterns, AI uses machine learning to generate insights.

I always wondered why top executives are so frequently asking about the definition of AI because at first it seemed to me not as relevant to the discussion on how to align AI with their corporate strategy. However, I started to realize that their question is ultimately about “What is AI and what can it do for me?”.

For me, AI can do three things really good, which humans cannot really do and previous approaches couldn’t cope with:

  1. Finding similar patterns in historical data. Imagine 20 years of data like maintenance or repair documents of a manufacturing plant. Although they describe work done on a multitude of products due to a multitude of possible problems, AI can use this to look for a very similar situation based on a current problem description. This can be used to identify a common root cause as well as a common solution approach, saving valuable time for the operation.
  2. Finding correlations across time or processes. This is often used in predictive maintenance use cases. Here, the AI tries to see what similar events happen typically at some time before a failure happen. This way, it can alert the operator much earlier about an impending failure, say due to a change in the vibration pattern of the machine.
  3. Finding an optimal solution path based on many constraints. There are many problems in the business world, where choosing the optimal path based on complex situations is critical. Let’s say that suddenly a severe weather warning at an airport forces an airline to have to change their scheduling because of a reduced airport capacity. Delays for some aircraft can cause disruptions because passengers or personnel not being able to connect anymore. Knowing which aircraft to delay, which to cancel, which to switch while causing the minimal amount of disruption to passengers, crew, maintenance and ground-crew is something AI can help with.

The key now is to link these fundamental capabilities with the business context of the company and how it can ultimately help transform.

Data Science Blog: Companies are still starting with their own company-wide data strategy. And now they are talking about AI strategies. Is that something which should be handled separately?

In my experience – both based on having seen the implementations of several corporate data strategies as well as my upbringing at Oracle – the data strategy and AI strategy are co-dependent and cannot be separated. Very often I hear from clients that they think they first need to bring their data in order before doing AI project. And yes, without good data access, AI cannot really work. In fact, most of the time spent on AI is spent on processing, cleansing, understanding and contextualizing the data. However, you cannot really know what data will be needed in which form without knowing what you want to use it for. This is why strategies that handle data and AI separately mostly fail and generate huge costs.

Data Science Blog: What are the important steps for developing a good data strategy? Is there something like a general approach?

In my eyes, the AI strategy defines the data strategy step by step as more use cases are implemented. Rather than focusing too quickly at how to get all corporate data into a data lake, it will be much more important to start creating a use-case, technology and data governance. This governance has to be established once the AI strategy is starting to mature to enable the scale up and productization. At the beginning is to find the (very few) use-cases that can serve as light house projects to demonstrate (1) value impact, (2) a way to go from MVP to Pilot, and (3) how to address the data challenge. This will then more naturally identify the elements of governance, data access and technology that are required.

Data Science Blog: What are the most common questions from business leaders to you regarding AI? Why do they hesitate to get started?

By far it the most common question I get is: how do I get started? The hesitations often come from multiple sources like: “We don’t have the talent in house to do AI”, “Our data is not good enough”, “We don’t know which use-case to start with”, “It’s not easy for us to embrace agile and failure culture because our products are mission critical”, “We don’t know how much value this can bring us”.

Data Science Blog: Most managers prefer to start small and with lower risk. They seem to postpone bigger ideas to a later stage, at least some milestones should be reached. Is that a good idea or should they think bigger?

AI is often associated (rightfully so) with a new way of working – agile and embracing failures. Similarly, there is also the perception of significant cost to starting with AI (talent, technology, data). These perceptions often lead managers wanting to start with several smaller ambition use-cases where failure isn’t that grave. Once they have proven itself somehow, they would then move on to bigger projects. The problem with this strategy is on the one side that you fragment your few precious AI resources on too many projects and at the same time you cannot really demonstrate an impact since the projects weren’t chosen based on their impact potential.

The AI pioneers typically were successful by “thinking big, starting small and scaling fast”. You start by assessing the value potential of a use-case, for example: my current OEE (Overall Equipment Efficiency) is at 65%. There is an addressable loss of 25% which would grow my top line by $X. With the help of AI experts, you then create a hypothesis of how you think you can reduce that loss. This might be by choosing one specific equipment and 50% of the addressable loss. This is now the measure against which you define your failure or non-failure criteria. Once you have proven an MVP that can solve this loss, you scale up by piloting it in real-life setting and then scaling it to all the equipment. At every step of this process, you have a failure criterion that is measured by the impact value.


Virtual Edition, 11-12 MAY, 2020

The premier machine learning
conference for industry 4.0

This year Predictive Analytics World for Industry 4.0 runs alongside Deep Learning World and Predictive Analytics World for Healthcare.

AI For Advertisers: How Data Analytics Can Change The Maths Of Advertising?

All Images Credit: Freepik

The task of understanding a customer’s journey and designing your marketing strategy accordingly can be difficult in this data-driven world. Today, the customer expresses their needs in myriad forms of requests.

Consumers express their needs and want attitudes, and values in various forms through search, comments, blogs, Tweets, “likes,” videos, and conversations and access such data across many channels like web, mobile, and face to face. Volume, variety, velocity and veracity of the data accumulated through these customer interactions are huge.

BigData and data analytics can be leveraged to understand several phases of the customer journey. There are risks involved in using Artificial Intelligence for the marketing data analysis of data breach and even manipulation. But, AI do have brighter prospects when it comes to marketing and advertiser applications.

As the CEO of a technology firm Chop Dawg and marketer, Joshua Davidson puts it, “AI-powered apps are going to be the future for us, and there are several industries that are ripe for this.” The mobile-first strategy of many enterprises has powered the use of AI for digital marketing and developing technologies and innovations to power industries with intelligent systems.

How AI and Machine learning are affecting customer journeys?

Any consumer journey begins with the recognition of a problem and then stages like initial consideration, active evaluation, purchase, and postpurchase come through up till the consumer journey is over. The need for identifying the purchasing and need patterns of the consumers and finding the buyer personas to strategize the marketing for them.

Need and Want Recognition:

Identifying a need is quite difficult as it is the most initial level of a consumer’s journey and it is more on the category level than at a brand level. Marketers and advertisers are relying on techniques like market research, web analytics, and data mining to build consumer profiles and buyer’s persona for understanding the needs and influencing the purchase of products. AI can help identify these wants and needs in real-time as the consumers usually express their needs and wants online and help build profiles more quickly.

AI technologies offered by several firms help in consumer profiling. Firms like Microsoft offers Azure that crunches billions of data points in seconds to determine the needs of consumers. It then personalizes web content on specific platforms in real-time to align with those status-updates. Consumer digital footprints are evolving through social media status updates, purchasing behavior, online comments and posts. Ai tends to update these profiles continuously through machine learning techniques.

Initial Consideration:

A key objective of advertising is to insert a brand into the consideration set of the consumers when they are looking for deliberate offerings. Advertising includes increasing the visibility of brands and emphasize on the key reasons for consideration. Advertisers currently use search optimization, paid search advertisements, organic search, or advertisement retargeting for finding the consideration and increase the probability of consumer consideration.

AI can leverage machine learning and data analytics to help with search, identify and rank functions of consumer consideration that can match the real-time considerations at any specific time. Take an example of Google Adwords, it analyzes the consumer data and helps advertisers make clearer distinctions between qualified and unqualified leads for better targeting.

Google uses AI to analyze the search-query data by considering, not only the keywords but also context words and phrases, consumer activity data and other BigData. Then, Google identifies valuable subsets of consumers and more accurate targeting.

Active Evaluation: 

When consumers narrow it down to a few choices of brands, advertisers need to insert trust and value among the consumers for brands. A common technique is to identify the higher purchase consumers and persuade them through persuasive content and advertisement. AI can support these tasks using some techniques:

Predictive Lead Scoring: Predictive lead scoring by leveraging machine learning techniques of predictive analytics to allow marketers to make accurate predictions related to the intent of purchase for consumers. A machine learning algorithm runs through a database of existing consumer data, then recognize trends and patterns and after processing the external data on consumer activities and interests, creates robust consumer profiles for advertisers.

Natural Language Generation: By leveraging the image, speech recognition and natural language generation, machine learning enables marketers to curate content while learning from the consumer behavior in real-time scenarios and adjusts the content according to the profiles on the fly.

Emotion AI: Marketers use emotion AI to understand consumer sentiment and feel about the brand in general. By tapping into the reviews, blogs or videos they understand the mood of customers. Marketers also use emotion AI to pretest advertisements before its release. The famous example of Kelloggs, which used emotion AI to help devise an advertising campaign for their cereal, eliminating the advertisement executions whenever the consumer engagement dropped.

Purchase: 

As the consumers decide which brands to choose and what it’s worth, advertising aims to move them out of the decision process and push for the purchase by reinforcing the value of the brand compared with its competition.

Advertisers can insert such value by emphasizing convenience and information about where to buy the product, how to buy the product and reassuring the value through warranties and guarantees. Many marketers also emphasize on rapid return policies and purchase incentives.

AI can completely change the purchase process through dynamic pricing, which encompasses real-time price adjustments on the basis of information such as demand and other consumer-behavior variables, seasonality, and competitor activities.

Post-Purchase: 

Aftersales services can be improved through intelligent systems using AI technologies and machine learning techniques. Marketers and advertisers can hire dedicated developers to design intelligent virtual agents or chatbots that can reinforce the value and performance of a brand among consumers.

Marketers can leverage an intelligent technique known as Propensity modeling to identify the most valuable customers on the basis of lifetime value, likelihood of reengagement, propensity to churn, and other key performance measures of interest. Then advertisers can personalize their communication with these customers on the basis of these data.

Conclusion:

AI has shifted the focus of advertisers and marketers towards the customer-first strategies and enhanced the heuristics of customer engagement. Machine learning and IoT(Internet of Things) has already changed the way customer interact with the brands and this transition has come at a time when advertisers and marketers are looking for new ways to tap into the customer mindset and buyer’s persona.

All Images Credit: Freepik

Why Retailers Are Making the Push for Stronger Data Science and AI

Retail relies on what the customer wants and needs at that moment, no matter the size of the company. Making judgments without consumer input would probably work for a little while but will fall flat as soon as the business model becomes outdated. In today’s technology-run world, things can become obsolete in a matter of days or even hours.

Retailers are the businesses most in need of capitalizing on what the customer wants in real-time. They have started to use data science and information from the Internet of Things (IoT) to not only stay in business, but also get ahead of other brands.

Artificial intelligence (AI) adds a new layer by using modern technology. The details of why retailers want to use these new practices are a bit more specific, though.

Data Targets Audiences

By using current customer data compared to information from the IoT, retailers can learn more about their audience and find better means of targeting them. Demographics like age, location and many other factors could affect advertising and even shopping, not to mention holidays throughout the year an audience celebrates.

Websites also need to be customized to suit the target audience. Those that are mobile-friendly and focused on what shoppers want can increase revenue, but the wrong approach can drive away new and existing customers. AI can help companies understand that data and present it back to the customer seamlessly, providing different options for various audiences.

Customer Base Expansion

Customer success should mean business success, as well. Growing a client base is something data science can assist with. However, helping customers grow is another type of service few companies provide but all people appreciate. A business can expand by offering new products and services that are relevant to their audience through the use of data.

Once a company learns what current customers want and begin to fit their needs, it can expand to more audiences. With data science, a business can ensure it does so slowly to give more of what current customers want while also finding new ones. The data can tell what sort of interests they all share so companies can capitalize on the venture.

AI Helps Customer Service

AI helps out customer service on both ends. Employees don’t have to focus on common problems that could easily be resolved, and clients often walk away happier than if they were to speak to a real person. This doesn’t work for every problem, especially ones that are specific in nature, but they can assist with more common issues. This is where chatbots enter the stage.

An AI-supported chatbot can give immediate support, provide suggestions, answer direct questions and offer almost any other form of help needed. Customers get personalized attention, and businesses can work faster toward customer loyalty.

Again, speaking to a real person when they have problems is a big plus for customers, but not for issues they know could be resolved in the time it takes to wait on the line for a representative.

Supply and Demand

Price optimization has taken on a bigger role than it has in the past. Mostly, data science is looking at supply and demand in real-time rather than having price fluctuations occur months after the business loses money. Having the right price can also help create more promotions for products and services, rewarding loyal customers for their shopping.

The data has to be gained from multiple channels by using price optimization tools, which focus on using data correctly in a company’s favor. The information doesn’t just look at supply and demand, but also examines locations, times, customer attitudes, competitor pricing and many other factors. All these pieces of information can be delivered in real-time so prices can be changed accordingly.

Taking the Competition

The thing about data science is that businesses are already utilizing it to their full potential and getting more customers than ever. The only way to get ahead of the competition is to at least start using the tools they’ve had at their disposal for years.

Target was one such company that took up the data helm. During 2012 and 2013, it saw a pretty sizeable dip in sales, but its online sales went up by almost 30% during the same time.

Data and Retail

When running a retail business, especially one that’s branching off into a franchise, using data is imperative. Data science and AI have become extremely important to companies both big and small.

Applying it correctly can help enterprises of any size and in every industry take things to the next level.

Even if a company is just starting out, sticking the first landing with a target audience is a fantastic way to begin the adventure and find success.

Glorious career paths of a Big Data Professional

Are you wondering about the career profiles you may get to fill if you get into Big Data industry? If yes, then Bingo! This is the post that will inform you just about that. Big data is just an umbrella term. There are a lot of profiles and career paths that are covered under this umbrella term. Let us have a look at some of these profiles.

Data Visualisation Specialist

The process of visualizing data is turning out to be critical in guaranteeing information-driven representatives get the upfront investment required to actualize goal-oriented and significant Big Data extends in their organization. Making your data to tell a story and the craft of envisioning information convincingly has turned into a significant piece of the Big Data world and progressively associations need to have these capacities in-house. Besides, as a rule, these experts are relied upon to realize how to picture in different instruments, for example, Spotfire, D3, Carto, and Tableau – among numerous others. Information Visualization Specialists should be versatile and inquisitive to guarantee they stay aware of most recent patterns and answers for a recount to their information stories in the most intriguing manner conceivable with regards to the board room. 

 

Big Data Architect

This is the place the Hadoop specialists come in. Ordinarily, a Big Data planner tends to explicit information issues and necessities, having the option to portray the structure and conduct of a Big Data arrangement utilizing the innovation wherein they practice – which is, as a rule, mostly Hadoop.

These representatives go about as a significant connection between the association (and its specific needs) and Data Scientists and Engineers. Any organization that needs to assemble a Big Data condition will require a Big Data modeler who can serenely deal with the total lifecycle of a Hadoop arrangement – including necessity investigation, stage determination, specialized engineering structure, application plan, and advancement, testing the much-dreaded task of deploying lastly.

Systems Architect 

This Big data professional is in charge of how your enormous information frameworks are architected and interconnected. Their essential incentive to your group lies in their capacity to use their product building foundation and involvement with huge scale circulated handling frameworks to deal with your innovation decisions and execution forms. You’ll need this individual to construct an information design that lines up with the business, alongside abnormal state anticipating the improvement. The person in question will consider different limitations, adherence to gauges, and varying needs over the business.

Here are some responsibilities that they play:

    • Determine auxiliary prerequisites of databases by investigating customer tasks, applications, and programming; audit targets with customers and assess current frameworks.
    • Develop database arrangements by planning proposed framework; characterize physical database structure and utilitarian abilities, security, back-up and recuperation particulars.
    • Install database frameworks by creating flowcharts; apply ideal access methods, arrange establishment activities, and record activities.
    • Maintain database execution by distinguishing and settling generation and application advancement issues, figuring ideal qualities for parameters; assessing, incorporating, and putting in new discharges, finishing support and responding to client questions.
    • Provide database support by coding utilities, reacting to client questions, and settling issues.


Artificial Intelligence Developer

The certain promotion around Artificial Intelligence is additionally set to quicken the number of jobs publicized for masters who truly see how to apply AI, Machine Learning, and Deep Learning strategies in the business world. Selection representatives will request designers with broad learning of a wide exhibit of programming dialects which loan well to AI improvement, for example, Lisp, Prolog, C/C++, Java, and Python.

All said and done; many people estimate that this popular demand for AI specialists could cause a something like what we call a “Brain Drain” organizations poaching talented individuals away from the universe of the scholarly world. A month ago in the Financial Times, profound learning pioneer and specialist Yoshua Bengio, of the University of Montreal expressed: “The industry has been selecting a ton of ability — so now there’s a lack in the scholarly world, which is fine for those organizations. However, it’s not extraordinary for the scholarly world.” It ; howeverusiasm to perceive how this contention among the scholarly world and business is rotated in the following couple of years.

Data Scientist

The move of Big Data from tech publicity to business reality may have quickened, yet the move away from enrolling top Data Scientists isn’t set to change in 2020. An ongoing Deloitte report featured that the universe of business will require three million Data Scientists by 2021, so if their expectations are right, there’s a major ability hole in the market. This multidisciplinary profile requires specialized logical aptitudes, specialized software engineering abilities just as solid gentler abilities, for example, correspondence, business keenness, and scholarly interest.

Data Engineer

Clean and quality data is crucial in the accomplishment of Big Data ventures. Consequently, we hope to see a lot of opening in 2020 for Data Engineers who have a predictable and awesome way to deal with information transformation and treatment. Organizations will search for these special data masters to have broad involvement in controlling data with SQL, T-SQL, R, Hadoop, Hive, Python and Spark. Much like Data Scientists. They are likewise expected to be innovative with regards to contrasting information with clashing information types with have the option to determine issues. They additionally frequently need to make arrangements which enable organizations to catch existing information in increasingly usable information groups – just as performing information demonstrations and their modeling.

IT/Operations Manager Job Description

In Big data industry, the IT/Operations Manager is a profitable expansion to your group and will essentially be in charge of sending, overseeing, and checking your enormous information frameworks. You’ll depend on this colleague to plan and execute new hardware and administrations. The person in question will work with business partners to comprehend the best innovation ventures to address their procedures and concerns—interpreting business necessities to innovation plans. They’ll likewise work with venture chiefs to actualize innovation and be in charge of effective progress and general activities.

Here are some responsibilities that they play:

  • Manage and be proactive in announcing, settling and raising issues where required 
  • Lead and co-ordinate issue the executive’s exercises, notwithstanding ceaseless procedure improvement activities  
  • Proactively deal with our IT framework 
  • Supervise and oversee IT staffing, including enrollment, supervision, planning, advancement, and assessment
  • Verify existing business apparatuses and procedures remain ideally practical and worth included 
  • Benchmark, dissect, report on and make suggestions for the improvement and development of the IT framework and IT frameworks 
  • Advance and keep up a corporate SLA structure

Conclusion

These are some of the best career paths that big data professionals can play after entering the industry. Honesty and hard work can always take you to the zenith of any field that you choose to be in. Also, keep upgrading your skills by taking newer certifications and technologies. Good Luck 

Visual Question Answering with Keras – Part 2: Making Computers Intelligent to answer from images

Making Computers Intelligent to answer from images

This is my second blog on Visual Question Answering, in the last blog, I have introduced to VQA, available datasets and some of the real-life applications of VQA. If you have not gone through then I would highly recommend you to go through it. Click here for more details about it.

In this blog post, I will walk through the implementation of VQA in Keras.

You can download the dataset from here: https://visualqa.org/index.html. All my experiments were performed with VQA v2 and I have used a very tiny subset of entire dataset i.e all samples for training and testing from the validation set.

Table of contents:

  1. Preprocessing Data
  2. Process overview for VQA
  3. Data Preprocessing – Images
  4. Data Preprocessing through the spaCy library- Questions
  5. Model Architecture
  6. Defining model parameters
  7. Evaluating the model
  8. Final Thought
  9. References

NOTE: The purpose of this blog is not to get the state-of-art performance on VQA. But the idea is to get familiar with the concept. All my experiments were performed with the validation set only.

Full code on my Github here.


1. Preprocessing Data:

If you have downloaded the dataset then the question and answers (called as annotations) are in JSON format. I have provided the code to extract the questions, annotations and other useful information in my Github repository. All extracted information is stored in .txt file format. After executing code the preprocessing directory will have the following structure.

All text files will be used for training.

 

2. Process overview for VQA:

As we have discussed in previous post visual question answering is broken down into 2 broad-spectrum i.e. vision and text.  I will represent the Neural Network approach to this problem using the Convolutional Neural Network (for image data) and Recurrent Neural Network(for text data). 

If you are not familiar with RNN (more precisely LSTM) then I would highly recommend you to go through Colah’s blog and Andrej Karpathy blog. The concepts discussed in this blogs are extensively used in my post.

The main idea is to get features for images from CNN and features for the text from RNN and finally combine them to generate the answer by passing them through some fully connected layers. The below figure shows the same idea.

 

I have used VGG-16 to extract the features from the image and LSTM layers to extract the features from questions and combining them to get the answer.

3. Data Preprocessing – Images:

Images are nothing but one of the input to our model. But as you already may know that before feeding images to the model we need to convert into the fixed-size vector.

So we need to convert every image into a fixed-size vector then it can be fed to the neural network. For this, we will use the VGG-16 pretrained model. VGG-16 model architecture is trained on millions on the Imagenet dataset to classify the image into one of 1000 classes. Here our task is not to classify the image but to get the bottleneck features from the second last layer.

Hence after removing the softmax layer, we get a 4096-dimensional vector representation (bottleneck features) for each image.

Image Source: https://www.cs.toronto.edu/~frossard/post/vgg16/

 

For the VQA dataset, the images are from the COCO dataset and each image has unique id associated with it. All these images are passed through the VGG-16 architecture and their vector representation is stored in the “.mat” file along with id. So in actual, we need not have to implement VGG-16 architecture instead we just do look up into file with the id of the image at hand and we will get a 4096-dimensional vector representation for the image.

4. Data Preprocessing through the spaCy library- Questions:

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. As we have converted images into a fixed 4096-dimensional vector we also need to convert questions into a fixed-size vector representation. For installing spaCy click here

You might know that for training word embeddings in Keras we have a layer called an Embedding layer which takes a word and embeds it into a higher dimensional vector representation. But by using the spaCy library we do not have to train the get the vector representation in higher dimensions.

 

This model is actually trained on billions of tokens of the large corpus. So we just need to call the vector method of spaCy class and will get vector representation for word.

After fitting, the vector method on tokens of each question will get the 300-dimensional fixed representation for each word.

5. Model Architecture:

In our problem the input consists of two parts i.e an image vector, and a question, we cannot use the Sequential API of the Keras library. For this reason, we use the Functional API which allows us to create multiple models and finally merge models.

The below picture shows the high-level architecture idea of submodules of neural network.

After concatenating the 2 different models the summary will look like the following.

The below plot helps us to visualize neural network architecture and to understand the two types of input:

 

6. Defining model parameters:

The hyperparameters that we are going to use for our model is defined as follows:

If you know what this parameter means then you can play around it and can get better results.

Time Taken: I used the GPU on https://colab.research.google.com and hence it took me approximately 2 hours to train the model for 5 epochs. However, if you train it on a PC without GPU, it could take more time depending on the configuration of your machine.

7. Evaluating the model:

Since I have used the very small dataset for performing these experiments I am not able to get very good accuracy. The below code will calculate the accuracy of the model.

 

Since I have trained a model multiple times with different parameters you will not get the same accuracy as me. If you want you can directly download mode.h5 file from my google drive.

 

8. Final Thoughts:

One of the interesting thing about VQA is that it a completely new field. So there is absolutely no end to what you can do to solve this problem. Below are some tips while replicating the code.

  1. Start with a very small subset of data: When you start implementing I suggest you start with a very small amount of data. Because once you are ready with the whole setup then you can scale it any time.
  2. Understand the code: Understanding code line by line is very much helpful to match your theoretical knowledge. So for that, I suggest you can take very few samples(maybe 20 or less) and run a small chunk (2 to 3 lines) of code to get the functionality of each part.
  3. Be patient: One of the mistakes that I did while starting with this project was to do everything at one go. If you get some error while replicating code spend 4 to 5 days harder on that. Even after that if you won’t able to solve, I would suggest you resume after a break of 1 or 2 days. 

VQA is the intersection of NLP and CV and hopefully, this project will give you a better understanding (more precisely practically) with most of the deep learning concepts.

If you want to improve the performance of the model below are few tips you can try:

  1. Use larger datasets
  2. Try Building more complex models like Attention, etc
  3. Try using other pre-trained word embeddings like Glove 
  4. Try using a different architecture 
  5. Do more hyperparameter tuning

The list is endless and it goes on.

In the blog, I have not provided the complete code you can get it from my Github repository.

9. References:

  1. https://blog.floydhub.com/asking-questions-to-images-with-deep-learning/
  2. https://tryolabs.com/blog/2018/03/01/introduction-to-visual-question-answering/
  3. https://github.com/sominwadhwa/vqamd_floyd