Machine Learing Archives

Tag Archive for: Machine Learing

Alles dreht sich um Daten: die Trainingsmethoden des Deep Learning

August 14, 2022/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Main Category/by Benjamin Aunkofer

Im Deep Learning gibt es unterschiedliche Trainingsmethoden. Welche wir in einem KI Projekt anwenden, hängt von den zur Verfügung gestellten Daten des Kunden ab: wieviele Daten gibt es, sind diese gelabelt oder ungelabelt? Oder gibt es sowohl gelabelte als auch ungelabelte Daten?

Nehmen wir einmal an, unser Kunde benötigt für sein Tourismusportal strukturierte, gelabelte Bilder. Die Aufgabe für unser KI Modell ist es also, zu erkennen, ob es sich um ein Bild des Schlafzimmers, Badezimmers, des Spa-Bereichs, des Restaurants etc. handelt. Sehen wir uns die möglichen Trainingsmethoden einmal an.

1. Supervised Learning

Hat unser Kunde viele Bilder und sind diese alle gelabelt, so ist das ein seltener Glücksfall. Wir können dann das Supervised Learning anwenden. Dabei lernt das KI Modell die verschiedenen Bildkategorien anhand der gelabelten Bilder. Es bekommt für das Training von uns also die Trainingsdaten mit den gewünschten Ergebnissen geliefert.
Während des Trainings sucht das Modell nach Mustern in den Bildern, die mit den gewünschten Ergebnissen zusammenpassen. So erlernt es Merkmale der Kategorien. Das Gelernte kann das Modell dann auf neue, ungesehene Daten übertragen und auf diese Weise eine Vorhersage für ungelabelte Bilder liefern, also etwa “Badezimmer 98%”.

2. Unsupervised learning

Wenn unser Kunde viele Bilder als Trainingsdaten liefern kann, diese jedoch alle nicht gelabelt sind, müssen wir auf Unsupervised Learning zurückgreifen. Das bedeutet, dass wir dem Modell nicht sagen können, was es lernen soll (die Zuordnung zu Kategorien), sondern es muss selbst Regelmäßigkeiten in den Daten finden.

Eine aktuell gängige Methode des Unsupervised Learning ist Contrastive Learning. Dabei generieren wir jeweils aus einem Bild mehrere Ausschnitte. Das Modell soll lernen, dass die Ausschnitte des selben Bildes ähnlicher zueinander sind als zu denen anderer Bilder. Oder kurz gesagt, das Modell lernt zwischen ähnlichen und unähnlichen Bildern zu unterscheiden.

Über diese Methode können wir zwar Vorhersagen erzielen, jedoch können diese niemals
die Ergebnisgüte von Supervised Learning erreichen.

3. Semi-supervised Learning

Kann uns unser Kunde eine kleine Menge an gelabelten Daten und eine große Menge an nicht gelabelten Daten zur Verfügung stellen, wenden wir Semi-supervised Learning an. Diese Datenlage begegnet uns in der Praxis tatsächlich am häufigsten. Bei fast allen KI Projekten stehen einer kleinen Menge an gelabelten Daten ein Großteil an unstrukturierten
Daten gegenüber.

Mit Semi-supervised Learning können wir beide Datensätze für das Training verwenden. Das gelingt zum Beispiel durch die Kombination von Contrastive Learning und Supervised Learning. Dabei trainieren wir ein KI Modell mit den gelabelten Daten, um Vorhersagen für Raumkategorien zu erhalten. Gleichzeitig lassen wir es Ähnlichkeiten und Unähnlichkeiten in den ungelabelten Daten erlernen und sich daraufhin selbst optimieren. Auf diese Weise können wir letztendlich auch gute Label-Vorhersagen für neue, ungesehene Bilder erzielen.

Fazit: Supervised vs. Unsupervised vs. Semi-supervised

Supervised Learning wünscht sich jeder, der mit einem KI Projekt betraut ist. In der Praxis ist das kaum anwendbar, da selten sämtliche Trainingsdaten gut strukturiert und gelabelt vorliegen.

Wenn nur unstrukturierte und ungelabelte Daten vorhanden sind, dann können wir mit Unsupervised Learning immerhin Informationen aus den Daten gewinnen, die unser Kunde so nicht hätte. Im Vergleich zu Supervised Learning ist aber die Ergebnisqualität deutlich schlechter.

Mit Semi-Supervised Learning versuchen wir das Datendilemma, also kleiner Teil gelabelte, großer Teil ungelabelte Daten, aufzulösen. Wir verwenden beide Datensätze und können gute Vorhersage-Ergebnisse erzielen, deren Qualität dem Supervised Learning oft ebenbürtig sind.

Dieser Artikel entstand in Zusammenarbeit zwischen DATANOMIQ, einem Unternehmen für Beratung und Services rund um Business Intelligence, Process Mining und Data Science. und pixolution, einem Unternehmen für AI Solutions im Bereich Computer Vision (Visuelle Bildsuche und individuelle KI Lösungen).

Automatic Financial Trading Agent for Low-risk Portfolio Management using Deep Reinforcement Learning

June 9, 2022/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Main Category/by Jieyun Hu

This article focuses on autonomous trading agent to solve the capital market portfolio management problem. Researchers aim to achieve higher portfolio return while preferring lower-risk actions. It uses deep reinforcement learning Deep Q-Network (DQN) to train the agent. The main contribution of their work is the proposed target policy.

Introduction

Author emphasizes the importance of low-risk actions for two reasons: 1) the weak positive correlation between risk and profit suggests high returns can be obtained with low-risk actions, and 2) customer satisfaction decreases with increases in investment risk, which is undesirable. Author challenges the limitation of Supervised Learning algorithm since it requires domain knowledge. Thus, they propose Reinforcement Learning to be more suitable, because it only requires state, action and reward specifications.

The study verifies the method through the back-test in the cryptocurrency market because it is extremely volatile and offers enormous and diverse data. Agents then learn with shorter periods and are tested for the same period to verify the robustness of the method.

2 Proposed Method

The overall structure of the proposed method is shown below.

The architecutre of the proposed trading agent system.

2.1 Problem Definition

The portfolio consists of m assets and one base currency.

The price vector p stores the price p of all assets:

The portfolio vector w stores the amount of each asset:

At time $𝑡$ , the total value $W_t$ of the portfolio is defined as the inner product of the price vector $p_t$ and the portfolio vector $w_t$ .

Finally, the goal is to maximize the profit $P_t$ at the terminal time step $𝑇$ .

2.2 Asset Data Preprocessing

1) Asset Selection
Data is drawn from the Binance Exchange API, where top m traded coins are selected as assets.

2) Data Collection
Each coin has 9 properties, shown in Table.1, so each trade history matrix has size (α * 9), where α is the size of the target period converted into minutes.

3) Zero-Padding
Pad all other coins to match the matrix size of the longest coin. (Coins have different listing days)

Comment: Author pointed out that zero-padding may be lacking, but empirical results still confirm their method covering the missing data well.

4) Stack Matrices
Stack m matrices of size (α * 9) to form a block of size (m* α * 9). Then, use sliding window method with widow size w to create (α – w + 1) number of sequential blocks with size (w * m * 9).

5) Normalization
Normalize blocks with min-max normalization method. They are called history block 𝜙 and used as input (ie. state) for the agent.

3. Deep Q-Network

The proposed RL-based trading system follows the DQN structure.

Deep Q-Network has 2 networks, Q- and Target network, and a component called experience replay. The Q-network is the agent that is trained to produce the optimal state-action value (aka. q-value).

Comment: Q-value is calculated by the Bellman equation, which, in short, consists of the immediate reward from next action, and the discounted value of the next state by following the policy for all subsequent steps.

Here,
Agent: Portfolio manager
Action a: Trading strategy according to the current state
State 𝜙 : State of the capital market environment
Environment: Has all trade histories for assets, return reward r and provide next state 𝜙’ to agent again

DQN workflow:

DQN gets trained in multiple time steps of multiple episodes. Let’s look at the workflow of one episode.

Training of a Deep Q-Network

1) Experience replay selects an action according to the behavior policy, executes in the environment, returns the reward and next state. This experience set ( $\phi_t, a_t, r_r,\phi_{t+!}$ ) is stored in the repository as a sample of training data.

2) From the repository of prior observations, take a random batch of samples as the input to both Q- and Target network. The Q-network takes the current state and action from each data sample and predicts the q-value for that particular action. This is the ‘Predicted Q-Value’.Comment: Author uses 𝜀-greedy algorithm to calculate q-value and select action. To simplify, 𝜀-greedy policy takes the optimal action if a randomly generated number is greater than 𝜀, which represents a tradeoff between exploration and exploitation.

The Target network takes the next state from each data sample and predicts the best q-value out of all actions that can be taken from that state. This is the ‘Target Q-Value’.

Comment: Author proposes a different target policy to calculate the target q-value.

3) The Predicted q-value, Target q-value, and the observed reward from the data sample is used to compute the Loss to train the Q-network.

Comment: Target Network is not trained. It is held constant to serve as a stable target for learning and will be updated with a frequency different from the Q-network.

4) Copy Q-network weights to Target network after n time steps and continue to next time step until this episode is finished.

4.0 Main Contribution of the Research

4.1 Action and Reward

Agent determines not only action a but ratio , at which the action is applied.

Action:
Hold, buy and sell. Buy and sell are defined discretely for each asset. Hold holds all assets. Therefore, there are (2m + 1) actions in the action set A.

Agent obtains q-value of each action through q-network and selects action by using 𝜀-greedy algorithm as behavior policy.
Ratio:
$\sigma$ is defined as the softmax value for the q-value of each action (ie. i-th asset at $\sigma = 0.5$ , then i-th asset is bought using 50% of base currency).
Reward:
Reward depends on the portfolio value before and after the trading strategy. It is clipped to [-1,1] to avoid overfitting.

4.2 Proposed Target Policy

Author sets the target based on the expected SARSA algorithm with some modification.

Comment: Author claims that greedy policy ignores the risks that may arise from exploring other outcomes other than the optimal one, which is fatal for domains where safe actions are preferred (ie. capital market).

The proposed policy uses softmax algorithm adjusted with greediness according to the temperature term 𝜏. However, softmax value is very sensitive to the differences in optimal q-value of states. To stabilize learning, and thus to get similar greediness in all states, author redefine 𝜏 as the mean of absolute values for all q-values in each state multiplied by a hyperparameter 𝜏’.

4.3 Q-Network Structure

This study uses Convolutional Neural Network (CNN) to construct the networks. Detailed structure of the networks is shown in Table 2.

Comment: CNN is a deep neural network method that hierarchically extracts local features through a weighted filter. More details see: https://towardsdatascience.com/stock-market-action-prediction-with-convnet-8689238feae3.

5 Experiment and Hyperparameter Tuning

5.1 Experiment Setting

Data is collected from August 2017 to March 2018 when the price fluctuates extensively.

Three evaluation metrics are used to compare the performance of the trading agent.

Profit $P_t$ introduced in 2.1.
Sharpe Ratio: A measure of return, taking risk into account.

Comment: $p_t$ is the standard deviation of the expected return and $P_f$ is the return of a risk-free asset, which is set to 0 here.
Maximum Drawdown: Maximum loss from a peak to a through, taking downside risk into account.

5.2 Hyperparameter Optimization

The proposed method has a number of hyperparameters: window size mentioned in 2.2, 𝜏’ in the target policy, and hyperparameters used in DQN structure. Author believes the former two are key determinants for the study and performs GridSearch to set w = 30, 𝜏’ = 0.25. The other hyperparameters are determined using heuristic search. Specifications of all hyperparameters are summarized in the last page.

Comment: Heuristic is a type of search that looks for a good solution, not necessarily a perfect one, out of the available options.

5.3 Performance Evaluation

Benchmark algorithms:

UBAH (Uniform buy and hold): Invest in all assets and hold until the end.
UCRP (Uniform Constant Rebalanced Portfolio): Rebalance portfolio uniformly for every trading period.

Methods from other studies: hyperparameters as suggested in the studies
EG (Exponential Gradient)
PAMR (Passive Aggressive Mean Reversion Strategy)

Comment: DQN basic uses greedy policy as the target policy.

The proposed DQN method exhibits the best overall results out of the 6 methods. When the agent is trained with shorter periods, although MDD increases significantly, it still performs better than benchmarks and proves its robustness.

6 Conclusion

The proposed method performs well compared to other methods, but there is a main drawback. The encoding method lacked a theoretical basis to successfully encode the information in the capital market, and this opaqueness is a rooted problem for deep learning. Second, the study focuses on its target policy, while there remains room for improvement with its neural network structure.

Specification of Hyperparameters.

References

Shin, S. Bu and S. Cho, “Automatic Financial Trading Agent for Low-risk Portfolio Management using Deep Reinforcement Learning”, https://arxiv.org/pdf/1909.03278.pdf
Li, P. Zhao, S. C. Hoi, and V. Gopalkrishnan, “PAMR: passive aggressive mean reversion strategy for portfolio selection,” Machine learning, vol. 87, pp. 221-258, 2012.
P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth, “On‐line portfolio selection using multiplicative updates,” Mathematical Finance, vol. 8, pp. 325-347, 1998.

https://deepai.org/machine-learning-glossary-and-terms/softmax-layer#:~:text=The%20softmax%20function%20is%20a,can%20be%20interpreted%20as%20probabilities.

http://www.kasimte.com/2020/02/14/how-does-temperature-affect-softmax-in-machine-learning.html

https://towardsdatascience.com/reinforcement-learning-made-simple-part-2-solution-approaches-7e37cbf2334e

https://towardsdatascience.com/reinforcement-learning-explained-visually-part-4-q-learning-step-by-step-b65efb731d3e

https://towardsdatascience.com/reinforcement-learning-explained-visually-part-3-model-free-solutions-step-by-step-c4bbb2b72dcf

https://towardsdatascience.com/reinforcement-learning-explained-visually-part-5-deep-q-networks-step-by-step-5a5317197f4b

Generative Adversarial Networks

May 20, 2022/in Artificial Intelligence, Data Mining, Deep Learning, Machine Learning, Main Category/by Sunil Yadav

After Deep Autoregressive Models, Deep Generative Modelling and Variational Autoencoders we now continue the discussion with Generative Adversarial Networks (GANs).

Introduction

So far, in the series of deep generative modellings (DGMs [Yad22a]), we have covered autoregressive modelling, which estimates the exact log likelihood defined by the model and variational autoencoders, which was variational approximations for lower bound optimization. Both of these modelling techniques were explicitly defining density functions and optimizing the likelihood of the training data. However, in this blog, we are going to discuss generative adversarial networks (GANs), which are likelihood-free models and do not define density functions explicitly. GANs follow a game-theoretic approach and learn to generate from the training distribution through a set up of a two-player game.

A two player model of GAN along with the generator and discriminators.

GAN tries to learn the distribution of high dimensional training data and generates high-quality synthetic data which has a similar distribution to training data. However, learning the training distribution is a highly complex task therefore GAN utilizes a two-player game approach to overcome the high dimensional complexity problem. GAN has two different neural networks (as shown in Figure ??) the generator and the discriminator. The generator takes a random input $z\sim p(z)$ and produces a sample that has a similar distribution as $p_d$ . To train this network efficiently, there is the other network that is utilized as the second player and known as the discriminator. The generator network (player one) tries to fool the discriminator by generating real looking images. Moreover, the discriminator network tries to distinguish between real (training data $x\sim p_d(x)$ ) and fake images effectively. Our main aim is to have an efficiently trained discriminator to be able to distinguish between real and fake images (the generator’s output) and on the other hand, we would like to have a generator, which can easily fool the discriminator by generating real-looking images.

Objective function and training

Objective function

Simultaneous training of these two networks is one of the main challenges in GANs and a minimax loss function is defined for this purpose. To understand this minimax function, firstly, we would like to discuss the concept of two sample testing by Aditya grover [Gro20]. Two sample testing is a method to compute the discrepancy between the training data distribution and the generated data distribution:

(1) $\begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{x\sim p_{\theta_g}} [D_{\theta_d}(G_{\theta_g}(x))], \end{equation*}$

where $p_{\theta_g}$ and $p_d$ are the distribution functions of generated and training data respectively. The term $F$ is a set of functions. The \textit{max} part is computing the discrepancies between two distribution using a function $D_{\theta_d} \in F$ and this part is very similar to the term $d$ (discrepancy measure) from our first article (Deep Generative Modelling) and KL-divergence is applied to compute this measure in second article (Deep Autoregressive Models) and third articles (Variational Autoencoders). However, in GANs, for a given set of functions $F$ , we would like compute the distribution $p_{\theta_g}$ , which minimizes the overall discrepancy even for a worse function $D_{\theta_d}\in F$ . The above mentioned objective function does not use any likelihood function and utilizing two different data samples from training and generated data respectively.

By combining Figure ?? and Equation 1, the first term $\mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)]$ corresponds to the discriminator, which has direct access to the training data and the second term $\mathbb{E}_{x\sim p_{\theta_g}}[D_{\theta_d}(G_{\theta_g}(x))]$ represents the generator part as it relies only on the latent space and produces synthetic data. Therefore, Equation 1 can be rewritten in the form of GAN’s two players as:

(2) $\begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{z\sim p_z}[D_{\theta_d}(G_{\theta_g}(z))], \end{equation*}$

The above equation can be rearranged in the form of log loss:

(3) $\begin{equation*} \min_{\theta_g}\: \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d} (x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]), \end{equation*}$

In the above equation, the arguments are modified from $p_{\theta_g}$ and $D_{\theta_d} in F$ to $\theta_g$ and $\theta_d$ respectively as we would like to approximate the network parameters, which are represented by $\theta_g$ and $\theta_d$ for the both generator and discriminator respectively. The discriminator wants to maximize the above objective for $\theta_d$ such that $D_{\theta_d}(x) \approx 1$ , which indicates that the outcome is close to the real data. Furthermore, $D_{\theta_d}(G_{\theta_g}(z))$ should be close to zero as it is fake data, therefore, the maximization of the above objective function for $\theta_d$ will ensure that the discriminator is performing efficiently in terms of separating real and fake data. From the generator point of view, we would like to minimize this objective function for $\theta_g$ such that $D_{\theta_d}(G_{\theta_g}(z)) \approx 1$ . If the minimization of the objective function happens effectively for $\theta_g$ then the discriminator will classify a fake data into a real data that means that the generator is producing almost real-looking samples.

Training

The training procedure of GAN can be explained by using the following visualization from Goodfellow et al. [GPAM+14]. In Figure 2(a), $z$ is a random input vector to the generator to produce a synthetic outcome $x\sim p_{\theta_g}$ (green curve). The generated data distribution is not close to the original data distribution $p_d$ (dotted black curve). Therefore, the discriminator classifies this image as a fake image and forces generator to learn the training data distribution (Figure 2(b) and (c)). Finally, the generator produces the image which could not detected as a fake data by discriminator(Figure 2(d)).

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ
respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow
et al. [GPAM+14].

The optimization of the objective function mentioned in Equation 3 is performed in th following two steps repeatedly:
\begin{enumerate}
\item Firstly, the gradient ascent is utilized to maximize the objective function for $\theta_d$ for discriminator.

(4) $\begin{equation*} \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d}(x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}$

\item In the second step, the following function is minimized for the generator using gradient descent.

(5) $\begin{equation*} \min_{\theta_g} \: ( \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}$

\end{enumerate}

However, in practice the minimization for the generator does now work well because when $D_{\theta_d}(G_{\theta_g}(z) \approx 1$ then the term $log \: (1-D_{\theta_d}(G_{\theta_g}(z)))$ has the dominant gradient and vice versa.

However, we would like to have the gradient behaviour completely opposite because $D_{\theta_d}(G_{\theta_g}(z) \approx 1$ means the generator is well trained and does not require dominant gradient values. However, in case of $D_{\theta_d}(G_{\theta_g}(z) \approx 0$ , the generator is not well trained and producing low quality outputs therefore, it requires a dominant gradient for an efficient training. To fix this problem, the gradient ascent method is applied to maximize the modified generator’s objective:
In the second step, the following function is minimized for the generator using gradient descent alternatively.

(6) $\begin{equation*} \max_{\theta_g} \: \mathbb{E}_{z\sim p_z}[log \: (D_{\theta_d}(G_{\theta_g}(z))] \end{equation*}$

therefore, during the training, Equation 4 and 6 will be maximized using the gradient ascent algorithm until the convergence.

Results

The quality of the generated images using GANs depends on several factors. Firstly, the joint training of GANs is not a stable procedure and that could severely decrease the quality of the outcome. Furthermore, the different neural network architecture will modify the quality of images based on the sophistication of the used network. For example, the vanilla GAN [GPAM+14] uses a fully connected deep neural network and generates a quite decent result. Furthermore, DCGAN [RMC15] utilized deep convolutional networks and enhanced the quality of outcome significantly. Furthermore, different types of loss functions are applied to stabilize the training procedure of GAN and to produce high-quality outcomes. As shown in Figure 3, StyleGAN [KLA19] utilized Wasserstein metric [Yad22b] to generate high-resolution face images. As it can be seen from Figure 3, the quality of the generated images are enhancing with time by applying more sophisticated training techniques and network architectures.

GAN timeline with different variations in terms of network architecture and loss functions.

Summary

This article covered the basics and mathematical concepts of GANs. However, the training of two different networks simultaneously could be complex and unstable. Therefore, researchers are continuously working to create a better and more stable version of GANs, for example, WGAN. Furthermore, different types of network architectures are introduced to improve the quality of outcomes. We will discuss this further in the upcoming blog about these variations.

References

[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in
neural information processing systems, 27, 2014.

[Gro20] Aditya Grover. Generative adversarial networks.
https://deepgenerativemodels.github.io/notes/gan/, 2020.

[KLA19] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4401–4410, 2019.

[RMC15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.

[Yad22a] Sunil Yadav. Deep generative modelling. https://data-scienceblog.
com/blog/2022/02/19/deep-generative-modelling/, 2022.

[Yad22b] Sunil Yadav. Necessary probability concepts for deep learning: Part 2.
https://medium.com/@sunil7545/kl-divergence-js-divergence-and-wasserstein-metricin-
deep-learning-995560752a53, 2022.

Variational Autoencoders

April 19, 2022/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Main Category, Use Cases/by Sunil Yadav

After Deep Autoregressive Models and Deep Generative Modelling, we will continue our discussion with Variational AutoEncoders (VAEs) after covering up DGM basics and AGMs. Variational autoencoders (VAEs) are a deep learning method to produce synthetic data (images, texts) by learning the latent representations of the training data. AGMs are sequential models and generate data based on previous data points by defining tractable conditionals. On the other hand, VAEs are using latent variable models to infer hidden structure in the underlying data by using the following intractable distribution function:

(1) $\begin{equation*} p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz. \end{equation*}$

The generative process using the above equation can be expressed in the form of a directed graph as shown in Figure ?? (the decoder part), where latent variable $z\sim p_\theta(z)$ produces meaningful information of $x \sim p_\theta(x|z)$ .

Figure 1: Architectures AE and VAE based on the bottleneck architecture. The decoder part work as
a generative model during inference.

Autoencoders

Autoencoders (AEs) are the key part of VAEs and are an unsupervised representation learning technique and consist of two main parts, the encoder and the decoder (see Figure ??). The encoders are deep neural networks (mostly convolutional neural networks with imaging data) to learn a lower-dimensional feature representation from training data. The learned latent feature representation $z$ usually has a much lower dimension than input $x$ and has the most dominant features of $x$ . The encoders are learning features by performing the convolution at different levels and compression is happening via max-pooling.

On the other hand, the decoders, which are also a deep convolutional neural network are reversing the encoder’s operation. They try to reconstruct the original data $x$ from the latent representation $z$ using the up-sampling convolutions. The decoders are pretty similar to VAEs generative models as shown in Figure 1, where synthetic images will be generated using the latent variable $z$ .

During the training of autoencoders, we would like to utilize the unlabeled data and try to minimize the following quadratic loss function:

(2) $\begin{equation*} \mathcal{L}(\theta, \phi) = ||x-\hat{x}||^2, \end{equation*}$

The above equation tries to minimize the distance between the original input and reconstructed image as shown in Figure 1.

Variational autoencoders

VAEs are motivated by the decoder part of AEs which can generate the data from latent representation and they are a probabilistic version of AEs which allows us to generate synthetic data with different attributes. VAE can be seen as the decoder part of AE, which learns the set parameters $\theta$ to approximate the conditional $p_\theta(x|z)$ to generate images based on a sample from a true prior, $z\sim p_\theta(z)$ . The true prior $p_\theta(z)$ are generally of Gaussian distribution.

Network Architecture

VAE has a quite similar architecture to AE except for the bottleneck part as shown in Figure 2. in AES, the encoder converts high dimensional input data to low dimensional latent representation in a vector form. On the other hand, VAE’s encoder learns the mean vector and standard deviation diagonal matrix such that $z\sim \matcal{N}(\mu_z, \Sigma_x)$ as it will be performing probabilistic generation of data. Therefore the encoder and decoder should be probabilistic.

Training

Similar to AGMs training, we would like to maximize the likelihood of the training data. The likelihood of the data for VAEs are mentioned in Equation 1 and the first term $p_\theta(x|z)$ will be approximated by neural network and the second term $p(x)$ prior distribution, which is a Gaussian function, therefore, both of them are tractable. However, the integration won’t be tractable because of the high dimensionality of data.

To solve this problem of intractability, the encoder part of AE was utilized to learn the set of parameters $\phi$ to approximate the conditional $q_\phi (z|x)$ . Furthermore, the conditional $q_\phi (z|x)$ will approximate the posterior $p_\theta (z|x)$ , which is intractable. This additional encoder part will help to derive a lower bound on the data likelihood that will make the likelihood function tractable. In the following we will derive the lower bound of the likelihood function:

(3) $\begin{equation*} \begin{flalign} \begin{aligned} log \: p_\theta (x) = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{p_\theta (x|z) p_\theta (z)}{p_\theta (z|x)} \: \frac{q_\phi(z|x)}{q_\phi(z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: p_\theta (x|z)\Bigg] - \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)} {p_\theta (z)}\Bigg] + \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)}{p_\theta (z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)) + \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z|x)). \end{aligned} \end{flalign} \end{equation*}$

In the above equation, the first line computes the likelihood using the logarithmic of $p_\theta (x)$ and then it is expanded using Bayes theorem with additional constant $q_\phi(z|x)$ multiplication. In the next line, it is expanded using the logarithmic rule and then rearranged. Furthermore, the last two terms in the second line are the definition of KL divergence and the third line is expressed in the same.

In the last line, the first term is representing the reconstruction loss and it will be approximated by the decoder network. This term can be estimated by the reparametrization trick \cite{}. The second term is KL divergence between prior distribution $p_\theta(z)$ and the encoder function $q_\phi (z|x)$ , both of these functions are following the Gaussian distribution and has the closed-form solution and are tractable. The last term is intractable due to $p_\theta (z|x)$ . However, KL divergence computes the distance between two probability densities and it is always positive. By using this property, the above equation can be approximated as:

(4) $\begin{equation*} log \: p_\theta (x)\geq \mathcal{L}(x, \phi, \theta) , \: \text{where} \: \mathcal{L}(x, \phi, \theta) = \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)). \end{equation*}$

In the above equation, the term $\mathcal{L}(x, \phi, \theta)$ is presenting the tractable lower bound for the optimization and is also termed as ELBO (Evidence Lower Bound Optimization). During the training process, we maximize ELBO using the following equation:

(5) $\begin{equation*} \operatorname*{argmax}_{\phi, \theta} \sum_{x\in X} \mathcal{L}(x, \phi, \theta). \end{equation*}$

Furthermore, the reconstruction loss term can be written using Equation 2 as the decoder output is assumed to be following Gaussian distribution. Therefore, this term can be easily transformed to mean squared error (MSE).

During the implementation, the architecture part is straightforward and can be found here. The user has to define the size of latent space, which will be vital in the reconstruction process. Furthermore, the loss function can be minimized using ADAM optimizer with a fixed batch size and a fixed number of epochs.

Figure 2: The results obtained from vanilla VAE (left) and a recent VAE-based generative
model NVAE (right)

In the above, we are showing the quality improvement since VAE was introduced by Kingma and
Welling [KW14]. NVAE is a relatively new method using a deep hierarchical VAE [VK21].

Summary

In this blog, we discussed variational autoencoders along with the basics of autoencoders. We covered
the main difference between AEs and VAEs along with the derivation of lower bound in VAEs. We
have shown using two different VAE based methods that VAE is still active research because in general,
it produces a blurry outcome.

References

[KW14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.
[VK21] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder, 2021.

Graphical understanding of dynamic programming and the Bellman equation: taking a typical approach at first

October 5, 2021/in Artificial Intelligence, Data Mining, Data Science, Deep Learning, Main Category/by Yasuto Tamura

This is the second article of the series My elaborate study notes on reinforcement learning.

*I must admit I could not fully explain how I tried visualizing ideas of Bellman equations in this article. I highly recommend you to also take brief at the second section of the third article. (A comment added on 13/3/2022)

1, Before getting down on business

As the title of this article suggests, this article is going to be mainly about the Bellman equation and dynamic programming (DP), which are to be honest very typical and ordinary topics. One typical way of explaining DP in contexts of reinforcement learning (RL) would be explaining the Bellman equation, value iteration, and policy iteration, in this order. If you would like to merely follow pseudocode of them and implement them, to be honest that is not a big deal. However even though I have studied RL only for some weeks, I got a feeling that these algorithms, especially policy iteration are more than just single algorithms. In order not to miss the points of DP, rather than typically explaining value iteration and policy iteration, I would like to take a different approach. Eventually I am going to introduce DP in RL as a combination of the following key terms: the Bellman operator, the fixed point of a policy, policy evaluation, policy improvement, and existence of the optimal policy. But first, in this article I would like to cover basic and typical topics of DP in RL.

Many machine learning algorithms which use supervised/unsupervised learning more or less share the same ideas. You design a model and a loss function and input samples from data, and you adjust parameters of the model so that the loss function decreases. And you usually use optimization techniques like stochastic gradient descent (SGD) or ones derived from SGD. Actually feature engineering is needed to extract more meaningful information from raw data. Or especially in this third AI boom, the models are getting more and more complex, and I would say the efforts of feature engineering was just replaced by those of designing neural networks. But still, once you have the whole picture of supervised/unsupervised learning, you would soon realize other various algorithms is just a matter of replacing each component of the workflow. However reinforcement learning has been another framework of training machine learning models. Richard E. Bellman’s research on DP in 1950s is said to have laid a foundation for RL. RL also showed great progress thanks to development of deep neural networks (DNN), but still you have to keep it in mind that RL and supervised/unsupervised learning are basically different frameworks. DNN are just introduced in RL frameworks to enable richer expression of each component of RL. And especially when RL is executed in a higher level environment, for example screens of video games or phases of board games, DNN are needed to process each state of the environment. Thus first of all I think it is urgent to see ideas unique to RL in order to effectively learn RL. In the last article I said RL is an algorithm to enable planning by trial and error in an environment, when the model of the environment is not known. And DP is a major way of solving planning problems. But in this article and the next article, I am mainly going to focus on a different aspect of RL: interactions of policies and values.

According to a famous Japanese textbook on RL named “Machine Learning Professional Series: Reinforcement Learning,” most study materials on RL lack explanations on mathematical foundations of RL, including the book by Sutton and Barto. That is why many people who have studied machine learning often find it hard to get RL formulations at the beginning. The book also points out that you need to refer to other bulky books on Markov decision process or dynamic programming to really understand the core ideas behind algorithms introduced in RL textbooks. And I got an impression most of study materials on RL get away with the important ideas on DP with only introducing value iteration and policy iteration algorithms. But my opinion is we should pay more attention on policy iteration. And actually important RL algorithms like Q learning, SARSA, or actor critic methods show some analogies to policy iteration. Also the book by Sutton and Barto also briefly mentions “Almost all reinforcement learning methods are well described as GPI (generalized policy iteration). That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right side.“

Even though I arrogantly, as a beginner in this field, emphasized “simplicity” of RL in the last article, in this article I am conversely going to emphasize the “profoundness” of DP over two articles. But I do not want to cover all the exhaustive mathematical derivations for dynamic programming, which would let many readers feel reluctant to study RL. I tried as hard as possible to visualize the ideas in DP in simple and intuitive ways, as far as I could understand. And as the title of this article series shows, this article is also a study note for me. Any corrections or advice would be appreciated via email or comment pots below.

2, Taking a look at what DP is like

In the last article, I said that planning or RL is a problem of finding an optimal policy $\pi(a|s)$ for choosing which actions to take depending on where you are. Also in the last article I displayed flows of blue arrows for navigating a robot as intuitive examples of optimal policies in planning or RL problems. But you cannot directly calculate those policies. Policies have to be evaluated in the long run so that they maximize returns, the sum of upcoming rewards. Then in order to calculate a policy $p(a|s)$ , you need to calculate a value functions $v_{\pi}(s)$ . $v_{\pi}(s)$ is a function of how good it is to be in a given state $s$ , under a policy $\pi$ . That means it is likely you get higher return starting from $s$ , when $v_{\pi}(s)$ is high. As illustrated in the figure below, values and policies, which are two major elements of RL, are updated interactively until they converge to an optimal value or an optimal policy. The optimal policy and the optimal value are denoted as $v_{\ast}$ and $\pi_{\ast}$ respectively.

Dynamic programming (DP) is a family of algorithms which is effective for calculating the optimal value $v_{\ast}$ and the optimal policy $\pi_{\ast}$ when the complete model of the environment is given. Whether in my articles or not, the rest of discussions on RL are more or less based on DP. RL can be viewed as a method of achieving the same effects as DP when the model of the environment is not known. And I would say the effects of imitating DP are often referred to as trial and errors in many simplified explanations on RL. If you have studied some basics of computer science, I am quite sure you have encountered DP problems. With DP, in many problems on textbooks you find optimal paths of a graph from a start to a goal, through which you can maximizes the sum of scores of edges you pass. You might remember you could solve those problems in recursive ways, but I think many people have just learnt very limited cases of DP. For the time being I would like you to forget such DP you might have learned and comprehend it as something you newly start learning in the context of RL.

*As a more advances application of DP, you might have learned string matching. You can calculated how close two strings of characters are with DP using string matching.

The way of calculating $v_{\pi}(s)$ and $\pi(a|s)$ with DP can be roughly classified to two types, policy-based and value-based. Especially in the contexts of DP, the policy-based one is called policy iteration, and the values-based one is called value iteration. The biggest difference between them is, in short, policy iteration updates a policy every times step, but value iteration does it only at the last time step. I said you alternate between updating $v_{\pi}(s)$ and $\pi(a|s)$ , but in fact that is only true of policy iteration. Value iteration updates a value function $v(s)$ . Before formulating these algorithms, I think it will be effective to take a look at how values and policies are actually updated in a very simple case. I would like to introduce a very good tool for visualizing value/policy iteration. You can customize a grid map and place either of “Treasure,” “Danger,” and “Block.” You can choose probability of transition and either of settings, “Policy Iteration” or “Values Iteration.” Let me take an example of conducting DP on a gird map like below. Whichever of “Policy Iteration” or “Values Iteration” you choose, you would get numbers like below. Each number in each cell is the value of each state, and you can see that when you are on states with high values, you are more likely to reach the “treasure” and avoid “dangers.” But I bet this chart does not make any sense if you have not learned RL yet. I prepared some code for visualizing the process of DP on this simulator. The code is available in this link.

*In the book by Sutton and Barto, when RL/DP is discussed at an implementation level, the estimated values of $v_{\pi}(s)$ or $v_{\ast}(s)$ can be denoted as an array $V$ or $V_t$ . But I would like you take it easy while reading my articles. I will repeatedly mentions differences of notations when that matters.

*Remember that at the beginning of studying RL, only super easy cases are considered, so a $V$ is usually just a NumPy array or an Excel sheet.

*The chart above might be also misleading since there is something like a robot at the left bottom corner, which might be an agent. But the agent does not actually move around the environment in planning problems because it has a perfect model of the environment in the head.

The visualization I prepared is based on the implementation of the simulator, so they would give the same outputs. When you run policy iteration in the map, the values and polices are updated as follows. The arrow in each cell is the policy in the state. At each time step the arrows is calculated in a greedy way, and each arrow at each state shows the direction in which the agent is likely to get the highest reward. After 3 iterations, the policies and values converge, and with the policies you can navigate yourself to the “Treasure,” avoiding “Dangers.”

*I am not sure why policies are incorrect at the most left side of the grid map. I might need some modification of code.

You can also update values without modifying policies as the chart below. In this case only the values of cells are updated. This is value-iteration, and after this iteration converges, if you transit to an adjacent cell with the highest value at each cell, you can also navigate yourself to the “treasure,” avoiding “dangers.”

I would like to start formulating DP little by little,based on the notations used in the RL book by Sutton. From now on, I would take an example of the $5 \times 6$ grid map which I visualized above. In this case each cell is numbered from $0$ to $29$ as the figure below. But the cell 7, 13, 14 are removed from the map. In this case $\mathcal{S} = {0, 1, 2, 3, 4, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29}$ , and $\mathcal{A} = \{\uparrow, \rightarrow, \downarrow, \leftarrow \}$ . When you pass $s=8$ , you get a reward $r_{treasure}=1$ , and when you pass the states $s=15$ or $s=19$ , you get a reward $r_{danger}=-1$ . Also, the agent is encouraged to reach the goal as soon as possible, thus the agent gets a regular reward of $r_{regular} = - 0.04$ every time step.

In the last section, I mentioned that the purpose of RL is to find the optimal policy which maximizes a return, the sum of upcoming reward $R_t$ . A return is calculated as follows.

$R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T$

In RL a return is estimated in probabilistic ways, that is, an expectation of the return given a state $S_t = s$ needs to be considered. And this is the value of the state. Thus the value of a state $S_t = s$ is calculated as follows.

$\mathbb{E}_{\pi}\bigl[R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T | S_t = s \bigr]$

In order to roughly understand how this expectation is calculated let’s take an example of the $5 \times 6$ grid map above. When the current state of an agent is $s=10$ , it can take numerous patterns of actions. For example (a) $10 - 9 - 8 - 2$ , (b) $10-16-15-21-20-19$ , (c) $10-11-17-23-29-\cdots$ . The rewards after each behavior is calculated as follows.

If you take a you take the course (a) $10 - 9 - 8 - 2$ , you get a reward of $r_a = -0.04 -0.04 + 1 -0.04$ in total. The probability of taking a course of a) is $p_a = \pi(A_t = \leftarrow | S_t = 10) \cdot p(S_{t+1} = 9 |S_t = 10, A_t = \leftarrow )$ $\cdot \pi(A_{t+1} = \leftarrow | S_{t+1} = 9) \cdot p(S_{t+2} = 8 |S_{t+1} = 9, A_{t+1} = \leftarrow )$ $\cdot \pi(A_{t+2} = \uparrow | S_{t+2} = 8) \cdot p(S_{t+3} = 2 | S_{t+2} = 8, A_{t+2} = \uparrow )$
Just like the case of (a), the reward after taking the course $(b)$ is $r_b = - 0.04 -0.04 -1 -0.04 -0.04 -0.04 -1$ . The probability of taking the action can be calculated in the same way as $p_b = \pi(A_t = \downarrow | S_t = 10) \cdot p(S_{t+1} = 16 |S_t = 10, A_t = \downarrow )$ $\cdots \pi(A_{t+4} = \leftarrow | S_{t+4} = 20) \cdot p(S_{t+5} = 19 |S_{t+4} = 20, A_{t+4} = \leftarrow )$ .
The rewards and the probability of the case (c) cannot be calculated because future behaviors of the agent is not confirmed.

Assume that (a) and (b) are the only possible cases starting from $s$ , under the policy $\pi$ , then the the value of $s=10$ can be calculated as follows as a probabilistic sum of rewards of each behavior (a) and (b).

$\mathbb{E}_{\pi}\bigl[R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T | S_t = s \bigr] = r_a \cdot p_a + r_b \cdot p_b$

But obviously this is not how values of states are calculated in general. Starting from a state a state $s=10$ , not only (a) and (b), but also numerous other behaviors of agents can be considered. Or rather, it is almost impossible to consider all the combinations of actions, transition, and next states. In practice it is quite difficult to calculate a sequence of upcoming rewards $R_{t+1}, \gamma R_{t+2}, R_{t+3} \cdots$ ,and it is virtually equal to considering all the possible future cases.A very important formula named the Bellman equation effectively formulate that.

3, The Bellman equation and convergence of value functions

*I must admit I could not fully explain how I tried visualizing ideas of Bellman equations in this article. It might be better to also take brief at the second section of the third article. (A comment added on 3/3/2022)

The Bellman equation enables estimating values of states considering future countless possibilities with the following two ideas.

Returns are calculated recursively.
Returns are calculated in probabilistic ways.

First of all, I have to emphasize that a discounted return is usually used rather than a normal return, and a discounted one is defined as below

$G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots + \gamma ^ {T-t-1} R_T = \sum_{k=0}^{T-t-1}{\gamma ^{k}R_{t+k+1}}$

, where $\gamma \in (0, 1]$ is a discount rate. (1)As the first point above, the discounted return can be calculated recursively as follows: $G_t = R_{t + 1} + \gamma R_{t + 2} + \gamma ^2 R_{t + 2} + \gamma ^3 R_{t + 3} + \cdots$ $= R_{t + 1} + \gamma (R_{t + 2} + \gamma R_{t + 2} + \gamma ^2 R_{t + 3} + \cdots )$ $= R_{t + 1} + \gamma G_{t+1}$ . You can postpone calculation of future rewards corresponding to $G_{t+1}$ this way. This might sound obvious, but this small trick is crucial for defining defining value functions or making update rules of them. (2)The second point might be confusing to some people, but it is the most important in this section. We took a look at a very simplified case of calculating the expectation in the last section, but let’s see how a value function $v_{\pi}(s)$ is defined in the first place.

$v_{\pi}(s) \doteq \mathbb{E}_{\pi}\bigl[G_t | S_t = s \bigr]$

This equation means that the value of a state $s$ is a probabilistic sum of all possible rewards taken in the future following a policy $\pi$ . That is, $v_{\pi}(s)$ is an expectation of the return, starting from the state $s$ . The definition of a values $v_{\pi}(s)$ is written down as follows, and this is what $\mathbb{E}_{\pi}$ means.

$v_{\pi} (s)= \sum_{a}{\pi(a|s) \sum_{s', r}{p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]}}$

This is called Bellman equation, and it is no exaggeration to say this is the foundation of many of upcoming DP or RL ideas. Bellman equation can be also written as $\sum_{s', r, a}{\pi(a|s) p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]}$ . It can be comprehended this way: in Bellman equation you calculate a probabilistic sum of $r +v_{\pi}(s')$ , considering all the possible actions of the agent in the time step. $r +v_{\pi}(s')$ is a sum of the values of the next state $s'$ and a reward $r$ , which you get when you transit to the state $s'$ from $s$ . The probability of getting a reward $r$ after moving from the state $s$ to $s'$ , taking an action $a$ is $\pi(a|s) p(s', r|s, a)$ . Hence the right side of Bellman equation above means the sum of $\pi(a|s) p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]$ , over all possible combinations of $s'$ , $r$ , and $a$ .

*I would not say this equation is obvious, and please let me explain a proof of this equation later.

The following figures are based on backup diagrams introduced in the book by Sutton and Barto. As we have just seen, Bellman expectation equation calculates a probabilistic summation of $r + v(s')$ . In order to calculate the expectation, you have to consider all the combinations of $s'$ , $r$ , and $a$ . The backup diagram at the left side below shows the idea as a decision-tree-like graph, and strength of color of each arrow is the probability of taking the path.

The Bellman equation I have just introduced is called Bellman expectation equation to be exact. Like the backup diagram at the right side, there is another type of Bellman equation where you consider only the most possible path. Bellman optimality equation is defined as follows.

$v_{\ast}(s) \doteq \max_{a} \sum_{s', r}{p(s', r|s, a)\bigl[r + \gamma v_{\ast}(s')\bigr]}$

I would like you to pay attention again to the fact that in definitions of Bellman expectation/optimality equations, $v_{\pi}(s)$ / $v_{\ast}(s)$ is defined recursively with $v_{\pi}(s)$ / $v_{\ast}(s)$ . You might have thought how to calculate $v_{\pi}(s)$ / $v_{\ast}(s)$ is the problem in the first place.

As I implied in the first section of this article, ideas behind how to calculate these $v_{\pi}(s)$ and $v_{\ast}(s)$ should be discussed more precisely. Especially how to calculate $v_{\pi}(s)$ is a well discussed topic in RL, including the cases where data is sampled from an unknown environment model. In this article we are discussing planning problems, where a model an environment is known. In planning problems, that is DP problems where all the probabilities of transition $p(s', r | s, a)$ are known, a major way of calculating $v_{\pi}(s)$ is iterative policy evaluation. With iterative policy evaluation a sequence of value functions $(v_0(s), v_1(s), \dots , v_{k-1}(s), v_{k}(s))$ converges to $v_{\pi}(s)$ with the following recurrence relation

$v_{k+1}(s) =\sum_{a}{\pi(a|s)\sum_{s', r}{p(s', r | s, a) [r + \gamma v_k (s')]}}$ .

Once $v_{k}(s)$ converges to $v_{\pi}(s)$ , finally the equation of the definition of $v_{\pi}(s)$ holds as follows.

$v_{\pi}(s) =\sum_{a}{\pi(a|s)\sum_{s', r}{p(s', r | s, a) [r + \gamma v_{\pi} (s')]}}$ .

The convergence to $v_{\pi}(s)$ is like the graph below. If you already know how to calculate forward propagation of a neural network, this should not be that hard to understand. You just expand recurrent relation of $v_{k}(s)$ and $v_{k+1}(s)$ from the initial value at $k=0$ to the converged state at $k=K$ . But you have to be careful abut the directions of the arrows in purple. If you correspond the backup diagrams of the Bellman equation with the graphs below, the purple arrows point to the reverse side to the direction where the graphs extend. This process of converging an arbitrarily initialized $v_0(s)$ to $v_{\pi}(s)$ is called policy evaluation.

* $\mathcal{S}, \mathcal{A}$ are a set of states and actions respectively. Thus $|\mathcal{S}|$ , the size of $\mathcal{S}$ is the number of white nodes in each layer, and $|\mathcal{S}|$ the number of black nodes.

The same is true of the process of calculating an optimal value function $v_{\ast}$ . With the following recurrence relation

$v_{k+1}(s) =\max_a\sum_{s', r}{p(s', r | s, a) [r + \gamma v_k (s')]}$

$(v_0(s), v_1(s), \dots , v_{k-1}(s), v_{k}(s))$ converges to an optimal value function $v_{\ast}(s)$ . The graph below visualized the idea of convergence.

4, Pseudocode of policy iteration and value iteration

I prepared pseudocode of each algorithm based on the book by Sutton and Barto. These would be one the most typical DP algorithms you would encounter while studying RL, and if you just want to implement RL by yourself, these pseudocode would enough. Or rather these would be preferable to other more general and abstract pseudocode. But I would like to avoid explaining these pseudocode precisely because I think we need to be more conscious about more general ideas behind DP, which I am going to explain in the next article. I will cover only the important points of these pseudocode, and I would like to introduce some implementation of the algorithms in the latter part of next article. I think you should briefly read this section and come back to this section section or other study materials after reading the next article. In case you want to check the algorithms precisely, you could check the pseudocode I made with LaTeX in this link.

The biggest difference of policy iteration and value iteration is the timings of updating a policy. In policy iteration, a value function $v(s)$ and $\pi(a|s)$ are arbitrarily initialized. (1)The first process is policy evaluation. The policy $\pi(a|s)$ is fixed, and the value function $v(s)$ approximately converge to $v_{\pi}(s)$ , which is a value function on the policy $\pi$ . This is conducted by the iterative calculation with the reccurence relation introduced in the last section.(2) The second process is policy improvement. Based on the calculated value function $v_{\pi}(s)$ , the new policy $\pi(a|s)$ is updated as below.

$\pi(a|s) \gets\text{argmax}_a {r + \sum_{s', r}{p(s', r|s, a)[r + \gamma V(s')]}}, \quad \forall s\in \mathcal{S}$

The meaning of this update rule of a policy is quite simple: $\pi(a|s)$ is updated in a greedy way with an action $a$ such that $r + \sum_{s', r}{p(s', r|s, a)[r + \gamma V(s')]}$ is maximized. And when the policy $\pi(a|s)$ is not updated anymore, the policy has converged to the optimal one. At least I would like you to keep it in mind that a while loop of itrative calculation of $v_{\pi}(s)$ is nested in another while loop. The outer loop continues till the policy is not updated anymore.

On the other hand in value iteration, there is mainly only one loop of updating $v_{k}(s)$ , which converge to $v_{\ast}(s)$ . And the output policy is the calculated the same way as policy iteration with the estimated optimal value function. According to the book by Sutton and Barto, value iteration can be comprehended this way: the loop of value iteration is truncated with only one iteration, and also policy improvement is done only once at the end.

As I repeated, I think policy iteration is more than just a single algorithm. And relations of values and policies should be discussed carefully rather than just following pseudocode. And whatever RL algorithms you learn, I think more or less you find some similarities to policy iteration. Thus in the next article, I would like to introduce policy iteration in more abstract ways. And I am going to take a rough look at various major RL algorithms with the keywords of “values” and “policies” in the next article.

Appendix

I mentioned the Bellman equation is nothing obvious. In this section, I am going to introduce a mathematical derivation, which I think is the most straightforward. If you are allergic to mathematics, the part blow is not recommendable, but the Bellman equation is the core of RL. I would not say this is difficult, and if you are going to read some texts on RL including some equations, I think mastering the operations I explain below is almost mandatory.

First of all, let’s organize some important points. But please tolerate inaccuracy of mathematical notations here. I am going to follow notations in the book by Sutton and Barto.

Capital letters usually denote random variables. For example $X, Y,Z, S_t, A_t, R_{t+1}, S_{t+1}$ . And corresponding small letters are realized values of the random variables. For example $x, y, z, s, a, r, s'$ . (*Please do not think too much about the number of $'$ s on the small letters.)
Conditional probabilities in general are denoted as for example $\text{Pr}\{X=x, Y=y | Z=z\}$ . This means the probability of $x, y$ are sampled given that $z$ is sampled.
In the book by Sutton and Barto, a probilistic funciton $p(\cdot)$ means a probability of transition, but I am using $p(\cdot)$ to denote probabilities in general. Thus $p( s', a, r | s)$ shows the probability that, given an agent being in state $s$ at time $t$ , the agent will do action $a$ , AND doing this action will cause the agent to proceed to state $s'$ at time $t+1$ , and receive reward $r$ . $p( s', a, r | s)$ is not defined in the book by Barto and Sutton.
The following equation holds about any conditional probabilities: $p(x, y|z) = p(x|y, z)p(y|z)$ . Thus importantly, $p(s', a, r|s) = p(s', r| s, a)p(a|s)=p(s', r | s, a)\pi(a|s)$
When random variables $X, Y$ are discrete random variables, a conditional expectation of $X$ given $Y=y$ is calculated as follows: $\mathbb{E}[X|Y=y] = \sum_{x}{p(x|Y=y)}$ .

Keeping the points above in mind, let’s get down on business. First, according to definition of a value function on a policy $pi$ and linearity of an expectation, the following equations hold.

$v_{\pi}(s) = \mathbb{E} [G_t | S_t =s] = \mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t =s]$

$=\mathbb{E} [R_{t+1} | S_t =s] + \gamma \mathbb{E} [G_{t+1} | S_t =s]$

Thus we need to calculate $\mathbb{E} [R_{t+1} | S_t =s]$ and $\mathbb{E} [G_{t+1} | S_t =s]$ . As I have explained $\mathbb{E} [R_{t+1} | S_t =s]$ is the sum of $p(s', a, r |s) r$ over all the combinations of $(s', a, r)$ . And according to one of the points above, $p(s', a, r |s) = p(s', r | s, a)p(a|s)=p(s', r | s, a)\pi(a|s)$ . Thus the following equation holds.

$\mathbb{E} [R_{t+1} | S_t =s] = \sum_{s', a, r}{p(s', a, r|s)r} =$ $\sum_{s', a, r}{p(s', r | s, a)\pi(a|s)r}$ .

Next we have to calculate

$\mathbb{E} [G_{t+1} | S_t =s]$

$= \mathbb{E} [R_{t + 2} + \gamma R_{t + 3} + \gamma ^2 R_{t + 4} + \cdots | S_t =s]$

$= \mathbb{E} [R_{t + 2} | S_t =s] + \gamma \mathbb{E} [R_{t + 2} | S_t =s] + \gamma ^2\mathbb{E} [ R_{t + 4} | S_t =s] +\cdots$ .

Let’s first calculate $\mathbb{E} [R_{t + 2} | S_t =s]$ . Also $\mathbb{E} [R_{t + 3} | S_t =s]$ is a sum of $p(s'', a', r', s', a, r|s)r'$ over all the combinations of (s”, a’, r’, s’, a, r).

$\mathbb{E}_{\pi} [R_{t + 2} | S_t =s] =\sum_{s'', a', r', s', a, r}{p(s'', a', r', s', a, r|s)r'}$

$=\sum_{s'', a', r', s', a, r}{p(s'', a', r'| s', a, r, s)p(s', a, r|s)r'}$

$=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s'', a', r'}{p(s'', a', r'| s', a, r, s)r'}$

I would like you to remember that in Markov decision process the next state $S_{t+1}$ and the reward $R_t$ only depends on the current state $S_t$ and the action $A_t$ at the time step.

Thus in variables $s', a, r, s$ , only $s'$ have the following variables $r', a', s'', r'', a'', s''', \dots$ . And again $p(s', a, r |s) = p(s', r | s, a)p(a|s)$ . Thus the following equations hold.

$\mathbb{E}_{\pi} [R_{t + 2} | S_t =s]=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s'', a', r'}{p(s'', a', r'| s', a, r', s)r'}$

$=\sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \sum_{s'', a', r'}{p(s'', a', r'| s')r'}$

$= \sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \mathbb{E}_{\pi} [R_{t+2} | s']$ .

$\mathbb{E}_{\pi} [R_{t + 3} | S_t =s]$ can be calculated the same way.

$\mathbb{E}_{\pi}[R_{t + 3} | S_t =s] =\sum_{s''', a'', r'', s'', a', r', s', a, r}{p(s''', a'', r'', s'', a', r', s', a, r|s)r''}$

$=\sum_{s''', a'', r'', s'', a', r', s', a, r}{p(s''', a'', r'', s'', a', r'| s', a, r, s)p(s', a, r|s)r''}$

$=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s''', a'' r'', s'', a', r'}{p(s''', a'', r'', s'', a', r'| s', a, r, s)r''}$

$=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \sum_{s''', a'' r'', s'', a', r'}{p(s''', a'', r'', s'', a', r'| s')r''}$

$=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [R_{t+3} | s']$ .

The same is true of calculating $\mathbb{E}_{\pi} [R_{t + 4} | S_t =s]$ , $\mathbb{E}_{\pi} [R_{t + 5} | S_t =s]\dots$ . Thus

$v_{\pi}(s) =\mathbb{E} [R_{t+1} | S_t =s] + \gamma \mathbb{E} [G_{t+1} | S_t =s]$

= $\sum_{s', a, r}{p(s', r | s, a)\pi(a|s)r}$ $+ \mathbb{E} [R_{t + 2} | S_t =s] + \gamma \mathbb{E} [R_{t + 3} | S_t =s] + \gamma ^2\mathbb{E} [ R_{t + 4} | S_t =s] +\cdots$

$=\sum_{s, a, r}{p(s', r | s, a)\pi(a|s)r}$ $+\sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \mathbb{E}_{\pi} [R_{t+2} |S_{t+1}= s']$ $+\gamma \sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [R_{t+3} |S_{t+1} = s']$ $+\gamma^2 \sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [ R_{t+4}|S_{t+1} = s'] + \cdots$

$=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + \mathbb{E}_{\pi} [\gamma R_{t+2}+ \gamma R_{t+3}+\gamma^2R_{t+4} + \cdots |S_{t+1} = s'] ]$

$=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + \mathbb{E}_{\pi} [G_{t+1} |S_{t+1} = s'] ]$

$=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + v_{\pi}(s') ]$

My elaborate study notes on reinforcement learning

July 31, 2021/in Artificial Intelligence, Data Science, Data Science Hack, Deep Learning, Machine Learning, Main Category, Use Cases/by Yasuto Tamura

I will not tell you why, but all of a sudden I was in need of writing an article series on Reinforcement Learning. Though I am also a beginner in reinforcement learning field. Everything I knew was what I learned from one online lecture conducted in a lazy tone in my college. However in the process of learning reinforcement learning, I found a line which could connect the two dots, one is reinforcement learning and the other is my studying field. That is why I made up my mind to make an article series on reinforcement learning seriously.

To be a bit more concrete, I imagine that technologies in our world could be enhanced by a combination of reinforcement learning and virtual reality. That means companies like Toyota or VW might come to invest on visual effect or video game companies more seriously in the future. And I have been actually struggling with how to train deep learning with cgi, which might bridge the virtual world and the real world.

As I am also a beginner in reinforcement learning, this article series would a kind of study note for me. But as I have been doing in my former articles, I prefer exhaustive but intuitive explanations on AI algorithms, thus I will do my best to make my series as instructive and effective as existing tutorial on reinforcement learning.

This article is going to be composed of the following contents.

Understanding the “simplicity” of reinforcement learning: comprehensive tips to take the trouble out of RL
Graphical understanding of dynamic programming and the Bellman equation: taking a typical approach at first
Four essential ideas for making reinforcement learning and dynamic programming more effective
Stop saying “trial and errors” for now: seeing reinforcement learning through some spectrums
A thaw in another winter of artificial intelligence: uses of deep learning in reinforcement learning (coming soon!)

In this article I would like to share what I have learned about RL, and I hope you could get some hints of learning this fascinating field. In case you have any comments or advice on my “study note,” leaving a comment or contacting me via email would be appreciated.

Interview – Machine Learning in Marketing und CRM

June 11, 2020/in Artificial Intelligence, Data Science, Data Science News, Gerneral, Insights, Interviews, Machine Learning, Main Category/by Redaktion

Interview mit Herrn Laurenz Wuttke von der datasolut GmbH über Machine Learning in Marketing und CRM.

Laurenz Wuttke ist Data Scientist und Gründer der datasolut GmbH. Er studierte Wirtschaftsinformatik an der Hochschule Hannover und befasst sich bereits seit 2011 mit Marketing- bzw. CRM-Systemen und der Datenanalyse. Heute ist er Dozent für Big Data im Marketing an der Hochschule Düsseldorf und unterstützt Unternehmen dabei, durch den Einsatz von künstlicher Intelligenz, individuell auf die Kundenbedürfnisse tausender Kunden einzugehen. Damit jeder Marketing Manager jedem Kunden das richtige Angebot zur richtigen Zeit machen kann.

Data Science Blog: Herr Wuttke, Marketing gilt als einer der Pionier-Bereiche der Unternehmen für den Einstieg in Big Data Analytics. Wie etabliert ist Big Data und Data Science heute im Marketing?

Viele Unternehmen in Deutschland erkennen gerade Chancen und den Wert ihrer Daten. Dadurch investieren die Unternehmen in Big Data Infrastruktur und Data Science Teams.

Gleichzeitig denke ich, wir stehen im Marketing gerade am Anfang einer neuen Daten-Ära. Big Data und Data Science sind im Moment noch ein Thema der großen Konzerne. Viele kleine und mittelständische Unternehmen haben noch viele offene Potentiale in Bezug auf intelligente Kundenanalysen.

Durch stetig steigende Preise für die Kundenakquise, wird die Erhaltung und Steigerung einer guten Kundenbindung immer wichtiger. Und genau hier sehe ich die Vorteile durch Data Science im Marketing. Unternehmen können viel genauer auf Kundenbedürfnisse eingehen, antizipieren welches Produkt als nächstes gekauft wird und so ihr Marketing zielgenau ausrichten. Dieses „personalisierte Marketing“ führt zu einer deutlich stärkeren Kundenbindung und steigert langfristig Umsätze.

Viele amerikanische Unternehmen machen es vor, aber auch deutsche Unternehmen wie Zalando oder AboutYou investieren viel Geld in die Personalisierung ihres Marketings. Ich denke, die Erfolge sprechen für sich.

Data Science Blog: Ein häufiges Anliegen für viele Marketing Manager ist die treffsichere Kundensegmentierung nach vielerlei Kriterien. Welche Verbesserungen sind hier möglich und wie können Unternehmen diese erreichen?

Kundensegmentierungen sind ein wichtiger Bestandteil vieler Marketingstrategien. Allerdings kann man hier deutlich weitergehen und Marketing im Sinne von „Segments of One“ betreiben. Das bedeutet wir haben für jeden einzelnen Kunden eine individuelle „Next Best Action und Next Best Offer“.

Somit wird jeder Kunde aus Sicht des Marketings individuell betrachtet und bekommt individuelle Produktempfehlungen sowie Marketingmaßnahmen, welche auf das jeweilige Kundenbedürfnis zugeschnitten sind.

Dies ist auch ein wichtiger Schritt für die Marketingautomatisierung, denn wir können im Marketing schlichtweg keine tausenden von Kunden persönlich betreuen.

Data Science Blog: Sind die Kundencluster dann erkannt, stellt sich die Frage, wie diese besser angesprochen werden können. Wie funktioniert die dafür notwendige Kundenanalyse?

Ganz unterschiedlich, je nach Geschäftsmodell und Branche fällt die Kundenanalyse anders aus. Wir schauen uns unterschiedliche Merkmale zum historischen Kaufverhalten, Demografie und Produktnutzung an. Daraus ergeben sich in der Regel sehr schnell Kundenprofile oder Personas, die gezielt angesprochen werden können.

Data Science Blog: Oft werden derartige Analyse-Vorhaben auf Grund der Befürchtung, die relevanten Daten seien nicht verfügbar oder die Datenqualität sei einer solchen Analyse nicht würdig, gar nicht erst gestartet. Sind das begründete Bedenken?

Nein, denn oft kommen die Daten, die für eine Kundenanalyse oder die Vorhersage von Ergebnissen braucht, aus Datenquellen wie z.B. den Transaktionsdaten. Diese Daten hat jedes Unternehmen in guter Qualität vorliegen.

Natürlich werden die Analysen besser, wenn weitere Datenquellen wie bspw. Produktmetadaten, Kundeneigenschaften oder das Klickverhalten zur Verfügung stehen, aber es ist kein Muss.

Aus meiner Praxiserfahrung kann ich sagen, dass hier oft ungenutzte Potentiale schlummern.

Data Science Blog: Wie ist da eigentlich Ihre Erfahrung bzgl. der Interaktion zwischen Marketing und Business Intelligence? Sollten Marketing Manager ihre eigenen Datenexperten haben oder ist es besser, diese Ressourcen zentral in einer BI-Abteilung zu konzentrieren?

Aus meiner Sicht funktioniert moderenes Marketing heute nicht mehr ohne valide Datenbasis. Aus diesem Grund ist die Zusammenarbeit von Marketing und Business Intelligence unersetzbar, besonders wenn es um Bestandskundenmarketing geht. Hier laufen idealerweise alle Datenquellen in einer 360 Grad Kundensicht zusammen.

Dies kann dann auch als die Datenquelle für Machine Learning und Data Science verwendet werden. Alle wichtigen Daten können aus einer strukturierten 360 Grad Sicht zu einer Machine Learning Datenbasis (ML-Feature Store) umgewandelt werden. Das spart enorm viel Zeit und viel Geld.

Zu Ihrer zweiten Frage: Ich denke es gibt Argumente für beide Konstrukte, daher habe ich da keine klare Präferenz. Mir ist immer wichtig, dass der fachliche Austausch zwischen Technik und Fachbereich gut funktioniert. Ziele müssen besprochen und gegeben falls angepasst werden, um immer in die richtige Richtung zu gehen. Wenn diese Voraussetzung mit einer guten Data Science Infrastruktur gegeben ist, wird Data Science für wirklich skalierbar.

Data Science Blog: Benötigen Unternehmen dafür eine Customer Data Platform (CDP) oder zumindest ein CRM? Womit sollten Unternehmen beginnen, sollten sie noch ganz am Anfang stehen?

Eine Customer Data Platform (CDP) ist von Vorteil, ist aber kein Muss für den Anfang. Ein guts CRM-System oder gute gepflegte Kundendatenbank reicht zunächst für den Anfang.

Natürlich bietet eine CDP einen entscheidenden Vorteil durch die Zusammenführung von der Online- und der CRM-Welt. Das Klickverhalten hat einen enormen Einfluss auf die analytischen Modelle und hilft dabei, Kunden immer besser zu verstehen. Das ist besonders wichtig in unserer Zeit, da wir immer weniger direkten Kundenkontakt haben und zukünftig wird dieser auch noch weiter abnehmen.

Zusammengefasst: Wer diese Kundendaten intelligent miteinander verknüpft hat einen großen Vorteil.

Data Science Blog: Wie integrieren Sie App- und Webtracking in Ihre Analysen?

Trackingdaten aus Apps und Webseiten sind ein wichtiger Bestandteil unserer Machine Learning Modelle. Sie geben wichtige Informationen über das Kundenverhalten preis. So können die Trackingdaten gute Merkmale für Anwendungsfälle wie Churn Prediction, Customer Lifetime Value und Next Best Offer sein.

Häufig sind die Trackingdaten von unterschiedlichen Anbietern (Google Analytics, Piwik etc.) leicht anders in ihrer Struktur, dafür haben wir uns einen intelligenten Ansatz überlegt, um diese zu vereinheitlichen und in unseren Modellen anzuwenden.

Data Science Blog: Zurück zum Kunden. Seine Bedürfnisse stehen bei erfolgreichen Unternehmen im Fokus stehen. Einige Geschäftsmodelle basieren auf Abonnements oder Mitgliedschaften. Wie können Sie solchen Unternehmen helfen?

Abonnements und Subscriptions sind ein großer Trend: Der Kunde wird zum Nutzer und es fallen viele Kundendaten an, die gesammelt werden können. Viele unserer Kunden haben subscription- oder vertragsbasierte Geschäftsmodelle, was ich persönlich sehr interessante Geschäftsmodelle finde.

Diese haben häufig die Herausforderung ihre Kunden langfristig zu binden und eine gesunde Kundenbindung aufzubauen. Die Akquisition ist meistens sehr teuer und die Kundenabwanderung oder Customer Churn zu reduzieren damit ein strategisches Ziel. Wirklich erfolgreich werden diese dann, wenn die Churn Rate geringgehalten wird.

Die Lösung für eine niedrige Kundenabwanderung, neben einem guten Produkt und gutem Kundenservice, ist eine Churn Prediction und darauf aufbauende Churn Prevention Maßnahmen. Wir nehmen uns dazu das historische Kundenverhalten, schauen uns die Kündiger an und modellieren daraus eine Vorhersage für die Kundenabwanderung. So können Unternehmen abwanderungsgefährdete Kunden schon frühzeitig erkennen und entsprechend handeln. Das hat den entscheidenden Vorteil, dass man nicht einen schon verlorenen Kunden erneut gewinnen muss.

Es gibt aber auch Möglichkeiten schon weit vor der eigentlichen Churn-Gefahr anzusetzen, bei drohender Inaktivität. So haben wir für einen großen Fitness-App-Anbieter ein Alarmsystem entwickelt, das Kunden automatisiert Engagement-Kampagnen versendet, um bei drohender Inaktivität, den Kunden auf die Angebote aufmerksam zu machen. Sie kennen das von der Netflix-App, welche Ihnen jeden Abend einen guten Tipp für das Fernsehprogramm bereitstellt.

Data Science Blog: Gehen wir mal eine Ebene höher. So mancher CMO hat mit dem CFO den Deal, jährlich nur einen bestimmten Betrag ins Marketing zu stecken. Wie hilft Data Science bei der Budget-Verteilung auf die Bestandskunden?

Da gibt es eine einfache Lösung für „Customer Lifetime Value Prognosen“. Durch Machine Learning wird für jeden einzelnen Kunden eine Umsatz-Vorhersage für einen bestimmten Zeitraum getroffen. So kann das Bestandkundenmarketing das Marketingbudget ganz gezielt einsetzen und nach dem Kundenwert steuern. Ich gebe Ihnen ein Beispiel: Kundenreaktivierung im Handel. Sie haben ein bestimmtes Budget und können nicht jedem Kunden eine Reaktivierungsmaßnahme zukommen lassen. Wenn Sie einen gut berechneten Customer Lifetime Value haben, können Sie sich so auf die wertigen Kunden konzentrieren und diese reaktivieren.

Data Science Blog: Mit welchen Technologien arbeiten Sie bevorzugt? Welche Tools sind gerade im Kontext von analytischen Aufgaben im Marketing besonders effizient?

Wir haben uns in den letzten Jahren besonders auf Python und PySpark fokussiert. Mit der Entwicklung von Python für Data Science konnten die anderen Umgebungen kaum mithalten und somit ist Python aus meiner Sicht derzeit die beste Umgebung für unsere Lösungen.

Auch die Cloud spielt eine große Rolle für uns. Als kleines Unternehmen haben wir uns bei datasolut auf die AWS Cloud fokussiert, da wir gar nicht in der Lage wären, riesige Datenbestände unserer Kunden zu hosten.

Vor allem von dem hohen Automatisierungsgrad in Bezug auf Datenverarbeitung und Machine Learning bietet AWS alles, was das Data Science Herz begehrt.

Data Science Blog: Was würden Sie einem Junior Marketing Manager und einem Junior Data Scientist für den Ausbau seiner Karriere raten? Wie werden diese jungen Menschen zukünftig beruflich erfolgreich?

Dem Junior Marketing Manager würde ich immer raten, dass er sich Datenanalyse-Skills erarbeiten soll. Aber vor allem sollte er verstehen, was mit Daten alles möglich ist und wie diese eingesetzt werden können. Auch in meiner Vorlesung zu „Big Data im Marketing“ an der Hochschule Düsseldorf unterrichte ich Studierende, die auf Marketing spezialisiert sind. Hier gebe ich stets diesen Ratschlag.

Bei den Junior Daten Scientist ist es andersherum. Ich sehe in der Praxis immer wieder Data Scientists, die den Transfer zwischen Marketing und Data Science nicht gut hinbekommen. Daher rate ich jedem Data Scientist, der sich auf Marketing und Vertrieb fokussieren will, dass hier fachliches Know-How essentiell ist. Kein Modell oder Score hat einen Wert für ein Unternehmen, wenn es nicht gut im Marketing eingesetzt wird und dabei hilft, Marketingprozesse zu automatisieren.

Ein weiterer wichtiger Aspekt ist, dass sich Data Science und Machine Learning gerade rasant ändern. Die Automatisierung (Stichwort: AutoML) von diesen Prozessen ist auf der Überholspur, dass zeigen die großen Cloudanbieter ganz deutlich. Auch wir nutzen diese Technologie schon in der Praxis. Was der Algorithmus aber nicht übernehmen kann, ist der Transfer und Enablement der Fachbereiche.

Data Science Blog: Zum Schluss noch eine Bitte: Was ist Ihre Prophezeiung für die kommenden Jahre 2021/2022. What is the next big thing in Marketing Analytics?

Es gibt natürlich viele kleinere Trends, welche das Marketing verändern werden. Ich denke jedoch, dass die größte Veränderung für die Unternehmen sein wird, dass es einen viel großflächigeren Einsatz von Machine Learning im Marketing geben wird. Dadurch wird der Wettbewerb härter und für viele Unternehmen wird Marketing Analytics ein essentieller Erfolgsfaktor sein.

Tag Archive for: Machine Learing

1. Supervised Learning

2. Unsupervised learning

3. Semi-supervised Learning

Fazit: Supervised vs. Unsupervised vs. Semi-supervised

Introduction

2 Proposed Method

2.1 Problem Definition

3. Deep Q-Network

4.0 Main Contribution of the Research

4.1 Action and Reward

4.2 Proposed Target Policy

4.3 Q-Network Structure

5 Experiment and Hyperparameter Tuning

5.1 Experiment Setting

5.2 Hyperparameter Optimization

5.3 Performance Evaluation

6 Conclusion

References

Introduction

Objective function and training

Objective function

Training

Results

Summary

References

Autoencoders

Variational autoencoders

Network Architecture

Training

Summary

Further readings

References

1, Before getting down on business

2, Taking a look at what DP is like

3, The Bellman equation and convergence of value functions

4, Pseudocode of policy iteration and value iteration

Appendix

Interview mit Herrn Laurenz Wuttke von der datasolut GmbH über Machine Learning in Marketing und CRM.

Interesting links

Pages

Categories

Archive