Variational Autoencoders

After Deep Autoregressive Models and Deep Generative Modelling, we will continue our discussion with Variational AutoEncoders (VAEs) after covering up DGM basics and AGMs. Variational autoencoders (VAEs) are a deep learning method to produce synthetic data (images, texts) by learning the latent representations of the training data. AGMs are sequential models and generate data based on previous data points by defining tractable conditionals. On the other hand, VAEs are using latent variable models to infer hidden structure in the underlying data by using the following intractable distribution function: 

(1)   \begin{equation*} p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz. \end{equation*}

The generative process using the above equation can be expressed in the form of a directed graph as shown in Figure ?? (the decoder part), where latent variable z\sim p_\theta(z) produces meaningful information of x \sim p_\theta(x|z).

Architectures AE and VAE based on the bottleneck architecture. The decoder part work as a generative model during inference.

Figure 1: Architectures AE and VAE based on the bottleneck architecture. The decoder part work as
a generative model during inference.

Autoencoders

Autoencoders (AEs) are the key part of VAEs and are an unsupervised representation learning technique and consist of two main parts, the encoder and the decoder (see Figure ??). The encoders are deep neural networks (mostly convolutional neural networks with imaging data) to learn a lower-dimensional feature representation from training data. The learned latent feature representation z usually has a much lower dimension than input x and has the most dominant features of x. The encoders are learning features by performing the convolution at different levels and compression is happening via max-pooling.

On the other hand, the decoders, which are also a deep convolutional neural network are reversing the encoder’s operation. They try to reconstruct the original data x from the latent representation z using the up-sampling convolutions. The decoders are pretty similar to VAEs generative models as shown in Figure 1, where synthetic images will be generated using the latent variable z.

During the training of autoencoders, we would like to utilize the unlabeled data and try to minimize the following quadratic loss function:

(2)   \begin{equation*} \mathcal{L}(\theta, \phi) = ||x-\hat{x}||^2, \end{equation*}


The above equation tries to minimize the distance between the original input and reconstructed image as shown in Figure 1.

Variational autoencoders

VAEs are motivated by the decoder part of AEs which can generate the data from latent representation and they are a probabilistic version of AEs which allows us to generate synthetic data with different attributes. VAE can be seen as the decoder part of AE, which learns the set parameters \theta to approximate the conditional p_\theta(x|z) to generate images based on a sample from a true prior, z\sim p_\theta(z). The true prior p_\theta(z) are generally of Gaussian distribution.

Network Architecture

VAE has a quite similar architecture to AE except for the bottleneck part as shown in Figure 2. in AES, the encoder converts high dimensional input data to low dimensional latent representation in a vector form. On the other hand, VAE’s encoder learns the mean vector and standard deviation diagonal matrix such that z\sim \matcal{N}(\mu_z, \Sigma_x) as it will be performing probabilistic generation of data. Therefore the encoder and decoder should be probabilistic.

Training

Similar to AGMs training, we would like to maximize the likelihood of the training data. The likelihood of the data for VAEs are mentioned in Equation 1 and the first term p_\theta(x|z) will be approximated by neural network and the second term p(x) prior distribution, which is a Gaussian function, therefore, both of them are tractable. However, the integration won’t be tractable because of the high dimensionality of data.

To solve this problem of intractability, the encoder part of AE was utilized to learn the set of parameters \phi to approximate the conditional q_\phi (z|x). Furthermore, the conditional q_\phi (z|x) will approximate the posterior p_\theta (z|x), which is intractable. This additional encoder part will help to derive a lower bound on the data likelihood that will make the likelihood function tractable. In the following we will derive the lower bound of the likelihood function:

(3)   \begin{equation*} \begin{flalign} \begin{aligned} log \: p_\theta (x) = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{p_\theta (x|z) p_\theta (z)}{p_\theta (z|x)} \: \frac{q_\phi(z|x)}{q_\phi(z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: p_\theta (x|z)\Bigg] - \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)} {p_\theta (z)}\Bigg] + \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)}{p_\theta (z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)) + \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z|x)). \end{aligned} \end{flalign} \end{equation*}


In the above equation, the first line computes the likelihood using the logarithmic of p_\theta (x) and then it is expanded using Bayes theorem with additional constant q_\phi(z|x) multiplication. In the next line, it is expanded using the logarithmic rule and then rearranged. Furthermore, the last two terms in the second line are the definition of KL divergence and the third line is expressed in the same.

In the last line, the first term is representing the reconstruction loss and it will be approximated by the decoder network. This term can be estimated by the reparametrization trick \cite{}. The second term is KL divergence between prior distribution p_\theta(z) and the encoder function q_\phi (z|x), both of these functions are following the Gaussian distribution and has the closed-form solution and are tractable. The last term is intractable due to p_\theta (z|x). However, KL divergence computes the distance between two probability densities and it is always positive. By using this property, the above equation can be approximated as:

(4)   \begin{equation*} log \: p_\theta (x)\geq \mathcal{L}(x, \phi, \theta) , \: \text{where} \: \mathcal{L}(x, \phi, \theta) = \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)). \end{equation*}

In the above equation, the term \mathcal{L}(x, \phi, \theta) is presenting the tractable lower bound for the optimization and is also termed as ELBO (Evidence Lower Bound Optimization). During the training process, we maximize ELBO using the following equation:

(5)   \begin{equation*} \operatorname*{argmax}_{\phi, \theta} \sum_{x\in X} \mathcal{L}(x, \phi, \theta). \end{equation*}

.

Furthermore, the reconstruction loss term can be written using Equation 2 as the decoder output is assumed to be following Gaussian distribution. Therefore, this term can be easily transformed to mean squared error (MSE).

During the implementation, the architecture part is straightforward and can be found here. The user has to define the size of latent space, which will be vital in the reconstruction process. Furthermore, the loss function can be minimized using ADAM optimizer with a fixed batch size and a fixed number of epochs.

Figure 2: The results obtained from vanilla VAE (left) and a recent VAE-based generative model NVAE (right)

Figure 2: The results obtained from vanilla VAE (left) and a recent VAE-based generative
model NVAE (right)

In the above, we are showing the quality improvement since VAE was introduced by Kingma and
Welling [KW14]. NVAE is a relatively new method using a deep hierarchical VAE [VK21].

Summary

In this blog, we discussed variational autoencoders along with the basics of autoencoders. We covered
the main difference between AEs and VAEs along with the derivation of lower bound in VAEs. We
have shown using two different VAE based methods that VAE is still active research because in general,
it produces a blurry outcome.

Further readings

Here are the couple of links to learn further about VAE-related concepts:
1. To learn basics of probability concepts, which were used in this blog, you can check this article.
2. To learn more recent and effective VAE-based methods, check out NVAE.
3. To understand and utilize a more advance loss function, please refer to this article.

References

[KW14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.
[VK21] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder, 2021.

Key Points on AI’s Role In The Future Of Data Protection

Artificial Intelligence is transforming every industry as we speak, and data protection might be the biggest of them all. With a projected market size of USD 113390 Million, there’s a lot to protect—and humans won’t be able to do it all.

Luckily for us, Artificial Intelligence solutions are here to help us out. Because AI can do a lot more than just collect and analyze data — it can also protect it. In this article, we’ll explain what the role of Artificial Intelligence is in the future of data protection.

Here’s AI for data protection in summary:

3 Ways AI serves in data protection

  • AI Can Improve Compliance: from the GDPR to the CPRA, AI can help you track down gaps in your compliance with the most important data protection legislation.
  • AI as an ally against cyberattacks: cyberattacks are becoming increasingly sophisticated, but so is AI. It can help you recognize the patterns that indicate an attack is underway and put in automated reactions to minimize damage.
  • AI can protect against phishing attempts: together with ML and NLP, AI is a valuable tool in detecting phishing attempts—especially since they are becoming increasingly hard to spot.

Why AI is so valuable in the fight against cybercrime

  • AI can handle more and more complex data than humans: with the amount of data that is being processed and collected every second, it’s incredibly inefficient to not let AI do the work—and AI can cut costs drastically as well.
  • AI can quickly classify data and keep it organized: before you can protect your data, make sure it’s organized properly. No matter the amount or complexity of the structure, AI can help you stay on top of it.
  • No humans needed to keep sensitive data secure: scared of human errors and have trust issues? With AI, you don’t need to rely on people for protection and discreteness.

The threats your data faces on a daily basis

It’s not just the good guys who are using technologies like artificial intelligence to up their game—hackers and people after sensitive data can also reap the benefits of AI. There are more than 2,200 cyberattacks per day—which means one every 39 seconds, so the threat is substantial.

While the clock is ticking, research found that fewer than 25% of businesses think they’re ready to fight off a ransomware attack. That leaves 75% of organizations all the more vulnerable to data privacy threats.

Leaks of personal information, data hacks and other privacy scandals are costly: it’s estimated that cybercrime will cost companies worldwide an estimated $10.5 trillion annually by 2025, with an ​​average cost of $3.86 million per breach—not including the harm done to users and the reputation of a business.

That makes investing in a solid data protection system all the more useful, which is shown in the spending habits of businesses all over the world: global spending on privacy efforts are expected to reach $8 billion by 2022. Luckily, with the rapid developments in AI and other smart security tools, it has become more attainable—even for smaller businesses.

3 Ways AI serves in data protection

What does Artificial intelligence in data protection look like in practice? Let’s look at some of the ways AI can assist your organization in warding off cyber criminals.

1.    AI Can Improve Compliance

How compliant is your organization with all the data protection and privacy regulations? It can be incredibly hard to keep up, understand and check whether your systems are up-to-date on the latest compliance regulations.

But—no need to worry! AI has taken over the nitty-gritty of it all. It’s expected that by 2023, over 40% of privacy compliance technology will rely on AI.

What kind of legislation can you hold up with the use of AI? Two big names are the GDPR and CPRA. AI can help you identify blind spots in your data protection efforts and warn you when you’re not living up to the standards governments put in place.

One tool that does this is SECURITI.ai. With AI-driven PI data discovery, DSR automation, documented accountability you get a clearer view of your data processing activities and can make sure you’re compliant.

An alternative AI solution is Claudette, a web crawler that assesses the privacy policies using supervised machine learning technologies. After it’s done scanning and collecting information, it checks if the data is used in a way that’s GDPR proof. It shows you issues such as incomplete information, unclear language, or problematic data processing tactics.

Of course, you can’t solely rely on AI to do all the work when it comes to privacy and data protection. You and your employees also need to understand and handle data in ways that are compliant with the rules set in place.

Start with understanding what the GDPR and CPRA are all about. Osano’s guide to CPRA is a great place to start to learn what the CPRA, which will replace the CPPA on January 1, 2023, is all about. Educate yourself on the rules of data protection, and it will be even easier to select an AI tool that will help you protect your valuable data.

2.    AI as an ally against cyberattacks

With the combination of big data, artificial intelligence and machine learning, you have a great recipe for tracking down the patterns that indicate a cyberattack is happening. Why is that helpful?

It’s all about identifying patterns. When AI and ML work together, they can map out what happened during previous attacks. Together, they can identify the actions hackers have taken before and find weak spots in your security system, so you can fill those gaps and be extra alert.

AI can assist in quickly alerting the right people and systems that there’s a threat. This can even kick off a series of extra measures to be taken, so the cyberattack can be beaten back.

AI can also make sure malicious websites and unauthorized data transactions are automatically blocked before any harm can be done.

3.    AI can protect against phishing attempts

​​Sometimes its employees who unknowingly are letting the cyber criminals in. Many people roll their eyes when they hear about yet another phishing attempt—shouldn’t we all know better by now not to click on certain links? — but cyber criminals are creating increasingly sophisticated phishing attacks. Even the most tech-savvy and internet-native people are able to fall for it.

Because phishing is all about what’s happening in the details, or in the background of a message—something the untrained human eye won’t immediately see.

Ai does see it, however. With technologies like Natural Language Processing and Machine Learning, it can automatically spot if a phishing attack is at play, and warn users.

There are even AI and ML tools on the market that are able to analyze the context of a message and the relationship between the sender and receiver, for even greater accuracy.

Why AI is so valuable in the fight against cybercrime

But why AI? Can we really rely on yet another robotic system to keep a digital framework safe? Isn’t it safe to have it handled by humans? We’ll expand on the three main benefits AI offers in the data protection game.

1.    AI can handle more and more complex data than humans

With all the data that is being processed and stored nowadays, there are barely enough people on the planet to keep an eye on every sensitive piece of information.

Good data protection is extremely time-consuming, because it’s constant. Checking servers manually is virtually impossible.

AI can work automatically and 24/7, no matter how much data there is to handle. On top of that, AI can be put in place to handle the more complex data structures, which can be hard to analyze and protect for humans. All while keeping costs low.

2.    AI can quickly classify data and keep it organized

Before you can even start protecting data, you will need to put it in place—efficiently. With the large volumes of data that organizations deal with, AI comes in handy. AI can quickly classify and manage data to keep it organized.

3.    No humans needed to keep sensitive data secure

AI can work independently from humans, which means nobody necessarily needs to have direct access to the sensitive data you’re trying to predict. Not only does that decrease the changes of human error, but it also builds an extra layer of trust.

Ready to call in the help of AI for your data protection?

Start by looking at the legislations that are important for your organization, and build on the needs you have for your specific business. Want to know more about the power of AI for data driven businesses? Keep reading in our blog section dedicated to artificial intelligence!

Training of Deep Learning AI models

Ein KI Projekt richtig umsetzen : So geht’s

Sie wollen in Ihrem Unternehmen Kosten senken und effizientere Workflows einführen? Dann haben Sie vielleicht schon darüber nachgedacht, Prozesse mit Künstlicher Intelligenz zu automatisieren. Für einen gelungenen Start, besprechen wir nun, wie ein KI-Projekt abläuft und wie man es richtig umsetzt.

Wir von DATANOMIQ und pixolution teilen unsere Erfahrungen aus Deep Learning Projekten, wo es vor allem um die Optimierung und Automatisierung von Unternehmensprozessen rund um visuelle Daten geht, etwa Bilder oder Videos. Wir stellen Ihnen die einzelnen Projektschritte vor, verraten Ihnen, wo dabei die Knackpunkte liegen und wie alle Beteiligten dazu beitragen können, ein KI-Projekt zum Erfolg zu führen.

1. Erstgespräch

In einem Erstgespräch nehmen wir Ihre Anforderungen auf.

  • Bestandsaufnahme Ihrer aktuellen Prozesse und Ihrer Änderungswünsche: Wie sind Ihre aktuellen Prozesse strukturiert? An welchen Prozessen möchten Sie etwas ändern?
  • Zielformulierung: Welches Endergebnis wünschen Sie sich? Wie genau sollen die neuen Prozesse aussehen? Das Ziel sollte so detailliert wie möglich beschrieben werden.
  • Budget: Welches Budget haben Sie für dieses Projekt eingeplant? Zusammen mit dem formulierten Ziel gibt das Budget die Wege vor, die wir zusammen in dem Projekt gehen können. Meist wollen Sie durch die Einführung von KI Kosten sparen oder höhere Umsätze erreichen. Das spielt für Höhe des Budgets die entscheidende Rolle.
  • Datenlage: Haben Sie Daten, die wir für das Training verwenden können? Wenn ja, welche und wieviele Daten sind das? Ist eine kontinuierliche Datenerfassung vorhanden, die während des Projekts genutzt werden kann, oder muss dafür erst die Grundlage geschaffen werden?

2. Evaluation

In diesem Schritt evaluieren und planen wir mit Ihnen gemeinsam die Umsetzung des Projekts. Das bedeutet im Einzelnen folgendes.

Begutachtung der Daten und weitere Datenplanung

Wir sichten von Ihnen bereitgestellte Trainingsdaten, z.B. gelabelte Bilder, und machen uns ein Bild davon, ob diese für das Training sinnvoll verwendet werden können. Da man für Deep Learning sehr viele Trainingsdaten benötigt, ist das ein entscheidender Punkt. In die Begutachtung der Daten fließt auch die Beurteilung der Qualität und Ausgewogenheit ein, denn davon ist abhängig wie gut ein KI-Modell lernt und korrekte Vorhersagen trifft.

Wenn von Ihnen keinerlei Daten zum Projektstart bereitgestellt werden können, wird zuerst ein separates Projekt notwendig, das nur dazu dient, Daten zu sammeln. Das bedeutet für Sie etwa je nach Anwendbarkeit den Einkauf von Datensets oder Labeling-Dienstleistungen.
Wir stehen Ihnen dabei beratend zur Seite.

Während der gesamten Dauer des Projekts werden immer wieder neue Daten benötigt, um die Qualität des Modells weiter zu verbessern. Daher müssen wir mit Ihnen gemeinsam planen, wie Sie fortlaufend diese Daten erheben, falsche Predictions des Modells erkennen und korrigieren, sodass Sie diese uns bereitstellen können. Die richtig erkannten Daten sowie die falsch erkannten und dann korrigierten Daten werden nämlich in das nächste Training einfließen.

Definition des Minimum Viable Product (MVP)

Wir definieren mit Ihnen zusammen, wie eine minimal funktionsfähige Version der KI aussehen kann. Die Grundfrage hierbei ist: Welche Komponenten oder Features sollten als Erstes in den Produktivbetrieb gehen, sodass Sie möglichst schnell einen Mehrwert aus
der KI ziehen?

Ein Vorteil dieser Herangehensweise ist, dass Sie den neuen KI-basierten Prozess in kleinem Maßstab testen können. Gleichzeitig können wir Verbesserungen schneller identifizieren. Zu einem späteren Zeitpunkt können Sie dann skalieren und weitere Features aufnehmen. Die schlagenden Argumente, mit einem MVP zu starten, sind jedoch die Kostenreduktion und Risikominimierung. Anstatt ein riesiges Projekt umzusetzen wird ein kleines Mehrwert schaffendes Projekt geschnürt und in der Realität getestet. So werden Fehlplanungen und
-entwicklungen vermieden, die viel Geld kosten.

Definition der Key Performance Indicators (KPI)

Key Performance Indicators sind für die objektive Qualitätsmessung der KI und des Business Impacts wichtig. Diese Zielmarken definieren, was das geplante System leisten soll, damit es erfolgreich ist. Key Performance Indicators können etwa sein:

  • Durchschnittliche Zeitersparnis des Prozesses durch Teilautomatisierung
  • Garantierte Antwortzeit bei maximalem Anfrageaufkommen pro Sekunde
  • Parallel mögliche Anfragen an die KI
  • Accuracy des Modells
  • Zeit von Fertigstellung bis zur Implementierung des KI Modells

Planung in Ihr Produktivsystem

Wir planen mit Ihnen die tiefe Integration in Ihr Produktivsystem. Dabei sind etwa folgende Fragen wichtig: Wie soll die KI in der bestehenden Softwareumgebung und im Arbeitsablauf genutzt werden? Was ist notwendig, um auf die KI zuzugreifen?

Mit dem Erstgespräch und der Evaluation ist nun das Fundament für das Projekt gelegt. In den Folgeschritten treiben wir die Entwicklung nun immer weiter voran. Die Schritte 3 bis 5 werden dabei solange wiederholt bis wir von der minimal funktionsfähigen
Produktversion, dem MVP, bis zum gewünschten Endprodukt gelangt sind.

3. Iteration

Wir trainieren den Algorithmus mit dem Großteil der verfügbaren Daten. Anschließend überprüfen wir die Performance des Modells mit ungesehenen Daten.

Wie lange das Training dauert ist abhängig von der Aufgabe. Man kann jedoch sagen, dass das Trainieren eines Deep Learning Modells für Bilder oder Videos komplexer und zeitaufwändiger ist als bei textbasierten maschinellen Lernaufgaben. Das liegt daran, dass wir tiefe Modelle (mit vielen Layern) verwenden und die verarbeiteten Datenmengen in der Regel sehr groß sind.

Das Trainieren des Modells ist je nach Projekt jedoch nur ein Bruchstück des ganzen Entwicklungsprozesses, den wir leisten. Oft ist es notwendig, dass wir einen eigenen Prozess aufbauen, in den das Modell eingebettet werden kann, wie z.B. einen Webservice.

4. Integration

Ist eine akzeptable Qualitätsstufe des Modells nach dem Training erreicht, liefern wir Ihnen eine erste Produktversion aus. Üblicherweise stellen wir Ihnen die Version als Docker Image mit API zur Verfügung. Sie beginnen dann mit der Integration in Ihr System und Ihre Workflows. Wir begleiten Sie dabei.

5. Feedback erfassen

Nachdem die Integration in den Produktivbetrieb erfolgt ist, ist es sehr wichtig, dass Sie aus der Nutzung Daten sammeln. Nur so können Sie beurteilen, ob die KI funktioniert wie Sie es sich vorgestellt haben und ob es in die richtige Richtung geht. Es geht also darum, zu erfassen was das Modell im Realbetrieb kann und was nicht. Diese Daten sammeln Sie und übermitteln sie an uns. Wir speisen diese dann in nächsten Trainingslauf ein.

Es ist dabei nicht ungewöhnlich, dass diese Datenerfassung im Realbetrieb eine gewisse Zeit in Anspruch nimmt. Das ist natürlich davon abhängig, in welchem Umfang Sie Daten erfassen. Bis zum Beginn der nächsten Iteration können so üblicherweise Wochen oder sogar Monate vergehen.

Die nächste Iteration

Um mit der nächsten Iteration eine signifikante Steigerung der Ergebnisqualität zu erreichen, kann es notwendig sein, dass Sie uns mehr Daten oder andere Daten zur Verfügung stellen, die aus dem Realbetrieb anfallen.

Eine nächste Iteration kann aber auch motiviert sein durch eine Veränderung in den Anforderungen, wenn etwa bei einem Klassifikationsmodell neue Kategorien erkannt werden müssen. Das aktuelle Modell kann für solche Veränderungen dann keine guten Vorhersagen treffen und muss erst mit entsprechenden neuen Daten trainiert werden.

Tipps für ein erfolgreiches KI Projekt

Ein entscheidender Knackpunkt für ein erfolgreiches KI Projekt ist das iterative Vorgehen und schrittweise Einführen eines KI-basierten Prozesses, mit dem die Qualität und Funktionsbreite der Entwicklung gesteigert wird.

Weiterhin muss man sich darüber klar sein, dass die Bereitstellung von Trainingsdaten kein statischer Ablauf ist. Es ist ein Kreislauf, in dem Sie als Kunde eine entscheidende Rolle einnehmen. Ein letzter wichtiger Punkt ist die Messbarkeit des Projekts. Denn nur wenn die Zielwerte während des Projekts gemessen werden, können Rückschritte oder Fortschritte gesehen werden und man kann schließlich am Ziel ankommen.

Möglich wurde dieser Artikel durch die großartige Zusammenarbeit mit pixolution, einem Unternehmen für AI Solutions im Bereich Computer Vision (Visuelle Bildsuche und individuelle KI Lösungen).

Deep Autoregressive Models

Deep Autoregressive Models

In this blog article, we will discuss about deep autoregressive generative models (AGM). Autoregressive models were originated from economics and social science literature on time-series data where obser- vations from the previous steps are used to predict the value at the current and at future time steps [SS05]. Autoregression models can be expressed as:

    \begin{equation*} x_{t+1}= \sum_i^t \alpha_i x_{t-i} + c_i, \end{equation*}

where the terms \alpha and c are constants to define the contributions of previous samples x_i for the future value prediction. In the other words, autoregressive deep generative models are directed and fully observed models where outcome of the data completely depends on the previous data points as shown in Figure 1.

Autoregressive directed graph.

Figure 1: Autoregressive directed graph.

Let’s consider x \sim X, where X is a set of images and each images is n-dimensional (n pixels). Then the prediction of new data pixel will be depending all the previously predicted pixels (Figure ?? shows the one row of pixels from an image). Referring to our last blog, deep generative models (DGMs) aim to learn the data distribution p_\theta(x) of the given training data and by following the chain rule of the probability, we can express it as:

(1)   \begin{equation*} p_\theta(x) = \prod_{i=1}^n p_\theta(x_i | x_1, x_2, \dots , x_{i-1}) \end{equation*}

The above equation modeling the data distribution explicitly based on the pixel conditionals, which are tractable (exact likelihood estimation). The right hand side of the above equation is a complex distribution and can be represented by any possible distribution of n random variables. On the other hand, these kind of representation can have exponential space complexity. Therefore, in autoregressive generative models (AGM), these conditionals are approximated/parameterized by neural networks.

Training

As AGMs are based on tractable likelihood estimation, during the training process these methods maximize the likelihood of images over the given training data X and it can be expressed as:

(2)   \begin{equation*} \max_{\theta} \sum_{x\sim X} log \: p_\theta (x) = \max_{\theta} \sum_{x\sim X} \sum_{i=1}^n log \: p_\theta (x_i | x_1, x_2, \dots, x_{i-1}) \end{equation*}

The above expression is appearing because of the fact that DGMs try to minimize the distance between the distribution of the training data and the distribution of the generated data (please refer to our last blog). The distance between two distribution can be computed using KL-divergence:

(3)   \begin{equation*} \min_{\theta} d_{KL}(p_d (x),p_\theta (x)) = log\: p_d(x) - log \: p_\theta(x) \end{equation*}

In the above equation the term p_d(x) does not depend on \theta, therefore, whole equation can be shortened to Equation 2, which represents the MLE (maximum likelihood estimation) objective to learn the model parameter \theta by maximizing the log likelihood of the training images X. From implementation point of view, the MLE objective can be optimized using the variations of stochastic gradient (ADAM, RMSProp, etc.) on mini-batches.

Network Architectures

As we are discussing deep generative models, here, we would like to discuss the deep aspect of AGMs. The parameterization of the conditionals mentioned in Equation 1 can be realized by different kind of network architectures. In the literature, several network architectures are proposed to increase their receptive fields and memory, allowing more complex distributions to be learned. Here, we are mentioning a couple of well known architectures, which are widely used in deep AGMs:

  1. Fully-visible sigmoid belief network (FVSBN): FVSBN is the simplest network without any hidden units and it is a linear combination of the input elements followed by a sigmoid function to keep output between 0 and 1. The positive aspects of this network is simple design and the total number of parameters in the model is quadratic which is much smaller compared to exponential [GHCC15].
  2. Neural autoregressive density estimator (NADE): To increase the effectiveness of FVSBN, the simplest idea would be to use one hidden layer neural network instead of logistic regression. NADE is an alternate MLP-based parameterization and more effective compared to FVSBN [LM11].
  3. Masked autoencoder density distribution (MADE): Here, the standard autoencoder neural networks are modified such that it works as an efficient generative models. MADE masks the parameters to follow the autoregressive property, where the current sample is reconstructed using previous samples in a given ordering [GGML15].
  4. PixelRNN/PixelCNN: These architecture are introducced by Google Deepmind in 2016 and utilizing the sequential property of the AGMs with recurrent and convolutional neural networks.
Different autoregressive architectures

Figure 2: Different autoregressive architectures (image source from [LM11]).

Results using different architectures

Results using different architectures (images source https://deepgenerativemodels.github.io).

It uses two different RNN architectures (Unidirectional LSTM and Bidirectional LSTM) to generate pixels horizontally and horizontally-vertically respectively. Furthermore, it ulizes residual connection to speed up the convergence and masked convolution to condition the different channels of images. PixelCNN applies several convolutional layers to preserve spatial resolution and increase the receptive fields. Furthermore, masking is applied to use only the previous pixels. PixelCNN is faster in training compared to PixelRNN. However, the outcome quality is better with PixelRNN [vdOKK16].

Summary

In this blog article, we discussed about deep autoregressive models in details with the mathematical foundation. Furthermore, we discussed about the training procedure including the summary of different network architectures. We did not discuss network architectures in details, we would continue the discussion of PixelCNN and its variations in upcoming blogs.

References

[GGML15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. CoRR, abs/1502.03509, 2015.

[GHCC15] Zhe Gan, Ricardo Henao, David Carlson, and Lawrence Carin. Learning Deep Sigmoid Belief Networks with Data Augmentation. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence
and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 268–276, San Diego, California, USA, 09–12 May 2015. PMLR.

[LM11] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 29–37, Fort Lauderdale, FL, USA, 11–13 Apr 2011.
PMLR.

[SS05] Robert H. Shumway and David S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.

[vdOKK16] A ̈aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. CoRR, abs/1601.06759, 2016

How to ensure occupational safety using Deep Learning – Infographic

In cooperation between DATANOMIQ, my consulting company for data science, business intelligence and process mining, and Pixolution, a specialist for computer vision with deep learning, we have created an infographic (PDF) about a very special use case for companies with deep learning: How to ensure occupational safety through automatic risk detection using using Deep Learning AI.

How to ensure occupational safety through automatic risk detection using Deep Learning - Infographic

How to ensure occupational safety through automatic risk detection using Deep Learning – Infographic

Four essential ideas for making reinforcement learning and dynamic programming more effective

This is the third article of the series My elaborate study notes on reinforcement learning.

1, Some excuses for writing another article on the same topic

In the last article I explained policy iteration and value iteration of dynamic programming (DP) because DP is the foundation of reinforcement learning (RL). And in fact this article is a kind of a duplicate of the last one. Even though I also tried my best on the last article, I would say it was for superficial understanding of how those algorithms are implemented. I think that was not enough for the following two reasons. The first reason is that what I explained in the last article was virtually just about how to follow pseudocode of those algorithms like other study materials. I tried to explain them with a simple example and some diagrams. But in practice it is not realistic to think about such diagrams all the time. Also writing down Bellman equations every time is exhausting. Thus I would like to introduce Bellman operators, powerful tools for denoting Bellman equations briefly. Bellman operators would help you learn RL at an easier and more abstract level.

The second reason is that relations of values and policies are important points in many of RL algorithms. And simply, one article is not enough to realize this fact. In the last article I explained that policy iteration of DP separately and interactively updates a value and a policy. These procedures can be seen in many RL algorithms. Especially a family of algorithms named actor critic methods use this structure more explicitly. In the algorithms “actor” is in charge of a policy and a “critic” is in charge of a value. Just as the “critic” gives some feedback to the “actor” and the “actor” update his acting style, the value gives some signals to the policy for updating itself. Some people say RL algorithms are generally about how to design those “actors” and “critics.” In some cases actors can be very influential, but in other cases the other side is more powerful. In order to be more conscious about these interactive relations of policies and values, I have to dig the ideas behind policy iteration and value iteration, but with simpler notations.

Even though this article shares a lot with the last one, without pinning down the points I am going to explain, your study of RL could be just a repetition of following pseudocode of each algorithm. But instead I would rather prefer to make more organic links between the algorithms while studying RL. This article might be tiresome to read since it is mainly theoretical sides of DP or RL. But I would like you to patiently read through this to more effectively learn upcoming RL algorithms, and I did my best to explain them again in graphical ways.

2, RL and plannings as tree structures

Some tree structures have appeared so far in my article, but some readers might be still confused how to look at this. I must admit I lacked enough explanations on them. Thus I am going to review Bellman equation and give overall instructions on how to see my graphs. I am trying to discover effective and intuitive ways of showing DP or RL ideas. If there is something unclear of if you have any suggestions, please feel free to leave a comment or send me an email.

I got inspiration from Backup diagrams of Bellman equations introduced in the book by Barto and Sutton when I started making the graphs in this article series. The back up diagrams are basic units of tree structures in RL, and they are composed of white nodes showing states s and black nodes showing actions a. And when an agent goes from a node a to the next state s', it gets a corresponding reward r. As I explained in the second article, a value of a state s is calculated by considering all possible actions and corresponding next states s', and resulting rewards r, starting from s. And the backup diagram shows the essence of how a value of s is calculated.

*Please let me call this figure a backup diagram of “Bellman-equation-like recurrence relation,” instead of Bellman equation. Bellman equation holds only when v_{\pi}(s) is known, and v_{\pi}(s) is usually calculated from the recurrence relation. We are going to see this fact in the rest part of this article, making uses of Bellman operators.

Let’s again take a look at the definition of v_{\pi}(s), a value of a state s for a policy \pi. v_{\pi}(s) is defined as an expectation of a sum of upcoming rewards R_t, given that the state at the time step t is s. (Capital letters are random variables and small letters are their realized values.)

v_{\pi} (s)\doteq \mathbb{E}_{\pi} [ G_t | S_t =s ] =\mathbb{E}_{\pi} [ R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots + \gamma ^{T-t -1} R_{T} |S_t =s]

*To be exact, we need to take the limit of T like T \to \infty. But the number T is limited in practical discussions, so please don’t care so much about very exact definitions of value functions in my article series.

But considering all the combinations of actions and corresponding rewards are not realistic, thus Bellman equation is defined recursively as follows.

v_{\pi} (s)= \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t =s ]

But when you want to calculate v_{\pi} (s) at the left side, v_{\pi} (s) at the right side is supposed to be unknown, so we use the following recurrence relation.

v_{k+1} (s)\doteq \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_{k}(S_{t+1}) | S_t =s ]

And the operation of calculating an expectation with \mathbb{E}_{\pi}, namely a probabilistic sum of future rewards is defined as follows.

v_{k+1} (s) = \mathbb{E}_{\pi} [R_{t+1} + \gamma v_k (S_{t+1}) | S_t = s] \doteq \sum_a {\pi(a|s)} \sum_{s', r} {p(s', r|s, a)[r + \gamma v_k(s')]}

\pi(a|s) are policies, and p(s', r|s, a) are probabilities of transitions. Policies are probabilities of taking an action a given an agent being in a state s. But agents cannot necessarily move do that based on their policies. Some randomness or uncertainty of movements are taken into consideration, and they are modeled as probabilities of transitions. In my article, I would like you to see the equation above as a sum of branch(s, a) weighted by \pi(a|s) or a sum of twig(r, s') weighted by \pi(a|s), p(s' | s, a). “Branches” and “twigs” are terms which I coined.

*Even though especially values of branch(s, a) are important when you actually implement DP, they are not explicitly defined with certain functions in most study materials on DP.

I think what makes the backup diagram confusing at the first glance is that nodes of states in white have two layers, a layer s and the one of s'. But the node s is included in the nodes of s'. Let’s take an example of calculating the Bellman-equation-like recurrence relations with a grid map environment. The transitions on the backup diagram should be first seen as below to avoid confusion. Even though the original backup diagrams have only one root node and have three layers, in actual models of environments transitions of agents are modeled as arows going back and forth between white and black nodes.

But in DP values of states, namely white nodes have to be updated with older values. That is why the original backup diagrams have three layers. For exmple, the value of a value v_{k+1}(9) is calculated like in the figure below, using values of v_{k}(s'). As I explained earlier, the value of the state 9 is a sum of branch(s, a), weighted by \pi(\rightarrow | 9), \pi(\downarrow | 9), \pi(\leftarrow | 9), \pi(\uparrow | 9). And I showed the weight as strength of purple color of the arrows. r_a, r_b, r_c, r_d are corresponding rewards of each transition. And importantly, the Bellman-equation-like operation, whish is a part of DP, is conducted inside the agent. The agent does not have to actually move, and that is what planning is all about.

And DP, or more exactly policy evaluation, calculating the expectation over all the states, repeatedly. An important fact is, arrows in the backup diagram are pointing backward compared to the direction of value functions being updated, from v_{k}(s) to v_{k+1}(s). I tried to show the idea that values v_{k}(s) are backed up to calculate v_{k+1}(s). In my article series, with the right side of the figure below, I make it a rule to show the ideas that a model of an environment is known and it is updated recursively.

3, Types of policies

As I said in the first article, the ultimate purpose of DP or RL is finding the optimal policies. With optimal policies agents are the most likely to maximize rewards they get in environments. And policies \pi determine the values of states as value functions v_{\pi}(s). Or policies can be obtained from value functions. This structure of interactively updating values and policies is called general policy iteration (GPI) in the book by Barto and Sutton.

Source: Richard S. Sutton, Andrew G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, (2018)

However I have been using the term “a policy” without exactly defining it. There are several types of policies, and distinguishing them is more or less important in the next sections. But I would not like you to think too much about that. In conclusion, only very limited types of policies are mainly discussed in RL. Only \Pi ^{\text{S}}, \Pi ^{\text{SD}} in the figure below are of interest when you learn RL as a beginner. I am going to explain what each set of policies means one by one.

In fact we have been discussing a set of policies \Pi ^{\text{S}}, which mean probabilistic Markov policies. Remember that in the first article I explained Markov decision processes can be described like diagrams of daily routines. For example, the diagrams below are my daily routines. The indexes t denote days. In either of states “Home,” “Lab,” and “Starbucks,” I take an action to another state. The numbers in black are probabilities of taking the actions, and those in orange are rewards of taking the actions. I also explained that the ultimate purpose of planning with DP is to find the optimal policy in this state transition diagram.

Before explaining each type of sequences of policies, let me formulate probabilistic Markov policies at first. A set of probabilistic Markov policies is defined as follows.
\Pi \doteq \biggl\{ \pi : \mathcal{A}\times\mathcal{S} \rightarrow [0, 1]: \sum_{a \in \mathcal{A}}{\pi (a|s) =1, \forall s \in \mathcal{S} } \biggr\}
This means \pi (a|s) maps any combinations of an action a\in\mathcal{A} and a state s \in\mathcal{S} to a probability. The diagram above means you choose a policy \pi from the set \Pi, and you use the policy every time step t, I mean every day. A repetitive sequence of the same probabilistic Markov policy \pi is defined as \boldsymbol{\pi}^{\text{s}} \doteq \{\pi, \pi, \dots \} \in \boldsymbol{\Pi} ^{\text{S}}. And a set of such stationary Markov policy sequences is denoted as \boldsymbol{\Pi} ^{\text{S}}.

*As I formulated in the last articles, policies are different from probabilities of transitions. Even if you take take an action probabilistically, the action cannot necessarily be finished. Thus probabilities of transitions depend on combinations of policies and the agents or the environments.

But when I just want to focus on works like a robot, I give up living my life. I abandon efforts of giving even the slightest variations to my life, and I just deterministically take next actions every day. In this case, we can say the policies are stationary and deterministic. The set of such policies is defined as below. \pi ^{\text{d}} are called deterministic policies.\Pi ^\text{d} \doteq \bigl\{ \pi ^\text{d} : \mathcal{A}\rightarrow \mathcal{S} \bigr\}

I think it is normal policies change from day to day, even if people also have only options of “Home,” “Lab,” or “Starbucks.” These cases are normal Markov policies, and you choose a policy \pi from \Pi every time step.

And the resulting sequences of policies and the set of the sequences are defined as \boldsymbol{\pi}^{\text{m}} \doteq \{\pi_0, \pi_1, \dots \} \in \boldsymbol{\Pi} ^{\text{M}}, \quad \pi_t \in \Pi.

In real world, an assumption of Markov decision process is quite unrealistic because your strategies constantly change depending on what you have done or gained so far. Possibilities of going to a Starbucks depend on what you have done in the week so far. You might order a cup of frappucino as a little something for your exhausting working days. There might be some communications on what you order then with clerks. And such experiences would affect your behaviors of going to Starbucks again. Such general and realistic policies are called history-dependent policies.

*Going to Starbucks everyday like a Markov decision process and deterministically ordering a cupt of hot black coffee is supposed to be unrealistic. Even if clerks start heating a mug as soon as I enter the shop.

In history-dependent cases, your policies depend on your states, actions, and rewards so far. In this case you take actions based on history-dependent policies \pi _{t}^{\text{h}}. However as I said, only \Pi ^{\text{S}}, \Pi ^{\text{SD}} are important in my articles. And history-dependent policies are discussed only in partially observable Markov decision process (POMDP), which this article series is not going to cover. Thus you have only to take a brief look at how history-dependent ones are defined.

History-dependent policies are the types of the most general policies. In order to formulate history-dependent policies, we first have to formulate histories. Histories h_t \in \mathcal{H}_t in the context of DP or RL are defined as follows.

h_t \doteq \{s_0, a_0, r_0, \dots , s_{t-1}, a_{t-1}, r_{t}, s_t\}

Given the histories which I have defined, a history dependent policy is defined as follows.

\pi_{t}^{\text{h}}(a|h_t) \doteq \text{Pr}(A=a | H_t = h_t)

This means a probability of taking an action a given a history h_t. It might be more understandable with the graphical model below, which I showed also in the first article. In the graphical model, H_t is a random variable, and h_t is its realized value.


A set of history-dependent policies is defined as follows.

\Pi _{t}^{\text{h}} \doteq \biggl\{ \pi _{t}^{h} : \mathcal{A}\times\mathcal{H}_t \rightarrow [0, 1]: \sum_{a \in \mathcal{A}}{\pi_{t}^{\text{h}} (a|h_{t}) =1 } \biggr\}

And a set of sequences of history-dependent policies is \boldsymbol{\pi}^{\text{h}} \doteq \{\pi^{\text{h}}_0, \pi^{\text{h}}_1, \dots \} \in \boldsymbol{\Pi} ^{\text{H}}, \quad \pi_{t}^{\text{h}} \in \Pi_{t}^{\text{h}}.

In fact I have not defined the optimal value function v_{\ast}(s) or \pi_{\ast} in my article series yet. I must admit it was not good to discuss DP without even defining the important ideas. But now that we have learnt types of policies, it should be less confusing to introduce their more precise definitions now. The optimal value function v_{\ast}: \mathcal{S} \mapsto \mathbb{R} is defined as the maximum value functions for all states s, with respect to any types of sequences of policies \boldsymbol{\pi}.

v_{\ast} \doteq \max_{\boldsymbol{\pi}\in \boldsymbol{\Pi}^{\text{H}}}{v_{\boldsymbol{\pi}(s)}}, \quad \forall s \mathbb{R}

And the optimal policy is defined as the policy which satisfies the equation below.

v_{\ast}(s) = v_{\pi ^{\ast}}(s), \quad \forall s \in \mathcal{S}

The optimal value function is optimal with respect to all the types of sequences of policies, as you can see from the definition. However in fact, it is known that the optimal policy is a deterministic Markov policy \pi ^\text{d} \in \Pi ^\text{d}. That means, in the example graphical models I displayed, you just have to deterministically go back and forth between the lab and the home in order to maximize value function, never stopping by at a Starbucks. Also you do not have to change your plans depending on days.

And when all the values of the states are maximized, you can easily calculate the optimal deterministic policy of your everyday routine. Thus in DP, you first need to maximize the values of the states. I am going to explain this fact of DP more precisely in the next section. Combined with some other important mathematical features of DP, you will have clearer vision on what DP is doing.

*I might have to precisely explain how v_{\boldsymbol{\pi}}(s) is defined. But to make things easier for now, let me skip ore precise formulations. Value functions are defined as expectations of rewards with respect to a single policy or a sequence of policies. You have only to keep it in mind that v_{\boldsymbol{\pi}}(s) is a value function resulting from taking actions based on \boldsymbol{\pi}. And v_{\pi}(s), which we have been mainly discussing, is a value function based on only a single policy \pi.

*Please keep it in mind that these diagrams are not anything like exaggeratedly simplified models for explaining RL. That is my life.

3, Key components of DP

*Even though notations on this article series are based on the book by Barto and Sutton, the discussions in this section are, based on a Japanese book named “Machine Learning Professional Series: Reinforcement Learning” by Tetsurou Morimura, which I call “the whale book.” There is a slight difference in how they calculate Bellman equations. In the book by Barto and Sutton, expectations are calculated also with respect to rewards r, but not in the whale book. I think discussions in the whale book can be extended to the cases in the book by Barto and Sutton, but just in case please bear that in mind.

In order to make organic links between the RL algorithms you are going to encounter, I think you should realize DP algorithms you have learned in the last article are composed of some essential ideas about DP. As I stressed in the first article, RL is equal to solving planning problems, including DP, by sampling data through trial-and-error-like behaviors of agents. Thus in other words, you approximate DP-like calculations with batch data or online data. In order to see how to approximate such DP-like calculations, you have to know more about features of those calculations. Those features are derived from some mathematical propositions about DP. But effortlessly introducing them one by one would be just confusing, so I tired extracting some essences. And the figures below demonstrate the ideas.

The figures above express the following facts about DP:

  1. DP is a repetition of Bellman-equation-like operations, and they can be simply denoted with Bellman operators \mathsf{B}_{\pi} or \mathsf{B}_{\ast}.
  2. The value function for a policy \pi is calculated by solving a Bellman equation, but in practice you approximately solve it by repeatedly using Bellman operators.
  3. There exists an optimal policy \pi ^{\ast} \in \Pi ^{\text{d}}, which is deterministic. And it is an optimal policy if and only if it satisfies the Bellman expectation equation v^{\ast}(s) = (\mathsf{B}_{\pi ^{\ast}} v^{\ast})(s), \quad \forall s \in \mathcal{S}, with the optimal value function v^{\ast}(s).
  4. With a better deterministic policy, you get a better value function. And eventually both the value function and the policy become optimal.

Let’s take a close look at what each of them means.

(1) Bellman operator

In the last article, I explained the Bellman equation and recurrence relations derived from it. And they are the basic ideas leading to various RL algorithms. The Bellman equation itself is not so complicated, and I showed its derivation in the last article. You just have to be careful about variables in calculation of expectations. However writing the equations or recurrence relations every time would be tiresome and confusing. And in practice we need to apply the recurrence relation many times. In order to avoid writing down the Bellman equation every time, let me introduce a powerful notation for simplifying the calculations: I am going to discuss RL making uses of Bellman operators from now on.

First of all, a Bellman expectation operator \mathsf{B}_{\pi}: \mathbb{R}^{\mathcal{S}} \rightarrow \mathbb{R}^{\mathcal{S}}, or rather an application of a Bellman expectation operator on any state functions v: \mathcal{S}\rightarrow \mathbb{R} is defined as below.

(\mathsf{B}_{\pi} (v))(s) \doteq \sum_{a}{\pi (a|s)} \sum_{s'}{p(s'| s, a) \biggl[r + \gamma v (s') \biggr]}, \quad \forall s \in \mathcal{S}

For simplicity, I am going to denote the left side of the equation as (\mathsf{B}_{\pi} (v)) (s)=\mathsf{B}_{\pi} (v) \doteq \mathsf{B}_{\pi} v. In the last article I explained that when v_{0}(s) is an arbitrarily initialized value function, a sequence of value functions (v_{0}(s), v_{1}(s), \dots, v_{k}(s), \dots) converge to v_{\pi}(s) for a fixed probabilistic policy \pi, by repeatedly applying the recurrence relation below.

v_{k+1} = \sum_{a}{\pi (a|s)} \sum_{s'}{p(s'| s, a) \biggl[r + \gamma v_{k} (s') \biggr]}

With the Bellman expectation operator, the recurrence relation above is written as follows.

v_{k+1} = \mathsf{B}_{\pi} v_{k}

Thus v_{k} is obtained by applying \mathsf{B}_{\pi} to v_{0} k times in total. Such operation is denoted as follows.

v_{k} = (\mathsf{B}_{\pi}\dots (\mathsf{B}_{\pi} v_{0})\dots) \doteq \mathsf{B}_{\pi} \dots \mathsf{B}_{\pi} v_{0} \doteq \mathsf{B}^k_{\pi} v_{0}

As I have just mentioned, \mathsf{B}^k_{\pi} v_{0} converges to v_{\pi}(s), thus the following equation holds.

\lim_{k \rightarrow \infty} \mathsf{B}^k_{\pi} v_{0} = v_{\pi}(s)

I have to admit I am merely talking about how to change notations of the discussions in the last article, but introducing Bellman operators makes it much easier to learn or explain DP or RL as the figure below shows.

Just as well, a Bellman optimality operator \mathsf{B}_{\ast}: \mathbb{R}^{\mathcal{S}} \rightarrow \mathbb{R}^{\mathcal{S}} is defined as follows.

(\mathsf{B}_{\ast} v)(s) \doteq \max_{a} \sum_{s'}{p(s' | s, a) \biggl[r + \gamma v(s') \biggr]}, \quad \forall s \in \mathcal{S}

Also the notation with a Bellman optimality operators can be simplified as (\mathsf{B}_{\ast} v)(s) \doteq \mathsf{B}_{\ast} v. With a Bellman optimality operator, you can get a recurrence relation v_{k+1} = \mathsf{B}_{\ast} v_{k}. Multiple applications of Bellman optimality operators can be written down as below.

v_{k} = (\mathsf{B}_{\ast}\dots (\mathsf{B}_{\ast} v_{0})\dots) \doteq \mathsf{B}_{\ast} \dots \mathsf{B}_{\ast} v_{0} \doteq \mathsf{B}^k_{\ast} v_{0}

Please keep it in mind that this operator does not depend on policies \pi. And an important fact is that any initial value function v_0 converges to the optimal value function v_{\ast}.

\lim_{k \rightarrow \infty} \mathsf{B}^k_{\ast} v_{0} = v_{\ast}(s)

Thus any initial value functions converge to the the optimal value function by repeatedly applying Bellman optimality operators. This is almost equal to value iteration algorithm, which I explained in the last article. And notations of value iteration can be also simplified by introducing the Bellman optimality operator like in the figure below.

Again, I would like you to pay attention to how value iteration works. The optimal value function v_{\ast}(s) is supposed to be maximum with respect to any sequences of policies \boldsymbol{\pi}, from its definition. However the optimal value function v_{\ast}(s) can be obtained with a single bellman optimality operator \mathsf{B}_{\ast} , never caring about policies. Obtaining the optimal value function is crucial in DP problems as I explain in the next topic. And at least one way to do that is guaranteed with uses of a \mathsf{B}_{\ast}.

*We have seen a case of applying the same Bellman expectation operator on a fixed policy \pi, but you can use different Bellman operators on different policies varying from time steps to time steps. To be more concrete, assume that you have a sequence of Markov policies \boldsymbol{\pi} = \{ \pi_{0},\pi_{1}, \dots, \pi_{k-1} \}\in \boldsymbol{\Pi} ^{\text{M}}. If you apply Bellman operators of the policies one by one in an order of \pi_{k-1}, \pi_{k-2}, \dots, \pi_{k-1} on a state function v, the resulting state function is calculated as below.

\mathsf{B}_{\pi_0}(\mathsf{B}_{\pi_1}\dots (\mathsf{B}_{\pi_{k-1}} v)\dots) \doteq \mathsf{B}_{\pi_0}\mathsf{B}_{\pi_1} \dots \mathsf{B}_{\pi_{k-1}} v \doteq \mathsf{B}^k_{\boldsymbol{\pi}}

When \boldsymbol{\pi} = \{ \pi_{0},\pi_{1}, \dots, \pi_{k-1} \}, we can also discuss convergence of v_{\boldsymbol{\pi}}, but that is just confusing. Please let me know if you are interested.

(2) Policy evaluation

Policy evaluation is in short calculating v_{\pi}, the value function for a policy \pi. And in theory it can be calculated by solving a Bellman expectation equation, which I have already introduced.

v(s) = \sum_{a}{\pi (a|s)} \sum_{s'}{p(s'| s, a) \biggl[r + \gamma v (s') \biggr]}

Using a Bellman operator, which I have introduced in the last topic, the equation above can be written v(s) = \mathsf{B}_{\pi} v(s). But whichever the notation is, the equation holds when the value function v(s) is v_{\pi}(s). You have already seen the major way of how to calculate v_{\pi} in (1), or also in the last article. You have only to multiply the same Belman expectation operator \mathsf{B}_{\pi} to any initial value funtions v_{initial}(s).

This process can be seen in this way: any initial value functions v_{initial}(s) little by little converge to v_{\pi}(s) as the same Bellman expectation operator \mathsf{B}_{\pi} is applied. And when a v_{initial}(s) converges to v_{\pi}(s), the value function does not change anymore because the value function already satisfies a Bellman expectation equation v(s) = \mathsf{B}_{\pi} v(s). In other words v_{\pi}(s) = \mathsf{B}^k_{\pi} v_{\pi}(s), and the v_{\pi}(s) is called the fixed point of \mathsf{B}_{\pi}. The figure below is the image of how any initial value functions converge to the fixed point unique to a certain policy \pi. Also Bellman optimality operators \mathsf{B}_{\ast} also have their fixed points because any initial value functions converge to v_{\ast}(s) by repeatedly applying \mathsf{B}_{\ast}.

I am actually just saying the same facts as in the topic (1) in another way. But I would like you to keep it in mind that the fixed point of \mathsf{B}_{\pi} is more of a “local” fixed point. On the other hand the fixed point of \mathsf{B}_{\ast} is more like “global.” Ultimately the global one is ultimately important, and the fixed point v_{\ast} can be directly reached only with the Bellman optimality operator \mathsf{B}_{\ast}. But you can also start with finding local fixed points, and it is known that the local fixed points also converge to the global one. In fact, the former case of corresponds to policy iteration, and the latter case to value iteration. At any rate, the goal for now is to find the optimal value function v_{\ast}. Once the value function is optimal, the optimal policy can be automatically obtained, and I am going to explain why in the next two topics.

(3) Existence of the optimal policy

In the first place, does the optimal policy really exist? The answer is yes, and moreover it is a stationary and deterministic policy \pi ^{\text{d}} \in \Pi^{\text{SD}}. And also, you can judge whether a policy is optimal by a Bellman expectation equation below.

v_{\ast}(s) = (\mathsf{B}_{\pi^{\ast} } v_{\ast})(s), \quad \forall s \in \mathcal{S}

In other words, the optimal value function v_{\ast}(s) has to be already obtained to judge if a policy is optimal. And the resulting optimal policy is calculated as follows.

\pi^{\text{d}}_{\ast}(s) = \text{argmax}_{a\in \matchal{A}} \sum_{s'}{p(s' | s, a) \biggl[r + \gamma v_{\ast}(s') \biggr]}, \quad \forall s \in \mathcal{S}

Let’s take an example of the state transition diagram in the last section. I added some transitions from nodes to themselves and corresponding scores. And all values of the states are initialized as v_{init.}. After some calculations, v_{init.} is optimized to v_{\ast}. And finally the optimal policy can be obtained from the equation I have just mentioned. And the conclusion is “Go to the lab wherever you are to maximize score.”


The calculation above is finding an action a which maximizes b(s, a)\doteq\sum_{s'}{p(s' | s, a) \biggl[r + \gamma v_{\ast}(s') \biggr]} = r + \gamma \sum_{s'}{p(s' | s, a) v_{\ast}(s') }. Let me call the part b(s, a) ” a value of a branch,” and finding the optimal deterministic policy is equal to choosing the maximum branch for all s. A branch corresponds to a pair of a state s, a and all the all the states s'.


*We can comprehend applications of Bellman expectation operators as probabilistically reweighting branches with policies \pi(a|s).

*The states s and s' are basically the same. They are just different in uses of indexes for referring them. That might be a confusing point of understanding Bellman equations.

Let’s see how values actually converge to the optimal values and how branches b(s, a). I implemented value iteration of the Starbucks-lab-home transition diagram and visuzlied them with Graphviz. I initialized all the states as 0, and after some iterations they converged to the optimal values. The numbers in each node are values of the sates. And the numbers next to each edge are corresponding values of branches b(a, b). After you get the optimal value, if you choose the direction with the maximum branch at each state, you get the optimal deterministic policy. And that means “Just go to the lab, not Starbucks.”

*Discussing and visualizing “branches” of Bellman equations are not normal in other study materials. But I just thought it would be better to see how they change.

(4) Policy improvement

Policy improvement means a very simple fact: in policy iteration algorithm, with a better policy, you get a better value function. That is all. In policy iteration, a policy is regarded as optimal as long as it does not updated anymore. But as far as I could see so far, there is one confusing fact. Even after a policy converges, value functions still can be updated. But from the definition, an optimal value function is determined with the optimal value function. Such facts can be seen in some of DP implementation, including grid map implementation I introduced in the last article.


Thus I am not sure if it is legitimate to say whether the policy is optimal even before getting the optimal value function. At any rate, this is my “elaborate study note,” so I conversely ask for some help to more professional someones if they come across with my series. Please forgive me for shifting to the next article, without making things clear.

4, Viewing DP algorithms in a more simple and abstract way

We have covered the four important topics for a better understanding of DP algorithms. Making use of these ideas, pseudocode of DP algorithms which I introduced in the last article can be rewritten in a more simple and abstract way. Rather than following pseudocode of DP algorithms, I would like you to see them this way: policy iteration is a repetation of finding the fixed point of a Bellman operator \mathsf{B}_{\pi}, which is a local fixed point, and updating the policy. Even if the policy converge, values have not necessarily converged to the optimal values.


When it comes to value iteration: value iteration is finding the fixed point of \mathsf{B}_{\ast}, which is global, and getting the deterministic and optimal policy.

I have written about DP in as many as two articles. But I would say that was inevitable for laying more or less solid foundation of learning RL. The last article was too superficial and ordinary, but on the other hand this one is too abstract to introduce at first. Now that I have explained essential theoretical parts of DP, I can finally move to topics unique to RL. We have been thinking the case of plannings where the models of the environemnt is known, but they are what agents have to estimate with “trial and errors.” The term “trial and errors” might have been too abstract to you when you read about RL so far. But after reading my articles, you can instead say that is a matter of how to approximate Bellman operators with batch or online data taken by agents, rather than ambiguously saying “trial and erros.” In the next article, I am going to talk about “temporal differences,” which makes RL different from other fields and can be used as data samples to approximate Bellman operators.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Deep Generative Modelling

Nowadays, we see several real-world applications of synthetically generated data (see Figure 1), for example solving the data imbalance problem in classification tasks, performing style transfer for artistic images, generating protein structure for scientific analysis, etc. In this blog, we are going to explore synthetic data generation using deep neural networks with the mathematical background.

 Synthetic images generated by deep generative models - deep learning generates images

Figure 1 – Synthetic images generated by deep generative models

What is Deep Generative modelling?

Deep generative modelling (DGM) falls in the category of unsupervised learning and addresses a challenging task of the distribution estimation of the given data. To approximate the underlying distribution of a complicated and high dimensional data, Deep generative models (DGM) utilize various deep neural networks architectures e.g., CNN and RNN. Furthermore, the trained DGMs generate samples which have the same distribution as the training data distribution. In other words, if the given training data has the distribution function 𝑝𝑑 (𝑥), then DGMs learn to
generate the samples from a distribution 𝑝𝜃 (𝑥) such that 𝑝𝑑 (𝑥) ≈ 𝑝𝜃 (𝑥).

Deep Learning as unsupervised learner - DGMs pipeline

Figure 2 – DGMs pipeline

Figure 2 represents the general idea about the deep generative modeling, where DGMs are generating data samples with distribution of 𝑝𝜃 (𝑥), which is quite similar to the data distribution of training samples 𝑝𝑑 (𝑥).

Why Deep Generative modelling is important?

DGMs are mainly used to generate synthetic data, which can be used in different applications. The followings are a few examples:

  1. To avoid the data imbalance problems in several real-life classification problems
  2. Text-to-image, image-to-image conversion, image inpainting, super-resolution
  3. Speech and music synthesis.
  4. Computer graphics: rendering, texture generation, character movement, fluid dynamics
    simulation.

How DGMs work?

The above figure is representing a complete workflow of DGMs and it is not very precise because it is combining both training and inference process. During the inference/generation, there will be a slight modification, which is shown in the following figure:

Data generation with random input and a trained DGM

Figure 3 – Data generation with random input and a trained DGM

As it is clear from the above figure, the user gives a random sample as the input to the trained generator to generate a sample which has the similar distribution to the training data. Let us consider that the random input z is sampled from a tractable distribution 𝑝(𝑧) and supported in 𝑅𝑚 and the training data distribution (intractable) is high dimensional and supported in 𝑅𝑛. Therefore, the main goal of trained generator can be written as:

    \begin{equation*} g_\theta:\mathbb{R}^m \to \mathbb{R}^n, \quad \textit{such that}, \quad \min_{\theta} d(p_d (x),p_\theta (x)) \end{equation*}

where d denotes the distance between the two probability distributions and every random vector z will mapped in an unknown vector x, which has an intractable distribution. The vector z is commonly referred as latent variable which is sample from a latent space and in general, follows a tractable Gaussian distribution. The distance minimization problem can be addressed using maximum likelihood. Let us assume that the generator function 𝑔𝜃 is known then we can compute the likelihood of the generated sample x from the latent variable z:

(1)   \begin{equation*} p_\theta (x)= \int p_\theta (x|z) p(z)dz \end{equation*}

The term 𝑝𝜃(𝑥|𝑧) measures the closeness between the generated sample 𝑔𝜃(𝑧) to the original sample x. Based on the data, the likelihood function can be Gaussian for real valued data or Bernoulli for the binary data. From the above discussion, it is clear that the approximating the generator function is most challenging task and that is performed suing deep neural network with high dimensional data. A deep neural network approximates the generator function by computing the generator parameters 𝜃.

Types of DGMs

There are several different types of DGMs to approximate the generator functions, which can generate the new data points with the similar distribution of the training data. In this series of the blogs, we will discuss these methods which are mentioned in the following figure.

In general, DGMs can be separated into implicit and explicit methods, where explicit method are basically likelihood-based methods and learn the data distribution based on an explicitly defined 𝑝𝜃(𝑥). On the other hand, implicit methods learn data distribution directly without any prior model structure. Furthermore, explicit methods are split into tractable and approximation-based methods, where tractable methods are utilizing the model structures which have exact likelihood evaluation and approximation-based methods are applying different forms of approximation in the likelihood estimation.

Summary

In this blog article, we covered the mathematical foundation of DGMs including the different types. In further blog articles, we will cover the above mentioned different DGMs with theoretical background and applications.

How Deep Learning drives businesses forward through automation – Infographic

In cooperation between DATANOMIQ, my consulting company for data science, business intelligence and process mining, and Pixolution, a specialist for computer vision with deep learning, we have created an infographic (PDF) about a very special use case for companies with deep learning: How to protect the corporate identity of any company by ensuring consistent branding with automated font recognition.

How to ensure consistent branding with automatic font recognition - Infographic

How to ensure consistent branding with automatic font recognition – Infographic

The infographic is available as PDF download:

Vorstellung des Verbundforschungsprojekts “What can AI do for me?”

Dieser Artikel ist eine Zusammenfassung der Ergebnisse einer Studie namens “What can AI do for me?” (www.whatcanaidoforme.com) Ansprechpartnerin für dieses Projekt ist Frau Carina Weber, Mitarbeiterin der Hochschule der Medien in Stuttgart.

Hintergrund zur Studie: Zu dem Thema Anwendung von Künstliche Intelligenz und ihrem Potenzial für die Wertschöpfung von Unternehmen gibt es bereits einige wenige Studien. Die wenigen Forschungsarbeiten stellen positive Auswirkungen, wie Produktoptimierung, Kosteneinsparung durch Optimierung des Ressourcenmanagements, Steigerung der allgemein Unternehmensperformance, etc. fest. Allerdings bleibt unerforscht welchen individuellen Beitrag spezifische Anwendungsfälle leisten. Dieses Wissen wird jedoch für strategische Entscheidungen bezüglich der Implementierung von AI benötigt, um beispielsweise den ROI von AI-Projekten schätzen zu können. Dazu soll die vorliegende Studie Einsicht bringen.

Darüberhinaus wurden die Ergebnisse genutzt um im Rahmen des Verbundforschungsprojekts What Can AI Do For Me? eine AI-basierte Matching-Plattform zu entwickeln. Eine bis jetzt einzigartige Anwendung, mittels derer Unternehmen individuelle AI-Anwendungsfälle mit ihren jeweiligen Potenzialen kennenlernen und sich direkt mit Lösungsanbietern verknüpfen lassen können.

Beispiele: Praktische Anwendung von AI – Mit welchen Herausforderungen sehen sich Unternehmen konfrontiert?

Schon heute stellt Artificial Intelligence, folgend abgekürzt mit AI, im unternehmerischen

Sinne eine Schlüsseltechnologie dar. Es stellt sich jedoch die Frage, inwieweit sich die Technologien rund um AI tatsächlich auf die essentiellen Unternehmensziele auswirken und mit welchen Hindernissen sich die Unternehmen bei der Implementierung konfrontiert sehen.

In der AI Value Creation Studie des Forschungsprojekts “What can AI do for me” ist man mit Unterstützung von Expertinnen und Experten, sowohl auf Anwenderseite, als auch auf der von Nutzerinnen und Nutzer, dieser Fragestellung, durch eine qualitative und quantitative Forschung nachgegangen.

Unsicher beim Einsatz von AI? Die Studie bietet Orientierungshilfe

Das Institute of Applied Artificial Intelligence (IAAI) der Hochschule der Medien entwickelt im Rahmen des oben genannten Verbundforschungsprojekts zusammen mit der thingsTHINKING GmbH und der KENBUN IT AG eine AI-basierte Matching-Plattform, mittels derer Unternehmen geeignete Anwendungsmöglichkeiten und Lösungsunternehmen finden können. Gefördert wurde das Projekt im Jahr 2021 über den KI-Innovationswettbewerb des Ministeriums für Wirtschaft, Arbeit und Tourismus Baden-Württemberg und erhielt zusätzliche Unterstützung von bekannten AI-Initiativen und Verbänden. So konnte am 19. Oktober die Inbetriebnahme der Beta-Version erfolgreich gestartet werden. Sie steht seitdem unter der Domain WhatCanAIDoForMe.com kostenfrei zur Verfügung.

Die Basis der Annahmen der Matching-Plattform bilden die Ergebnisse der AI Value Creation Studie des IAAI der Hochschule der Medien. Im Verlauf der qualitativen Forschung konnten über 90 verschiedene AI Use Cases aus der Unternehmenspraxis in über 40 Interviews mit Expertinnen und -experten vielfältigster Branchen identifiziert werden. Die erhobenen Use Cases wurden in insgesamt 19 Use Case Cluster strukturiert, um eine bessere Vergleichbarkeit zu schaffen und gleichzeitig vielfältige Anwendungsmöglichkeiten aufzuzeigen.

Es wird eine Orientierungshilfe für Unternehmen geschaffen, über die sie einen Überblick erlangen können, in welchen Unternehmensfunktionen AI bereits erfolgreich eingesetzt wird.

Des Weiteren sollen durch die Studie Potenziale von AI in Bezug auf die Wertschöpfung, im Sinne einer möglichen Umsatz-, Unternehmenswertsteigerung sowie Kostensenkung, erhoben und Hindernisse bei der Realisierung von AI Use Cases erkannt werden. Zuletzt sollen Unternehmen dazu befähigt werden Stellschrauben zu identifizieren, an welchen sie ansetzen müssen, um AI erfolgreich im Unternehmen einzusetzen.

Im Rahmen der erhobenen Studie wurde einerseits eine Dominanz der AI Use Cases im Bereich der Produktion und Supply Chain, Marketing und Sales sowie im Kundenservice deutlich. Andererseits konnten vielzählige Use Cases ermittelt werden, die crossfunktional in Unternehmen eingesetzt werden können und somit wiederkehrende Tätigkeiten, wie AI-gestützte Recherche in Datenbanken oder Sachbearbeitung von Dokumenten, in Unternehmen unterstützen.

Variierendes Wertschöpfungspotenzial je nach Einsatzbereich und Aufgabe

Gerade bei Use Cases mit AI-Anwendungen, die über verschiedeneUnternehmensfunktionen hinweg eingesetzt werden können, ist die Einschätzung des Wertschöpfungspotenzials abhängig von der individuellen Aufgabe und dem Anwendungsbereich und demnach sehr divers.

Über alle erhobenen Use Cases hinweg tendieren die befragten Personen dazu das Wertschöpfungspotenzial zur Kostenreduktion am höchsten einzuschätzen. Dieses Phänomen kann dadurch erklärt werden, dass ineffiziente Prozesse schnell zu höheren Kosten führen, bei einer beschleunigten, zuverlässigeren Ausführung durch AI das Potenzial zur Kostenersparnis schnell ersichtlich werden kann. Dadurch wurde dieses Wertschöpfungspotenzial im Vergleich zu Umsatz- und Unternehmenswertsteigerung auch häufiger von Expertinnen und Experten identifiziert. Zusätzlich zu diesen Erkenntnissen wurden in Interviews weitere Aspekte bzw. Ziele des

AI-Einsatzes in den Unternehmen abgefragt, die sich abseits schon genannten Wertschöpfungspotenziale indirekt auf die Wertschöpfung und den Unternehmenserfolg auswirken. So wurden neben Prozessoptimierung, die Steigerung der ökologischen und ökonomischen Nachhaltigkeit, die Verbesserung des Unternehmensimages und eine Steigerung der Unternehmensattraktivität genannt.

Fehlende Daten, fehlendes Personal – die Hindernisse bei der Implementierung

In der qualitativen Studie wurden neben den Potenzialen von AI auch Hindernisse und Herausforderungen. Durch eine genaue Systematisierung und Analyse wurde deutlich: der Mangel an Daten, personellen und finanziellen Ressourcen und das fehlendes Mindset machen den Unternehmen zu schaffen. Um diese Ergebnisse besser beurteilen und einschätzen zu können wurden Branchenexpertinnen und -experten gebeten, die ermittelten Herausforderungen im Rahmen einer quantitativen Studie zu bewerten. Die Stichprobe besteht aus Mitarbeiterinnen und Mitarbeiter in beratender Funktion bei AI-Projekte, Managerinnen und Manager mit Entscheidungsfunktion auf diesem Gebiet sowie Unternehmensberaterinnen  und -berater aus Beratungsfirmen mit Fokus auf AI-Projekten.

Sehr deutlich wurde hierbei der allgegenwärtige Mangel an Fachpersonal, der von weit mehr als der Hälfte der Befragten angegeben wurde. Zudem ist die gegebenen Datenqualität oft nur unzureichend und es fehlt an AI-Strategien, was sehr große Hindernisse angesehen wurden. Im Vergleich hierzu waren Hindernisse wie ein mangelnder Reifegrad der AI-Technologien und offene Rechtsfragen nur von etwas mehr als einem Drittel der Befragten angegeben worden. Was natürlich zum einen deutlich macht, dass zwar verschiedene Herausforderungen bei der AI-Implementierung gibt, es aber oft in den Händen der Unternehmen liegt inwieweit diese überwunden werden.

Weiterführende Informationen zum Forschungsbericht und dem Projekt

Weitere Ergebnisse und Informationen zur Forschungsmethode können dem Forschungsbericht der Autoren Prof. Dr. Jürgen Seitz, Katharina Willbold, Robin Haiber und Alicia Krafft entnommen werden. Dieser kann vollständig kostenlos unter https://www.hdm-stuttgart.de/iaai_download/ eingesehen werden. Weiterhin steht die AI-basierte Matching-Plattform WhatCanAIDoForMe? des IAAI der Hochschule der Medien, der thingsTHINKING Gmbh und der KENBUN IT AG kostenfrei zur Anwendung bereit.

Hier werden Unternehmen ausgehend von einer Beschreibung zur Problemstellung ihres Business Cases über ein semantisches Matching passende AI-Anwendungsfälle vorgeschlagen. Darüber hinaus wird ein numerisches Wertschöpfungspotenzial aus Basis einer Expertinnen-/ Expertenmeinung angezeigt. Dieses kann als ein erster Indikator für eine Bewertung des AI-Vorhabens herangezogen werden.

Unter der Domain WhatCanAIDoForMe.com kann die Plattform aufgerufen werden.

Autoren

Jürgen Seitz
Dr. Jürgen Seitz ist einer der führenden Professoren im Bereich Digitalisierung in Deutschland. Als Mitbegründer, Geschäftsführer und Beirat hat er geholfen, mehrere erfolgreiche digitale Unternehmen aufzubauen und zu skalieren. Seine beruflichen Stationen umfassten u.A. Microsoft, WEB.DE und die United Internet Gruppe (1&1). Heute forscht und lehrt er an der Hochschule der Medien in Stuttgart in den Bereichen Digital Marketing und Digital Business. Er ist außerdem Gründungsprofessor am Institute for Applied Artificial Intelligence (IAAI), Herausgeber der Digital Insights Studienreihe und engagiert sich für die Digitalisierung von NGOs.
Alicia Krafft
Alicia Krafft, Studentin an der Hochschule der Medien in Stuttgart, absolviert derzeit ihr Masterstudium in Unternehmenskommunikation mit den Schwerpunkten Digitale Medien und Marketing sowie Web Analytics. In den letzten Jahren half sie digitale Kommunikationsstrategien für diverse Unternehmen zu entwickeln und umzusetzen, u.a. für die ARENA2036, ein Forschungscampus der Universität Stuttgart, und zuletzt für das Forschungsteam rund um Dr. Jürgen Seitz.

Graphical understanding of dynamic programming and the Bellman equation: taking a typical approach at first

This is the second article of the series My elaborate study notes on reinforcement learning.

*I must admit I could not fully explain how I tried visualizing ideas of Bellman equations in this article. I highly recommend you to also take brief at the second section of the third article. (A comment added on 13/3/2022)

1, Before getting down on business

As the title of this article suggests, this article is going to be mainly about the Bellman equation and dynamic programming (DP), which are to be honest very typical and ordinary topics. One typical way of explaining DP in contexts of reinforcement learning (RL) would be explaining the Bellman equation, value iteration, and policy iteration, in this order. If you would like to merely follow pseudocode of them and implement them, to be honest that is not a big deal. However even though I have studied RL only for some weeks, I got a feeling that these algorithms, especially policy iteration are more than just single algorithms. In order not to miss the points of DP, rather than typically explaining value iteration and policy iteration, I would like to take a different approach. Eventually I am going to introduce DP in RL as a combination of the following key terms: the Bellman operator, the fixed point of a policy, policy evaluation, policy improvement, and existence of the optimal policy. But first, in this article I would like to cover basic and typical topics of DP in RL.

Many machine learning algorithms which use supervised/unsupervised learning more or less share the same ideas. You design a model and a loss function and input samples from data, and you adjust parameters of the model so that the loss function decreases. And you usually use optimization techniques like stochastic gradient descent (SGD) or ones derived from SGD. Actually feature engineering is needed to extract more meaningful information from raw data. Or especially in this third AI boom, the models are getting more and more complex, and I would say the efforts of feature engineering was just replaced by those of designing neural networks. But still, once you have the whole picture of supervised/unsupervised learning, you would soon realize other various algorithms is just a matter of replacing each component of the workflow. However reinforcement learning has been another framework of training machine learning models. Richard E. Bellman’s research on DP in 1950s is said to have laid a foundation for RL. RL also showed great progress thanks to development of deep neural networks (DNN), but still you have to keep it in mind that RL and supervised/unsupervised learning are basically different frameworks. DNN are just introduced in RL frameworks to enable richer expression of each component of RL. And especially when RL is executed in a higher level environment, for example screens of video games or phases of board games, DNN are needed to process each state of the environment. Thus first of all I think it is urgent to see ideas unique to RL in order to effectively learn RL. In the last article I said RL is an algorithm to enable planning by trial and error in an environment, when the model of the environment is not known. And DP is a major way of solving planning problems. But in this article and the next article, I am mainly going to focus on a different aspect of RL: interactions of policies and values.

According to a famous Japanese textbook on RL named “Machine Learning Professional Series: Reinforcement Learning,” most study materials on RL lack explanations on mathematical foundations of RL, including the book by Sutton and Barto. That is why many people who have studied machine learning often find it hard to get RL formulations at the beginning. The book also points out that you need to refer to other bulky books on Markov decision process or dynamic programming to really understand the core ideas behind algorithms introduced in RL textbooks. And I got an impression most of study materials on RL get away with the important ideas on DP with only introducing value iteration and policy iteration algorithms. But my opinion is we should pay more attention on policy iteration. And actually important RL algorithms like Q learning, SARSA, or actor critic methods show some analogies to policy iteration. Also the book by Sutton and Barto also briefly mentions “Almost all reinforcement learning methods are well described as GPI (generalized policy iteration). That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right side.

Even though I arrogantly, as a beginner in this field, emphasized “simplicity” of RL in the last article, in this article I am conversely going to emphasize the “profoundness” of DP over two articles. But I do not want to cover all the exhaustive mathematical derivations for dynamic programming, which would let many readers feel reluctant to study RL. I tried as hard as possible to visualize the ideas in DP in simple and intuitive ways, as far as I could understand. And as the title of this article series shows, this article is also a study note for me. Any corrections or advice would be appreciated via email or comment pots below.

2, Taking a look at what DP is like

In the last article, I said that planning or RL is a problem of finding an optimal policy \pi(a|s) for choosing which actions to take depending on where you are. Also in the last article I displayed flows of blue arrows for navigating a robot as intuitive examples of optimal policies in planning or RL problems. But you cannot directly calculate those policies. Policies have to be evaluated in the long run so that they maximize returns, the sum of upcoming rewards. Then in order to calculate a policy p(a|s), you need to calculate a value functions v_{\pi}(s). v_{\pi}(s) is a function of how good it is to be in a given state s, under a policy \pi. That means it is likely you get higher return starting from s, when v_{\pi}(s) is high. As illustrated in the figure below, values and policies, which are two major elements of RL, are updated interactively until they converge to an optimal value or an optimal policy. The optimal policy and the optimal value are denoted as v_{\ast} and \pi_{\ast} respectively.

Dynamic programming (DP) is a family of algorithms which is effective for calculating the optimal value v_{\ast} and the optimal policy \pi_{\ast} when the complete model of the environment is given. Whether in my articles or not, the rest of discussions on RL are more or less based on DP. RL can be viewed as a method of achieving the same effects as DP when the model of the environment is not known. And I would say the effects of imitating DP are often referred to as trial and errors in many simplified explanations on RL. If you have studied some basics of computer science, I am quite sure you have encountered DP problems. With DP, in many problems on textbooks you find optimal paths of a graph from a start to a goal, through which you can maximizes the sum of scores of edges you pass. You might remember you could solve those problems in recursive ways, but I think many people have just learnt very limited cases of DP. For the time being I would like you to forget such DP you might have learned and comprehend it as something you newly start learning in the context of RL.

*As a more advances application of DP, you might have learned string matching. You can calculated how close two strings of characters are with DP using string matching.

The way of calculating v_{\pi}(s) and \pi(a|s) with DP can be roughly classified to two types, policy-based and value-based. Especially in the contexts of DP, the policy-based one is called policy iteration, and the values-based one is called value iteration. The biggest difference between them is, in short, policy iteration updates a policy every times step, but value iteration does it only at the last time step. I said you alternate between updating v_{\pi}(s) and \pi(a|s), but in fact that is only true of policy iteration. Value iteration updates a value function v(s). Before formulating these algorithms, I think it will be effective to take a look at how values and policies are actually updated in a very simple case. I would like to introduce a very good tool for visualizing value/policy iteration. You can customize a grid map and place either of “Treasure,” “Danger,” and “Block.” You can choose probability of transition and either of settings, “Policy Iteration” or “Values Iteration.” Let me take an example of conducting DP on a gird map like below. Whichever of “Policy Iteration” or “Values Iteration” you choose, you would get numbers like below. Each number in each cell is the value of each state, and you can see that when you are on states with high values, you are more likely to reach the “treasure” and avoid “dangers.” But I bet this chart does not make any sense if you have not learned RL yet. I prepared some code for visualizing the process of DP on this simulator. The code is available in this link.

*In the book by Sutton and Barto, when RL/DP is discussed at an implementation level, the estimated values of v_{\pi}(s) or v_{\ast}(s) can be denoted as an array V or V_t. But I would like you take it easy while reading my articles. I will repeatedly mentions differences of notations when that matters.

*Remember that at the beginning of studying RL, only super easy cases are considered, so a V is usually just a NumPy array or an Excel sheet.

*The chart above might be also misleading since there is something like a robot at the left bottom corner, which might be an agent. But the agent does not actually move around the environment in planning problems because it has a perfect model of the environment in the head.

The visualization I prepared is based on the implementation of the simulator, so they would give the same outputs. When you run policy iteration in the map, the values and polices are updated as follows. The arrow in each cell is the policy in the state. At each time step the arrows is calculated in a greedy way, and each arrow at each state shows the direction in which the agent is likely to get the highest reward. After 3 iterations, the policies and values converge, and with the policies you can navigate yourself to the “Treasure,” avoiding “Dangers.”

*I am not sure why policies are incorrect at the most left side of the grid map. I might need some modification of code.

You can also update values without modifying policies as the chart below. In this case only the values of cells are updated. This is value-iteration, and after this iteration converges, if you transit to an adjacent cell with the highest value at each cell, you can also navigate yourself to the “treasure,” avoiding “dangers.”

I would like to start formulating DP little by little,based on the notations used in the RL book by Sutton. From now on, I would take an example of the 5 \times 6 grid map which I visualized above. In this case each cell is numbered from 0 to 29 as the figure below. But the cell 7, 13, 14 are removed from the map. In this case \mathcal{S} = {0, 1, 2, 3, 4, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29}, and \mathcal{A} = \{\uparrow, \rightarrow, \downarrow, \leftarrow \}. When you pass s=8, you get a reward r_{treasure}=1, and when you pass the states s=15 or s=19, you get a reward r_{danger}=-1. Also, the agent is encouraged to reach the goal as soon as possible, thus the agent gets a regular reward of r_{regular} = - 0.04 every time step.

In the last section, I mentioned that the purpose of RL is to find the optimal policy which maximizes a return, the sum of upcoming reward R_t. A return is calculated as follows.

R_{t+1} + R_{t+2} +  R_{t+3} + \cdots + R_T

In RL a return is estimated in probabilistic ways, that is, an expectation of the return given a state S_t = s needs to be considered. And this is the value of the state. Thus the value of a state S_t = s is calculated as follows.

\mathbb{E}_{\pi}\bigl[R_{t+1} + R_{t+2} +  R_{t+3} + \cdots + R_T | S_t = s \bigr]

In order to roughly understand how this expectation is calculated let’s take an example of the 5 \times 6 grid map above. When the current state of an agent is s=10, it can take numerous patterns of actions. For example (a) 10 - 9 - 8 - 2 , (b) 10-16-15-21-20-19, (c) 10-11-17-23-29-\cdots. The rewards after each behavior is calculated as follows.

  • If you take a you take the course (a) 10 - 9 - 8 - 2, you get a reward of r_a = -0.04 -0.04 + 1 -0.04 in total. The probability of taking a course of a) is p_a = \pi(A_t = \leftarrow | S_t = 10) \cdot p(S_{t+1} = 9 |S_t = 10, A_t = \leftarrow ) \cdot \pi(A_{t+1} = \leftarrow | S_{t+1} = 9) \cdot p(S_{t+2} = 8 |S_{t+1} = 9, A_{t+1} = \leftarrow ) \cdot \pi(A_{t+2} = \uparrow | S_{t+2} = 8) \cdot p(S_{t+3} = 2 | S_{t+2} = 8, A_{t+2} = \uparrow )
  • Just like the case of (a), the reward after taking the course (b) is r_b = - 0.04 -0.04 -1 -0.04 -0.04 -0.04 -1. The probability of taking the action can be calculated in the same way as p_b = \pi(A_t = \downarrow | S_t = 10) \cdot p(S_{t+1} = 16 |S_t = 10, A_t = \downarrow ) \cdots \pi(A_{t+4} = \leftarrow | S_{t+4} = 20) \cdot p(S_{t+5} = 19 |S_{t+4} = 20, A_{t+4} = \leftarrow ).
  • The rewards and the probability of the case (c) cannot be calculated because future behaviors of the agent is not confirmed.

Assume that (a) and (b) are the only possible cases starting from s, under the policy \pi, then the the value of s=10 can be calculated as follows as a probabilistic sum of rewards of each behavior (a) and (b).

\mathbb{E}_{\pi}\bigl[R_{t+1} + R_{t+2} +  R_{t+3} + \cdots + R_T | S_t = s \bigr] = r_a \cdot p_a + r_b \cdot p_b

But obviously this is not how values of states are calculated in general. Starting from a state a state s=10, not only (a) and (b), but also numerous other behaviors of agents can be considered. Or rather, it is almost impossible to consider all the combinations of actions, transition, and next states. In practice it is quite difficult to calculate a sequence of upcoming rewards R_{t+1}, \gamma R_{t+2}, R_{t+3} \cdots,and it is virtually equal to considering all the possible future cases.A very important formula named the Bellman equation effectively formulate that.

3, The Bellman equation and convergence of value functions

*I must admit I could not fully explain how I tried visualizing ideas of Bellman equations in this article. It might be better to also take brief at the second section of the third article. (A comment added on 3/3/2022)

The Bellman equation enables estimating values of states considering future countless possibilities with the following two ideas.

  1.  Returns are calculated recursively.
  2.  Returns are calculated in probabilistic ways.

First of all, I have to emphasize that a discounted return is usually used rather than a normal return, and a discounted one is defined as below

G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots + \gamma ^ {T-t-1} R_T = \sum_{k=0}^{T-t-1}{\gamma ^{k}R_{t+k+1}}

, where \gamma \in (0, 1] is a discount rate. (1)As the first point above, the discounted return can be calculated recursively as follows: G_t = R_{t + 1} + \gamma R_{t + 2} + \gamma ^2 R_{t + 2} + \gamma ^3 R_{t + 3} + \cdots = R_{t + 1} + \gamma (R_{t + 2} + \gamma R_{t + 2} + \gamma ^2 R_{t + 3} + \cdots ) = R_{t + 1} + \gamma G_{t+1}. You can postpone calculation of future rewards corresponding to G_{t+1} this way. This might sound obvious, but this small trick is crucial for defining defining value functions or making update rules of them. (2)The second point might be confusing to some people, but it is the most important in this section. We took a look at a very simplified case of calculating the expectation in the last section, but let’s see how a value function v_{\pi}(s) is defined in the first place.

v_{\pi}(s) \doteq \mathbb{E}_{\pi}\bigl[G_t | S_t = s \bigr]

This equation means that the value of a state s is a probabilistic sum of all possible rewards taken in the future following a policy \pi. That is, v_{\pi}(s) is an expectation of the return, starting from the state s. The definition of a values v_{\pi}(s) is written down as follows, and this is what \mathbb{E}_{\pi} means.

v_{\pi} (s)= \sum_{a}{\pi(a|s) \sum_{s', r}{p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]}}

This is called Bellman equation, and it is no exaggeration to say this is the foundation of many of upcoming DP or RL ideas. Bellman equation can be also written as \sum_{s', r, a}{\pi(a|s) p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr]}. It can be comprehended this way: in Bellman equation you calculate a probabilistic sum of r +v_{\pi}(s'), considering all the possible actions of the agent in the time step. r +v_{\pi}(s') is a sum of the values of the next state s' and a reward r, which you get when you transit to the state s' from s. The probability of getting a reward r after moving from the state s to s', taking an action a is \pi(a|s) p(s', r|s, a). Hence the right side of Bellman equation above means the sum of \pi(a|s) p(s', r|s, a)\bigl[r + \gamma v_{\pi}(s')\bigr], over all possible combinations of s', r, and a.

*I would not say this equation is obvious, and please let me explain a proof of this equation later.

The following figures are based on backup diagrams introduced in the book by Sutton and Barto. As we have just seen, Bellman expectation equation calculates a probabilistic summation of r + v(s'). In order to calculate the expectation, you have to consider all the combinations of s', r, and a. The backup diagram at the left side below shows the idea as a decision-tree-like graph, and strength of color of each arrow is the probability of taking the path.

The Bellman equation I have just introduced is called Bellman expectation equation to be exact. Like the backup diagram at the right side, there is another type of Bellman equation where you consider only the most possible path. Bellman optimality equation is defined as follows.

v_{\ast}(s) \doteq \max_{a} \sum_{s', r}{p(s', r|s, a)\bigl[r + \gamma v_{\ast}(s')\bigr]}

I would like you to pay attention again to the fact that in definitions of Bellman expectation/optimality equations, v_{\pi}(s)/v_{\ast}(s) is defined recursively with v_{\pi}(s)/v_{\ast}(s). You might have thought how to calculate v_{\pi}(s)/v_{\ast}(s) is the problem in the first place.

As I implied in the first section of this article, ideas behind how to calculate these v_{\pi}(s) and v_{\ast}(s) should be discussed more precisely. Especially how to calculate v_{\pi}(s) is a well discussed topic in RL, including the cases where data is sampled from an unknown environment model. In this article we are discussing planning problems, where a model an environment is known. In planning problems, that is DP problems where all the probabilities of transition p(s', r | s, a) are known, a major way of calculating v_{\pi}(s) is iterative policy evaluation. With iterative policy evaluation a sequence of value functions (v_0(s), v_1(s), \dots , v_{k-1}(s), v_{k}(s)) converges to v_{\pi}(s) with the following recurrence relation

v_{k+1}(s) =\sum_{a}{\pi(a|s)\sum_{s', r}{p(s', r | s, a) [r + \gamma v_k (s')]}}.

Once v_{k}(s) converges to v_{\pi}(s), finally the equation of the definition of v_{\pi}(s) holds as follows.

v_{\pi}(s) =\sum_{a}{\pi(a|s)\sum_{s', r}{p(s', r | s, a) [r + \gamma v_{\pi} (s')]}}.

The convergence to v_{\pi}(s) is like the graph below. If you already know how to calculate forward propagation of a neural network, this should not be that hard to understand. You just expand recurrent relation of v_{k}(s) and v_{k+1}(s) from the initial value at k=0 to the converged state at k=K. But you have to be careful abut the directions of the arrows in purple. If you correspond the backup diagrams of the Bellman equation with the graphs below, the purple arrows point to the reverse side to the direction where the graphs extend. This process of converging an arbitrarily initialized v_0(s) to v_{\pi}(s) is called policy evaluation.

*\mathcal{S}, \mathcal{A} are a set of states and actions respectively. Thus |\mathcal{S}|, the size of  \mathcal{S} is the number of white nodes in each layer, and |\mathcal{S}| the number of black nodes.

The same is true of the process of calculating an optimal value function v_{\ast}. With the following recurrence relation

v_{k+1}(s) =\max_a\sum_{s', r}{p(s', r | s, a) [r + \gamma v_k (s')]}

(v_0(s), v_1(s), \dots , v_{k-1}(s), v_{k}(s)) converges to an optimal value function v_{\ast}(s). The graph below visualized the idea of convergence.

4, Pseudocode of policy iteration and value iteration

I prepared pseudocode of each algorithm based on the book by Sutton and Barto. These would be one the most typical DP algorithms you would encounter while studying RL, and if you just want to implement RL by yourself, these pseudocode would enough. Or rather these would be preferable to other more general and abstract pseudocode. But I would like to avoid explaining these pseudocode precisely because I think we need to be more conscious about more general ideas behind DP, which I am going to explain in the next article. I will cover only the important points of these pseudocode, and I would like to introduce some implementation of the algorithms in the latter part of next article. I think you should briefly read this section and come back to this section section or other study materials after reading the next article. In case you want to check the algorithms precisely, you could check the pseudocode I made with LaTeX in this link.

The biggest difference of policy iteration and value iteration is the timings of updating a policy. In policy iteration, a value function v(s) and \pi(a|s) are arbitrarily initialized. (1)The first process is policy evaluation. The policy \pi(a|s) is fixed, and the value function v(s) approximately converge to v_{\pi}(s), which is a value function on the policy \pi. This is conducted by the iterative calculation with the reccurence relation introduced in the last section.(2) The second process is policy improvement. Based on the calculated value function v_{\pi}(s), the new policy \pi(a|s) is updated as below.

\pi(a|s) \gets\text{argmax}_a {r + \sum_{s', r}{p(s', r|s, a)[r + \gamma V(s')]}}, \quad \forall s\in \mathcal{S}

The meaning of this update rule of a policy is quite simple: \pi(a|s) is updated in a greedy way with an action a such that r + \sum_{s', r}{p(s', r|s, a)[r + \gamma V(s')]} is maximized. And when the policy \pi(a|s) is not updated anymore, the policy has converged to the optimal one. At least I would like you to keep it in mind that a while loop of itrative calculation of v_{\pi}(s) is nested in another while loop. The outer loop continues till the policy is not updated anymore.

On the other hand in value iteration, there is mainly only one loop of updating  v_{k}(s), which converge to v_{\ast}(s). And the output policy is the calculated the same way as policy iteration with the estimated optimal value function. According to the book by Sutton and Barto, value iteration can be comprehended this way: the loop of value iteration is truncated with only one iteration, and also policy improvement is done only once at the end.

As I repeated, I think policy iteration is more than just a single algorithm. And relations of values and policies should be discussed carefully rather than just following pseudocode. And whatever RL algorithms you learn, I think more or less you find some similarities to policy iteration. Thus in the next article, I would like to introduce policy iteration in more abstract ways. And I am going to take a rough look at various major RL algorithms with the keywords of “values” and “policies” in the next article.

Appendix

I mentioned the Bellman equation is nothing obvious. In this section, I am going to introduce a mathematical derivation, which I think is the most straightforward. If you are allergic to mathematics, the part blow is not recommendable, but the Bellman equation is the core of RL. I would not say this is difficult, and if you are going to read some texts on RL including some equations, I think mastering the operations I explain below is almost mandatory.

First of all, let’s organize some important points. But please tolerate inaccuracy of mathematical notations here. I am going to follow notations in the book by Sutton and Barto.

  • Capital letters usually denote random variables. For example X, Y,Z, S_t, A_t, R_{t+1}, S_{t+1}. And corresponding small letters are realized values of the random variables. For example x, y, z, s, a, r, s'. (*Please do not think too much about the number of 's on the small letters.)
  • Conditional probabilities in general are denoted as for example \text{Pr}\{X=x, Y=y | Z=z\}. This means the probability of x, y are sampled given that z is sampled.
  • In the book by Sutton and Barto, a probilistic funciton p(\cdot) means a probability of transition, but I am using p(\cdot) to denote probabilities in general. Thus p( s', a, r | s) shows the probability that, given an agent being in state s at time t, the agent will do action a, AND doing this action will cause the agent to proceed to state s' at time t+1, and receive reward r. p( s', a, r | s) is not defined in the book by Barto and Sutton.
  • The following equation holds about any conditional probabilities: p(x, y|z) = p(x|y, z)p(y|z). Thus importantly, p(s', a, r|s) = p(s', r| s, a)p(a|s)=p(s', r | s, a)\pi(a|s)
  • When random variables X, Y are discrete random variables, a conditional expectation of X given Y=y is calculated as follows: \mathbb{E}[X|Y=y] = \sum_{x}{p(x|Y=y)}.

Keeping the points above in mind, let’s get down on business. First, according to definition of a value function on a policy pi and linearity of an expectation, the following equations hold.

v_{\pi}(s) = \mathbb{E} [G_t | S_t =s] = \mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t =s]

=\mathbb{E} [R_{t+1} | S_t =s] + \gamma \mathbb{E} [G_{t+1} | S_t =s]

Thus we need to calculate \mathbb{E} [R_{t+1} | S_t =s] and \mathbb{E} [G_{t+1} | S_t =s]. As I have explained \mathbb{E} [R_{t+1} | S_t =s] is the sum of p(s', a, r |s) r over all the combinations of (s', a, r). And according to one of the points above, p(s', a, r |s) = p(s', r | s, a)p(a|s)=p(s', r | s, a)\pi(a|s). Thus the following equation holds.

\mathbb{E} [R_{t+1} | S_t =s] = \sum_{s', a, r}{p(s', a, r|s)r} = \sum_{s', a, r}{p(s', r | s, a)\pi(a|s)r}.

Next we have to calculate

\mathbb{E} [G_{t+1} | S_t =s]

= \mathbb{E} [R_{t + 2} + \gamma R_{t + 3} + \gamma ^2 R_{t + 4} + \cdots | S_t =s]

= \mathbb{E} [R_{t + 2}  | S_t =s] + \gamma \mathbb{E} [R_{t + 2} | S_t =s]  + \gamma ^2\mathbb{E} [ R_{t + 4} | S_t =s]  +\cdots.

Let’s first calculate \mathbb{E} [R_{t + 2}  | S_t =s]. Also \mathbb{E} [R_{t + 3}  | S_t =s] is a sum of p(s'', a', r', s', a, r|s)r' over all the combinations of (s”, a’, r’, s’, a, r).

\mathbb{E}_{\pi} [R_{t + 2}  | S_t =s] =\sum_{s'', a', r', s', a, r}{p(s'', a', r', s', a, r|s)r'}

=\sum_{s'', a', r', s', a, r}{p(s'', a', r'| s', a, r, s)p(s', a, r|s)r'}

=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s'', a', r'}{p(s'', a', r'| s', a, r, s)r'}

I would like you to remember that in Markov decision process the next state S_{t+1} and the reward R_t only depends on the current state S_t and the action A_t at the time step.

Thus in variables s', a, r, s, only s' have the following variables r', a', s'', r'', a'', s''', \dots.  And again p(s', a, r |s) = p(s', r | s, a)p(a|s). Thus the following equations hold.

\mathbb{E}_{\pi} [R_{t + 2}  | S_t =s]=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s'', a', r'}{p(s'', a', r'| s', a, r', s)r'}

=\sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \sum_{s'', a', r'}{p(s'', a', r'| s')r'}

= \sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \mathbb{E}_{\pi} [R_{t+2}  | s'].

\mathbb{E}_{\pi} [R_{t + 3}  | S_t =s] can be calculated the same way.

\mathbb{E}_{\pi}[R_{t + 3}  | S_t =s] =\sum_{s''', a'', r'', s'', a', r', s', a, r}{p(s''', a'', r'', s'', a', r', s', a, r|s)r''}

=\sum_{s''', a'', r'', s'', a', r', s', a, r}{p(s''', a'', r'', s'', a', r'| s', a, r, s)p(s', a, r|s)r''}

=\sum_{ s', a, r}{p(s', a, r|s)} \sum_{s''', a'' r'', s'', a', r'}{p(s''', a'', r'', s'', a', r'| s', a, r, s)r''}

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \sum_{s''', a'' r'', s'', a', r'}{p(s''', a'', r'', s'', a', r'| s')r''}

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [R_{t+3}  | s'].

The same is true of calculating \mathbb{E}_{\pi} [R_{t + 4}  | S_t =s], \mathbb{E}_{\pi} [R_{t + 5}  | S_t =s]\dots.  Thus

v_{\pi}(s) =\mathbb{E} [R_{t+1} | S_t =s] + \gamma \mathbb{E} [G_{t+1} | S_t =s]

=\sum_{s', a, r}{p(s', r | s, a)\pi(a|s)r} + \mathbb{E} [R_{t + 2}  | S_t =s] + \gamma \mathbb{E} [R_{t + 3} | S_t =s]  + \gamma ^2\mathbb{E} [ R_{t + 4} | S_t =s]  +\cdots

=\sum_{s, a, r}{p(s', r | s, a)\pi(a|s)r} +\sum_{ s', a, r}{p(s', r|a, s)\pi(a|s)} \mathbb{E}_{\pi} [R_{t+2}  |S_{t+1}= s'] +\gamma \sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [R_{t+3} |S_{t+1} =  s'] +\gamma^2 \sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} \mathbb{E}_{\pi} [ R_{t+4}|S_{t+1} =  s'] + \cdots

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + \mathbb{E}_{\pi} [\gamma R_{t+2}+ \gamma R_{t+3}+\gamma^2R_{t+4} + \cdots |S_{t+1} =  s'] ]

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + \mathbb{E}_{\pi} [G_{t+1} |S_{t+1} =  s'] ]

=\sum_{ s', a, r}{ p(s', r | s, a)p(a|s)} [r + v_{\pi}(s') ]