Main Category Archives

Geschriebene Artikel über Big Data Analytics

Variational Autoencoders

April 19, 2022/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Main Category, Use Cases/by Sunil Yadav

After Deep Autoregressive Models and Deep Generative Modelling, we will continue our discussion with Variational AutoEncoders (VAEs) after covering up DGM basics and AGMs. Variational autoencoders (VAEs) are a deep learning method to produce synthetic data (images, texts) by learning the latent representations of the training data. AGMs are sequential models and generate data based on previous data points by defining tractable conditionals. On the other hand, VAEs are using latent variable models to infer hidden structure in the underlying data by using the following intractable distribution function:

(1) $\begin{equation*} p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz. \end{equation*}$

The generative process using the above equation can be expressed in the form of a directed graph as shown in Figure ?? (the decoder part), where latent variable $z\sim p_\theta(z)$ produces meaningful information of $x \sim p_\theta(x|z)$ .

Figure 1: Architectures AE and VAE based on the bottleneck architecture. The decoder part work as
a generative model during inference.

Autoencoders

Autoencoders (AEs) are the key part of VAEs and are an unsupervised representation learning technique and consist of two main parts, the encoder and the decoder (see Figure ??). The encoders are deep neural networks (mostly convolutional neural networks with imaging data) to learn a lower-dimensional feature representation from training data. The learned latent feature representation $z$ usually has a much lower dimension than input $x$ and has the most dominant features of $x$ . The encoders are learning features by performing the convolution at different levels and compression is happening via max-pooling.

On the other hand, the decoders, which are also a deep convolutional neural network are reversing the encoder’s operation. They try to reconstruct the original data $x$ from the latent representation $z$ using the up-sampling convolutions. The decoders are pretty similar to VAEs generative models as shown in Figure 1, where synthetic images will be generated using the latent variable $z$ .

During the training of autoencoders, we would like to utilize the unlabeled data and try to minimize the following quadratic loss function:

(2) $\begin{equation*} \mathcal{L}(\theta, \phi) = ||x-\hat{x}||^2, \end{equation*}$

The above equation tries to minimize the distance between the original input and reconstructed image as shown in Figure 1.

Variational autoencoders

VAEs are motivated by the decoder part of AEs which can generate the data from latent representation and they are a probabilistic version of AEs which allows us to generate synthetic data with different attributes. VAE can be seen as the decoder part of AE, which learns the set parameters $\theta$ to approximate the conditional $p_\theta(x|z)$ to generate images based on a sample from a true prior, $z\sim p_\theta(z)$ . The true prior $p_\theta(z)$ are generally of Gaussian distribution.

Network Architecture

VAE has a quite similar architecture to AE except for the bottleneck part as shown in Figure 2. in AES, the encoder converts high dimensional input data to low dimensional latent representation in a vector form. On the other hand, VAE’s encoder learns the mean vector and standard deviation diagonal matrix such that $z\sim \matcal{N}(\mu_z, \Sigma_x)$ as it will be performing probabilistic generation of data. Therefore the encoder and decoder should be probabilistic.

Training

Similar to AGMs training, we would like to maximize the likelihood of the training data. The likelihood of the data for VAEs are mentioned in Equation 1 and the first term $p_\theta(x|z)$ will be approximated by neural network and the second term $p(x)$ prior distribution, which is a Gaussian function, therefore, both of them are tractable. However, the integration won’t be tractable because of the high dimensionality of data.

To solve this problem of intractability, the encoder part of AE was utilized to learn the set of parameters $\phi$ to approximate the conditional $q_\phi (z|x)$ . Furthermore, the conditional $q_\phi (z|x)$ will approximate the posterior $p_\theta (z|x)$ , which is intractable. This additional encoder part will help to derive a lower bound on the data likelihood that will make the likelihood function tractable. In the following we will derive the lower bound of the likelihood function:

(3) $\begin{equation*} \begin{flalign} \begin{aligned} log \: p_\theta (x) = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{p_\theta (x|z) p_\theta (z)}{p_\theta (z|x)} \: \frac{q_\phi(z|x)}{q_\phi(z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: p_\theta (x|z)\Bigg] - \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)} {p_\theta (z)}\Bigg] + \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)}{p_\theta (z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)) + \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z|x)). \end{aligned} \end{flalign} \end{equation*}$

In the above equation, the first line computes the likelihood using the logarithmic of $p_\theta (x)$ and then it is expanded using Bayes theorem with additional constant $q_\phi(z|x)$ multiplication. In the next line, it is expanded using the logarithmic rule and then rearranged. Furthermore, the last two terms in the second line are the definition of KL divergence and the third line is expressed in the same.

In the last line, the first term is representing the reconstruction loss and it will be approximated by the decoder network. This term can be estimated by the reparametrization trick \cite{}. The second term is KL divergence between prior distribution $p_\theta(z)$ and the encoder function $q_\phi (z|x)$ , both of these functions are following the Gaussian distribution and has the closed-form solution and are tractable. The last term is intractable due to $p_\theta (z|x)$ . However, KL divergence computes the distance between two probability densities and it is always positive. By using this property, the above equation can be approximated as:

(4) $\begin{equation*} log \: p_\theta (x)\geq \mathcal{L}(x, \phi, \theta) , \: \text{where} \: \mathcal{L}(x, \phi, \theta) = \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)). \end{equation*}$

In the above equation, the term $\mathcal{L}(x, \phi, \theta)$ is presenting the tractable lower bound for the optimization and is also termed as ELBO (Evidence Lower Bound Optimization). During the training process, we maximize ELBO using the following equation:

(5) $\begin{equation*} \operatorname*{argmax}_{\phi, \theta} \sum_{x\in X} \mathcal{L}(x, \phi, \theta). \end{equation*}$

Furthermore, the reconstruction loss term can be written using Equation 2 as the decoder output is assumed to be following Gaussian distribution. Therefore, this term can be easily transformed to mean squared error (MSE).

During the implementation, the architecture part is straightforward and can be found here. The user has to define the size of latent space, which will be vital in the reconstruction process. Furthermore, the loss function can be minimized using ADAM optimizer with a fixed batch size and a fixed number of epochs.

Figure 2: The results obtained from vanilla VAE (left) and a recent VAE-based generative
model NVAE (right)

In the above, we are showing the quality improvement since VAE was introduced by Kingma and
Welling [KW14]. NVAE is a relatively new method using a deep hierarchical VAE [VK21].

Summary

In this blog, we discussed variational autoencoders along with the basics of autoencoders. We covered
the main difference between AEs and VAEs along with the derivation of lower bound in VAEs. We
have shown using two different VAE based methods that VAE is still active research because in general,
it produces a blurry outcome.

References

[KW14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.
[VK21] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder, 2021.

Key Points on AI’s Role In The Future Of Data Protection

April 3, 2022/in Artificial Intelligence, Data Security/by Lydia Iseh

Artificial Intelligence is transforming every industry as we speak, and data protection might be the biggest of them all. With a projected market size of USD 113390 Million, there’s a lot to protect—and humans won’t be able to do it all.

Luckily for us, Artificial Intelligence solutions are here to help us out. Because AI can do a lot more than just collect and analyze data — it can also protect it. In this article, we’ll explain what the role of Artificial Intelligence is in the future of data protection.

Source www.pexels.com

Here’s AI for data protection in summary:

3 Ways AI serves in data protection

AI Can Improve Compliance: from the GDPR to the CPRA, AI can help you track down gaps in your compliance with the most important data protection legislation.
AI as an ally against cyberattacks: cyberattacks are becoming increasingly sophisticated, but so is AI. It can help you recognize the patterns that indicate an attack is underway and put in automated reactions to minimize damage.
AI can protect against phishing attempts: together with ML and NLP, AI is a valuable tool in detecting phishing attempts—especially since they are becoming increasingly hard to spot.

Why AI is so valuable in the fight against cybercrime

AI can handle more and more complex data than humans: with the amount of data that is being processed and collected every second, it’s incredibly inefficient to not let AI do the work—and AI can cut costs drastically as well.
AI can quickly classify data and keep it organized: before you can protect your data, make sure it’s organized properly. No matter the amount or complexity of the structure, AI can help you stay on top of it.
No humans needed to keep sensitive data secure: scared of human errors and have trust issues? With AI, you don’t need to rely on people for protection and discreteness.

The threats your data faces on a daily basis

It’s not just the good guys who are using technologies like artificial intelligence to up their game—hackers and people after sensitive data can also reap the benefits of AI. There are more than 2,200 cyberattacks per day—which means one every 39 seconds, so the threat is substantial.

While the clock is ticking, research found that fewer than 25% of businesses think they’re ready to fight off a ransomware attack. That leaves 75% of organizations all the more vulnerable to data privacy threats.

Leaks of personal information, data hacks and other privacy scandals are costly: it’s estimated that cybercrime will cost companies worldwide an estimated $10.5 trillion annually by 2025, with an average cost of $3.86 million per breach—not including the harm done to users and the reputation of a business.

That makes investing in a solid data protection system all the more useful, which is shown in the spending habits of businesses all over the world: global spending on privacy efforts are expected to reach $8 billion by 2022. Luckily, with the rapid developments in AI and other smart security tools, it has become more attainable—even for smaller businesses.

source www.pexels.com

3 Ways AI serves in data protection

What does Artificial intelligence in data protection look like in practice? Let’s look at some of the ways AI can assist your organization in warding off cyber criminals.

1. AI Can Improve Compliance

How compliant is your organization with all the data protection and privacy regulations? It can be incredibly hard to keep up, understand and check whether your systems are up-to-date on the latest compliance regulations.

But—no need to worry! AI has taken over the nitty-gritty of it all. It’s expected that by 2023, over 40% of privacy compliance technology will rely on AI.

What kind of legislation can you hold up with the use of AI? Two big names are the GDPR and CPRA. AI can help you identify blind spots in your data protection efforts and warn you when you’re not living up to the standards governments put in place.

One tool that does this is SECURITI.ai. With AI-driven PI data discovery, DSR automation, documented accountability you get a clearer view of your data processing activities and can make sure you’re compliant.

An alternative AI solution is Claudette, a web crawler that assesses the privacy policies using supervised machine learning technologies. After it’s done scanning and collecting information, it checks if the data is used in a way that’s GDPR proof. It shows you issues such as incomplete information, unclear language, or problematic data processing tactics.

Of course, you can’t solely rely on AI to do all the work when it comes to privacy and data protection. You and your employees also need to understand and handle data in ways that are compliant with the rules set in place.

Start with understanding what the GDPR and CPRA are all about. Osano’s guide to CPRA is a great place to start to learn what the CPRA, which will replace the CPPA on January 1, 2023, is all about. Educate yourself on the rules of data protection, and it will be even easier to select an AI tool that will help you protect your valuable data.

2. AI as an ally against cyberattacks

With the combination of big data, artificial intelligence and machine learning, you have a great recipe for tracking down the patterns that indicate a cyberattack is happening. Why is that helpful?

It’s all about identifying patterns. When AI and ML work together, they can map out what happened during previous attacks. Together, they can identify the actions hackers have taken before and find weak spots in your security system, so you can fill those gaps and be extra alert.

AI can assist in quickly alerting the right people and systems that there’s a threat. This can even kick off a series of extra measures to be taken, so the cyberattack can be beaten back.

AI can also make sure malicious websites and unauthorized data transactions are automatically blocked before any harm can be done.

3. AI can protect against phishing attempts

Sometimes its employees who unknowingly are letting the cyber criminals in. Many people roll their eyes when they hear about yet another phishing attempt—shouldn’t we all know better by now not to click on certain links? — but cyber criminals are creating increasingly sophisticated phishing attacks. Even the most tech-savvy and internet-native people are able to fall for it.

Because phishing is all about what’s happening in the details, or in the background of a message—something the untrained human eye won’t immediately see.

Ai does see it, however. With technologies like Natural Language Processing and Machine Learning, it can automatically spot if a phishing attack is at play, and warn users.

There are even AI and ML tools on the market that are able to analyze the context of a message and the relationship between the sender and receiver, for even greater accuracy.

source unsplash.com

Why AI is so valuable in the fight against cybercrime

But why AI? Can we really rely on yet another robotic system to keep a digital framework safe? Isn’t it safe to have it handled by humans? We’ll expand on the three main benefits AI offers in the data protection game.

1. AI can handle more and more complex data than humans

With all the data that is being processed and stored nowadays, there are barely enough people on the planet to keep an eye on every sensitive piece of information.

Good data protection is extremely time-consuming, because it’s constant. Checking servers manually is virtually impossible.

AI can work automatically and 24/7, no matter how much data there is to handle. On top of that, AI can be put in place to handle the more complex data structures, which can be hard to analyze and protect for humans. All while keeping costs low.

2. AI can quickly classify data and keep it organized

Before you can even start protecting data, you will need to put it in place—efficiently. With the large volumes of data that organizations deal with, AI comes in handy. AI can quickly classify and manage data to keep it organized.

3. No humans needed to keep sensitive data secure

AI can work independently from humans, which means nobody necessarily needs to have direct access to the sensitive data you’re trying to predict. Not only does that decrease the changes of human error, but it also builds an extra layer of trust.

Ready to call in the help of AI for your data protection?

Start by looking at the legislations that are important for your organization, and build on the needs you have for your specific business. Want to know more about the power of AI for data driven businesses? Keep reading in our blog section dedicated to artificial intelligence!

Ein KI Projekt richtig umsetzen : So geht’s

March 30, 2022/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Main Category/by Benjamin Aunkofer

Sie wollen in Ihrem Unternehmen Kosten senken und effizientere Workflows einführen? Dann haben Sie vielleicht schon darüber nachgedacht, Prozesse mit Künstlicher Intelligenz zu automatisieren. Für einen gelungenen Start, besprechen wir nun, wie ein KI-Projekt abläuft und wie man es richtig umsetzt.

Wir von DATANOMIQ und pixolution teilen unsere Erfahrungen aus Deep Learning Projekten, wo es vor allem um die Optimierung und Automatisierung von Unternehmensprozessen rund um visuelle Daten geht, etwa Bilder oder Videos. Wir stellen Ihnen die einzelnen Projektschritte vor, verraten Ihnen, wo dabei die Knackpunkte liegen und wie alle Beteiligten dazu beitragen können, ein KI-Projekt zum Erfolg zu führen.

1. Erstgespräch

In einem Erstgespräch nehmen wir Ihre Anforderungen auf.

Bestandsaufnahme Ihrer aktuellen Prozesse und Ihrer Änderungswünsche: Wie sind Ihre aktuellen Prozesse strukturiert? An welchen Prozessen möchten Sie etwas ändern?
Zielformulierung: Welches Endergebnis wünschen Sie sich? Wie genau sollen die neuen Prozesse aussehen? Das Ziel sollte so detailliert wie möglich beschrieben werden.
Budget: Welches Budget haben Sie für dieses Projekt eingeplant? Zusammen mit dem formulierten Ziel gibt das Budget die Wege vor, die wir zusammen in dem Projekt gehen können. Meist wollen Sie durch die Einführung von KI Kosten sparen oder höhere Umsätze erreichen. Das spielt für Höhe des Budgets die entscheidende Rolle.
Datenlage: Haben Sie Daten, die wir für das Training verwenden können? Wenn ja, welche und wieviele Daten sind das? Ist eine kontinuierliche Datenerfassung vorhanden, die während des Projekts genutzt werden kann, oder muss dafür erst die Grundlage geschaffen werden?

2. Evaluation

In diesem Schritt evaluieren und planen wir mit Ihnen gemeinsam die Umsetzung des Projekts. Das bedeutet im Einzelnen folgendes.

Begutachtung der Daten und weitere Datenplanung

Wir sichten von Ihnen bereitgestellte Trainingsdaten, z.B. gelabelte Bilder, und machen uns ein Bild davon, ob diese für das Training sinnvoll verwendet werden können. Da man für Deep Learning sehr viele Trainingsdaten benötigt, ist das ein entscheidender Punkt. In die Begutachtung der Daten fließt auch die Beurteilung der Qualität und Ausgewogenheit ein, denn davon ist abhängig wie gut ein KI-Modell lernt und korrekte Vorhersagen trifft.

Wenn von Ihnen keinerlei Daten zum Projektstart bereitgestellt werden können, wird zuerst ein separates Projekt notwendig, das nur dazu dient, Daten zu sammeln. Das bedeutet für Sie etwa je nach Anwendbarkeit den Einkauf von Datensets oder Labeling-Dienstleistungen.
Wir stehen Ihnen dabei beratend zur Seite.

Während der gesamten Dauer des Projekts werden immer wieder neue Daten benötigt, um die Qualität des Modells weiter zu verbessern. Daher müssen wir mit Ihnen gemeinsam planen, wie Sie fortlaufend diese Daten erheben, falsche Predictions des Modells erkennen und korrigieren, sodass Sie diese uns bereitstellen können. Die richtig erkannten Daten sowie die falsch erkannten und dann korrigierten Daten werden nämlich in das nächste Training einfließen.

Definition des Minimum Viable Product (MVP)

Wir definieren mit Ihnen zusammen, wie eine minimal funktionsfähige Version der KI aussehen kann. Die Grundfrage hierbei ist: Welche Komponenten oder Features sollten als Erstes in den Produktivbetrieb gehen, sodass Sie möglichst schnell einen Mehrwert aus
der KI ziehen?

Ein Vorteil dieser Herangehensweise ist, dass Sie den neuen KI-basierten Prozess in kleinem Maßstab testen können. Gleichzeitig können wir Verbesserungen schneller identifizieren. Zu einem späteren Zeitpunkt können Sie dann skalieren und weitere Features aufnehmen. Die schlagenden Argumente, mit einem MVP zu starten, sind jedoch die Kostenreduktion und Risikominimierung. Anstatt ein riesiges Projekt umzusetzen wird ein kleines Mehrwert schaffendes Projekt geschnürt und in der Realität getestet. So werden Fehlplanungen und
-entwicklungen vermieden, die viel Geld kosten.

Definition der Key Performance Indicators (KPI)

Key Performance Indicators sind für die objektive Qualitätsmessung der KI und des Business Impacts wichtig. Diese Zielmarken definieren, was das geplante System leisten soll, damit es erfolgreich ist. Key Performance Indicators können etwa sein:

Durchschnittliche Zeitersparnis des Prozesses durch Teilautomatisierung
Garantierte Antwortzeit bei maximalem Anfrageaufkommen pro Sekunde
Parallel mögliche Anfragen an die KI
Accuracy des Modells
Zeit von Fertigstellung bis zur Implementierung des KI Modells

Planung in Ihr Produktivsystem

Wir planen mit Ihnen die tiefe Integration in Ihr Produktivsystem. Dabei sind etwa folgende Fragen wichtig: Wie soll die KI in der bestehenden Softwareumgebung und im Arbeitsablauf genutzt werden? Was ist notwendig, um auf die KI zuzugreifen?

Mit dem Erstgespräch und der Evaluation ist nun das Fundament für das Projekt gelegt. In den Folgeschritten treiben wir die Entwicklung nun immer weiter voran. Die Schritte 3 bis 5 werden dabei solange wiederholt bis wir von der minimal funktionsfähigen
Produktversion, dem MVP, bis zum gewünschten Endprodukt gelangt sind.

3. Iteration

Wir trainieren den Algorithmus mit dem Großteil der verfügbaren Daten. Anschließend überprüfen wir die Performance des Modells mit ungesehenen Daten.

Wie lange das Training dauert ist abhängig von der Aufgabe. Man kann jedoch sagen, dass das Trainieren eines Deep Learning Modells für Bilder oder Videos komplexer und zeitaufwändiger ist als bei textbasierten maschinellen Lernaufgaben. Das liegt daran, dass wir tiefe Modelle (mit vielen Layern) verwenden und die verarbeiteten Datenmengen in der Regel sehr groß sind.

Das Trainieren des Modells ist je nach Projekt jedoch nur ein Bruchstück des ganzen Entwicklungsprozesses, den wir leisten. Oft ist es notwendig, dass wir einen eigenen Prozess aufbauen, in den das Modell eingebettet werden kann, wie z.B. einen Webservice.

4. Integration

Ist eine akzeptable Qualitätsstufe des Modells nach dem Training erreicht, liefern wir Ihnen eine erste Produktversion aus. Üblicherweise stellen wir Ihnen die Version als Docker Image mit API zur Verfügung. Sie beginnen dann mit der Integration in Ihr System und Ihre Workflows. Wir begleiten Sie dabei.

5. Feedback erfassen

Nachdem die Integration in den Produktivbetrieb erfolgt ist, ist es sehr wichtig, dass Sie aus der Nutzung Daten sammeln. Nur so können Sie beurteilen, ob die KI funktioniert wie Sie es sich vorgestellt haben und ob es in die richtige Richtung geht. Es geht also darum, zu erfassen was das Modell im Realbetrieb kann und was nicht. Diese Daten sammeln Sie und übermitteln sie an uns. Wir speisen diese dann in nächsten Trainingslauf ein.

Es ist dabei nicht ungewöhnlich, dass diese Datenerfassung im Realbetrieb eine gewisse Zeit in Anspruch nimmt. Das ist natürlich davon abhängig, in welchem Umfang Sie Daten erfassen. Bis zum Beginn der nächsten Iteration können so üblicherweise Wochen oder sogar Monate vergehen.

Die nächste Iteration

Um mit der nächsten Iteration eine signifikante Steigerung der Ergebnisqualität zu erreichen, kann es notwendig sein, dass Sie uns mehr Daten oder andere Daten zur Verfügung stellen, die aus dem Realbetrieb anfallen.

Eine nächste Iteration kann aber auch motiviert sein durch eine Veränderung in den Anforderungen, wenn etwa bei einem Klassifikationsmodell neue Kategorien erkannt werden müssen. Das aktuelle Modell kann für solche Veränderungen dann keine guten Vorhersagen treffen und muss erst mit entsprechenden neuen Daten trainiert werden.

Tipps für ein erfolgreiches KI Projekt

Ein entscheidender Knackpunkt für ein erfolgreiches KI Projekt ist das iterative Vorgehen und schrittweise Einführen eines KI-basierten Prozesses, mit dem die Qualität und Funktionsbreite der Entwicklung gesteigert wird.

Weiterhin muss man sich darüber klar sein, dass die Bereitstellung von Trainingsdaten kein statischer Ablauf ist. Es ist ein Kreislauf, in dem Sie als Kunde eine entscheidende Rolle einnehmen. Ein letzter wichtiger Punkt ist die Messbarkeit des Projekts. Denn nur wenn die Zielwerte während des Projekts gemessen werden, können Rückschritte oder Fortschritte gesehen werden und man kann schließlich am Ziel ankommen.

Möglich wurde dieser Artikel durch die großartige Zusammenarbeit mit pixolution, einem Unternehmen für AI Solutions im Bereich Computer Vision (Visuelle Bildsuche und individuelle KI Lösungen).

Air Quality Forecasting Python Project

March 20, 2022/in Data Mining, Data Science, Python, Use Cases/by Shivani Padaya

You will find the full python code and all visuals for this article here in this gitlab repository. The repository contains a series of analysis, transforms and forecasting models frequently used when dealing with time series. The aim of this repository is to showcase how to model time series from the scratch, for this we are using a real usecase dataset

This project forecast the Carbon Dioxide (Co2) emission levels yearly. Most of the organizations have to follow government norms with respect to Co2 emissions and they have to pay charges accordingly, so this project will forecast the Co2 levels so that organizations can follow the norms and pay in advance based on the forecasted values. In any data science project the main component is data, for this project the data was provided by the company, from here time series concept comes into the picture. The dataset for this project contains 215 entries and two components which are Year and Co2 emissions which is univariate time series as there is only one dependent variable Co2 which depends on time. from year 1800 to year 2014 Co2 levels were present in the dataset.

The dataset used: The dataset contains yearly Co2 emmisions levels. data from 1800 to 2014 sampled every 1 year. The dataset is non stationary so we have to use differenced time series for forecasting.

After getting data the next step is to analyze the time series data. This process is done by using Python. The data was present in excel file so first we need to read that excel file. This task is done by using Pandas which is python libraries to creates Pandas Data Frame. After that preprocessing like changing data types of time from object to DateTime performed for the coding purpose. Time series contain 4 main components Level, Trend, Seasonality and Noise. To study this component, we need to decompose our time series so that we can batter understand our time series and we can choose the forecasting model accordingly because each component behave different on the model. also by decomposing we can identify that the time series is multiplicative or additive.

CO2 emissions – plotted via python pandas / matplotlib

Decomposing time series using python statesmodels libraries we get to know trend, seasonality and residual component separately. the components multiply together to make the time series multiplicative and in additive time series components added together. Taking the deep dive to understand the trend component, moving average of 10 steps were applied which shows nonlinear upward trend, fit the linear regression model to check the trend which shows upward trend. talking about seasonality there were combination of multiple patterns over time period which is common in real world time series data. capturing the white noise is difficult in this type of data. the time series contains values from 1800 where the Co2 values are less then 1 because of no human activities so levels were decreasing. By the time numbers of industries and human activities are rapidly increasing which causes Co2 levels rapidly increasing. In time series the highest Co2 emission level was 18.7 in 1979. It was challenging to decide whether to consider this values which are less then 0.5 as white noise or not because 30% of the Co2 values were less then 1, in real world looking at current scenario the chances of Co2 emission level being 0 is near to impossible still there are chances that Co2 levels can be 0.0005. So considering each data point as a valuable information we refused to remove that entries.

Next step is to create Lag plot so we can see the correlation between the current year Co2 level and previous year Co2 level. the plot was linear which shows high correlation so we can say that the current Co2 levels and previous levels have strong relationship. the randomness of the data were measured by plotting autocorrelation graph. the autocorrelation graph shows smooth curves which indicates the time series is non–stationary thus next step is to make time series stationary. in non–stationary time series, summary statistics like mean and variance change over time.

To make time series stationary we have to remove trend and seasonality from it. Before that we use dickey fuller test to make sure our time series is non–stationary. the test was done by using python, and the test gives p–value as output. here the null hypothesis is that the data is non–stationary while alternate hypothesis is that the data is stationary, in this case the significance values is 0.05 and the p–values which is given by dickey fuller test is greater than 0.05 hence we failed to reject null hypothesis so we can say the time series is non–stationery. Differencing is one of the techniques to make time series stationary. On this time series, first order differencing technique applied to make the time series stationary. In first order differencing we have to subtract previous value from current value for all the data points. also different transformations like log, sqrt and reciprocal were applied in the context of making the time series stationary. Smoothing techniques like simple moving average, exponential weighted moving average, simple exponential smoothing and double exponential smoothing techniques can be applied to remove the variation between time stamps and to see the smooth curves.

Smoothing techniques also used to observe trend in time series as well as to predict the future values. But performance of other models was good compared to smoothing techniques. First 200 entries taken to train the model and remaining last for testing the performance of the model. performance of different models measured by Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) as we are predicting future Co2 emissions so basically it is regression problem. RMSE is calculated by root of the average of squared difference between actual values and predicted values by the model on testing data. Here RMSE values were calculated using python sklearn library. For model building two approaches are there, one is data–driven and another one is model based. models from both the approaches were applied to find the best fitted model. ARIMA model gives the best results for this kind of dataset as the model were trained on differenced time series. The ARIMA model predicts a given time series based on its own past values. It can be used for any non–seasonal series of numbers that exhibits patterns and is not a series of random events. ARIMA takes 3 parameters which are AR, MA and the order of difference. Hyper parameter tuning technique gives best parameters for the model by trying different sets of parameters. Although The autocorrelation and partial autocorrelation plots can be use to decide AR and MA parameter because partial autocorrelation function shows the partial correlation of a stationary time series with its own lagged values so using PACF we can decide the value of AR and from ACF we can decide the value of MA parameter as ACF shows how data points in a time series are related.

Yearly difference of CO2 emissions – ARIMA Prediction

Apart from ARIMA, few other model were trained which are AR, ARMA, Simple Linear Regression, Quadratic method, Holts winter exponential smoothing, Ridge and Lasso Regression, LGBM and XGboost methods, Recurrent neural network (RNN) – Long Short Term Memory (LSTM) and Fbprophet. I would like to mention my experience with LSTM here because it is another model which gives good result as ARIMA. the reason for not choosing LSTM as final model is its complexity. As ARIMA is giving appropriate results and it is simple to understand and requires less dependencies. while using lstm, lot of data preprocessing and other dependencies required, the dataset was small thus we used to train the model on CPU, otherwise gpu is required to train the LSTM model. we face one more challenge in deployment part. the challenge is to get the data into original form because the model was trained on differenced time series, so it will predict the future values in differenced format. After lot of research on the internet and by deeply understanding mathematical concepts finally we got the solution for it. solution for this issue is we have to add previous value from the original data from into first order differencing and then we have to add the last value of this time series into predicted values. To create the user interface streamlit was used, it is commonly used python library. the pickle file of the ARIMA model were used to predict the future values based on user input. The limit for forecasting is the year 2050. The project was uploaded on google cloud platform. so the flow is, first the starting year from which user want to forecast was taken and the end year till which year user want to forecast was taken and then according to the range of this inputs the prediction takes place. so by taking the inputs the pickle file will produce the future Co2 emissions in differenced format, then the values will be converted to original format and then the original values will be displayed on the user interface as well as the interactive line graph were displayed on the interface.

You will find the full python code and all visuals for this article here in this gitlab repository.

Deep Autoregressive Models

March 15, 2022/in Artificial Intelligence, Data Mining, Data Science, Deep Learning, Machine Learning, Main Category/by Sunil Yadav

In this blog article, we will discuss about deep autoregressive generative models (AGM). Autoregressive models were originated from economics and social science literature on time-series data where obser- vations from the previous steps are used to predict the value at the current and at future time steps [SS05]. Autoregression models can be expressed as:

$\begin{equation*} x_{t+1}= \sum_i^t \alpha_i x_{t-i} + c_i, \end{equation*}$

where the terms $\alpha$ and $c$ are constants to define the contributions of previous samples $x_i$ for the future value prediction. In the other words, autoregressive deep generative models are directed and fully observed models where outcome of the data completely depends on the previous data points as shown in Figure 1.

Figure 1: Autoregressive directed graph.

Let’s consider $x \sim X$ , where $X$ is a set of images and each images is $n-$ dimensional (n pixels). Then the prediction of new data pixel will be depending all the previously predicted pixels (Figure ?? shows the one row of pixels from an image). Referring to our last blog, deep generative models (DGMs) aim to learn the data distribution $p_\theta(x)$ of the given training data and by following the chain rule of the probability, we can express it as:

(1) $\begin{equation*} p_\theta(x) = \prod_{i=1}^n p_\theta(x_i | x_1, x_2, \dots , x_{i-1}) \end{equation*}$

The above equation modeling the data distribution explicitly based on the pixel conditionals, which are tractable (exact likelihood estimation). The right hand side of the above equation is a complex distribution and can be represented by any possible distribution of $n$ random variables. On the other hand, these kind of representation can have exponential space complexity. Therefore, in autoregressive generative models (AGM), these conditionals are approximated/parameterized by neural networks.

Training

As AGMs are based on tractable likelihood estimation, during the training process these methods maximize the likelihood of images over the given training data $X$ and it can be expressed as:

(2) $\begin{equation*} \max_{\theta} \sum_{x\sim X} log \: p_\theta (x) = \max_{\theta} \sum_{x\sim X} \sum_{i=1}^n log \: p_\theta (x_i | x_1, x_2, \dots, x_{i-1}) \end{equation*}$

The above expression is appearing because of the fact that DGMs try to minimize the distance between the distribution of the training data and the distribution of the generated data (please refer to our last blog). The distance between two distribution can be computed using KL-divergence:

(3) $\begin{equation*} \min_{\theta} d_{KL}(p_d (x),p_\theta (x)) = log\: p_d(x) - log \: p_\theta(x) \end{equation*}$

In the above equation the term $p_d(x)$ does not depend on $\theta$ , therefore, whole equation can be shortened to Equation 2, which represents the MLE (maximum likelihood estimation) objective to learn the model parameter $\theta$ by maximizing the log likelihood of the training images $X$ . From implementation point of view, the MLE objective can be optimized using the variations of stochastic gradient (ADAM, RMSProp, etc.) on mini-batches.

Network Architectures

As we are discussing deep generative models, here, we would like to discuss the deep aspect of AGMs. The parameterization of the conditionals mentioned in Equation 1 can be realized by different kind of network architectures. In the literature, several network architectures are proposed to increase their receptive fields and memory, allowing more complex distributions to be learned. Here, we are mentioning a couple of well known architectures, which are widely used in deep AGMs:

Fully-visible sigmoid belief network (FVSBN): FVSBN is the simplest network without any hidden units and it is a linear combination of the input elements followed by a sigmoid function to keep output between 0 and 1. The positive aspects of this network is simple design and the total number of parameters in the model is quadratic which is much smaller compared to exponential [GHCC15].
Neural autoregressive density estimator (NADE): To increase the effectiveness of FVSBN, the simplest idea would be to use one hidden layer neural network instead of logistic regression. NADE is an alternate MLP-based parameterization and more effective compared to FVSBN [LM11].
Masked autoencoder density distribution (MADE): Here, the standard autoencoder neural networks are modified such that it works as an efficient generative models. MADE masks the parameters to follow the autoregressive property, where the current sample is reconstructed using previous samples in a given ordering [GGML15].
PixelRNN/PixelCNN: These architecture are introducced by Google Deepmind in 2016 and utilizing the sequential property of the AGMs with recurrent and convolutional neural networks.

Figure 2: Different autoregressive architectures (image source from [LM11]).

Results using different architectures (images source https://deepgenerativemodels.github.io).

It uses two different RNN architectures (Unidirectional LSTM and Bidirectional LSTM) to generate pixels horizontally and horizontally-vertically respectively. Furthermore, it ulizes residual connection to speed up the convergence and masked convolution to condition the different channels of images. PixelCNN applies several convolutional layers to preserve spatial resolution and increase the receptive fields. Furthermore, masking is applied to use only the previous pixels. PixelCNN is faster in training compared to PixelRNN. However, the outcome quality is better with PixelRNN [vdOKK16].

Summary

In this blog article, we discussed about deep autoregressive models in details with the mathematical foundation. Furthermore, we discussed about the training procedure including the summary of different network architectures. We did not discuss network architectures in details, we would continue the discussion of PixelCNN and its variations in upcoming blogs.

References

[GGML15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. CoRR, abs/1502.03509, 2015.

[GHCC15] Zhe Gan, Ricardo Henao, David Carlson, and Lawrence Carin. Learning Deep Sigmoid Belief Networks with Data Augmentation. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence
and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 268–276, San Diego, California, USA, 09–12 May 2015. PMLR.

[LM11] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 29–37, Fort Lauderdale, FL, USA, 11–13 Apr 2011.
PMLR.

[SS05] Robert H. Shumway and David S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.

[vdOKK16] A ̈aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. CoRR, abs/1601.06759, 2016

How to ensure occupational safety using Deep Learning – Infographic

March 3, 2022/in Artificial Intelligence, Data Science, Deep Learning, Insights, Machine Learning, Main Category, Use Cases/by Benjamin Aunkofer

In cooperation between DATANOMIQ, my consulting company for data science, business intelligence and process mining, and Pixolution, a specialist for computer vision with deep learning, we have created an infographic (PDF) about a very special use case for companies with deep learning: How to ensure occupational safety through automatic risk detection using using Deep Learning AI.

How to ensure occupational safety through automatic risk detection using Deep Learning – Infographic

Infographic as PDF Download

Four essential ideas for making reinforcement learning and dynamic programming more effective

March 1, 2022/in Artificial Intelligence, Data Science, Deep Learning, Machine Learning, Predictive Analytics/by Yasuto Tamura

This is the third article of the series My elaborate study notes on reinforcement learning.

1, Some excuses for writing another article on the same topic

In the last article I explained policy iteration and value iteration of dynamic programming (DP) because DP is the foundation of reinforcement learning (RL). And in fact this article is a kind of a duplicate of the last one. Even though I also tried my best on the last article, I would say it was for superficial understanding of how those algorithms are implemented. I think that was not enough for the following two reasons. The first reason is that what I explained in the last article was virtually just about how to follow pseudocode of those algorithms like other study materials. I tried to explain them with a simple example and some diagrams. But in practice it is not realistic to think about such diagrams all the time. Also writing down Bellman equations every time is exhausting. Thus I would like to introduce Bellman operators, powerful tools for denoting Bellman equations briefly. Bellman operators would help you learn RL at an easier and more abstract level.

The second reason is that relations of values and policies are important points in many of RL algorithms. And simply, one article is not enough to realize this fact. In the last article I explained that policy iteration of DP separately and interactively updates a value and a policy. These procedures can be seen in many RL algorithms. Especially a family of algorithms named actor critic methods use this structure more explicitly. In the algorithms “actor” is in charge of a policy and a “critic” is in charge of a value. Just as the “critic” gives some feedback to the “actor” and the “actor” update his acting style, the value gives some signals to the policy for updating itself. Some people say RL algorithms are generally about how to design those “actors” and “critics.” In some cases actors can be very influential, but in other cases the other side is more powerful. In order to be more conscious about these interactive relations of policies and values, I have to dig the ideas behind policy iteration and value iteration, but with simpler notations.

Even though this article shares a lot with the last one, without pinning down the points I am going to explain, your study of RL could be just a repetition of following pseudocode of each algorithm. But instead I would rather prefer to make more organic links between the algorithms while studying RL. This article might be tiresome to read since it is mainly theoretical sides of DP or RL. But I would like you to patiently read through this to more effectively learn upcoming RL algorithms, and I did my best to explain them again in graphical ways.

2, RL and plannings as tree structures

Some tree structures have appeared so far in my article, but some readers might be still confused how to look at this. I must admit I lacked enough explanations on them. Thus I am going to review Bellman equation and give overall instructions on how to see my graphs. I am trying to discover effective and intuitive ways of showing DP or RL ideas. If there is something unclear of if you have any suggestions, please feel free to leave a comment or send me an email.

I got inspiration from Backup diagrams of Bellman equations introduced in the book by Barto and Sutton when I started making the graphs in this article series. The back up diagrams are basic units of tree structures in RL, and they are composed of white nodes showing states $s$ and black nodes showing actions $a$ . And when an agent goes from a node $a$ to the next state $s'$ , it gets a corresponding reward $r$ . As I explained in the second article, a value of a state $s$ is calculated by considering all possible actions and corresponding next states $s'$ , and resulting rewards $r$ , starting from $s$ . And the backup diagram shows the essence of how a value of $s$ is calculated.

*Please let me call this figure a backup diagram of “Bellman-equation-like recurrence relation,” instead of Bellman equation. Bellman equation holds only when $v_{\pi}(s)$ is known, and $v_{\pi}(s)$ is usually calculated from the recurrence relation. We are going to see this fact in the rest part of this article, making uses of Bellman operators.

Let’s again take a look at the definition of $v_{\pi}(s)$ , a value of a state $s$ for a policy $\pi$ . $v_{\pi}(s)$ is defined as an expectation of a sum of upcoming rewards $R_t$ , given that the state at the time step $t$ is $s$ . (Capital letters are random variables and small letters are their realized values.)

$v_{\pi} (s)\doteq \mathbb{E}_{\pi} [ G_t | S_t =s ] =\mathbb{E}_{\pi} [ R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots + \gamma ^{T-t -1} R_{T} |S_t =s]$

*To be exact, we need to take the limit of $T$ like $T \to \infty$ . But the number $T$ is limited in practical discussions, so please don’t care so much about very exact definitions of value functions in my article series.

But considering all the combinations of actions and corresponding rewards are not realistic, thus Bellman equation is defined recursively as follows.

$v_{\pi} (s)= \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t =s ]$

But when you want to calculate $v_{\pi} (s)$ at the left side, $v_{\pi} (s)$ at the right side is supposed to be unknown, so we use the following recurrence relation.

$v_{k+1} (s)\doteq \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_{k}(S_{t+1}) | S_t =s ]$

And the operation of calculating an expectation with $\mathbb{E}_{\pi}$ , namely a probabilistic sum of future rewards is defined as follows.

$v_{k+1} (s) = \mathbb{E}_{\pi} [R_{t+1} + \gamma v_k (S_{t+1}) | S_t = s] \doteq \sum_a {\pi(a|s)} \sum_{s', r} {p(s', r|s, a)[r + \gamma v_k(s')]}$

$\pi(a|s)$ are policies, and $p(s', r|s, a)$ are probabilities of transitions. Policies are probabilities of taking an action $a$ given an agent being in a state $s$ . But agents cannot necessarily move do that based on their policies. Some randomness or uncertainty of movements are taken into consideration, and they are modeled as probabilities of transitions. In my article, I would like you to see the equation above as a sum of $branch(s, a)$ weighted by $\pi(a|s)$ or a sum of $twig(r, s')$ weighted by $\pi(a|s), p(s' | s, a)$ . “Branches” and “twigs” are terms which I coined.

*Even though especially values of $branch(s, a)$ are important when you actually implement DP, they are not explicitly defined with certain functions in most study materials on DP.

I think what makes the backup diagram confusing at the first glance is that nodes of states in white have two layers, a layer $s$ and the one of $s'$ . But the node $s$ is included in the nodes of $s'$ . Let’s take an example of calculating the Bellman-equation-like recurrence relations with a grid map environment. The transitions on the backup diagram should be first seen as below to avoid confusion. Even though the original backup diagrams have only one root node and have three layers, in actual models of environments transitions of agents are modeled as arows going back and forth between white and black nodes.

But in DP values of states, namely white nodes have to be updated with older values. That is why the original backup diagrams have three layers. For exmple, the value of a value $v_{k+1}(9)$ is calculated like in the figure below, using values of $v_{k}(s')$ . As I explained earlier, the value of the state $9$ is a sum of $branch(s, a)$ , weighted by $\pi(\rightarrow | 9), \pi(\downarrow | 9), \pi(\leftarrow | 9), \pi(\uparrow | 9)$ . And I showed the weight as strength of purple color of the arrows. $r_a, r_b, r_c, r_d$ are corresponding rewards of each transition. And importantly, the Bellman-equation-like operation, whish is a part of DP, is conducted inside the agent. The agent does not have to actually move, and that is what planning is all about.

And DP, or more exactly policy evaluation, calculating the expectation over all the states, repeatedly. An important fact is, arrows in the backup diagram are pointing backward compared to the direction of value functions being updated, from $v_{k}(s)$ to $v_{k+1}(s)$ . I tried to show the idea that values $v_{k}(s)$ are backed up to calculate $v_{k+1}(s)$ . In my article series, with the right side of the figure below, I make it a rule to show the ideas that a model of an environment is known and it is updated recursively.

3, Types of policies

As I said in the first article, the ultimate purpose of DP or RL is finding the optimal policies. With optimal policies agents are the most likely to maximize rewards they get in environments. And policies $\pi$ determine the values of states as value functions $v_{\pi}(s)$ . Or policies can be obtained from value functions. This structure of interactively updating values and policies is called general policy iteration (GPI) in the book by Barto and Sutton.

Source: Richard S. Sutton, Andrew G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, (2018)

However I have been using the term “a policy” without exactly defining it. There are several types of policies, and distinguishing them is more or less important in the next sections. But I would not like you to think too much about that. In conclusion, only very limited types of policies are mainly discussed in RL. Only $\Pi ^{\text{S}}, \Pi ^{\text{SD}}$ in the figure below are of interest when you learn RL as a beginner. I am going to explain what each set of policies means one by one.

In fact we have been discussing a set of policies $\Pi ^{\text{S}}$ , which mean probabilistic Markov policies. Remember that in the first article I explained Markov decision processes can be described like diagrams of daily routines. For example, the diagrams below are my daily routines. The indexes $t$ denote days. In either of states “Home,” “Lab,” and “Starbucks,” I take an action to another state. The numbers in black are probabilities of taking the actions, and those in orange are rewards of taking the actions. I also explained that the ultimate purpose of planning with DP is to find the optimal policy in this state transition diagram.

Before explaining each type of sequences of policies, let me formulate probabilistic Markov policies at first. A set of probabilistic Markov policies is defined as follows.
$\Pi \doteq \biggl\{ \pi : \mathcal{A}\times\mathcal{S} \rightarrow [0, 1]: \sum_{a \in \mathcal{A}}{\pi (a|s) =1, \forall s \in \mathcal{S} } \biggr\}$
This means $\pi (a|s)$ maps any combinations of an action $a\in\mathcal{A}$ and a state $s \in\mathcal{S}$ to a probability. The diagram above means you choose a policy $\pi$ from the set $\Pi$ , and you use the policy every time step $t$ , I mean every day. A repetitive sequence of the same probabilistic Markov policy $\pi$ is defined as $\boldsymbol{\pi}^{\text{s}} \doteq \{\pi, \pi, \dots \} \in \boldsymbol{\Pi} ^{\text{S}}$ . And a set of such stationary Markov policy sequences is denoted as $\boldsymbol{\Pi} ^{\text{S}}$ .

*As I formulated in the last articles, policies are different from probabilities of transitions. Even if you take take an action probabilistically, the action cannot necessarily be finished. Thus probabilities of transitions depend on combinations of policies and the agents or the environments.

But when I just want to focus on works like a robot, I give up living my life. I abandon efforts of giving even the slightest variations to my life, and I just deterministically take next actions every day. In this case, we can say the policies are stationary and deterministic. The set of such policies is defined as below. $\pi ^{\text{d}}$ are called deterministic policies. $\Pi ^\text{d} \doteq \bigl\{ \pi ^\text{d} : \mathcal{A}\rightarrow \mathcal{S} \bigr\}$

I think it is normal policies change from day to day, even if people also have only options of “Home,” “Lab,” or “Starbucks.” These cases are normal Markov policies, and you choose a policy $\pi$ from $\Pi$ every time step.

And the resulting sequences of policies and the set of the sequences are defined as $\boldsymbol{\pi}^{\text{m}} \doteq \{\pi_0, \pi_1, \dots \} \in \boldsymbol{\Pi} ^{\text{M}}, \quad \pi_t \in \Pi$ .

In real world, an assumption of Markov decision process is quite unrealistic because your strategies constantly change depending on what you have done or gained so far. Possibilities of going to a Starbucks depend on what you have done in the week so far. You might order a cup of frappucino as a little something for your exhausting working days. There might be some communications on what you order then with clerks. And such experiences would affect your behaviors of going to Starbucks again. Such general and realistic policies are called history-dependent policies.

*Going to Starbucks everyday like a Markov decision process and deterministically ordering a cupt of hot black coffee is supposed to be unrealistic. Even if clerks start heating a mug as soon as I enter the shop.

In history-dependent cases, your policies depend on your states, actions, and rewards so far. In this case you take actions based on history-dependent policies $\pi _{t}^{\text{h}}$ . However as I said, only $\Pi ^{\text{S}}, \Pi ^{\text{SD}}$ are important in my articles. And history-dependent policies are discussed only in partially observable Markov decision process (POMDP), which this article series is not going to cover. Thus you have only to take a brief look at how history-dependent ones are defined.

History-dependent policies are the types of the most general policies. In order to formulate history-dependent policies, we first have to formulate histories. Histories $h_t \in \mathcal{H}_t$ in the context of DP or RL are defined as follows.

$h_t \doteq \{s_0, a_0, r_0, \dots , s_{t-1}, a_{t-1}, r_{t}, s_t\}$

Given the histories which I have defined, a history dependent policy is defined as follows.

$\pi_{t}^{\text{h}}(a|h_t) \doteq \text{Pr}(A=a | H_t = h_t)$

This means a probability of taking an action $a$ given a history $h_t$ . It might be more understandable with the graphical model below, which I showed also in the first article. In the graphical model, $H_t$ is a random variable, and $h_t$ is its realized value.

A set of history-dependent policies is defined as follows.

$\Pi _{t}^{\text{h}} \doteq \biggl\{ \pi _{t}^{h} : \mathcal{A}\times\mathcal{H}_t \rightarrow [0, 1]: \sum_{a \in \mathcal{A}}{\pi_{t}^{\text{h}} (a|h_{t}) =1 } \biggr\}$

And a set of sequences of history-dependent policies is $\boldsymbol{\pi}^{\text{h}} \doteq \{\pi^{\text{h}}_0, \pi^{\text{h}}_1, \dots \} \in \boldsymbol{\Pi} ^{\text{H}}, \quad \pi_{t}^{\text{h}} \in \Pi_{t}^{\text{h}}$ .

In fact I have not defined the optimal value function $v_{\ast}(s)$ or $\pi_{\ast}$ in my article series yet. I must admit it was not good to discuss DP without even defining the important ideas. But now that we have learnt types of policies, it should be less confusing to introduce their more precise definitions now. The optimal value function $v_{\ast}: \mathcal{S} \mapsto \mathbb{R}$ is defined as the maximum value functions for all states $s$ , with respect to any types of sequences of policies $\boldsymbol{\pi}$ .

$v_{\ast} \doteq \max_{\boldsymbol{\pi}\in \boldsymbol{\Pi}^{\text{H}}}{v_{\boldsymbol{\pi}(s)}}, \quad \forall s \mathbb{R}$

And the optimal policy is defined as the policy which satisfies the equation below.

$v_{\ast}(s) = v_{\pi ^{\ast}}(s), \quad \forall s \in \mathcal{S}$

The optimal value function is optimal with respect to all the types of sequences of policies, as you can see from the definition. However in fact, it is known that the optimal policy is a deterministic Markov policy $\pi ^\text{d} \in \Pi ^\text{d}$ . That means, in the example graphical models I displayed, you just have to deterministically go back and forth between the lab and the home in order to maximize value function, never stopping by at a Starbucks. Also you do not have to change your plans depending on days.

And when all the values of the states are maximized, you can easily calculate the optimal deterministic policy of your everyday routine. Thus in DP, you first need to maximize the values of the states. I am going to explain this fact of DP more precisely in the next section. Combined with some other important mathematical features of DP, you will have clearer vision on what DP is doing.

*I might have to precisely explain how $v_{\boldsymbol{\pi}}(s)$ is defined. But to make things easier for now, let me skip ore precise formulations. Value functions are defined as expectations of rewards with respect to a single policy or a sequence of policies. You have only to keep it in mind that $v_{\boldsymbol{\pi}}(s)$ is a value function resulting from taking actions based on $\boldsymbol{\pi}$ . And $v_{\pi}(s)$ , which we have been mainly discussing, is a value function based on only a single policy $\pi$ .

*Please keep it in mind that these diagrams are not anything like exaggeratedly simplified models for explaining RL. That is my life.

3, Key components of DP

*Even though notations on this article series are based on the book by Barto and Sutton, the discussions in this section are, based on a Japanese book named “Machine Learning Professional Series: Reinforcement Learning” by Tetsurou Morimura, which I call “the whale book.” There is a slight difference in how they calculate Bellman equations. In the book by Barto and Sutton, expectations are calculated also with respect to rewards $r$ , but not in the whale book. I think discussions in the whale book can be extended to the cases in the book by Barto and Sutton, but just in case please bear that in mind.

In order to make organic links between the RL algorithms you are going to encounter, I think you should realize DP algorithms you have learned in the last article are composed of some essential ideas about DP. As I stressed in the first article, RL is equal to solving planning problems, including DP, by sampling data through trial-and-error-like behaviors of agents. Thus in other words, you approximate DP-like calculations with batch data or online data. In order to see how to approximate such DP-like calculations, you have to know more about features of those calculations. Those features are derived from some mathematical propositions about DP. But effortlessly introducing them one by one would be just confusing, so I tired extracting some essences. And the figures below demonstrate the ideas.

The figures above express the following facts about DP:

DP is a repetition of Bellman-equation-like operations, and they can be simply denoted with Bellman operators $\mathsf{B}_{\pi}$ or $\mathsf{B}_{\ast}$ .
The value function for a policy $\pi$ is calculated by solving a Bellman equation, but in practice you approximately solve it by repeatedly using Bellman operators.
There exists an optimal policy $\pi ^{\ast} \in \Pi ^{\text{d}}$ , which is deterministic. And it is an optimal policy if and only if it satisfies the Bellman expectation equation $v^{\ast}(s) = (\mathsf{B}_{\pi ^{\ast}} v^{\ast})(s), \quad \forall s \in \mathcal{S}$ , with the optimal value function $v^{\ast}(s)$ .
With a better deterministic policy, you get a better value function. And eventually both the value function and the policy become optimal.

Let’s take a close look at what each of them means.

(1) Bellman operator

In the last article, I explained the Bellman equation and recurrence relations derived from it. And they are the basic ideas leading to various RL algorithms. The Bellman equation itself is not so complicated, and I showed its derivation in the last article. You just have to be careful about variables in calculation of expectations. However writing the equations or recurrence relations every time would be tiresome and confusing. And in practice we need to apply the recurrence relation many times. In order to avoid writing down the Bellman equation every time, let me introduce a powerful notation for simplifying the calculations: I am going to discuss RL making uses of Bellman operators from now on.

First of all, a Bellman expectation operator $\mathsf{B}_{\pi}: \mathbb{R}^{\mathcal{S}} \rightarrow \mathbb{R}^{\mathcal{S}}$ , or rather an application of a Bellman expectation operator on any state functions $v: \mathcal{S}\rightarrow \mathbb{R}$ is defined as below.

$(\mathsf{B}_{\pi} (v))(s) \doteq \sum_{a}{\pi (a|s)} \sum_{s'}{p(s'| s, a) \biggl[r + \gamma v (s') \biggr]}, \quad \forall s \in \mathcal{S}$

For simplicity, I am going to denote the left side of the equation as $(\mathsf{B}_{\pi} (v)) (s)=\mathsf{B}_{\pi} (v) \doteq \mathsf{B}_{\pi} v$ . In the last article I explained that when $v_{0}(s)$ is an arbitrarily initialized value function, a sequence of value functions $(v_{0}(s), v_{1}(s), \dots, v_{k}(s), \dots)$ converge to $v_{\pi}(s)$ for a fixed probabilistic policy $\pi$ , by repeatedly applying the recurrence relation below.

$v_{k+1} = \sum_{a}{\pi (a|s)} \sum_{s'}{p(s'| s, a) \biggl[r + \gamma v_{k} (s') \biggr]}$

With the Bellman expectation operator, the recurrence relation above is written as follows.

$v_{k+1} = \mathsf{B}_{\pi} v_{k}$

Thus $v_{k}$ is obtained by applying $\mathsf{B}_{\pi}$ to $v_{0}$ $k$ times in total. Such operation is denoted as follows.

$v_{k} = (\mathsf{B}_{\pi}\dots (\mathsf{B}_{\pi} v_{0})\dots) \doteq \mathsf{B}_{\pi} \dots \mathsf{B}_{\pi} v_{0} \doteq \mathsf{B}^k_{\pi} v_{0}$

As I have just mentioned, $\mathsf{B}^k_{\pi} v_{0}$ converges to $v_{\pi}(s)$ , thus the following equation holds.

$\lim_{k \rightarrow \infty} \mathsf{B}^k_{\pi} v_{0} = v_{\pi}(s)$

I have to admit I am merely talking about how to change notations of the discussions in the last article, but introducing Bellman operators makes it much easier to learn or explain DP or RL as the figure below shows.

Just as well, a Bellman optimality operator $\mathsf{B}_{\ast}: \mathbb{R}^{\mathcal{S}} \rightarrow \mathbb{R}^{\mathcal{S}}$ is defined as follows.

$(\mathsf{B}_{\ast} v)(s) \doteq \max_{a} \sum_{s'}{p(s' | s, a) \biggl[r + \gamma v(s') \biggr]}, \quad \forall s \in \mathcal{S}$

Also the notation with a Bellman optimality operators can be simplified as $(\mathsf{B}_{\ast} v)(s) \doteq \mathsf{B}_{\ast} v$ . With a Bellman optimality operator, you can get a recurrence relation $v_{k+1} = \mathsf{B}_{\ast} v_{k}$ . Multiple applications of Bellman optimality operators can be written down as below.

$v_{k} = (\mathsf{B}_{\ast}\dots (\mathsf{B}_{\ast} v_{0})\dots) \doteq \mathsf{B}_{\ast} \dots \mathsf{B}_{\ast} v_{0} \doteq \mathsf{B}^k_{\ast} v_{0}$

Please keep it in mind that this operator does not depend on policies $\pi$ . And an important fact is that any initial value function $v_0$ converges to the optimal value function $v_{\ast}$ .

$\lim_{k \rightarrow \infty} \mathsf{B}^k_{\ast} v_{0} = v_{\ast}(s)$

Thus any initial value functions converge to the the optimal value function by repeatedly applying Bellman optimality operators. This is almost equal to value iteration algorithm, which I explained in the last article. And notations of value iteration can be also simplified by introducing the Bellman optimality operator like in the figure below.

Again, I would like you to pay attention to how value iteration works. The optimal value function $v_{\ast}(s)$ is supposed to be maximum with respect to any sequences of policies $\boldsymbol{\pi}$ , from its definition. However the optimal value function $v_{\ast}(s)$ can be obtained with a single bellman optimality operator $\mathsf{B}_{\ast}$ , never caring about policies. Obtaining the optimal value function is crucial in DP problems as I explain in the next topic. And at least one way to do that is guaranteed with uses of a $\mathsf{B}_{\ast}$ .

*We have seen a case of applying the same Bellman expectation operator on a fixed policy $\pi$ , but you can use different Bellman operators on different policies varying from time steps to time steps. To be more concrete, assume that you have a sequence of Markov policies $\boldsymbol{\pi} = \{ \pi_{0},\pi_{1}, \dots, \pi_{k-1} \}\in \boldsymbol{\Pi} ^{\text{M}}$ . If you apply Bellman operators of the policies one by one in an order of $\pi_{k-1}, \pi_{k-2}, \dots, \pi_{k-1}$ on a state function $v$ , the resulting state function is calculated as below.

$\mathsf{B}_{\pi_0}(\mathsf{B}_{\pi_1}\dots (\mathsf{B}_{\pi_{k-1}} v)\dots) \doteq \mathsf{B}_{\pi_0}\mathsf{B}_{\pi_1} \dots \mathsf{B}_{\pi_{k-1}} v \doteq \mathsf{B}^k_{\boldsymbol{\pi}}$

When $\boldsymbol{\pi} = \{ \pi_{0},\pi_{1}, \dots, \pi_{k-1} \}$ , we can also discuss convergence of $v_{\boldsymbol{\pi}}$ , but that is just confusing. Please let me know if you are interested.

(2) Policy evaluation

Policy evaluation is in short calculating $v_{\pi}$ , the value function for a policy $\pi$ . And in theory it can be calculated by solving a Bellman expectation equation, which I have already introduced.

$v(s) = \sum_{a}{\pi (a|s)} \sum_{s'}{p(s'| s, a) \biggl[r + \gamma v (s') \biggr]}$

Using a Bellman operator, which I have introduced in the last topic, the equation above can be written $v(s) = \mathsf{B}_{\pi} v(s)$ . But whichever the notation is, the equation holds when the value function $v(s)$ is $v_{\pi}(s)$ . You have already seen the major way of how to calculate $v_{\pi}$ in (1), or also in the last article. You have only to multiply the same Belman expectation operator $\mathsf{B}_{\pi}$ to any initial value funtions $v_{initial}(s)$ .

This process can be seen in this way: any initial value functions $v_{initial}(s)$ little by little converge to $v_{\pi}(s)$ as the same Bellman expectation operator $\mathsf{B}_{\pi}$ is applied. And when a $v_{initial}(s)$ converges to $v_{\pi}(s)$ , the value function does not change anymore because the value function already satisfies a Bellman expectation equation $v(s) = \mathsf{B}_{\pi} v(s)$ . In other words $v_{\pi}(s) = \mathsf{B}^k_{\pi} v_{\pi}(s)$ , and the $v_{\pi}(s)$ is called the fixed point of $\mathsf{B}_{\pi}$ . The figure below is the image of how any initial value functions converge to the fixed point unique to a certain policy $\pi$ . Also Bellman optimality operators $\mathsf{B}_{\ast}$ also have their fixed points because any initial value functions converge to $v_{\ast}(s)$ by repeatedly applying $\mathsf{B}_{\ast}$ .

I am actually just saying the same facts as in the topic (1) in another way. But I would like you to keep it in mind that the fixed point of $\mathsf{B}_{\pi}$ is more of a “local” fixed point. On the other hand the fixed point of $\mathsf{B}_{\ast}$ is more like “global.” Ultimately the global one is ultimately important, and the fixed point $v_{\ast}$ can be directly reached only with the Bellman optimality operator $\mathsf{B}_{\ast}$ . But you can also start with finding local fixed points, and it is known that the local fixed points also converge to the global one. In fact, the former case of corresponds to policy iteration, and the latter case to value iteration. At any rate, the goal for now is to find the optimal value function $v_{\ast}$ . Once the value function is optimal, the optimal policy can be automatically obtained, and I am going to explain why in the next two topics.

(3) Existence of the optimal policy

In the first place, does the optimal policy really exist? The answer is yes, and moreover it is a stationary and deterministic policy $\pi ^{\text{d}} \in \Pi^{\text{SD}}$ . And also, you can judge whether a policy is optimal by a Bellman expectation equation below.

$v_{\ast}(s) = (\mathsf{B}_{\pi^{\ast} } v_{\ast})(s), \quad \forall s \in \mathcal{S}$

In other words, the optimal value function $v_{\ast}(s)$ has to be already obtained to judge if a policy is optimal. And the resulting optimal policy is calculated as follows.

$\pi^{\text{d}}_{\ast}(s) = \text{argmax}_{a\in \matchal{A}} \sum_{s'}{p(s' | s, a) \biggl[r + \gamma v_{\ast}(s') \biggr]}, \quad \forall s \in \mathcal{S}$

Let’s take an example of the state transition diagram in the last section. I added some transitions from nodes to themselves and corresponding scores. And all values of the states are initialized as $v_{init.}$ . After some calculations, $v_{init.}$ is optimized to $v_{\ast}$ . And finally the optimal policy can be obtained from the equation I have just mentioned. And the conclusion is “Go to the lab wherever you are to maximize score.”

The calculation above is finding an action $a$ which maximizes $b(s, a)\doteq\sum_{s'}{p(s' | s, a) \biggl[r + \gamma v_{\ast}(s') \biggr]} = r + \gamma \sum_{s'}{p(s' | s, a) v_{\ast}(s') }$ . Let me call the part $b(s, a)$ ” a value of a branch,” and finding the optimal deterministic policy is equal to choosing the maximum branch for all $s$ . A branch corresponds to a pair of a state $s, a$ and all the all the states $s'$ .

*We can comprehend applications of Bellman expectation operators as probabilistically reweighting branches with policies $\pi(a|s)$ .

*The states $s$ and $s'$ are basically the same. They are just different in uses of indexes for referring them. That might be a confusing point of understanding Bellman equations.

Let’s see how values actually converge to the optimal values and how branches $b(s, a)$ . I implemented value iteration of the Starbucks-lab-home transition diagram and visuzlied them with Graphviz. I initialized all the states as $0$ , and after some iterations they converged to the optimal values. The numbers in each node are values of the sates. And the numbers next to each edge are corresponding values of branches $b(a, b)$ . After you get the optimal value, if you choose the direction with the maximum branch at each state, you get the optimal deterministic policy. And that means “Just go to the lab, not Starbucks.”

*Discussing and visualizing “branches” of Bellman equations are not normal in other study materials. But I just thought it would be better to see how they change.

(4) Policy improvement

Policy improvement means a very simple fact: in policy iteration algorithm, with a better policy, you get a better value function. That is all. In policy iteration, a policy is regarded as optimal as long as it does not updated anymore. But as far as I could see so far, there is one confusing fact. Even after a policy converges, value functions still can be updated. But from the definition, an optimal value function is determined with the optimal value function. Such facts can be seen in some of DP implementation, including grid map implementation I introduced in the last article.

Thus I am not sure if it is legitimate to say whether the policy is optimal even before getting the optimal value function. At any rate, this is my “elaborate study note,” so I conversely ask for some help to more professional someones if they come across with my series. Please forgive me for shifting to the next article, without making things clear.

4, Viewing DP algorithms in a more simple and abstract way

We have covered the four important topics for a better understanding of DP algorithms. Making use of these ideas, pseudocode of DP algorithms which I introduced in the last article can be rewritten in a more simple and abstract way. Rather than following pseudocode of DP algorithms, I would like you to see them this way: policy iteration is a repetation of finding the fixed point of a Bellman operator $\mathsf{B}_{\pi}$ , which is a local fixed point, and updating the policy. Even if the policy converge, values have not necessarily converged to the optimal values.

When it comes to value iteration: value iteration is finding the fixed point of $\mathsf{B}_{\ast}$ , which is global, and getting the deterministic and optimal policy.

I have written about DP in as many as two articles. But I would say that was inevitable for laying more or less solid foundation of learning RL. The last article was too superficial and ordinary, but on the other hand this one is too abstract to introduce at first. Now that I have explained essential theoretical parts of DP, I can finally move to topics unique to RL. We have been thinking the case of plannings where the models of the environemnt is known, but they are what agents have to estimate with “trial and errors.” The term “trial and errors” might have been too abstract to you when you read about RL so far. But after reading my articles, you can instead say that is a matter of how to approximate Bellman operators with batch or online data taken by agents, rather than ambiguously saying “trial and erros.” In the next article, I am going to talk about “temporal differences,” which makes RL different from other fields and can be used as data samples to approximate Bellman operators.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Deep Generative Modelling

February 19, 2022/in Artificial Intelligence, Data Science, Deep Learning/by Sunil Yadav

Nowadays, we see several real-world applications of synthetically generated data (see Figure 1), for example solving the data imbalance problem in classification tasks, performing style transfer for artistic images, generating protein structure for scientific analysis, etc. In this blog, we are going to explore synthetic data generation using deep neural networks with the mathematical background.

Synthetic images generated by deep generative models - deep learning generates images

Figure 1 – Synthetic images generated by deep generative models

What is Deep Generative modelling?

Deep generative modelling (DGM) falls in the category of unsupervised learning and addresses a challenging task of the distribution estimation of the given data. To approximate the underlying distribution of a complicated and high dimensional data, Deep generative models (DGM) utilize various deep neural networks architectures e.g., CNN and RNN. Furthermore, the trained DGMs generate samples which have the same distribution as the training data distribution. In other words, if the given training data has the distribution function 𝑝𝑑 (𝑥), then DGMs learn to
generate the samples from a distribution 𝑝𝜃 (𝑥) such that 𝑝𝑑 (𝑥) ≈ 𝑝𝜃 (𝑥).

Deep Learning as unsupervised learner - DGMs pipeline

Figure 2 – DGMs pipeline

Figure 2 represents the general idea about the deep generative modeling, where DGMs are generating data samples with distribution of 𝑝𝜃 (𝑥), which is quite similar to the data distribution of training samples 𝑝𝑑 (𝑥).

Why Deep Generative modelling is important?

DGMs are mainly used to generate synthetic data, which can be used in different applications. The followings are a few examples:

To avoid the data imbalance problems in several real-life classification problems
Text-to-image, image-to-image conversion, image inpainting, super-resolution
Speech and music synthesis.
Computer graphics: rendering, texture generation, character movement, fluid dynamics
simulation.

How DGMs work?

The above figure is representing a complete workflow of DGMs and it is not very precise because it is combining both training and inference process. During the inference/generation, there will be a slight modification, which is shown in the following figure:

Figure 3 – Data generation with random input and a trained DGM

As it is clear from the above figure, the user gives a random sample as the input to the trained generator to generate a sample which has the similar distribution to the training data. Let us consider that the random input z is sampled from a tractable distribution 𝑝(𝑧) and supported in 𝑅𝑚 and the training data distribution (intractable) is high dimensional and supported in 𝑅𝑛. Therefore, the main goal of trained generator can be written as:

$\begin{equation*} g_\theta:\mathbb{R}^m \to \mathbb{R}^n, \quad \textit{such that}, \quad \min_{\theta} d(p_d (x),p_\theta (x)) \end{equation*}$

where d denotes the distance between the two probability distributions and every random vector z will mapped in an unknown vector x, which has an intractable distribution. The vector z is commonly referred as latent variable which is sample from a latent space and in general, follows a tractable Gaussian distribution. The distance minimization problem can be addressed using maximum likelihood. Let us assume that the generator function 𝑔𝜃 is known then we can compute the likelihood of the generated sample x from the latent variable z:

(1) $\begin{equation*} p_\theta (x)= \int p_\theta (x|z) p(z)dz \end{equation*}$

The term 𝑝𝜃(𝑥|𝑧) measures the closeness between the generated sample 𝑔𝜃(𝑧) to the original sample x. Based on the data, the likelihood function can be Gaussian for real valued data or Bernoulli for the binary data. From the above discussion, it is clear that the approximating the generator function is most challenging task and that is performed suing deep neural network with high dimensional data. A deep neural network approximates the generator function by computing the generator parameters 𝜃.

Types of DGMs

There are several different types of DGMs to approximate the generator functions, which can generate the new data points with the similar distribution of the training data. In this series of the blogs, we will discuss these methods which are mentioned in the following figure.

In general, DGMs can be separated into implicit and explicit methods, where explicit method are basically likelihood-based methods and learn the data distribution based on an explicitly defined 𝑝𝜃(𝑥). On the other hand, implicit methods learn data distribution directly without any prior model structure. Furthermore, explicit methods are split into tractable and approximation-based methods, where tractable methods are utilizing the model structures which have exact likelihood evaluation and approximation-based methods are applying different forms of approximation in the likelihood estimation.

Summary

In this blog article, we covered the mathematical foundation of DGMs including the different types. In further blog articles, we will cover the above mentioned different DGMs with theoretical background and applications.

How Deep Learning drives businesses forward through automation – Infographic

February 16, 2022/in Artificial Intelligence, Deep Learning, Machine Learning, Main Category, Use Cases/by Benjamin Aunkofer

In cooperation between DATANOMIQ, my consulting company for data science, business intelligence and process mining, and Pixolution, a specialist for computer vision with deep learning, we have created an infographic (PDF) about a very special use case for companies with deep learning: How to protect the corporate identity of any company by ensuring consistent branding with automated font recognition.

How to ensure consistent branding with automatic font recognition – Infographic

The infographic is available as PDF download:

Download Infographic as PDF

What is Portfolio Risk Management in Python?

February 9, 2022/in Python/by Shannon Flynn

Data science is a crucial industry, with multiple processes today relying on it. One of its more helpful and intriguing applications is in investing, where it helps investors make more informed decisions. Practices like portfolio management in Python help take the guesswork out of this notoriously risky undertaking.

Investing is a complicated science, making it hard to do well. Some estimates hold that as much as 90% of people lose money in stocks. While stock trading will always involve some risk, Python-based portfolio management can help.

What Is Portfolio Management in Python?

Portfolio management is the process of planning, making and overseeing investments to meet your long-term investment goals. Portfolio management in Python uses data science to analyze risks and rewards to make the best investment decisions.

Since the future is uncertain, buying stocks is inherently risky, but some assets are riskier than others. For example, since many companies are trying to reach carbon neutrality by 2050, investing in sustainable technologies is a fairly sound strategy. However, that doesn’t guarantee that every eco-friendly startup will succeed, so investors need to consider more factors.

Some data scientists have found that you can use Python to understand these factors better. By plugging various figures into a Python equation, investors can chart potential risks and returns to find the best investments.

How Does Python Portfolio Management Work?

Portfolio risk management in Python operates on a principle called Modern Portfolio Theory (MPT). MPT helps investors find an optimal mix of high-risk, high-return investments and low-risk, low-return ones based on their risk tolerance. Investors can either look for the highest returns at a certain risk level or look for the lowest risk to get a certain return.

To apply this in Python, data scientists create one list for portfolio returns, one for risk and one for weights, or how much each investment accounts for the overall portfolio. They then randomly generate weight for the assets, then normalize it to sum to a value of one.

Data scientists then calculate the risks and returns for each asset and plug them into the different randomly generated weights. This will produce a list of various scenarios, showing how much overall risk and reward each portfolio would have.

Investors can then look at this list to see how much of each asset they should include in their portfolio. They can either use the mix that produces the greatest return or the one with the lowest risk.

Why Does It Matter?

Using Python for portfolio risk management helps remove a lot of the guesswork from investing. Running these calculations gives investors multiple scenarios to choose from, helping them find the best portfolio strategy for their needs and goals.

This presents a promising opportunity for data scientists. Data analytics are quickly becoming an essential part of the stock market. Algorithmic trading, which applies data and AI to MPT, already accounts for 60 to 73% of all U.S. equity trading. Portfolio management in Python could help more data scientists capitalize on this trend.

This practice is a relatively straightforward way to apply data science to stock trading. Data scientists that can make the most of that opportunity stand to make a name for themselves in investing circles.

Python Portfolio Management Can Maximize Returns

In the past, stock trading was almost akin to gambling, involving huge amounts of risk. While portfolio management in Python doesn’t remove volatility from the stock market, it helps put it in perspective. Investors can then make safer, more informed decisions to meet their investing goals.

Python-based portfolio management stands as a natural intersection between data science and stock trading. As a result, it can help both data scientists and investors achieve new success.

Autoencoders

Variational autoencoders

Network Architecture

Training

Summary

Further readings

References

Here’s AI for data protection in summary:

The threats your data faces on a daily basis

3 Ways AI serves in data protection

1. AI Can Improve Compliance

2. AI as an ally against cyberattacks

3. AI can protect against phishing attempts

Why AI is so valuable in the fight against cybercrime

1. AI can handle more and more complex data than humans

2. AI can quickly classify data and keep it organized

3. No humans needed to keep sensitive data secure

Ready to call in the help of AI for your data protection?

1. Erstgespräch

2. Evaluation

3. Iteration

4. Integration

5. Feedback erfassen

Die nächste Iteration

Tipps für ein erfolgreiches KI Projekt

Training

Network Architectures

Summary

References

1, Some excuses for writing another article on the same topic

2, RL and plannings as tree structures

3, Types of policies

3, Key components of DP

(1) Bellman operator

(2) Policy evaluation

(3) Existence of the optimal policy

(4) Policy improvement

4, Viewing DP algorithms in a more simple and abstract way

What is Deep Generative modelling?

Why Deep Generative modelling is important?

How DGMs work?

What Is Portfolio Management in Python?

How Does Python Portfolio Management Work?

Why Does It Matter?

Python Portfolio Management Can Maximize Returns

Interesting links

Pages

Categories

Archive