Stop saying “trial and errors” for now: seeing reinforcement learning through some spectrums

*This is the fourth article of the series My elaborate study notes on reinforcement learning.

*In this article series “the book by Barto and Sutton” means “Reinforcement Learning: An Introduction second edition.” This book is said to be almost mandatory for those who seriously learn Reinforcement Learning (RL). And “the whale book” means a Japanese textbook named 「強化学習 (機械学習プロフェッショナルシリーズ)」(“Reinforcement Learning (Machine Learning Processional Series)”), by Morimura Tetsuro. I would say the former is for those who want to mainly learn how to use RL, and the latter is for more theoretical understanding. I am trying to make something between them in my series.

1, Finally to reinforcement learning

Some of you might have got away with explaining reinforcement learning (RL) only by saying an obscure thing like “RL enables computers to learn through trial and errors.” But if you have patiently read my articles so far, you might have come to say “RL is a family of algorithms which simulate procedures similar to dynamic programming (DP).” Even though my article series has not covered anything concrete and unique to RL yet, I think my series has already laid a hopefully effective foundation of discussions on RL. And in the first article, I already explained that “trial and errors” are only agents’ actions for collecting data from the environment. Such “trial and errors” lead to “experiences” of computers. And in this article we can finally start discussing how computers “experience” things in more practical and theoretical ways.

*The expression “to learn” is also frequently used in contexts of other machine learning algorithms. Thus in order to clearly separate the ideas, let me use the expression “to experience” when it comes to explaining RL. At any rate, what computers are doing is updating parameters, and in RL also updating values and policies. But some terms related to RL also use the word “experience,” for example experience replay, so “to experience” might be a preferred phrase in RL fields.

I think changing discussions on DP into those on RL is like making graphs more “open” rather than “closed.” In the second article, I explained DP problems, where the models of environments are completely known, as repeatedly updating graphs like neural networks. As I have been repeatedly saying RL, or at least model-free RL, is an approximated application of DP in the environments without a complete model. That means, connections of nodes of the graph, that is relations of actions and states, are something agents have to estimate directly or indirectly. I think that can be seen as untying connections of the graphs which I displayed when I explained DP. By doing so, I propose to see RL or more exactly model-free RL like the graph of the right side of the figure below.

*For the time being, I would prefer to use the term model-free RL rather than just RL. That is not only because this article is about model-free RL but also because I want to avoid saying inaccurate things about wider range of RL algorithms I would have to study more precisely and explain.

Some people might say these are tree structures, and that might be technically correct. But in my sense, this is more of “willows.” The cover of the second edition of the books by Barto and Sutton also looks like willows. The cover design comes from a paper on RL named “Learning to Drive a Bicycle using Reinforcement Learning and Shaping.” The paper is about learning to ride a bike in a simulator with RL. The geometric patterns are not models of human brain nerves, but trajectories of an agent learning to balance a bike. However interestingly, the trajectories of the bike, which are inscribed on a road, partly diverge but converge in a certain way as a whole, like the RL graph I propose. That is why I chose some pictures of 「花札 (hanafuda)」as the main picture of this series. Hanafuda is a Japanese gamble card game with monthly seasonal flower pictures. And the cards of June have pictures of willows.

Source: Learning to Drive a Bicycle using Reinforcement Learning and Shaping, Randløv, (1998)    Richard S. Sutton, Andrew G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, (2018)

2, Untying DP graphs: planning or learning

Even though I have just loudly declared that my RL graphs are more of “willow” structures in my aesthetic sense, I must admit they should basically be discussed as popular tree structures. That is because, when you start discussing practical RL algorithms you need to see relations of states and actions as tree structures extending. If you already more or less familiar with tree structures or searching algorithms on tree graphs, learning RL with tree structures should be more or less straightforward to you. Another reason for using tree structures with nodes of states and actions is that the book by Barto and Sutton use buck up diagrams of Bellman equations which are tree graphs. But I personally think the graphs should be used more effectively, so I am trying to expand its uses to DP and RL algorithms in general. In order to avoid confusions about current discussions on RL in my article series, I would like to give an overall review on how to look at my graphs.

The graphs in the figure below are going to be used in my articles, at least when I talk about model-free RL. I made them based on the backup diagram of Bellman equation introduced in the book by Barto and Sutton. I would like you to first remember that in RL we are basically discussing Markov decision process (MDP) environment, where the next action and the resulting next states depends only on the current state. Such models are composed of white nodes representing each state s in an state space \mathcal{S}, and black nodes representing each action a, which is a member of an action space \mathcal{A}. Any behaviors of agents are represented as going back and forth between black and white nodes of the model, and that is why connections in the MDP model are bidirectional.  In my articles let me call such model of environments “a closed model.” RL or general planning problems are matters of optimizing policies in such models of environments. Optimizing the policies are roughly classified into two types, planning/searching or RL, and the main difference between them is whether connections of graphs of models are known or not. Planning or searching is conducted without actually moving in the environment. DP are family of planning algorithms which are known to converge, and so far in my articles we have seen that DP are enabled by repeatedly applying Bellman operators. But instead of considering and updating all the possible transitions in the model like DP, planning can be conducted more sparsely. Such sparse planning are often called searching, and many of them use tree structures. If you have learned any general decision making problems with tree graphs, you might be already familiar with some searching techniques like alpha-beta pruning.

*In explanations on DP in my articles, directions of connections of model graphs are confusing, so I precisely explained how to look at them in the second section in the last article.

On the other hand, RL algorithms are matters of learning the linkages of models of environments by actually moving in them. For example, when the agent in the figure below move on a grid map like the purple arrows, the movement is represented like in the closed model in the middle. However as the agent does not have the complete closed model, the agent has to move around in the environment like the tree structure at the right side to learn values of each node.

The point is, whether models of environments are known or unknown, or whether agents actually move in the environment or not, movements of agents are basically represented as going back and forth between white nodes and black nodes in closed models. And such closed models are entangled in searching or RL. They are similar operations, but they are essentially different in that searching agents do not actually move in searching but in RL they actually move.  In order to distinguish searching and learning, in my articles, trees for searching are extended vertically, trees for learning horizontally.

*DP and searching are both planning, but DP consider all the connections of actions and states by repeatedly applying Bellman operators. Thus I would not count DP as “untying” of closed models.

3, Some spectrums in RL algorithms

Starting studying actual RL algorithms also means encountering various algorithms one after another. Some of you might have already been overwhelmed by new terms coming up one after another in study materials on RL. That is because, as I explained in the first article, RL is more about how to train models of values or policies. Thus it is natural that compared to general machine learning, which more or less share the same training frameworks, RL has a variety of training procedures. Rather than independently studying each RL algorithm, I think it is more effective to see connections of each algorithm, which is linked by adjusting degrees of some important elements in RL. In fact I have already introduced those elements as some pairs of key words of RL in the first article. But it would be all the more effective to review them, especially after learning DP algorithms as representative planning methods. If you study RL that way, you would come to see trial and errors or RL as a crucial but just one aspect of RL.

I think if you care less about the trial-and-error aspect of RL that allows you to study RL more effectively in the beginning. And for the time being, you should stop viewing RL in the popular way as presented above. Not that I am encouraging you to ignore the trial and error part, namely relations of actions, rewards, and states. My point is that it is more of inside the agent that should be emphasized. Planning, including DP is conducted inside the agent, and trial and errors are collection of data from the environment for the sake of the planning. That is why in many study materials on RL, DP is first introduced. And if you see differences of RL algorithms as adjusting of some pairs of elements of planning problems, it would be less likely that you would get lost in curriculums on RL. The pairs are like some spectrums. Not that you always have to choose either of each pair, but rather ideal solutions are often in the middle of the two ends of the spectrums depending on tasks. Let’s take a look at the types of those spectrums one by one.

(1) Value-policy or actor-critic spectrum

The crucial type of spectrum you should be already familiar with is the value-policy one. I think this spectrum can be adjusted in various ways. For example, over the last two articles we have seen how values and policies reach the optimal functions in DP using policy iteration or value iteration. Policy iteration alternates between updating values and policies until convergence to the optimal policy, whereas value iteration keeps updating only values until reaching the optimal value, to get the optimal policy at the end. And similar discussions can be seen also in the upcoming RL algorithms. The book by Barto and Sutton sees such operations in general as generalized policy iteration (GPI).

Source: Richard S. Sutton, Andrew G. Barto, “Reinforcement Learning: An Introduction,” MIT Press, (2018)

You should pay attention to the idea of GPI because this is what makes RL different form other general machine learning. In many cases RL is explained as a field of machine learning which is like trial and errors, but I personally think that GPI, interactive optimization between values and policies, should be more emphasized. As I said in the first article, RL optimizes decision making rules, that is policies \pi(a|s), in MDPs. Other general machine learning algorithms have more direct supervision by loss functions and models are optimized so that loss functions are minimized. In the case of the figure below, an ML model f is optimized to f_{\ast} by optimization such as gradient descent. But on the other hand in RL policies \pi do not have direct loss functions. Then RL uses values v(s), which are functions of how good it is to be in states s. As one part of GPI, the value function v_{\pi} for the current policy \pi is calculated, and this is called estimation in the book by Barto and Sutton.  And based on the estimated value function, the policy is improved as \pi ', which is called policy improvement, and overall processes of estimation and policy improvement are called control in the book. And v_{\pi} and \pi are updated alternately this way until converging to the optimal values v_{\ast} or policies \pi_{\ast}. This interactive updates of values and policies are done inside the agent, in the dotted frame in red below. I personally think this part should be more emphasized than trial-and-error-like behaviors of agents. Once you see trial and errors of RL as crucial but just one aspect of GPI and focus more inside agents, you would see why so many study materials start explaining RL with DP.

You can explicitly model such interactions of values and policies by modeling each of them with different functions, and in this case such frameworks of RL in general are called actor-critic methods. I am gong to explain actor-critic methods in an upcoming article. Thus the value-policy spectrum also can be seen as a actor-critic spectrum. Differences between the pairs of value-policy or actor-critic spectrums are something you would little by little understand. For now I would say GPI is the most general and important idea behind RL. But practical RL algorithms are implemented as actor-critic methods. Critic parts gives some signals to actor parts, and critic parts get its consequence by actor parts taking actions in environments. Not that actors directly give feedback to critics.

*I think one of confusions in studying RL come from introducing Q-learning or SARSA at the first algorithms or a control in RL. As I have said earlier, interactive relations between values and policies or actors and critics, that is GPI, should be emphasized. And I think that is why DP is first introduced in many books. But in Q-learning or SARSA, an actor and a critic parts are combined as one module. But explicitly separating the actor and critic parts would be just too difficult at the beginning. And modeling an actor and a critic with separate modules would lead to difficulties in optimizing them together.

(2) Exploration-exploitation or on-off policy spectrum

I think the most straightforward spectrum is the exploitation-exploration spectrum. You can adjust how likely agents take random actions to collect data. Occasionally it is ideal for agents to have some degree of randomness in taking actions to explore unknown states of environments. One of the simplest algorithms to formulate randomness of actions is ε-greedy method, which I explained in the first article. In this method in short agents take a random action with a probability of ε. Instead of arbitrarily setting a hyperparameter \epsilon, randomness of actions can be also learned by modeling policies with certain functions. This randomness of functions can be also modeled in actor-critic frameworks. That means, depending on a choice of an actor, such actor can learn randomness of actions, that is explorations.

The two types of spectrums I have introduced so far lead to another type of spectrum. It is an on-off policy spectrum. Even though I explained types of policies in the last article using examples of home-lab-Starbucks diagrams, there is another way to classify policies: there are target policies and behavior policies. The former are the very policies whose optimization we have been discussing. The latter are policies for taking actions and collecting data. When agents use target policies also as behavior policies, they are on-policy algorithms. If agents use different policies for taking actions during optimization of target policies, they are off-policy methods.

Policy iteration and value iteration of DP can be also classified into on-policy or off-policy in a sense. In policy iteration values are updated using an up-to-date estimated policy, and the policy becomes optimal when it converges. Thus behavior and target policies are the same in this case. On the other hand in value iteration, values are updated with Bellman optimality operator, which updates values in a greedy way. Using greedy method means the policy \pi is not used for considering which action to take. Thus target and behavior policies are different. As you will see soon, concrete model-free RL algorithms like SARSA or Q-learning also have the same structure: the former is on-policy and the latter is off-policy. The difference of on-policy or off-policy would be more straightforward if we model behavior policies and target policies with different functions. An advantage of off-policy RL is you can model randomness of exploration of agents with extra functions. On the other hand, a disadvantage is that it would be harder to train different models at the same time. That might be a kind of tradeoff similar to an actor-critic method.

Even though this exploration-exploitation aspect of RL is relatively easy to understand, at the same time that can lead to much more complicated discussions on RL, which I would not be able to cover in this article series. I recommended you to stop seeing RL as trial and errors for the time being, but in the end trial and errors would prove to be crucial because data needed for GPI are collected mainly via trial and errors. Even if you implement some simple RL algorithms, you would soon realize it is hard to deal with unvisited states. Enough explorations need to be modeled by a behavior policy or some sophisticated heuristic techniques. I am planning to explain convergence of several RL algorithms, and they are guaranteed by sufficiently exploring all the states. However, thorough explorations of all the states lead to massive computational costs. But lack of exploration would let RL agents myopically overestimate current policies, never finding policies which pay off in the long run. That might be close to discussions on how to efficiently find a global minimum of a loss function, avoiding local minimums.

(3) TD-MonteCarlo spectrum

A variety of spectrums so far are enabled by modeling proper functions on demand. But in AI problems such functions are something which have to be automatically trained with some supervision. Instead of giving supervision explicitly with annotated data like in supervised learning of general machine learning, RL agents train models with “experiences.” As I am going to explain in the next part of this article, “experiences” in RL contexts mean making some estimations of values and adjusting such estimations based on actual rewards they get. And the timings of such feedback lead to another spectrum, which I call a TD-MonteCarlo spectrum. When the feedback happens every time an agent takes an action, it is TD method, on the other hand when that happens only at the end of an episode, that is Monte Carlo method. But it is easy to imagine that ideal solutions are usually at the middle of them. I am going to dig this topic soon in the next article. And n-step methods or TD(λ), which bridge the TD and Monte Carlo, are going to be covered in one of upcoming articles.

(4) Model free-based spectrum

The next spectrum might be relatively hard to understand, and to be honest I am still not completely sure about this topic. Please bear that in your mind. In the last section, I said RL is a kind of untying DP graphs and make them open because in RL, models of environments are unknown. However to be exact, that was mainly about model-free RL, which this article is going to cover for the time being. And I would say the graphs I showed in the last section were just two extremes of this model based-free spectrum. Some model-based RL methods exist in the middle of those two ends. In short RL agents can retain models of environments and do some plannings even when they do trial and errors. The figure below briefly compares planning, model-based RL, and model-free RL in the spectrum.

Let’s take a rough example of humans solving a huge maze. DP, which I have covered is like having a perfect map of the maze and making plans of how to move inside in advance. On the other hand, model-free reinforcement learning is like soon actually entering the maze without any plans. In model-free reinforcement learning, you only know how big the maze is, and you have a great memory for remembering in which directions to move, in all the places. However, as the model of how paths are connected is unknown, and you naively try to remember all the actions in all the places, it generally takes a longer time to solve the maze. As you could easily imagine, having some heuristic ideas about the model of the maze and taking some notes and making plans about courses would be the most efficient and the most peaceful. And such models in your head can be updated by actually moving in the maze.

*I believe that you would not say the pictures above are spoilers.

I need to more clearly talk about what a model is in RL or general planning problems. The book by Barto and Sutton simply defines a model this way: “By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. ” The book also says such models can be also classified to distribution models and sample models. The difference between them is the former describes an environment as combinations of known models, but the latter is like a black box model of an environment. An intuitive example is, as introduced in the book by Barto and Sutton, throwing dozens of dices can be seen in the both types. If you just throw the dices, sometimes chancing numbers of dices, and record the sum of the numbers on the dices s every time, that is equal to getting the sum from a black box. But a probabilistic distribution of such sums can be actually calculated as a multinomial distribution. Just as well, you can see a probability of transitions in an RL environment as a black box, but the probability can be also modeled. Some readers might have realized that distribution or sample models can be almost the same in the end, with sufficient data. In many cases of machine learning or statistics algorithms, complicated distributions have to be approximated with samples. Or rather how to approximate them is more of interest. In the case of dozens of dices, you can analytically calculate its distribution model as a multinomial distribution. But if you throw the dices numerous times, you would get precise approximated distributions.

When we discuss model-based RL, we need to consider not only DP but also other planning algorithms. DP is a family of planning algorithms which are known to converge, and many of RL algorithms share a lot with DP at theoretical levels. But in fact DP has one shortcoming even if the MDP model of an environment is known: DP needs to consider and update all the states. When models of environments are too complicated and large, applying DP is not a good idea. Also in many of such cases, you could not even get such a huge model of the environment. You would rather get only a black box model of the environment. Such a black box model only gets a pair of current state and action (s, a), and gives out the next state s' and corresponding reward r, that is the black box is a sample model. In this case other planning methods with some searching algorithms are used, for example Monte Carlo tree search. Such search algorithms are designed to more efficiently and sparsely search states and actions of interest. Many of searching algorithms used in RL make uses of tree structures. Model-based approaches can be roughly classified into three types below based on size or complication of models.

*As you could see, differences between sample models and distribution models can be very ambiguous. So are differences between model-free and model-based RL, I guess. As a matter of fact the whale book says the distributions of models approximated in model-free RL are the same as those in model-based ones. I cannot say anything exactly anymore, but I guess model-free RL is more of “memorizing” an environment, or combinations of states and actions in the environments. But memorizing environments can be computationally problematic in many cases, so assuming some distributions of models can help. That is my impression for now.

*Tree search algorithms alone shows very impressive performances, as long as you have massive computation resources. A heuristic tree search without reinforcement learning could defeat Garri Kasparow, a former chess champion, as long as enough computation resource is available. Searching algorithms were enough for “simplicity” of chess.

*I am not sure whether model-free RL algorithms are always simpler than model-based ones. For example Deep Q-Learning, a model-free method with some neural networks can learn to play Atari or Nintendo Entertainment System. Model-based deep RL is used in more complex task like AlphaGo or AlphaZero, which can defeat world champions of various board games. AlphaGo or AlphaZero models intuitions in phases of board games with convolutional neural networks (CNN), prediction of some phases ahead with search algorithms, and learning from past experiences with RL. I am not going to cover model-based RL in general in this series, but instead I would like to explain how RL enables computers to play video games after introducing some searching algorithms.

(5) Model expressivity spectrum

No matter how impressive or dreamy RL algorithms sound, their competence largely depend on model expressivity. In the first article, I emphasized “simplicity” of RL. DP or RL algorithms so far or in upcoming several articles consider incredibly simple cases like kids playbooks. And that beginning parts of most RL study materials cover only the left side of the figure below. In order to enable RL agents with more impressive tasks such as balancing cart-pole or playing video games, we need to raise the bar of expressivity spectrum, from the left to the right side of the figure below. You need to wait until a chapter or a section on “function approximation” in order to actually feel that your computer is doing trial and errors. And such chapters finally appear after reading half of both the book by Barto and Sutton and the whale book.

*And this spectrum is also a spectrum of computation costs or convergence. The left type could be easily implemented like programming assignments of schools since it in short needs only Excel sheets, and you would soon get results. The middle type would be more challenging, but that would not b computationally too expensive. But when it comes to the type at the right side, that is not something which should be done on your local computer. At least you need a GPU. You should expect some hours or days even for training RL agents to play 8 bit video games. That is of course due to cost of training deep neural networks (DNN), especially CNN. But another factors is potential inefficiency of RL. I hope I could explain those weak points of RL and remedies for them.

We need to model values and policies with certain functions. For the time being, in my articles values and policies are just modeled as tabular data, that is some NumPy arrays or Excel sheets. These are types of cases where environments and actions are relatively simple and discrete. Thus they can be modeled with some tabular data with the same degree of freedom. Assume a case where there are only 30 grids in an environment and only 4 types of actions in every grid. In such case, values are stored as arrays with 30 elements, and so are policies. But when environments are more complex or require continuous values of some parameters, values and policies have to be approximated with some models. When only relatively few parameters need to be estimated, simple machine learning models such as softmax functions can be used as such models. But compared to the cases with tabular data, convergence of training has to be discussed more carefully. And when you need to estimate continuous values, techniques like policy gradients have to be introduced. And we can dramatically enhance expressivity of models with deep neural netowrks (DNN), and such RL is called deep RL. Deep RL has showed great progress these days, and it is capable of impressive performances. Deep RL often needs observers to process inputs like video frames, and for example convolutional neural networks (CNN) can be used to make such observers. At any rate, no matter how much expressivity RL models have, they need to be supervised with some signals just as general machine learning often need labeled data. And “experiences” give such supervisions to RL agents.

(6) Adjusting sliders of spectrum

As you might have already noticed, these spectrums are not something you can adjust independently like faders on mixing board. They are more like some sliders for adjusting colors, brightness, or chroma on painting software. If you adjust one element, other parts are more or less influenced. And even though there are a variety of colors in the world, they continuously change by adjusting those elements of colors. Just as well, even if each RL algorithms look independent, many of them share more or less the same ideas, and only some parts are different in terms of their degrees. When you get lost in the course of studying RL, I would like you to decompose the current topic into these spectrums of RL elements I have explained.

I hope my explanations so far changed how you see RL. In the first article I already said RL is approximation of DP-like procedures with data collected by trial and errors, but from now on I would explain it also this way: RL is a family of algorithms which enable GPI by adjusting some spectrums.

In the next some articles, I am going to mainly cover RL algorithms named SARSA and Q-learning. Both of them use tabular data, and they are model-free. And in values and policies, or actors and critics are together modeled as action-value functions, which I am going to explain later in this article. The only difference is SARSA is on-policy, and Q-learning is off-policy, just as I have already mentioned. And when it comes to how to train them, they both use Temporal Difference (TD), and this gives signals of “experience” to RL agents. Altering DP in to model-free RL is, in the figure above, adjusting the model-based-free and MonteCarlo-TD spectrums to the right end. And you also adjust the low-high-expressivity and value-policy spectrums to the left end. In terms of actor-critic spectrum, the actor and the critic parts are modeled as the same module. Seeing those algorithms this way would be much more effective than looking at their pseudocode independently.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Training of Deep Learning AI models

It’s All About Data: The Training of AI Models

In deep learning, there are different training methods. Which one we use in an AI project depends on the data provided by our customer: how much data is there, is it labeled or unlabeled? Or is there both labeled and unlabeled data?

Let’s say our customer needs structured, labeled images for an online tourism portal. The task for our AI model is therefore to recognize whether a picture is a bedroom, bathroom, spa area, restaurant, etc. Let’s take a look at the possible training methods.

1. Supervised Learning

If our customer has a lot of images and they are all labeled, this is a rare stroke of luck. We can then apply supervised learning. The AI model learns the different image categories based on the labeled images. For this purpose, it receives the training data with the desired results from us.

During training, the model searches for patterns in the images that match the desired results, learning the characteristics of the categories. The model can then apply what it has learned to new, unseen data and in this way provide a prediction for unlabeled images, i.e., something like “bathroom 98%.”

2. Unsupervised Learning

If our customer can provide many images as training data, but all of them are not labeled, we have to resort to unsupervised learning. This means that we cannot tell the model what it should learn (the assignment to categories), but it must find regularities in the data itself.

Contrastive learning is currently a common method of unsupervised learning. Here, we generate several sections from one image at a time. The model should learn that the sections of the same image are more similar to each other than to those of other images. Or in short, the model learns to distinguish between similar and dissimilar images.

Although we can use this method to make predictions, they can never achieve the quality of results of supervised learning.

3. Semi-supervised Learning

If our customer can provide us with few labeled data and a large amount of unlabeled data, we apply semi-supervised learning. In practice, we actually encounter this data situation most often.

With semi-supervised learning, we can use both data sets for training, the labeled and the unlabeled data. This is possible by combining contrastive learning and supervised learning, for example: we train an AI model with the labeled data to obtain predictions for room categories. At the same time, we let the model learn similarities and dissimilarities in the unlabeled data and then optimize itself. In this way, we can ultimately achieve good label predictions for new, unseen images.

Supervised vs. Unsupervised vs. Semi-supervised

Everyone who is entrusted with an AI project wants to apply supervised learning. In practice, however, this is rarely the case, as rarely all training data is well structured and labeled.

If only unstructured and unlabeled data is available, we can at least extract information from the data with unsupervised learning. These can already provide added value for our customer. However, compared to supervised learning, the quality of the results is significantly worse.

With semi-supervised learning, we try to resolve the data dilemma of small part labeled data, large part unlabeled data. We use both datasets and can obtain good prediction results whose quality is often on par with those of supervised learning. This article is written in cooperation between DATANOMIQ and pixolution, a company for computer vision and AI-bases visual search.

Buzzword Bingo: Data Science – Teil I

Rund um das Thema Data Science gibt es unglaublich viele verschiedene Buzzwords, die Ihnen sicherlich auch schon vielfach begegnet sind. Sei es der Begriff Künstliche Intelligenz, Big Data oder auch Deep Learning. Die Bedeutung dieser Begriffe ist jedoch nicht immer ganz klar und häufig werden Begriffe auch vertauscht oder in missverständlichen Zusammenhängen benutzt. Höchste Zeit also, sich einmal mit den genauen Definitionen dieser Begriffe zu beschäftigen!

Buzzword Bingo: Data Science – Teil 1: Künstliche Intelligenz, Algorithmen & Maschinelles Lernen

Im ersten Teil unserer dreiteiligen Reihe „Buzzword Bingo Data Science“ beschäftigen wir uns zunächst mit den drei Begriffen „Künstliche Intelligenz“, „Algorithmus“ und „Maschinelles Lernen“.

Künstliche Intelligenz

Der im Bereich der Data Science u. a. am häufigsten genutzte Begriff ist derjenige der „Künstlichen Intelligenz“. Viele Menschen denken bei dem Begriff sofort an hochspezialisierte Maschinen à la „The Matrix“ oder „I, Robot“. Dabei ist der Begriff deutlich älter als viele denken. Bereits 1956 wurde der englische Begriff “artificial intelligence” zum ersten Mal in einem Workshop-Titel am US-amerikanischen Dartmouth College genutzt.

Heutzutage besitzt der Begriff der künstlichen Intelligenz keine allgemeingültige Definition. Es handelt sich bei künstlicher Intelligenz grundsätzlich um ein Teilgebiet der Informatik, das sich mit der Automatisierung von intelligentem Verhalten befasst. Es geht also darum, dass ein Computerprogramm auf eine Eingabe eine intelligente Reaktion zeigt. Zu beachten ist hierbei, dass eine künstliche Intelligenz nur ein scheinbar intelligentes Verhalten zeigen kann. Künstliche Intelligenz wird heutzutage sehr weit gefasst und kann vieles umfassen: von klassischen, regelbasierten Algorithmen bis hin zu selbstlernenden künstlichen neuronalen Netzen.

Das zentrale Forschungsziel ist die Entwicklung einer sogenannten Allgemeinen Künstlichen Intelligenz, also einer Maschine, die in der Lage sein wird, autonom beliebige Probleme zu lösen. Es gibt eine fortlaufende Debatte darüber, ob dieses Ziel jemals erreicht werden kann bzw. ob es erreicht werden sollte.

In den vergangenen Jahren ist auch die sogenannte xAI (engl. Explainable AI; erklärbare künstliche Intelligenz) in den Mittelpunkt der Forschungsinteressen gerückt. Dabei geht es um die Problematik, dass künstliche Intelligenzen sogenannte Black Boxen sind. Das bedeutet, dass ein menschlicher User die Entscheidung einer künstlichen Intelligenz üblicherweise nicht nachvollziehen kann. Eine xAI wäre im Vergleich jedoch eine Glass Box, die Entscheidungen einer solchen künstlichen Intelligenz wären für Menschen also nachvollziehbar.


Algorithmen sind klar definierte, vorgegebene Prozeduren, mit denen klar definierte Aufgaben gelöst werden können. Dabei kann der Lösungsweg des Algorithmus entweder durch Menschen vorgegeben, also programmiert werden oder Algorithmen lernen durch Methoden des maschinellen Lernens selbstständig den Lösungsweg für eine Prozedur.

Im Bereich der Data Science bezeichnen wir mit Algorithmen kleine Programme, die scheinbar intelligent handeln. Dementsprechend stecken auch hinter künstlichen Intelligenzen Algorithmen. Werden Algorithmen mit klar definierten Eingaben versorgt, führen sie somit zu einem eindeutigen, konstanten Ergebnis. Dabei gilt aber leider auch der Grundsatz der Informatik „Mist rein, Mist raus“. Ein Algorithmus kann immer nur auf sinnvolle Eingaben sinnvolle Ausgaben erzeugen. Die Komplexität von Algorithmen kann sehr vielfältig sein und je komplexer ein solcher Algorithmus ist, desto „intelligenter“ erscheint er oftmals.

Maschinelles Lernen

Maschinelles Lernen ist ein Überbegriff für eine Vielzahl von Verfahren, mit denen ein Computer oder eine künstliche Intelligenz automatisch Muster in Daten erkennt. Beim maschinellen Lernen wird grundsätzlich zwischen dem überwachten und unüberwachten Lernen unterschieden.

Beim überwachten Lernen lernt ein Algorithmus den Zusammenhang zwischen bekannten Eingabe- und Ausgabewerten. Nachdem dieser Zusammenhang vom Algorithmus erlernt wurde, kann dieses maschinelle Modell dann auf neue Eingabewerte angewandt und somit unbekannte Ausgabewerte vorhergesagt werden. Beispielsweise könnte mithilfe einer Regression zunächst der Zusammenhang zwischen Lufttemperatur und dem Wochentag (jeweils bekannte Eingabewerte) sowie der Anzahl der verkauften Eiskugeln (für die Vergangenheit bekannte Ausgabewerte) in einem Freibad untersucht werden. Sobald dieser Zusammenhang einmal ausreichend genau bestimmt worden ist, kann er auch für die Zukunft fortgeschrieben werden. Das bedeutet, es wäre dann möglich, anhand des nächsten Wochentages sowie der vorhergesagten Lufttemperatur (bekannte Eingabewerte für die Zukunft) die Anzahl der verkauften Eiskugeln (unbekannte Ausgabewerte für die Zukunft) zu prognostizieren und somit die Absatzmenge genauer planen zu können.

Beim unüberwachten Lernen auf der anderen Seite sind nur Eingabedaten vorhanden, es gibt keine den Eingabedaten zugehörigen Ausgabedaten. Hier wird dann mit Methoden wie beispielsweise dem Clustering versucht, verschiedene Datenpunkte anhand ihrer Eigenschaften in verschiedene Gruppen aufzuteilen. Beispielsweise könnte ein Clustering-Algorithmus verschiedene Besucher:innen eines Webshops in verschiedene Gruppen einteilen: Es könnte beispielsweise eine Gruppe von Besucher:innen geben, die sehr zielstrebig ein einzelnes Produkt in den Warenkorb legen und ihren Kauf direkt abschließen. Andere Besucher:innen könnten allerdings viele verschiedene Produkte ansehen, in den Warenkorb legen und am Ende nur wenige oder vielleicht sogar gar keine Käufe tätigen. Wieder andere Kund:innen könnten unter Umständen lediglich auf der Suche nach Artikeln im Sale sein und keine anderen Produkte ansehen.

Aufgrund ihres Nutzungsverhaltens auf der Website könnte ein Clustering-Algorithmus mit ausreichend aufbereiteten Daten nun all diese Kund:innen in verschiedene Gruppen oder Cluster einteilen. Was der Algorithmus jedoch nicht leisten kann ist zu erklären, was die erkannten Cluster genau bedeuten. Hierfür braucht es nach wie vor menschliche Intelligenz gepaart mit Fachwissen.

Automatic Financial Trading Agent for Low-risk Portfolio Management using Deep Reinforcement Learning

This article focuses on autonomous trading agent to solve the capital market portfolio management problem. Researchers aim to achieve higher portfolio return while preferring lower-risk actions. It uses deep reinforcement learning Deep Q-Network (DQN) to train the agent. The main contribution of their work is the proposed target policy.


Author emphasizes the importance of low-risk actions for two reasons: 1) the weak positive correlation between risk and profit suggests high returns can be obtained with low-risk actions, and 2) customer satisfaction decreases with increases in investment risk, which is undesirable. Author challenges the limitation of Supervised Learning algorithm since it requires domain knowledge. Thus, they propose Reinforcement Learning to be more suitable, because it only requires state, action and reward specifications.

The study verifies the method through the back-test in the cryptocurrency market because it is extremely volatile and offers enormous and diverse data. Agents then learn with shorter periods and are tested for the same period to verify the robustness of the method. 

2 Proposed Method

The overall structure of the proposed method is shown below.

The architecutre of the proposed trading agent system.

The architecutre of the proposed trading agent system.

2.1 Problem Definition

The portfolio consists of m assets and one base currency.

The price vector p stores the price p of all assets:

The portfolio vector w stores the amount of each asset:

At time 𝑡, the total value W_t of the portfolio is defined as the inner product of the price vector p_t and the portfolio vector w_t .

Finally, the goal is to maximize the profit P_t at the terminal time step 𝑇.

2.2 Asset Data Preprocessing

1) Asset Selection
Data is drawn from the Binance Exchange API, where top m traded coins are selected as assets.

2) Data Collection
Each coin has 9 properties, shown in Table.1, so each trade history matrix has size (α * 9), where α is the size of the target period converted into minutes.

3) Zero-Padding
Pad all other coins to match the matrix size of the longest coin. (Coins have different listing days)

Comment: Author pointed out that zero-padding may be lacking, but empirical results still confirm their method covering the missing data well.

4) Stack Matrices
Stack m matrices of size (α * 9) to form a block of size (m* α * 9). Then, use sliding window method with widow size w to create (α – w + 1) number of sequential blocks with size (w *  m * 9).

5) Normalization
Normalize blocks with min-max normalization method. They are called history block 𝜙 and used as input (ie. state) for the agent.

3. Deep Q-Network

The proposed RL-based trading system follows the DQN structure.

Deep Q-Network has 2 networks, Q- and Target network, and a component called experience replay. The Q-network is the agent that is trained to produce the optimal state-action value (aka. q-value).

Comment: Q-value is calculated by the Bellman equation, which, in short, consists of the immediate reward from next action, and the discounted value of the next state by following the policy for all subsequent steps.


Agent: Portfolio manager
Action a: Trading strategy according to the current state
State 𝜙 : State of the capital market environment
Environment: Has all trade histories for assets, return reward r and provide next state 𝜙’ to agent again

DQN workflow:

DQN gets trained in multiple time steps of multiple episodes. Let’s look at the workflow of one episode.

Training of a Deep Q-Network

Training of a Deep Q-Network

1) Experience replay selects an action according to the behavior policy, executes in the environment, returns the reward and next state. This experience set (\phi_t, a_t, r_r,\phi_{t+!}) is stored in the repository as a sample of training data.

2) From the repository of prior observations, take a random batch of samples as the input to both Q- and Target network. The Q-network takes the current state and action from each data sample and predicts the q-value for that particular action. This is the ‘Predicted Q-Value’.Comment: Author uses 𝜀-greedy algorithm to calculate q-value and select action. To simplify, 𝜀-greedy policy takes the optimal action if a randomly generated number is greater than 𝜀, which represents a tradeoff between exploration and exploitation.

The Target network takes the next state from each data sample and predicts the best q-value out of all actions that can be taken from that state. This is the ‘Target Q-Value’.

Comment: Author proposes a different target policy to calculate the target q-value.

3) The Predicted q-value, Target q-value, and the observed reward from the data sample is used to compute the Loss to train the Q-network.

Comment: Target Network is not trained. It is held constant to serve as a stable target for learning and will be updated with a frequency different from the Q-network.

4) Copy Q-network weights to Target network after n time steps and continue to next time step until this episode is finished.

The architecutre of the proposed trading agent system.

4.0 Main Contribution of the Research

4.1 Action and Reward

Agent determines not only action a but ratio , at which the action is applied.

  1. Action:
    Hold, buy and sell. Buy and sell are defined discretely for each asset. Hold holds all assets. Therefore, there are (2m + 1) actions in the action set A.

    Agent obtains q-value of each action through q-network and selects action by using 𝜀-greedy algorithm as behavior policy.
  2. Ratio:
    \sigma is defined as the softmax value for the q-value of each action (ie. i-th asset at \sigma = 0.5 , then i-th asset is bought using 50% of base currency).
  3. Reward:
    Reward depends on the portfolio value before and after the trading strategy. It is clipped to [-1,1] to avoid overfitting.

4.2 Proposed Target Policy

Author sets the target based on the expected SARSA algorithm with some modification.

Comment: Author claims that greedy policy ignores the risks that may arise from exploring other outcomes other than the optimal one, which is fatal for domains where safe actions are preferred (ie. capital market).

The proposed policy uses softmax algorithm adjusted with greediness according to the temperature term 𝜏. However, softmax value is very sensitive to the differences in optimal q-value of states. To stabilize  learning, and thus to get similar greediness in all states, author redefine 𝜏 as the mean of absolute values for all q-values in each state multiplied by a hyperparameter 𝜏’.

4.3 Q-Network Structure

This study uses Convolutional Neural Network (CNN) to construct the networks. Detailed structure of the networks is shown in Table 2.

Comment: CNN is a deep neural network method that hierarchically extracts local features through a weighted filter. More details see:

5 Experiment and Hyperparameter Tuning

5.1 Experiment Setting

Data is collected from August 2017 to March 2018 when the price fluctuates extensively.

Three evaluation metrics are used to compare the performance of the trading agent.

  • Profit P_t introduced in 2.1.
  • Sharpe Ratio: A measure of return, taking risk into account.

    Comment: p_t is the standard deviation of the expected return and P_f  is the return of a risk-free asset, which is set to 0 here.
  • Maximum Drawdown: Maximum loss from a peak to a through, taking downside risk into account.

5.2 Hyperparameter Optimization

The proposed method has a number of hyperparameters: window size mentioned in 2.2,  𝜏’ in the target policy, and hyperparameters used in DQN structure. Author believes the former two are key determinants for the study and performs GridSearch to set w = 30, 𝜏’ = 0.25. The other hyperparameters are determined using heuristic search. Specifications of all hyperparameters are summarized in the last page.

Comment: Heuristic is a type of search that looks for a good solution, not necessarily a perfect one, out of the available options.

5.3 Performance Evaluation

Benchmark algorithms:

UBAH (Uniform buy and hold): Invest in all assets and hold until the end.
UCRP (Uniform Constant Rebalanced Portfolio): Rebalance portfolio uniformly for every trading period.

Methods from other studies: hyperparameters as suggested in the studies
EG (Exponential Gradient)
PAMR (Passive Aggressive Mean Reversion Strategy)

Comment: DQN basic uses greedy policy as the target policy.

The proposed DQN method exhibits the best overall results out of the 6 methods. When the agent is trained with shorter periods, although MDD increases significantly, it still performs better than benchmarks and proves its robustness.

6 Conclusion

The proposed method performs well compared to other methods, but there is a main drawback. The encoding method lacked a theoretical basis to successfully encode the information in the capital market, and this opaqueness is a rooted problem for deep learning. Second, the study focuses on its target policy, while there remains room for improvement with its neural network structure.

Specification of Hyperparameters

Specification of Hyperparameters.



  1. Shin, S. Bu and S. Cho, “Automatic Financial Trading Agent for Low-risk Portfolio Management using Deep Reinforcement Learning”,
  2. Li, P. Zhao, S. C. Hoi, and V. Gopalkrishnan, “PAMR: passive aggressive mean reversion strategy for portfolio selection,” Machine learning, vol. 87, pp. 221-258, 2012.
  3. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth, “On‐line portfolio selection using multiplicative updates,” Mathematical Finance, vol. 8, pp. 325-347, 1998.,can%20be%20interpreted%20as%20probabilities.

Wie Maschinen uns verstehen: Natural Language Understanding

Foto von Sebastian Bill auf Unsplash.

Natural Language Understanding (NLU) ist ein Teilbereich von Computer Science, der sich damit beschäftigt natürliche Sprache, also beispielsweise Texte oder Sprachaufnahmen, verstehen und verarbeiten zu können. Das Ziel ist es, dass eine Maschine in der gleichen Weise mit Menschen kommunizieren kann, wie es Menschen untereinander bereits seit Jahrhunderten tun.

Was sind die Bereiche von NLU?

Eine neue Sprache zu erlernen ist auch für uns Menschen nicht einfach und erfordert viel Zeit und Durchhaltevermögen. Wenn eine Maschine natürliche Sprache erlernen will, ist es nicht anders. Deshalb haben sich einige Teilbereiche innerhalb des Natural Language Understandings herausgebildet, die notwendig sind, damit Sprache komplett verstanden werden kann.

Diese Unterteilungen können auch unabhängig voneinander genutzt werden, um einzelne Aufgaben zu lösen:

  • Speech Recognition versucht aufgezeichnete Sprache zu verstehen und in textuelle Informationen umzuwandeln. Das macht es für nachgeschaltete Algorithmen einfacher die Sprache zu verarbeiten. Speech Recognition kann jedoch auch alleinstehend genutzt werden, beispielsweise um Diktate oder Vorlesungen in Text zu verwandeln.
  • Part of Speech Tagging wird genutzt, um die grammatikalische Zusammensetzung eines Satzes zu erkennen und die einzelnen Satzbestandteile zu markieren.
  • Named Entity Recognition versucht innerhalb eines Textes Wörter und Satzbausteine zu finden, die einer vordefinierten Klasse zugeordnet werden können. So können dann zum Beispiel alle Phrasen in einem Textabschnitt markiert werden, die einen Personennamen enthalten oder eine Zeit ausdrücken.
  • Sentiment Analysis klassifiziert das Sentiment, also die Gefühlslage, eines Textes in verschiedene Stufen. Dadurch kann beispielsweise automatisiert erkannt werden, ob eine Produktbewertung eher positiv oder eher negativ ist.
  • Natural Language Generation ist eine allgemeine Gruppe von Anwendungen mithilfe derer automatisiert neue Texte generiert werden sollen, die möglichst natürlich klingen. Zum Beispiel können mithilfe von kurzen Produkttexten ganze Marketingbeschreibungen dieses Produkts erstellt werden.

Welche Algorithmen nutzt man für NLP?

Die meisten, grundlegenden Anwendungen von NLP können mit den Python Modulen spaCy und NLTK umgesetzt werden. Diese Bibliotheken bieten weitreichende Modelle zur direkten Anwendung auf einen Text, ohne vorheriges Trainieren eines eigenen Algorithmus. Mit diesen Modulen ist ohne weiteres ein Part of Speech Tagging oder Named Entity Recognition in verschiedenen Sprachen möglich.

Der Hauptunterschied zwischen diesen beiden Bibliotheken ist die Ausrichtung. NLTK ist vor allem für Entwickler gedacht, die eine funktionierende Applikation mit Natural Language Processing Modulen erstellen wollen und dabei auf Performance und Interkompatibilität angewiesen sind. SpaCy hingegen versucht immer Funktionen bereitzustellen, die auf dem neuesten Stand der Literatur sind und macht dabei möglicherweise Einbußen bei der Performance.

Für umfangreichere und komplexere Anwendungen reichen jedoch diese Optionen nicht mehr aus, beispielsweise wenn man eine eigene Sentiment Analyse erstellen will. Je nach Anwendungsfall sind dafür noch allgemeine Machine Learning Modelle ausreichend, wie beispielsweise ein Convolutional Neural Network (CNN). Mithilfe von Tokenizern von spaCy oder NLTK können die einzelnen in Wörter in Zahlen umgewandelt werden, mit denen wiederum das CNN als Input arbeiten kann. Auf heutigen Computern sind solche Modelle mit kleinen Neuronalen Netzwerken noch schnell trainierbar und deren Einsatz sollte deshalb immer erst geprüft und möglicherweise auch getestet werden.

Jedoch gibt es auch Fälle in denen sogenannte Transformer Modelle benötigt werden, die im Bereich des Natural Language Processing aktuell state-of-the-art sind. Sie können inhaltliche Zusammenhänge in Texten besonders gut mit in die Aufgabe einbeziehen und liefern daher bessere Ergebnisse beispielsweise bei der Machine Translation oder bei Natural Language Generation. Jedoch sind diese Modelle sehr rechenintensiv und führen zu einer sehr langen Rechenzeit auf normalen Computern.

Was sind Transformer Modelle?

In der heutigen Machine Learning Literatur führt kein Weg mehr an Transformer Modellen aus dem Paper „Attention is all you need“ (Vaswani et al. (2017)) vorbei. Speziell im Bereich des Natural Language Processing sind die darin erstmals beschriebenen Transformer Modelle nicht mehr wegzudenken.

Transformer werden aktuell vor allem für Übersetzungsaufgaben genutzt, wie beispielsweise auch bei Darüber hinaus sind diese Modelle auch für weitere Anwendungsfälle innerhalb des Natural Language Understandings geeignet, wie bspw. das Beantworten von Fragen, Textzusammenfassung oder das Klassifizieren von Texten. Das GPT-2 Modell ist eine Implementierung von Transformern, dessen Anwendungen und die Ergebnisse man hier ausprobieren kann.

Was macht den Transformer so viel besser?

Soweit wir wissen, ist der Transformer jedoch das erste Transduktionsmodell, das sich ausschließlich auf die Selbstaufmerksamkeit (im Englischen: Self-Attention) stützt, um Repräsentationen seiner Eingabe und Ausgabe zu berechnen, ohne sequenzorientierte RNNs oder Faltung (im Englischen Convolution) zu verwenden.

Übersetzt aus dem englischen Originaltext: Attention is all you need (Vaswani et al. (2017)).

In verständlichem Deutsch bedeutet dies, dass das Transformer Modell die sogenannte Self-Attention nutzt, um für jedes Wort innerhalb eines Satzes die Beziehung zu den anderen Wörtern im gleichen Satz herauszufinden. Dafür müssen nicht, wie bisher, Recurrent Neural Networks oder Convolutional Neural Networks zum Einsatz kommen.

Was dieser Mechanismus konkret bewirkt und warum er so viel besser ist, als die vorherigen Ansätze wird im folgenden Beispiel deutlich. Dazu soll der folgende deutsche Satz mithilfe von Machine Learning ins Englische übersetzt werden:

„Das Mädchen hat das Auto nicht gesehen, weil es zu müde war.“

Für einen Computer ist diese Aufgabe leider nicht so einfach, wie für uns Menschen. Die Schwierigkeit an diesem Satz ist das kleine Wort „es“, dass theoretisch für das Mädchen oder das Auto stehen könnte. Aus dem Kontext wird jedoch deutlich, dass das Mädchen gemeint ist. Und hier ist der Knackpunkt: der Kontext. Wie programmieren wir einen Algorithmus, der den Kontext einer Sequenz versteht?

Vor Veröffentlichung des Papers „Attention is all you need“ waren sogenannte Recurrent Neural Networks die state-of-the-art Technologie für solche Fragestellungen. Diese Netzwerke verarbeiten Wort für Wort eines Satzes. Bis man also bei dem Wort „es“ angekommen ist, müssen erst alle vorherigen Wörter verarbeitet worden sein. Dies führt dazu, dass nur noch wenig Information des Wortes „Mädchen“ im Netzwerk vorhanden sind bis den Algorithmus überhaupt bei dem Wort „es“ angekommen ist. Die vorhergegangenen Worte „weil“ und „gesehen“ sind zu diesem Zeitpunkt noch deutlich stärker im Bewusstsein des Algorithmus. Es besteht also das Problem, dass Abhängigkeiten innerhalb eines Satzes verloren gehen, wenn sie sehr weit auseinander liegen.

Was machen Transformer Modelle anders? Diese Algorithmen prozessieren den kompletten Satz gleichzeitig und gehen nicht Wort für Wort vor. Sobald der Algorithmus das Wort „es“ in unserem Beispiel übersetzen will, wird zuerst die sogenannte Self-Attention Layer durchlaufen. Diese hilft dem Programm andere Wörter innerhalb des Satzes zu erkennen, die helfen könnten das Wort „es“ zu übersetzen. In unserem Beispiel werden die meisten Wörter innerhalb des Satzes einen niedrigen Wert für die Attention haben und das Wort Mädchen einen hohen Wert. Dadurch ist der Kontext des Satzes bei der Übersetzung erhalten geblieben.

How Do Various Actor-Critic Based Deep Reinforcement Learning Algorithms Perform on Stock Trading?

Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy


Deep Reinforcement Learning (DRL) is a blooming field famous for addressing a wide scope of complex decision-making tasks. This article would introduce and summarize the paper “Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy”, and discuss how these actor-critic based DRL learning algorithms, Proximal Policy Optimization (PPO), Advantage Actor Critic (A2C), and Deep Deterministic Policy Gradient (DDPG), act to accomplish automated stock trading by boosting investment return.

1 Motivation and Related Technology

It has long been challenging to design a comprehensive strategy for capital allocation optimization in a complex and dynamic stock market. With development of Artificial Intelligence, machine learning coupled with fundamentals analysis and alternative data has been in trend and provides better performance than conventional methodologies. Reinforcement Learning (RL) as a branch of it, is able to learn from interactions with environment, during which the agent continuously absorbs information, takes actions, and learns to improve its policy regarding rewards or losses obtained. On top of that, DRL utilizes neural networks as function approximators to approximate the Q-value (the expected reward of each action) in RL, which in return adjusts RL for large-scale data learning.

In DRL, the critic-only approach is capable for solving discrete action space problems, calculating Q-value to learn the optimal action-selection policy. On the other side, the actor-only approach, used in continuous action space environments, directly learns the optimal policy itself. Combining both, the actor-critic algorithm simultaneously updates the actor network representing the policy, and critic network representing the value function. The critic estimates the value function, while the actor updates the policy guided by the critic with policy gradients.

Overview of reinforcement learning-based stock theory.

Figure 1: Overview of reinforcement learning-based stock theory.

2 Mathematical Modeling

2.1 Stock Trading Simulation

Given the stochastic nature of stock market, the trading process is modeled as a Markov Decision Process (MDP) as follows:

  • State s = [p, h, b]: a vector describing the current state of the portfolio consists of D stocks, includes stock prices vector p, the stock shares vector h, and the remaining balance b.
  • Action a: a vector of actions which are selling, buying, or holding (Fig.2), resulting in decreasing, increasing, and no change of shares h, respectively. The number of shares been transacted is recorded as k.
  • Reward r(s, a, s’): the reward of taking action a at state s and arriving at the new state s’.
  • Policy π(s): the trading strategy at state s, which is the probability distribution of actions.
  • Q-value : the expected reward of taking action a at state s following policy π.
A starting portfolio value with three actions result in three possible portfolios.

A starting portfolio value with three actions result in three possible portfolios. Note that “hold” may lead to different portfolio values due to the changing stock prices.

Besides, several assumptions and constraints are proposed for practice:

  • Market liquidity: the orders are rapidly executed at close prices.
  • Nonnegative balance: the balance at time t+1 after taking actions at t, equals to the original balance plus the proceeds of selling minus the spendings of buying:
  • Transaction cost: assume the transaction costs to be 0.1% of the value of each trade:
  • Risk-aversion: to control the risk of stock market crash caused by major emergencies, the financial turbulence index that measures extreme asset price movements is introduced:

    where  denotes the stock returns, µ and Σ are respectively the average and covariance of historical returns. When  exceeds a threshold, buying will be halted and the agent sells all shares. Trading will be resumed once  returns to normal level.

2.2 Trading Goal: Return Maximation

The goal is to design a trading strategy that raises agent’s total cumulative compensation given by the reward function:

and then considering the transition of the shares and the balance defined as:

the reward can be further decomposed:


At inception, h and Q_{\pi}(s,a) are initialized to 0, while the policy π(s) is uniformly distributed among all actions. Afterwards, everything is updated through interacting with the stock market environment. By the Bellman Equation, Q_{\pi}(s_t, a_t) is the expectation of the sum of direct reward r(s_t,a_t,s_{t+1} and the future reqard Q_{\pi}(s{t+1}, a_{a+1}) at the next state discounted by a factor γ, resulting in the state-action value function:

2.3 Environment for Multiple Stocks

OpenAI gym is used to implement the multiple stocks trading environment and to train the agent.

  1. State Space: a vector [b_t, p_t, h_t, M_t, R_t, C_t, X_t] storing information about
    b_t: Portfolio balance
    p_t: Adjusted close prices
    h_t: Shares owned of each stock
    M_t: Moving Average Convergence Divergence
    R_t: Relative Strength Index
    C_t: Commodity Channel Index
    X_t: Average Directional Index
  2. Action Space: {−k, …, −1, 0, 1, …, k} for a single stock, whose elements representing the number of shares to buy or sell. The action space is then normalized to [−1, 1], since A2C and PPO are defined directly on a Gaussian distribution.
Overview of the load-on-demand technique.

Overview of the load-on-demand technique.

Furthermore, a load-on-demand technique is applied for efficient use of memory as shown above.

  1. Algorithms Selection

This paper mainly uses the following three actor-critic algorithms:

  • A2C: uses parallel copies of the same agent to update gradients for different data samples, and a coordinator to pass the average gradients over all agents to a global network, which can update the actor and the critic network, with the objective function:
  • where \pi_{\theta}(a_t|s_t) is the policy network, and A(S_t|a_t) is the advantage function to reduce the high variance of it:
  • V(S_t)is the value function of state S_t, regardless of actions. DDPG: combines the frameworks of Q-learning and policy gradients and uses neural networks as function approximators; it learns directly from the observations through policy gradient and deterministically map states to actions. The Q-value is updated by:
    Critic network is then updated by minimizing the loss function:
  • PPO: controls the policy gradient update to ensure that the new policy does not differ too much from the previous policy, with the estimated advantage function and a probability ratio:

    The clipped surrogate objective function:

    takes the minimum of the clipped and normal objective to restrict the policy update at each step and improve the stability of the policy.

An ensemble strategy is finally proposed to combine the three agents together to build a robust trading strategy. After training and testing the three agents concurrently, in the trading stage, the agent with the highest Sharpe ratio in one period will be automatically selected to use in the next period.

  1. Implementation: Training and Validation

The historical daily trading data comes from the 30 DJIA constituent stocks.

Stock data splitting in-sample and out-of-sample

Stock data splitting in-sample and out-of-sample.

  • In-sample training stage: data from 01/01/2009 – 09/30/2015 used to train 3 agents using PPO, A2C, and DDPG;
  • In-sample validation stage: data from 10/01/2015 – 12/31/2015 used to validate the 3 agents by 5 metrics: cumulative return, annualized return, annualized volatility, Sharpe ratio, and max drawdown; tune key parameters like learning rate and number of episodes;
  • Out-of-sample trading stage: unseen data from 01/01/2016 – 05/08/2020 to evaluate the profitability of algorithms while continuing training. In each quarter, the agent with the highest Sharpe ratio is selected to act in the next quarter, as shown below.

    Table 1 - Sharpe Ratios over time.

    Table 1 – Sharpe Ratios over time.

  1. Results Analysis and Conclusion

From Table II and Fig.5, one can notice that PPO agent is good at following trend and performs well in chasing for returns, with the highest cumulative return 83.0% and annual return 15.0% among the three agents, indicating its appropriateness in a bullish market. A2C agent is more adaptive to handle risk, with the lowest annual volatility 10.4% and max drawdown −10.2%, suggesting its capability in a bearish market. DDPG generates the lowest return among the three, but works fine under risk, with lower annual volatility and max drawdown than PPO. Apparently all three agents outperform the two benchmarks.

Table 2 - Performance Evaluation Comparison.

Table 2 – Performance Evaluation Comparison.

Moreover, it is obvious in Fig.6 that the ensemble strategy and the three agents act well during the 2020 stock market crash, when the agents successfully stops trading, thus cutting losses.

Performance during the stock market crash in the first quarter of 2020.

Performance during the stock market crash in the first quarter of 2020.

From the results, the ensemble strategy demonstrates satisfactory returns and lowest volatilities. Although its cumulative returns are lower than PPO, it has achieved the highest Sharpe ratio 1.30 among all strategies. It is reasonable that the ensemble strategy indeed performs better than the individual algorithms and baselines, since it works in a way each elemental algorithm is supplementary to others while balancing risk and return.

For further improvement, it will be inspiring to explore more models such as Asynchronous Advantage Actor-Critic (A3C) or Twin Delayed DDPG (TD3), and to take more fundamental analysis indicators or ESG factors into consideration. While more sophisticated models and larger datasets are adopted, improvement of efficiency may also be a challenge.

Generative Adversarial Networks GANs

Generative Adversarial Networks

After Deep Autoregressive Models, Deep Generative Modelling and Variational Autoencoders we now continue the discussion with Generative Adversarial Networks (GANs).


So far, in the series of deep generative modellings (DGMs [Yad22a]), we have covered autoregressive modelling, which estimates the exact log likelihood defined by the model and variational autoencoders, which was variational approximations for lower bound optimization. Both of these modelling techniques were explicitly defining density functions and optimizing the likelihood of the training data. However, in this blog, we are going to discuss generative adversarial networks (GANs), which are likelihood-free models and do not define density functions explicitly. GANs follow a game-theoretic approach and learn to generate from the training distribution through a set up of a two-player game.

A two player model of GAN along with the generator and discriminators.

A two player model of GAN along with the generator and discriminators.

GAN tries to learn the distribution of high dimensional training data and generates high-quality synthetic data which has a similar distribution to training data. However, learning the training distribution is a highly complex task therefore GAN utilizes a two-player game approach to overcome the high dimensional complexity problem. GAN has two different neural networks (as shown in Figure ??) the generator and the discriminator. The generator takes a random input z\sim p(z) and produces a sample that has a similar distribution as p_d. To train this network efficiently, there is the other network that is utilized as the second player and known as the discriminator. The generator network (player one) tries to fool the discriminator by generating real looking images. Moreover, the discriminator network tries to distinguish between real (training data x\sim p_d(x)) and fake images effectively. Our main aim is to have an efficiently trained discriminator to be able to distinguish between real and fake images (the generator’s output) and on the other hand, we would like to have a generator, which can easily fool the discriminator by generating real-looking images.

Objective function and training

Objective function

Simultaneous training of these two networks is one of the main challenges in GANs and a minimax loss function is defined for this purpose. To understand this minimax function, firstly, we would like to discuss the concept of two sample testing by Aditya grover [Gro20]. Two sample testing is a method to compute the discrepancy between the training data distribution and the generated data distribution:

(1)   \begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{x\sim p_{\theta_g}} [D_{\theta_d}(G_{\theta_g}(x))], \end{equation*}

where p_{\theta_g} and p_d are the distribution functions of generated and training data respectively. The term F is a set of functions. The \textit{max} part is computing the discrepancies between two distribution using a function D_{\theta_d} \in F and this part is very similar to the term d (discrepancy measure) from our first article (Deep Generative Modelling) and KL-divergence is applied to compute this measure in second article (Deep Autoregressive Models) and third articles (Variational Autoencoders). However, in GANs, for a given set of functions F, we would like compute the distribution p_{\theta_g}, which minimizes the overall discrepancy even for a worse function D_{\theta_d}\in F. The above mentioned objective function does not use any likelihood function and utilizing two different data samples from training and generated data respectively.

By combining Figure ?? and Equation 1, the first term \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] corresponds to the discriminator, which has direct access to the training data and the second term \mathbb{E}_{x\sim p_{\theta_g}}[D_{\theta_d}(G_{\theta_g}(x))] represents the generator part as it relies only on the latent space and produces synthetic data. Therefore, Equation 1 can be rewritten in the form of GAN’s two players as:

(2)   \begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{z\sim p_z}[D_{\theta_d}(G_{\theta_g}(z))], \end{equation*}

The above equation can be rearranged in the form of log loss:

(3)   \begin{equation*} \min_{\theta_g}\: \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d} (x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]), \end{equation*}

In the above equation, the arguments are modified from p_{\theta_g} and D_{\theta_d} in F to \theta_g and  \theta_d respectively as we would like to approximate the network parameters, which are represented by \theta_g and \theta_d for the both generator and discriminator respectively. The discriminator wants to maximize the above objective for \theta_d such that D_{\theta_d}(x) \approx 1, which indicates that the outcome is close to the real data. Furthermore, D_{\theta_d}(G_{\theta_g}(z)) should be close to zero as it is fake data, therefore, the maximization of the above objective function for \theta_d will ensure that the discriminator is performing efficiently in terms of separating real and fake data. From the generator point of view, we would like to minimize this objective function for \theta_g such that D_{\theta_d}(G_{\theta_g}(z)) \approx 1. If the minimization of the objective function happens effectively for \theta_g then the discriminator will classify a fake data into a real data that means that the generator is producing almost real-looking samples.


The training procedure of GAN can be explained by using the following visualization from Goodfellow et al. [GPAM+14]. In Figure 2(a), z is a random input vector to the generator to produce a synthetic outcome x\sim p_{\theta_g} (green curve). The generated data distribution is not close to the original data distribution p_d (dotted black curve). Therefore, the discriminator classifies this image as a fake image and forces generator to learn the training data distribution (Figure 2(b) and (c)). Finally, the generator produces the image which could not detected as a fake data by discriminator(Figure 2(d)).

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow et al.

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ
respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow
et al. [GPAM+14].

The optimization of the objective function mentioned in Equation 3 is performed in th following two steps repeatedly:
\item Firstly, the gradient ascent is utilized to maximize the objective function for \theta_d for discriminator.

(4)   \begin{equation*} \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d}(x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}

\item In the second step, the following function is minimized for the generator using gradient descent.

(5)   \begin{equation*} \min_{\theta_g} \: ( \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}


However, in practice the minimization for the generator does now work well because when D_{\theta_d}(G_{\theta_g}(z) \approx 1 then the term log \: (1-D_{\theta_d}(G_{\theta_g}(z))) has the dominant gradient and vice versa.

However, we would like to have the gradient behaviour completely opposite because D_{\theta_d}(G_{\theta_g}(z) \approx 1 means the generator is well trained and does not require dominant gradient values. However, in case of D_{\theta_d}(G_{\theta_g}(z) \approx 0, the generator is not well trained and producing low quality outputs therefore, it requires a dominant gradient for an efficient training. To fix this problem, the gradient ascent method is applied to maximize the modified generator’s objective:
In the second step, the following function is minimized for the generator using gradient descent alternatively.

(6)   \begin{equation*} \max_{\theta_g} \: \mathbb{E}_{z\sim p_z}[log \: (D_{\theta_d}(G_{\theta_g}(z))] \end{equation*}

therefore, during the training, Equation 4 and 6 will be maximized using the gradient ascent algorithm until the convergence.


The quality of the generated images using GANs depends on several factors. Firstly, the joint training of GANs is not a stable procedure and that could severely decrease the quality of the outcome. Furthermore, the different neural network architecture will modify the quality of images based on the sophistication of the used network. For example, the vanilla GAN [GPAM+14] uses a fully connected deep neural network and generates a quite decent result. Furthermore, DCGAN [RMC15] utilized deep convolutional networks and enhanced the quality of outcome significantly. Furthermore, different types of loss functions are applied to stabilize the training procedure of GAN and to produce high-quality outcomes. As shown in Figure 3, StyleGAN [KLA19] utilized Wasserstein metric [Yad22b] to generate high-resolution face images. As it can be seen from Figure 3, the quality of the generated images are enhancing with time by applying more sophisticated training techniques and network architectures.

GAN timeline with different variations in terms of network architecture and loss functions.

GAN timeline with different variations in terms of network architecture and loss functions.


This article covered the basics and mathematical concepts of GANs. However, the training of two different networks simultaneously could be complex and unstable. Therefore, researchers are continuously working to create a better and more stable version of GANs, for example, WGAN. Furthermore, different types of network architectures are introduced to improve the quality of outcomes. We will discuss this further in the upcoming blog about these variations.


[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in
neural information processing systems, 27, 2014.

[Gro20] Aditya Grover. Generative adversarial networks., 2020.

[KLA19] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4401–4410, 2019.

[RMC15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.

[Yad22a] Sunil Yadav. Deep generative modelling. https://data-scienceblog.
com/blog/2022/02/19/deep-generative-modelling/, 2022.

[Yad22b] Sunil Yadav. Necessary probability concepts for deep learning: Part 2.
deep-learning-995560752a53, 2022.

Automated product quality monitoring using artificial intelligence deep learning

How to maintain product quality with deep learning

Deep Learning helps companies to automate operative processes in many areas. Industrial companies in particular also benefit from product quality assurance by automated failure and defect detection. Computer Vision enables automation to identify scratches and cracks on product item surfaces. You will find more information about how this works in the following infografic from DATANOMIQ and pixolution you can download using the link below.

How to maintain product quality with automatic defect detection - Infographic

How to maintain product quality with automatic defect detection – Infographic

Variational Autoencoders

After Deep Autoregressive Models and Deep Generative Modelling, we will continue our discussion with Variational AutoEncoders (VAEs) after covering up DGM basics and AGMs. Variational autoencoders (VAEs) are a deep learning method to produce synthetic data (images, texts) by learning the latent representations of the training data. AGMs are sequential models and generate data based on previous data points by defining tractable conditionals. On the other hand, VAEs are using latent variable models to infer hidden structure in the underlying data by using the following intractable distribution function: 

(1)   \begin{equation*} p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz. \end{equation*}

The generative process using the above equation can be expressed in the form of a directed graph as shown in Figure ?? (the decoder part), where latent variable z\sim p_\theta(z) produces meaningful information of x \sim p_\theta(x|z).

Architectures AE and VAE based on the bottleneck architecture. The decoder part work as a generative model during inference.

Figure 1: Architectures AE and VAE based on the bottleneck architecture. The decoder part work as
a generative model during inference.


Autoencoders (AEs) are the key part of VAEs and are an unsupervised representation learning technique and consist of two main parts, the encoder and the decoder (see Figure ??). The encoders are deep neural networks (mostly convolutional neural networks with imaging data) to learn a lower-dimensional feature representation from training data. The learned latent feature representation z usually has a much lower dimension than input x and has the most dominant features of x. The encoders are learning features by performing the convolution at different levels and compression is happening via max-pooling.

On the other hand, the decoders, which are also a deep convolutional neural network are reversing the encoder’s operation. They try to reconstruct the original data x from the latent representation z using the up-sampling convolutions. The decoders are pretty similar to VAEs generative models as shown in Figure 1, where synthetic images will be generated using the latent variable z.

During the training of autoencoders, we would like to utilize the unlabeled data and try to minimize the following quadratic loss function:

(2)   \begin{equation*} \mathcal{L}(\theta, \phi) = ||x-\hat{x}||^2, \end{equation*}

The above equation tries to minimize the distance between the original input and reconstructed image as shown in Figure 1.

Variational autoencoders

VAEs are motivated by the decoder part of AEs which can generate the data from latent representation and they are a probabilistic version of AEs which allows us to generate synthetic data with different attributes. VAE can be seen as the decoder part of AE, which learns the set parameters \theta to approximate the conditional p_\theta(x|z) to generate images based on a sample from a true prior, z\sim p_\theta(z). The true prior p_\theta(z) are generally of Gaussian distribution.

Network Architecture

VAE has a quite similar architecture to AE except for the bottleneck part as shown in Figure 2. in AES, the encoder converts high dimensional input data to low dimensional latent representation in a vector form. On the other hand, VAE’s encoder learns the mean vector and standard deviation diagonal matrix such that z\sim \matcal{N}(\mu_z, \Sigma_x) as it will be performing probabilistic generation of data. Therefore the encoder and decoder should be probabilistic.


Similar to AGMs training, we would like to maximize the likelihood of the training data. The likelihood of the data for VAEs are mentioned in Equation 1 and the first term p_\theta(x|z) will be approximated by neural network and the second term p(x) prior distribution, which is a Gaussian function, therefore, both of them are tractable. However, the integration won’t be tractable because of the high dimensionality of data.

To solve this problem of intractability, the encoder part of AE was utilized to learn the set of parameters \phi to approximate the conditional q_\phi (z|x). Furthermore, the conditional q_\phi (z|x) will approximate the posterior p_\theta (z|x), which is intractable. This additional encoder part will help to derive a lower bound on the data likelihood that will make the likelihood function tractable. In the following we will derive the lower bound of the likelihood function:

(3)   \begin{equation*} \begin{flalign} \begin{aligned} log \: p_\theta (x) = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{p_\theta (x|z) p_\theta (z)}{p_\theta (z|x)} \: \frac{q_\phi(z|x)}{q_\phi(z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: p_\theta (x|z)\Bigg] - \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)} {p_\theta (z)}\Bigg] + \mathbf{E}_{z\sim q_\phi(z|x)} \Bigg[log \: \frac{q_\phi (z|x)}{p_\theta (z|x)}\Bigg] \\ = & \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)) + \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z|x)). \end{aligned} \end{flalign} \end{equation*}

In the above equation, the first line computes the likelihood using the logarithmic of p_\theta (x) and then it is expanded using Bayes theorem with additional constant q_\phi(z|x) multiplication. In the next line, it is expanded using the logarithmic rule and then rearranged. Furthermore, the last two terms in the second line are the definition of KL divergence and the third line is expressed in the same.

In the last line, the first term is representing the reconstruction loss and it will be approximated by the decoder network. This term can be estimated by the reparametrization trick \cite{}. The second term is KL divergence between prior distribution p_\theta(z) and the encoder function q_\phi (z|x), both of these functions are following the Gaussian distribution and has the closed-form solution and are tractable. The last term is intractable due to p_\theta (z|x). However, KL divergence computes the distance between two probability densities and it is always positive. By using this property, the above equation can be approximated as:

(4)   \begin{equation*} log \: p_\theta (x)\geq \mathcal{L}(x, \phi, \theta) , \: \text{where} \: \mathcal{L}(x, \phi, \theta) = \mathbf{E}_{z\sim q_\phi(z|x)} \Big[log \: p_\theta (x|z)\Big] - \mathbf{D}_{KL}(q_\phi (z|x), p_\theta (z)). \end{equation*}

In the above equation, the term \mathcal{L}(x, \phi, \theta) is presenting the tractable lower bound for the optimization and is also termed as ELBO (Evidence Lower Bound Optimization). During the training process, we maximize ELBO using the following equation:

(5)   \begin{equation*} \operatorname*{argmax}_{\phi, \theta} \sum_{x\in X} \mathcal{L}(x, \phi, \theta). \end{equation*}


Furthermore, the reconstruction loss term can be written using Equation 2 as the decoder output is assumed to be following Gaussian distribution. Therefore, this term can be easily transformed to mean squared error (MSE).

During the implementation, the architecture part is straightforward and can be found here. The user has to define the size of latent space, which will be vital in the reconstruction process. Furthermore, the loss function can be minimized using ADAM optimizer with a fixed batch size and a fixed number of epochs.

Figure 2: The results obtained from vanilla VAE (left) and a recent VAE-based generative model NVAE (right)

Figure 2: The results obtained from vanilla VAE (left) and a recent VAE-based generative
model NVAE (right)

In the above, we are showing the quality improvement since VAE was introduced by Kingma and
Welling [KW14]. NVAE is a relatively new method using a deep hierarchical VAE [VK21].


In this blog, we discussed variational autoencoders along with the basics of autoencoders. We covered
the main difference between AEs and VAEs along with the derivation of lower bound in VAEs. We
have shown using two different VAE based methods that VAE is still active research because in general,
it produces a blurry outcome.

Further readings

Here are the couple of links to learn further about VAE-related concepts:
1. To learn basics of probability concepts, which were used in this blog, you can check this article.
2. To learn more recent and effective VAE-based methods, check out NVAE.
3. To understand and utilize a more advance loss function, please refer to this article.


[KW14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.
[VK21] Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder, 2021.

Training of Deep Learning AI models

Ein KI Projekt richtig umsetzen : So geht’s

Sie wollen in Ihrem Unternehmen Kosten senken und effizientere Workflows einführen? Dann haben Sie vielleicht schon darüber nachgedacht, Prozesse mit Künstlicher Intelligenz zu automatisieren. Für einen gelungenen Start, besprechen wir nun, wie ein KI-Projekt abläuft und wie man es richtig umsetzt.

Wir von DATANOMIQ und pixolution teilen unsere Erfahrungen aus Deep Learning Projekten, wo es vor allem um die Optimierung und Automatisierung von Unternehmensprozessen rund um visuelle Daten geht, etwa Bilder oder Videos. Wir stellen Ihnen die einzelnen Projektschritte vor, verraten Ihnen, wo dabei die Knackpunkte liegen und wie alle Beteiligten dazu beitragen können, ein KI-Projekt zum Erfolg zu führen.

1. Erstgespräch

In einem Erstgespräch nehmen wir Ihre Anforderungen auf.

  • Bestandsaufnahme Ihrer aktuellen Prozesse und Ihrer Änderungswünsche: Wie sind Ihre aktuellen Prozesse strukturiert? An welchen Prozessen möchten Sie etwas ändern?
  • Zielformulierung: Welches Endergebnis wünschen Sie sich? Wie genau sollen die neuen Prozesse aussehen? Das Ziel sollte so detailliert wie möglich beschrieben werden.
  • Budget: Welches Budget haben Sie für dieses Projekt eingeplant? Zusammen mit dem formulierten Ziel gibt das Budget die Wege vor, die wir zusammen in dem Projekt gehen können. Meist wollen Sie durch die Einführung von KI Kosten sparen oder höhere Umsätze erreichen. Das spielt für Höhe des Budgets die entscheidende Rolle.
  • Datenlage: Haben Sie Daten, die wir für das Training verwenden können? Wenn ja, welche und wieviele Daten sind das? Ist eine kontinuierliche Datenerfassung vorhanden, die während des Projekts genutzt werden kann, oder muss dafür erst die Grundlage geschaffen werden?

2. Evaluation

In diesem Schritt evaluieren und planen wir mit Ihnen gemeinsam die Umsetzung des Projekts. Das bedeutet im Einzelnen folgendes.

Begutachtung der Daten und weitere Datenplanung

Wir sichten von Ihnen bereitgestellte Trainingsdaten, z.B. gelabelte Bilder, und machen uns ein Bild davon, ob diese für das Training sinnvoll verwendet werden können. Da man für Deep Learning sehr viele Trainingsdaten benötigt, ist das ein entscheidender Punkt. In die Begutachtung der Daten fließt auch die Beurteilung der Qualität und Ausgewogenheit ein, denn davon ist abhängig wie gut ein KI-Modell lernt und korrekte Vorhersagen trifft.

Wenn von Ihnen keinerlei Daten zum Projektstart bereitgestellt werden können, wird zuerst ein separates Projekt notwendig, das nur dazu dient, Daten zu sammeln. Das bedeutet für Sie etwa je nach Anwendbarkeit den Einkauf von Datensets oder Labeling-Dienstleistungen.
Wir stehen Ihnen dabei beratend zur Seite.

Während der gesamten Dauer des Projekts werden immer wieder neue Daten benötigt, um die Qualität des Modells weiter zu verbessern. Daher müssen wir mit Ihnen gemeinsam planen, wie Sie fortlaufend diese Daten erheben, falsche Predictions des Modells erkennen und korrigieren, sodass Sie diese uns bereitstellen können. Die richtig erkannten Daten sowie die falsch erkannten und dann korrigierten Daten werden nämlich in das nächste Training einfließen.

Definition des Minimum Viable Product (MVP)

Wir definieren mit Ihnen zusammen, wie eine minimal funktionsfähige Version der KI aussehen kann. Die Grundfrage hierbei ist: Welche Komponenten oder Features sollten als Erstes in den Produktivbetrieb gehen, sodass Sie möglichst schnell einen Mehrwert aus
der KI ziehen?

Ein Vorteil dieser Herangehensweise ist, dass Sie den neuen KI-basierten Prozess in kleinem Maßstab testen können. Gleichzeitig können wir Verbesserungen schneller identifizieren. Zu einem späteren Zeitpunkt können Sie dann skalieren und weitere Features aufnehmen. Die schlagenden Argumente, mit einem MVP zu starten, sind jedoch die Kostenreduktion und Risikominimierung. Anstatt ein riesiges Projekt umzusetzen wird ein kleines Mehrwert schaffendes Projekt geschnürt und in der Realität getestet. So werden Fehlplanungen und
-entwicklungen vermieden, die viel Geld kosten.

Definition der Key Performance Indicators (KPI)

Key Performance Indicators sind für die objektive Qualitätsmessung der KI und des Business Impacts wichtig. Diese Zielmarken definieren, was das geplante System leisten soll, damit es erfolgreich ist. Key Performance Indicators können etwa sein:

  • Durchschnittliche Zeitersparnis des Prozesses durch Teilautomatisierung
  • Garantierte Antwortzeit bei maximalem Anfrageaufkommen pro Sekunde
  • Parallel mögliche Anfragen an die KI
  • Accuracy des Modells
  • Zeit von Fertigstellung bis zur Implementierung des KI Modells

Planung in Ihr Produktivsystem

Wir planen mit Ihnen die tiefe Integration in Ihr Produktivsystem. Dabei sind etwa folgende Fragen wichtig: Wie soll die KI in der bestehenden Softwareumgebung und im Arbeitsablauf genutzt werden? Was ist notwendig, um auf die KI zuzugreifen?

Mit dem Erstgespräch und der Evaluation ist nun das Fundament für das Projekt gelegt. In den Folgeschritten treiben wir die Entwicklung nun immer weiter voran. Die Schritte 3 bis 5 werden dabei solange wiederholt bis wir von der minimal funktionsfähigen
Produktversion, dem MVP, bis zum gewünschten Endprodukt gelangt sind.

3. Iteration

Wir trainieren den Algorithmus mit dem Großteil der verfügbaren Daten. Anschließend überprüfen wir die Performance des Modells mit ungesehenen Daten.

Wie lange das Training dauert ist abhängig von der Aufgabe. Man kann jedoch sagen, dass das Trainieren eines Deep Learning Modells für Bilder oder Videos komplexer und zeitaufwändiger ist als bei textbasierten maschinellen Lernaufgaben. Das liegt daran, dass wir tiefe Modelle (mit vielen Layern) verwenden und die verarbeiteten Datenmengen in der Regel sehr groß sind.

Das Trainieren des Modells ist je nach Projekt jedoch nur ein Bruchstück des ganzen Entwicklungsprozesses, den wir leisten. Oft ist es notwendig, dass wir einen eigenen Prozess aufbauen, in den das Modell eingebettet werden kann, wie z.B. einen Webservice.

4. Integration

Ist eine akzeptable Qualitätsstufe des Modells nach dem Training erreicht, liefern wir Ihnen eine erste Produktversion aus. Üblicherweise stellen wir Ihnen die Version als Docker Image mit API zur Verfügung. Sie beginnen dann mit der Integration in Ihr System und Ihre Workflows. Wir begleiten Sie dabei.

5. Feedback erfassen

Nachdem die Integration in den Produktivbetrieb erfolgt ist, ist es sehr wichtig, dass Sie aus der Nutzung Daten sammeln. Nur so können Sie beurteilen, ob die KI funktioniert wie Sie es sich vorgestellt haben und ob es in die richtige Richtung geht. Es geht also darum, zu erfassen was das Modell im Realbetrieb kann und was nicht. Diese Daten sammeln Sie und übermitteln sie an uns. Wir speisen diese dann in nächsten Trainingslauf ein.

Es ist dabei nicht ungewöhnlich, dass diese Datenerfassung im Realbetrieb eine gewisse Zeit in Anspruch nimmt. Das ist natürlich davon abhängig, in welchem Umfang Sie Daten erfassen. Bis zum Beginn der nächsten Iteration können so üblicherweise Wochen oder sogar Monate vergehen.

Die nächste Iteration

Um mit der nächsten Iteration eine signifikante Steigerung der Ergebnisqualität zu erreichen, kann es notwendig sein, dass Sie uns mehr Daten oder andere Daten zur Verfügung stellen, die aus dem Realbetrieb anfallen.

Eine nächste Iteration kann aber auch motiviert sein durch eine Veränderung in den Anforderungen, wenn etwa bei einem Klassifikationsmodell neue Kategorien erkannt werden müssen. Das aktuelle Modell kann für solche Veränderungen dann keine guten Vorhersagen treffen und muss erst mit entsprechenden neuen Daten trainiert werden.

Tipps für ein erfolgreiches KI Projekt

Ein entscheidender Knackpunkt für ein erfolgreiches KI Projekt ist das iterative Vorgehen und schrittweise Einführen eines KI-basierten Prozesses, mit dem die Qualität und Funktionsbreite der Entwicklung gesteigert wird.

Weiterhin muss man sich darüber klar sein, dass die Bereitstellung von Trainingsdaten kein statischer Ablauf ist. Es ist ein Kreislauf, in dem Sie als Kunde eine entscheidende Rolle einnehmen. Ein letzter wichtiger Punkt ist die Messbarkeit des Projekts. Denn nur wenn die Zielwerte während des Projekts gemessen werden, können Rückschritte oder Fortschritte gesehen werden und man kann schließlich am Ziel ankommen.

Möglich wurde dieser Artikel durch die großartige Zusammenarbeit mit pixolution, einem Unternehmen für AI Solutions im Bereich Computer Vision (Visuelle Bildsuche und individuelle KI Lösungen).