Einführung und Vertiefung in R Statistics mit den Dortmunder R-Kursen!

Im Rahmen der Dortmunder R Kurse bieten wir unsere Expertise in Schulungen für die Programmiersprache R an. Zielgruppe unserer Fortbildungen sind nicht nur Statistiker, sondern auch Anwender jeder Fachrichtung aus Industrie und Forschungseinrichtungen, die mit R ihre Daten analysieren wollen. Die Dortmunder R-Kurse werden ausschließlich von Statistikern mit langjähriger Erfahrung angeboten. Die Referenten gehören zum engsten Kreis der internationalen R-Gemeinschaft. Die angebotenen Kurse haben sich vielfach national und international bewährt.

Unsere Termine für die Online-Durchführung in diesem Jahr:

8., 9. und 10. Juni: R-Basiskurs (jeweils 9:00 – 14:00 Uhr)

22., 23., 24. und 25. Juni: R-Vertiefungskurs (jeweils 9:00 – 13:00 Uhr)

Kosten jeweils 750.00€, bei Buchung beider Kurse im Juni erhalten Sie einen Preisnachlass von 200€.

Zur Anmeldung gelangen Sie über den nachfolgenden Link:

R Basiskurs

Das Seminar R Basiskurs für Anfänger findet am 8., 9. und 10. Juni 2020 statt. Den Teilnehmern wird der praxisrelevante Part der Programmiersprache näher gebracht, um so die Grundlagen zur ersten Datenanalyse — von Datensatz zu statistischen Kennzahlen und ersten Visualisierungen — zu schaffen. Anmeldeschluss ist der 25. Mai 2020.


  • Installation von R und zugehöriger Entwicklungsumgebung
  • Grundlagen von R: Syntax, Datentypen, Operatoren, Funktionen, Indizierung
  • R-Hilfe effektiv nutzen
  • Ein- und Ausgabe von Daten
  • Behandlung fehlender Werte
  • Statistische Kennzahlen
  • Visualisierung

R Vertiefungskurs

Das Seminar R-Vertiefungskurs für Fortgeschrittene findet am 22., 23., 24. und 25. Juni (jeweils von 9:00 – 13:00 Uhr) statt. Die Veranstaltung ist ideal für Teilnehmende mit ersten Vorkenntnissen, die ihre Analysen effizient mit R durchführen möchten. Anmeldeschluss ist der 11. Juni 2020.

Der Vertiefungskurs baut inhaltlich auf dem Basiskurs auf. Es besteht aber keine Verpflichtung, bei Besuch des Vertiefungskurses zuvor den Basiskurs zu absolvieren, wenn bereits entsprechende Vorkenntnisse in R vorhanden sind.


  • Eigene Funktionen, Schleifen vermeiden durch *apply
  • Einführung in ggplot2 und dplyr
  • Statistische Tests und Lineare Regression
  • Dynamische Berichterstellung
  • Angewandte Datenanalyse anhand von Fallbeispielen

Links zur Veranstaltung direkt:

R-Basiskurs: https://dortmunder-r-kurse.de/kurse/r-basiskurs/

R-Vertiefungskurs: https://dortmunder-r-kurse.de/kurse/r-vertiefungskurs/

Data Analytics & Artificial Intelligence Trends in 2020

Artificial intelligence has infiltrated all aspects of our lives and brought significant improvements.

Although the first thing that comes to most people’s minds when they think about AI are humanoid robots or intelligent machines from sci-fi flicks, this technology has had the most impressive advancements in the field of data science.

Big data analytics is what has already transformed the way we do business as it provides an unprecedented insight into a vast amount of unstructured, semi-structured, and structured data by analyzing, processing, and interpreting it.

Data and AI specialists and researchers are likely to have a field day in 2020, so here are some of the most important trends in this industry.

1. Predictive Analytics

As its name suggests, this trend will be all about using gargantuan data sets in order to predict outcomes and results.

This practice is slated to become one of the biggest trends in 2020 because it will help businesses improve their processes tremendously. It will find its place in optimizing customer support, pricing, supply chain, recruitment, and retail sales, to name just a few.

For example, Amazon has already been leveraging predictive analytics for its dynamic pricing model. Namely, the online retail giant uses this technology to analyze the demand for a particular product, competitors’ prices, and a number of other parameters in order to adjust its price.

According to stats, Amazon changes prices 2.5 million times a day so that a particular product’s cost fluctuates and changes every 10 minutes, which requires an extremely predictive analytics algorithm.

2. Improved Cybersecurity

In a world of advanced technologies where IoT and remotely controlled devices having top-notch protection is of critical importance.

Numerous businesses and individuals have fallen victim to ruthless criminals who can steal sensitive data or wipe out entire bank accounts. Even some big and powerful companies suffered huge financial and reputation blows due to cyber attacks they were subjected to.

This kind of crime is particularly harsh for small and medium businesses. Stats say that 60% of SMBs are forced to close down after being hit by such an attack.

AI again takes advantage of its immense potential for analyzing and processing data from different sources quickly and accurately. That’s why it’s capable of assisting cybersecurity specialists in predicting and preventing attacks.

In case that an attack emerges, the response time is significantly shorter, so that the worst-case scenario can be avoided.

When we’re talking about avoiding security risks, AI can improve enterprise risk management, too, by providing guidance and assisting risk management professionals.

3. Digital Workers

In 2020, an army of digital workers will transform the traditional workspace and take productivity to a whole new level.

Virtual assistants and chatbots are some examples of already existing digital workers, but it will be even more of them. According to research, this trend is one the rise, as it’s expected that AI software and robots will increase by 50% by 2022.

Robots will take over even some small tasks in the office. The point is to streamline the entire business process, and that can be achieved by training robots to perform small and simple tasks like human employees. The only difference will be that digital workers will do that faster and without any mistakes.

4. Hybrid Workforce

Many people worry that AI and automation will steal their jobs and render them unemployed.

Even the stats are bleak – AI will eliminate 1.8 million jobs. But, on the other hand, it will create 2.3 million new jobs.

So, our future is actually AI and humans working together, and that’s what will become the business normalcy in 2020.

Robotic process automation and different office digital workers will be in charge of tedious and repetitive tasks, while more sophisticated issues that require critical thinking and creativity will be human workers’ responsibility.

One of the most important things about creating this hybrid workforce is for businesses to openly discuss it with their employees and explain how these new technologies will be used. A regular workforce has to know that they will be working alongside machines whose job will be to speed up the processes and cut costs.

5. Process Intelligence

This AI trend will allow businesses to gain insight into their processes by using all the information contained in their system and creating an overall, real-time, and accurate visual model of all the processes.

What’s great about it is that it’s possible to see these processes from different perspectives – across departments, functions, staff, and locations.

With such a visual model, it’s possible to properly analyze these processes, identify potential bottlenecks, and eliminate them before they even begin to emerge.

Besides, as this is AI and data analytics at their best, this technology will also facilitate decision-making by predicting the future results of tech investments.

Needless to say, Process Intelligence will become an enterprise standard very soon, thanks to its ability to provide a better understanding and effective management of end-to-end processes.

As you can see, in 2020, these two advanced technologies will continue to evolve and transform the business landscape and change it for the better.

Article series: 5 Clean Coding Tips – 5.Put yourself in somebody else’s shoes

This is the fifth of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

It might be a bit repetitive to bring up how important the readability of the code is, let’s do it anyway. In the majority of the cases you are writing for others, therefore you need to put yourself in their shoes to be able to assess how good the readability of your code is. For you, it all might be obvious because you wrote it. But it doesn’t have to be easy to read for someone else. If you have a colleague or a friend that has a bit of time for you and is willing to give you feedback, that is great. If, however, you don’t have such a person, having a few imaginary friends might be helpful in this case. It might sound crazy, but don’t close this page just yet. Having a set of imaginary personas at your disposal, to review your work with their eyes, can help you a lot. Imagine that your code met one of those guys. What would they say about it? If you work in a team or collaborate with people, you probably don’t have to imagine them. You’ve met them.

The_PEP8_guy – He has years of experience. He is used to seeing the code in a very particular way. He quotes the style guide during lunch. His fingers make the perfect line splitting and indentation without even his thoughts reaching the conscious state. He knows that lowercase_with_underscore is for variables, UPPER_CASE_NAMES are for constants and the CapitalizedWords are for classes. He will be lost if you do it in any different way. His expectations will not meet what you wrote, and he will not understand anything, because he will be too distracted by the messed up visual. Depending on the character he might start either crying or shouting. Read the style guide and follow it. You might be able to please this guy at least a little bit with the automatic tools like pylint.

The_ grieving _widow – Imagine that something happens to you. Let’s say, that you get hit by a bus[i]. You leave behind sadness and the_ grieving_widow to manage your code, your legacy. Will the future generations be able to make use of it or were you the only one who can understand anything you wrote? That is a bit of an extreme situation, ok. Alternatively, imagine, that you go for a 5-week vacation to a silent retreat with a strict no-phone policy (or that is what you tell your colleagues). Will they be able to carry on if they cannot ask you anything about the code? Review your code and the documentation from the perspective of the poor grieving_widow.

The_not_your_domain_guy – He is from the outside of the world you are currently in and he just does not understand your jargon. He doesn’t have to know that in data science a feature, a predictor and an x probably mean the same thing. SNR might shout signal-to-noise ratio at you, it will only snort at him. You might use abbreviations that are obvious to you but not to everyone. If you think that the majority of people can understand, and it helps with the code readability keep the abbreviations but just in case, document/comment them. There might be abbreviations specific to your company and, someone from the outside, a new guy, a consultant will not get them. Put yourself in the shoes of that guy and maybe make your code a bit more democratic wherever possible.

The_foreigner– You might be working in an environment, where every single person speaks the same language you speak, and it happens not to be English. So, you and your colleagues name variables and write the comments in your language. However, unless you work in a team with rules a strict as Athletic Bilbao, there might be a foreigner joining your team in the future. It is hard to argue that English is the lingua franca in programming (and in the world), these days. So, it might be worth putting yourself in the_foreigner’s shoes, while writing your code, to avoid a huge amount of work in the future, that the translation and explanation will require. And even if you are working on your own, you might want to make your code public one day and want as many people as possible to read it.

The_hurry_up_guy – we all know this guy. Sometimes he doesn’t have a body or a face, but we can feel his presence. You might want to write a perfect solution, comment it in the best possible way and maybe add a bit of glitter on top but sometimes you just need to give in and do it his way. And that’s ok too.


[i] https://en.wikipedia.org/wiki/Bus_factor

Article series: 5 Clean Coding Tips – 4. Stop commenting the obvious

This is the fourth of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

Everyone will tell you that you need to comment your code. You do it for yourself, for others, it might help you to put down a structure of your code before you get down to coding properly. Writing a lot of comments might give you a false sense of confidence, that you are doing a good job. While in reality, you are commenting your code a lot with obvious, redundant statements that are not bringing any value. The role of a comment it to explain, not to describe. You need to realize that any piece of comment has to add information to the code you already have, not to double it.

Keep in mind, you are not narrating the code, adding ‘subtitles’ to python’s performance. The comments are there to clarify what is not explicit in the code itself. Adding a comment saying what the line of code does is completely redundant most of the time:

A good rule of thumb would be: if it starts to sound like an instastory, rethink it. ‘So, I am having my breakfast, with a chai latte and my friend, the cat is here as well’. No.

It is also a good thing to learn to always update necessary comments before you modify the code. It is incredibly easy to modify a line of code, move on and forget the comment. There are people who claim that there are very few crimes in the world worse than comments that contradict the code itself.

Of course, there are situations, where you might be preparing a tutorial for others and you want to narrate what the code is doing. Then writing that load function will load the data is good. It does not have to be obvious for the listener. When teaching, repetitions, and overly explicit explanations are more than welcome. Always have in mind who your reader will be.

Ein Einblick in die Aktienmärkte unter Berücksichtigung von COVID-19


Die COVID-19-Pandemie hat uns alle fest im Griff. Besonders die Wirtschaft leidet stark unter den erforderlichen Maßnahmen, die weltweit angewendet werden. Wir wollen daher die Gelegenheit nutzen einen Blick auf die Aktienkurse zu wagen und analysieren, inwieweit der Virus einen Einfluss auf das Wachstum des Marktes hat.


Zuallererst werden wir uns auf die Industrie-, Schwellenländer und Grenzmärkte konzentrieren. Dafür nutzen wir die MSCI Global Investable Market Indizes (kurz GIMI), welche die zuvor genannten Gruppen abbilden. Die MSCI Inc. ist ein US-amerikanischer Finanzdienstleister und vor allem für ihre Aktienindizes bekannt.

Aktienindizes sind Kennzahlen der Entwicklung bzw. Änderung einer Auswahl von Aktienkursen und können repräsentativ für ganze Märkte, spezifische Branchen oder Länder stehen. Der DAX ist zum Beispiel ein Index, welcher die Entwicklung der größten 30 deutschen Unternehmen zusammenfasst.

Leider sind die Daten von MSCI nicht ohne weiteres zugänglich, weshalb wir unsere Analysen mit ETFs (engl.: “Exchange Traded Fund”) durchführen werden. ETFs sind wiederum an Börsen gehandelte Fonds, die von Fondgesellschaften/-verwaltern oder Banken verwaltet werden.

Für unsere erste Analyse sollen folgende ETFs genutzt werden, welche die folgenden Indizes führen:

Index Beschreibung ETF
MSCI World über 1600 Aktienwerte aus 24 Industrieländern iShares MSCI World ETF
MSCI Emerging Markets ca. 1400 Aktienwerte aus 27 Schwellenländern iShares MSCI Emerging Markets ETF
MSCI Frontier Markets Aktienwerte aus ca. 29 Frontier-Ländern iShares MSCI Frontier 100 ETF

Tab.1: MSCI Global Investable Market Indizes mit deren repräsentativen ETFs


Zur Extraktion der ETF-Börsenkurse nehmen wir die yahoo finance API zur Hilfe. Mit den richtigen Symbolen können wir die historischen Daten unserer ETF-Auswahl ausgeben lassen. Wie unter diesem Link für den iShares MSCI World ETF zu sehen ist, gibt es mehrere Werte in den historischen Daten. Für unsere Analyse nutzen wir den Wert, nachdem die Börse geschlossen hat.

Da die ETFs in ihren Kurswerten Unterschiede haben und uns nur die relative Entwicklung interessiert, werden wir relative Werte für die Analyse nutzen. Der Startzeitpunkt soll mit dem 06.01.2020 festgelegt werden.

Die Daten über bestätigte Infektionen mit COVID-19 entnehmen wir aus der Hochrechnung der Johns Hopkins Universität.

Correlation between confirmed cases and growth of MSCI GIMI
Abb.1: Interaktives Diagramm: Wachstum der Aktienmärkte getrennt in Industrie-, Schwellen-, Frontier-Länder und deren bestätigten COVID-19 Fälle über die Zeit. Die bestätigten Fälle der jeweiligen Märkte basieren auf der Aufsummierung der Länder, welche auch in den Märkten aufzufinden sind und daher kann es zu Unterschieden bei den offiziellen Zahlen kommen.

Interpretation des Diagramms

Auf den ersten Blick sieht man deutlich, dass mit steigenden COVID-19 Fällen die Aktienkurse bis zu -31% einbrechen. (Anfangszeitpunkt: 06.01.2020 Endzeitpunkt: 09.04.2020)

Betrachten wir den Anfang des Diagramms so sehen wir einen Einbruch der Emerging Markets, welche eine Gewichtung von 39.69 % (Stand 09.04.20) chinesische Aktien haben. Am 17.01.20 verzeichnen die Emerging Marktes noch ein Plus von 3.15 % gegenüber unserem Startzeitpunkt, wohingegen wir am 01.02.2020 ein Defizit von -6.05 % gegenüber dem Startzeitpunkt haben, was ein Einbruch von -9.20 % zum 17.01.2020 entspricht. Da der Ursprung des COVID-19 Virus auch in China war, könnte man diesen Punkt als Grund des Einbruches interpretieren. Die Industrie- und  Frontier-Länder bleiben hingegen recht stabil und auch deren bestätigten Fälle sind noch sehr niedrig.

Die Industrieländer erreichen ihren Höchststand am 19.02.20 mit einem Plus von 2.80%. Danach brachen alle drei Märkte deutlich ein. Auch in diesem Zeitraum gab es die ersten Todesopfer in Europa und in den USA. Der derzeitige Tiefpunkt, welcher am 23.03.20 zu registrieren ist, beläuft sich für die Industrieländer -32.10 %, Schwellenländer 31.7 % und Frontier-Länder auf -34.88 %.

Interessanterweise steigen die Marktwerte ab diesem Zeitpunkt wieder an. Gründe könnten die Nachrichten aus China sein, welche keine weiteren Neu-Infektionen verzeichnen, die FED dem Markt bis zu 1.5 Billionen Dollar zur Verfügung stellt und/oder die Ankündigung der Europäische Zentralbank Anleihen in Höhe von 750 MRD. Euro zu kaufen. Auch in Deutschland wurden große Hilfspakete angekündigt.

Um detaillierte Aussagen treffen zu können, müssen wir uns die Kurse auf granularer Ebene anschauen. Durch eine gezieltere Betrachtung auf Länderebene könnten Zusammenhänge näher beschrieben werden.

Wenn du dich für interaktive Analysen interessierst und tiefer in die Materie eintauchen möchtest: DATANOMIQ COVID-19 Dashboard

Hier haben wir ein Dashboard speziell für Analysen für die Aktienmärkte, welches stetig verbessert wird. Auch sollen Krypto-Währungen bald implementiert werden. Habt ihr Vorschläge und Verbesserungswünsche, dann lasst gerne ein Kommentar da!

Article series: 5 Clean Coding Tips – 3. Take Advantage of the Formatting Tools.

This is the third of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

Unfortunately, no automatic formatting tool will correct the logic in your code, suggest meaningful names of your variables or comment the code for you. Yet. Gmail has lately started suggesting email titles based on email content. AI-powered variable naming can be next, who knows. Anyway, the visual level of the code is much easier to correct and there are tools that will do some of the code formatting on the visual level job for you. Some of them might be already existing in your IDE, you just need to look for them a bit, others need to be installed. One of the most popular formatting tools is pylint[i]. It is worth checking it out and learning to use it in an efficient way.

Beware that as convenient as it may seem to copy and paste your code into a quick online ‘beautifier’ it is not always a good idea. The online tools might store your code. If you are working on something that shouldn’t just freely float in the world wide web, stick to reliable tools like pylint, that will store the data within your working directory.

These tools can become very good friends of yours but also very annoying ones. They will not miss single whitespace and will not keep their mouth shut when your line length jumps from 79 to 80 characters. They will be shouting with an underscoring of some worrying color and/or exclamation marks. You will need to find your way to coexist and retain your sanity. It can be very distracting when you are in a working flow and warnings pop up all the time about formatting details that have nothing to do with what you are trying to solve. Sometimes, it might be better to turn those warnings off while you are in your most concentrated/creative phase of writing and turn them back on while the dust of your genius settles down a little bit. Usually the offer a lot of flexibility, regarding which warnings you want to be ignored and other features. The good thing is, they also teach you what are mistakes that you are making and after some time you will just stop making them in the first place.


[i] https://www.pylint.org/

Einführung in die Welt der Autoencoder

An wen ist der Artikel gerichtet?

In diesem Artikel wollen wir uns näher mit dem neuronalen Netz namens Autoencoder beschäftigen und wollen einen Einblick in die Grundprinzipien bekommen, die wir dann mit einem vereinfachten Programmierbeispiel festigen. Kenntnisse in Python, Tensorflow und neuronalen Netzen sind dabei sehr hilfreich.

Funktionsweise des Autoencoders

Ein Autoencoder ist ein neuronales Netz, welches versucht die Eingangsinformationen zu komprimieren und mit den reduzierten Informationen im Ausgang wieder korrekt nachzubilden.

Die Komprimierung und die Rekonstruktion der Eingangsinformationen laufen im Autoencoder nacheinander ab, weshalb wir das neuronale Netz auch in zwei Abschnitten betrachten können.




Der Encoder

Der Encoder oder auch Kodierer hat die Aufgabe, die Dimensionen der Eingangsinformationen zu reduzieren, man spricht auch von Dimensionsreduktion. Durch diese Reduktion werden die Informationen komprimiert und es werden nur die wichtigsten bzw. der Durchschnitt der Informationen weitergeleitet. Diese Methode hat wie viele andere Arten der Komprimierung auch einen Verlust.

In einem neuronalen Netz wird dies durch versteckte Schichten realisiert. Durch die Reduzierung von Knotenpunkten in den kommenden versteckten Schichten werden die Kodierung bewerkstelligt.

Der Decoder

Nachdem das Eingangssignal kodiert ist, kommt der Decoder bzw. Dekodierer zum Einsatz. Er hat die Aufgabe mit den komprimierten Informationen die ursprünglichen Daten zu rekonstruieren. Durch Fehlerrückführung werden die Gewichte des Netzes angepasst.

Ein bisschen Mathematik

Das Hauptziel des Autoencoders ist, dass das Ausgangssignal dem Eingangssignal gleicht, was bedeutet, dass wir eine Loss Funktion haben, die L(x , y) entspricht.

L(x, \hat{x})

Unser Eingang soll mit x gekennzeichnet werden. Unsere versteckte Schicht soll h sein. Damit hat unser Encoder folgenden Zusammenhang h = f(x).

Die Rekonstruktion im Decoder kann mit r = g(h) beschrieben werden. Bei unserem einfachen Autoencoder handelt es sich um ein Feed-Forward Netz ohne rückkoppelten Anteil und wird durch Backpropagation oder zu deutsch Fehlerrückführung optimiert.

Formelzeichen Bedeutung
\mathbf{x}, \hat{\mathbf{x}} Eingangs-, Ausgangssignal
\mathbf{W}, \hat{\mathbf{W}} Gewichte für En- und Decoder
\mathbf{B}, \hat{\mathbf{B}} Bias für En- und Decoder
\sigma, \hat{\sigma} Aktivierungsfunktion für En- und Decoder
L Verlustfunktion

Unsere versteckte Schicht soll mit \latex h gekennzeichnet werden. Damit besteht der Zusammenhang:

(1)   \begin{align*} \mathbf{h} &= f(\mathbf{x}) = \sigma(\mathbf{W}\mathbf{x} + \mathbf{B}) \\ \hat{\mathbf{x}} &= g(\mathbf{h}) = \hat{\sigma}(\hat{\mathbf{W}} \mathbf{h} + \hat{\mathbf{B}}) \\ \hat{\mathbf{x}} &= \hat{\sigma} \{ \hat{\mathbf{W}} \left[\sigma ( \mathbf{W}\mathbf{x} + \mathbf{B} )\right]  + \hat{\mathbf{B}} \}\\ \end{align*}

Für eine Optimierung mit der mittleren quadratischen Abweichung (MSE) könnte die Verlustfunktion wie folgt aussehen:

(2)   \begin{align*} L(\mathbf{x}, \hat{\mathbf{x}}) &= \mathbf{MSE}(\mathbf{x}, \hat{\mathbf{x}}) = \|  \mathbf{x} - \hat{\mathbf{x}} \| ^2 &=  \| \mathbf{x} - \hat{\sigma} \{ \hat{\mathbf{W}} \left[\sigma ( \mathbf{W}\mathbf{x} + \mathbf{B} )\right]  + \hat{\mathbf{B}} \} \| ^2 \end{align*}


Wir haben die Theorie und Mathematik eines Autoencoder in seiner Ursprungsform kennengelernt und wollen jetzt diese in einem (sehr) einfachen Beispiel anwenden, um zu schauen, ob der Autoencoder so funktioniert wie die Theorie es besagt.

Dazu nehmen wir einen One Hot (1 aus n) kodierten Datensatz, welcher die Zahlen von 0 bis 3 entspricht.

    \begin{align*} [1, 0, 0, 0] \ \widehat{=}  \ 0 \\ [0, 1, 0, 0] \ \widehat{=}  \ 1 \\ [0, 0, 1, 0] \ \widehat{=}  \ 2 \\ [0, 0, 0, 1] \ \widehat{=} \  3\\ \end{align*}

Diesen Datensatz könnte wie folgt kodiert werden:

    \begin{align*} [1, 0, 0, 0] \ \widehat{=}  \ 0 \ \widehat{=}  \ [0, 0] \\ [0, 1, 0, 0] \ \widehat{=}  \ 1 \ \widehat{=}  \  [0, 1] \\ [0, 0, 1, 0] \ \widehat{=}  \ 2 \ \widehat{=}  \ [1, 0] \\ [0, 0, 0, 1] \ \widehat{=} \  3 \ \widehat{=}  \ [1, 1] \\ \end{align*}

Damit hätten wir eine Dimensionsreduktion von vier auf zwei Merkmalen vorgenommen und genau diesen Vorgang wollen wir bei unserem Beispiel erreichen.

Programmierung eines einfachen Autoencoders


Typische Einsatzgebiete des Autoencoders sind neben der Dimensionsreduktion auch Bildaufarbeitung (z.B. Komprimierung, Entrauschen), Anomalie-Erkennung, Sequenz-to-Sequenz Analysen, etc.


Wir haben mit einem einfachen Beispiel die Funktionsweise des Autoencoders festigen können. Im nächsten Schritt wollen wir anhand realer Datensätze tiefer in gehen. Auch soll in kommenden Artikeln Variationen vom Autoencoder in verschiedenen Einsatzgebieten gezeigt werden.

Article series: 5 Clean Coding Tips – 2. Name Variables in a Meaningful Way

This is the second of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

When it comes to naming variables, there are a few official rules in the PEP8 style guide. A variable must start with an underscore or a letter and can be followed by a number of underscores or letters or digits. They cannot be reserved words: True, False, or, not, lambda etc. The preferred naming style is lowercase or lowercase_with_underscore. This all refers to variable names on a visual level. However, for readability purposes, the semantic level is as important, or maybe even more so. If it was for python, the variables could be named like this:

It wouldn’t make the slightest difference. But again, the code is not only for the interpreter to be read. It is for humans. Other people might need to look at your code to understand what you did, to be able to continue the work that you have already started. In any case, they need to be able to decipher what hides behind the variable names, that you’ve given the objects in your code. They will need to remember what they meant as they reappear in the code. And it might not be easy for them.

Remembering names is not an easy thing to do in all life situations. Let’s consider the following situation. You go to a party, there is a bunch of new people that you meet for the first time. They all have names and you try very hard to remember them all. Imagine how much easier would it be if you could call the new girl who came with John as the_girl_who_came_with_John. How much easier would it be to gossip to your friends about her? ‘Camilla is on the 5th glass of wine tonight, isn’t she?!.’ ‘Who are you talking about???’ Your friends might ask. ‘The_Girl_who_came_with_John.’ And they will all know. ‘It was nice to meet you girl_who_came_with_john, see you around.’ The good thing is that variables are not really like people. You can be a bit rude to them, they will not mind. You don’t have to force yourself or anyone else to remember an arbitrary name of a variable, that accidentally came to your mind in the moment of creation. Let your colleagues figure out what is what by a meaningful, straightforward description of it.

There is an important tradeoff to be aware of here. The lines of code should not exceed a certain length (79 characters, according to the PEP 8), therefore, it is recommended that you keep your names as short as possible. It is worth to give it a bit of thought about how you can name your variable in the most descriptive way, keeping it as short as possible. Keep in mind, that
the_blond_girl_in_a_dark_blue_dress_who_came_with_John_to_this_party might not be the best choice.

There are a few additional pieces of advice when it comes to naming your variables. First, try to always use pronounceable names. If you’ve ever been to an international party, you will know how much harder to remember is something that you cannot even repeat. Second, you probably have been taught over and over again that whenever you create a loop, you use i and j to denote the iterators.

It is probably engraved deep into the folds in your brain to write for i in…. You need to try and scrape it out of your cortex. Think about what the i stands for, what it really does and name it accordingly. Is i maybe the row_index? Is it a list_element?

Additionally, think about when to use a noun and where a verb. Variables usually are things and functions usually do things. So, it might be better to name functions with verb expressions, for example: get_id() or raise_to_power().

Moreover, it is a good practice to name constant numbers in the code. First, because when you name them you explain the meaning of the number. Second, because maybe one day you will have to change that number. If it appears multiple times in your code, you will avoid searching and changing it in every place. PEP 8 states that the constants should be named with UPPER_CASE_NAME. It is also quite common practice to explain the meaning of the constants with an inline comment at the end of the line, where the number appears. However, this approach will increase the line length and will require repeating the comment if the number appears more than one time in the code.

How Important is Customer Lifetime Value?

This is the third article of article series Getting started with the top eCommerce use cases.

Customer Lifetime Value

Many researches have shown that cost for acquiring a new customer is higher than the cost of retention of an existing customer which makes Customer Lifetime Value (CLV or LTV) one of the most important KPI’s. Marketing is about building a relationship with your customer and quality service matters a lot when it comes to customer retention. CLV is a metric which determines the total amount of money a customer is expected to spend in your business.

CLV allows marketing department of the company to understand how much money a customer is going  to spend over their  life cycle which helps them to determine on how much the company should spend to acquire each customer. Using CLV a company can better understand their customer and come up with different strategies either to retain their existing customers by sending them personalized email, discount voucher, provide them with better customer service etc. This will help a company to narrow their focus on acquiring similar customers by applying customer segmentation or look alike modeling.

One of the main focus of every company is Growth in this competitive eCommerce market today and price is not the only factor when a customer makes a decision. CLV is a metric which revolves around a customer and helps to retain valuable customers, increase revenue from less valuable customers and improve overall customer experience. Don’t look at CLV as just one metric but the journey to calculate this metric involves answering some really important questions which can be crucial for the business. Metrics and questions like:

  1. Number of sales
  2. Average number of times a customer buys
  3. Full Customer journey
  4. How many marketing channels were involved in one purchase?
  5. When the purchase was made?
  6. Customer retention rate
  7. Marketing cost
  8. Cost of acquiring a new customer

and so on are somehow associated with the calculation of CLV and exploring these questions can be quite insightful. Lately, a lot of companies have started to use this metric and shift their focuses in order to make more profit. Amazon is the perfect example for this, in 2013, a study by Consumers Intelligence Research Partners found out that prime members spends more than a non-prime member. So Amazon started focusing on Prime members to increase their profit over the past few years. The whole article can be found here.

How to calculate CLV?

There are several methods to calculate CLV and few of them are listed below.

Method 1: By calculating average revenue per customer


Figure 1: Using average revenue per customer


Let’s suppose three customers brought 745€ as profit to a company over a period of 2 months then:

CLV (2 months) = Total Profit over a period of time / Number of Customers over a period of time

CLV (2 months) = 745 / 3 = 248 €

Now the company can use this to calculate CLV for an year however, this is a naive approach and works only if the preferences of the customer are same for the same period of time. So let’s explore other approaches.

Method 2

This method requires to first calculate KPI’s like retention rate and discount rate.


CLV = Gross margin per lifespan ( Retention rate per month / 1 + Discount rate – Retention rate per month)


Retention rate = Customer at the end of the month – Customer during the month / Customer at the beginning of the month ) * 100

Method 3

This method will allow us to look at other metrics also and can be calculated in following steps:

  1. Calculate average number of transactions per month (T)
  2. Calculate average order value (OV)
  3. Calculate average gross margin (GM)
  4. Calculate customer lifespan in months (ALS)

After calculating these metrics CLV can be calculated as:


CLV = T*OV*GM*ALS / No. of Clients for the period


Transactions (T) = Total transactions / Period

Average order value (OV) = Total revenue / Total orders

Gross margin (GM) = (Total revenue – Cost of sales/ Total revenue) * 100 [but how you calculate cost of sales is debatable]

Customer lifespan in months (ALS) = 1 / Churn Rate %


CLV can be calculated using any of the above mentioned methods depending upon how robust your company wants the analysis to be. Some companies are also using Machine learning models to predict CLV, maybe not directly but they use ML models to predict customer churn rate, retention rate and other marketing KPI’s. Some companies take advantage of all the methods by taking an average at the end.

Matrix search: Finding the blocks of neighboring fields in a matrix with Python


In this article we will look at a solution in python to the following grid search task:

Find the biggest block of adjoining elements of the same kind and into how many blocks the matrix is divided. As adjoining blocks, we will consider field touching by the sides and not the corners.

Input data

For the ease of the explanation, we will be looking at a simple 3×4 matrix with elements of three different kinds, 0, 1 and 2 (see above). To test the code, we will simulate data to achieve different matrix sizes and a varied number of element types. It will also allow testing edge cases like, where all elements are the same or all elements are different.

To simulate some test data for later, we can use the numpy randint() method:

The code

How the code works

In summary, the algorithm loops through all fields of the matrix looking for unseen fields that will serve as a starting point for a local exploration of each block of color – the find_blocks() function. The local exploration is done by looking at the neighboring fields and if they are within the same kind, moving to them to explore further fields – the explore_block() function. The fields that have already been seen and counted are stored in the visited list.

find_blocks() function:

  1. Finds a starting point of a new block
  2. Runs a the explore_block() function for local exploration of the block
  3. Appends the size of the explored block
  4. Updates the list of visited points
  5. Returns the result, once all fields of the matrix have been visited.

explore_block() function:

  1. Takes the coordinates of the starting field for a new block and the list of visited points
  2. Creates the queue set with the starting point
  3. Sets the size of the current block (field_count) to 1
  4. Starts a while loop that is executed for as long as the queue is not empty
    1. Takes an element of the queue and uses its coordinates as the current location for further exploration
    2. Adds the current field to the visited list
    3. Explores the neighboring fields and if they belong to the same block, they are added to the queue
    4. The fields are taken off the queue for further exploration one by one until the queue is empty
  5. Returns the field_count of the explored block and the updated list of visited fields

Execute the function

The returned result is biggest block: 4, number of blocks: 4.

Run the test matrices:


The matrices for the article were visualized with the seaborn heatmap() method.