## A common trap when it comes to sampling from a population that intrinsically includes outliers

I will discuss a common fallacy concerning the conclusions drawn from calculating a sample mean and a sample standard deviation and more importantly how to avoid it.

Suppose you draw a random sample , , … of size and compute the ordinary (arithmetic) sample mean  and a sample standard deviation from it.  Now if (and only if) the (true) population mean µ (first moment) and population variance (second moment) obtained from the actual underlying PDF  are finite, the numbers and make the usual sense otherwise they are misleading as will be shown by an example.

By the way: The common correlation coefficient will also be undefined (or in practice always point to zero) in the presence of infinite population variances. Hopefully I will create an article discussing this related fallacy in the near future where a suitable generalization to Lévy-stable variables will be proposed.

Drawing a random sample from a heavy tailed distribution and discussing certain measures

As an example suppose you have a one dimensional random walker whose step length is distributed by a symmetric standard Cauchy distribution (Lorentz-profile) with heavy tails, i.e. an alpha-stable distribution with alpha being equal to one. The PDF of an individual independent step is given by , thus neither the first nor the second moment exist whereby the first exists and vanishes at least in the sense of a principal value due to symmetry.

Still let us generate (pseudo) standard Cauchy random numbers in R* to analyze the behavior of their sample mean and standard deviation as a function of the reduced sample size .

*The R-code is shown at the end of the article.

Here are the piecewise sample mean (in blue) and standard deviation (in red) for the mentioned Cauchy sampling. We see that both the sample mean and include jumps and do not converge.

Especially the mean deviates relatively largely from zero even after 3000 observations. The sample has no target due to the population variance being infinite.

If the data is new and no prior distribution is known, computing the sample mean and will be misleading. Astonishingly enough the sample mean itself will have the (formally exact) same distribution as the single step length . This means that the sample mean is also standard Cauchy distributed implying that with a different Cauchy sample one could have easily observed different sample means far of the presented values in blue.

What sense does it make to present the usual interval in such a case? What to do?

The sample median, median absolute difference (mad) and Inter-Quantile-Range (IQR) are more appropriate to describe such a data set including outliers intrinsically. To make this plausible I present the following plot, whereby the median is shown in black, the mad in green and the IQR in orange.

This example shows that the median, mad and IQR converge quickly against their assumed values and contain no major jumps. These quantities do an obviously better job in describing the sample. Even in the presence of outliers they remain robust, whereby the mad converges more quickly than the IQR. Note that a standard Cauchy sample will contain half of its sample in the interval median mad meaning that the IQR is twice the mad.

Drawing a random sample from a PDF that has finite moments

Just for comparison I also show the above quantities for a standard normal (pseudo) sample labeled with the same color as before as a counter example. In this case not only do both the sample mean and median but also the and mad converge towards their expected values (see plot below). Here all the quantities describe the data set properly and there is no trap since there are no intrinsic outliers. The sample mean itself follows a standard normal, so that the in deed makes sense and one could calculate a standard error from it to present the usual stochastic confidence intervals for the sample mean.

A careful observation shows that in contrast to the Cauchy case here the sampled mean and converge more quickly than the sample median and the IQR. However still the sampled mad performs about as well as the . Again the mad is twice the IQR.

And here are the graphs of the prementioned quantities for a pseudo normal sample:

The take-home-message:

Just be careful when you observe outliers and calculate sample quantities right away, you might miss something. At best one carefully observes how the relevant quantities change with sample size as demonstrated in this article.

Such curves should become of broader interest in order to improve transparency in the Data Science process and reduce fallacies as well.

P.S.: Feel free to play with the set random seed in the R-code below and observe how other quantities behave with rising sample size. Of course you can also try different PDFs at the beginning of the code. You can employ a Cauchy, Gaussian, uniform, exponential or Holtsmark (pseudo) random sample.

QUIZ: Which one of the recently mentioned random samples contains a trap** and why?

R-code used to generate the data and for producing plots:

## Cross-industry standard process for data mining

Introduced in 1996, the cross-industry standard process for data mining (CRISP-DM) became the most
common procedure for all data mining projects. This method consists of six phases: Business
understanding, Data understanding, Data preparation, Modeling, Evaluation and Deployment (see
Figure 1). It is being used not just as a reference manual but as a user guide as it explains every phase
in detail (Hipp, 2000). The six phases of this model are explained below:

Figure 1: Different phases of CRISP-DM

It includes understanding the business problem and determining the
objective of the business as well as of the project. It is also important to understand the previous work
done on the project (if any) to achieve the business goals and to examine if the scope of the project has changed.

The job of a Data Scientist is not limited to coding or just make a machine learning model and I guess that’s why this whole lifecycle was developed.  The key points a project owner should take care in this process are:

– Identify stakeholders  and involve them to define the scope your project
– Identify metrics / KPIs for measuring success

Evaluating a model is a different thing as it can only tell you how good are your predictions but identifying the success metric is really important for any data science project because when your model is deployed in production this measure will tell you if your model actually works or not. Now, let’s discuss what is this success metric
Consider that you are working in an e-commerce company where Head of finance ask you to create a machine learning model to predict if a specific product will return or not. The problem is not hard to understand, its a binary classification problem and you know you can do the job. But before you start working with the data you should define a metric to measure the success. What do you think your success metric could be? I would go with the return rate, in other words, calculate the rate for how many orders are actually coming back and if this measure is getting decrease you would know your model works and if not then FIX IT !!

## Data understanding

The initial step in this phase is to gather all the data from different sources. It is
then important to describe the data, generate graphs for distribution in order to get familiar with the
data. This phase is important as without enough data or without understanding about the data analysis
cannot be performed. In data mining terms this can be compared to Exploratory data analysis (EDA)
where techniques from descriptive statistics are used to have an insight into the data. For instance, if it is
a time series data it makes sense to know from when until when the data is available before diving deep into
the data.

## Data preparation

This phase takes most of the time in data mining project as a lot of methods from
data cleaning, feature subset, feature engineering, the transformation of data etc. are used before the final
dataset is trained for modeling purpose. The single dataset can also be prepared in different forms as some
algorithms can learn more with a certain type of data, some algorithms can deal with imbalance dataset
and for some algorithms, the target variable must be balanced. This phase also requires sometimes to
calculate new KPI’s according to the business need or sometimes to reduce the dimension of the dataset.

## Modeling and Evaluation

Various models are selected and build in this process and appropriate hyperparameters are
selected after an intensive grid search.  Once all the models are built it is now time to evaluate and compare performances of all the models.

## Deployment

A model is of no use if it is not deployed into production. Until now you have been doing the job of a data scientist but for deployment, you need some software engineering

skills. There are several ways to deploy a machine learning model or python code. Few of them are:

• Re-implement your python code in C++, Java etc. (LOL)
• Save the coefficients and use them to get predictions
• Serialized objects (REST API with flask, Django)

To understand the concept of deploying an ML model using REST API this post is highly recommended.

## Training eines Neurons mit dem Gradientenverfahren

Dies ist Artikel 3 von 6 der Artikelserie –Einstieg in Deep Learning.

Das Training von neuronalen Netzen erfolgt nach der Forward-Propagation über zwei Schritte:

1. Fehler-Rückführung über aller aktiver Neuronen aller Netz-Schichten, so dass jedes Neuron “seinen” Einfluss auf den Ausgabefehler kennt.
2. Anpassung der Gewichte entgegen den Gradienten der Fehlerfunktion

Beide Schritte werden in der Regel zusammen als Backpropagation bezeichnet. Machen wir erstmal einen Schritt vor und betrachten wir, wie ein Neuron seine Gewichtsverbindungen zu seinen Vorgängern anpasst.

Der Gradientenabstieg ist ein generalisierbarer Algorithmus zur Optimierung, der in vielen Verfahren des maschinellen Lernens zur Anwendung kommt, jedoch ganz besonders als sogenannte Backpropagation im Deep Learning den Erfolg der künstlichen neuronalen Netze erst möglich machen konnte.

Der Gradientenabstieg lässt sich vom Prinzip her leicht erklären: Angenommen, man stünde im Gebirge im dichten Nebel. Das Tal, und somit der Weg nach Hause, ist vom Nebel verdeckt. Wohin laufen wir? Wir können das Ziel zwar nicht sehen, tasten uns jedoch so heran, dass unser Gehirn den Gradienten (den Unterschied der Höhen beider Füße) berechnet, somit die Steigung des Bodens kennt und sich entgegen dieser Steigung unser Weg fortsetzt.

Konkret funktioniert der Gradientenabstieg so: Wir starten bei einem zufälligen Theta (Random Initialization). Wir berechnen die Ausgabe (Forwardpropogation) und vergleichen sie über eine Verlustfunktion (z. B. über die Funktion Mean Squared Error) mit dem tatsächlich korrekten Wert. Auf Grund der zufälligen Initialisierung haben wir eine nahe zu garantierte Falschheit der Ergebnisse und somit einen Verlust. Für die Verlustfunktion berechnen wir den Gradienten für gegebene Eingabewerte. Voraussetzung dafür ist, dass die Funktion ableitbar ist. Wir bewegen uns entgegen des Gradienten in Richtung Minimum der Verlustfunktion. Ist dieses Minimum (fast) gefunden, spricht man auch davon, dass der Lernalgorithmus konvergiert.

Das Gradientenabstiegsverfahren ist eine Möglichkeit der Gradientenverfahren, denn wollten wir maximieren, würden wir uns entlang des Gradienten bewegen, was in anderen Anwendungen sinnvoll ist.

Ob als “Cost Function” oder als “Loss Function” bezeichnet, in jedem Fall ist es eine “Error Function”, aber auf die Benennung kommen wir später zu sprechen. Jedenfalls versuchen wir die Fehlerrate zu senken! Leider sind diese Funktionen in der Praxis selten so einfach konvex (zwei Berge mit einem Tal dazwischen).

Aber Achtung: Denn befinden wir uns nur zwischen zwei Bergen, finden wir das Tal mit Sicherheit über den Gradienten. Befinden wir uns jedoch in einem richtigen Gebirge mit vielen Bergen und Tälern, gilt es, das richtige Tal zu finden. Bei der Optimierung der Gewichtungen von künstlichen neuronalen Netzen wollen wir die besten Gewichtungen finden, die uns zu den geringsten Ausgaben der Verlustfunktion führen. Wir suchen also das globale Minimum unter den vielen (lokalen) Minima.

## Programmier-Beispiel in Python

Nachfolgend ein Beispiel des Gradientenverfahrens zur Berechnung einer Regression. Wir importieren numpy und matplotlib.pyplot und erzeugen uns künstliche Datenpunkte:

Nun wollen wir einen Lernalgorithmus über das Gradientenverfahren erstellen. Im Grunde haben wir hier es bereits mit einem linear aktivierten Neuron zutun:

Bei der linearen Regression, die wir durchführen wollen, nehmen wir zwei-dimensionale Daten (wobei wir die Regression prinzipiell auch mit x-Dimensionen durchführen können, dann hätte unser Neuron weitere Eingänge). Wir empfangen einen Bias () der stets mit einer Eingangskonstante multipliziert und somit als Wert erhalten bleibt. Der Bias ist das Alpha in einer Schulmathe-tauglichen Formel wie .

Beta ist die Steigung, der Gradient, der Funktion.

Sowohl als auch sind uns unbekannt, versuchen wir jedoch über die Betrachtung unserer Prädiktion durch Berechnung der Formel und den darauffolgenden Abgleich mit dem tatsächlichen herauszufinden. Anfangs behaupten wir beispielsweise einfach, sowohl als auch seien . Folglich wird ebenfalls gleich 0.00 sein und die Fehlerfunktion (Loss Function) wird maximal sein. Dies war der erste Durchlauf des Trainings, die sogenannte erste Epoche!

Die Epochen (Durchläufe) und dazugehörige Fehlergrößen. Wenn die Fehler sinken und mit weiteren Epochen nicht mehr wesentlich besser werden, heißt es, das der Lernalogorithmus konvergiert.

Als Fehlerfunktion verwenden wir bei der Regression die MSE-Funktion (Mean Squared Error):

Um diese Funktion wird sich nun alles drehen, denn diese beschreibt den Fehler und gibt uns auch die Auskunft darüber, ob wie stark und in welche Richtung sie ansteigt, so dass wir uns entgegen der Steigung bewegen können. Wer die Regeln der Ableitung im Kopf hat, weiß, dass die Ableitung der Formel leichter wird, wenn wir sie vorher auf halbe Werte runterskalieren. Da die Proportionen dabei erhalten bleiben und uns quadrierte Fehlerwerte unserem menschlichen Verstand sowieso nicht so viel sagen (unser Gehirn denkt nunmal nicht exponential), stört das nicht:

Wenn die Mathematik der partiellen Ableitung (Ableitung einer Funktion nach jedem Gradienten) abhanden gekommen ist, bitte nochmal folgende Regeln nachschlagen, um die nachfolgende Ableitung verstehen zu können:

• Allgemeine partielle Ableitung
• Kettenregel

Ableitung der MSD-Funktion nach dem einen Gewicht bzw. partiell nach jedem vorhandenen :

Woher wir das am Ende her haben? Das ergibt sie aus der Kettenregel: Die äußere Funktion wurde abgeleitet, so wurde aus dann . Jedoch muss im Sinne eben dieser Kettenregel auch die innere Funktion abgeleitet werden. Da wir nach ableiten, bleibt nur erhalten.

Damit können wir arbeiten! So kompliziert ist die Formel nun auch wieder nicht:

Mit dieser Formel können wir unsere Gewichte an den Fehler anpassen: ( ist der Gradient der Funktion!)

### Initialisieren der Gewichtungen

Die Gewichtungen und müssen anfänglich mit Werten initialisiert werden. In der Regression bietet es sich an, die Gewichte anfänglich mit zu initialisieren.

Bei vielen neuronalen Netzen, mit nicht-linearen Aktivierungsfunktionen, ist das jedoch eher ungünstig und zufällige Werte sind initial besser. Gut erprobt sind normal-verteilte Zufallswerte.

### Lernrate

Nur eine Kleinigkeit haben wir bisher vergessen: Wir brauchen einen Faktor, mit dem wir anpassen. Hier wäre der Faktor 1. Das ist in der Regel viel zu groß. Dieser Faktor wird geläufig als Lernrate (Learning Rate) (eta) bezeichnet:

Die Lernrate ist ein Knackpunkt und der erste Parameter des Lernalgorithmus, den es anzupassen gilt, wenn das Training nicht konvergiert.

Die Lernrate darf nicht zu groß klein gewählt werden, da das Training sonst zu viele Epochen benötigt. Ungeduldige erhöhen die Lernrate möglicherweise aber so sehr, dass der Lernalgorithmus im Minimum der Fehlerfunktion vorbeiläuft und diesen stets überspringt. Hier würde der Algorithmus also sozusagen konvergieren, weil nicht mehr besser werden, aber das resultierende Modell wäre weit vom Optimum entfernt.

Beginnen wir mit der Implementierung als Python-Klasse:

Die Klasse sollte so funktionieren, bevor wir sie verwenden, sollten wir die Input-Werte standardisieren:

Bei diesem Beispiel mit künstlich erzeugten Werten ist das Standardisieren bzw. das Fehlen des Standardisierens zwar nicht kritisch, aber man sollte es sich zur Gewohnheit machen. Testweise es einfach mal weglassen 🙂

Kommen wir nun zum Einsatz der Klasse, die die Regression via Gradientenabstieg absolvieren soll:

Was tut diese Instanz der Klasse LinearRegressionGD nun eigentlich?

Bildlich gesprochen, legt sie eine Gerade auf den Boden des Koordinatensystems, denn die Gewichtungen werden mit 0.00 initialisiert, ist also gleich 0.00, egal welche Werte in enthalten sind. Der Fehler ist dann aber sehr groß (sollte maximal sein, im Vergleich zu zukünftigen Epochen). Die Gewichte werden also angepasst, die Gerade somit besser in die Punktwolke platziert. Mit jeder Epoche wird die Gerade erneut in die Punktwolke gelegt, der Gesamtfehler (über alle , da wir es hier mit dem Batch-Verfahren zutun haben) berechnet, die Werte angepasst… bis die vorgegebene Zahl an Epochen abgelaufen ist.

Schauen wir uns das Ergebnis des Trainings an:

Die Linie sieht passend aus, oder? Da wir hier nicht zu sehr in die Theorie der Regressionsanalyse abdriften möchten, lassen wir das testen und prüfen der Akkuratesse mal aus, hier möchte ich auf meinen Artikel Regressionsanalyse in Python mit Scikit-Learn verweisen.

Prüfen sollten wir hingegen mal, wie schnell der Lernalgorithmus mit der vorgegebenen Lernrate konvergiert:

Hier die Verlaufskurve der Cost Function:

Die Kurve zeigt uns, dass spätestens nach 40 Epochen kaum noch Verbesserung (im Sinne der Gesamtfehler-Minimierung) erreicht wird.

### Wichtige Hinweise

Natürlich war das nun nur ein erster kleiner Einstieg und wer es verstanden hat, hat viel gewonnen. Denn erst dann kann man sich vorstellen, wie ein einzelnen Neuron eines künstlichen neuronalen Netzes grundsätzlich trainiert werden kann.

Folgendes sollte noch beachtet werden:

• Lernrate :
Die Lernrate ist ein wichtiger Parameter. Wer das Programmier-Beispiel bei sich zum Laufen gebracht hat, einfach mal die Lernrate auf Werte zwischen 10.00 und 0.00000001 setzen, schauen was passiert 🙂
• Globale Minima vs lokale Minima:
Diese lineare zwei-dimensionale Regression ist ziemlich einfach. Neuronale Netze sind hingegen komplexer und haben nicht einfach nur eine simple konvexe Fehlerfunktion. Hier gibt es mehrere Hügel und Täler in der Fehlerfunktion und die Gefahr ist groß, in einem lokalen, nicht aber in einem globalen Minimum zu landen.
Wir haben hier das sogenannte Batch-Verfahren verwendet. Dieses ist grundsätzlich besser als die stochastische Methode. Denn beim Batch verwenden wir den gesamten Stapel an -Werten für die Fehlerbestimmung. Allerdings ist dies bei großen Daten zu rechen- und speicherintensiv. Dann werden kleinere Unter-Stapel (Sub-Batches) zufällig aus den -Werten ausgewählt, der Fehler daraus bestimmt (was nicht ganz so akkurat ist, wie als würden wir den Fehler über alle berechnen) und der Gradient bestimmt. Dies ist schon Rechen- und Speicherkapazität, erfordert aber meistens mehr Epochen.

### Buchempfehlung

Die folgenden zwei Bücher haben mir bei der Erstellung dieses Beispiels geholfen und kann ich als hilfreiche und deutlich weiterführende Lektüre empfehlen:

## Fuzzy Matching mit dem Jaro-Winkler-Score zur Auswertung von Markenbekanntheit und Werbeerinnerung

Für Unternehmen sind Markenbekanntheit und Werbeerinnerung wichtige Zielgrößen, denn anhand dieser lässt sich ableiten, ob Konsumenten ein Produkt einer Marke kaufen werden oder nicht. Zielgrößen wie diese werden von Marktforschungsinstituten über Befragungen ermittelt. Dafür wird in regelmäßigen Zeitabständen eine gleichbleibende Anzahl an Personen befragt, ob diese sich an Marken einer bestimmten Branche erinnern oder sich an Werbung erinnern. Die Personen füllen dafür in der Regel einen Onlinefragebogen aus.

Die Ergebnisse der Befragung liegen in einer Datenmatrix (siehe Tabelle) vor und müssen zur Auswertung zunächst bearbeitet werden.

 Laufende Nummer Marke 1 Marke 2 Marke 3 Marke 4 1 ING-Diba Citigroup Sparkasse 2 Sparkasse Consorsbank 3 Commerbank Deutsche Bank Sparkasse ING-DiBa 4 Sparkasse Targobank

Ziel ist es aus diesen Daten folgende 0/1 codierte Matrix zu generieren. Wenn eine Marke bekannt ist, wird in die zur Marke gehörende Spalte eine Eins eingetragen, ansonsten eine Null.

 Alle Marken ING-Diba Citigroup Sparkasse Targobank ING-Diba, Citigroup, Sparkasse 1 1 1 0 Sparkasse, Consorsbank 0 0 1 0 Commerzbank, Deutsche Bank, Sparkasse, ING-Diba 1 0 0 0 Sparkasse, Targobank 0 0 1 1

Der Workflow um diese Datentransformation durchzuführen ist oftmals mittels eines Teilstrings einer Marke zu suchen ob diese in einem über alle Nennungen hinweg zusammengeführten String vorkommt oder nicht (z.B. „argo“ bei Targobank). Das Problem dieser Herangehensweise ist, dass viele falsch geschriebenen Wörter so nicht erfasst werden und die Erfahrung zeigt, dass falsch geschriebene Marken in vielfältigster Weise auftreten. Hier mussten in der Vergangenheit Mitarbeiter sich in stundenlangem Kampf durch die Ergebnisse wühlen und falsch zugeordnete oder nicht zugeordnete Marken händisch korrigieren und alle Variationen der Wörter notieren, um für die nächste Befragung das Suchpattern zu optimieren.

Eine Alternative diesen aufwändigen Workflow stellt die Ermittlung von falsch geschriebenen Wörtern mittels des Jaro-Winkler-Scores dar. Dafür muss zunächst die Jaro-Winkler-Distanz zwischen zwei Strings berechnet werden. Diese berechnet sich wie folgt:

• m: Anzahl der übereinstimmenden Buchstaben
• s: Länge des Strings
• t: Hälfte der Anzahl der Umstellungen der Buchstaben die nötig sind, damit Strings identisch sind. („Ta“ und „gobank“ befinden sich bereits in der korrekten Reihenfolge, somit gilt: t = 0)

Aus dem Ergebnis lässt sich der Jaro-Winkler Score berechnen:

ist dabei die Jaro-Winkler-Distanz, l die Länge der übereinstimmenden Buchstaben von Beginn des Wortes bis zum maximal vierten Buchstaben und p ein konstanter Faktor von 0,1.

Für die Strings „Targobank“ und „Tangobank“ ergibt sich die Jaro-Winkler-Distanz:

Daraus wird im nächsten Schritt der Jaro-Winkler Score berechnet:

Bisherige Erfahrungen haben gezeigt, dass sich Scores ab 0,8 bzw. 0,9 am besten zur Suche von ähnlichen Wörtern eignen. Ein Schwellenwert darunter findet sehr viele Wörter, die sich z.B. auch anderen Wörtern zuordnen lassen. Ein Schwellenwert über 0,9 identifiziert falsch geschriebene Wörter oftmals nicht mehr.

Nach diesem theoretischen Exkurs möchte ich nun zeigen, wie sich das Ganze praktisch anwenden lässt. Da sich das Ganze um ein fiktives Beispiel handelt, werden zur Demonstration der Praxistauglichkeit Fakedaten mit folgendem Code erzeugt. Dabei wird angenommen, dass Personen unterschiedlich viele Banken kennen und diese mit einer bestimmten Wahrscheinlichkeit falsch schreiben.

Ausführen:

Nun werden die Inhalte der Spalten in eine einzige Spalte zusammengefasst und jede Marke per Komma getrennt.

Damit Sonderzeichen, Leerzeichen oder Groß- und Kleinschreibung keine Rolle spielen, werden alle Strings vereinheitlicht und störende Zeichen entfernt.

Im nächsten Schritt wird geprüft welche Schreibweisen überhaupt existieren. Dafür eignet sich eine Word-Frequency-Matrix, mit der alle einzigartigen Wörter und deren Häufigkeiten in einem Vektor gezählt wird.

Danach wird eine leere Liste erstellt, in der iterativ für jedes Element des Suchvektors ein Charactervektor erzeugt wird, der Wörter enthält, die einen Jaro-Winker Score von 0,9 oder höher besitzen.

Jetzt wird ein leerer DataFrame erzeugt, der die Zeilenlänge des originalen DataFrames besitzt sowie die Anzahl der Marken als Spaltenlänge.

Im nächsten Schritt wird nun aus den ähnlichen Wörtern mit einer oder-Verknüpfung einen String erzeugt, der alle durch den Jaro-Winkler-Score identifizierten Wörter beinhaltet. Wenn ein Treffer gefunden wird, wird in der Suchspalte eine Eins eingetragen, ansonsten eine Null.

Zuletzt wird eine Spalte erzeugt, in die eine Eins geschrieben wird, wenn keine der Marken gefunden wurde.

Nach der fertigen Berechnung der Matrix können nun die finalen KPI´s berechnet und als Report in eine .xlsx Datei geschrieben werden.

Dieses Vorgehen kann natürlich nicht verhindern, dass sich jemand mit kritischem Auge die Daten anschauen muss. In mehreren Tests ergaben sich bei einer Fallzahl von ~10.000 Antworten Genauigkeiten zwischen 95% und 100%, was bisherige Ansätze um ein Vielfaches übertrifft.9407407

## Big Data has reduced the boundary between demand-centric dynamic pricing and user-behavior centric pricing!

Real-time pricing is also known as Dynamic pricing, and it is a method to plan and set highly flexible prices of the services or the products. Dynamic pricing is aimed to help the online organizations modify the costs on the fly in relation to the ever changing market conditions. All sorts of modifications are managed the costing bots, who collect the information, and use the algorithms in order to regulate the costing, keeping in mind the set guidelines. With the help of data analysis, vendors can accurately forecast the best prices, and also can adjust it as per the changing needs.

What’s the role of Big Data in Dynamics pricing?

Big data strategies are made just to get the required insights which help to enhance the performance of a business. Still, companies find it difficult to understand the capabilities of analytics, and how the analytics can be used to make the process of pricing all the more powerful. Various levels of Big Data collection, and analysis result into planning a proper dynamics pricing structure. The Big Data captured by the companies hold a lot of value when it comes to devising solid, and very workable dynamics costing structures.

Each and every one of the data-oriented firms move from the basic data reporting stage via a plenty of stages to get to the utmost, desirable level of optimization that’s deemed the most sophisticated. This eventually helps to enhance the revenue management process as well.

How Big Data lessens the gap between demand-centric dynamic pricing and user-behavior centric pricing?

Big Data as we have discussed above has a major role to play when it comes to setting dynamic pricing plans. Dynamic pricing is now further categorized into different segments and two of them are demand-centric dynamic pricing and user-behavior centric pricing. Both of these hold equal importance in creating a top pricing strategy. However, one of the other important things is that, it acts as a liaison between the two as well.  It bridges the gap between the two. When it comes to demand centric costing, it is referred to as what the customer needs, and what the customer is looking for. Whereas, when it comes to user behavior pricing, it is more related to what we should be offering to the customer as per the interest levels of the customers.

Now, both of these parameters hold equal importance when it comes to making costing strategies that are fruitful. To set proper ‘demand centric pricing’ it is importance to know about the demand as well as the wants of the target audience. And, when it comes to user-behavior centric pricing, we need to know how the user is feeling, and what interest areas are. This where the role of Big Data analytics come into play.

Big Data analytics of relative information helps to find out both, the demands and well as the user behaviors. Big Data analytics done to study the target audience are a best way to get to the answers. Once we know about the demands and the user behavior we have to combine both of these to churn our better pricing strategies.

The costing plans should be taken into consideration by mapping both of these elements together. For example, even whenever we curate marketing strategies, they are basically catering to the demands of the public. But, at the same time, user-behavior is never neglected either. It’s a mix of both that we need for setting dynamic prices as well. The modifications which should be done in the pricing should be done based on collective insights gained by clubbing both the elements together.

By studying both the demands graphs as well as the user behavior reports, a company can devise plans that will turn out to be very useful when it comes to costing. Dynamic pricing is as it is a very fruitful invention, and the integration of Big Data has made it all the more powerful.

Big Data is one of those technologies which has made a lot possible in a lot of areas. Be it the pricing structures or the business strategies, Big Data analytics are used everywhere to improve the performance of the company.

# Sentiment Analysis of IMDB reviews¶

This article shows you how to build a Neural Network from scratch(no libraries) for the purpose of detecting whether a movie review on IMDB is negative or positive.

### Outline:¶

• Curating a dataset and developing a "Predictive Theory"

• Transforming Text to Numbers Creating the Input/Output Data

• Building our Neural Network

• Making Learning Faster by Reducing "Neural Noise"

• Reducing Noise by strategically reducing the vocabulary

# Curating the Dataset¶

In [3]:
def pretty_print_review_and_label(i):
print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # features of our dataset
g.close()

g = open('labels.txt','r') # labels
g.close()


Note: The data in reviews.txt we're contains only lower case characters. That's so we treat different variations of the same word, like The, the, and THE, all the same way.

### It's always a good idea to get check out your dataset before you proceed.¶

In [2]:
len(reviews) #No. of reviews

Out[2]:
25000
In [3]:
reviews[0] #first review

Out[3]:
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '
In [4]:
labels[0] #first label

Out[4]:
'POSITIVE'

# Developing a Predictive Theory¶

### Analysing how you would go about predicting whether its a positive or a negative review.¶

In [5]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...

In [41]:
from collections import Counter
import numpy as np


We'll create three Counter objects, one for words from postive reviews, one for words from negative reviews, and one for all the words.

In [56]:
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()


Examine all the reviews. For each word in a positive review, increase the count for that word in both your positive counter and the total words counter; likewise, for each word in a negative review, increase the count for that word in both your negative counter and the total words counter. You should use split(' ') to divide a piece of text (such as a review) into individual words.

In [57]:
# Loop over all the words in all the reviews and increment the counts in the appropriate counter objects
for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1


### Most common positive & negative words¶

In [ ]:
positive_counts.most_common()


The above statement retrieves alot of words, the top 3 being : ('the', 173324), ('.', 159654), ('and', 89722),

In [ ]:
negative_counts.most_common()


The above statement retrieves alot of words, the top 3 being : ('', 561462), ('.', 167538), ('the', 163389),

As you can see, common words like "the" appear very often in both positive and negative reviews. Instead of finding the most common words in positive or negative reviews, what you really want are the words found in positive reviews more often than in negative reviews, and vice versa. To accomplish this, you'll need to calculate the ratios of word usage between positive and negative reviews.

The positive-to-negative ratio for a given word can be calculated with positive_counts[word] / float(negative_counts[word]+1). Notice the +1 in the denominator – that ensures we don't divide by zero for words that are only seen in positive reviews.

In [58]:
pos_neg_ratios = Counter()

# Calculate the ratios of positive and negative uses of the most common words
# Consider words to be "common" if they've been used at least 100 times
for term,cnt in list(total_counts.most_common()):
if(cnt > 100):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio


Examine the ratios

In [12]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 1.0607993145235326
Pos-to-neg ratio for 'amazing' = 4.022813688212928
Pos-to-neg ratio for 'terrible' = 0.17744252873563218


We see the following:

• Words that you would expect to see more often in positive reviews – like "amazing" – have a ratio greater than 1. The more skewed a word is toward postive, the farther from 1 its positive-to-negative ratio will be.
• Words that you would expect to see more often in negative reviews – like "terrible" – have positive values that are less than 1. The more skewed a word is toward negative, the closer to zero its positive-to-negative ratio will be.
• Neutral words, which don't really convey any sentiment because you would expect to see them in all sorts of reviews – like "the" – have values very close to 1. A perfectly neutral word – one that was used in exactly the same number of positive reviews as negative reviews – would be almost exactly 1.

Ok, the ratios tell us which words are used more often in postive or negative reviews, but the specific values we've calculated are a bit difficult to work with. A very positive word like "amazing" has a value above 4, whereas a very negative word like "terrible" has a value around 0.18. Those values aren't easy to compare for a couple of reasons:

• Right now, 1 is considered neutral, but the absolute value of the postive-to-negative rations of very postive words is larger than the absolute value of the ratios for the very negative words. So there is no way to directly compare two numbers and see if one word conveys the same magnitude of positive sentiment as another word conveys negative sentiment. So we should center all the values around netural so the absolute value fro neutral of the postive-to-negative ratio for a word would indicate how much sentiment (positive or negative) that word conveys.
• When comparing absolute values it's easier to do that around zero than one.

To fix these issues, we'll convert all of our ratios to new values using logarithms (i.e. use np.log(ratio))

In the end, extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs.

In [59]:
# Convert ratios to logs
for word,ratio in pos_neg_ratios.most_common():
pos_neg_ratios[word] = np.log(ratio)


Examine the new ratios

In [14]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

Pos-to-neg ratio for 'the' = 0.05902269426102881
Pos-to-neg ratio for 'amazing' = 1.3919815802404802
Pos-to-neg ratio for 'terrible' = -1.7291085042663878


If everything worked, now you should see neutral words with values close to zero. In this case, "the" is near zero but slightly positive, so it was probably used in more positive reviews than negative reviews. But look at "amazing"'s ratio - it's above 1, showing it is clearly a word with positive sentiment. And "terrible" has a similar score, but in the opposite direction, so it's below -1. It's now clear that both of these words are associated with specific, opposing sentiments.

Run the below code to see more ratios.

It displays all the words, ordered by how associated they are with postive reviews.

In [ ]:
pos_neg_ratios.most_common()


The top most common words for the above code : ('edie', 4.6913478822291435), ('paulie', 4.0775374439057197), ('felix', 3.1527360223636558), ('polanski', 2.8233610476132043), ('matthau', 2.8067217286092401), ('victoria', 2.6810215287142909), ('mildred', 2.6026896854443837), ('gandhi', 2.5389738710582761), ('flawless', 2.451005098112319), ('superbly', 2.2600254785752498), ('perfection', 2.1594842493533721), ('astaire', 2.1400661634962708), ('captures', 2.0386195471595809), ('voight', 2.0301704926730531), ('wonderfully', 2.0218960560332353), ('powell', 1.9783454248084671), ('brosnan', 1.9547990964725592)

# Creating the Input/Output Data¶

Create a set named vocab that contains every word in the vocabulary.

In [19]:
vocab = set(total_counts.keys())


Check vocabulary size

In [20]:
vocab_size = len(vocab)
print(vocab_size)

74074


Th following image rpresents the layers of the neural network you'll be building throughout this notebook. layer_0 is the input layer, layer_1 is a hidden layer, and layer_2 is the output layer.

In [1]:


Out[1]:

TODO: Create a numpy array called layer_0 and initialize it to all zeros. Create layer_0 as a 2-dimensional matrix with 1 row and vocab_size columns.

In [21]:
layer_0 = np.zeros((1,vocab_size))


layer_0 contains one entry for every word in the vocabulary, as shown in the above image. We need to make sure we know the index of each word, so run the following cell to create a lookup table that stores the index of every word.

TODO: Complete the implementation of update_input_layer. It should count how many times each word is used in the given review, and then store those counts at the appropriate indices inside layer_0.

In [ ]:
# Create a dictionary of words in the vocabulary mapped to index positions
# (to be used in layer_0)
word2index = {}
for i,word in enumerate(vocab):
word2index[word] = i


It stores the indexes like this: 'antony': 22, 'pinjar': 23, 'helsig': 24, 'dances': 25, 'good': 26, 'willard': 71500, 'faridany': 27, 'foment': 28, 'matts': 12313,

Lets implement some functions for simplifying our inputs to the neural network.

In [25]:
def update_input_layer(review):
"""
The element at a given index of layer_0 should represent
how many times the given word occurs in the review.
"""

global layer_0

# clear out previous state, reset the layer to be all 0s
layer_0 *= 0

# count how many times each word is used in the given review and store the results in layer_0
for word in review.split(" "):
layer_0[0][word2index[word]] += 1


Run the following cell to test updating the input layer with the first review. The indices assigned may not be the same as in the solution, but hopefully you'll see some non-zero values in layer_0.

In [26]:
update_input_layer(reviews[0])
layer_0

Out[26]:
array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

get_target_for_labels should return 0 or 1, depending on whether the given label is NEGATIVE or POSITIVE, respectively.

In [27]:
def get_target_for_label(label):
if(label == 'POSITIVE'):
return 1
else:
return 0


# Building a Neural Network¶

In [32]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
"""
Args:
reviews(list) - List of reviews used for training
labels(list) - List of POSITIVE/NEGATIVE labels
hidden_nodes(int) - Number of nodes to create in the hidden layer
learning_rate(float) - Learning rate to use while training

"""
# Assign a seed to our random number generator to ensure we get
# reproducable results
np.random.seed(1)

# process the reviews and their associated labels so that everything
self.pre_process_data(reviews, labels)

# Build the network to have the number of hidden nodes and the learning rate that
# were passed into this initializer. Make the same number of input nodes as
# there are vocabulary words and create a single output node.
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

def pre_process_data(self, reviews, labels):

# populate review_vocab with all of the words in the given reviews
review_vocab = set()
for review in reviews:
for word in review.split(" "):

# Convert the vocabulary set to a list so we can access words via indices
self.review_vocab = list(review_vocab)

# populate label_vocab with all of the words in the given labels.
label_vocab = set()
for label in labels:

# Convert the label vocabulary set to a list so we can access labels via indices
self.label_vocab = list(label_vocab)

# Store the sizes of the review and label vocabularies.
self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab)

# Create a dictionary of words in the vocabulary mapped to index positions
self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i

# Create a dictionary of labels mapped to index positions
self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i

def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Set number of nodes in input, hidden and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes

# Store the learning rate
self.learning_rate = learning_rate

# Initialize weights

# These are the weights between the input layer and the hidden layer.
self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))

# These are the weights between the hidden layer and the output layer.
self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,
(self.hidden_nodes, self.output_nodes))

# The input layer, a two-dimensional matrix with shape 1 x input_nodes
self.layer_0 = np.zeros((1,input_nodes))

def update_input_layer(self,review):

# clear out previous state, reset the layer to be all 0s
self.layer_0 *= 0

for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] += 1

def get_target_for_label(self,label):
if(label == 'POSITIVE'):
return 1
else:
return 0

def sigmoid(self,x):
return 1 / (1 + np.exp(-x))

def sigmoid_output_2_derivative(self,output):
return output * (1 - output)

def train(self, training_reviews, training_labels):

# make sure out we have a matching number of reviews and labels
assert(len(training_reviews) == len(training_labels))

# Keep track of correct predictions to display accuracy during training
correct_so_far = 0

# Remember when we started for printing time statistics
start = time.time()

# loop through all the given reviews and run a forward and backward pass,
# updating weights for every item
for i in range(len(training_reviews)):

# Get the next review and its correct label
review = training_reviews[i]
label = training_labels[i]

### Forward pass ###

# Input Layer
self.update_input_layer(review)

# Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1)

# Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))

### Backward pass ###

# Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

# Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

# Update the weights
self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step

# Keep track of correct predictions.
if(layer_2 >= 0.5 and label == 'POSITIVE'):
correct_so_far += 1
elif(layer_2 < 0.5 and label == 'NEGATIVE'):
correct_so_far += 1

sys.stdout.write(" #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
+ " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")

def test(self, testing_reviews, testing_labels):
"""
Attempts to predict the labels for the given testing_reviews,
and uses the test_labels to calculate the accuracy of those predictions.
"""

# keep track of how many correct predictions we make
correct = 0

# Loop through each of the given reviews and call run to predict
# its label.
for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1

sys.stdout.write(" #Correct:" + str(correct) + " #Tested:" + str(i+1) \
+ " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")

def run(self, review):
"""
Returns a POSITIVE or NEGATIVE prediction for the given review.
"""
# Run a forward pass through the network, like in the "train" function.

# Input Layer
self.update_input_layer(review.lower())

# Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1)

# Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))

# Return POSITIVE for values above greater-than-or-equal-to 0.5 in the output layer;
# return NEGATIVE for other values
if(layer_2[0] >= 0.5):
return "POSITIVE"
else:
return "NEGATIVE"



Run the following code to create the network with a small learning rate, 0.001, and then train the new network. Using learning rate larger than this, for example 0.1 or even 0.01 would result in poor performance.

In [ ]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)
mlp.train(reviews[:-1000],labels[:-1000])


Running the above code would have given an accuracy around 62.2%

# Reducing Noise in Our Input Data¶

Counting how many times each word occured in our review might not be the most efficient way. Instead just including whether a word was there or not will improve our training time and accuracy. Hence we update our update_input_layer() function.

In [ ]:
def update_input_layer(self,review):
self.layer_0 *= 0

for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] =1


# Reducing Noise by Strategically Reducing the Vocabulary¶

Let us put the pos to neg ratio's that we found were much more effective at detecting a positive or negative label. We could do that by a few change:

• Modify pre_process_data:
• Add two additional parameters: min_count and polarity_cutoff
• Calculate the positive-to-negative ratios of words used in the reviews.
• Change so words are only added to the vocabulary if they occur in the vocabulary more than min_count times.
• Change so words are only added to the vocabulary if the absolute value of their postive-to-negative ratio is at least polarity_cutoff
In [ ]:
def pre_process_data(self, reviews, labels, polarity_cutoff, min_count):

positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1

pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
if(cnt >= 50):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
if(ratio > 1):
pos_neg_ratios[word] = np.log(ratio)
else:
pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

# populate review_vocab with all of the words in the given reviews
review_vocab = set()
for review in reviews:
for word in review.split(" "):
if(total_counts[word] > min_count):
if(word in pos_neg_ratios.keys()):
if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
else:

# Convert the vocabulary set to a list so we can access words via indices
self.review_vocab = list(review_vocab)

# populate label_vocab with all of the words in the given labels.
label_vocab = set()
for label in labels:

# Convert the label vocabulary set to a list so we can access labels via indices
self.label_vocab = list(label_vocab)

# Store the sizes of the review and label vocabularies.
self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab)

# Create a dictionary of words in the vocabulary mapped to index positions
self.word2index = {}
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i

# Create a dictionary of labels mapped to index positions
self.label2index = {}
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i


## The Inside Out of ML Based Prescriptive Analytics

With the constantly growing number of data, more and more companies are shifting towards analytic solutions. Analytic solutions help in extracting the meaning from the huge amount of data available. Thus, improving decision making.

Decision making is an important aspect of businesses, and technologies like Machine Learning are enhancing it further. The growing use of Machine Learning has changed the way of prescriptive analytics. In order to optimize the efforts, companies need to be more accurate with the historical and present data. This is because the historical and present data are the essentials of analytics. This article helps describe the inside out of Machine Learning-based prescriptive analytics.

Descriptive analytics, predictive analytics, and prescriptive analytics are the three phases of business analytics. Descriptive analytics, being the first one, deals with past performance. Historical data is mined to understand past performance. This serves as a way to look for the reasons behind past success and failure. It is a kind of post-mortem analysis and most management reporting like sales, marketing, operations, and finance etc. make use of this.

The second one is a predictive analysis which answers the question of what is likely to happen. The historical data is now combined with rules, algorithms etc. to determine the possible future outcome or likelihood of a situation occurring.

The final phase, well known to everyone, is prescriptive analytics. It can continually take in new data and re-predict and re-prescribe. This improves the accuracy of the prediction and prescribes better decision options.  Professional services or technology or their combination can be chosen to perform all the three analytics.

The analysis of business activities goes through many phases. Prescriptive analytics is one such. It is known to be the third phase of business analytics and comes after descriptive and predictive analytics. It entails the application of mathematical and computational sciences. It makes use of the results obtained from descriptive and predictive analysis to suggest decision options. It goes beyond predicting future outcomes and suggests actions to benefit from the predictions. It shows the implications of each decision option. It anticipates on what will happen when it will happen as well as why it will happen.

ML-based prescriptive analytics

Being just before the prescriptive analytics, predictive analytics is often confused with it. What actually happens is predictive analysis leads to prescriptive analysis. Thus, a Machine Learning based prescriptive analytics goes through an ML-based predictive analysis first. Therefore, it becomes necessary to consider the ML-based predictive analysis first.

ML-based predictive analytics:

A lot of things prevent businesses from achieving predictive analysis capabilities.  Machine Learning can be a great help in boosting Predictive analytics. Use of Machine Learning and Artificial Intelligence algorithms helps businesses in optimizing and uncovering the new statistical patterns. These statistical patterns form the backbone of predictive analysis. E-commerce, marketing, customer service, medical diagnosis etc. are some of the prospective use cases for Machine Learning based predictive analytics.

In E-commerce, machine learning can help in predicting the usual choices of the customer. Thus, presenting him/her according to his/her likes and dislikes. It can also help in predicting fraudulent transaction. Similarly, B2B marketing also makes good use of Machine learning based predictive analytics. Customer services and medical diagnosis also benefit from predictive analytics. Thus, a prediction and a prescription based on machine learning can boost various business functions.

Organizations and software development companies are making more and more use of machine learning based predictive analytics. The advancements like neural networks and deep learning algorithms are able to uncover hidden information. This all requires a well-researched approach. Big data and progressive IT systems also act as important factors in this.

## Language Detecting with sklearn by determining Letter Frequencies

Of course, there are better and more efficient methods to detect the language of a given text than counting its lettes. On the other hand this is a interesting little example to show the impressing ability of todays machine learning algorithms to detect hidden patterns in a given set of data.

For example take the sentence:

“Ceci est une phrase française.”

It’s not to hard to figure out that this sentence is french. But the (lowercase) letters of the same sentence in a random order look like this:

“eeasrsçneticuaicfhenrpaes”

Still sure it’s french? Regarding the fact that this string contains the letter “ç” some people could have remembered long passed french lessons back in school and though might have guessed right. But beside the fact that the french letter “ç” is also present for example in portuguese, turkish, catalan and a few other languages, this is still a easy example just to explain the problem. Just try to guess which language might have generated this:

“ogldviisnntmeyoiiesettpetorotrcitglloeleiengehorntsnraviedeenltseaecithooheinsnstiofwtoienaoaeefiitaeeauobmeeetdmsflteightnttxipecnlgtetgteyhatncdisaceahrfomseehmsindrlttdthoaranthahdgasaebeaturoehtrnnanftxndaeeiposttmnhgttagtsheitistrrcudf”

While this looks simply confusing to the human eye and it seems practically impossible to determine the language it was generated from, this string still contains as set of hidden but well defined patterns from which the language could be predictet with almost complete (ca. 98-99%) certainty.

First of all, we need a set of texts in the languages our model should be able to recognise. Luckily with the package NLTK there comes a big set of example texts which actually are protocolls of the european parliament and therefor are publicly availible in 11 differen languages:

•  Danish
•  Dutch
•  English
•  Finnish
•  French
•  German
•  Greek
•  Italian
•  Portuguese
•  Spanish
•  Swedish

Because the greek version is not written with the latin alphabet, the detection of the language greek would just be too simple, so we stay with the other 10 languages availible. To give you a idea of the used texts, here is a little sample:

“Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded ‘millennium bug’ failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.”

## Train and Test

The following code imports the nessesary modules and reads the sample texts from a set of text files into a pandas.Dataframe object and prints some statistics about the read texts:

Above you see a sample set of random rows of the created Dataframe. After removing very short text snipplets (less than 200 chars) we are left with 56481 snipplets. The function clean_eutextdf() then creates a lower case representation of the texts in the coloum ‘ltext’ to facilitate counting the chars in the next step.
The following code snipplet now extracs the features – in this case the relative frequency of each letter in every text snipplet – that are used for prediction:

Now that we have calculated the features for every text snipplet in our dataset, we can split our data set in a train and test set:

After doing that, we can train a k-nearest-neigbours classifier and test it to get the percentage of correctly predicted languages in the test data set. Because we do not know what value for k may be the best choice, we just run the training and testing with different values for k in a for loop:

As you can see in the output the reliability of the language classifier is generally very high: It starts at about 97.5% for k = 1, increases for with increasing values of k until it reaches a maximum level of about 98.5% at k ≈ 10.

## Using the Classifier to predict languages of texts

Now that we have trained and tested the classifier we want to use it to predict the language of example texts. To do that we need two more functions, shown in the following piece of code. The first one extracts the nessesary features from the sample text and predict_lang() predicts the language of a the texts:

With this classifier it is now also possible to predict the language of the randomized example snipplet from the introduction (which is acutally created from the first paragraph of this article):

The KNN classifier of sklearn also offers the possibility to predict the propability with which a given classification is made. While the probability distribution for a specific language is relativly clear for long sample texts it decreases noticeably the shorter the texts are.

## Background and Insights

Why does a relative simple model like counting letters acutally work? Every language has a specific pattern of letter frequencies which can be used as a kind of fingerprint: While there are almost no y‘s in the german language this letter is quite common in english. In french the letter k is not very common because it is replaced with q in most cases.

For a better understanding look at the output of the following code snipplet where only three letters already lead to a noticable form of clustering:

Even though every single letter frequency by itself is not a very reliable indicator, the set of frequencies of all present letters in a text is a quite good evidence because it will more or less represent the letter frequency fingerprint of the given language. Since it is quite hard to imagine or visualize the above plot in more than three dimensions, I used a little trick which shows that every language has its own typical fingerprint of letter frequencies:

## What more?

Beside the fact, that letter frequencies alone, allow us to predict the language of every example text (at least in the 10 languages with latin alphabet we trained for) with almost complete certancy there is even more information hidden in the set of sample texts.

As you might know, most languages in europe belong to either the romanian or the indogermanic language family (which is actually because the romans conquered only half of europe). The border between them could be located in belgium, between france and germany and in swiss. West of this border the romanian languages, which originate from latin, are still spoken, like spanish, portouguese and french. In the middle and northern part of europe the indogermanic languages are very common like german, dutch, swedish ect. If we plot the analysed languages with a different colour sheme this border gets quite clear and allows us to take a look back in history that tells us where our languages originate from:

As you can see the more common letters, especially the vocals like a, e, i, o and u have almost the same frequency in all of this languages. Far more interesting are letters like q, k, c and w: While k is quite common in all of the indogermanic languages it is quite rare in romanic languages because the same sound is written with the letters q or c.
As a result it could be said, that even “boring” sets of data (just give it a try and read all the texts of the protocolls of the EU parliament…) could contain quite interesting patterns which – in this case – allows us to predict quite precisely which language a given text sample is written in, without the need of any translation program or to speak the languages. And as an interesting side effect, where certain things in history happend (or not happend): After two thousand years have passed, modern machine learning techniques could easily uncover this history because even though all these different languages developed, they still have a set of hidden but common patterns that since than stayed the same.

## Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process and many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

## Data pre-processing for textual variables

### Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

### Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#\$%^ *()~;:/<>\|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

### Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

### Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

### Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text. The whole code is available here.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.

## Dem Wettbewerb voraus mit Künstlicher Intelligenz

### Was KI schon heute kann und was bis 2020 auf deutsche Unternehmen zukommt

Künstliche Intelligenz ist für die Menschheit wichtiger als die Erfindung von Elektrizität oder die Beherrschung des Feuers – davon sind der Google-CEO Sundar Pichai und viele weitere Experten überzeugt. Doch was steckt wirklich dahinter? Welche Anwendungsfälle funktionieren schon heute? Und was kommt bis 2020 auf deutsche Unternehmen zu?

Big Data war das Buzzword der vergangenen Jahre und war – trotz mittlerweile etablierter Tools wie SAP Hana, Hadoop und weitere – betriebswirtschaftlich zum Scheitern verurteilt. Denn Big Data ist ein passiver Begriff und löst keinesfalls alltägliche Probleme in den Unternehmen.

Dabei wird völlig verkannt, dass Big Data die Vorstufe für den eigentlichen Problemlöser ist, der gemeinhin als Künstliche Intelligenz (KI) bezeichnet wird. KI ist ein Buzzword, dessen langfristiger Erfolg und Aktivismus selbst von skeptischen Experten nicht infrage gestellt wird. Daten-Ingenieure sprechen im Kontext von KI hier aktuell bevorzugt von Deep Learning; wissenschaftlich betrachtet ein Teilgebiet der KI.

## Was KI schon heute kann

Deep Learning Algorithmen laufen bereits heute in Nischen-Anwendungen produktiv, beispielsweise im Bereich der Chatbots oder bei der Suche nach Informationen. Sie übernehmen ferner das Rating für die Kreditwürdigkeit und sperren Finanzkonten, wenn sie erlernte Betrugsmuster erkennen. Im Handel findet Deep Learning bereits die optimalen Einkaufsparameter sowie den besten Verkaufspreis.

Getrieben wird Deep Learning insbesondere durch prestigeträchtige Vorhaben wie das autonome Fahren, dabei werden die vielfältigen Anwendungen im Geschäftsbereich oft vergessen.

## Die Grenzen von Deep Learning

Und Big Data ist das Futter für Deep Learning. Daraus resultiert auch die Grenze des Möglichen, denn für strategische Entscheidungen eignet sich KI bestenfalls für das Vorbereitung einer Datengrundlage, aus denen menschliche Entscheider eine Strategie entwickeln. KI wird zumindest in dieser Dekade nur auf operativer Ebene Entscheidungen treffen können, insbesondere in der Disposition, Instandhaltung, Logistik und im Handel auch im Vertrieb – anfänglich jeweils vor allem als Assistenzsystem für die Menschen.

Genau wie das autonome Fahren mit Assistenzsystemen beginnt, wird auch im Unternehmen immer mehr die KI das Steuer übernehmen.

## Was sich hinsichtlich KI bis 2020 tun wird

Derzeit stehen wir erst am Anfang der Möglichkeiten, die Künstliche Intelligenz uns bietet. Das Markt-Wachstum für KI-Systeme und auch die Anwendungen erfolgt exponentiell. Entsprechend wird sich auch die Arbeitsweise für KI-Entwickler ändern müssen. Mit etablierten Deep Learning Frameworks, die mehrheitlich aus dem Silicon Valley stammen, zeichnet sich der Trend ab, der für die Zukunft noch weiter professionalisiert werden wird: KI-Frameworks werden Enterprise-fähig und Distributionen dieser Plattformen werden es ermöglichen, dass KI-Anwendungen als universelle Kernintelligenz für das operative Geschäft für fast alle Unternehmen binnen weniger Monate implementierbar sein werden.

Wir können bis 2020 also mit einer Alexa oder Cortana für das Unternehmen rechnen, die Unternehmensprozesse optimiert, Risiken berichtet und alle alltäglichen Fragen des Geschäftsführers beantwortet – in menschlich-verbal formulierten Sätzen.

Der Einsatz von Künstlicher Intelligenz zur Auswertung von Geschäfts- oder Maschinendaten ist auch das Leit-Thema der zweitägigen Data Leader Days 2018 in Berlin. Am 14. November 2018 sprechen renommierte Data Leader über Anwendungsfälle, Erfolge und Chancen mit Geschäfts- und Finanzdaten. Der 15. November 2018 konzentriert sich auf Automotive- und Maschinendaten mit hochrangigen Anwendern aus der produzierenden Industrie und der Automobilzuliefererindustrie. Seien Sie dabei und nutzen Sie die Chance, sich mit führenden KI-Anwendern auszutauschen.