A Gentle Introduction to Precision and Recall.

The idea of this blog is to give an intuitive understanding of Precision and Recall for a binary classification problem. I will shy away from explaining it in a textbook way but rather will try to give an intuition. Nevertheless, let me write the textbook formula first:

The problem with this nomenclature is that despite being correct, it can be a bit confusing, especially for beginners. For example ‘False Positives’ could be understood from a classifier point of view or from a population point of view.

Visualizing with an example

Let’s suppose we have a classifier to differentiate jeans from a T-shirts in a lot of cloths. This lot has 100 pieces altogether with 70 jeans and 30 T-shirts. Let us see this visually. Until this point, we just have a collection of clothes and have no classifier.

We already know that altogether we truly have 70 Jeans and 30 T-shirts.

Now let’s run the classifier to identify the jeans from T-shirts. We can assume the result of the classifier is following (number inside the box is the result of classifier):

We see that out of 70 jeans the classifier identifies 63 correctly as jeans and the remaining 7 as non-Jeans. Out of 30 T-shirts, the classifier identifies 11 falsely as jeans the remaining 19 correctly as non-Jeans.

So Recall is nothing but the proportion of identified jeans out of total jeans, which is

Recall = 63 / 70

Precision is the true jeans identified out of the total number of classified jeans. Which is:

Precision = 63 / (63+11)

Hence we see, in a way Recall has to do with the ability of classifier to deal with jeans and precision has to do with ability to deal with both Jeans and Non-Jeans.

This seems to provide better intuition than the textbook formula.

Diving Deeper with another example

Let us go through one more example to cement the idea. Let’s imagine there is a village which has a notoriously high number of criminals. A special cop arrives to tackle the law and order situation. He interviews every resident and locks some residents based on hunches.

If there are still many criminals roaming on the street the recall is bad, as recall deals with the ability to deal with the quantity which classifier is supposed to find (in this case criminals).

If there are too many innocents rotting in jail the precision is bad. As precision has also to do with the ability to deal with ‘others‘ that is not the quantity which the classifier is supposed to find (in this case these are the innocents).

Now we see, we don’t want too many criminals roaming on the street nor do we want many innocents rotting in the jail. Hence we need both recall and precision to be high or in other words, their mean to be high. But this cannot be arithmetic mean. Let’s see why using an example.

If for a village of 2000 residents there are 100 criminals. And if the cop straight away locks all 2000 residents, the confusion matrix looks like this:

 

Recall= 100/ (100+0) = 1

Precision = 100/ (100+1900) = 0.05

Arithmetic mean for Precision and Recall = (1+.05)/2 = 0.525

This would look like a pretty good classifier even though we know that in reality it’s a bad classifier (or a bad cop who just locks up every person he meets). It can be shown that the same happens in reverse. If the cop does not lock up anyone, the arithmetic mean does not show the true picture again.

That’s why we use harmonic mean. We call it F1 Score and it is calculated as follows: (2 * 1 * 0.05) / (1 + 0.05) = 0.0952

Now, this looks like a more realistic score. So, the performance of a classifier can be judged with a harmonic mean between precision and recall.

Let’s try to understand one more thing.

Often, classifiers work by returning probabilities of positives and negatives. One way to turn them into a confusion matrix is to use a threshold of 0.5. This means that if the probability of being positive is more than 0.5, we consider the case as positive (in our case a criminal). Otherwise, it is a negative.

But there might be cases where we want our recall to be very high. For example, if there is a classifier for identifying Ebola. We do not want any of the cases to be missed because otherwise we are risking an outbreak of the decease with disastrous consequences.

In this case, the threshold needs to be kept really low (maybe near .1 or smaller) so that we raise a flag for every case that has at least 10 % probability and get this person retested. This is an important measure in order to prevent an outbreak, despite the fact that there are a lot of false cases that needs to be rechecked.

There might be other cases where there are many false alarms (maybe fraud transaction in banks) which may be of low risk and it would be expensive to investigate all those cases. In those case, we might want to have a threshold higher than 0.5.

This gives us a taste of things to come. A classifiers efficiency can be plotted for different thresholds which gives us something called a ROC curve. But let’s save that for another post.

How is automation changing data science and machine learning?

We have come a long way since the introduction of data science and machine learning. The recent study has found that the volume of business data doubles in less than 14 months. Today, the collection of data is no longer a problem, but the filtration, analysis, and maintenance of relevant information is a bigger issue.

We need to hire data science professionals, and they demand over $100k annually. Paying that sort of money for a professional is not feasible for every single organization, especially small and middle-sized companies. Google recently announced that it is going to make machine learning technology possible for every business.

The access to machine learning technology is now possible, even for small businesses due to automation. Google, Microsoft, and other companies have come up with automated machine learning tools that enable small businesses to use machine learning technology to enhance their business performance and profit.

Image Source: Google Cloud

With that said, the world still needs a lot of machine learning professionals. Many machine learning professionals prefer Python for machine learning due to its features and a wide range of libraries.

According to the Gartner report, around 40% of data science tasks will be automated by 2020. The data science tools can automate some parts of data science processes, but it is not complete automation.

With that said, it has been helping a lot to accelerate the tasks. We still need data science professionals to deal with real-world problems. The algorithms are not yet able to handle messy data. The significant chunk of data science professionals often prefers performing with data science with Python for sophisticated tasks.

Automation in Data Science

Let me show you the figure right at the beginning before moving forward.

Image Source: Wikipedia

If I had to use only one word to describe the entire data science process, I would use the word “headache.” According to the recent report, the median salary of data scientists easily surpasses $100k annually. The pay will be higher in the time to come.

One needs to pay a lot of money and invest a lot of time to get insights from the collected data. The data scientists need to spend almost 50-60% of their time in data processing and the rest of their time in modeling and deployment.

The cloud platforms like Amazon Web Services, Google, Microsoft Azure, and so on make the job more comfortable, but there is still a lot of work to maintain and extract useful insights from the collected data.

The data science process has lots of inefficiencies. At first, they need to spend over 50% of their total time on processing messy real-world data. After that, there could be a need to customize models, according to specific problems.

The significant contribution of automation is making a significant portion of data processing parts automated. Secondly, the automated platforms can make tracking of various models easier from multiple parameters. The time needed to launch the algorithm is minimal.

One example of an extensive tool to handle a data science project is Alteryx. IT has come up with powerful automated solutions that can drastically reduce the data processing and model development time for smoothening the entire data science workflow. The data science platform, Alteryx, is so amazing that its share price doubled in a span of little more than a year.

Some other great tools that can help you in data science automation are Rapidminer, H20.ai, KNIME, and so on. However, the lack of skilled data scientists can create a problem despite these tools. It is where the role of automated machine learning pops in.

How is Machine Learning Transformed with the entrance of Automation?

The traditional machine learning process was too complicated. One requires to have a lot of expensive machine learning professionals working for months to come up with models to process machine learning tasks.

Image Source: Medium

To make traditional machine learning work, one needs to gather data, standardize data, process features, create and train the machine learning model from problems, validate the models, and deploy the models at last.

You must have heard of how machine learning is only for corporations in the past. But, that has drastically changed in recent time, and it is all due to automation. Keep in mind that the above machine learning model is a simple one. There is a lot of extra works for complicated models. Even for the simple ones, you need to spend a lot of time and money, which makes it impossible for small and medium companies.

The automation in machine learning is all about automating the entire process to make machine learning easier. The only thing you need to do is feed data to the system (not a massive volume of data). You do not need even to cross the three-figure number of images to continue with automated machine learning platforms.

Microsoft has its automl platform along with Google. Other automl platforms can do the trick for you. Using those platforms do not cost you an arm and a leg. If you check out the price, you will be surprised.

There is no need for you to create or deploy models or even test the models. The algorithm will do the job for you. It takes examples and models of historical models to process the data and use a machine learning algorithm.

Even non-statistician can implement machine learning technology with limited data, thanks to automation in machine learning. You can make use of predictive analytics and can get easy solutions for simple prediction problems without scratching your head. Numerous libraries can assist you in the automated generation of machine learning pipelines.

How are the jobs of data scientists simplified by the introduction of automation in machine learning and data science?

It is true that the introduction of automation has drastically reduced the time for completing the tasks for data scientists. They no longer have to spend their valuable time in time-consuming, monotonous works that are necessary but do not provide a lot of value.

However, the need for skilled data scientists still exist, and it will always be there in the time to come. There are challenging works for data scientists that we cannot replace with machines, such as listening to clients, figuring out the root cause of business issues, development and selection of the right solution for the specific business problem.

Just like in other types of jobs, the advancement of automation technologies will modify the tasks that data scientists need to perform. They will be able to allocate more time on things that matter rather than monotonous tasks.

Final Verdict

The automation of machine learning and data science are in the beginning stage. However, they are already making a massive impact on the business world. The huge corporations are investing in Big Data and Machine Learning technologies. We can expect a considerable improvement in these technologies shortly.

Sooner, the competitive advantage of a business will depend on how well they can use the technologies, instead of access to machine learning or Big Data technologies.  I hope this article was valuable to you. If you want to add something or express your thoughts, feel free to leave a comment. I will gladly read and reply to your comment.

A common trap when it comes to sampling from a population that intrinsically includes outliers

I will discuss a common fallacy concerning the conclusions drawn from calculating a sample mean and a sample standard deviation and more importantly how to avoid it.

Suppose you draw a random sample x_1, x_2, … x_N of size N and compute the ordinary (arithmetic) sample mean  x_m and a sample standard deviation sd from it.  Now if (and only if) the (true) population mean µ (first moment) and population variance (second moment) obtained from the actual underlying PDF  are finite, the numbers x_m and sd make the usual sense otherwise they are misleading as will be shown by an example.

By the way: The common correlation coefficient will also be undefined (or in practice always point to zero) in the presence of infinite population variances. Hopefully I will create an article discussing this related fallacy in the near future where a suitable generalization to Lévy-stable variables will be proposed.

 Drawing a random sample from a heavy tailed distribution and discussing certain measures

As an example suppose you have a one dimensional random walker whose step length is distributed by a symmetric standard Cauchy distribution (Lorentz-profile) with heavy tails, i.e. an alpha-stable distribution with alpha being equal to one. The PDF of an individual independent step is given by p(x) = \frac{\pi^{-1}}{(1 + x^2)} , thus neither the first nor the second moment exist whereby the first exists and vanishes at least in the sense of a principal value due to symmetry.

Still let us generate N = 3000 (pseudo) standard Cauchy random numbers in R* to analyze the behavior of their sample mean and standard deviation sd as a function of the reduced sample size n \leq N.

*The R-code is shown at the end of the article.

Here are the piecewise sample mean (in blue) and standard deviation (in red) for the mentioned Cauchy sampling. We see that both the sample mean and sd include jumps and do not converge.

Especially the mean deviates relatively largely from zero even after 3000 observations. The sample sd has no target due to the population variance being infinite.

If the data is new and no prior distribution is known, computing the sample mean and sd will be misleading. Astonishingly enough the sample mean itself will have the (formally exact) same distribution as the single step length p(x). This means that the sample mean is also standard Cauchy distributed implying that with a different Cauchy sample one could have easily observed different sample means far of the presented values in blue.

What sense does it make to present the usual interval x_m \pm sd / \sqrt{N} in such a case? What to do?

The sample median, median absolute difference (mad) and Inter-Quantile-Range (IQR) are more appropriate to describe such a data set including outliers intrinsically. To make this plausible I present the following plot, whereby the median is shown in black, the mad in green and the IQR in orange.

This example shows that the median, mad and IQR converge quickly against their assumed values and contain no major jumps. These quantities do an obviously better job in describing the sample. Even in the presence of outliers they remain robust, whereby the mad converges more quickly than the IQR. Note that a standard Cauchy sample will contain half of its sample in the interval median \pm mad meaning that the IQR is twice the mad.

Drawing a random sample from a PDF that has finite moments

Just for comparison I also show the above quantities for a standard normal (pseudo) sample labeled with the same color as before as a counter example. In this case not only do both the sample mean and median but also the sd and mad converge towards their expected values (see plot below). Here all the quantities describe the data set properly and there is no trap since there are no intrinsic outliers. The sample mean itself follows a standard normal, so that the sd in deed makes sense and one could calculate a standard error \frac{sd}{\sqrt{N}} from it to present the usual stochastic confidence intervals for the sample mean.

A careful observation shows that in contrast to the Cauchy case here the sampled mean and sd converge more quickly than the sample median and the IQR. However still the sampled mad performs about as well as the sd. Again the mad is twice the IQR.

And here are the graphs of the prementioned quantities for a pseudo normal sample:

The take-home-message:

Just be careful when you observe outliers and calculate sample quantities right away, you might miss something. At best one carefully observes how the relevant quantities change with sample size as demonstrated in this article.

Such curves should become of broader interest in order to improve transparency in the Data Science process and reduce fallacies as well.

Thank you for reading.

P.S.: Feel free to play with the set random seed in the R-code below and observe how other quantities behave with rising sample size. Of course you can also try different PDFs at the beginning of the code. You can employ a Cauchy, Gaussian, uniform, exponential or Holtsmark (pseudo) random sample.

 

QUIZ: Which one of the recently mentioned random samples contains a trap** and why?

**in the context of this article

 

R-code used to generate the data and for producing plots:

 

#R-script for emphasizing convergence and divergence of sample means

####install and load relevant packages ####

#uncomment these lines if necessary
#install.packages(c('ggplot2',’stabledist’))
#library(ggplot2)
#library(stabledist)

#####drawing random samples #####

#Setting a random seed for being able to reproduce results  
set.seed(1234567)   
N= 2000     #sample size

#Choose a PDF from which a sample shall be drawn
#To do so (un)comment the respective lines of following code

data <- rcauchy(N)    # option1(default): standard Cauchy sampling

#data <- rnorm(N)     #option2: standard Gaussian sampling
                               
#data <- rexp(N)    # option3: standard exponential sampling

#data <- rstable(N,alpha=1.5,beta=0)  # option4: standard symmetric Holtsmark sampling

#data <- runif(N)              #option5: standard uniform sample

#####descriptive statistics####
#preparations/declarations

SUM = vector()
sd =vector()
mean = vector()
SQ =vector()
SQUARES = vector()
median = vector()
mad =vector()
quantiles = data.frame()
sem =vector()

#piecewise calculaion of descrptive quantities

for (k in 1:length(data)){              #mainloop
SUM[k] <- sum(data[1:k])            # sum of sample
mean[k] <- mean(data[1:k])          # arithmetic mean
sd[k] <- sd(data[1:k])              # standard deviation
sem[k] <- sd[k]/(sqrt(k))          #standard error of the sample mean (for finite variances)
mad[k] <- mad(data[1:k],const=1)   # median absolute deviation    

for (j in 1:5){
qq <- quantile(data[1:k],na.rm = T)
quantiles[k,j] <- qq[j]         #quantiles of sample
}
colnames(quantiles) <- c('min','Q1','median','Q3','max')

for (i in 1:length(data[1:k])){
SQUARES[i] <- data[i]*data[i]    
}
SQ[k] <- sum(SQUARES[1:k])    #sum of squares of random sample
}  #end of mainloop

#create table containing all relevant data
TABLE <-  as.data.frame(cbind(quantiles,mean,sd,SQ,SUM,sem))




#####plotting results###
x11()
print(ggplot(TABLE,aes(1:N,median))+
geom_point(size=.5)+xlab('sample size n')+ylab('sample median'))
x11()
print(ggplot(TABLE,aes(1:N,mad))+geom_point(size=.5,color ='green')+
xlab('sample size n')+ylab('sample median absolute difference'))
x11()
print(ggplot(TABLE,aes(1:N,sd))+geom_point(size=.5,color ='red')+
xlab('sample size n')+ylab('sample standard deviation'))
x11()
print(ggplot(TABLE,aes(1:N,mean))+geom_point(size=.5, color ='blue')+
xlab('sample size n')+ylab('sample mean'))
x11()
print(ggplot(TABLE,aes(1:N,Q3-Q1))+geom_point(size=.5, color ='blue')+
xlab('sample size n')+ylab('IQR'))

#uncomment the following lines of code to see further plots

#x11()
#print(ggplot(TABLE,aes(1:N,sem))+geom_point(size=.5)+
#xlab('sample size n')+ylab('sample sum of r.v.'))
#x11()
#print(ggplot(TABLE,aes(1:N,SUM))+geom_point(size=.5)+
#xlab('sample size n')+ylab('sample sum of r.v.'))
#x11()
#print(ggplot(TABLE,aes(1:N,SQ))+geom_point(size=.5)+
#xlab('sample size n')+ylab('sample sum of squares'))

 

Fuzzy Matching mit dem Jaro-Winkler-Score zur Auswertung von Markenbekanntheit und Werbeerinnerung

Für Unternehmen sind Markenbekanntheit und Werbeerinnerung wichtige Zielgrößen, denn anhand dieser lässt sich ableiten, ob Konsumenten ein Produkt einer Marke kaufen werden oder nicht. Zielgrößen wie diese werden von Marktforschungsinstituten über Befragungen ermittelt. Dafür wird in regelmäßigen Zeitabständen eine gleichbleibende Anzahl an Personen befragt, ob diese sich an Marken einer bestimmten Branche erinnern oder sich an Werbung erinnern. Die Personen füllen dafür in der Regel einen Onlinefragebogen aus.

Die Ergebnisse der Befragung liegen in einer Datenmatrix (siehe Tabelle) vor und müssen zur Auswertung zunächst bearbeitet werden.

Laufende Nummer Marke 1 Marke 2 Marke 3 Marke 4
1 ING-Diba Citigroup Sparkasse
2 Sparkasse Consorsbank
3 Commerbank Deutsche Bank Sparkasse ING-DiBa
4 Sparkasse Targobank

Ziel ist es aus diesen Daten folgende 0/1 codierte Matrix zu generieren. Wenn eine Marke bekannt ist, wird in die zur Marke gehörende Spalte eine Eins eingetragen, ansonsten eine Null.

Alle Marken ING-Diba Citigroup Sparkasse Targobank
ING-Diba, Citigroup, Sparkasse 1 1 1 0
Sparkasse, Consorsbank 0 0 1 0
Commerzbank, Deutsche Bank, Sparkasse, ING-Diba 1 0 0 0
Sparkasse, Targobank 0 0 1 1

Der Workflow um diese Datentransformation durchzuführen ist oftmals mittels eines Teilstrings einer Marke zu suchen ob diese in einem über alle Nennungen hinweg zusammengeführten String vorkommt oder nicht (z.B. „argo“ bei Targobank). Das Problem dieser Herangehensweise ist, dass viele falsch geschriebenen Wörter so nicht erfasst werden und die Erfahrung zeigt, dass falsch geschriebene Marken in vielfältigster Weise auftreten. Hier mussten in der Vergangenheit Mitarbeiter sich in stundenlangem Kampf durch die Ergebnisse wühlen und falsch zugeordnete oder nicht zugeordnete Marken händisch korrigieren und alle Variationen der Wörter notieren, um für die nächste Befragung das Suchpattern zu optimieren.

Eine Alternative diesen aufwändigen Workflow stellt die Ermittlung von falsch geschriebenen Wörtern mittels des Jaro-Winkler-Scores dar. Dafür muss zunächst die Jaro-Winkler-Distanz zwischen zwei Strings berechnet werden. Diese berechnet sich wie folgt:

d_j = \frac{1}{3}(\frac{m}{|s_1|}+\frac{m}{|s_2|}+\frac{m - t}{m})

  • m: Anzahl der übereinstimmenden Buchstaben
  • s: Länge des Strings
  • t: Hälfte der Anzahl der Umstellungen der Buchstaben die nötig sind, damit Strings identisch sind. („Ta“ und „gobank“ befinden sich bereits in der korrekten Reihenfolge, somit gilt: t = 0)

Aus dem Ergebnis lässt sich der Jaro-Winkler Score berechnen:
d_w = \d_j + (l_p (1 - d_j))
ist dabei die Jaro-Winkler-Distanz, l die Länge der übereinstimmenden Buchstaben von Beginn des Wortes bis zum maximal vierten Buchstaben und p ein konstanter Faktor von 0,1.

Für die Strings „Targobank“ und „Tangobank“ ergibt sich die Jaro-Winkler-Distanz:

d_j = \frac{1}{3}(\frac{8}{9}+\frac{8}{9}+\frac{8 - 0}{9})

Daraus wird im nächsten Schritt der Jaro-Winkler Score berechnet:

d_w = 0,9259 + (2 \cdot 0,1 (1 - 0,9259)) = 0,9407407

Bisherige Erfahrungen haben gezeigt, dass sich Scores ab 0,8 bzw. 0,9 am besten zur Suche von ähnlichen Wörtern eignen. Ein Schwellenwert darunter findet sehr viele Wörter, die sich z.B. auch anderen Wörtern zuordnen lassen. Ein Schwellenwert über 0,9 identifiziert falsch geschriebene Wörter oftmals nicht mehr.

Nach diesem theoretischen Exkurs möchte ich nun zeigen, wie sich das Ganze praktisch anwenden lässt. Da sich das Ganze um ein fiktives Beispiel handelt, werden zur Demonstration der Praxistauglichkeit Fakedaten mit folgendem Code erzeugt. Dabei wird angenommen, dass Personen unterschiedlich viele Banken kennen und diese mit einer bestimmten Wahrscheinlichkeit falsch schreiben.

# Erstellung von Fakeantworten
set.seed(1234)
library(stringi)
library(tidyr)
library(RecordLinkage)
library(xlsx)
library(tm)
library(qdap)
library(stringr)
library(openxlsx)

konsonant <- c("r", "n", "g", "h", "b")
vokal <- c("a", "e", "o", "i", "u")

# Funktion, die mit einer zu bestimmenden Wahrscheinlichkeit, einen zufälligen Buchstaben erzeugt.
generate_wrong_words <- function(x, p, k = TRUE) {
  if(runif(1, 0, 1) > p) { # Zufallswert zwischen 0 und 1
    if(k == TRUE) { # Konsonant oder Vokal erzeugen
      string <- konsonant[sample.int(5, 1)] # Zufallszahl, die Index des Konsonnanten-Vektors bestimmt.
    } else {
      string <- vokal[sample.int(5, 1)] # Zufallszahl, die Index eines Vokal-Vecktors bestimmt.
    }
  } else {
    string <- x
  }
  return(string)
}

randombank <- function(x) {
  random_num <- runif(1, 0, 1)
  if(random_num  > x) { ## Wahrscheinlichkeit, dass Person keine Bank kennt.
    number <- sample.int(7, 1)
    if(number == 1) {
      bank <- paste0("Ta", generate_wrong_words(x = "r", p = 0.7), "gob", generate_wrong_words(x = "a", p = 0.9), "nk")
    } else if (number == 2) {
      bank <- paste0("Ing-di", generate_wrong_words(x = "b", p = 0.6), "a")
    } else if (number == 3) {
      bank <- paste0("com", generate_wrong_words(x = "m", p = 0.7), "erzb", generate_wrong_words(x = "a", p = 0.8), "nk")
    } else if (number == 4){
      bank <- paste0("Deutsch", generate_wrong_words(x = "e", p = 0.6, k = FALSE), " Ban", generate_wrong_words(x = "k", p = 0.8))
    } else if (number == 5) {
      bank <- paste0("Spark", generate_wrong_words(x = "a", p = 0.7, k = FALSE), "sse")
    } else if (number == 6) {
      bank <- paste0("Cons", generate_wrong_words(x = "o", p = 0.7, k = FALSE), "rsbank")
    } else {
      bank <- paste0("Cit", generate_wrong_words(x = "i", p = 0.7, k = FALSE), "gro", generate_wrong_words(x = "u", p = 0.9, k = FALSE), "p")
    }
  } else {
    bank <- "" # Leerer String, wenn keine Bank bekannt.
  }
  return(bank)
}


# DataFrame erzeugen, in dem Werte gespeichert werden.
df_raw <- data.frame(matrix(ncol = 8, nrow = 2500))

# Erzeugen von richtig und falsch geschrieben Banken mit einer durch bestimmten Variabilität an Banken, welche die Personen kennen.
for(i in 1:2500) {
  df_raw [i, 1] <- i # Laufende Nummer des Befragten
  df_raw [i, 2] <- randombank(x = 0.05)
  if(df_raw [i, 2] == "") { df_raw [i, 3] <- "" } else {df_raw [i, 3] <- randombank(x = 0.1)}
  if(df_raw [i, 3] == "") { df_raw [i, 4] <- "" } else {df_raw [i, 4] <- randombank(x = 0.1)}
  if(df_raw [i, 4] == "") { df_raw [i, 5] <- "" } else {df_raw [i, 5] <- randombank(x = 0.15)} 
  if(df_raw [i, 5] == "") { df_raw [i, 6] <- "" } else {df_raw [i, 6] <- randombank(x = 0.15)}
  if(df_raw [i, 6] == "") { df_raw [i, 7] <- "" } else {df_raw [i, 7] <- randombank(x = 0.2)} 
  if(df_raw [i, 7] == "") { df_raw [i, 8] <- "" } else {df_raw [i, 8] <- randombank(x = 0.2)} 
}
colnames(df_raw)[1] <- "lfdn"

Ausführen:

head(df_raw)

Nun werden die Inhalte der Spalten in eine einzige Spalte zusammengefasst und jede Marke per Komma getrennt.

df <- unite(df_raw, united, c(2:ncol(df_raw)), sep = ",")
colnames(df)[2] <- "text"
# Gesuchte Banken (nur korrekt geschrieben)
startliste <- c("Targobank", "Ing-DiBa", "Commerzbank", "Deutsche Bank", "Sparkasse", "Consorsbank", "Citigroup")

Damit Sonderzeichen, Leerzeichen oder Groß- und Kleinschreibung keine Rolle spielen, werden alle Strings vereinheitlicht und störende Zeichen entfernt.

dftext <- tolower(dftext)
dftext <- str_trim(dftext)
dftext <- gsub(" ", "", dftext)
dftext <- gsub("[?]", "", dftext)
dftext <- gsub("[-]", "", dftext)
dftext <- gsub("[_]", "", dftext)

startliste <- tolower(startliste)
startliste <- str_trim(startliste)
startliste <- gsub(" ", "", startliste)
startliste <- gsub("[?]", "", startliste)
startliste <- gsub("[-]", "", startliste)
startliste <- gsub("[_]", "", startliste)

Im nächsten Schritt wird geprüft welche Schreibweisen überhaupt existieren. Dafür eignet sich eine Word-Frequency-Matrix, mit der alle einzigartigen Wörter und deren Häufigkeiten in einem Vektor gezählt wird.

words <- as.data.frame(wfm(dftext)) # Jedes einzigartige Wort und dazugehörige Häufigkeiten. words <- rownames(words) # wfm zählt Häufigkeiten jedes Wortes und schreibt Wörter in rownames, wir brauchen jedoch das Wort selbst. </pre> Danach wird eine leere Liste erstellt, in der iterativ für jedes Element des Suchvektors ein Charactervektor erzeugt wird, der Wörter enthält, die einen Jaro-Winker Score von 0,9 oder höher besitzen. <pre class="theme:github lang:r decode:true ">for(i in 1:length(startliste)) {   finalewortliste[[i]] <- words[which(jarowinkler(startliste[[i]], words) > 0.9)] } </pre> Jetzt wird ein leerer DataFrame erzeugt, der die Zeilenlänge des originalen DataFrames besitzt sowie die Anzahl der Marken als Spaltenlänge. <pre class="theme:github lang:r decode:true ">finaldf <- data.frame(matrix(nrow = nrow(df), ncol = length(startliste))) colnames(finaldf) <- startliste </pre> Im nächsten Schritt wird nun aus den ähnlichen Wörtern mit einer oder-Verknüpfung einen String erzeugt, der alle durch den Jaro-Winkler-Score identifizierten Wörter beinhaltet. Wenn ein Treffer gefunden wird, wird in der Suchspalte eine Eins eingetragen, ansonsten eine Null. <pre class="theme:github lang:r decode:true ">for(i in 1:ncol(finaldf)) {   finaldf[i] <- ifelse(str_detect(dftext, paste(finalewortliste[[i]], collapse = "|")) == TRUE, 1, 0) 
}

Zuletzt wird eine Spalte erzeugt, in die eine Eins geschrieben wird, wenn keine der Marken gefunden wurde.

finaldfkeinedergeannten <- ifelse(rowSums(finaldf) > 0, 0, 1) # Wenn nicht mindestens eine der gesuchten Banken bekannt </pre> Nach der fertigen Berechnung der Matrix können nun die finalen KPI´s berechnet und als Report in eine .xlsx Datei geschrieben werden. <pre class="theme:github lang:r decode:true "># Prozentuale Anteile berechnen. anteil <- as.data.frame(t(sapply(finaldf, sum) / nrow(finaldf) * 100)) # Ordne dem DataFrame die ursprünglichen Nenneungen zu. finaldf <- cbind(dftext, finaldf)
colnames(finaldf)[1] <- "text"

# Ergebnisse in eine .xlsx Datei schreiben.
wb <- createWorkbook()
addWorksheet(wb, "Ergebnisse")    
writeData(wb, "Ergebnisse", anteil, startCol = 2, startRow = 1, rowNames = FALSE)
writeData(wb, "Ergebnisse", finaldf, startCol = 1, startRow = 4, rowNames = FALSE)
saveWorkbook(wb, paste0("C:/Users/User/Desktop/Results_", Sys.Date(), ".xlsx"), overwrite = TRUE)  

Dieses Vorgehen kann natürlich nicht verhindern, dass sich jemand mit kritischem Auge die Daten anschauen muss. In mehreren Tests ergaben sich bei einer Fallzahl von ~10.000 Antworten Genauigkeiten zwischen 95% und 100%, was bisherige Ansätze um ein Vielfaches übertrifft.9407407

Big Data has reduced the boundary between demand-centric dynamic pricing and user-behavior centric pricing!

Real-time pricing is also known as Dynamic pricing, and it is a method to plan and set highly flexible prices of the services or the products. Dynamic pricing is aimed to help the online organizations modify the costs on the fly in relation to the ever changing market conditions. All sorts of modifications are managed the costing bots, who collect the information, and use the algorithms in order to regulate the costing, keeping in mind the set guidelines. With the help of data analysis, vendors can accurately forecast the best prices, and also can adjust it as per the changing needs.

What’s the role of Big Data in Dynamics pricing?

Big data strategies are made just to get the required insights which help to enhance the performance of a business. Still, companies find it difficult to understand the capabilities of analytics, and how the analytics can be used to make the process of pricing all the more powerful. Various levels of Big Data collection, and analysis result into planning a proper dynamics pricing structure. The Big Data captured by the companies hold a lot of value when it comes to devising solid, and very workable dynamics costing structures.

Each and every one of the data-oriented firms move from the basic data reporting stage via a plenty of stages to get to the utmost, desirable level of optimization that’s deemed the most sophisticated. This eventually helps to enhance the revenue management process as well.

How Big Data lessens the gap between demand-centric dynamic pricing and user-behavior centric pricing?

Big Data as we have discussed above has a major role to play when it comes to setting dynamic pricing plans. Dynamic pricing is now further categorized into different segments and two of them are demand-centric dynamic pricing and user-behavior centric pricing. Both of these hold equal importance in creating a top pricing strategy. However, one of the other important things is that, it acts as a liaison between the two as well.  It bridges the gap between the two. When it comes to demand centric costing, it is referred to as what the customer needs, and what the customer is looking for. Whereas, when it comes to user behavior pricing, it is more related to what we should be offering to the customer as per the interest levels of the customers.

Now, both of these parameters hold equal importance when it comes to making costing strategies that are fruitful. To set proper ‘demand centric pricing’ it is importance to know about the demand as well as the wants of the target audience. And, when it comes to user-behavior centric pricing, we need to know how the user is feeling, and what interest areas are. This where the role of Big Data analytics come into play.

Big Data analytics of relative information helps to find out both, the demands and well as the user behaviors. Big Data analytics done to study the target audience are a best way to get to the answers. Once we know about the demands and the user behavior we have to combine both of these to churn our better pricing strategies.

The costing plans should be taken into consideration by mapping both of these elements together. For example, even whenever we curate marketing strategies, they are basically catering to the demands of the public. But, at the same time, user-behavior is never neglected either. It’s a mix of both that we need for setting dynamic prices as well. The modifications which should be done in the pricing should be done based on collective insights gained by clubbing both the elements together.

By studying both the demands graphs as well as the user behavior reports, a company can devise plans that will turn out to be very useful when it comes to costing. Dynamic pricing is as it is a very fruitful invention, and the integration of Big Data has made it all the more powerful.

Big Data is one of those technologies which has made a lot possible in a lot of areas. Be it the pricing structures or the business strategies, Big Data analytics are used everywhere to improve the performance of the company.


Warning: file_get_contents(https://nbviewer.jupyter.org/url/github.com/sarthakbabbar3/Sentiment_Analysis_IMDB/blob/master/Sentiment%20analysis%20imdb.ipynb): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /var/www/datasciencehack/wp-content/plugins/nbconvert-master/nbconvert.php on line 87

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /var/www/datasciencehack/wp-content/plugins/nbconvert-master/nbconvert.php on line 121

Warning: file_get_contents(https://api.github.com/repos/sarthakbabbar3/Sentiment_Analysis_IMDB/commits/master?path=Sentiment%20analysis%20imdb.ipynb&page=1): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /var/www/datasciencehack/wp-content/plugins/nbconvert-master/nbconvert.php on line 38

Sentiment Analysis of IMDB reviews

 

Download the Data Sources

The data sources used in this article can be downloaded here:

The Inside Out of ML Based Prescriptive Analytics

With the constantly growing number of data, more and more companies are shifting towards analytic solutions. Analytic solutions help in extracting the meaning from the huge amount of data available. Thus, improving decision making.

Decision making is an important aspect of businesses, and technologies like Machine Learning are enhancing it further. The growing use of Machine Learning has changed the way of prescriptive analytics. In order to optimize the efforts, companies need to be more accurate with the historical and present data. This is because the historical and present data are the essentials of analytics. This article helps describe the inside out of Machine Learning-based prescriptive analytics.

Phases of business analytics

Descriptive analytics, predictive analytics, and prescriptive analytics are the three phases of business analytics. Descriptive analytics, being the first one, deals with past performance. Historical data is mined to understand past performance. This serves as a way to look for the reasons behind past success and failure. It is a kind of post-mortem analysis and most management reporting like sales, marketing, operations, and finance etc. make use of this.

The second one is a predictive analysis which answers the question of what is likely to happen. The historical data is now combined with rules, algorithms etc. to determine the possible future outcome or likelihood of a situation occurring.

The final phase, well known to everyone, is prescriptive analytics. It can continually take in new data and re-predict and re-prescribe. This improves the accuracy of the prediction and prescribes better decision options.  Professional services or technology or their combination can be chosen to perform all the three analytics.

More about prescriptive analytics

The analysis of business activities goes through many phases. Prescriptive analytics is one such. It is known to be the third phase of business analytics and comes after descriptive and predictive analytics. It entails the application of mathematical and computational sciences. It makes use of the results obtained from descriptive and predictive analysis to suggest decision options. It goes beyond predicting future outcomes and suggests actions to benefit from the predictions. It shows the implications of each decision option. It anticipates on what will happen when it will happen as well as why it will happen.

ML-based prescriptive analytics

Being just before the prescriptive analytics, predictive analytics is often confused with it. What actually happens is predictive analysis leads to prescriptive analysis. Thus, a Machine Learning based prescriptive analytics goes through an ML-based predictive analysis first. Therefore, it becomes necessary to consider the ML-based predictive analysis first.

ML-based predictive analytics:

A lot of things prevent businesses from achieving predictive analysis capabilities.  Machine Learning can be a great help in boosting Predictive analytics. Use of Machine Learning and Artificial Intelligence algorithms helps businesses in optimizing and uncovering the new statistical patterns. These statistical patterns form the backbone of predictive analysis. E-commerce, marketing, customer service, medical diagnosis etc. are some of the prospective use cases for Machine Learning based predictive analytics.

In E-commerce, machine learning can help in predicting the usual choices of the customer. Thus, presenting him/her according to his/her likes and dislikes. It can also help in predicting fraudulent transaction. Similarly, B2B marketing also makes good use of Machine learning based predictive analytics. Customer services and medical diagnosis also benefit from predictive analytics. Thus, a prediction and a prescription based on machine learning can boost various business functions.

Organizations and software development companies are making more and more use of machine learning based predictive analytics. The advancements like neural networks and deep learning algorithms are able to uncover hidden information. This all requires a well-researched approach. Big data and progressive IT systems also act as important factors in this.

Language Detecting with sklearn by determining Letter Frequencies

Of course, there are better and more efficient methods to detect the language of a given text than counting its lettes. On the other hand this is a interesting little example to show the impressing ability of todays machine learning algorithms to detect hidden patterns in a given set of data.

For example take the sentence:

“Ceci est une phrase française.”

It’s not to hard to figure out that this sentence is french. But the (lowercase) letters of the same sentence in a random order look like this:

“eeasrsçneticuaicfhenrpaes”

Still sure it’s french? Regarding the fact that this string contains the letter “ç” some people could have remembered long passed french lessons back in school and though might have guessed right. But beside the fact that the french letter “ç” is also present for example in portuguese, turkish, catalan and a few other languages, this is still a easy example just to explain the problem. Just try to guess which language might have generated this:

“ogldviisnntmeyoiiesettpetorotrcitglloeleiengehorntsnraviedeenltseaecithooheinsnstiofwtoienaoaeefiitaeeauobmeeetdmsflteightnttxipecnlgtetgteyhatncdisaceahrfomseehmsindrlttdthoaranthahdgasaebeaturoehtrnnanftxndaeeiposttmnhgttagtsheitistrrcudf”

While this looks simply confusing to the human eye and it seems practically impossible to determine the language it was generated from, this string still contains as set of hidden but well defined patterns from which the language could be predictet with almost complete (ca. 98-99%) certainty.

First of all, we need a set of texts in the languages our model should be able to recognise. Luckily with the package NLTK there comes a big set of example texts which actually are protocolls of the european parliament and therefor are publicly availible in 11 differen languages:

  •  Danish
  •  Dutch
  •  English
  •  Finnish
  •  French
  •  German
  •  Greek
  •  Italian
  •  Portuguese
  •  Spanish
  •  Swedish

Because the greek version is not written with the latin alphabet, the detection of the language greek would just be too simple, so we stay with the other 10 languages availible. To give you a idea of the used texts, here is a little sample:

“Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded ‘millennium bug’ failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.”

Train and Test

The following code imports the nessesary modules and reads the sample texts from a set of text files into a pandas.Dataframe object and prints some statistics about the read texts:

from pathlib import Path
import random
from collections import Counter, defaultdict
import numpy as np
import pandas as pd
from sklearn.neighbors import *
from matplotlib import pyplot as plt
from mpl_toolkits import mplot3d
%matplotlib inline


def read(file):
    '''Returns contents of a file'''
    with open(file, 'r', errors='ignore') as f:
        text = f.read()
    return text

def load_eu_texts():
    '''Read texts snipplets in 10 different languages into pd.Dataframe

    load_eu_texts() -> pd.Dataframe
    
    The text snipplets are taken from the nltk-data corpus.
    '''
    basepath = Path('/home/my_username/nltk_data/corpora/europarl_raw/langs/')
    df = pd.DataFrame(columns=['text', 'lang', 'len'])
    languages = [None]
    for lang in basepath.iterdir():
        languages.append(lang.as_posix())
        t = '\n'.join([read(p) for p in lang.glob('*')])
        d = pd.DataFrame()
        d['text'] = ''
        d['text'] = pd.Series(t.split('\n'))
        d['lang'] = lang.name.title()
        df = df.append(d.copy(), ignore_index=True)
    return df

def clean_eutextdf(df):
    '''Preprocesses the texts by doing a set of cleaning steps
    
    clean_eutextdf(df) -> cleaned_df
    '''
    # Cuts of whitespaces a the beginning and and
    df['text'] = [i.strip() for i in df['text']]
    # Generate a lowercase Version of the text column
    df['ltext'] = [i.lower() for i in df['text']]

    # Determining the length of each text
    df['len'] = [len(i) for i in df['text']]
    # Drops all texts that are not at least 200 chars long
    df = df.loc[df['len'] > 200]
    return df

# Execute the above functions to load the texts
df = clean_eutextdf(load_eu_texts())

# Print a few stats of the read texts
textline = 'Number of text snippplets: ' + str(df.shape[0])
print('\n' + textline + '\n' + ''.join(['_' for i in range(len(textline))]))
c = Counter(df['lang'])
for l in c.most_common():
    print('%-25s' % l[0] + str(l[1]))
df.sample(10)
Number of text snippplets: 56481
________________________________
French                   6466
German                   6401
Italian                  6383
Portuguese               6147
Spanish                  6016
Finnish                  5597
Swedish                  4940
Danish                   4914
Dutch                    4826
English                  4791
lang	len	text	ltext
135233	Finnish	346	Vastustan sitä , toisin kuin tämän parlamentin...	vastustan sitä , toisin kuin tämän parlamentin...
170400	Danish	243	Desuden ødelægger det centraliserede europæisk...	desuden ødelægger det centraliserede europæisk...
85466	Italian	220	In primo luogo , gli accordi di Sharm el-Sheik...	in primo luogo , gli accordi di sharm el-sheik...
15926	French	389	Pour ce qui est concrètement du barrage de Ili...	pour ce qui est concrètement du barrage de ili...
195321	English	204	Discretionary powers for national supervisory ...	discretionary powers for national supervisory ...
160557	Danish	304	Det er de spørgmål , som de lande , der udgør ...	det er de spørgmål , som de lande , der udgør ...
196310	English	355	What remains of the concept of what a company ...	what remains of the concept of what a company ...
110163	Portuguese	327	Actualmente , é do conhecimento dos senhores d...	actualmente , é do conhecimento dos senhores d...
151681	Danish	203	Dette er vigtigt for den tillid , som samfunde...	dette er vigtigt for den tillid , som samfunde...
200540	English	257	Therefore , according to proponents , such as ...	therefore , according to proponents , such as ...

Above you see a sample set of random rows of the created Dataframe. After removing very short text snipplets (less than 200 chars) we are left with 56481 snipplets. The function clean_eutextdf() then creates a lower case representation of the texts in the coloum ‘ltext’ to facilitate counting the chars in the next step.
The following code snipplet now extracs the features – in this case the relative frequency of each letter in every text snipplet – that are used for prediction:

def calc_charratios(df):
    '''Calculating ratio of any (alphabetical) char in any text of df for each lyric
    
    calc_charratios(df) -> list, pd.Dataframe
    '''
    CHARS = ''.join({c for c in ''.join(df['ltext']) if c.isalpha()})
    print('Counting Chars:')
    for c in CHARS:
        print(c, end=' ')
        df[c] = [r.count(c) for r in df['ltext']] / df['len']
    return list(CHARS), df

features, df = calc_charratios(df)

Now that we have calculated the features for every text snipplet in our dataset, we can split our data set in a train and test set:

def split_dataset(df, ratio=0.5):
    '''Split the dataset into a train and a test dataset
    
    split_dataset(featuredf, ratio) -> pd.Dataframe, pd.Dataframe
    '''
    df = df.sample(frac=1).reset_index(drop=True)
    traindf = df[:][:int(df.shape[0] * ratio)]
    testdf = df[:][int(df.shape[0] * ratio):]
    return traindf, testdf

featuredf = pd.DataFrame()
featuredf['lang'] = df['lang']
for feature in features:
    featuredf[feature] = df[feature]
traindf, testdf = split_dataset(featuredf, ratio=0.80)

x = np.array([np.array(row[1:]) for index, row in traindf.iterrows()])
y = np.array([l for l in traindf['lang']])
X = np.array([np.array(row[1:]) for index, row in testdf.iterrows()])
Y = np.array([l for l in testdf['lang']])

After doing that, we can train a k-nearest-neigbours classifier and test it to get the percentage of correctly predicted languages in the test data set. Because we do not know what value for k may be the best choice, we just run the training and testing with different values for k in a for loop:

def train_knn(x, y, k):
    '''Returns the trained k nearest neighbors classifier
    
    train_knn(x, y, k) -> sklearn.neighbors.KNeighborsClassifier
    '''
    clf = KNeighborsClassifier(k)
    clf.fit(x, y)
    return clf

def test_knn(clf, X, Y):
    '''Tests a given classifier with a testset and return result
    
    text_knn(clf, X, Y) -> float
    '''
    predictions = clf.predict(X)
    ratio_correct = len([i for i in range(len(Y)) if Y[i] == predictions[i]]) / len(Y)
    return ratio_correct

print('''k\tPercentage of correctly predicted language
__________________________________________________''')
for i in range(1, 16):
    clf = train_knn(x, y, i)
    ratio_correct = test_knn(clf, X, Y)
    print(str(i) + '\t' + str(round(ratio_correct * 100, 3)) + '%')
k	Percentage of correctly predicted language
__________________________________________________
1	97.548%
2	97.38%
3	98.256%
4	98.132%
5	98.221%
6	98.203%
7	98.327%
8	98.247%
9	98.371%
10	98.345%
11	98.327%
12	98.3%
13	98.256%
14	98.274%
15	98.309%

As you can see in the output the reliability of the language classifier is generally very high: It starts at about 97.5% for k = 1, increases for with increasing values of k until it reaches a maximum level of about 98.5% at k ≈ 10.

Using the Classifier to predict languages of texts

Now that we have trained and tested the classifier we want to use it to predict the language of example texts. To do that we need two more functions, shown in the following piece of code. The first one extracts the nessesary features from the sample text and predict_lang() predicts the language of a the texts:

def extract_features(text, features):
    '''Extracts all alphabetic characters and add their ratios as feature
    
    extract_features(text, features) -> np.array
    '''
    textlen = len(text)
    ratios = []
    text = text.lower()
    for feature in features:
        ratios.append(text.count(feature) / textlen)
    return np.array(ratios)

def predict_lang(text, clf=clf):
    '''Predicts the language of a given text and classifier
    
    predict_lang(text, clf) -> str
    '''
    extracted_features = extract_features(text, features)
    return clf.predict(np.array(np.array([extracted_features])))[0]

text_sample = df.sample(10)['text']

for example_text in text_sample:
    print('%-20s'  % predict_lang(example_text, clf) + '\t' + example_text[:60] + '...')
Italian             	Auspico che i progetti riguardanti i programmi possano contr...
English             	When that time comes , when we have used up all our resource...
Portuguese          	Creio que o Parlamento protesta muitas vezes contra este mét...
Spanish             	Sobre la base de esta posición , me parece que se puede enco...
Dutch               	Ik voel mij daardoor aangemoedigd omdat ik een brede consens...
Spanish             	Señor Presidente , Señorías , antes que nada , quisiera pron...
Italian             	Ricordo altresì , signora Presidente , che durante la preced...
Swedish             	Betänkande ( A5-0107 / 1999 ) av Berend för utskottet för re...
English             	This responsibility cannot only be borne by the Commissioner...
Portuguese          	A nossa leitura comum é que esse partido tem uma posição man...

With this classifier it is now also possible to predict the language of the randomized example snipplet from the introduction (which is acutally created from the first paragraph of this article):

example_text = "ogldviisnntmeyoiiesettpetorotrcitglloeleiengehorntsnraviedeenltseaecithooheinsnstiofwtoienaoaeefiitaeeauobmeeetdmsflteightnttxipecnlgtetgteyhatncdisaceahrfomseehmsindrlttdthoaranthahdgasaebeaturoehtrnnanftxndaeeiposttmnhgttagtsheitistrrcudf"
predict_lang(example_text)
'English'

The KNN classifier of sklearn also offers the possibility to predict the propability with which a given classification is made. While the probability distribution for a specific language is relativly clear for long sample texts it decreases noticeably the shorter the texts are.

def dict_invert(dictionary):
    ''' Inverts keys and values of a dictionary
    
    dict_invert(dictionary) -> collections.defaultdict(list)
    '''
    inverse_dict = defaultdict(list)
    for key, value in dictionary.items():
        inverse_dict[value].append(key)
    return inverse_dict

def get_propabilities(text, features=features):
    '''Prints the probability for every language of a given text
    
    get_propabilities(text, features)
    '''
    results = clf.predict_proba(extract_features(text, features=features).reshape(1, -1))
    for result in zip(clf.classes_, results[0]):
        print('%-20s' % result[0] + '%7s %%' % str(round(float(100 * result[1]), 4)))


example_text = 'ogldviisnntmeyoiiesettpetorotrcitglloeleiengehorntsnraviedeenltseaecithooheinsnstiofwtoienaoaeefiitaeeauobmeeetdmsflteightnttxipecnlgtetgteyhatncdisaceahrfomseehmsindrlttdthoaranthahdgasaebeaturoehtrnnanftxndaeeiposttmnhgttagtsheitistrrcudf'
print(example_text)
get_propabilities(example_text + '\n')
print('\n')
example_text2 = 'Dies ist ein kurzer Beispielsatz.'
print(example_text2)
get_propabilities(example_text2 + '\n')
ogldviisnntmeyoiiesettpetorotrcitglloeleiengehorntsnraviedeenltseaecithooheinsnstiofwtoienaoaeefiitaeeauobmeeetdmsflteightnttxipecnlgtetgteyhatncdisaceahrfomseehmsindrlttdthoaranthahdgasaebeaturoehtrnnanftxndaeeiposttmnhgttagtsheitistrrcudf
Danish                  0.0 %
Dutch                   0.0 %
English               100.0 %
Finnish                 0.0 %
French                  0.0 %
German                  0.0 %
Italian                 0.0 %
Portuguese              0.0 %
Spanish                 0.0 %
Swedish                 0.0 %


Dies ist ein kurzer Beispielsatz.
Danish                  0.0 %
Dutch                   0.0 %
English                 0.0 %
Finnish                 0.0 %
French              18.1818 %
German              72.7273 %
Italian              9.0909 %
Portuguese              0.0 %
Spanish                 0.0 %
Swedish                 0.0 %

Background and Insights

Why does a relative simple model like counting letters acutally work? Every language has a specific pattern of letter frequencies which can be used as a kind of fingerprint: While there are almost no y‘s in the german language this letter is quite common in english. In french the letter k is not very common because it is replaced with q in most cases.

For a better understanding look at the output of the following code snipplet where only three letters already lead to a noticable form of clustering:

projection='3d')
legend = []
X, Y, Z = 'e', 'g', 'h'

def iterlog(ln):
    retvals = []
    for n in ln:
        try:
            retvals.append(np.log(n))
        except:
            retvals.append(None)
    return retvals

for X in ['t']:
    ax = plt.axes(projection='3d')
    ax.xy_viewLim.intervalx = [-3.5, -2]
    legend = []
    for lang in [l for l in df.groupby('lang') if l[0] in {'German', 'English', 'Finnish', 'French', 'Danish'}]:
        sample = lang[1].sample(4000)

        legend.append(lang[0])
        ax.scatter3D(iterlog(sample[X]), iterlog(sample[Y]), iterlog(sample[Z]))

    ax.set_title('log(10) of the Relativ Frequencies of "' + X.upper() + "', '" + Y.upper() + '" and "' + Z.upper() + '"\n\n')
    ax.set_xlabel(X.upper())
    ax.set_ylabel(Y.upper())
    ax.set_zlabel(Z.upper())
    plt.legend(legend)
    plt.show()

 

Even though every single letter frequency by itself is not a very reliable indicator, the set of frequencies of all present letters in a text is a quite good evidence because it will more or less represent the letter frequency fingerprint of the given language. Since it is quite hard to imagine or visualize the above plot in more than three dimensions, I used a little trick which shows that every language has its own typical fingerprint of letter frequencies:

legend = []
fig = plt.figure(figsize=(15, 10))
plt.axes(yscale='log')
    
langs = defaultdict(list)

for lang in [l for l in df.groupby('lang') if l[0] in set(df['lang'])]:
    for feature in 'abcdefghijklmnopqrstuvwxyz':
        langs[lang[0]].append(lang[1][feature].mean())

mean_frequencies = {feature:df[feature].mean() for feature in 'abcdefghijklmnopqrstuvwxyz'}
for i in langs.items():
    legend.append(i[0])
    j = np.array(i[1]) / np.array([mean_frequencies[c] for c in 'abcdefghijklmnopqrstuvwxyz'])
    plt.plot([c for c in 'abcdefghijklmnopqrstuvwxyz'], j)
plt.title('Log. of relative Frequencies compared to the mean Frequency in all texts')
plt.xlabel('Letters')
plt.ylabel('(log(Lang. Frequencies / Mean Frequency)')
plt.legend(legend)
plt.grid()
plt.show()

What more?

Beside the fact, that letter frequencies alone, allow us to predict the language of every example text (at least in the 10 languages with latin alphabet we trained for) with almost complete certancy there is even more information hidden in the set of sample texts.

As you might know, most languages in europe belong to either the romanian or the indogermanic language family (which is actually because the romans conquered only half of europe). The border between them could be located in belgium, between france and germany and in swiss. West of this border the romanian languages, which originate from latin, are still spoken, like spanish, portouguese and french. In the middle and northern part of europe the indogermanic languages are very common like german, dutch, swedish ect. If we plot the analysed languages with a different colour sheme this border gets quite clear and allows us to take a look back in history that tells us where our languages originate from:

legend = []
fig = plt.figure(figsize=(15, 10))
plt.axes(yscale='linear')
    
langs = defaultdict(list)
for lang in [l for l in df.groupby('lang') if l[0] in {'German', 'English', 'French', 'Spanish', 'Portuguese', 'Dutch', 'Swedish', 'Danish', 'Italian'}]:
    for feature in 'abcdefghijklmnopqrstuvwxyz':
        langs[lang[0]].append(lang[1][feature].mean())

colordict = {l[0]:l[1] for l in zip([lang for lang in langs], ['brown', 'tomato', 'orangered',
                                                               'green', 'red', 'forestgreen', 'limegreen',
                                                               'darkgreen', 'darkred'])}
mean_frequencies = {feature:df[feature].mean() for feature in 'abcdefghijklmnopqrstuvwxyz'}
for i in langs.items():
    legend.append(i[0])
    j = np.array(i[1]) / np.array([mean_frequencies[c] for c in 'abcdefghijklmnopqrstuvwxyz'])
    plt.plot([c for c in 'abcdefghijklmnopqrstuvwxyz'], j, color=colordict[i[0]])
#     plt.plot([c for c in 'abcdefghijklmnopqrstuvwxyz'], i[1], color=colordict[i[0]])
plt.title('Log. of relative Frequencies compared to the mean Frequency in all texts')
plt.xlabel('Letters')
plt.ylabel('(log(Lang. Frequencies / Mean Frequency)')
plt.legend(legend)
plt.grid()
plt.show()

As you can see the more common letters, especially the vocals like a, e, i, o and u have almost the same frequency in all of this languages. Far more interesting are letters like q, k, c and w: While k is quite common in all of the indogermanic languages it is quite rare in romanic languages because the same sound is written with the letters q or c.
As a result it could be said, that even “boring” sets of data (just give it a try and read all the texts of the protocolls of the EU parliament…) could contain quite interesting patterns which – in this case – allows us to predict quite precisely which language a given text sample is written in, without the need of any translation program or to speak the languages. And as an interesting side effect, where certain things in history happend (or not happend): After two thousand years have passed, modern machine learning techniques could easily uncover this history because even though all these different languages developed, they still have a set of hidden but common patterns that since than stayed the same.

Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

amz_reviews = pd.read_csv("1429_1.csv")

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

amz_reviews.shape
(34660, 21)

amz_reviews.columns
Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')

 

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

columns = ['id','name','keys','manufacturer','reviews.dateAdded', 'reviews.date','reviews.didPurchase',
          'reviews.userCity', 'reviews.userProvince', 'reviews.dateSeen', 'reviews.doRecommend','asins',
          'reviews.id', 'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.title']

df = pd.DataFrame(amz_reviews.drop(columns,axis=1,inplace=False))

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

df['reviews.rating'].value_counts().plot(kind='bar')

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

Data pre-processing for textual variables

Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

## Change the reviews type to string
df['reviews.text'] = df['reviews.text'].astype(str)

## Before lowercasing 
df['reviews.text'][2]
'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it 
already...'

## Lowercase all reviews
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['reviews.text'][2] ## to see the difference
'inexpensive tablet for him to use and learn on, step up from the nabi. he was thrilled with it, learn how to skype on it 
already...'

Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#$%^ *()~;:/<>|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

## remove punctuation
df['reviews.text'] = df['reviews.text'].str.replace('[^ws]','')
df['reviews.text'][2]
'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already'

Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

stop = stopwords.words('english')
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['reviews.text'][2]
'inexpensive tablet use learn step nabi thrilled learn skype already'

Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

st = PorterStemmer()
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['reviews.text'][2]
'inexpens tablet use learn step nabi thrill learn skype alreadi'

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

## Define a function which can be applied to calculate the score for the whole dataset

def senti(x):
    return TextBlob(x).sentiment  

df['senti_score'] = df['reviews.text'].apply(senti)

df.senti_score.head()

0                                   (0.3, 0.8)
1                                (0.65, 0.675)
2                                   (0.0, 0.0)
3    (0.29545454545454547, 0.6492424242424243)
4                    (0.5, 0.5827777777777777)
Name: senti_score, dtype: object

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.

Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process and many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

amz_reviews = pd.read_csv("1429_1.csv")

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

amz_reviews.shape
(34660, 21)

amz_reviews.columns
Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')

 

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

columns = ['id','name','keys','manufacturer','reviews.dateAdded', 'reviews.date','reviews.didPurchase',
          'reviews.userCity', 'reviews.userProvince', 'reviews.dateSeen', 'reviews.doRecommend','asins',
          'reviews.id', 'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.title']

df = pd.DataFrame(amz_reviews.drop(columns,axis=1,inplace=False))

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

df['reviews.rating'].value_counts().plot(kind='bar')

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

Data pre-processing for textual variables

Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

## Change the reviews type to string
df['reviews.text'] = df['reviews.text'].astype(str)

## Before lowercasing 
df['reviews.text'][2]
'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it 
already...'

## Lowercase all reviews
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['reviews.text'][2] ## to see the difference
'inexpensive tablet for him to use and learn on, step up from the nabi. he was thrilled with it, learn how to skype on it 
already...'

Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#$%^ *()~;:/<>|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

## remove punctuation
df['reviews.text'] = df['reviews.text'].str.replace('[^ws]','')
df['reviews.text'][2]
'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already'

Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

stop = stopwords.words('english')
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['reviews.text'][2]
'inexpensive tablet use learn step nabi thrilled learn skype already'

Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

st = PorterStemmer()
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['reviews.text'][2]
'inexpens tablet use learn step nabi thrill learn skype alreadi'

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

## Define a function which can be applied to calculate the score for the whole dataset

def senti(x):
    return TextBlob(x).sentiment  

df['senti_score'] = df['reviews.text'].apply(senti)

df.senti_score.head()

0                                   (0.3, 0.8)
1                                (0.65, 0.675)
2                                   (0.0, 0.0)
3    (0.29545454545454547, 0.6492424242424243)
4                    (0.5, 0.5827777777777777)
Name: senti_score, dtype: object

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text. The whole code is available here.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.