All about Big Data Storage and Analytics

The Future of AI in Dental Technology

As we develop more advanced technology, we begin to learn that artificial intelligence can have more and more of an impact on our lives and industries that we have gotten used to being the same over the past decades. One of those industries is dentistry. In your lifetime, you’ve probably not seen many changes in technology, but a boom around artificial intelligence and technology has opened the door for AI in dental technologies.

How Can AI Help?

Though dentists take a lot of pride in their craft and career, most acknowledge that AI can do some things that they can’t do or would make their job easier if they didn’t have to do. AI can perform a number of both simple and advanced tasks. Let’s take a look at some areas that many in the dental industry feel that AI can be of assistance.

Repetitive, Menial Tasks

The most obvious area that AI can help out when it comes to dentistry is with repetitive and menial simple tasks. There are many administrative tasks in the dentistry industry that can be sped up and made more cost-effective with the use of AI. If we can train a computer to do some of these tasks, we may be able to free up more time for our dentists to focus on more important matters and improve their job performance as well. One primary use of AI is virtual consultations that offices like Philly Braces are offering. This saves patients time when they come in as the Doctor already knows what the next steps in their treatment will be.

Using AI to do some basic computer tasks is already being done on a small scale by some, but we have yet to see a very large scale implementation of this technology. We would expect that to happen soon, with how promising and cost-effective the technology has proven to be.

Reducing Misdiagnosis

One area that many think that AI can help a lot in is misdiagnosis. Though dentists do their best, there is still a nearly 20% misdiagnosis rate when reading x-rays in dentistry. We like to think that a human can read an x-ray better, but this may not be the case. AI technology can certainly be trained to read an x-ray and there have been some trials to suggest that they can do it better and identify key conditions that we often misread.

A world with AI diagnosis that is accurate and quicker will save time, money, and lead to better dental health among patients. It hasn’t yet come to fruition, but this seems to be the next major step for AI in dentistry.

Artificial Intelligence Assistants

Once it has been demonstrated that AI can perform a range of tasks that are useful to dentists, the next logical step is to combine those skills to make a fully-functional AI dental assistant. A machine like this has not yet been developed, but we can imagine that it would be an interface that could be spoken to similar to Alexa. The dentist would request vital information and other health history data from a patient or set of patients to assist in the treatment process. This would undoubtedly be a huge step forward and bring a lot of computing power into the average dentist office.


It’s clear that AI has a bright future in the dental industry and has already shown some of the essential skills that it can help with in order to provide more comprehensive and accurate care to dental patients. Some offices like Westwood Orthodontics already use AI in the form of a virtual consult to diagnose issues and provide treatment options before patients actually step foot in the office. Though not nearly all applications that AI can provide have been explored, we are well on our way to discovering the vast benefits of artificial intelligence for both patients and practices in the dental healthcare industry.

The New Age of Big Data: Is It the Death of Hadoop?

Big Data had gone through several transformations through the years, growing into the phrase we identify it as today. From its first identified use on the back of Hadoop and MapReduce, a new age of Big Data has been ushered in with the spread of new technologies such as Kubernetes, Spark, and NoSQL databases.

These might not serve the exact same purpose as Hadoop individually, but they fill the same niche and do the same job with features the original platform designers never envisioned.

The multi-cloud architecture boom and increasing emphasis on real-time data may just mean the end of Big Data as we know it, and Hadoop with it.

A brief history of Big Data

The use of data for making business decisions can be traced back to ancient civilizations in Mesopotamia. However, the age of Big Data as we know it is only as old as 2005 when O’Reilly Media launched the phrase. It was used to describe the massive amounts of data that the world was beginning to produce on the internet.

The newly-dubbed Web 2.0 needed to be indexed and easily searchable, and, Yahoo, being the behemoth that it was, was just the right company for the job. Hadoop was born off the efforts of Yahoo engineers, depending on Google’s MapReduce under the hood. A new era of Big Data had begun, and Hadoop was at the forefront of the revolution.

The new technologies led to a fundamental shift in the way the world regarded data processing. Traditional assumptions of atomicity, consistency, isolation, and durability (ACID) began to fade, and new use cases for previously unusable data began to emerge.

Hadoop would begin its life as a commercial platform with the launch of Cloudera in 2008, followed by rivals such as Hortonworks, EMC and MapR. It continued its momentous run until it seemingly hit its peak in 2015, and its place in the enterprise market would never be guaranteed again

Where Hadoop Couldn’t Keep Up

Hadoop made its mark in the world of Big Data by being a platform to collect, store and analyze large swathes of data. However, not even a technology as revolutionary and versatile as Hadoop could exist without its drawbacks.

Some of these would be so costly developers would rather design whole new systems to deal with them. With time, Hadoop started to lose its charm, unable to grow past its initial vision as a Big Data software.

Hadoop is a machine made up of smaller moving parts that are incredibly efficient at what they do – crunch data. This ultimately results in one of the first drawbacks of Hadoop – it does not come with built-in support for analytics data. Hadoop works well to process your data, but not likely as you need – visual reports about how the data is being processed, for instance.

MapReduce was also built from the ground up to be file-intensive. This makes it a great piece of software for simple requests, but not so much for iterative data. For smaller datasets, it turns out to be a rather inefficient solution.

Another area Hadoop lands flat on its face is with regards to real-time processing and reporting. Hadoop suffers from the curse of time. It relies on technologies that even its very founders (Google in particular) no longer rely on.

With MapReduce, every time you want to analyze a modified dataset (say, after adding or deleting data), you have to stream over the whole dataset again. Thanks to this feature, Hadoop is horrible at real-time reporting – a feature that led to the creation of Percolator, MapReduce’s replacement within Google.

The emergence of better technology has also meant a rise in the number of threats to said technology and a corresponding increase in the emphasis that is placed on it.

Unfortunately, Hadoop is nowhere close to being secure. As a matter of fact, its security settings are off by default, and it has too much inertia to simply change that. To make things worse, plugging in security measures isn’t that much easier.

The Fall of Hadoop

With these and more shortcomings in the data science world, new tools such as Hive, Pig and Spark were created to work on top of Hadoop to overcome its weaknesses. But it simply couldn’t grow out of the shoes it had been made for.

The growth of NoSQL databases such as Hazelcast and MongoDB also meant that problems Hadoop was designed to support were now being solved by single players rather than the ‘all or nothing’ approach Hadoop was designed with. It wasn’t flexible enough to evolve beyond simply being a batch processing software.

Over time, new Big Data challenges began to emerge that a large monolithic software like Hadoop couldn’t deal with, either. Being primarily file-intensive, it couldn’t keep up with the variety of data sources that were now available, the lack of support for dynamic schemas, on-the-fly queries, and the rise of cloud infrastructure all caused people to seek different solutions. Hadoop had lost its grip on the enterprise world.

Businesses whose primary concern was dealing with Hadoop infrastructure like Cloudera and Hortonworks were seeing less and less adoption. This led to the eventual merger of the two companies in 2019, and the same message rang out from different corners of the world at the same time: ‘Hadoop is dead.’

Is Hadoop Really Dead?

Hadoop still has a place in the enterprise world – the problems it was designed to solve still exist to this day. Technologies such as Spark have largely taken over the same space that Hadoop once occupied.

The question of Hadoop or Spark is one every data scientist has to contend with at some point, and most seem to be settling in the latter of these, thanks to the great advantages is speed it offers.

It’s unlikely Hadoop will see much more adoption with newer marker entrants, especially considering the pace with which technology moves. It also doesn’t help that a lot of alternatives have a much smaller learning curve than the convoluted monolith that is Hadoop. Companies like MapR and Cloudera have also begun to pivot away from Hadoop-only infrastructure to more robust cloud-based solutions. Hadoop still has its place, but maybe not for long.

Erstellen und benutzen einer Geodatenbank

In diesem Artikel soll es im Gegensatz zum vorherigen Artikel Alles über Geodaten weniger darum gehen, was man denn alles mit Geodaten machen kann, dafür aber mehr darum wie man dies anstellt. Es wird gezeigt, wie man aus dem öffentlich verfügbaren Datensatz des OpenStreetMap-Projekts eine Geodatenbank erstellt und einige Beispiele dafür gegeben, wie man diese abfragen und benutzen kann.

Wahl der Datenbank

Prinzipiell gibt es zwei große “geo-kompatible” OpenSource-Datenbanken bzw. “Datenbank-AddOn’s”: Spatialite, welches auf SQLite aufbaut, und PostGIS, das PostgreSQL verwendet.

PostGIS bietet zum Teil eine einfachere Syntax, welche manchmal weniger Tipparbeit verursacht. So kann man zum Beispiel um die Entfernung zwischen zwei Orten zu ermitteln einfach schreiben:

während dies in Spatialite “nur” mit einer normalen Funktion möglich ist:

Trotztdem wird in diesem Artikel Spatialite (also SQLite) verwendet, da dessen Einrichtung deutlich einfacher ist (schließlich sollen interessierte sich alle Ergebnisse des Artikels problemlos nachbauen können, ohne hierfür einen eigenen Datenbankserver aufsetzen zu müssen).

Der Hauptunterschied zwischen PostgreSQL und SQLite (eigentlich der Unterschied zwischen SQLite und den meissten anderen Datenbanken) ist, dass für PostgreSQL im Hintergrund ein Server laufen muss, an welchen die entsprechenden Queries gesendet werden, während SQLite ein “normales” Programm (also kein Client-Server-System) ist welches die Queries selber auswertet.

Hierdurch fällt beim Aufsetzen der Datenbank eine ganze Menge an Konfigurationsarbeit weg: Welche Benutzer gibt es bzw. akzeptiert der Server? Welcher Benutzer bekommt welche Rechte? Über welche Verbindung wird auf den Server zugegriffen? Wie wird die Sicherheit dieser Verbindung sichergestellt? …

Während all dies bei SQLite (und damit auch Spatialite) wegfällt und die Einrichtung der Datenbank eigentlich nur “installieren und fertig” ist, muss auf der anderen Seite aber auch gesagt werden dass SQLite nicht gut für Szenarien geeignet ist, in welchen viele Benutzer gleichzeitig (insbesondere schreibenden) Zugriff auf die Datenbank benötigen.

Benötigte Software und ein Beispieldatensatz

Was wird für diesen Artikel an Software benötigt?

SQLite3 als Datenbank

libspatialite als “Geoplugin” für SQLite

spatialite-tools zum erstellen der Datenbank aus dem OpenStreetMaps (*.osm.pbf) Format

python3, die beiden GeoModule spatialite, folium und cartopy, sowie die Module pandas und matplotlib (letztere gehören im Bereich der Datenauswertung mit Python sowieso zum Standart). Für pandas gibt es noch die Erweiterung geopandas sowie eine praktisch unüberschaubare Anzahl weiterer geographischer Module aber bereits mit den genannten lassen sich eine Menge interessanter Dinge herausfinden.

– und natürlich einen Geodatensatz: Zum Beispiel sind aus dem OpenStreetMap-Projekt extrahierte Datensätze hier zu finden.

Es ist ratsam, sich hier erst einmal einen kleinen Datensatz herunterzuladen (wie zum Beispiel einen der Stadtstaaten Bremen, Hamburg oder Berlin). Zum einen dauert die Konvertierung des .osm.pbf-Formats in eine Spatialite-Datenbank bei größeren Datensätzen unter Umständen sehr lange, zum anderen ist die fertige Datenbank um ein vielfaches größer als die stark gepackte Originaldatei (für “nur” Deutschland ist die fertige Datenbank bereits ca. 30 GB groß und man lässt die Konvertierung (zumindest am eigenen Laptop) am besten über Nacht laufen – willkommen im Bereich “BigData”).

Erstellen eine Geodatenbank aus OpenStreetMap-Daten

Nach dem Herunterladen eines Datensatzes der Wahl im *.osm.pbf-Format kann hieraus recht einfach mit folgendem Befehl aus dem Paket spatialite-tools die Datenbank erstellt werden:

Erkunden der erstellten Geodatenbank

Nach Ausführen des obigen Befehls sollte nun eine Datei mit dem gewählten Namen (im Beispiel bremen-latest.sqlite) im aktuellen Ordner vorhanden sein – dies ist bereits die fertige Datenbank. Zunächst sollte man mit dieser Datenbank erst einmal dasselbe machen, wie mit jeder anderen Datenbank auch: Sich erst einmal eine Weile hinsetzen und schauen was alles an Daten in der Datenbank vorhanden und vor allem wo diese Daten in der erstellten Tabellenstruktur zu finden sind. Auch wenn dieses Umschauen prinzipiell auch vollständig über die Shell oder in Python möglich ist, sind hier Programme mit graphischer Benutzeroberfläche (z. B. spatialite-gui oder QGIS) sehr hilfreich und sparen nicht nur eine Menge Zeit sondern vor allem auch Tipparbeit. Wer dies tut, wird feststellen, dass sich in der generierten Datenbank einige dutzend Tabellen mit Namen wie pt_addresses, ln_highway und pg_boundary befinden.

Die Benennung der Tabellen folgt dem Prinzip, dass pt_*-Tabellen Punkte im Geokoordinatensystem wie z. B. Adressen, Shops, Bäckereien und ähnliches enthalten. ln_*-Tabellen enthalten hingegen geographische Entitäten, welche sich als Linien darstellen lassen, wie beispielsweise Straßen, Hochspannungsleitungen, Schienen, ect. Zuletzt gibt es die pg_*-Tabellen welche Polygone – also Flächen einer bestimmten Form enthalten. Dazu zählen Landesgrenzen, Bundesländer, Inseln, Postleitzahlengebiete, Landnutzung, aber auch Gebäude, da auch diese jeweils eine Grundfläche besitzen. In dem genannten Datensatz sind die Grundflächen von Gebäuden – zumindest in Europa – nahezu vollständig. Aber auch der Rest der Welt ist für ein “Wikipedia der Kartographie” insbesondere in halbwegs besiedelten Gebieten bemerkenswert gut erfasst, auch wenn nicht unbedingt davon ausgegangen werden kann, dass abgelegenere Gegenden (z. B. irgendwo auf dem Land in Südamerika) jedes Gebäude eingezeichnet ist.

Verwenden der Erstellten Datenbank

Auf diese Datenbank kann nun entweder direkt aus der Shell über den Befehl

zugegriffen werden oder man nutzt das gleichnamige Python-Paket:

Nach Eingabe der obigen Befehle in eine Python-Konsole, ein Jupyter-Notebook oder ein anderes Programm, welches die Anbindung an den Python-Interpreter ermöglicht, können die von der Datenbank ausgegebenen Ergebnisse nun direkt in ein Pandas Data Frame hineingeladen und verwendet/ausgewertet/analysiert werden.

Im Grunde wird hierfür “normales SQL” verwendet, wie in anderen Datenbanken auch. Der folgende Beispiel gibt einfach die fünf ersten von der Datenbank gefundenen Adressen aus der Tabelle pt_addresses aus:

Link zur Ausgabe

Es wird dem Leser sicherlich aufgefallen sein, dass die Spalte “Geometry” (zumindest für das menschliche Auge) nicht besonders ansprechend sowie auch nicht informativ aussieht: Der Grund hierfür ist, dass diese Spalte die entsprechende Position im geographischen Koordinatensystem aus Gründen wie dem deutlich kleineren Speicherplatzbedarf sowie der damit einhergehenden Optimierung der Geschwindigkeit der Datenbank selber, in binärer Form gespeichert und ohne weitere Verarbeitung auch als solche ausgegeben wird.

Glücklicherweise stellt spatialite eine ganze Reihe von Funktionen zur Verarbeitung dieser geographischen Informationen bereit, von denen im folgenden einige beispielsweise vorgestellt werden:

Für einzelne Punkte im Koordinatensystem gibt es beispielsweise die Funktionen X(geometry) und Y(geometry), welche aus diesem “binären Wirrwarr” den Längen- bzw. Breitengrad des jeweiligen Punktes als lesbare Zahlen ausgibt.

Ändert man also das obige Query nun entsprechend ab, erhält man als Ausgabe folgendes Ergebnis in welchem die Geometry-Spalte der ausgegebenen Adressen in den zwei neuen Spalten Longitude und Latitude in lesbarer Form zu finden ist:

Link zur Tabelle

Eine weitere häufig verwendete Funktion von Spatialite ist die Distance-Funktion, welche die Distanz zwischen zwei Orten berechnet.

Das folgende Beispiel sucht in der Datenbank die 10 nächstgelegenen Bäckereien zu einer frei wählbaren Position aus der Datenbank und listet diese nach zunehmender Entfernung auf (Achtung – die frei wählbare Position im Beispiel liegt in München, wer die selbe Position z. B. mit dem Bremen-Datensatz verwendet, wird vermutlich etwas weiter laufen müssen…):

Link zur Ausgabe

Ein Anwendungsfall für eine solche Liste können zum Beispiel Programme/Apps wie oder Google-Maps sein, in denen User nach Bäckereien, Geldautomaten, Supermärkten oder Apotheken “in der Nähe” suchen können sollen.

Diese Liste enthält nun alle Informationen die grundsätzlich gebraucht werden, ist soweit auch informativ und wird in den meißten Fällen der Datenauswertung auch genau so gebraucht, jedoch ist diese für das Auge nicht besonders ansprechend.

Viel besser wäre es doch, die gefundenen Positionen auf einer interaktiven Karte einzuzeichnen:

Was kann man sonst interessantes mit der erstellten Datenbank und etwas Python machen? Wer in Deutschland ein wenig herumgekommen ist, dem ist eventuell aufgefallen, dass sich die Endungen von Ortsnamen stark unterscheiden: Um München gibt es Stadteile und Dörfer namens Garching, Freising, Aubing, ect., rund um Stuttgart enden alle möglichen Namen auf “ingen” (Plieningen, Vaihningen, Echterdingen …) und in Berlin gibt es Orte wie Pankow, Virchow sowie eine bunte Auswahl weiterer *ow’s.

Das folgende Query spuckt gibt alle “village’s”, “town’s” und “city’s” aus der Tabelle pt_place, also Dörfer und Städte, aus:

Link zur Ausgabe

Graphisch mit matplotlib und cartopy in ein Koordinatensystem eingetragen sieht diese Verteilung folgendermassen aus:

Die Grafik zeigt, dass stark unterschiedliche Vorkommen der verschiedenen Ortsendungen in Deutschland (Clustering). Über das genaue Zustandekommen dieser Verteilung kann ich hier nur spekulieren, jedoch wird diese vermutlich ähnlichen Prozessen unterliegen wie beispielsweise die Entwicklung von Dialekten.

Wer sich die Karte etwas genauer anschaut wird merken, dass die eingezeichneten Landesgrenzen und Küstenlinien nicht besonders genau sind. Hieran wird ein interessanter Effekt von häufig verwendeten geographischen Entitäten, nämlich Linien und Polygonen deutlich. Im Beispiel werden durch die beiden Zeilen

die bereits im Modul cartopy hinterlegten Daten verwendet. Genaue Verläufe von Küstenlinien und Landesgrenzen benötigen mit wachsender Genauigkeit hingegen sehr viel Speicherplatz, da mehr und mehr zu speichernde Punkte benötigt werden (genaueres siehe hier).


Man kann also bereits mit einigen Grundmodulen und öffentlich verfügbaren Datensätzen eine ganze Menge im Bereich der Geodaten erkunden und entdecken. Gleichzeitig steht, insbesondere für spezielle Probleme, eine große Bandbreite weiterer Software zur Verfügung, für welche dieser Artikel zwar einen Grundsätzlichen Einstieg geben kann, die jedoch den Rahmen dieses Artikels sprengen würden.

A Bird’s Eye View: How Machine Learning Can Help You Charge Your E-Scooters

Bird scooters in Columbus, Ohio

Bird scooters in Columbus, Ohio

Ever since I started using bike-sharing to get around in Seattle, I have become fascinated with geolocation data and the transportation sharing economy. When I saw this project leveraging the mobility data RESTful API from the Los Angeles Department of Transportation, I was eager to dive in and get my hands dirty building a data product utilizing a company’s mobility data API.

Unfortunately, the major bike and scooter providers (Bird, JUMP, Lime) don’t have publicly accessible APIs. However, some folks have seemingly been able to reverse-engineer the Bird API used to populate the maps in their Android and iOS applications.

One interesting feature of this data is the nest_id, which indicates if the Bird scooter is in a “nest” — a centralized drop-off spot for charged Birds to be released back into circulation.

I set out to ask the following questions:

  1. Can real-time predictions be made to determine if a scooter is currently in a nest?
  2. For non-nest scooters, can new nest location recommendations be generated from geospatial clustering?

To answer these questions, I built a full-stack machine learning web application, NestGenerator, which provides an automated recommendation engine for new nest locations. This application can help power Bird’s internal nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on proximity to scooters and nest locations in their area.


The electric scooter market has seen substantial growth with Bird’s recent billion dollar valuation  and their $300 million Series C round in the summer of 2018. Bird offers electric scooters that top out at 15 mph, cost $1 to unlock and 15 cents per minute of use. Bird scooters are in over 100 cities globally and they announced in late 2018 that they eclipsed 10 million scooter rides since their launch in 2017.

Bird scooters in Tel Aviv, Israel

Bird scooters in Tel Aviv, Israel

With all of these scooters populating cities, there’s much-needed demand for people to charge them. Since they are electric, someone needs to charge them! A charger can earn additional income for charging the scooters at their home and releasing them back into circulation at nest locations. The base price for charging each Bird is $5.00. It goes up from there when the Birds are harder to capture.

Data Collection and Machine Learning Pipeline

The full data pipeline for building “NestGenerator”


From the details here, I was able to write a Python script that returned a list of Bird scooters within a specified area, their geolocation, unique ID, battery level and a nest ID.

I collected scooter data from four cities (Atlanta, Austin, Santa Monica, and Washington D.C.) across varying times of day over the course of four weeks. Collecting data from different cities was critical to the goal of training a machine learning model that would generalize well across cities.

Once equipped with the scooter’s latitude and longitude coordinates, I was able to leverage additional APIs and municipal data sources to get granular geolocation data to create an original scooter attribute and city feature dataset.

Data Sources:

  • Walk Score API: returns a walk score, transit score and bike score for any location.
  • Google Elevation API: returns elevation data for all locations on the surface of the earth.
  • Google Places API: returns information about places. Places are defined within this API as establishments, geographic locations, or prominent points of interest.
  • Google Reverse Geocoding API: reverse geocoding is the process of converting geographic coordinates into a human-readable address.
  • Weather Company Data: returns the current weather conditions for a geolocation.
  • LocationIQ: Nearby Points of Interest (PoI) API returns specified PoIs or places around a given coordinate.
  • OSMnx: Python package that lets you download spatial geometries and model, project, visualize, and analyze street networks from OpenStreetMap’s APIs.

Feature Engineering

After extensive API wrangling, which included a four-week prolonged data collection phase, I was finally able to put together a diverse feature set to train machine learning models. I engineered 38 features to classify if a scooter is currently in a nest.

Full Feature Set

Full Feature Set

The features boiled down into four categories:

  • Amenity-based: parks within a given radius, gas stations within a given radius, walk score, bike score
  • City Network Structure: intersection count, average circuity, street length average, average streets per node, elevation level
  • Distance-based: proximity to closest highway, primary road, secondary road, residential road
  • Scooter-specific attributes: battery level, proximity to closest scooter, high battery level (> 90%) scooters within a given radius, total scooters within a given radius


Log-Scale Transformation

For each feature, I plotted the distribution to explore the data for feature engineering opportunities. For features with a right-skewed distribution, where the mean is typically greater than the median, I applied these log transformations to normalize the distribution and reduce the variability of outlier observations. This approach was used to generate a log feature for proximity to closest scooter, closest highway, primary road, secondary road, and residential road.

An example of a log transformation

Statistical Analysis: A Systematic Approach

Next, I wanted to ensure that the features I included in my model displayed significant differences when broken up by nest classification. My thinking was that any features that did not significantly differ when stratified by nest classification would not have a meaningful predictive impact on whether a scooter was in a nest or not.

Distributions of a feature stratified by their nest classification can be tested for statistically significant differences. I used an unpaired samples t-test with a 0.01% significance level to compute a p-value and confidence interval to determine if there was a statistically significant difference in means for a feature stratified by nest classification. I rejected the null hypothesis if a p-value was smaller than the 0.01% threshold and if the 99.9% confidence interval did not straddle zero. By rejecting the null-hypothesis in favor of the alternative hypothesis, it’s deemed there is a significant difference in means of a feature by nest classification.

Battery Level Distribution Stratified by Nest Classification to run a t-test

Battery Level Distribution Stratified by Nest Classification to run a t-test

Log of Closest Scooter Distribution Stratified by Nest Classification to run a t-test

Throwing Away Features

Using the approach above, I removed ten features that did not display statistically significant results.

Statistically Insignificant Features Removed Before Model Development

Model Development

I trained two models, a random forest classifier and an extreme gradient boosting classifier since tree-based models can handle skewed data, capture important feature interactions, and provide a feature importance calculation. I trained the models on 70% of the data collected for all four cities and reserved the remaining 30% for testing.

After hyper-parameter tuning the models for performance on cross-validation data it was time to run the models on the 30% of test data set aside from the initial data collection.

I also collected additional test data from other cities (Columbus, Fort Lauderdale, San Diego) not involved in training the models. I took this step to ensure the selection of a machine learning model that would generalize well across cities. The performance of each model on the additional test data determined which model would be integrated into the application development.

Performance on Additional Cities Test Data

The Random Forest Classifier displayed superior performance across the board

The Random Forest Classifier displayed superior performance across the board

I opted to move forward with the random forest model because of its superior performance on AUC score and accuracy metrics on the additional cities test data. AUC is the Area under the ROC Curve, and it provides an aggregate measure of model performance across all possible classification thresholds.

AUC Score on Test Data for each Model

AUC Score on Test Data for each Model

Feature Importance

Battery level dominated as the most important feature. Additional important model features were proximity to high level battery scooters, proximity to closest scooter, and average distance to high level battery scooters.

Feature Importance for the Random Forest Classifier

Feature Importance for the Random Forest Classifier

The Trade-off Space

Once I had a working machine learning model for nest classification, I started to build out the application using the Flask web framework written in Python. After spending a few days of writing code for the application and incorporating the trained random forest model, I had enough to test out the basic functionality. I could finally run the application locally to call the Bird API and classify scooter’s into nests in real-time! There was one huge problem, though. It took more than seven minutes to generate the predictions and populate in the application. That just wasn’t going to cut it.

The question remained: will this model deliver in a production grade environment with the goal of making real-time classifications? This is a key trade-off in production grade machine learning applications where on one end of the spectrum we’re optimizing for model performance and on the other end we’re optimizing for low latency application performance.

As I continued to test out the application’s performance, I still faced the challenge of relying on so many APIs for real-time feature generation. Due to rate-limiting constraints and daily request limits across so many external APIs, the current machine learning classifier was not feasible to incorporate into the final application.

Run-Time Compliant Application Model

After going back to the drawing board, I trained a random forest model that relied primarily on scooter-specific features which were generated directly from the Bird API.

Through a process called vectorization, I was able to transform the geolocation distance calculations utilizing NumPy arrays which enabled batch operations on the data without writing any “for” loops. The distance calculations were applied simultaneously on the entire array of geolocations instead of looping through each individual element. The vectorization implementation optimized real-time feature engineering for distance related calculations which improved the application response time by a factor of ten.

Feature Importance for the Run-time Compliant Random Forest Classifier

Feature Importance for the Run-time Compliant Random Forest Classifier

This random forest model generalized well on test-data with an AUC score of 0.95 and an accuracy rate of 91%. The model retained its prediction accuracy compared to the former feature-rich model, but it gained 60x in application performance. This was a necessary trade-off for building a functional application with real-time prediction capabilities.

Geospatial Clustering

Now that I finally had a working machine learning model for classifying nests in a production grade environment, I could generate new nest locations for the non-nest scooters. The goal was to generate geospatial clusters based on the number of non-nest scooters in a given location.

The k-means algorithm is likely the most common clustering algorithm. However, k-means is not an optimal solution for widespread geolocation data because it minimizes variance, not geodetic distance. This can create suboptimal clustering from distortion in distance calculations at latitudes far from the equator. With this in mind, I initially set out to use the DBSCAN algorithm which clusters spatial data based on two parameters: a minimum cluster size and a physical distance from each point. There were a few issues that prevented me from moving forward with the DBSCAN algorithm.

  1. The DBSCAN algorithm does not allow for specifying the number of clusters, which was problematic as the goal was to generate a number of clusters as a function of non-nest scooters.
  2. I was unable to hone in on an optimal physical distance parameter that would dynamically change based on the Bird API data. This led to suboptimal nest locations due to a distortion in how the physical distance point was used in clustering. For example, Santa Monica, where there are ~15,000 scooters, has a higher concentration of scooters in a given area whereas Brookline, MA has a sparser set of scooter locations.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

Given the granularity of geolocation scooter data I was working with, geospatial distortion was not an issue and the k-means algorithm would work well for generating clusters. Additionally, the k-means algorithm parameters allowed for dynamically customizing the number of clusters based on the number of non-nest scooters in a given location.

Once clusters were formed with the k-means algorithm, I derived a centroid from all of the observations within a given cluster. In this case, the centroids are the mean latitude and mean longitude for the scooters within a given cluster. The centroids coordinates are then projected as the new nest recommendations.

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm.

NestGenerator Application

After wrapping up the machine learning components, I shifted to building out the remaining functionality of the application. The final iteration of the application is deployed to Heroku’s cloud platform.

In the NestGenerator app, a user specifies a location of their choosing. This will then call the Bird API for scooters within that given location and generate all of the model features for predicting nest classification using the trained random forest model. This forms the foundation for map filtering based on nest classification. In the app, a user has the ability to filter the map based on nest classification.

Drop-Down Map View filtering based on Nest Classification

Drop-Down Map View filtering based on Nest Classification

Nearest Generated Nest

To see the generated nest recommendations, a user selects the “Current Non-Nest Scooters & Predicted Nest Locations” filter which will then populate the application with these nest locations. Based on the user’s specified search location, a table is provided with the proximity of the five closest nests and an address of the Nest location to help inform a Bird charger in their decision-making.

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

NestGenerator web-layout with nest addresses and proximity to nearest generated nests


By accurately predicting nest classification and clustering non-nest scooters, NestGenerator provides an automated recommendation engine for new nest locations. For Bird, this application can help power their nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on scooters and nest locations in their area.


The code for this project can be found on my GitHub

Comments or Questions? Please email me an E-Mail!


Interview: Does Business Intelligence benefit from Cloud Data Warehousing?

Interview with Ross Perez, Senior Director, Marketing EMEA at Snowflake

Read this article in German:
“Profitiert Business Intelligence vom Data Warehouse in der Cloud?”

Does Business Intelligence benefit from Cloud Data Warehousing?

Ross Perez is the Senior Director, Marketing EMEA at Snowflake. He leads the Snowflake marketing team in EMEA and is charged with starting the discussion about analytics, data, and cloud data warehousing across EMEA. Before Snowflake, Ross was a product marketer at Tableau Software where he founded the Iron Viz Championship, the world’s largest and longest running data visualization competition.

Data Science Blog: Ross, Business Intelligence (BI) is not really a new trend. In 2019/2020, making data available for the whole company should not be a big thing anymore. Would you agree?

BI is definitely an old trend, reporting has been around for 50 years. People are accustomed to seeing statistics and data for the company at large, and even their business units. However, using BI to deliver analytics to everyone in the organization and encouraging them to make decisions based on data for their specific area is relatively new. In a lot of the companies Snowflake works with, there is a huge new group of people who have recently received access to self-service BI and visualization tools like Tableau, Looker and Sigma, and they are just starting to find answers to their questions.

Data Science Blog: Up until today, BI was just about delivering dashboards for reporting to the business. The data warehouse (DWH) was something like the backend. Today we have increased demand for data transparency. How should companies deal with this demand?

Because more people in more departments are wanting access to data more frequently, the demand on backend systems like the data warehouse is skyrocketing. In many cases, companies have data warehouses that weren’t built to cope with this concurrent demand and that means that the experience is slow. End users have to wait a long time for their reports. That is where Snowflake comes in: since we can use the power of the cloud to spin up resources on demand, we can serve any number of concurrent users. Snowflake can also house unlimited amounts of data, of both structured and semi-structured formats.

Data Science Blog: Would you say the DWH is the key driver for becoming a data-driven organization? What else should be considered here?

Absolutely. Without having all of your data in a single, highly elastic, and flexible data warehouse, it can be a huge challenge to actually deliver insight to people in the organization.

Data Science Blog: So much for the theory, now let’s talk about specific use cases. In general, it matters a lot whether you are storing and analyzing e.g. financial data or machine data. What do we have to consider for both purposes?

Financial data and machine data do look very different, and often come in different formats. For instance, financial data is often in a standard relational format. Data like this needs to be able to be easily queried with standard SQL, something that many Hadoop and noSQL tools were unable to provide. Luckily, Snowflake is an ansi-standard SQL data warehouse so it can be used with this type of data quite seamlessly.

On the other hand, machine data is often semi-structured or even completely unstructured. This type of data is becoming significantly more common with the rise of IoT, but traditional data warehouses were very bad at dealing with it since they were optimized for relational data. Semi-structured data like JSON, Avro, XML, Orc and Parquet can be loaded into Snowflake for analysis quite seamlessly in its native format. This is important, because you don’t want to have to flatten the data to get any use from it.

Both types of data are important, and Snowflake is really the first data warehouse that can work with them both seamlessly.

Data Science Blog: Back to the common business use case: Creating sales or purchase reports for the business managers, based on data from ERP-systems such as Microsoft or SAP. Which architecture for the DWH could be the right one? How many and which database layers do you see as necessary?

The type of report largely does not matter, because in all cases you want a data warehouse that can support all of your data and serve all of your users. Ideally, you also want to be able to turn it off and on depending on demand. That means that you need a cloud-based architecture… and specifically Snowflake’s innovative architecture that separates storage and compute, making it possible to pay for exactly what you use.

Data Science Blog: Where would you implement the main part of the business logic for the report? In the DWH or in the reporting tool? Does it matter which reporting tool we choose?

The great thing is that you can choose either. Snowflake, as an ansi-Standard SQL data warehouse, can support a high degree of data modeling and business logic. But you can also utilize partners like Looker and Sigma who specialize in data modeling for BI. We think it’s best that the customer chooses what is right for them.

Data Science Blog: Snowflake enables organizations to store and manage their data in the cloud. Does it mean companies lose control over their storage and data management?

Customers have complete control over their data, and in fact Snowflake cannot see, alter or change any aspect of their data. The benefit of a cloud solution is that customers don’t have to manage the infrastructure or the tuning – they decide how they want to store and analyze their data and Snowflake takes care of the rest.

Data Science Blog: How big is the effort for smaller and medium sized companies to set up a DWH in the cloud? Does this have to be an expensive long-term project in every case?

The nice thing about Snowflake is that you can get started with a free trial in a few minutes. Now, moving from a traditional data warehouse to Snowflake can take some time, depending on the legacy technology that you are using. But Snowflake itself is quite easy to set up and very much compatible with historical tools making it relatively easy to move over.

Allgemeines über Geodaten

Dieser Artikel ist der Auftakt in einer Artikelserie zum Thema “Geodatenanalyse”.

Von den vielen Arten an Datensätzen, die öffentlich im Internet verfügbar sind, bin ich in letzter Zeit vermehrt über eine besonders interessante Gruppe gestolpert, die sich gleich für mehrere Zwecke nutzen lassen: Geodaten.

Gerade in wirtschaftlicher Hinsicht bieten sich eine ganze Reihe von Anwendungsfällen, bei denen Geodaten helfen können, Einblicke in Tatsachen zu erlangen, die ohne nicht möglich wären. Der wohl bekannteste Fall hierfür ist vermutlich die einfache Navigation zwischen zwei Punkten, die jeder kennt, der bereits ein Navigationssystem genutzt oder sich eine Route von Google Maps berechnen lassen hat.
Hiermit können nicht nur Fragen nach dem schnellsten oder Energie einsparensten (und damit gleichermaßen auch witschaftlichsten) Weg z. B. von Berlin nach Hamburg beantwortet werden, sondern auch die bestmögliche Lösung für Ausnahmesituationen wie Stau oder Vollsperrungen berechnet werden (ja, Stau ist, zumindest in der Theorie immer noch eine “Ausnahmesituation” ;-)).
Neben dieser beliebten Art Geodaten zu nutzen, gibt es eine ganze Reihe weiterer Situationen in denen deren Nutzung hilfreich bis essentiell sein kann. Als Beispiel sei hier der Einzugsbereich von in Konkurrenz stehenden Einheiten, wie z. B. Supermärkten genannt. Ohne an dieser Stelle statistische Nachweise vorlegen zu können, kaufen (zumindest meiner persönlichen Beobachtung nach) die meisten Menschen fast immer bei dem Supermarkt ein, der am bequemsten zu erreichen ist und dies ist in der Regel der am nächsten gelegene. Besitzt man nun eine Datenbank mit der Information, wo welcher Supermarkt bzw. welche Supermarktkette liegt, kann man mit so genannten Voronidiagrammen recht einfach den jeweiligen Einzugsbereich der jeweiligen Supermärkte berechnen.
Entsprechende Karten können auch von beliebigen anderen Entitäten mit fester geographischer Position gezeichnet werden: Geldautomaten, Funkmasten, öffentlicher Nahverkehr, …

Ein anderes Beispiel, das für die Datenauswertung interessant ist, ist die kartographische Auswertung von Postleitzahlen. Diese sind in fast jedem Datensatz zu Kunden, Lieferanten, ect. vorhanden, bilden jedoch weder eine ordinale, noch eine sinnvolle kategorische Größe, da es viele tausend verschiedene gibt. Zudem ist auch eine einfache Gruppierung in gröbere Kategorien wie beispielsweise Postleitzahlen des Schemas 1xxxx oft kaum sinnvoll, da diese in aller Regel kein sinnvolles Mapping auf z. B. politische Gebiete – wie beispielsweise Bundesländer – zulassen. Ein Ausweg aus diesem Dilemma ist eine einfache kartographische Übersicht, welche die einzelnen Postleitzahlengebiete in einer Farbskala zeigt.

Im gezeigten Beispiel ist die Bevölkerungsdichte Deutschlands als Karte zu sehen. Hiermit wird schnell und übersichtlich deutlich, wo in Deutschland die Bevölkerung lokalisiert ist. Ähnliche Karten können beispielsweise erstellt werden, um Fragen wie “Wie ist meine Kundschaft verteilt?” oder “Wo hat die Werbekampange XYZ besonders gut funktioniert?” zu beantworten. Bezieht man weitere Daten wie die absolute Bevölkerung oder die Bevölkerungsdichte mit ein, können auch Antworten auf Fragen wie “Welchen Anteil der Bevölkerung habe ich bereits erreicht und wo ist noch nicht genutztes Potential?” oder “Ist mein Produkt eher in städtischen oder ländlichen Gebieten gefragt?” einfach und schnell gefunden werden.
Ohne die entsprechende geographische Zusatzinformation bleiben insbesondere Postleitzahlen leider oft als “nicht sinnvoll auswertbar” bei der Datenauswertung links liegen.
Eine ganz andere Art von Vorteil der Geodaten ist der educational point of view:
  • Wer erst anfängt, sich mit Datenbanken zu beschäftigen, findet mit Straßen, Postleitzahlen und Ländern einen deutlich einfacheren und vor allem besser verständlichen Zugang zu SQL als mit abstrakten Größen und Nummern wie ProductID, CustomerID und AdressID. Zudem lassen sich Geodaten nebenbei bemerkt mittels so genannter GeoInformationSystems (*gis-Programme), erstaunlich einfach und ansprechend plotten.
  • Wer sich mit SQL bereits ein wenig auskennt, kann mit den (beispielsweise von Spatialite oder PostGIS) bereitgestellten SQL-Funktionen eine ganze Menge über Datenbanken sowie deren Möglichkeiten – aber auch über deren Grenzen – erfahren.
  • Für wen relationale Datenbanken sowie deren Funktionen schon lange nichts Neues mehr darstellen, kann sich hier (selbst mit dem eigenen Notebook) erstaunlich einfach in das Thema “Bug Data” einarbeiten, da die Menge an öffentlich vorhandenen Geodaten z.B. des OpenStreetMaps-Projektes selbst in optimal gepackten Format vielen Dutzend GB entsprechen. Gerade die Möglichkeit, die viele *gis-Programme wie beispielsweise QGIS bieten, nämlich Straßen-, Schienen- und Stromnetze “on-the-fly” zu plotten, macht die Bedeutung von richtig oder falsch gesetzten Indices in verschiedenen Datenbanken allein anhand der Geschwindigkeit mit der sich die Plots aufbauen sehr eindrucksvoll deutlich.
Um an Datensätze zu kommen, reicht es in der Regel Google mit den entsprechenden Schlagworten zu versorgen.
Neben – um einen Vergleich zu nutzen – dem Brockhaus der Karten GoogleMaps gibt es beispielsweise mit dem OpenStreetMaps-Projekt einen freien Geodatensatz, welcher in diesem Kontext etwa als das Wikipedia der Karten zu verstehen ist.
Hier findet man zum Beispiel Daten wie Straßen-, Schienen- oder dem Stromnetz, aber auch die im obigen Voronidiagramm eingezeichneten Gebäude und Supermärkte stammen aus diesem Datensatz. Hiermit lassen sich recht einfach just for fun interessante Dinge herausfinden, wie z. B., dass es in Deutschland ca. 28 Mio Gebäude gibt (ein SQL-Einzeiler), dass der Berliner Osten auch ca. 30 Jahre nach der Wende noch immer vorwiegend von der Tram versorgt wird, während im Westen hauptsächlich die U-Bahn fährt. Oder über welche Trassen der in der Nordsee von Windkraftanlagen erzeugte Strom auf das Festland kommt und von da aus weiter verteilt wird.
Eher grundlegende aber deswegen nicht weniger nützliche Datensätze lassen sich unter dem Stichwort “natural earth” finden. Hier sind Daten wie globale Küstenlinien, mittels Echolot ausgemessene Meerestiefen, aber auch von Menschen geschaffene Dinge wie Landesgrenzen und Städte sehr übersichtlich zu finden.
Im Grunde sind der Vorstellung aber keinerlei Grenzen gesetzt und fast alle denkbaren geographischen Fakten können, manchmal sogar live via Sattelit, mitverfolgt werden. So kann man sich beispielsweise neben aktueller Wolkenbedekung, Regenradar und globaler Oberflächentemperatur des Planeten auch das Abschmelzen der Polkappen seit 1970 ansehen (NSIDC) oder sich live die Blitzeinschläge auf dem gesamten Planeten anschauen – mit Vorhersage darüber, wann und wo der Donner zu hören ist (das funktioniert wirklich! Beispielsweise auf lightningmaps).
Kurzum Geodaten sind neben ihrer wirtschaftlichen Relevanz – vor allem für die Logistik – auch für angehende Data Scientists sehr aufschlussreich und ein wunderbares Spielzeug, mit dem man sich lange beschäftigen und eine Menge interessanter Dinge herausfinden kann.

Attribution Models in Marketing

Attribution Models

A Business and Statistical Case


A desire to understand the causal effect of campaigns on KPIs

Advertising and marketing costs represent a huge and ever more growing part of the budget of companies. Studies have found out this share is as high as 10% and increases with the size of companies (CMO study by American Marketing Association and Duke University, 2017). Measuring precisely the impact of a specific marketing campaign on the sales of a company is a critical step towards an efficient allocation of this budget. Would the return be higher for an euro spent on a Facebook ad, or should we better spend it on a TV spot? How much should I spend on Twitter ads given the volume of sales this channel is responsible for?

Attribution Models have lately received great attention in Marketing departments to answer these issues. The transition from offline to online marketing methods has indeed permitted the collection of multiple individual data throughout the whole customer journey, and  allowed for the development of user-centric attribution models. In short, Attribution Models use the information provided by Tracking technologies such as Google Analytics or Webtrekk to understand customer journeys from the first click on a Facebook ad to the final purchase and adequately ponderate the different marketing campaigns encountered depending on their responsibility in the final conversion.

Issues on Causal Effects

A key question then becomes: how to declare a channel is responsible for a purchase? In other words, how can we isolate the causal effect or incremental value of a campaign ?

          1. A/B-Tests

One method to estimate the pure impact of a campaign is the design of randomized experiments, wherein a control and treated groups are compared.  A/B tests belong to this broad category of randomized methods. Provided the groups are a priori similar in every aspect except for the treatment received, all subsequent differences may be attributed solely to the treatment. This method is typically used in medical studies to assess the effect of a drug to cure a disease.

Main practical issues regarding Randomized Methods are:

  • Assuring that control and treated groups are really similar before treatment. Uually a random assignment (i.e assuring that on a relevant set of observable variables groups are similar) is realized;
  • Potential spillover-effects, i.e the possibility that the treatment has an impact on the non-treated group as well (Stable unit treatment Value Assumption, or SUTVA in Rubin’s framework);
  • The costs of conducting such an experiment, and especially the costs linked to the deliberate assignment of individuals to a group with potentially lower results;
  • The number of such experiments to design if multiple treatments have to be measured;
  • Difficulties taking into account the interaction effects between campaigns or the effect of spending levels. Indeed, usually A/B tests are led by cutting off temporarily one campaign entirely and measuring the subsequent impact on KPI’s compared to the situation where this campaign is maintained;
  • The dynamical reproduction of experiments if we assume that treatment effects may change over time.

In the marketing context, multiple campaigns must be tested in a dynamical way, and treatment effect is likely to be heterogeneous among customers, leading to practical issues in the lauching of A/B tests to approximate the incremental value of all campaigns. However, sites with a lot of traffic and conversions can highly benefit from A/B testing as it provides a scientific and straightforward way to approximate a causal impact. Leading companies such as Uber, Netflix or Airbnb rely on internal tools for A/B testing automation, which allow them to basically test any decision they are about to make.



Experiment!: Website conversion rate optimization with A/B and multivariate testing, Colin McFarland, ©2013 | New Riders  

A/B testing: the most powerful way to turn clicks into customers. Dan Siroker, Pete Koomen; Wiley, 2013.



        2. Attribution models

Attribution Models do not demand to create an experimental setting. They take into account existing data and derive insights from the variability of customer journeys. One key difficulty is then to differentiate correlation and causality in the links observed between the exposition to campaigns and purchases. Indeed, selection effects may bias results as exposure to campaigns is usually dependant on user-characteristics and thus may not be necessarily independant from the customer’s baseline conversion probabilities. For example, customers purchasing from a discount price comparison website may be intrinsically different from customers buying from FB ad and this a priori difference may alone explain post-exposure differences in purchasing bahaviours. This intrinsic weakness must be remembered when interpreting Attribution Models results.

                          2.1 General Issues

The main issues regarding the implementation of Attribution Models are linked to

  • Causality and fallacious reasonning, as most models do not take into account the aforementionned selection biases.
  • Their difficult evaluation. Indeed, in almost all attribution models (except for those based on classification, where the accuracy of the model can be computed), the additionnal value brought by the use of a given attribution models cannot be evaluated using existing historical data. This additionnal value can only be approximated by analysing how the implementation of the conclusions of the attribution model have impacted a given KPI.
  • Tracking issues, leading to an uncorrect reconstruction of customer journeys
    • Cross-device journeys: cross-device issue arises from the use of different devices throughout the customer journeys, making it difficult to link datapoints. For example, if a customer searches for a product on his computer but later orders it on his mobile, the AM would then mistakenly consider it an order without prior campaign exposure. Though difficult to measure perfectly, the proportion of cross-device orders can approximate 20-30%.
    • Cookies destruction makes it difficult to track the customer his the whole journey. Both regulations and consumers’ rising concerns about data privacy issues mitigate the reliability and use of cookies.1 – From 2002 on, the EU has enacted directives concerning privacy regulation and the extended use of cookies for commercial targeting purposes, which have highly impacted marketing strategies, such as the ‘Privacy and Electronic Communications Directive’ (2002/58/EC). A research was conducted and found out that the adoption of this ‘Privacy Directive’ had led to 64% decrease in advertising methods compared to the rest of the world (Goldfarb et Tucker (2011)). The effect was stronger for generalized sites (Yahoo) than for specialized sites.2 – Users have grown more and more conscious of data privacy issues and have adopted protective measures concerning data privacy, such as automatic destruction of cookies after a session is ended, or simply giving away less personnal information (Goldfarb et Tucker (2012) ) .Valuable user information may be lost, though tracking technologies evolution have permitted to maintain tracking by other means. This issue may be particularly important in countries highly concerned with data privacy issues such as Germany.
    • Offline/Online bridge: an Attribution Model should take into account all campaigns to draw valuable insights. However, the exposure to offline campaigns (TV, newspapers) are difficult to track at the user level. One idea to tackle this issue would be to estimate the proportion of conversions led by offline campaigns through AB testing and deduce this proportion from the credit assigned to the online campaigns accounted for in the Attribution Model.
    • Touch point information available: clicks are easy to follow but irrelevant to take into account the influence of purely visual campaigns such as display ads or video.

                          2.2 Today’s main practices

Two main families of Attribution Models exist:

  • Rule-Based Attribution Models, which have been used for in the last decade but from which companies are gradualy switching.

Attribution depends on the individual journeys that have led to a purchase and is solely based on the rank of the campaign in the journey. Some models focus on a single touch points (First Click, Last Click) while others account for multi-touch journeys (Bathtube, Linear). It can be calculated at the customer level and thus doesn’t require large amounts of data points. We can distinguish two sub-groups of rule-based Attribution Models:

  • One Touch Attribution Models attribute all credit to a single touch point. The First-Click model attributes all credit for a converion to the first touch point of the customer journey; last touch attributes all credit to the last campaign.
  • Multi-touch Rule-Based Attribution Models incorporate information on the whole customer journey are thus an improvement compared to one touch models. To this family belong Linear model where credit is split equally between all channels, Bathtube model where 40% of credit is given to first and last clicks and the remaining 20% is distributed equally between the middle channels, or time-decay models where credit assigned to a click diminishes as the time between the click and the order increases..

The main advantages of rule-based models is their simplicity and cost effectiveness. The main problems are:

– They are a priori known and can thus lead to optimization strategies from competitors
– They do not take into account aggregate intelligence on customer journeys and actual incremental values.
– They tend to bias (depending on the model chosen) channels that are over-represented at the beggining or end of the funnel, according to theoretical assumptions that have no observationnal back-ups.

  • Data-Driven Attribution Models

These models take into account the weaknesses of rule-based models and make a relevant use of available data. Being data-driven, following attribution models cannot be computed using single user level data. On the contrary values are calculated through data aggregation and thus require a certain volume of customer journey information.



        3. Data-Driven Attribution Models in practice

                          3.1 Issues

Several issues arise in the computation of campaigns individual impact on a given KPI within a data-driven model.

  • Selection biases: Exposure to certain types of advertisement is usually highly correlated to non-observable variables which are in turn correlated to consumption practices. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in conversion probabilities between groups whether than by the campaign effect.
  • Complementarity: it may be that campaigns A and B only have an effect when combined, so that measuring their individual impact would lead to misleading conclusions. The model could then try to assess the effect of combinations of campaigns on top of the effect of individual campaigns. As the number of possible non-ordered combinations of k campaigns is 2k, it becomes clear that inclusing all possible combinations would however be time-consuming.
  • Order-sensitivity: The effect of a campaign A may depend on the place where it appears in the customer journey, meaning the rank of a campaign and not merely its presence could be accounted for in the model.
  • Relative Order-sensitivity: it may be that campaigns A and B only have an effect when one is exposed to campaign A before campaign B. If so, it could be useful to assess the effect of given combinations of campaigns as well. And this for all campaigns, leading to tremendous numbers of possible combinations.
  • All previous phenomenon may be present, increasing even more the potential complexity of a comprehensive Attribution Model. The number of all possible ordered combination of k campaigns is indeed :


                          3.2 Main models

                                  A) Logistic Regression and Classification models

If non converting journeys are available, Attribition Model can be shaped as a simple classification issue. Campaign types or campaigns combination and volume of campaign types can be included in the model along with customer or time variables. As we are interested in inference (on campaigns effect) whether than prediction, a parametric model should be used, such as Logistic Regression. Non paramatric models such as Random Forests or Neural Networks can also be used though the interpretation of campaigns value would be more difficult to derive from the model results.

A common pitfall is the usual issue of spurious correlations on one hand and the correct interpretation of coefficients in business terms.

An advantage if the possibility to evaluate the relevance of the model using common model validation methods to evaluate its predictive power (validation set \ AUC \pseudo R squared).

                                  B) Shapley Value


The Shapley Value is based on a Game Theory framework and is named after its creator, the Nobel Price Laureate Lloyd Shapley. Initially meant to calculate the marginal contribution of players in cooperative games, the model has received much attention in research and industry and has lately been applied to marketing issues. This model is typically used by Google Adords and other ad bidding vendors. Campaigns or marketing channels are in this model seen as compementary players looking forward to increasing a given KPI.
Contrarily to Logistic Regressions, it is a non-parametric model. Contrarily to Markov Chains, all results are built using existing journeys, and not simulated ones.

Channels are considered to enter the game sequentially under a certain joining order. Shapley value try to The Shapley value of channel i is the weighted sum of the marginal values that channel i adds to all possible coalitions that don’t contain channel i.
In other words, the main logic is to analyse the difference of gains when a channel i is added after a coalition Ck of k channels, k<=n. We then sum all the marginal contributions over all possible ordered combination Ck of all campaigns excluding i, with k<=n-1.

Subsets framework

A first an most usual way to compute the Shapley Vaue is to consider that when a channel enters coalition, its additionnal value is the same irrelevant of the order in which previous channels have appeared. In other words, journeys (A>B>C) and (B>A>C) trigger the same gains.
Shapley value is computed as the gains associated to adding a channel i to a subset of channels, weighted by the number of (ordered) sequences that the (unordered) subset represents, summed up on all possible subsets of the total set of campaigns where the channel i is not present.
The Shapley value of the channel 𝑥𝑗 is then:

where |S| is the number of campaigns of a coalition S and the sum extends over all subsets S that do not not contain channel j. 𝜈(𝑆)  is the value of the coalition S and 𝜈(𝑆 ∪ {𝑥𝑗})  the value of the coalition formed by adding 𝑥𝑗 to coalition S. 𝜈(𝑆 ∪ {𝑥𝑗}) − 𝜈(𝑆) is thus the marginal contribution of channel 𝑥𝑗 to the coalition S.

The formula can be rewritten and understood as:

This method is convenient when data on the gains of on all possible permutations of all unordered k subsets of the n campaigns are available. It is also more convenient if the order of campaigns prior to the introduction of a campaign is thought to have no impact.

Ordered sequences

Let us define 𝜈((A>B)) as the value of the sequence A then B. What is we let 𝜈((A>B)) be different from 𝜈((B>A)) ?
This time we would need to sum over all possible permutation of the S campaigns present before  𝑥𝑗 and the N-(S+1) campaigns after 𝑥𝑗. Doing so we will sum over all possible orderings (i.e all permutations of the n campaigns of the grand coalition containing all campaigns) and we can remove the permutation coefficient s!(p-s+1)!.

This method is convenient when the order of channels prior to and after the introduction of another channel is assumed to have an impact. It is also necessary to possess data for all possible permutations of all k subsets of the n campaigns, and not only on all (unordered) k-subsets of the n campaigns, k<=n. In other words, one must know the gains of A, B, C, A>B, B>A, etc. to compute the Shapley Value.

Differences between the two approaches

We simulate an ordered case where the value for each ordered sequence k for k<=3 is known. We compare it to the usual Shapley value calculated based on known gains of unordered subsets of campaigns. So as to compare relevant values, we have built the gains matrix so that the gains of a subset A, B i.e  𝜈({B,A}) is the average of the gains of ordered sequences made up with A and B (assuming the number of journeys where A>B equals the number of journeys where B>A, we have 𝜈({B,A})=0.5( 𝜈((A>B)) + 𝜈((B>A)) ). We let the value of the grand coalition be different depending on the order of campaigns-keeping the constraints that it averages to the value used for the unordered case.

Note: mvA refers to the marginal value of A in a given sequence.
With traditionnal unordered coalitions:

With ordered sequences used to compute the marginal values:


We can see that the two approaches yield very different results. In the unordered case, the Shapley Value campaign C is the highest, culminating at 20, while A and B have the same Shapley Value mvA=mvB=15. In the ordered case, campaign A has the highest Shapley Value and all campaigns have different Shapley Values.

This example illustrates the inherent differences between the set and sequences approach to Shapley values. Real life data is more likely to resemble the ordered case as conversion probabilities may for any given set of campaigns be influenced by the order through which the campaigns appear.


Shapley value has become popular in allocation problems in cooperative games because it is the unique allocation which satisfies different axioms:

  • Efficiency: Shaple Values of all channels add up to the total gains (here, orders) observed.
  • Symmetry: if channels A and B bring the same contribution to any coalition of campaigns, then their Shapley Value i sthe same
  • Null player: if a channel brings no additionnal gains to all coalitions, then its Shapley Value is zero
  • Strong monotony: the Shapley Value of a player increases weakly if all its marginal contributions increase weakly

These properties make the Shapley Value close to what we intuitively define as a fair attribution.


  • The Shapley Value is based on combinatory mathematics, and the number of possible coalitions and ordered sequences becomes huge when the number of campaigns increases.
  • If unordered, the Shapley Value assumes the contribution of campaign A is the same if followed by campaign B or by C.
  • If ordered, the number of combinations for which data must be available and sufficient is huge.
  • Channels rarely present or present in long journeys will be played down.
  • Generally, gains are supposed to grow with the number of players in the game. However, it is plausible that in the marketing context a journey with a high number of channels will not necessarily bring more orders than a journey with less channels involved.


R package: GameTheoryAllocation

Zhao & al, 2018 “Shapley Value Methods for Attribution Modeling in Online Advertising “

                                  B) Markov Chains

Markov Chains are used to model random processes, i.e events that occur in a sequential manner and in such a way that the probability to move to a certain state only depends on the past steps. The number of previous steps that are taken into account to model the transition probability is called the memory parameter of the sequence, and for the model to have a solution must be comprised between 0 and 4. A Markov Chain process is thus defined entirely by its Transition Matrix and its initial vector (i.e the starting point of the process).

Markov Chains are applied in many scientific fields. Typically, they are used in weather forecasting, with the sequence of Sunny and Rainy days following a Markov Process of memory parameter 0, so that for each given day the probability that the next day will be rainy or sunny only depends on the weather of the current day. Other applications can be found in sociology to understand the dynamics of social classes intergenerational reproduction. To get more both mathematical and applied illustration, I recommend the reading of this course.

In the marketing context, Markov Chains are an interesting way to model the conversion funnel. To go from the from the Markov Model to the Attribution logic, we calculate the Removal Effect of each channel, i.e the difference in conversions that happen if the channel is removed. Please read below for an introduction to the methodology.

The first step in a Markov Chains Attribution Model is to build the transition matrix that captures the transition probabilities between the campaigns accross existing customer journeys. This Matrix is to be read as a “From state A to state B” table, from the left to the right. A first difficulty is finding the right memory parameter to use. A large memory parameter would allow to take more into account interraction effects within the conversion funnel but would lead to increased computationnal time, a non-readable transition matrix, and be more sensitive to noisy data. Please note that this transition matrix provides useful information on the conversion funnel and on the relationships between campaigns and can be used as such as an analytical tool. I suggest the clear and easily R code which can be found here or here.

Here is an illustration of a Markov Chain with memory Parameter of 0: the probability to go to a certain campaign B in the next step only depend on the campaign we are currently at:

The associated Transition Matrix is then (with null probabilities left as Blank):

The second step is  to compute the actual responsibility of a channel in total conversions. As mentionned above, the main philosophy to do so is to calculate the Removal Effect of each channel, i.e the changes in the number of conversions when a channel is entirely removed. All customer journeys which went through this channel are settled out to be unsuccessful. This calculation is done by applying the transition matrix with and without the removed channels to an initial vector that contains the number of desired simulations.

Building on our current example, we can then settle an initial vector with the desired number of simulations, e.g 10 000:


It is possible at this stage to add a constraint on the maximum number of times the matrix is applied to the data, i.e on the maximal number of campaigns a simulated journey is allowed to have.


  • The dynamic journey is taken into account, as well as the transition between two states. The funnel is not assumed to be linear.
  • It is possile to build a conversion graph that maps the customer journey provides valuable insights.
  • It is possible to evaluate partly the accuracy of the Attribution Model based on Markov Chains. It is for example possible to see how well the transition matrix help predict the future by analysing the number of correct predictions at any given step over all sequences.


  • It can be somewhat difficult to set the memory parameter. Complementarity effects between channels are not well taken into account if the memory is low, but a parameter too high will lead to over-sensitivity to noise in the data and be difficult to implement if customer journeys tend to have a number of campaigns below this memory parameter.
  • Long journeys with different channels involved will be overweighted, as they will count many times in the Removal Effect.  For example, if there are n-1 channels in the customer journey, this journey will be considered as failure for the n-1 channel-RE. If the volume effects (i.e the impact of the overall number of channels in a journey, irrelevant from their type° are important then results may be biased.


R package: ChannelAttribution




“Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling”; Anderl, Eva and Becker, Ingo and Wangenheim, Florian V. and Schumann, Jan Hendrik, 2014. Available at SSRN: or

“Media Exposure through the Funnel: A Model of Multi-Stage Attribution”, Abhishek & al, 2012

“Multichannel Marketing Attribution Using Markov Chains”, Kakalejčík, L., Bucko, J., Resende, P.A.A. and Ferencova, M. Journal of Applied Management and Investments, Vol. 7 No. 1, pp. 49-60.  2018


                          3.3 To go further: Tackling selection biases with Quasi-Experiments

Exposure to certain types of advertisement is usually highly correlated to non-observable variables. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in converison probabilities between groups whether than by the campaign effect. These potential selection effects may bias the results obtained using historical data.

Quasi-Experiments can help correct this selection effect while still using available observationnal data.  These methods recreate the settings on a randomized setting. The goal is to come as close as possible to the ideal of comparing two populations that are identical in all respects except for the advertising exposure. However, populations might still differ with respect to some unobserved characteristics.

Common quasi-experimental methods used for instance in Public Policy Evaluation are:

  • Discontinuity Regressions
  • Matching Methods, such as Exact Matching,  Propensity-score matching or k-nearest neighbourghs.



“Towards a digital Attribution Model: Measuring the impact of display advertising on online consumer behaviour”, Anindya Ghose & al, MIS Quarterly Vol. 40 No. 4, pp. 1-XX, 2016

        4. First Steps towards a Practical Implementation

Identify key points of interests

  • Identify the nature of touchpoints available: is the data based on clicks? If so, is there a way to complement the data with A/B tests to measure the influence of ads without clicks (display, video) ? For example, what happens to sales when display campaign is removed? Analysing this multiplier effect would give the overall responsibility of display on sales, to be deduced from current attribution values given to click-based channels. More interestingly, what is the impact of the removal of display campaign on the occurences of click-based campaigns ? This would give us an idea of the impact of display ads on the exposure to each other campaigns, which would help correct the attribution values more precisely at the campaign level.
  • Define the KPI to track. From a pure Marketing perspective, looking at purchases may be sufficient, but from a financial perspective looking at profits, though a bit more difficult to compute, may drive more interesting results.
  • Define a customer journey. It may seem obvious, but the notion needs to be clarified at first. Would it be defined by a time limit? If so, which one? Does it end when a conversion is observed? For example, if a customer makes 2 purchases, would the campaigns he’s been exposed to before the first order still be accounted for in the second order? If so, with a time decay?
  • Define the research framework: are we interested only in customer journeys which have led to conversions or in all journeys? Keep in mind that successful customer journeys are a non-representative sample of customer journeys. Models built on the analysis of biased samples may be conservative. Take an extreme example: 80% of customers who see campaign A buy the product, VS 1% for campaign B. However, campaign B exposure is great and 100 Million people see it VS only 1M for campaign A. An Attribution Model based on successful journeys will give higher credit to campaign B which is an auguable conclusion. Taking into account costs per campaign (in the case where costs are calculated by clicks) may of course tackle this issue partly, as campaign A could then exhibit higher returns, but a serious fallacious reasonning is at stake here.

Analyse the typical customer journey    

  • Performing a duration analysis on the data may help you improve the definition of the customer journey to be used by your organization. After which days are converison probabilities null? Should we consider the effect of campaigns disappears after x days without orders? For example, if 99% of orders are placed in the 30 days following a first click, it might be interesting to define the customer journey as a 30 days time frame following the first oder.
  • Look at the distribution of the number of campaigns in a typical journey. If you choose to calculate the effect of campaigns interraction in your Attribution Model, it may indeed help you determine the maximum number of campaigns to be included in a combination. Indeed, you may not need to assess the impact of channel combinations with above than 4 different channels if 95% of orders are placed after less then 4 campaigns.
  • Transition matrixes: what if a campaign A systematically leads to a campaign B? What happens if we remove A or B? These insights would give clues to ask precise questions for a latter AB test, for example to find out if there is complementarity between channels A and B – (implying none should be removed) or mere substitution (implying one can be given up).
  • If conversion rates are available: it can be interesting to perform a survival analysis i.e to analyse the likelihood of conversion based on duration since first click. This could help us excluse potential outliers or individuals who have very low conversion probabilities.


Attribution is a complex topic which will probably never be definitively solved. Indeed, a main issue is the difficulty, or even impossibility, to evaluate precisely the accuracy of the attribution model that we’ve built. Attribution Models should be seen as a good yet always improvable approximation of the incremental values of campaigns, and be presented with their intrinsinc limits and biases.

Cloudera und Hortonworks vollenden geplante Fusion

Kombiniertes „ Open-Source-Powerhouse” wird die branchenweit erste Enterprise Data Cloud vom Netzwerk-Rand (Edge) bis hin zu künstlicher Intelligenz bauen.

München, Palo Alto (Kalifornien), 03. Januar 2019 – Cloudera, Inc. (NYSE: CLDR) hat den Abschluss seiner Fusion mit Hortonworks, Inc. bekanntgegeben. Cloudera wird die erste Enterprise Data Cloud bereitstellen, die die ganze Macht der Daten freisetzt, welche sich in einer beliebigen Cloud vom Netzwerk-Rand (Edge) bis zur KI bewegen –  all dies basierend auf einer hundertprozentigen Open-Source-Datenplattform. Die Enterprise Data Cloud unterstützt sowohl hybride als auch Multi-Cloud-Deployments. Unternehmen erhalten dadurch die nötige Flexibilität, um Machine Learning und Analysen mit ihren Daten, auf ihre Art und Weise und ohne Lock-in durchzuführen.

„Heute startet ein aufregendes neues Kapitel für Cloudera als führender Anbieter von Enterprise Data Clouds”, so Tom Reilly, Chief Executive Officer von Cloudera. „Das kombinierte Team und Technologieportfolio etabliert das neue Cloudera als klaren Marktführer mit der Größe und den Ressourcen für weitere Innovationen und Wachstum. Wir bieten unseren Kunden eine umfassende Lösung, um die richtige Datenanalyse für Daten überall dort bereitzustellen, wo das Unternehmen arbeiten muss, vom Edge bis zur KI, mit der branchenweit ersten Enterprise Data Cloud”.  

Ergänzend dazu stellte das Forschungsunternehmen Forrester fest1, dass „diese Fusion … die Messlatte für Innovationen im Big-Data-Bereich höher legen wird, insbesondere bei der Unterstützung einer durchgehenden Big-Data-Strategie in einer Hybrid- und Multi-Cloud-Umgebung. Wir glauben, dass dies eine Win-Win-Situation für Kunden, Partner und Lieferanten ist.”

Cloudera wird weiterhin unter dem Symbol „CLDR” an der New Yorker Börse gehandelt. Die Aktionäre von Hortonworks erhielten 1,305 Stammaktien von Cloudera für jede Aktie von Hortonworks.

Das Cloudera-Management wird am 10. Januar 2019 um 19:00 Uhr ein Online-Meeting veranstalten, um zu diskutieren, wie das neue Cloudera Innovationen beschleunigen und die erste Enterprise Data Cloud der Branche liefern wird. Registrieren Sie sich jetzt. Die Veranstaltung wird am 14. Januar 2019 um 14:00 Uhr auch für die EMEA-Region stattfinden. Registrieren Sie sich hier für dieses Webinar.

1 „Cloudera And Hortonworks Merger: A Win-Win For All”, Beitrag von Noel Yuhanna im Forrester-Blog (4. Oktober 2018)

Über Cloudera

Bei Cloudera glauben wir, dass Daten morgen Dinge ermöglichen werden, die heute noch unmöglich sind. Wir versetzen Menschen in die Lage, komplexe Daten in klare, umsetzbare Erkenntnisse zu transformieren. Cloudera stellt dafür eine Enterprise Data Cloud bereit – für alle Daten, jederzeit, vom Netzwerkrand (Edge) bis hin zu künstlicher Intelligenz. Mit der Innovationskraft der Open-Source-Community treibt Cloudera die digitale Transformation für die größten Unternehmen der Welt voran. Erfahren Sie mehr unter

Cloudera und damit verbundene Zeichen und Warenzeichen sind registrierte Warenzeichen der Cloudera Inc. Alle anderen Unternehmen und Produktnamen können Warenzeichen der jeweiligen Besitzer sein.

Big Data has reduced the boundary between demand-centric dynamic pricing and user-behavior centric pricing!

Real-time pricing is also known as Dynamic pricing, and it is a method to plan and set highly flexible prices of the services or the products. Dynamic pricing is aimed to help the online organizations modify the costs on the fly in relation to the ever changing market conditions. All sorts of modifications are managed the costing bots, who collect the information, and use the algorithms in order to regulate the costing, keeping in mind the set guidelines. With the help of data analysis, vendors can accurately forecast the best prices, and also can adjust it as per the changing needs.

What’s the role of Big Data in Dynamics pricing?

Big data strategies are made just to get the required insights which help to enhance the performance of a business. Still, companies find it difficult to understand the capabilities of analytics, and how the analytics can be used to make the process of pricing all the more powerful. Various levels of Big Data collection, and analysis result into planning a proper dynamics pricing structure. The Big Data captured by the companies hold a lot of value when it comes to devising solid, and very workable dynamics costing structures.

Each and every one of the data-oriented firms move from the basic data reporting stage via a plenty of stages to get to the utmost, desirable level of optimization that’s deemed the most sophisticated. This eventually helps to enhance the revenue management process as well.

How Big Data lessens the gap between demand-centric dynamic pricing and user-behavior centric pricing?

Big Data as we have discussed above has a major role to play when it comes to setting dynamic pricing plans. Dynamic pricing is now further categorized into different segments and two of them are demand-centric dynamic pricing and user-behavior centric pricing. Both of these hold equal importance in creating a top pricing strategy. However, one of the other important things is that, it acts as a liaison between the two as well.  It bridges the gap between the two. When it comes to demand centric costing, it is referred to as what the customer needs, and what the customer is looking for. Whereas, when it comes to user behavior pricing, it is more related to what we should be offering to the customer as per the interest levels of the customers.

Now, both of these parameters hold equal importance when it comes to making costing strategies that are fruitful. To set proper ‘demand centric pricing’ it is importance to know about the demand as well as the wants of the target audience. And, when it comes to user-behavior centric pricing, we need to know how the user is feeling, and what interest areas are. This where the role of Big Data analytics come into play.

Big Data analytics of relative information helps to find out both, the demands and well as the user behaviors. Big Data analytics done to study the target audience are a best way to get to the answers. Once we know about the demands and the user behavior we have to combine both of these to churn our better pricing strategies.

The costing plans should be taken into consideration by mapping both of these elements together. For example, even whenever we curate marketing strategies, they are basically catering to the demands of the public. But, at the same time, user-behavior is never neglected either. It’s a mix of both that we need for setting dynamic prices as well. The modifications which should be done in the pricing should be done based on collective insights gained by clubbing both the elements together.

By studying both the demands graphs as well as the user behavior reports, a company can devise plans that will turn out to be very useful when it comes to costing. Dynamic pricing is as it is a very fruitful invention, and the integration of Big Data has made it all the more powerful.

Big Data is one of those technologies which has made a lot possible in a lot of areas. Be it the pricing structures or the business strategies, Big Data analytics are used everywhere to improve the performance of the company.

Sentiment Analysis of IMDB reviews

Sentiment Analysis of IMDB reviews

This article shows you how to build a Neural Network from scratch(no libraries) for the purpose of detecting whether a movie review on IMDB is negative or positive.


  • Curating a dataset and developing a "Predictive Theory"

  • Transforming Text to Numbers Creating the Input/Output Data

  • Building our Neural Network

  • Making Learning Faster by Reducing "Neural Noise"

  • Reducing Noise by strategically reducing the vocabulary

Curating the Dataset

In [3]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # features of our dataset
reviews = list(map(lambda x:x[:-1],g.readlines()))

g = open('labels.txt','r') # labels
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))

Note: The data in reviews.txt we're contains only lower case characters. That's so we treat different variations of the same word, like The, the, and THE, all the same way.

It's always a good idea to get check out your dataset before you proceed.

In [2]:
len(reviews) #No. of reviews
In [3]:
reviews[0] #first review
'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '
In [4]:
labels[0] #first label

Developing a Predictive Theory

Analysing how you would go about predicting whether its a positive or a negative review.

In [5]:
print("labels.txt \t : \t reviews.txt\n")
labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...
In [41]:
from collections import Counter
import numpy as np

We'll create three Counter objects, one for words from postive reviews, one for words from negative reviews, and one for all the words.

In [56]:
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

Examine all the reviews. For each word in a positive review, increase the count for that word in both your positive counter and the total words counter; likewise, for each word in a negative review, increase the count for that word in both your negative counter and the total words counter. You should use split(' ') to divide a piece of text (such as a review) into individual words.

In [57]:
# Loop over all the words in all the reviews and increment the counts in the appropriate counter objects
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

Most common positive & negative words

In [ ]:

The above statement retrieves alot of words, the top 3 being : ('the', 173324), ('.', 159654), ('and', 89722),

In [ ]:

The above statement retrieves alot of words, the top 3 being : ('', 561462), ('.', 167538), ('the', 163389),

As you can see, common words like "the" appear very often in both positive and negative reviews. Instead of finding the most common words in positive or negative reviews, what you really want are the words found in positive reviews more often than in negative reviews, and vice versa. To accomplish this, you'll need to calculate the ratios of word usage between positive and negative reviews.

The positive-to-negative ratio for a given word can be calculated with positive_counts[word] / float(negative_counts[word]+1). Notice the +1 in the denominator – that ensures we don't divide by zero for words that are only seen in positive reviews.

In [58]:
pos_neg_ratios = Counter()

# Calculate the ratios of positive and negative uses of the most common words
# Consider words to be "common" if they've been used at least 100 times
for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

Examine the ratios

In [12]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))
Pos-to-neg ratio for 'the' = 1.0607993145235326
Pos-to-neg ratio for 'amazing' = 4.022813688212928
Pos-to-neg ratio for 'terrible' = 0.17744252873563218

We see the following:

  • Words that you would expect to see more often in positive reviews – like "amazing" – have a ratio greater than 1. The more skewed a word is toward postive, the farther from 1 its positive-to-negative ratio will be.
  • Words that you would expect to see more often in negative reviews – like "terrible" – have positive values that are less than 1. The more skewed a word is toward negative, the closer to zero its positive-to-negative ratio will be.
  • Neutral words, which don't really convey any sentiment because you would expect to see them in all sorts of reviews – like "the" – have values very close to 1. A perfectly neutral word – one that was used in exactly the same number of positive reviews as negative reviews – would be almost exactly 1.

Ok, the ratios tell us which words are used more often in postive or negative reviews, but the specific values we've calculated are a bit difficult to work with. A very positive word like "amazing" has a value above 4, whereas a very negative word like "terrible" has a value around 0.18. Those values aren't easy to compare for a couple of reasons:

  • Right now, 1 is considered neutral, but the absolute value of the postive-to-negative rations of very postive words is larger than the absolute value of the ratios for the very negative words. So there is no way to directly compare two numbers and see if one word conveys the same magnitude of positive sentiment as another word conveys negative sentiment. So we should center all the values around netural so the absolute value fro neutral of the postive-to-negative ratio for a word would indicate how much sentiment (positive or negative) that word conveys.
  • When comparing absolute values it's easier to do that around zero than one.

To fix these issues, we'll convert all of our ratios to new values using logarithms (i.e. use np.log(ratio))

In the end, extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs.

In [59]:
# Convert ratios to logs
for word,ratio in pos_neg_ratios.most_common():
    pos_neg_ratios[word] = np.log(ratio)

Examine the new ratios

In [14]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))
Pos-to-neg ratio for 'the' = 0.05902269426102881
Pos-to-neg ratio for 'amazing' = 1.3919815802404802
Pos-to-neg ratio for 'terrible' = -1.7291085042663878

If everything worked, now you should see neutral words with values close to zero. In this case, "the" is near zero but slightly positive, so it was probably used in more positive reviews than negative reviews. But look at "amazing"'s ratio - it's above 1, showing it is clearly a word with positive sentiment. And "terrible" has a similar score, but in the opposite direction, so it's below -1. It's now clear that both of these words are associated with specific, opposing sentiments.

Run the below code to see more ratios.

It displays all the words, ordered by how associated they are with postive reviews.

In [ ]:

The top most common words for the above code : ('edie', 4.6913478822291435), ('paulie', 4.0775374439057197), ('felix', 3.1527360223636558), ('polanski', 2.8233610476132043), ('matthau', 2.8067217286092401), ('victoria', 2.6810215287142909), ('mildred', 2.6026896854443837), ('gandhi', 2.5389738710582761), ('flawless', 2.451005098112319), ('superbly', 2.2600254785752498), ('perfection', 2.1594842493533721), ('astaire', 2.1400661634962708), ('captures', 2.0386195471595809), ('voight', 2.0301704926730531), ('wonderfully', 2.0218960560332353), ('powell', 1.9783454248084671), ('brosnan', 1.9547990964725592)

Transforming Text into Numbers

Creating the Input/Output Data

Create a set named vocab that contains every word in the vocabulary.

In [19]:
vocab = set(total_counts.keys())

Check vocabulary size

In [20]:
vocab_size = len(vocab)

Th following image rpresents the layers of the neural network you'll be building throughout this notebook. layer_0 is the input layer, layer_1 is a hidden layer, and layer_2 is the output layer.

In [1]:

TODO: Create a numpy array called layer_0 and initialize it to all zeros. Create layer_0 as a 2-dimensional matrix with 1 row and vocab_size columns.

In [21]:
layer_0 = np.zeros((1,vocab_size))

layer_0 contains one entry for every word in the vocabulary, as shown in the above image. We need to make sure we know the index of each word, so run the following cell to create a lookup table that stores the index of every word.

TODO: Complete the implementation of update_input_layer. It should count how many times each word is used in the given review, and then store those counts at the appropriate indices inside layer_0.

In [ ]:
# Create a dictionary of words in the vocabulary mapped to index positions 
# (to be used in layer_0)
word2index = {}
for i,word in enumerate(vocab):
    word2index[word] = i

It stores the indexes like this: 'antony': 22, 'pinjar': 23, 'helsig': 24, 'dances': 25, 'good': 26, 'willard': 71500, 'faridany': 27, 'foment': 28, 'matts': 12313,

Lets implement some functions for simplifying our inputs to the neural network.

In [25]:
def update_input_layer(review):
    The element at a given index of layer_0 should represent
    how many times the given word occurs in the review.
    global layer_0
    # clear out previous state, reset the layer to be all 0s
    layer_0 *= 0
    # count how many times each word is used in the given review and store the results in layer_0 
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

Run the following cell to test updating the input layer with the first review. The indices assigned may not be the same as in the solution, but hopefully you'll see some non-zero values in layer_0.

In [26]:
array([[ 18.,   0.,   0., ...,   0.,   0.,   0.]])

get_target_for_labels should return 0 or 1, depending on whether the given label is NEGATIVE or POSITIVE, respectively.

In [27]:
def get_target_for_label(label):
    if(label == 'POSITIVE'):
        return 1
        return 0

Building a Neural Network

In [32]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews,labels,hidden_nodes = 10, learning_rate = 0.1):
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        # Assign a seed to our random number generator to ensure we get
        # reproducable results

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels)
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        # populate review_vocab with all of the words in the given reviews
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):

        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = list(review_vocab)
        # populate label_vocab with all of the words in the given labels.
        label_vocab = set()
        for label in labels:
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = list(label_vocab)
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        # Create a dictionary of words in the vocabulary mapped to index positions
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        # Create a dictionary of labels mapped to index positions
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Set number of nodes in input, hidden and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights

        # These are the weights between the input layer and the hidden layer.
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        # These are the weights between the hidden layer and the output layer.
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5, 
                                                (self.hidden_nodes, self.output_nodes))
        # The input layer, a two-dimensional matrix with shape 1 x input_nodes
        self.layer_0 = np.zeros((1,input_nodes))
    def update_input_layer(self,review):

        # clear out previous state, reset the layer to be all 0s
        self.layer_0 *= 0
        for word in review.split(" "):
            if(word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] += 1
    def get_target_for_label(self,label):
        if(label == 'POSITIVE'):
            return 1
            return 0
    def sigmoid(self,x):
        return 1 / (1 + np.exp(-x))
    def sigmoid_output_2_derivative(self,output):
        return output * (1 - output)
    def train(self, training_reviews, training_labels):
        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews) == len(training_labels))
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0

        # Remember when we started for printing time statistics
        start = time.time()
        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            # Get the next review and its correct label
            review = training_reviews[i]
            label = training_labels[i]
            ### Forward pass ###

            # Input Layer

            # Hidden layer
            layer_1 =

            # Output layer
            layer_2 = self.sigmoid(
            ### Backward pass ###

            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            # Backpropagated error
            layer_1_error = # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= * self.learning_rate # update hidden-to-output weights with gradient descent step
            self.weights_0_1 -= * self.learning_rate # update input-to-hidden weights with gradient descent step

            # Keep track of correct predictions.
            if(layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
            elif(layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            sys.stdout.write(" #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
    def test(self, testing_reviews, testing_labels):
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        # keep track of how many correct predictions we make
        correct = 0

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred =[i])
            if(pred == testing_labels[i]):
                correct += 1
            sys.stdout.write(" #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    def run(self, review):
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        # Run a forward pass through the network, like in the "train" function.
        # Input Layer

        # Hidden layer
        layer_1 =

        # Output layer
        layer_2 = self.sigmoid(
        # Return POSITIVE for values above greater-than-or-equal-to 0.5 in the output layer;
        # return NEGATIVE for other values
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
            return "NEGATIVE"

Run the following code to create the network with a small learning rate, 0.001, and then train the new network. Using learning rate larger than this, for example 0.1 or even 0.01 would result in poor performance.

In [ ]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)

Running the above code would have given an accuracy around 62.2%

Reducing Noise in Our Input Data

Counting how many times each word occured in our review might not be the most efficient way. Instead just including whether a word was there or not will improve our training time and accuracy. Hence we update our update_input_layer() function.

In [ ]:
def update_input_layer(self,review):
    self.layer_0 *= 0
    for word in review.split(" "):
        if(word in self.word2index.keys()):
            self.layer_0[0][self.word2index[word]] =1

Creating and running our neural network again, even with a higher learning rate of 0.1 gave us a training accuracy of 83.8% and testing accuracy(testing on last 1000 reviews) of 85.7%.

Reducing Noise by Strategically Reducing the Vocabulary

Let us put the pos to neg ratio's that we found were much more effective at detecting a positive or negative label. We could do that by a few change:

  • Modify pre_process_data:
    • Add two additional parameters: min_count and polarity_cutoff
    • Calculate the positive-to-negative ratios of words used in the reviews.
    • Change so words are only added to the vocabulary if they occur in the vocabulary more than min_count times.
    • Change so words are only added to the vocabulary if the absolute value of their postive-to-negative ratio is at least polarity_cutoff
In [ ]:
def pre_process_data(self, reviews, labels, polarity_cutoff, min_count):
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()

        for i in range(len(reviews)):
            if(labels[i] == 'POSITIVE'):
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1

        pos_neg_ratios = Counter()

        for term,cnt in list(total_counts.most_common()):
            if(cnt >= 50):
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio

        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

        # populate review_vocab with all of the words in the given reviews
        review_vocab = set()
        for review in reviews:
            for word in review.split(" "):
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):

        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = list(review_vocab)
        # populate label_vocab with all of the words in the given labels.
        label_vocab = set()
        for label in labels:
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = list(label_vocab)
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        # Create a dictionary of words in the vocabulary mapped to index positions
        self.word2index = {}
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        # Create a dictionary of labels mapped to index positions
        self.label2index = {}
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i

Our training accuracy increased to 85.6% after this change. As we can see our accuracy saw a huge jump by making minor changes based on our intuition. We can keep making such changes and increase the accuracy even further.


Download the Data Sources

The data sources used in this article can be downloaded here: