Moderne Business Intelligence in der Microsoft Azure Cloud

Google, Amazon und Microsoft sind die drei großen Player im Bereich Cloud Computing. Die Cloud kommt für nahezu alle möglichen Anwendungsszenarien infrage, beispielsweise dem Hosting von Unternehmenssoftware, Web-Anwendungen sowie Applikationen für mobile Endgeräte. Neben diesen Klassikern spielt die Cloud jedoch auch für Internet of Things, Blockchain oder Künstliche Intelligenz eine wichtige Rolle als Enabler. In diesem Artikel beleuchten wir den Cloud-Anbieter Microsoft Azure mit Blick auf die Möglichkeiten des Aufbaues eines modernen Business Intelligence oder Data Platform für Unternehmen.

Eine Frage der Architektur

Bei der Konzeptionierung der Architektur stellen sich viele Fragen:

  • Welche Datenbank wird für das Data Warehouse genutzt?
  • Wie sollten ETL-Pipelines erstellt und orchestriert werden?
  • Welches BI-Reporting-Tool soll zum Einsatz kommen?
  • Müssen Daten in nahezu Echtzeit bereitgestellt werden?
  • Soll Self-Service-BI zum Einsatz kommen?
  • … und viele weitere Fragen.

1 Die Referenzmodelle für Business Intelligence Architekturen von Microsoft Azure

Die vielen Dienste von Microsoft Azure erlauben unzählige Einsatzmöglichkeiten und sind selbst für Cloud-Experten nur schwer in aller Vollständigkeit zu überblicken.  Microsoft schlägt daher verschiedene Referenzmodelle für Datenplattformen oder Business Intelligence Systeme mit unterschiedlichen Ausrichtungen vor. Einige davon wollen wir in diesem Artikel kurz besprechen und diskutieren.

1a Automatisierte Enterprise BI-Instanz

Diese Referenzarchitektur für automatisierte und eher klassische BI veranschaulicht die Vorgehensweise für inkrementelles Laden in einer ELT-Pipeline mit dem Tool Data Factory. Data Factory ist der Cloud-Nachfolger des on-premise ETL-Tools SSIS (SQL Server Integration Services) und dient nicht nur zur Erstellung der Pipelines, sondern auch zur Orchestrierung (Trigger-/Zeitplan der automatisierten Ausführung und Fehler-Behandlung). Über Pipelines in Data Factory werden die jeweils neuesten OLTP-Daten inkrementell aus einer lokalen SQL Server-Datenbank (on-premise) in Azure Synapse geladen, die Transaktionsdaten dann in ein tabellarisches Modell für die Analyse transformiert, dazu wird MS Azure Analysis Services (früher SSAS on-premis) verwendet. Als Tool für die Visualisierung der Daten wird von Microsoft hier und in allen anderen Referenzmodellen MS PowerBI vorgeschlagen. MS Azure Active Directory verbindet die Tools on Azure über einheitliche User im Active Directory Verzeichnis in der Azure-Cloud.

https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/data/enterprise-bi-adfQuelle:

Einige Diskussionspunkte zur BI-Referenzarchitektur von MS Azure

Der von Microsoft vorgeschlagenen Referenzarchitektur zu folgen kann eine gute Idee sein, ist jedoch tatsächlich nur als Vorschlag – eher noch als Kaufvorschlag – zu betrachten. Denn Unternehmens-BI ist hochgradig individuell und Bedarf einiger Diskussion vor der Festlegung der Architektur.

Azure Data Factory als ETL-Tool

Azure Data Factory wird in dieser Referenzarchitektur als ETL-Tool vorgeschlagen. In der Tat ist dieses sehr mächtig und rein über Mausklicks bedienbar. Darüber hinaus bietet es die Möglichkeit z. B. über Python oder Powershell orchestriert und pipeline-modelliert zu werden. Der Clue für diese Referenzarchitektur ist der Hinweis auf die On-Premise-Datenquellen. Sollte zuvor SSIS eingesetzt werden sollen, können die SSIS-Packages zu Data Factory migriert werden.

Die Auswahl der Datenbanken

Der Vorteil dieser Referenzarchitektur ist ohne Zweifel die gute Aufstellung der Architektur im Hinblick auf vielseitige Einsatzmöglichkeiten, so werden externe Daten (in der Annahme, dass diese un- oder semi-strukturiert vorliegen) zuerst in den Azure Blob Storage oder in den auf dem Blob Storage beruhenden Azure Data Lake zwischen gespeichert, bevor sie via Data Factory in eine für Azure Synapse taugliche Struktur transformiert werden können. Möglicherweise könnte auf den Blob Storage jedoch auch gut verzichtet werden, solange nur Daten aus bekannten, strukturierten Datenbanken der Vorsysteme verarbeitet werden. Als Staging-Layer und für Datenhistorisierung sind der Azure Blob Storage oder der Azure Data Lake jedoch gute Möglichkeiten, da pro Dateneinheit besonders preisgünstig.

Azure Synapse ist eine mächtige Datenbank mindestens auf Augenhöhe mit zeilen- und spaltenorientierten, verteilten In-Memory-Datenbanken wie Amazon Redshift, Google BigQuery oder SAP Hana. Azure Synapse bietet viele etablierte Funktionen eines modernen Data Warehouses und jährlich neue Funktionen, die zuerst als Preview veröffentlicht werden, beispielsweise der Einsatz von Machine Learning direkt auf der Datenbank.

Zur Diskussion steht jedoch, ob diese Funktionen und die hohe Geschwindigkeit (bei richtiger Nutzung) von Azure Synapse die vergleichsweise hohen Kosten rechtfertigen. Alternativ können MySQL-/MariaDB oder auch PostgreSQL-Datenbanken bei MS Azure eingesetzt werden. Diese sind jedoch mit Vorsicht zu nutzen bzw. erst unter genauer Abwägung einzusetzen, da sie nicht vollständig von Azure Data Factory in der Pipeline-Gestaltung unterstützt werden. Ein guter Kompromiss kann der Einsatz von Azure SQL Database sein, der eigentliche Nachfolger der on-premise Lösung MS SQL Server. MS Azure Snypase bleibt dabei jedoch tatsächlich die Referenz, denn diese Datenbank wurde speziell für den Einsatz als Data Warehouse entwickelt.

Zentrale Cube-Generierung durch Azure Analysis Services

Zur weiteren Diskussion stehen könnte MS Azure Analysis Sevice als Cube-Engine. Diese Cube-Engine, die ursprünglich on-premise als SQL Server Analysis Service (SSAS) bekannt war, nun als Analysis Service in der Azure Cloud verfügbar ist, beruhte früher noch als SSAS auf der Sprache MDX (Multi-Dimensional Expressions), eine stark an SQL angelehnte Sprache zum Anlegen von schnellen Berechnungsformeln für Kennzahlen im Cube-Datenmodellen, die grundlegendes Verständnis für multidimensionale Abfragen mit Tupeln und Sets voraussetzt. Heute wird statt MDX die Sprache DAX (Data Analysis Expression) verwendet, die eher an Excel-Formeln erinnert (diesen aber keinesfalls entspricht), sie ist umfangreicher als MDX, jedoch für den abitionierten Anwender leichter verständlich und daher für Self-Service-BI geeignet.

Punkt der Diskussion ist, dass der Cube über den Analysis-Service selbst keine Möglichkeiten eine Self-Service-BI nicht ermöglicht, da die Bearbeitung des Cubes mit DAX nur über spezielle Entwicklungsumgebungen möglich ist (z. B. Visual Studio). MS Power BI selbst ist ebenfalls eine Instanz des Analysis Service, denn im Kern von Power BI steckt dieselbe Engine auf Basis von DAX. Power BI bietet dazu eine nutzerfreundliche UI und direkt mit mausklickbaren Elementen Daten zu analysieren und Kennzahlen mit DAX anzulegen oder zu bearbeiten. Wird im Unternehmen absehbar mit Power BI als alleiniges Analyse-Werkzeug gearbeitet, ist eine separate vorgeschaltete Instanz des Azure Analysis Services nicht notwendig. Der zur Abwägung stehende Vorteil des Analysis Service ist die Nutzung des Cubes in Microsoft Excel durch die User über Power Pivot. Dies wiederum ist eine eigene Form des sehr flexiblen Self-Service-BIs.

1b Enterprise Data Warehouse-Architektur

Eine weitere Referenz-Architektur von Microsoft auf Azure ist jene für den Einsatz als Data Warehouse, bei der Microsoft Azure Synapse den dominanten Part von der Datenintegration über die Datenspeicherung und Vor-Analyse übernimmt.https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/enterprise-data-warehouseQuelle: 

Diskussionspunkte zum Referenzmodell der Enterprise Data Warehouse Architecture

Auch diese Referenzarchitektur ist nur für bestimmte Einsatzzwecke in dieser Form sinnvoll.

Azure Synapse als ETL-Tool

Im Unterschied zum vorherigen Referenzmodell wird hier statt auf Azure Data Factory auf Azure Synapse als ETL-Tool gesetzt. Azure Synapse hat die Datenintegrationsfunktionalitäten teilweise von Azure Data Factory geerbt, wenn gleich Data Factory heute noch als das mächtigere ETL-Tool gilt. Azure Synapse entfernt sich weiter von der alten SSIS-Logik und bietet auch keine Integration von SSIS-Paketen an, zudem sind einige Anbindungen zwischen Data Factory und Synapse unterschiedlich.

Auswahl der Datenbanken

Auch in dieser Referenzarchitektur kommt der Azure Blob Storage als Zwischenspeicher bzw. Staging-Layer zum Einsatz, jedoch im Mantel des Azure Data Lakes, der den reinen Speicher um eine Benutzerebene erweitert und die Verwaltung des Speichers vereinfacht. Als Staging-Layer oder zur Datenhistorisierung ist der Blob Storage eine kosteneffiziente Methode, darf dennoch über individuelle Betrachtung in der Notwendigkeit diskutiert werden.

Azure Synapse erscheint in dieser Referenzarchitektur als die sinnvolle Lösung, da nicht nur die Pipelines von Synapse, sondern auch die SQL-Engine sowie die Spark-Engine (über Python-Notebooks) für die Anwendung von Machine Learning (z. B. für Recommender-Systeme) eingesetzt werden können. Hier spielt Azure Synpase die Möglichkeiten als Kern einer modernen, intelligentisierbaren Data Warehouse Architektur voll aus.

Azure Analysis Service

Auch hier wird der Azure Analysis Service als Cube-generierende Maschinerie von Microsoft vorgeschlagen. Hier gilt das zuvor gesagte: Für den reinen Einsatz mit Power BI ist der Analysis Service unnötig, sollen Nutzer jedoch in MS Excel komplexe, vorgerechnete Analysen durchführen können, dann zahlt sich der Analysis Service aus.

Azure Cosmos DB

Die Azure Cosmos DB ist am nächsten vergleichbar mit der MongoDB Atlas (die Cloud-Version der eigentlich on-premise zu hostenden MongoDB). Es ist eine NoSQL-Datenbank, die über Datendokumente im JSON-File-Format auch besonders große Datenmengen in sehr hoher Geschwindigkeit abfragen kann. Sie gilt als die zurzeit schnellste Datenbank in Sachen Lesezugriff und spielt dabei alle Vorteile aus, wenn es um die massenweise Bereitstellung von Daten in andere Applikationen geht. Unternehmen, die ihren Kunden mobile Anwendungen bereitstellen, die Millionen parallele Datenzugriffe benötigen, setzen auf Cosmos DB.

1c Referenzarchitektur für Realtime-Analytics

Die Referenzarchitektur von Microsoft Azure für Realtime-Analytics wird die Referenzarchitektur für Enterprise Data Warehousing ergänzt um die Aufnahme von Data Streaming.

Diskussionspunkte zum Referenzmodell für Realtime-Analytics

Diese Referenzarchitektur ist nur für Einsatzszenarios sinnvoll, in denen Data Streaming eine zentrale Rolle spielt. Bei Data Streaming handelt es sich, vereinfacht gesagt, um viele kleine, ereignis-getriggerte inkrementelle Datenlade-Vorgänge bzw. -Bedarfe (Events), die dadurch nahezu in Echtzeit ausgeführt werden können. Dies kann über Webshops und mobile Anwendungen von hoher Bedeutung sein, wenn z. B. Angebote für Kunden hochgrade-individualisiert angezeigt werden sollen oder wenn Marktdaten angezeigt und mit ihnen interagiert werden sollen (z. B. Trading von Wertpapieren). Streaming-Tools bündeln eben solche Events (bzw. deren Datenhäppchen) in Data-Streaming-Kanäle (Partitionen), die dann von vielen Diensten (Consumergruppen / Receiver) aufgegriffen werden können. Data Streaming ist insbesondere auch dann ein notwendiges Setup, wenn ein Unternehmen über eine Microservices-Architektur verfügt, in der viele kleine Dienste (meistens als Docker-Container) als dezentrale Gesamtstruktur dienen. Jeder Dienst kann über Apache Kafka als Sender- und/oder Empfänger in Erscheinung treten. Der Azure Event-Hub dient dazu, die Zwischenspeicherung und Verwaltung der Datenströme von den Event-Sendern in den Azure Blob Storage bzw. Data Lake oder in Azure Synapse zu laden und dort weiter zu reichen oder für tiefere Analysen zu speichern.

Azure Eventhub ArchitectureQuelle: https://docs.microsoft.com/de-de/azure/event-hubs/event-hubs-about

Für die Datenverarbeitung in nahezu Realtime sind der Azure Data Lake und Azure Synapse derzeitig relativ alternativlos. Günstigere Datenbank-Instanzen von MariaDB/MySQL, PostgreSQL oder auch die Azure SQL Database wären hier ein Bottleneck.

2 Fazit zu den Referenzarchitekturen

Die Referenzarchitekturen sind exakt als das zu verstehen: Als Referenz. Keinesfalls sollte diese Architektur unreflektiert für ein Unternehmen übernommen werden, sondern vorher in Einklang mit der Datenstrategie gebracht werden, dabei sollten mindestens diese Fragen geklärt werden:

  • Welche Datenquellen sind vorhanden und werden zukünftig absehbar vorhanden sein?
  • Welche Anwendungsfälle (Use Cases) habe ich für die Business Intelligence bzw. Datenplattform?
  • Über welche finanziellen und fachlichen Ressourcen darf verfügt werden?

Darüber hinaus sollten sich die Architekten bewusst sein, dass, anders als noch in der trägeren On-Premise-Welt, die Could-Dienste schnelllebig sind. So sah die Referenzarchitektur 2019/2020 noch etwas anders aus, in der Databricks on Azure als System für Advanced Analytics inkludiert wurde, heute scheint diese Position im Referenzmodell komplett durch Azure Synapse ersetzt worden zu sein.

Azure Reference Architecture BI Databrikcs 2019

Azure Reference Architecture – with Databricks, old image source: https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse

Hinweis zu den Kosten und der Administration

Die Kosten für Cloud Computing statt für IT-Infrastruktur On-Premise sind ein zweischneidiges Schwert. Der günstige Einstieg in de Azure Cloud ist möglich, jedoch bedingt ein kosteneffizienter Betrieb viel Know-How im Umgang mit den Diensten und Konfigurationsmöglichkeiten der Azure Cloud oder des jeweiligen alternativen Anbieters. Beispielsweise können über Azure Data Factory Datenbanken über Pipelines automatisiert hochskaliert und nach nur Minuten wieder runterskaliert werden. Nur wer diese dynamischen Skaliermöglichkeiten nutzt, arbeitet effizient in der Cloud.

Ferner sind Kosten nur schwer einschätzbar, da diese mehr noch von der Nutzung (Datenmenge, CPU, RAM) als von der zeitlichen Nutzung (Lifetime) abhängig sind. Preisrechner ermöglichen zumindest eine Kosteneinschätzung: https://azure.com/e/96162a623bda4911bb8f631e317affc6

Leveraging Data Science for Vaccine Access and Administration

As people across the world become eligible for receiving their doses, governments and pharmaceutical companies must act efficiently to provide those vaccines. Alongside distribution, people need more information about these newer vaccines. As a solution for both of these obstacles, data science is a useful tool for vaccine development, distribution, and access throughout the COVID-19 pandemic.

From the initial stages of social distancing and vaccine development to broadening access to the vaccines, data science uses information from countless resources to provide evidence-based, actionable recommendations. Governments and health care providers then act upon this data to help the public and move towards the global goal of eliminating the virus.

Vaccine Development

Since the pandemic began, vaccines have been a sign of hope and a return to a new normalcy. However, getting to effective vaccine distribution first required using data science to develop the doses themselves.

The COVID-19 vaccines have been some of the fastest-developed inoculations in history, which is partly because of the efforts of data scientists. Using machine learning, researchers were able to analyze the sequences of strains of the virus and establish what parts a vaccine would best respond to. Specifically, the sequences had to be those that would be less likely to mutate in the future and less likely to cause an adverse reaction in humans with an injection.

Machine learning helped scientists predict and theorize about which proteins would be the best to work within the SARS-CoV-2 strains. From there, they proceeded with creating vaccines that are now in use all over the world, like Pfizer’s or Moderna’s.

Then, as vaccines become more available, governments again rely on data science to dictate eligibility. Data analytics systems take into account exposure risk, demographics, jobs, and health conditions, which have helped countries break up eligibility into phases.

Supply and Demand

Vaccine supply chains had rocky beginnings throughout the world. In Germany, residents faced shortages of doses, where demand far outweighed the available vials. This type of shortage is especially dangerous, as it can lead to an increase in cases or a full-on spike.

To avoid these uneven dynamics, data science can provide more accurate projections of how many vaccines regions will need on a weekly basis. Data science systems that use machine learning can account for the population that’s eligible and historical COVID-19 vaccination numbers. Then, as eligibility opens up, these systems predict how many vaccines a facility or county will need in the future.

Vaccine administrative organizations can then communicate better with the government to request the doses they will properly handle and use weekly.

In the United States, West Virginia is working with data science dashboards to identify who is most at risk of contracting the virus. Then, they can request the right amount of vials each week and give them to the people who need them the most.

Information Access

As new vaccines and government mandates come into play, residents all over the world need more information to feel safe and to know what they should do. Vaccine scams, for example, have increased with distribution. These scams will ask for personal information like a Social Security number or a form of payment.

To avoid these scams and learn about the vaccine options available, the public needs more access to information. Data science is again helpful to distribute this information.

Google has become a leader in information access with its Intelligence Vaccine Impact initiative. With this program, Google uses machine learning and artificial intelligence (AI) to process data regarding government policy changes, vaccine availability, eligibility, and demographics. That way, people know when they can receive a vaccine, the research and information that supports the vaccines, and what scams to avoid along the way.

Then, based on the vaccine information Google gathers, data scientists can provide a clearer trajectory of the pandemic globally and locally as vaccines help cases go down.

A Data-Driven Path Forward

Data science provides solutions for vaccine access, distribution, and administration. With these powerful dynamics in place, it’s clear that data will lead the world towards a healthier future. Based on evidence from the pandemic, data helps governments and health care providers offer the best solutions for eliminating the virus and protecting people everywhere.

5 AI Tricks to Grow Your Online Sales

The way people shop is currently changing. This only means that online stores need optimization to stay competitive and answer to the needs of customers. In this post, we’ll bring up the five ways in which you can use artificial intelligence technology in an online store to grow your revenues. Let’s begin!

1. Personalization with AI

Opening the list of AI trends that are certainly worth covering deals with a step up in personalization. Did you know that according to the results of a survey that was held by Accenture, more than 90% of shoppers are likelier to buy things from those stores and brands that propose suitable product recommendations?

This is exactly where artificial intelligence can give you a big hand. Such progressive technology analyzes the behavior of your consumers individually, keeping in mind their browsing and purchasing history. After collecting all the data, AI draws the necessary conclusions and offers those product recommendations that the user might like.

Look at the example below with the block has a carousel of neat product options. Obviously, this “move” can give a big boost to the average cart sizes.

Screenshot taken on the official Reebok website

Screenshot taken on the official Reebok website

2. Smarter Search Options

With the rise of the popularity of AI voice assistants and the leap in technology in general, the way people look for things on the web has changed. Everything is moving towards saving time and getting faster better results.

One of such trends deals with embracing the text to speech and image search technology. Did you notice how many search bars have “microphone icons” for talking out your request?

On a similar note, numerous sites have made a big jump forward after incorporating search by picture. In this case, uploaded photos get analyzed by artificial intelligence technology. The system studies what’s depicted on the image and cross-checks it with the products sold in the store. In several seconds the user is provided with a selection of similar products.

Without any doubt, this greatly helps users find what they were looking for faster. As you might have guessed, this is a time-saving feature. In essence, this omits the necessity to open dozens of product pages on multiple sites when seeking out a liked item that they’ve taken a screenshot or photo of.

Check out how such a feature works on the official Amazon website by taking a look at the screenshots of StyleSnap provided below.

Screenshot taken on the official Amazon StyleSnap website

Screenshot taken on the official Amazon StyleSnap website

3. Assisting Clients via Chatbots

The next point on the list is devoted to AI chatbots. This feature can be a real magic wand with client support which is also beneficial for online sales.

Real customer support specialists usually aren’t available 24/7. And keeping in mind that most requests are on repetitive topics, having a chatbot instantly handle many of the questions is a neat way to “unload” the work of humans.

Such chatbots use machine learning to get better at understanding and processing client queries. How do they work? They’re “taught” via scripts and scenario schemes. Therefore, the more data you supply them with, the more matters they’ll be able to cover.

Case in point, there’s such a chat available on the official Victoria’s Secret website. If the user launches the Digital Assistant, the messenger bot starts the conversation. Based on the selected topic the user selects from the options, the bot defines what will be discussed.

Screenshot taken on the official Victoria’s Secret website

Screenshot taken on the official Victoria’s Secret website

4. Determining Top-Selling Product Combos

A similar AI use case for boosting online revenues to the one mentioned in the first point, it becomes much easier to cross-sell products when artificial intelligence “cracks” the actual top matches. Based on the findings by Sumo, you can boost your revenues by 10 to 30% if you upsell wisely!

The product database of online stores gets larger by the month, making it harder to know for good which items go well together and complement each other. With AI on your analytics team, you don’t have to scratch your head guessing which products people are likely to additionally buy along with the item they’re browsing at the moment. This work on singling out data can be done for you.

As seen on the screenshot from the official MAC Cosmetics website, the upselling section on the product page presents supplement items in a carousel. Thus, the chance of these products getting added to the shopping cart increases (if you compare it to the situation when the client would search the site and find these products by himself).

Screenshot taken on the official MAC Cosmetics website

Screenshot taken on the official MAC Cosmetics website

5. “Try It On” with a Camera

The fifth AI technology in this list is virtual try on that borrowed the power of augmented reality technology in the world of sales.

Especially for fields like cosmetics or accessories, it is important to find ways to help clients to make up their minds and encourage them to buy an item without testing it physically. If you want, you can play around with such real-time functionality and put on makeup using your camera on the official Maybelline New York site.

Consumers, ultimately, become happier because this solution omits frustration and unneeded doubts. With everything evident and clear, people don’t have the need to take a shot in the dark what will be a good match, they can see it.

Screenshot taken on the official Maybelline New York website

Screenshot taken on the official Maybelline New York website

In Closing

To conclude everything stated in this article, artificial intelligence is a big crunch point. Incorporating various AI-powered features into an online retail store can be a neat advancement leading to a visible growth in conversions.

Multi-head attention mechanism: “queries”, “keys”, and “values,” over and over again

This is the third article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

In the last article, I explained how attention mechanism works in simple seq2seq models with RNNs, and it basically calculates correspondences of the hidden state at every time step, with all the outputs of the encoder. However I would say the attention mechanisms of RNN seq2seq models use only one standard for comparing them. Using only one standard is not enough for understanding languages, especially when you learn a foreign language. You would sometimes find it difficult to explain how to translate a word in your language to another language. Even if a pair of languages are very similar to each other, translating them cannot be simple switching of vocabulary. Usually a single token in one language is related to several tokens in the other language, and vice versa. How they correspond to each other depends on several criteria, for example “what”, “who”, “when”, “where”, “why”, and “how”. It is easy to imagine that you should compare tokens with several criteria.

Transformer model was first introduced in the original paper named “Attention Is All You Need,” and from the title you can easily see that attention mechanism plays important roles in this model. When you learn about Transformer model, you will see the figure below, which is used in the original paper on Transformer.  This is the simplified overall structure of one layer of Transformer model, and you stack this layer N times. In one layer of Transformer, there are three multi-head attention, which are displayed as boxes in orange. These are the very parts which compare the tokens on several standards. I made the head article of this article series inspired by this multi-head attention mechanism.

The figure below is also from the original paper on Transfromer. If you can understand how multi-head attention mechanism works with the explanations in the paper, and if you have no troubles understanding the codes in the official Tensorflow tutorial, I have to say this article is not for you. However I bet that is not true of majority of people, and at least I need one article to clearly explain how multi-head attention works. Please keep it in mind that this article covers only the architectures of the two figures below. However multi-head attention mechanisms are crucial components of Transformer model, and throughout this article, you would not only see how they work but also get a little control over it at an implementation level.

1 Multi-head attention mechanism

When you learn Transformer model, I recommend you first to pay attention to multi-head attention. And when you learn multi-head attentions, before seeing what scaled dot-product attention is, you should understand the whole structure of multi-head attention, which is at the right side of the figure above. In order to calculate attentions with a “query”, as I said in the last article, “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” Sooner or later, you will notice I would be just repeating these phrases over and over again throughout this article, in several ways.

*Even if you are not sure what “reweighting” means in this context, please keep reading. I think you would little by little see what it means especially in the next section.

The overall process of calculating multi-head attention, displayed in the figure above, is as follows (Please just keep reading. Please do not think too much.): first you split the V: “values”, K: “keys”, and Q: “queries”, and second you transform those divided “values”, “keys”, and “queries” with densely connected layers (“Linear” in the figure). Next you calculate attention weights and reweight the “values” and take the summation of the reiweighted “values”, and you concatenate the resulting summations. At the end you pass the concatenated “values” through another densely connected layers. The mechanism of scaled dot-product attention is just a matter of how to concretely calculate those attentions and reweight the “values”.

*In the last article I briefly mentioned that “keys” and “queries” can be in the same language. They can even be the same sentence in the same language, and in this case the resulting attentions are called self-attentions, which we are mainly going to see. I think most people calculate “self-attentions” unconsciously when they speak. You constantly care about what “she”, “it” , “the”, or “that” refers to in you own sentence, and we can say self-attention is how these everyday processes is implemented.

Let’s see the whole process of calculating multi-head attention at a little abstract level. From now on, we consider an example of calculating multi-head self-attentions, where the input is a sentence “Anthony Hopkins admired Michael Bay as a great director.” In this example, the number of tokens is 9, and each token is encoded as a 512-dimensional embedding vector. And the number of heads is 8. In this case, as you can see in the figure below, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” is implemented as a 9\times 512 matrix. You first split each token into 512/8=64 dimensional, 8 vectors in total, as I colored in the figure below. In other words, the input matrix is divided into 8 colored chunks, which are all 9\times 64 matrices, but each colored matrix expresses the same sentence. And you calculate self-attentions of the input sentence independently in the 8 heads, and you reweight the “values” according to the attentions/weights. After this, you stack the sum of the reweighted “values”  in each colored head, and you concatenate the stacked tokens of each colored head. The size of each colored chunk does not change even after reweighting the tokens. According to Ashish Vaswani, who invented Transformer model, each head compare “queries” and “keys” on each standard. If the a Transformer model has 4 layers with 8-head multi-head attention , at least its encoder has 4\times 8 = 32 heads, so the encoder learn the relations of tokens of the input on 32 different standards.

I think you now have rough insight into how you calculate multi-head attentions. In the next section I am going to explain the process of reweighting the tokens, that is, I am finally going to explain what those colorful lines in the head image of this article series are.

*Each head is randomly initialized, so they learn to compare tokens with different criteria. The standards might be straightforward like “what” or “who”, or maybe much more complicated. In attention mechanisms in deep learning, you do not need feature engineering for setting such standards.

2 Calculating attentions and reweighting “values”

If you have read the last article or if you understand attention mechanism to some extent, you should already know that attention mechanism calculates attentions, or relevance between “queries” and “keys.” In the last article, I showed the idea of weights as a histogram, and in that case the “query” was the hidden state of the decoder at every time step, whereas the “keys” were the outputs of the encoder. In this section, I am going to explain attention mechanism in a more abstract way, and we consider comparing more general “tokens”, rather than concrete outputs of certain networks. In this section each [ \cdots ] denotes a token, which is usually an embedding vector in practice.

Please remember this mantra of attention mechanism: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” The figure below shows an overview of a case where “Michael” is a query. In this case you compare the query with the “keys”, that is, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” and you get the histogram of attentions/weights. Importantly the sum of the weights 1. With the attentions you have just calculated, you can reweight the “values,” which also denote the same input sentence. After that you can finally take a summation of the reweighted values. And you use this summation.

*I have been repeating the phrase “reweighting ‘values’  with attentions,”  but you in practice calculate the sum of those reweighted “values.”

Assume that compared to the “query”  token “Michael”, the weights of the “key” tokens “Anthony”, “Hopkins”, “admired”, “Michael”, “Bay”, “as”, “a”, “great”, and “director.” are respectively 0.06, 0.09, 0.05, 0.25, 0.18, 0.06, 0.09, 0.06, 0.15. In this case the sum of the reweighted token is 0.06″Anthony” + 0.09″Hopkins” + 0.05″admired” + 0.25″Michael” + 0.18″Bay” + 0.06″as” + 0.09″a” + 0.06″great” 0.15″director.”, and this sum is the what wee actually use.

*Of course the tokens are embedding vectors in practice. You calculate the reweighted vector in actual implementation.

You repeat this process for all the “queries.”  As you can see in the figure below, you get summations of 9 pairs of reweighted “values” because you use every token of the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as a “query.” You stack the sum of reweighted “values” like the matrix in purple in the figure below, and this is the output of a one head multi-head attention.

3 Scaled-dot product

This section is a only a matter of linear algebra. Maybe this is not even so sophisticated as linear algebra. You just have to do lots of Excel-like operations. A tutorial on Transformer by Jay Alammar is also a very nice study material to understand this topic with simpler examples. I tried my best so that you can clearly understand multi-head attention at a more mathematical level, and all you need to know in order to read this section is how to calculate products of matrices or vectors, which you would see in the first some pages of textbooks on linear algebra.

We have seen that in order to calculate multi-head attentions, we prepare 8 pairs of “queries”, “keys” , and “values”, which I showed in 8 different colors in the figure in the first section. We calculate attentions and reweight “values” independently in 8 different heads, and in each head the reweighted “values” are calculated with this very simple formula of scaled dot-product: Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) =softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})\boldsymbol{V}. Let’s take an example of calculating a scaled dot-product in the blue head.

At the left side of the figure below is a figure from the original paper on Transformer, which explains one-head of multi-head attention. If you have read through this article so far, the figure at the right side would be more straightforward to understand. You divide the input sentence into 8 chunks of matrices, and you independently put those chunks into eight head. In one head, you convert the input matrix by three different fully connected layers, which is “Linear” in the figure below, and prepare three matrices Q, K, V, which are “queries”, “keys”, and “values” respectively.

*Whichever color attention heads are in, the processes are all the same.

*You divide \frac{\boldsymbol{Q}} {\boldsymbol{K}^T} by \sqrt{d}_k in the formula. According to the original paper, it is known that re-scaling \frac{\boldsymbol{Q} }{\boldsymbol{K}^T} by \sqrt{d}_k is found to be effective. I am not going to discuss why in this article.

As you can see in the figure below, calculating Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) is virtually just multiplying three matrices with the same size (Only K is transposed though). The resulting 9\times 64 matrix is the output of the head.

softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}) is calculated like in the figure below. The softmax function regularize each row of the re-scaled product \frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}, and the resulting 9\times 9 matrix is a kind a heat map of self-attentions.

The process of comparing one “query” with “keys” is done with simple multiplication of a vector and a matrix, as you can see in the figure below. You can get a histogram of attentions for each query, and the resulting 9 dimensional vector is a list of attentions/weights, which is a list of blue circles in the figure below. That means, in Transformer model, you can compare a “query” and a “key” only by calculating an inner product. After re-scaling the vectors by dividing them with \sqrt{d_k} and regularizing them with a softmax function, you stack those vectors, and the stacked vectors is the heat map of attentions.

You can reweight “values” with the heat map of self-attentions, with simple multiplication. It would be more straightforward if you consider a transposed scaled dot-product \boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T. This also should be easy to understand if you know basics of linear algebra.

One column of the resulting matrix (\boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T) can be calculated with a simple multiplication of a matrix and a vector, as you can see in the figure below. This corresponds to the process or “taking a summation of reweighted ‘values’,” which I have been repeating. And I would like you to remember that you got those weights (blue) circles by comparing a “query” with “keys.”

Again and again, let’s repeat the mantra of attention mechanism together: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” If you have been patient enough to follow my explanations, I bet you have got a clear view on how multi-head attention mechanism works.

We have been seeing the case of the blue head, but you can do exactly the same procedures in every head, at the same time, and this is what enables parallelization of multi-head attention mechanism. You concatenate the outputs of all the heads, and you put the concatenated matrix through a fully connected layers.

If you are reading this article from the beginning, I think this section is also showing the same idea which I have repeated, and I bet more or less you no have clearer views on how multi-head attention mechanism works. In the next section we are going to see how this is implemented.

4 Tensorflow implementation of multi-head attention

Let’s see how multi-head attention is implemented in the Tensorflow official tutorial. If you have read through this article so far, this should not be so difficult. I also added codes for displaying heat maps of self attentions. With the codes in this Github page, you can display self-attention heat maps for any input sentences in English.

The multi-head attention mechanism is implemented as below. If you understand Python codes and Tensorflow to some extent, I think this part is relatively easy.  The multi-head attention part is implemented as a class because you need to train weights of some fully connected layers. Whereas, scaled dot-product is just a function.

*I am going to explain the create_padding_mask() and create_look_ahead_mask() functions in upcoming articles. You do not need them this time.

Let’s see a case of using multi-head attention mechanism on a (1, 9, 512) sized input tensor, just as we have been considering in throughout this article. The first axis of (1, 9, 512) corresponds to the batch size, so this tensor is virtually a (9, 512) sized tensor, and this means the input is composed of 9 512-dimensional vectors. In the results below, you can see how the shape of input tensor changes after each procedure of calculating multi-head attention. Also you can see that the output of the multi-head attention is the same as the input, and you get a 9\times 9 matrix of attention heat maps of each attention head.

I guess the most complicated part of this implementation above is the split_head() function, especially if you do not understand tensor arithmetic. This part corresponds to splitting the input tensor to 8 different colored matrices as in one of the figures above. If you cannot understand what is going on in the function, I recommend you to prepare a sample tensor as below.

This is just a simple (1, 9, 512) sized tensor with sequential integer elements. The first row (1, 2, …., 512) corresponds to the first input token, and (4097, 4098, … , 4608) to the last one. You should try converting this sample tensor to see how multi-head attention is implemented. For example you can try the operations below.

These operations correspond to splitting the input into 8 heads, whose sizes are all (9, 64). And the second axis of the resulting (1, 8, 9, 64) tensor corresponds to the index of the heads. Thus sample_sentence[0][0] corresponds to the first head, the blue 9\times 64 matrix. Some Tensorflow functions enable linear calculations in each attention head, independently as in the codes below.

Very importantly, we have been only considering the cases of calculating self attentions, where all “queries”, “keys”, and “values” come from the same sentence in the same language. However, as I showed in the last article, usually “queries” are in a different language from “keys” and “values” in translation tasks, and “keys” and “values” are in the same language. And as you can imagine, usualy “queries” have different number of tokens from “keys” or “values.” You also need to understand this case, which is not calculating self-attentions. If you have followed this article so far, this case is not that hard to you. Let’s briefly see an example where the input sentence in the source language is composed 9 tokens, on the other hand the output is composed 12 tokens.

As I mentioned, one of the outputs of each multi-head attention class is 9\times 9 matrix of attention heat maps, which I displayed as a matrix composed of blue circles in the last section. The the implementation in the Tensorflow official tutorial, I have added codes to display actual heat maps of any input sentences in English.

*If you want to try displaying them by yourself, download or just copy and paste codes in this Github page. Please maker “datasets” directory in the same directory as the code. Please download “spa-eng.zip” from this page, and unzip it. After that please put “spa.txt” on the “datasets” directory. Also, please download the “checkpoints_en_es” folder from this link, and place the folder in the same directory as the file in the Github page. In the upcoming articles, you would need similar processes to run my codes.

After running codes in the Github page, you can display heat maps of self attentions. Let’s input the sentence “Anthony Hopkins admired Michael Bay as a great director.” You would get a heat maps like this.

In fact, my toy implementation cannot handle proper nouns such as “Anthony” or “Michael.” Then let’s consider a simple input sentence “He admired her as a great director.” In each layer, you respectively get 8 self-attention heat maps.

I think we can see some tendencies in those heat maps. The heat maps in the early layers, which are close to the input, are blurry. And the distributions of the heat maps come to concentrate more or less diagonally. At the end, presumably they learn to pay attention to the start and the end of sentences.

You have finally finished reading this article. Congratulations.

You should be proud of having been patient, and you passed the most tiresome part of learning Transformer model. You must be ready for making a toy English-German translator in the upcoming articles. Also I am sure you have understood that Michael Bay is a great director, no matter what people say.

*Hannibal Lecter, I mean Athony Hopkins, also wrote a letter to the staff of “Breaking Bad,” and he told them the tv show let him regain his passion. He is a kind of admiring around, and I am a little worried that he might be getting senile. He played a role of a father forgetting his daughter in his new film “The Father.” I must see it to check if that is really an acting, or not.

[References]

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core
https://www.tensorflow.org/overview

[3] “Neural machine translation with attention,” Tensorflow Core
https://www.tensorflow.org/tutorials/text/nmt_with_attention

[4] Jay Alammar, “The Illustrated Transformer,”
http://jalammar.github.io/illustrated-transformer/

[5] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)
https://www.youtube.com/watch?v=5vcj8kSwBCY

[6]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[7]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[8]Rosemary Rossi, “Anthony Hopkins Compares ‘Genius’ Michael Bay to Spielberg, Scorsese,” yahoo! entertainment, (2017)
https://www.yahoo.com/entertainment/anthony-hopkins-transformers-director-michael-bay-guy-genius-010058439.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Data Science in Engineering Process - Product Lifecycle Management

How to develop digital products and solutions for industrial environments?

The Data Science and Engineering Process in PLM.

Huge opportunities for digital products are accompanied by huge risks

Digitalization is about to profoundly change the way we live and work. The increasing availability of data combined with growing storage capacities and computing power make it possible to create data-based products, services, and customer specific solutions to create insight with value for the business. Successful implementation requires systematic procedures for managing and analyzing data, but today such procedures are not covered in the PLM processes.

From our experience in industrial settings, organizations start processing the data that happens to be available. This data often does not fully cover the situation of interest, typically has poor quality, and in turn the results of data analysis are misleading. In industrial environments, the reliability and accuracy of results are crucial. Therefore, an enormous responsibility comes with the development of digital products and solutions. Unless there are systematic procedures in place to guide data management and data analysis in the development lifecycle, many promising digital products will not meet expectations.

Various methodologies exist but no comprehensive framework

Over the last decades, various methodologies focusing on specific aspects of how to deal with data were promoted across industries and academia. Examples are Six Sigma, CRISP-DM, JDM standard, DMM model, and KDD process. These methodologies aim at introducing principles for systematic data management and data analysis. Each methodology makes an important contribution to the overall picture of how to deal with data, but none provides a comprehensive framework covering all the necessary tasks and activities for the development of digital products. We should take these approaches as valuable input and integrate their strengths into a comprehensive Data Science and Engineering framework.

In fact, we believe it is time to establish an independent discipline to address the specific challenges of developing digital products, services and customer specific solutions. We need the same kind of professionalism in dealing with data that has been achieved in the established branches of engineering.

Data Science and Engineering as new discipline

Whereas the implementation of software algorithms is adequately guided by software engineering practices, there is currently no established engineering discipline covering the important tasks that focus on the data and how to develop causal models that capture the real world. We believe the development of industrial grade digital products and services requires an additional process area comprising best practices for data management and data analysis. This process area addresses the specific roles, skills, tasks, methods, tools, and management that are needed to succeed.

Figure: Data Science and Engineering as new engineering discipline

More than in other engineering disciplines, the outputs of Data Science and Engineering are created in repetitions of tasks in iterative cycles. The tasks are therefore organized into workflows with distinct objectives that clearly overlap along the phases of the PLM process.

Feasibility of Objectives
  Understand the business situation, confirm the feasibility of the product idea, clarify the data infrastructure needs, and create transparency on opportunities and risks related to the product idea from the data perspective.
Domain Understanding
  Establish an understanding of the causal context of the application domain, identify the influencing factors with impact on the outcomes in the operational scenarios where the digital product or service is going to be used.
Data Management
  Develop the data management strategy, define policies on data lifecycle management, design the specific solution architecture, and validate the technical solution after implementation.
Data Collection
  Define, implement and execute operational procedures for selecting, pre-processing, and transforming data as basis for further analysis. Ensure data quality by performing measurement system analysis and data integrity checks.
Modeling
  Select suitable modeling techniques and create a calibrated prediction model, which includes fitting the parameters or training the model and verifying the accuracy and precision of the prediction model.
Insight Provision
  Incorporate the prediction model into a digital product or solution, provide suitable visualizations to address the information needs, evaluate the accuracy of the prediction results, and establish feedback loops.

Real business value will be generated only if the prediction model at the core of the digital product reliably and accurately reflects the real world, and the results allow to derive not only correct but also helpful conclusions. Now is the time to embrace the unique chances by establishing professionalism in data science and engineering.

Authors

Peter Louis                               

Peter Louis is working at Siemens Advanta Consulting as Senior Key Expert. He has 25 years’ experience in Project Management, Quality Management, Software Engineering, Statistical Process Control, and various process frameworks (Lean, Agile, CMMI). He is an expert on SPC, KPI systems, data analytics, prediction modelling, and Six Sigma Black Belt.


Ralf Russ    

Ralf Russ works as a Principal Key Expert at Siemens Advanta Consulting. He has more than two decades experience rolling out frameworks for development of industrial-grade high quality products, services, and solutions. He is Six Sigma Master Black Belt and passionate about process transparency, optimization, anomaly detection, and prediction modelling using statistics and data analytics.4


How Data Science Can Benefit Nonprofits

Image Source: https://pixabay.com/vectors/pixel-cells-pixel-creative-commons-3704068/

Data science is the poster child of the 21st century and for good reason. Data-based decisions have streamlined, automated, and made businesses more efficient than ever before, and there are practically no industries that haven’t recognized its immense potential. But when you think of data science application, sectors like marketing, finance, technology, SMEs, and even education are the first that come to mind. There’s one more sector that’s proving to be an untapped market for data—the social sector. At first, one might question why non-profit organizations even need complex data applications, but that’s just it—they don’t. What they really need is data tools that are simple and reliable, because if anything, accountability is the most important component of the way non-profits run.

Challenges for Non-profits and Data Science

If you’re wondering why many non-profits haven’t already hopped onto the data bandwagon, its because in most cases they lack one big thing—quality data.

One reason is that effective data application requires clean data, and heaps of it, something non-profits struggle with. Most don’t sell products or services, and their success is reliant on broad, long-term (sometimes decades) results and changes, which means their outcomes are highly unmeasurable. Metrics and data seem out of place when appealing to donors, who are persuaded more by emotional campaigns. Data collection is also rare, perhaps only being recorded when someone signs up to the program or leaves, and hardly any tracking in between. The result is data that’s too little and unreliable to make effective change.

Perhaps the most important phase, data collection relies heavily on accurate and organized processes. For non-profits that don’t have the resources for accurate and manual record-keeping, clean, and quality data collection is a huge pain point. However, that is an issue now easily avoidable. For instance, avoiding duplicate files, adopting record-keeping methods like off-site and cloud storage, digital retention, and of course back-up plans—are all processes that could save non-profits time, effort, and risk. On the other hand, poor record management has its consequences, namely on things like fund allocation, payroll, budgeting, and taxes. It could lead to financial risk, legal trouble, and data loss — all added worries for already under-resourced non-profit organizations.

But now, as non-governmental organizations (NGOs) and non-profits catch up and invest more in data collection processes, there’s room for data science to make its impact. A growing global movement, ‘Data For Good’ represents individuals, companies, and organizations volunteering to create or use data to help further social causes ad support non-profit organizations. This ‘Data For Good’ movement includes tools for data work that are donated or subsidized, as well as educational programs that serve marginalized communities. As the movement gains momentum, non-profits are seeing data seep into their structures and turn processes around.

How Can Data Do Social Good?

With data science set to take the non-profit sector by storm, let’s look at some of the ways data can do social good:

  1. Improving communication with donors: Knowing when to reach out to your donors is key. In between a meeting? You’re unlikely to see much enthusiasm. Once they’re at home with their families? You may see wonderful results, as pointed out in this Forbes article. The article opines that data can help non-profits understand and communicate with their donors better.
  2. Donor targetting: Cold calls are a hit and miss, and with data on their side, non-profits can discover and define their ideal donor and adapt their messaging to reach out to them for better results.
  3. Improving cost efficiency: Costs are a major priority for non-profits and every penny counts. Data can help decrease costs and streamline financial planning
  4. Increasing new member sign-ups and renewals: Through data, non-profits can reach out to the right people they want on-board, strengthen recruitment processes and keep track of volunteers reaching out to them for future events or recruitment drives.
  5. Modeling and forecasting performance: With predictive modeling tools, non-profits can make data-based decisions on where they should allocate time and money for the future, rather than go on gut instinct.
  6. Measuring return on investment: For a long time, the outcomes of social campaigns have been perceived as intangible and immeasurable—it’s hard to measure empowerment or change. With data, non-profits can measure everything from the amount a fundraiser raised against a goal, the cost of every lead in a lead generation campaign, etc
  7. Streamlining operations: Finally, non-profits can use data tools to streamline their business processes internally and invest their efforts into resources that need it.

It’s true, measuring good and having social change down to a science is a long way off — but data application is a leap forward into a more efficient future for the social sector. With mission-aligned processes, data-driven non-profits can realize their potential, redirect their focus from trivial tasks, and onto the bigger picture to drive true change.

Introduction to Recommendation Engines

This is the second article of article series Getting started with the top eCommerce use cases. If you are interested in reading the first article you can find it here.

What are Recommendation Engines?

Recommendation engines are the automated systems which helps select out similar things whenever a user selects something online. Be it Netflix, Amazon, Spotify, Facebook or YouTube etc. All of these companies are now using some sort of recommendation engine to improve their user experience. A recommendation engine not only helps to predict if a user prefers an item or not but also helps to increase sales, ,helps to understand customer behavior, increase number of registered users and helps a user to do better time management. For instance Netflix will suggest what movie you would want to watch or Amazon will suggest what kind of other products you might want to buy. All the mentioned platforms operates using the same basic algorithm in the background and in this article we are going to discuss the idea behind it.

What are the techniques?

There are two fundamental algorithms that comes into play when there’s a need to generate recommendations. In next section these techniques are discussed in detail.

Content-Based Filtering

The idea behind content based filtering is to analyse a set of features which will provide a similarity between items themselves i.e. between two movies, two products or two songs etc. These set of features once compared gives a similarity score at the end which can be used as a reference for the recommendations.

There are several steps involved to get to this similarity score and the first step is to construct a profile for each item by representing some of the important features of that item. In other terms, this steps requires to define a set of characteristics that are discovered easily. For instance, consider that there’s an article which a user has already read and once you know that this user likes this article you may want to show him recommendations of similar articles. Now, using content based filtering technique you could find the similar articles. The easiest way to do that is to set some features for this article like publisher, genre, author etc. Based on these features similar articles can be recommended to the user (as illustrated in Figure 1). There are three main similarity measures one could use to find the similar articles mentioned below.

 

Figure 1: Content-Based Filtering

 

 

Minkowski distance

Minkowski distance between two variables can be calculated as:

(x,y)= (\sum_{i=1}^{n}{|X_{i} - Y_{i}|^{p}})^{1/p}

 

Cosine Similarity

Cosine similarity between two variables can be calculated as :

  \mbox{Cosine Similarity} = \frac{\sum_{i=1}^{n}{x_{i} y_{i}}} {\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}} \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}} \

 

Jaccard Similarity

 

  J(X,Y) = |X ∩ Y| / |X ∪ Y|

 

These measures can be used to create a matrix which will give you the similarity between each movie and then a function can be defined to return the top 10 similar articles.

 

Collaborative filtering

This filtering method focuses on finding how similar two users or two products are by analyzing user behavior or preferences rather than focusing on the content of the items. For instance consider that there are three users A,B and C.  We want to recommend some movies to user A, our first approach would be to find similar users and compare which movies user A has not yet watched and recommend those movies to user A.  This approach where we try to find similar users is called as User-User Collaborative Filtering.  

The other approach that could be used here is when you try to find similar movies based on the ratings given by others, this type is called as Item-Item Collaborative Filtering. The research shows that item-item collaborative filtering works better than user-user collaborative filtering as user behavior is really dynamic and changes over time. Also, there are a lot more users and increasing everyday but on the other side item characteristics remains the same. To calculate the similarities we can use Cosine distance.

 

Figure 2: Collaborative Filtering

 

Recently some companies have started to take advantage of both content based and collaborative filtering techniques to make a hybrid recommendation engine. The results from both models are combined into one hybrid model which provides more accurate recommendations. Five steps are involved to make a recommendation engine work which are collection of data, storing of data, analyzing the data, filtering the data and providing recommendations. There are a lot of attributes that are involved in order to collect user data including browsing history, page views, search logs, order history, marketing channel touch points etc. which requires a strong data architecture.  The collection of data is pretty straightforward but it can be overwhelming to analyze this amount of data. Storing this data could get tricky on the other hand as you need a scalable database for this kind of data. With the rise of graph databases this area is also improving for many use cases including recommendation engines. Graph databases like Neo4j can also help to analyze and find similar users and relationship among them. Analyzing the data can be carried in different ways, depending on how strong and scalable your architecture you can run real time, batch or near real time analysis. The fourth step involves the filtering of the data and here you can use any of the above mentioned approach to find similarities to finally provide the recommendations.

Having a good recommendation engine can be time consuming initially but it is definitely beneficial in the longer run. It not only helps to generate revenue but also helps to to improve your product catalog and customer service.

Multi-touch attribution: A data-driven approach

This is the first article of article series Getting started with the top eCommerce use cases.

What is Multi-touch attribution?

Customers shopping behavior has changed drastically when it comes to online shopping, as nowadays, customer likes to do a thorough market research about a product before making a purchase. This makes it really hard for marketers to correctly determine the contribution for each marketing channel to which a customer was exposed to. The path a customer takes from his first search to the purchase is known as a Customer Journey and this path consists of multiple marketing channels or touchpoints. Therefore, it is highly important to distribute the budget between these channels to maximize return. This problem is known as multi-touch attribution problem and the right attribution model helps to steer the marketing budget efficiently. Multi-touch attribution problem is well known among marketers. You might be thinking that if this is a well known problem then there must be an algorithm out there to deal with this. Well, there are some traditional models  but every model has its own limitation which will be discussed in the next section.

Traditional attribution models

Most of the eCommerce companies have a performance marketing department to make sure that the marketing budget is spent in an agile way. There are multiple heuristics attribution models pre-existing in google analytics however there are several issues with each one of them. These models are:

First touch attribution model

100% credit is given to the first channel as it is considered that the first marketing channel was responsible for the purchase.

Figure 1: First touch attribution model

Last touch attribution model

100% credit is given to the last channel as it is considered that the first marketing channel was responsible for the purchase.

Figure 2: Last touch attribution model

Linear-touch attribution model

In this attribution model, equal credit is given to all the marketing channels present in customer journey as it is considered that each channel is equally responsible for the purchase.

Figure 3: Linear attribution model

U-shaped or Bath tub attribution model

This is most common in eCommerce companies, this model assigns 40% to first and last touch and 20% is equally distributed among the rest.

Figure 4: Bathtub or U-shape attribution model

Data driven attribution models

Traditional attribution models follows somewhat a naive approach to assign credit to one or all the marketing channels involved. As it is not so easy for all the companies to take one of these models and implement it. There are a lot of challenges that comes with multi-touch attribution problem like customer journey duration, overestimation of branded channels, vouchers and cross-platform issue, etc.

Switching from traditional models to data-driven models gives us more flexibility and more insights as the major part here is defining some rules to prepare the data that fits your business. These rules can be defined by performing an ad hoc analysis of customer journeys. In the next section, I will discuss about Markov chain concept as an attribution model.

Markov chains

Markov chains concepts revolves around probability. For attribution problem, every customer journey can be seen as a chain(set of marketing channels) which will compute a markov graph as illustrated in figure 5. Every channel here is represented as a vertex and the edges represent the probability of hopping from one channel to another. There will be an another detailed article, explaining the concept behind different data-driven attribution models and how to apply them.

Figure 5: Markov chain example

Challenges during the Implementation

Transitioning from a traditional attribution models to a data-driven one, may sound exciting but the implementation is rather challenging as there are several issues which can not be resolved just by changing the type of model. Before its implementation, the marketers should perform a customer journey analysis to gain some insights about their customers and try to find out/perform:

  1. Length of customer journey.
  2. On an average how many branded and non branded channels (distinct and non-distinct) in a typical customer journey?
  3. Identify most upper funnel and lower funnel channels.
  4. Voucher analysis: within branded and non-branded channels.

When you are done with the analysis and able to answer all of the above questions, the next step would be to define some rules in order to handle the user data according to your business needs. Some of the issues during the implementation are discussed below along with their solution.

Customer journey duration

Assuming that you are a retailer, let’s try to understand this issue with an example. In May 2016, your company started a Fb advertising campaign for a particular product category which “attracted” a lot of customers including Chris. He saw your Fb ad while working in the office and clicked on it, which took him to your website. As soon as he registered on your website, his boss called him (probably because he was on Fb while working), he closed everything and went for the meeting. After coming back, he started working and completely forgot about your ad or products. After a few days, he received an email with some offers of your products which also he ignored until he saw an ad again on TV in Jan 2019 (after 3 years). At this moment, he started doing his research about your products and finally bought one of your products from some Instagram campaign. It took Chris almost 3 years to make his first purchase.

Figure 6: Chris journey

Now, take a minute and think, if you analyse the entire journey of customers like Chris, you would realize that you are still assigning some of the credit to the touchpoints that happened 3 years ago. This can be solved by using an attribution window. Figure 6 illustrates that 83% of the customers are making a purchase within 30 days which means the attribution window here could be 30 days. In simple words, it is safe to remove the touchpoints that happens after 30 days of purchase. This parameter can also be changed to 45 days or 60 days, depending on the use case.

Figure 7: Length of customer journey

Removal of direct marketing channel

A well known issue that every marketing analyst is aware of is, customers who are already aware of the brand usually comes to the website directly. This leads to overestimation of direct channel and branded channels start getting more credit. In this case, you can set a threshold (say 7 days) and remove these branded channels from customer journey.

Figure 8: Removal of branded channels

Cross platform problem

If some of your customers are using different devices to explore your products and you are not able to track them then it will make retargeting really difficult. In a perfect world these customers belong to same journey and if these can’t be combined then, except one, other paths would be considered as “non-converting path”. For attribution problem device could be thought of as a touchpoint to include in the path but to be able to track these customers across all devices would still be challenging. A brief introduction to deterministic and probabilistic ways of cross device tracking can be found here.

Figure 9: Cross platform clash

How to account for Vouchers?

To better account for vouchers, it can be added as a ‘dummy’ touchpoint of the type of voucher (CRM,Social media, Affiliate or Pricing etc.) used. In our case, we tried to add these vouchers as first touchpoint and also as a last touchpoint but no significant difference was found. Also, if the marketing channel of which the voucher was used was already in the path, the dummy touchpoint was not added.

Figure 10: Addition of Voucher as a touchpoint

Let me know in comments if you would like to add something or if you have a different perspective about this use case.

Getting started with the top eCommerce use cases

Nowadays, almost all the projects in eCommerce companies are data-dependent and everyone wants to leverage data science techniques to mine as much information as they can from that data. From tracking their customer’s shopping behavior to recommending them what to buy, from finding new leads for their market to calculating their lifetime value, from improving customer experience to increase their profitability. When we navigate through any website, we leave our traces and companies track these touchpoints to get insights about how we behave online. Companies sometimes have different landing pages based on the gender of the user.

This post will be focused on some of the use cases in marketing which are gaining attention over the past few years. I have been associated with different eCommerce companies as a data science consultant.

Upcoming months has a lot to offer as I will be writing blogs about the following use cases:

  1. Multi-touch attribution: A data-driven approach
  2. Introduction to Recommendation engines
  3. How Important is Customer Lifetime Value?
  4. Customer Segmentation
  5. Dynamic Pricing

 

If you are interested in reading the success story for the Multi-touch attribution project you can find it here.

Erstellen und benutzen einer Geodatenbank

In diesem Artikel soll es im Gegensatz zum vorherigen Artikel Alles über Geodaten weniger darum gehen, was man denn alles mit Geodaten machen kann, dafür aber mehr darum wie man dies anstellt. Es wird gezeigt, wie man aus dem öffentlich verfügbaren Datensatz des OpenStreetMap-Projekts eine Geodatenbank erstellt und einige Beispiele dafür gegeben, wie man diese abfragen und benutzen kann.

Wahl der Datenbank

Prinzipiell gibt es zwei große “geo-kompatible” OpenSource-Datenbanken bzw. “Datenbank-AddOn’s”: Spatialite, welches auf SQLite aufbaut, und PostGIS, das PostgreSQL verwendet.

PostGIS bietet zum Teil eine einfachere Syntax, welche manchmal weniger Tipparbeit verursacht. So kann man zum Beispiel um die Entfernung zwischen zwei Orten zu ermitteln einfach schreiben:

während dies in Spatialite “nur” mit einer normalen Funktion möglich ist:

Trotztdem wird in diesem Artikel Spatialite (also SQLite) verwendet, da dessen Einrichtung deutlich einfacher ist (schließlich sollen interessierte sich alle Ergebnisse des Artikels problemlos nachbauen können, ohne hierfür einen eigenen Datenbankserver aufsetzen zu müssen).

Der Hauptunterschied zwischen PostgreSQL und SQLite (eigentlich der Unterschied zwischen SQLite und den meissten anderen Datenbanken) ist, dass für PostgreSQL im Hintergrund ein Server laufen muss, an welchen die entsprechenden Queries gesendet werden, während SQLite ein “normales” Programm (also kein Client-Server-System) ist welches die Queries selber auswertet.

Hierdurch fällt beim Aufsetzen der Datenbank eine ganze Menge an Konfigurationsarbeit weg: Welche Benutzer gibt es bzw. akzeptiert der Server? Welcher Benutzer bekommt welche Rechte? Über welche Verbindung wird auf den Server zugegriffen? Wie wird die Sicherheit dieser Verbindung sichergestellt? …

Während all dies bei SQLite (und damit auch Spatialite) wegfällt und die Einrichtung der Datenbank eigentlich nur “installieren und fertig” ist, muss auf der anderen Seite aber auch gesagt werden dass SQLite nicht gut für Szenarien geeignet ist, in welchen viele Benutzer gleichzeitig (insbesondere schreibenden) Zugriff auf die Datenbank benötigen.

Benötigte Software und ein Beispieldatensatz

Was wird für diesen Artikel an Software benötigt?

SQLite3 als Datenbank

libspatialite als “Geoplugin” für SQLite

spatialite-tools zum erstellen der Datenbank aus dem OpenStreetMaps (*.osm.pbf) Format

python3, die beiden GeoModule spatialite, folium und cartopy, sowie die Module pandas und matplotlib (letztere gehören im Bereich der Datenauswertung mit Python sowieso zum Standart). Für pandas gibt es noch die Erweiterung geopandas sowie eine praktisch unüberschaubare Anzahl weiterer geographischer Module aber bereits mit den genannten lassen sich eine Menge interessanter Dinge herausfinden.

– und natürlich einen Geodatensatz: Zum Beispiel sind aus dem OpenStreetMap-Projekt extrahierte Datensätze hier zu finden.

Es ist ratsam, sich hier erst einmal einen kleinen Datensatz herunterzuladen (wie zum Beispiel einen der Stadtstaaten Bremen, Hamburg oder Berlin). Zum einen dauert die Konvertierung des .osm.pbf-Formats in eine Spatialite-Datenbank bei größeren Datensätzen unter Umständen sehr lange, zum anderen ist die fertige Datenbank um ein vielfaches größer als die stark gepackte Originaldatei (für “nur” Deutschland ist die fertige Datenbank bereits ca. 30 GB groß und man lässt die Konvertierung (zumindest am eigenen Laptop) am besten über Nacht laufen – willkommen im Bereich “BigData”).

Erstellen eine Geodatenbank aus OpenStreetMap-Daten

Nach dem Herunterladen eines Datensatzes der Wahl im *.osm.pbf-Format kann hieraus recht einfach mit folgendem Befehl aus dem Paket spatialite-tools die Datenbank erstellt werden:

Erkunden der erstellten Geodatenbank

Nach Ausführen des obigen Befehls sollte nun eine Datei mit dem gewählten Namen (im Beispiel bremen-latest.sqlite) im aktuellen Ordner vorhanden sein – dies ist bereits die fertige Datenbank. Zunächst sollte man mit dieser Datenbank erst einmal dasselbe machen, wie mit jeder anderen Datenbank auch: Sich erst einmal eine Weile hinsetzen und schauen was alles an Daten in der Datenbank vorhanden und vor allem wo diese Daten in der erstellten Tabellenstruktur zu finden sind. Auch wenn dieses Umschauen prinzipiell auch vollständig über die Shell oder in Python möglich ist, sind hier Programme mit graphischer Benutzeroberfläche (z. B. spatialite-gui oder QGIS) sehr hilfreich und sparen nicht nur eine Menge Zeit sondern vor allem auch Tipparbeit. Wer dies tut, wird feststellen, dass sich in der generierten Datenbank einige dutzend Tabellen mit Namen wie pt_addresses, ln_highway und pg_boundary befinden.

Die Benennung der Tabellen folgt dem Prinzip, dass pt_*-Tabellen Punkte im Geokoordinatensystem wie z. B. Adressen, Shops, Bäckereien und ähnliches enthalten. ln_*-Tabellen enthalten hingegen geographische Entitäten, welche sich als Linien darstellen lassen, wie beispielsweise Straßen, Hochspannungsleitungen, Schienen, ect. Zuletzt gibt es die pg_*-Tabellen welche Polygone – also Flächen einer bestimmten Form enthalten. Dazu zählen Landesgrenzen, Bundesländer, Inseln, Postleitzahlengebiete, Landnutzung, aber auch Gebäude, da auch diese jeweils eine Grundfläche besitzen. In dem genannten Datensatz sind die Grundflächen von Gebäuden – zumindest in Europa – nahezu vollständig. Aber auch der Rest der Welt ist für ein “Wikipedia der Kartographie” insbesondere in halbwegs besiedelten Gebieten bemerkenswert gut erfasst, auch wenn nicht unbedingt davon ausgegangen werden kann, dass abgelegenere Gegenden (z. B. irgendwo auf dem Land in Südamerika) jedes Gebäude eingezeichnet ist.

Verwenden der Erstellten Datenbank

Auf diese Datenbank kann nun entweder direkt aus der Shell über den Befehl

zugegriffen werden oder man nutzt das gleichnamige Python-Paket:

Nach Eingabe der obigen Befehle in eine Python-Konsole, ein Jupyter-Notebook oder ein anderes Programm, welches die Anbindung an den Python-Interpreter ermöglicht, können die von der Datenbank ausgegebenen Ergebnisse nun direkt in ein Pandas Data Frame hineingeladen und verwendet/ausgewertet/analysiert werden.

Im Grunde wird hierfür “normales SQL” verwendet, wie in anderen Datenbanken auch. Der folgende Beispiel gibt einfach die fünf ersten von der Datenbank gefundenen Adressen aus der Tabelle pt_addresses aus:

Link zur Ausgabe

Es wird dem Leser sicherlich aufgefallen sein, dass die Spalte “Geometry” (zumindest für das menschliche Auge) nicht besonders ansprechend sowie auch nicht informativ aussieht: Der Grund hierfür ist, dass diese Spalte die entsprechende Position im geographischen Koordinatensystem aus Gründen wie dem deutlich kleineren Speicherplatzbedarf sowie der damit einhergehenden Optimierung der Geschwindigkeit der Datenbank selber, in binärer Form gespeichert und ohne weitere Verarbeitung auch als solche ausgegeben wird.

Glücklicherweise stellt spatialite eine ganze Reihe von Funktionen zur Verarbeitung dieser geographischen Informationen bereit, von denen im folgenden einige beispielsweise vorgestellt werden:

Für einzelne Punkte im Koordinatensystem gibt es beispielsweise die Funktionen X(geometry) und Y(geometry), welche aus diesem “binären Wirrwarr” den Längen- bzw. Breitengrad des jeweiligen Punktes als lesbare Zahlen ausgibt.

Ändert man also das obige Query nun entsprechend ab, erhält man als Ausgabe folgendes Ergebnis in welchem die Geometry-Spalte der ausgegebenen Adressen in den zwei neuen Spalten Longitude und Latitude in lesbarer Form zu finden ist:

Link zur Tabelle

Eine weitere häufig verwendete Funktion von Spatialite ist die Distance-Funktion, welche die Distanz zwischen zwei Orten berechnet.

Das folgende Beispiel sucht in der Datenbank die 10 nächstgelegenen Bäckereien zu einer frei wählbaren Position aus der Datenbank und listet diese nach zunehmender Entfernung auf (Achtung – die frei wählbare Position im Beispiel liegt in München, wer die selbe Position z. B. mit dem Bremen-Datensatz verwendet, wird vermutlich etwas weiter laufen müssen…):

Link zur Ausgabe

Ein Anwendungsfall für eine solche Liste können zum Beispiel Programme/Apps wie maps.me oder Google-Maps sein, in denen User nach Bäckereien, Geldautomaten, Supermärkten oder Apotheken “in der Nähe” suchen können sollen.

Diese Liste enthält nun alle Informationen die grundsätzlich gebraucht werden, ist soweit auch informativ und wird in den meißten Fällen der Datenauswertung auch genau so gebraucht, jedoch ist diese für das Auge nicht besonders ansprechend.

Viel besser wäre es doch, die gefundenen Positionen auf einer interaktiven Karte einzuzeichnen:

Was kann man sonst interessantes mit der erstellten Datenbank und etwas Python machen? Wer in Deutschland ein wenig herumgekommen ist, dem ist eventuell aufgefallen, dass sich die Endungen von Ortsnamen stark unterscheiden: Um München gibt es Stadteile und Dörfer namens Garching, Freising, Aubing, ect., rund um Stuttgart enden alle möglichen Namen auf “ingen” (Plieningen, Vaihningen, Echterdingen …) und in Berlin gibt es Orte wie Pankow, Virchow sowie eine bunte Auswahl weiterer *ow’s.

Das folgende Query spuckt gibt alle “village’s”, “town’s” und “city’s” aus der Tabelle pt_place, also Dörfer und Städte, aus:

Link zur Ausgabe

Graphisch mit matplotlib und cartopy in ein Koordinatensystem eingetragen sieht diese Verteilung folgendermassen aus:

Die Grafik zeigt, dass stark unterschiedliche Vorkommen der verschiedenen Ortsendungen in Deutschland (Clustering). Über das genaue Zustandekommen dieser Verteilung kann ich hier nur spekulieren, jedoch wird diese vermutlich ähnlichen Prozessen unterliegen wie beispielsweise die Entwicklung von Dialekten.

Wer sich die Karte etwas genauer anschaut wird merken, dass die eingezeichneten Landesgrenzen und Küstenlinien nicht besonders genau sind. Hieran wird ein interessanter Effekt von häufig verwendeten geographischen Entitäten, nämlich Linien und Polygonen deutlich. Im Beispiel werden durch die beiden Zeilen

die bereits im Modul cartopy hinterlegten Daten verwendet. Genaue Verläufe von Küstenlinien und Landesgrenzen benötigen mit wachsender Genauigkeit hingegen sehr viel Speicherplatz, da mehr und mehr zu speichernde Punkte benötigt werden (genaueres siehe hier).

Schlussfolgerung

Man kann also bereits mit einigen Grundmodulen und öffentlich verfügbaren Datensätzen eine ganze Menge im Bereich der Geodaten erkunden und entdecken. Gleichzeitig steht, insbesondere für spezielle Probleme, eine große Bandbreite weiterer Software zur Verfügung, für welche dieser Artikel zwar einen Grundsätzlichen Einstieg geben kann, die jedoch den Rahmen dieses Artikels sprengen würden.