Tag Archive for: Data Mining

Monitoring of Jobskills with Data Engineering & AI

On own account, we from DATANOMIQ have created a web application that monitors data about job postings related to Data & AI from multiple sources (Indeed.com, Google Jobs, Stepstone.de and more).

The data is obtained from the Internet via APIs and web scraping, and the job titles and the skills listed in them are identified and extracted from them using Natural Language Processing (NLP) or more specific from Named-Entity Recognition (NER).

The skill clusters are formed via the discipline of Topic Modelling, a method from unsupervised machine learning, which show the differences in the distribution of requirements between them.

The whole web app is hosted and deployed on the Microsoft Azure Cloud via CI/CD and Infrastructure as Code (IaC).

The presentation is currently limited to the current situation on the labor market. However, we collect these over time and will make trends secure, for example how the demand for Python, SQL or specific tools such as dbt or Power BI changes.

Why we did it? It is a nice show-case many people are interested in. Over the time, it will provides you the answer on your questions related to which tool to learn! For DATANOMIQ this is a show-case of the coming Data as a Service (DaaS) Business.

Praxisbeispiel: Data Science im Banking

Wie sich mit Data Science die Profitabilität des Kreditkartengeschäfts einer Bank nachhaltig steigern lässt.

Die Fragestellung

Das Kreditkartengeschäft einer Bank brachte nicht die erhofften Gewinne ein, weshalb die Pricing-Strategie dieses Geschäftszweiges optimiert werden sollte. Hierbei sollte allerdings unbedingt vermieden werden, dass Kund:innen aufgrund erhöhter Zinskosten abspringen.

Die Frage, die sich hieraus ergab, lautete: Welche der Kund:innen würden höhere Zinskosten akzeptieren und welche würden bei einer Erhöhung der Zinsen ihre Kreditkarte kündigen? Um Kündigungen zu vermeiden, sollten deshalb zunächst eindeutige Kundensegmente identifiziert werden. Das Ziel war weiterhin, den weniger preissensitiven Kund:innen neue, lukrativere Kreditprodukte anzubieten, ohne gleichzeitig die Loyalität der Kund:innen zu gefährden.

Das Vorgehen

Um die verschiedenen Kundengruppen zu identifizieren, sollten die Kund:innen mithilfe einer Clustering-Analyse in klar voneinander abgegrenzte Segmente eingeteilt werden. Bei einer Clustering-Analyse handelt es sich um ein maschinelles Lernverfahren, bei dem Datenpunkte, in diesem Fall also Kund:innen zu Clustern oder Segmenten zusammengefasst werden. Bei einer solchen Analyse werden jene Kund:innen zu Clustern zusammengefasst, die sich in vielen Eigenschaften ähneln.

Der Vorteil an diesem Vorgehen ist, dass bei einer Clustering-Analyse eine Vielzahl an Eigenschaften gleichzeitig betrachtet werden kann. Außerdem können die erstellten Segmente dynamisch angepasst werden, wenn neue Daten in die Analyse eingehen. Zudem bietet ein Clustering-Modell die Möglichkeit, neue Kunden zu bewerten und einem bestehenden Cluster zuzuordnen, sofern die entsprechenden Daten über sie vorliegen.

Kunden segmentieren

Die Bank verfügte über vielfältige Daten den Kund:innen. Dazu gehörten persönliche Informationen wie Alter, Geschlecht, Bonität, Anzahl und Art der genutzten Kreditprodukte, Anzahl und Art der mit der Kreditkarte getätigten Transaktionen, aber auch Informationen zur bisherigen Beziehung zwischen Kund:in und Bank, wie beispielsweise Kontaktaufnahmen mit dem Kundenservice, Beschwerden, Net Promoter Score u.s.w.

Nachdem die Kund:innen anhand all dieser Eigenschaften einer Clustering-Analyse unterzogen worden waren, konnten verschiedene Gruppen identifiziert werden. Ein Vergleich dieser Gruppen untereinander ergab, dass es Kund:innen gibt, für die der Umfang der gebotenen Leistungen der Bank wichtiger war als der Zinssatz, also der Preis dieser Leistungen. Diese Kund:innen waren entsprechend als weniger preissensitiv bezüglich der Zinskosten einzuschätzen. In einem weiteren Segment wurden Kunden identifiziert, die eine Steigerung des Zinssatzes akzeptieren würden, weil sie die Kreditkarte sehr häufig verwendeten.

Durch die Bestimmung dieser wenig preissensitiven Cluster war die Bank zunächst in der Lage, diesen Kund:innen neue und lukrativere Kreditprodukte anzubieten.

Kundenloyalität messen

Darüber hinaus war der Bank wichtig, auch die Kundenzufriedenheit und -loyalität genauer zu beobachten, um Abwanderungen zu vermeiden.

Eine Möglichkeit, die Zufriedenheit und Loyalität von Kund:innen einzuschätzen besteht darin, ihre Sprache zu untersuchen, wenn sie im Austausch mit dem Kundenservice stehen. Aufgrund ihrer Wortwahl – ob mündlich oder schriftlich – können KI-Technologien den Emotionszustand der Kund:innen bestimmen. Positive Emotionen können hierbei allgemein als Zeichen der Loyalität und Zufriedenheit gedeutet werden, wohingegen negative Emotionen vor allem in Beschwerden oder schlechten Bewertungen vorkommen, die einen Kundenverlust zur Folge haben können. Das Ziel der Bank war es, Anfragen mit negativen Emotionen, also wahrscheinlich Beschwerden oder negative Bewertungen schneller zu erkennen, um diese priorisiert beantworten zu können und so einen drohenden Kundenverlust zu vermeiden.

In der Sprache ausgedrückte positive oder negative Emotionen können mit einer sogenannten Sentiment Analysis untersucht werden, wobei die Sprache der Kunden – ob schriftlich oder mündlich – mit KI-Technologien untersucht wird. Dafür kommt Natural Language Processing – eine Reihe der KI-Technologien zur Analyse menschlicher Sprache – zur Anwendung. Anhand dieser KI-Technologie wurden eingehende Nachrichten und Bewertungen einer automatischen Voruntersuchung unterzogen. Nachrichten und Bewertungen, die mit negativen Emotionen assoziiert wurden, wurden priorisiert bearbeitet. Durch die priorisierte Bearbeitung konnte eine 50%ige Reduktion der Antwortzeiten auf Beschwerden erzielt werden.

Die Ergebnisse

In diesem Projekt konnte die Bank durch verschiedene Ansätze das Kreditkartengeschäft optimieren sowie die Kundenreaktion auf die Zinssteigerung bzw. die Kundenloyalität in Echtzeit messen:

  • Mithilfe von Clustering konnten Kund:innen in Cluster eingeteilt werden, die sich in bestimmten, für die Bank wichtige Eigenschaften stark ähnelten. Durch die Bestimmung wenig preissensitiver Cluster war die Bank in der Lage, diesen Kund:innen neue und lukrativere Kreditprodukte anzubieten, was das Kreditkartengeschäft profitabler machte.
  • Mithilfe von Natural Language Processing konnten die Stimmungen der Kund:innen am Telefon mit dem Kundenservice oder per Email erfasst und ausgewertet werden. Negative Nachrichten wurden demzufolge priorisiert bearbeitet, was sich wiederum positiv auf die Kundenzufriedenheit und -loyalität auswirkte.

Neugierig geworden?

Dies ist nur eins von vielen Beispielen, wie Sie mit Data Science im Banking zu Erkenntnissen gelangen, die Sie gewinnbringend bzw. kostensparend einsetzen können.

Qualifizieren Sie sich mit den Seminaren und Trainings der Haufe Akademie rund um das Thema Data Science weiter!

Sie wollen auf Augenhöhe mit Data Scientists kommunizieren und im richtigen Moment die richtigen Fragen stellen können?

Oder Sie wollen selbst tief in die Welt der Data Science eintauchen und programmieren können? Wir bieten Ihnen die Qualifizierungen, die für Sie passen!

Aktuelle Kursangebot des Data Science Blog Sponsors, die Haufe Akademie:


Data Science – Weiterbildungen mit Coursera

Anzeige

Data Science und AI sind aufstrebende Arbeitsfelder, die sich mit der Gewinnung von Wissen aus Daten beschäftigen. Die Nachfrage nach Fähigkeiten im Bereich Data Science, aber auch in angrenzenden Bereichen wie Data Engineering oder Data Analytics, ist in den letzten Jahren explodiert, da Unternehmen versuchen, die Vorteile von Big Data und künstlicher Intelligenz (KI) zu nutzen. Es lohnt sich sehr, sich in diesen Bereich weiter zu entwickeln. Dafür eignen sich die Kurse von Coursera.org.

Online-Kurse lohnen sich dann, wenn eine Karriere im Bereich der Datenanalyse oder des maschinellen Lernens angestrebt oder einfach nur ihr Wissen in diesem Bereich erweitert werden soll.

Spezialisierungskurs – Google Data Analytics

Data Science hilft dabei, Entscheidungen auf Basis von Daten zu treffen, komplexe Probleme effektiver zu lösen und Karrierechancen zu verbessern. Die Tools von Google Cloud und Jupyter Notebook sind dafür geeignet, da sie eine leistungsstarke und skalierbare Infrastruktur sowie eine interaktive Entwicklungsplattform bieten.

Google Data Analytics Zertifikatskurs

Das Google Zertifikat für Datenanalyse behandelt neben dem Handwerkszeug für jeden Data Analyst – wie etwa SQL – auch die notwendige Datenbereinigung und Datenvisualisierung mit den Tools von Google. Es werden weder Erfahrung noch Vorkenntnisse vorausgsetzt.

Spezialisierungskurs – Google Advanced Data Analytics

Der Zertifikatskurs der erweiterten Datenanalyse von Google baut auf dem zuvorgenannten Data Analytics Kurs auf, kann jedoch auch direkt besucht werden. Hier werden grundlegende Fähigkeiten wie SQL vorausgesetzt und vertiefende Fähigkeiten vermittelt, die für einen Data Analysten nützlich sind und auch in die Data Science eintauchen.

Google Advanced Data Analytics
Dieses Kursangebot zum Aufbau erweiterter Datenanalyse-Fähigkeiten von Coursera wird ebenfalls von Google angeboten. Hier werden die Tools der Datenanalyse sowie der statistischen Handwerkzeuge für Data Science eingeführt, bis hin zum ersten Einstieg in Machine Learning.


Spezialisierungskurs – SQL für Data Science (Generalistisch)

SQL ist wichtig für etablierte und angehende Data Scientists, da es eine grundlegende Technologie für die Arbeit mit Datenbanken und relationalen Datenbankmanagementsystemen ist. SQL für Data Science ermöglicht, Daten effektiv zu organisieren und schnell Abfragen zu erstellen, um Antworten auf komplexe Fragen zu finden. Es ist auch relevant für die Arbeit mit nicht-relationalen Datenbanken und hilft Data Scientists, wertvolle Erkenntnisse aus großen Datenmengen zu gewinnen.

Auch wenn Python als Skill für einen Data Scientist ganz vorne steht, ist eine Karriere als Data Scientist ohne SQL-Kenntnisse nicht vorstellbar und dieser Kurs daher der richtige, wenn Nachbolbedarf besteht.

Spezialisierungskurs – Data Analyst Zertifikat (IBM)

Eine Karriere als Data Analyst ist attraktiv, da ihr eine hohe Nachfrage am Arbeitsmarkt gegenüber steht, die Arbeit vielfältig und herausfordernd ist, viele Weiterentwicklungsmöglichkeiten (z. B. zum Data Scientist) bietet und oft flexibel ist.

Der Online-Kurs von IBM bietet die Ausbildung der beruflichen Qualifikation zum Data Analyst. Ein weiterer Vorteil dieses Kurses ist, dass er für alle geeignet ist – unabhängig von ihrem Hintergrund oder der Vorbildung. Es sind keine Abschlüsse oder Vorkenntnisse erforderlich, was bedeutet, dass jeder, der sich für das Thema interessiert, am Kurs teilnehmen und von ihm profitieren kann.

Spezialisierungskurs – Datenverarbeitung mit Python & SQL (IBM)

Dieser Kurs bietet den Teilnehmern die Möglichkeit, ihre Kenntnisse in der Datenverarbeitung zu verbessern, eine Programmiersprache wie Python zu erlernen und grundlegende Kenntnisse in SQL zu erwerben. Diese Fähigkeiten sind für die Arbeit mit Daten unerlässlich und in der heutigen Arbeitswelt sehr gefragt. Darüber hinaus bietet der Kurs für Datenverarbeitung mit Python und SQL auch Schulungen zur Analyse und Visualisierung von Daten sowie zur Erstellung von Modellen für Maschinelles Lernen. Diese Fähigkeiten sind besonders wertvoll für die Entwicklung von Anwendungen und Systemen im Bereich der KI.

Dieser Kurs ist eine großartige Möglichkeit für alle, die ihre Kenntnisse im Bereich der Datenverarbeitung und des maschinellen Lernens verbessern möchten. Zwar werden auch hier keine Vorkenntnisse vorausgesetzt, jedoch geht der Kurs inhaltlich mehr in die Richtung Data Science als der zuvorgenannte Kurs zum Data Analyst und bietet ein umfassendes Training und Schulungen zu grundlegenden Fähigkeiten, die in der heutigen Arbeitswelt gefragt sind, und ist für jeden zugänglich, unabhängig von Hintergrund oder Erfahrung.

Spezialisierungskurs – Maschinelles Lernen (DeepLearning.AI)

Das Erlernen der Grundlagen des maschinellen Lernens (Machine Learning) ist von großer Bedeutung, da es eine der am schnellsten wachsenden und wichtigsten Technologien in der heutigen Zeit ist. Maschinelles Lernen ermöglicht es Computern, aus Erfahrung zu lernen, ohne explizit programmiert zu werden. Die Teilnehmer lernen, dem Computer das lernen zu ermöglichen.

Machinelles Lernen ist der Schlüssel zur Entwicklung von Anwendungen und Systemen im Bereich der künstlichen Intelligenz (KI) und hat Anwendungen in vielen Bereichen, von der Gesundheitsversorgung und der Finanzindustrie bis hin zur Unterhaltungsbranche und der Automobilindustrie.

Der Kurs für Maschinelles Lernen ist nicht nur ein sinnvoller Einstieg in diese Materie, sondern kann darauf aufbauend mit dem Thema Deep Learning in der Qualifikation erweitert werden.

Spezialisierungskurs – Deep Learning (DeepLearning.AI)

Das Verständnis von Deep Learning ist wichtig, da es eine Unterkategorie des maschinellen Lernens ist und viele noch mächtigere Anwendungen in verschiedenen Bereichen hat. Die populäre Applikation ChatGPT ist ein Produkt des Deep Learning. Deep Learning kann mit AI gleichgesetzt werden. Es ist eine gefragte Fähigkeit auf dem Arbeitsmarkt mit Job-Garantie.

Der Spezialisierungskurs für Deep Learning steht unabhängig für sich und erfordert keine speziellen Vorkenntnisse, darf jedoch auch als sinnvolle Ergänzung zum vorgenannten Einführungskurs in Machine Learning betrachtet werden.

Weitere Kursangebote für Data & AI auf Coursera

Die Entscheidung für ein bestimmtes Thema eines Kurses in den Bereichen Data Analytics, Data Science und AI ist eine persönliche und abhängig von den eigenen Vorkenntnissen und Vorlieben, sowie den eigenen Karrierezielen. Für die Karriere des Data Analyst sind SQL sowie allgemeine Kenntnisse rund um Data Analytics bzw. Datenverarbeitung wichtig. Von einem Data Scientist wird ferner erwartet, die theoretischen Grundlagen sowie die praktische Anwendung von Machine Learning und Deep Learning als trainierte Fähigkeit abrufbar zu haben.

Weitere Kurse von Coursera zum Thema Data & AI (link).

Dieser Artikel wurde gesponsored von Coursera.
Data Science im Vertrieb

Data Science im Vertrieb – Praxisbeispiel

Wie Sie mit einer automatisierten Lead-Priorisierung zu erfolgreichen Geschäftsabschlüssen kommen.

Die Fragestellung:

Ein Softwareunternehmen generierte durch Marketing- und Sales-Aktivitäten eine große Anzahl potenzieller Leads, die nicht alle gleichzeitig bearbeitet werden konnten. Die zentrale Frage war nun: Wie kann eine Priorisierung der Leads erfolgen, sodass erfolgsversprechende Leads zuerst bearbeitet werden können?

Definition: Ein Lead bezeichnet einen Kontakt zu einem/einer potenziellen Kund:in, die/der sich für ein Produkt oder eine Dienstleistung eines Unternehmens interessiert und deren/dessen Kontaktdaten dem Unternehmen vorliegen. Solche Leads können durch Online- und Offline-Werbemaßnahmen gewonnen werden.

In der Vergangenheit beruhte die Priorisierung und somit auch die Bearbeitung der Leads in dem Unternehmen häufig auf der persönlichen Erfahrung der zuständigen Vertriebsmitarbeiter:innen. Diese Vorgehensweise ist  jedoch sehr ressourcenintensiv und stark abhängig von der Erfahrung einzelner Vertriebsmitarbeiter:innen.

Aus diesem Grund beschloss das Unternehmen, ein KI-gestütztes System zu entwickeln, welches zum einen erfolgsversprechende Leads datenbasiert priorisiert und zum anderen Handlungsempfehlungen für die Vertriebsmitarbeiter:innen bereitstellt.

Das Vorgehen

Grundlage dieses Projektes waren bereits vorhandene Daten zu früheren Leads sowie CRM-Daten zu bereits geschlossenen Aufträgen und Deals mit diesen früheren Leads. Dazu gehörten beispielsweise:

  • Firma des Leads
  • Firmengröße des Leads
  • Branche des Leads
  • Akquisekanal, über den der Lead generiert wurde
  • Dauer bis Antwort durch Vertriebsmitarbeiter:in
  • Wochentag der Antwort
  • Kanal der Antwort

Diese Daten aus der Vergangenheit konnten zunächst einer explorativen Datenanalyse unterzogen werden, bei der untersucht wurde, inwiefern die Eigenschaften der Leads und das Verhalten der Vertriebsmitarbeiter:innen in der Vergangenheit einen Einfluss darauf hatten, ob es mit einem Lead zu einem Geschäftsabschluss kam oder nicht.

Diese Erkenntnisse aus den vergangenen Leads sollten jedoch nun auch auf aktuelle bzw. zukünftige Leads und die damit verbundenen Vertriebsaktivitäten übertragen werden. Deshalb ergaben sich aus der explorativen Datenanalyse zwei weiterführende Fragen:

  • Durch welche Merkmale zeichnen sich Leads aus, die mit einer hohen Wahrscheinlichkeit zu einem Geschäftsabschluss führen?
  • Welche Aktivitäten der Vertriebsmitarbeiter:innen führen zu einem Geschäftsabschluss?

Leads priorisieren

Durch die explorative Datenanalyse konnte das Unternehmen bereits erste Einblicke in die verschiedenen Eigenschaften der Leads erlangen. Bei einigen dieser Eigenschaften ist anzunehmen, dass sie die Wahrscheinlichkeit erhöhen, dass ein:e potenzielle:r Kund:in Interesse am Produkt des Unternehmens zeigt. Es gibt mehrere Wege, um die Erkenntnisse aus der explorativen Datenanalyse nun für zukünftiges Verhalten der Vertriebsmitarbeiter:innen zu nutzen.

Regelbasiertes Vorgehen

Auf Grundlage der explorativen Datenanalyse und der dort gewonnenen Erkenntnisse könnte das Unternehmen, z. B. dessen Vertriebsleitung, bestimmte Regeln oder Kriterien definieren, wie beispielsweise die Unternehmensgröße des Kunden oder die Branche. So könnte die Vertriebsleitung anordnen, dass Leads aus größeren Unternehmen oder aus Unternehmen aus dem Energiesektor priorisiert behandelt werden sollten, weil diese Leads auch in der Vergangenheit zu erfolgreichen Geschäftsabschlüssen geführt haben.

Der Vorteil eines solchen regelbasierten Vorgehens ist, dass es einfach zu definieren und schnell umzusetzen ist.

Der Nachteil ist jedoch, dass die hier definierten Regeln sehr starr sind und dass Menschen meist nicht in der Lage sind, mehr als zwei oder drei der Eigenschaften gleichzeitig zu betrachten. Obwohl sich die Regeln dann zwar grundsätzlich an den Erkenntnissen aus den Daten orientieren, hängen sie doch immer noch stark vom Bauchgefühl der Vertriebsleitung ab.

Clustering

Ein besserer Ansatz war es, die vergangenen Leads anhand aller verfügbaren Eigenschaften in Gruppen einzuteilen, innerhalb derer die Leads sich einander stark ähneln. Hierfür kommt ein maschinelles Lernverfahren namens Clustering zum Einsatz, welches genau dieses Ziel verfolgt: Beim Clustering werden Datenpunkte, also in diesem Falle die Leads, anhand ihrer Eigenschaften, also beispielsweise die Unternehmensgröße oder die Branche, aber auch ob es zu einem Geschäftsabschluss kam oder nicht, zusammengefasst.

Beispiel: Leads aus Unternehmen zwischen 500 und 999 Mitarbeitern aus der Energiebranche kauften 250 Lizenzen der Software A.

Kommt nun ein neuer Lead hinzu, kann er anhand seiner bereits bekannten Eigenschaften einem Cluster zugeordnet werden. Anschließend können die Vertriebsmitarbeiter:innen jene Leads priorisieren, die einem Cluster zugeordnet worden sind, in dem in der Vergangenheit bereits häufig erfolgreich Geschäfte abgeschlossen worden sind.

Der Vorteil eines solchen datenbasierten Vorgehens ist, dass eine Vielzahl an Kriterien gleichzeitig in die Priorisierung einbezogen werden kann.

Erfolgsführende Aktivitäten identifizieren

Process Mining

Im zweiten Schritt wurde eine weitere Frage gestellt: Welche Aktivitäten der Vertriebsmitarbeiter:innen führen zu einem erfolgreichen Geschäftsabschluss mit einem Lead? Dabei standen nicht nur die Leistungen einzelner Mitarbeiter:innen im Fokus, sondern auch die übergreifenden Muster, die beim Vergleich der verschiedenen Mitarbeiter:innen deutlich wurden. Mithilfe von Process Mining konnte festgestellt werden, welche Maßnahmen und Aktivitäten der Vertriebler:innen im Umgang mit einem Lead zum Erfolg bzw. zu einem Misserfolg geführt hatten. Weniger erfolgsversprechende Maßnahmen konnten somit in der Zukunft vermieden werden.

Vor allem zeitliche Aspekte spielten hierbei eine Rolle: Parameter, die aussagten, wie schnell oder an welchem Wochentag Leads eine Antwort erhielten, waren entscheidend für erfolgreiche Geschäftsabschlüsse. Diese Erkenntnisse konnte das Unternehmen dann in zukünftige Sales Trainings sowie die Sales-Strategie einfließen lassen.

Die Ergebnisse

In diesem Projekt konnte die Sales-Abteilung des Softwareunternehmens durch zwei verschiedene Ansätze die Priorisierung der Leads und damit die Geschäftsabschlüsse deutlich verbessern:

  • Priorisierung der Leads

Mithilfe des Clustering war es möglich, Leads in Gruppen einzuteilen, die sich in ihren Eigenschaften ähneln, u.a. auch in der Eigenschaft, ob es zu einem Geschäftsabschluss kommt oder nicht. Neue Leads wurden den verschiedenen Clustern zuordnen. Leads, die einem Cluster mit hoher Erfolgswahrscheinlichkeit zugeordnet wurden, konnten nun priorisiert bearbeitet werden.

  • Erfolgsversprechende Aktivitäten identifizieren

Mithilfe von Process Mining wurden erfolgsversprechende Aktivitäten der Sales-Mitarbeiter:innen identifiziert und skaliert. Umgekehrt wurden wenig erfolgsversprechende Aktivitäten erkannt und eliminiert, um Ressourcen zu sparen.

Infolgedessen konnte das Softwareunternehmen Leads erfolgreicher bearbeiten und höhere Umsätze erzielen. 

Haufe Akademie Data Science Buzzword Bingo

Buzzword Bingo: Data Science – Teil III

Im ersten Teil unserer Serie „Buzzword Bingo: Data Science“ widmeten wir uns den Begriffen Künstliche Intelligenz, Algorithmen und Maschinelles Lernen, im zweiten Teil den Begriffen Big Data, Predictive Analytics und Internet of Things. Nun geht es hier im dritten und letzten Teil weiter mit der Begriffsklärung dreier weiterer Begriffe aus dem Data Science-Umfeld.

Buzzword Bingo: Data Science – Teil III: Künstliche neuronale Netze & Deep Learning

Im dritten Teil unserer dreiteiligen Reihe „Buzzword Bingo Data Science“ beschäftigen wir uns mit den Begriffen „künstliche neuronale Netze“ und „Deep Learning“.

Künstliche neuronale Netze

Künstliche neuronale Netze beschreiben eine besondere Form des überwachten maschinellen Lernens. Das Besondere hier ist, dass mit künstlichen neuronalen Netzen versucht wird, die Funktionsweise des menschlichen Gehirns nachzuahmen. Dort können biologische Nervenzellen durch elektrische Impulse von benachbarten Neuronen erregt werden. Nach bestimmten Regeln leiten Neuronen diese elektrischen Impulse dann wiederum an benachbarte Neuronen weiter. Häufig benutzte Signalwege werden dabei verstärkt, wenig benutzte Verbindungen werden gleichzeitig im Laufe der Zeit abgeschwächt. Dies wird beim Menschen üblicherweise dann als Lernen bezeichnet.

Dasselbe geschieht auch bei künstlichen neuronalen Netzen: Künstliche Neuronen werden hier hinter- und nebeneinander geschaltet. Diese Neuronen nehmen dann Informationen auf, modifizieren und verarbeiten diese nach bestimmten Regeln und geben dann Informationen wiederum an andere Neuronen ab. Üblicherweise werden bei künstlichen neuronalen Netzen mindestens drei Schichten von Neuronen unterschieden.

  • Die Eingabeschicht nimmt Informationen aus der Umwelt auf und speist diese in das neuronale Netz ein.
  • Die verborgene(n) Schichte(n) liegen zwischen der Eingabe- und der Ausgabeschicht. Hier werden wie beschrieben die eingegebenen Informationen von den einzelnen Neuronen verarbeitet und anschließend weitergegeben. Der Name „verborgene“ Schicht betont dabei, dass für Anwender meist nicht erkennbar ist, in welcher Form ein neuronales Netz die Eingabeinformationen in den verborgenen Schichten verarbeitet.
  • Die letzte Schicht eines neuronalen Netzes ist die Ausgabeschicht. Diese beinhaltet die Ausgabeneuronen, welche die eigentliche Entscheidung, auf die das neuronale Netz trainiert wurde, als Information ausgeben.

Das besondere an neuronalen Netzen: Wie die Neuronen die Informationen zwischen den verborgenen Schichten verarbeiten und an die nächste Schicht weitergeben, erlernt ein künstliches neuronales Netz selbstständig. Hierfür werden – einfach ausgedrückt – die verschiedenen Pfade durch ein neuronales Netz, die verschiedene Entscheidungen beinhalten, häufig hintereinander ausprobiert. Führt ein bestimmter Pfad während des Trainings des neuronalen Netzes nicht zu dem vordefinierten korrekten Ergebnis, wird dieser Pfad verändert und in dieser Form zukünftig eher nicht mehr verwendet. Führt ein Pfad stattdessen erfolgreich zu dem vordefinierten Ergebnis, dann wird dieser Pfad bestärkt. Schlussendlich kann, wie bei jedem überwachten Lernprozess, ein erfolgreich trainiertes künstliches neuronales Netz auf unbekannte Eingangsdaten angewandt werden.

Auch wenn diese Funktionsweise auf den ersten Blick nicht sehr leicht verständlich ist: Am Ende handelt es sich auch hier bloß um einen Algorithmus, dessen Ziel es ist, Muster in Daten zu erkennen. Zwei Eigenschaften teilen sich künstliche neuronale Netze aber tatsächlich mit den natürlichen Vorbildern: Sie können sich besonders gut an viele verschiedene Aufgaben anpassen, benötigen dafür aber auch meistens mehr Beispiele (Daten) und Zeit als die klassischen maschinellen Lernverfahren.

Sonderform: Deep Learning

Deep Learning ist eine besondere Form von künstlichen neuronalen Netzen. Hierbei werden viele verdeckte Schichten hintereinander verwendet, wodurch ein tiefes (also „deep“) neuronales Netz entsteht.

Je tiefer ein neuronales Netz ist, umso komplexere Zusammenhänge kann es abbilden. Aber es benötigt auch deutlich mehr Rechenleistung als ein flaches neuronales Netz. Seit einigen Jahren steht diese Leistung günstig zur Verfügung, weshalb diese Form des maschinellen Lernens an Bedeutung gewonnen hat.

Data Science & Big Data

Buzzword Bingo: Data Science – Teil II

Im ersten Teil unserer Serie „Buzzword Bingo: Data Science“ widmeten wir uns den Begriffen Künstliche Intelligenz, Algorithmen und Maschinelles Lernen. Nun geht es hier im zweiten Teil weiter mit der Begriffsklärung dreier weiterer Begriffe aus dem Data Science-Umfeld.

Buzzword Bingo: Data Science – Teil II: Big Data, Predictive Analytics & Internet of Things

Im zweiten Teil unserer dreiteiligen Reihe „Buzzword Bingo Data Science“ beschäftigen wir uns mit den Begriffen „Big Data“, „Predictive Analytics“ und „Internet of Things“.

Big Data

Interaktionen auf Internetseiten und in Webshops, Likes, Shares und Kommentare in Social Media, Nutzungsdaten aus Streamingdiensten wie Netflix und Spotify, von mobilen Endgeräten wie Smartphones oder Fitnesstrackern aufgezeichnete Bewegungsdate oder Zahlungsaktivitäten mit der Kreditkarte: Wir alle produzieren in unserem Leben alltäglich immense Datenmengen.

Im Zusammenhang mit künstlicher Intelligenz wird dabei häufig von „Big Data“ gesprochen. Und weil es in der öffentlichen Diskussion um Daten häufig um personenbezogene Daten geht, ist der Begriff Big Data oft eher negativ konnotiert. Dabei ist Big Data eigentlich ein völlig wertfreier Begriff. Im Wesentlichen müssen drei Faktoren erfüllt werden, damit Daten als „big“ gelten. Da die drei Fachbegriffe im Englischen alle mit einem „V“ beginnen, wird häufig auch von den drei V der Big Data gesprochen.

Doch welche Eigenschaften sind dies?

  • Volume (Datenmenge): Unter Big Data werden Daten(-mengen) verstanden, die zu groß sind, um sie mit klassischen Methoden zu bearbeiten, weil beispielsweise ein einzelner Computer nicht in der Läge wäre, diese Datenmenge zu verarbeiten.
  • Velocity (Geschwindigkeit der Datenerfassung und -verarbeitung): Unter Big Data werden Daten(-mengen) verstanden, die in einer sehr hohen Geschwindigkeit generiert werden und dementsprechend auch in einer hohen Geschwindigkeit ausgewertet und weiterverarbeitet werden müssen, um Aktualität zu gewährleisten.
  • Variety (Datenkomplexität oder Datenvielfalt): Unter Big Data werden Daten(-mengen) verstanden, die so komplex sind, dass auf den ersten Blick keine Zusammenhänge erkennbar sind. Diese Zusammenhänge können erst mit speziellen maschinellen Lernverfahren aufgedeckt werden. Dazu gehört auch, dass ein Großteil aller Daten in unstrukturierten Formaten wie Texten, Bildern oder Videos abgespeichert ist.

Häufig werden neben diesen drei V auch weitere Faktoren aufgezählt, welche Big Data definieren. Dazu gehören Variability (Schwankungen, d.h. die Bedeutung von Daten kann sich verändern), Veracity (Wahrhaftigkeit, d.h. Big Data muss gründlich auf die Korrektheit der Daten geprüft werden), Visualization (Visualisierungen helfen, um komplexe Zusammenhänge in großen Datensets aufzudecken) und Value (Wert, d.h. die Auswertung von Big Data sollte immer mit einem unternehmerischen Vorteil einhergehen).

Predictive Analytics

  • Heute schon die Verkaufszahlen von morgen kennen, sodass eine rechtzeitige Nachbestellung knapper Produkte möglich ist?
  • Bereits am Donnerstagabend die Regenwahrscheinlichkeit für das kommende Wochenende kennen, sodass passende Kleidung für den Kurztrip gepackt werden kann?
  • Frühzeitig vor bevorstehenden Maschinenausfällen gewarnt werden, sodass die passenden Ersatzteile bestellt und das benötigte technische Personal angefragt werden kann?

Als Königsdisziplin der Data Science gilt für viele die genaue Vorhersage zukünftiger Zustände oder Ereignisse. Im Englischen wird dann von „Predictive Analytics“ gesprochen. Diese Methoden werden in vielen verschiedenen Branchen und Anwendungsfeldern genutzt. Die Prognose von Absatzzahlen, die Wettervorhersage oder Predictive Maintenance (engl. für vorausschauende Wartung) von Maschinen und Anlagen sind nur drei mögliche Beispiele.

Zu beachten ist allerdings, dass Predictive-Analytics-Modelle keine Wahrsagerei sind. Die Vorhersage zukünftiger Ereignisse beruht immer auf historischen Daten. Das bedeutet, dass maschinelle Modelle mit Methoden des überwachten maschinellen Lernens darauf trainiert werden, Zusammenhänge zwischen vielen verschiedenen Eingangseigenschaften und einer vorherzusagenden Ausgangseigenschaft zu erkennen. Im Falle der Predicitve Maintenance könnten solche Eingangseigenschaften beispielsweise das Alter einer Produktionsmaschine, der Zeitraum seit der letzten Wartung, die Umgebungstemperatur, die Produktionsgeschwindigkeit und viele weitere sein. In den historischen Daten könnte ein Algorithmus nun untersuchen, ob diese Eingangseigenschaften einen Zusammenhang damit aufweisen, ob die Maschine innerhalb der kommenden 7 Tage ausfallen wird. Hierfür muss zunächst eine ausreichend große Menge an Daten zur Verfügung stehen. Wenn ein vorherzusagendes Ereignis in der Vergangenheit nur sehr selten aufgetreten ist, dann stehen auch nur wenige Daten zur Verfügung, um dasselbe Ereignis für die Zukunft vorherzusagen. Sobald der Algorithmus einen entsprechenden Zusammenhang identifiziert hat, kann dieses trainierte maschinelle Modell nun verwendet werden, um zukünftige Maschinenausfälle rechtzeitig vorherzusagen.

Natürlich müssen solche Modelle dauerhaft darauf geprüft werden, ob sie die Realität immer noch so gut abbilden, wie zu dem Zeitpunkt, zu dem sie trainiert worden sind. Wenn sich nämlich die Umweltparameter ändern, das heißt, wenn Faktoren auftreten, die zum Trainingszeitpunkt noch nicht bekannt waren, dann muss auch das maschinelle Modell neu trainiert werden. Für unser Beispiel könnte dies bedeuten, dass wenn die Maschine für die Produktion eines neuen Produktes eingesetzt wird, auch für dieses neue Produkt zunächst geprüft werden müsste, ob die in der Vergangenheit gefundenen Zusammenhänge immer noch Bestand haben.

Internet of Things

Selbstfahrende Autos, smarte Kühlschränke, Heizungssysteme und Glühbirnen, Fitnesstracker und vieles mehr: das Buzzword „Internet of Things“ (häufig als IoT abgekürzt) beschreibt den Trend, nicht nur Computer über Netzwerke miteinander zu verbinden, sondern auch verschiedene alltägliche Objekte mit in diese Netzwerke aufzunehmen. Seinen Anfang genommen hat dieser Trend in erster Linie im Bereich der Unterhaltungselektronik. In vielen Haushalten sind schon seit Jahren Fernseher, Computer, Spielekonsole und Drucker über das Heimnetzwerk miteinander verbunden und lassen sich per Smartphone bedienen.

Damit ist das IoT natürlich eng verbunden mit Big Data, denn all diese Geräte produzieren nicht nur ständig Daten, sondern sie sind auch auf Informationen sowie auf Daten von anderen Geräten angewiesen, um zu funktionieren.

Generative Adversarial Networks GANs

Generative Adversarial Networks

After Deep Autoregressive Models, Deep Generative Modelling and Variational Autoencoders we now continue the discussion with Generative Adversarial Networks (GANs).

Introduction

So far, in the series of deep generative modellings (DGMs [Yad22a]), we have covered autoregressive modelling, which estimates the exact log likelihood defined by the model and variational autoencoders, which was variational approximations for lower bound optimization. Both of these modelling techniques were explicitly defining density functions and optimizing the likelihood of the training data. However, in this blog, we are going to discuss generative adversarial networks (GANs), which are likelihood-free models and do not define density functions explicitly. GANs follow a game-theoretic approach and learn to generate from the training distribution through a set up of a two-player game.

A two player model of GAN along with the generator and discriminators.

A two player model of GAN along with the generator and discriminators.

GAN tries to learn the distribution of high dimensional training data and generates high-quality synthetic data which has a similar distribution to training data. However, learning the training distribution is a highly complex task therefore GAN utilizes a two-player game approach to overcome the high dimensional complexity problem. GAN has two different neural networks (as shown in Figure ??) the generator and the discriminator. The generator takes a random input z\sim p(z) and produces a sample that has a similar distribution as p_d. To train this network efficiently, there is the other network that is utilized as the second player and known as the discriminator. The generator network (player one) tries to fool the discriminator by generating real looking images. Moreover, the discriminator network tries to distinguish between real (training data x\sim p_d(x)) and fake images effectively. Our main aim is to have an efficiently trained discriminator to be able to distinguish between real and fake images (the generator’s output) and on the other hand, we would like to have a generator, which can easily fool the discriminator by generating real-looking images.

Objective function and training

Objective function

Simultaneous training of these two networks is one of the main challenges in GANs and a minimax loss function is defined for this purpose. To understand this minimax function, firstly, we would like to discuss the concept of two sample testing by Aditya grover [Gro20]. Two sample testing is a method to compute the discrepancy between the training data distribution and the generated data distribution:

(1)   \begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{x\sim p_{\theta_g}} [D_{\theta_d}(G_{\theta_g}(x))], \end{equation*}


where p_{\theta_g} and p_d are the distribution functions of generated and training data respectively. The term F is a set of functions. The \textit{max} part is computing the discrepancies between two distribution using a function D_{\theta_d} \in F and this part is very similar to the term d (discrepancy measure) from our first article (Deep Generative Modelling) and KL-divergence is applied to compute this measure in second article (Deep Autoregressive Models) and third articles (Variational Autoencoders). However, in GANs, for a given set of functions F, we would like compute the distribution p_{\theta_g}, which minimizes the overall discrepancy even for a worse function D_{\theta_d}\in F. The above mentioned objective function does not use any likelihood function and utilizing two different data samples from training and generated data respectively.

By combining Figure ?? and Equation 1, the first term \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] corresponds to the discriminator, which has direct access to the training data and the second term \mathbb{E}_{x\sim p_{\theta_g}}[D_{\theta_d}(G_{\theta_g}(x))] represents the generator part as it relies only on the latent space and produces synthetic data. Therefore, Equation 1 can be rewritten in the form of GAN’s two players as:

(2)   \begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{z\sim p_z}[D_{\theta_d}(G_{\theta_g}(z))], \end{equation*}


The above equation can be rearranged in the form of log loss:

(3)   \begin{equation*} \min_{\theta_g}\: \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d} (x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]), \end{equation*}

In the above equation, the arguments are modified from p_{\theta_g} and D_{\theta_d} in F to \theta_g and  \theta_d respectively as we would like to approximate the network parameters, which are represented by \theta_g and \theta_d for the both generator and discriminator respectively. The discriminator wants to maximize the above objective for \theta_d such that D_{\theta_d}(x) \approx 1, which indicates that the outcome is close to the real data. Furthermore, D_{\theta_d}(G_{\theta_g}(z)) should be close to zero as it is fake data, therefore, the maximization of the above objective function for \theta_d will ensure that the discriminator is performing efficiently in terms of separating real and fake data. From the generator point of view, we would like to minimize this objective function for \theta_g such that D_{\theta_d}(G_{\theta_g}(z)) \approx 1. If the minimization of the objective function happens effectively for \theta_g then the discriminator will classify a fake data into a real data that means that the generator is producing almost real-looking samples.

Training

The training procedure of GAN can be explained by using the following visualization from Goodfellow et al. [GPAM+14]. In Figure 2(a), z is a random input vector to the generator to produce a synthetic outcome x\sim p_{\theta_g} (green curve). The generated data distribution is not close to the original data distribution p_d (dotted black curve). Therefore, the discriminator classifies this image as a fake image and forces generator to learn the training data distribution (Figure 2(b) and (c)). Finally, the generator produces the image which could not detected as a fake data by discriminator(Figure 2(d)).

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow et al.

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ
respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow
et al. [GPAM+14].

The optimization of the objective function mentioned in Equation 3 is performed in th following two steps repeatedly:
\begin{enumerate}
\item Firstly, the gradient ascent is utilized to maximize the objective function for \theta_d for discriminator.

(4)   \begin{equation*} \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d}(x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}


\item In the second step, the following function is minimized for the generator using gradient descent.

(5)   \begin{equation*} \min_{\theta_g} \: ( \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}


\end{enumerate}

However, in practice the minimization for the generator does now work well because when D_{\theta_d}(G_{\theta_g}(z) \approx 1 then the term log \: (1-D_{\theta_d}(G_{\theta_g}(z))) has the dominant gradient and vice versa.

However, we would like to have the gradient behaviour completely opposite because D_{\theta_d}(G_{\theta_g}(z) \approx 1 means the generator is well trained and does not require dominant gradient values. However, in case of D_{\theta_d}(G_{\theta_g}(z) \approx 0, the generator is not well trained and producing low quality outputs therefore, it requires a dominant gradient for an efficient training. To fix this problem, the gradient ascent method is applied to maximize the modified generator’s objective:
In the second step, the following function is minimized for the generator using gradient descent alternatively.

(6)   \begin{equation*} \max_{\theta_g} \: \mathbb{E}_{z\sim p_z}[log \: (D_{\theta_d}(G_{\theta_g}(z))] \end{equation*}


therefore, during the training, Equation 4 and 6 will be maximized using the gradient ascent algorithm until the convergence.

Results

The quality of the generated images using GANs depends on several factors. Firstly, the joint training of GANs is not a stable procedure and that could severely decrease the quality of the outcome. Furthermore, the different neural network architecture will modify the quality of images based on the sophistication of the used network. For example, the vanilla GAN [GPAM+14] uses a fully connected deep neural network and generates a quite decent result. Furthermore, DCGAN [RMC15] utilized deep convolutional networks and enhanced the quality of outcome significantly. Furthermore, different types of loss functions are applied to stabilize the training procedure of GAN and to produce high-quality outcomes. As shown in Figure 3, StyleGAN [KLA19] utilized Wasserstein metric [Yad22b] to generate high-resolution face images. As it can be seen from Figure 3, the quality of the generated images are enhancing with time by applying more sophisticated training techniques and network architectures.

GAN timeline with different variations in terms of network architecture and loss functions.

GAN timeline with different variations in terms of network architecture and loss functions.

Summary

This article covered the basics and mathematical concepts of GANs. However, the training of two different networks simultaneously could be complex and unstable. Therefore, researchers are continuously working to create a better and more stable version of GANs, for example, WGAN. Furthermore, different types of network architectures are introduced to improve the quality of outcomes. We will discuss this further in the upcoming blog about these variations.

References

[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in
neural information processing systems, 27, 2014.

[Gro20] Aditya Grover. Generative adversarial networks.
https://deepgenerativemodels.github.io/notes/gan/, 2020.

[KLA19] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4401–4410, 2019.

[RMC15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.

[Yad22a] Sunil Yadav. Deep generative modelling. https://data-scienceblog.
com/blog/2022/02/19/deep-generative-modelling/, 2022.

[Yad22b] Sunil Yadav. Necessary probability concepts for deep learning: Part 2.
https://medium.com/@sunil7545/kl-divergence-js-divergence-and-wasserstein-metricin-
deep-learning-995560752a53, 2022.

Coffee Shop Location Predictor

As part of this article, we will explore the main steps involved in predicting the best location for a coffee shop in Vancouver. We will also take into consideration that the coffee shop is near a transit station, and has no Starbucks near it. Well, while at it, let us also add an extra feature where we make sure the crime in the area is lower.

Introduction

In this article, we will highlight the main steps involved to predict a location for a coffee shop in Vancouver. We also want to make sure that the coffee shop is near a transit station, and has no Starbucks near it. As an added feature, we will make sure that the crime concentration in the area is low, and the entire program should be implemented in Python. So let’s walk through the steps.

Steps Required

  • Get crime history for the last two years
  • Get locations of all transit stations and Starbucks in Vancouver
  • Check all the transit stations that do not have any Starbucks near them
  • Get all the data regarding crimes near the filtered transit stations
  • Create a grid of all possible coordinates around the transit station
  • Check crime around each created coordinate and display the top 5 locations.

Gathering Data

This covers the first two steps required to get data from the internet, both manually and automatically.

Getting all Crime History

We can get crime history for the past 14 years in Vancouver from here. This data is in raw crime.csv format, so we have to process it and filter out useless data. We then write this processed information on the crime_processed.csv file.

Note: There are 530,653 records of crime in this file

In this program, we will just use the type and coordinate of the crime. There are many crime types, but we have classified them into three major categories namely;

Theft (red), Break and Enter (orange) and Mischief (green)

These all crimes can be plotted on Graph as displayed below.

This may seem very congested and full, so let’s see a closeup image for future references.

Getting Locations of all Rapid Transit Stations

We can get the coordinates of all Transit Stations in Vancouver from here. This dataset has all coordinates of rapid transit stations in three transit lines in Vancouver. There are a total of 23 of them in Vancouver, we can then use it for further processing.

Getting Locations of all Starbucks

The Starbucks data is present here, we can scrape it easily and get the locations of all the Starbucks in Vancouver. We just need the Starbucks that is near transit stations, so we’ll filter out the rest. There are a total 24 Starbucks in Vancouver, and 10 of them are near Transit Stations.

Note: Other than the coordinates of Transit Stations and Starbucks, we also need coordinates and type of the crime.

Transit Stations with no Starbucks

As we have all the data required, now moving to the next step. We need to get to the transit Station locations that have no Starbucks near them. For that we can create an area of particular radius around each Transit Station. Then check all Starbucks locations with respect to them, whether they are within that area or not.

If none of the Starbucks are within that particular Transit Station’s area, we can append it to a list. At the end, we have a list of all Transit locations with no Starbucks near them. There are a total of 6 Transit Stations with no Starbucks near them.

Crime near Transit Stations

Now lets filter out all crime records and get just what we are interested in, which means the crime near Transit stations. For that we will plot an area of specific radius around each of them to see the crimes. These are more than 110,000 crime records.

Crime near located Transit Stations

Now that we have all the Transit Stations that don’t have any Starbucks near them and also the crime near all Transit Stations. So, let’s use this information and get crime near the located Transit Stations. These are about 44,000 crime records.

This may seem correct at first glance, but the points are overlapping due to abundance, so we can create different lists of crimes based on their types.

Theft

Break and Enter

Mischief

Generating all possible coordinates

Now finally, we have all the prerequisites and let’s get to the main task at hand, predicting the best coordinate for the coffee shop.

There may be many approaches to solve this problem, but the one I used in this program is that I will create a grid of all possible locations (coordinates) in the area of 1 km radius around each located transit station.

Initially I generated 1 coordinate for every m, this resulted in 1000,000 coordinates in every km. This is a huge number, and for the 6 located Transit stations, it becomes 6 Million. It may not seem much at first glance because computers can handle such data in a few seconds.

But for location prediction we need to compare each coordinate with crime coordinates. As the algorithm has to check for ~7,000 Thefts, ~19,000 Break ins, and ~17,000 Mischiefs around each generated coordinate. Computing this would want the program to process an estimate of 432.4 Billion times. This sort of execution takes many hours on normal computers (sometimes days).

The solution to this is to create a coordinate for each 10 m area, this results about 10,000 coordinate per km. For the above mentioned number of crimes, the estimated processes will be several Billions. That would significantly reduce the time, but is still not less.

To control this, we can remove the duplicate values in crime coordinates and those which are too close to each other ~1m. Doing so, we are left with just 816 Thefts, 2,654 Break ins, and 8,234 Mischiefs around each generated coordinate.
The precision will not be affected much but the time and computational resources required will be reduced a lot.

 

Checking Crime near Generated coordinates

Now that we have all the locations, we will start some processing on it and check each coordinate against some constraints. That are respectively;

  1. Filter out Coordinates having Theft near 1 km
    We get 122,000 coordinates with no Thefts (Below merged 1000 to 1)
  2. Filter out Coordinates having Break Ins near 200m
    We get 8000 coordinates with no Thefts (Below merged 1000 to 1)
  3. Filter out Coordinates having Mischief near 200m
    We get 6000 coordinates with no Thefts (Below merged 1000 to 1)
    Now that we have 6 Coordinates of best locations that have passed through all the constraints, we will order them.To order them, we will check their distance from the nearest transit location. The nearest will be on top of the list as the best possible location, then the second and so on. The generated List is;

    1. -123.0419406741792, 49.24824259252004
    2. -123.05887151659479, 49.24327221040713
    3. -123.05287151659476, 49.24327221040713
    4. -123.04994067417924, 49.239242592520064
    5. -123.0419406741792, 49.239242592520064
    6. -123.0409406741792, 49.239242592520064

How can MindTrades help?

MindTrades Consulting Services, a leading marketing agency provides in-depth analysis and insights for the global IT sector including leading data integration brands such as Diyotta. From Cloud Migration, Big Data, Digital Transformation, Agile Deliver, Cyber Security, to Analytics- Mind trades provides published breakthrough ideas, and prompt content delivery. For more information, refer to mindtrades.com.

Code

https://github.com/Mindtrades-Consulting/Coffee-Shop-Location-Predictor

 

Rethinking linear algebra part two: ellipsoids in data science

*This is the fourth article of my article series “Illustrative introductions on dimension reduction.”

1 Our expedition of eigenvectors still continues

This article is still going to be about eigenvectors and PCA, and this article still will not cover LDA (linear discriminant analysis). Hereby I would like you to have more organic links of the data science ideas with eigenvectors.

In the second article, we have covered the following points:

  • You can visualize linear transformations with matrices by calculating displacement vectors, and they usually look like vectors swirling.
  • Diagonalization is finding a direction in which the displacement vectors do not swirl, and that is equal to finding new axis/basis where you can describe its linear transformations more straightforwardly. But we have to consider diagonalizability of the matrices.
  • In linear dimension reduction such as PCA or LDA, we mainly use types of matrices called positive definite or positive semidefinite matrices.

In the last article we have seen the following points:

  • PCA is an algorithm of calculating orthogonal axes along which data “swell” the most.
  • PCA is equivalent to calculating a new orthonormal basis for the data where the covariance between components is zero.
  • You can reduced the dimension of the data in the new coordinate system by ignoring the axes corresponding to small eigenvalues.
  • Covariance matrices enable linear transformation of rotation and expansion and contraction of vectors.

I emphasized that the axes are more important than the surface of the high dimensional ellipsoids, but in this article let’s focus more on the surface of ellipsoids, or I would rather say general quadratic curves. After also seeing how to draw ellipsoids on data, you would see the following points about PCA or eigenvectors.

  • Covariance matrices are real symmetric matrices, and also they are positive semidefinite. That means you can always diagonalize covariance matrices, and their eigenvalues are all equal or greater than 0.
  • PCA is equivalent to finding axes of quadratic curves in which gradients are biggest. The values of quadratic curves increases the most in those directions, and that means the directions describe great deal of information of data distribution.
  • Intuitively dimension reduction by PCA is equal to fitting a high dimensional ellipsoid on data and cutting off the axes corresponding to small eigenvalues.

Even if you already understand PCA to some extent, I hope this article provides you with deeper insight into PCA, and at least after reading this article, I think you would be more or less able to visually control eigenvectors and ellipsoids with the Numpy and Maplotlib libraries.

*Let me first introduce some mathematical facts and how I denote them throughout this article in advance. If you are allergic to mathematics, take it easy or please go back to my former articles.

  • Any quadratic curves can be denoted as \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where \boldsymbol{x}\in \mathbb{R}^D , A \in \mathbb{R}^{D\times D} \boldsymbol{b}\in \mathbb{R}^D s\in \mathbb{R}.
  • When I want to clarify dimensions of variables of quadratic curves, I denote parameters as A_D, b_D.
  • If a matrix A is a real symmetric matrix, there exist a rotation matrix U such that U^T A U = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D) and U = (\boldsymbol{u}_1, \dots , \boldsymbol{u}_D). \boldsymbol{u}_1, \dots , \boldsymbol{u}_D are eigenvectors corresponding to \lambda_1, \dots, \lambda_D respectively.
  • PCA corresponds to a case of diagonalizing A where A is a covariance matrix of certain data. When I want to clarify that A is a covariance matrix, I denote it as A=\Sigma.
  • Importantly covariance matrices \Sigma are positive semidefinite and real symmetric, which means you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

*In the last article, I denoted the covariance of data as S, based on Pattern Recognition and Machine Learning by C. M. Bishop.

*Sooner or later you are going to see that I am explaining basically the same ideas from different points of view, using the topic of PCA. However I believe they are all important when you learn linear algebra for data science of machine learning. Even you have not learnt linear algebra or if you have to teach linear algebra, I recommend you to first take a review on the idea of diagonalization, like the second article. And you should be conscious that, in the context of machine learning or data science, only a very limited type of matrices are important, which I have been explaining throughout this article.

2 Rotation or projection?

In this section I am going to talk about basic stuff found in most textbooks on linear algebra. In the last article, I mentioned that if A is a real symmetric matrix, you can diagonalize A with a rotation matrix U = (\boldsymbol{u}_1 \: \cdots \: \boldsymbol{u}_D), such that U^{-1}AU = U^{T}AU =\Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). I also explained that PCA is a case where A=\Sigma, that is, A is the covariance matrix of certain data. \Sigma is known to be positive semidefinite and real symmetric. Thus you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

I think we first need to clarify the difference of rotation and projection. In order to visualize the ideas, let’s consider a case of D=3. Assume that you have got an orthonormal rotation matrix U = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) which diagonalizes A. In the last article I said diagonalization is equivalent to finding new orthogonal axes formed by eigenvectors, and in the case of this section you got new orthonoramal basis (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) which are in red in the figure below. Projecting a point \boldsymbol{x} = (x, y, z) on the new orthonormal basis is simple: you just have to multiply \boldsymbol{x} with U^T. Let U^T \boldsymbol{x} be (x', y', z')^T, and then \left( \begin{array}{c} x' \\ y' \\ z' \end{array} \right) = U^T\boldsymbol{x} = \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right). You can see x', y', z' are \boldsymbol{x} projected on \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3 respectively, and the left side of the figure below shows the idea. When you replace the orginal orthonormal basis (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) with (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) as in the right side of the figure below, you can comprehend the projection as a rotation from (x, y, z) to (x', y', z') by a rotation matrix U^T.

Next, let’s see what rotation is. In case of rotation, you should imagine that you rotate the point \boldsymbol{x} in the same coordinate system, rather than projecting to other coordinate system. You can rotate \boldsymbol{x} by multiplying it with U. This rotation looks like the figure below.

In the initial position, the edges of the cube are aligned with the three orthogonal black axes (\boldsymbol{e}_1,  \boldsymbol{e}_2 , \boldsymbol{e}_3), with one corner of the cube located at the origin point of those axes. The purple dot denotes the corner of the cube directly opposite the origin corner. The cube is rotated in three dimensions, with the origin corner staying fixed in place. After the rotation with a pivot at the origin, the edges of the cube are now aligned with a new set of orthogonal axes (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3), shown in red. You might understand that more clearly with an equation: U\boldsymbol{x} = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) \left( \begin{array}{c} x \\ y \\ z \end{array} \right) = x\boldsymbol{u}_1 + y\boldsymbol{u}_2 + z\boldsymbol{u}_3. In short this rotation means you keep relative position of \boldsymbol{x}, I mean its coordinates (x, y, z), in the new orthonormal basis. In this article, let me call this a “cube rotation.”

The discussion above can be generalized to spaces with dimensions higher than 3. When U \in \mathbb{R}^{D \times D} is an orthonormal matrix and a vector \boldsymbol{x} \in \mathbb{R}^D, you can project \boldsymbol{x} to \boldsymbol{x}' = U^T \boldsymbol{x}or rotate it to \boldsymbol{x}'' = U \boldsymbol{x}, where \boldsymbol{x}' = (x_{1}', \dots, x_{D}')^T and \boldsymbol{x}'' = (x_{1}'', \dots, x_{D}'')^T. In other words \boldsymbol{x} = U \boldsymbol{x}', which means you can rotate back \boldsymbol{x}' to the original point \boldsymbol{x} with the rotation matrix U.

I think you at least saw that rotation and projection are basically the same, and that is only a matter of how you look at the coordinate systems. But I would say the idea of projection is more important through out this article.

Let’s consider a function f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} = (\boldsymbol{x}, A \boldsymbol{x}), where A\in \mathbb{R}^{D\times D} is a real symmetric matrix. The distribution of f(\boldsymbol{x}; A) is quadratic curves whose center point covers the origin, and it is known that you can express this distribution in a much simpler way using eigenvectors. When you project this function on eigenvectors of A, that is when you substitute U \boldsymbol{x}' for \boldsymbol{x}, you get f = (\boldsymbol{x}, A \boldsymbol{x}) =(U \boldsymbol{x}', AU \boldsymbol{x}') = (\boldsymbol{x}')^T U^TAU \boldsymbol{x}' = (\boldsymbol{x}')^T \Lambda \boldsymbol{x}' = \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2. You can always diagonalize real symmetric matrices, so the formula implies that the shapes of quadratic curves largely depend on eigenvectors. We are going to see this in detail in the next section.

*(\boldsymbol{x}, \boldsymbol{y}) denotes an inner product of \boldsymbol{x} and \boldsymbol{y}.

*We are going to see details of the shapes of quadratic “curves” or “functions” in the next section.

To be exact, you cannot naively multiply U or U^T for rotation. Let’s take a part of data I showed in the last article as an example. In the figure below, I projected data on the basis (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3).

You might have noticed that you cannot do a “cube rotation” in this case. If you make the coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) with your left hand, like you might have done in science classes in school to learn Fleming’s rule, you would soon realize that the coordinate systems in the figure above do not match. You need to flip the direction of one axis to match them.

Mathematically, you have to consider the determinant of the rotation matrix U. You can do a “cube rotation” when det(U)=1, and in the case above det(U) was -1, and you needed to flip one axis to make the determinant 1. In the example in the figure below, you can match the basis. This also can be generalized to higher dimensions, but that is also beyond the scope of this article series. If you are really interested, you should prepare some coffee and snacks and textbooks on linear algebra, and some weekends.

When you want to make general ellipsoids in a 3d space on Matplotlib, you can take advantage of rotation matrices. You first make a simple ellipsoid symmetric about xyz axis using polar coordinates, and you can rotate the whole ellipsoid with rotation matrices. I made some simple modules for drawing ellipsoid. If you put in a rotation matrix which diagonalize the covariance matrix of data and a list of three radiuses \sqrt{\lambda_1}, \sqrt{\lambda_2}, \sqrt{\lambda_3}, you can rotate the original ellipsoid so that it fits the data well.

3 Types of quadratic curves.

*This article might look like a mathematical writing, but I would say this is more about computer science. Please tolerate some inaccuracy in terms of mathematics. I gave priority to visualizing necessary mathematical ideas in my article series. If you are not sure about details, please let me know.

In linear dimension reduction, or at least in this article series you mainly have to consider ellipsoids. However ellipsoids are just one type of quadratic curves. In the last article, I mentioned that when the center of a D dimensional ellipsoid is the origin point of a normal coordinate system, the formula of the surface of the ellipsoid is as follows: (\boldsymbol{x}, A\boldsymbol{x})=1, where A satisfies certain conditions. To be concrete, when (\boldsymbol{x}, A\boldsymbol{x})=1 is the surface of a ellipsoid, A has to be diagonalizable and positive definite.

*Real symmetric matrices are diagonalizable, and positive definite matrices have only positive eigenvalues. Covariance matrices \Sigma, whose displacement vectors I visualized in the last two articles, are known to be symmetric real matrices and positive semi-defintie. However, the surface of an ellipsoid which fit the data is \boldsymbol{x}^T \Sigma ^{-1} \boldsymbol{x} = const., not \boldsymbol{x}^T \Sigma \boldsymbol{x} = const..

*You have to keep it in mind that \boldsymbol{x} are all deviations.

*You do not have to think too much about what the “semi” of the term “positive semi-definite” means fow now.

As you could imagine, this is just one simple case of richer variety of graphs. Let’s consider a 3-dimensional space. Any quadratic curves in this space can be denoted as ax^2 + by^2 + cz^2 + dxy + eyz + fxz + px + qy + rz + s = 0, where at least one of a, b, c, d, e, f, p, q, r, s is not 0.  Let \boldsymbol{x} be (x, y, z)^T, then the quadratic curves can be simply denoted with a 3\times 3 matrix A and a 3-dimensional vector \boldsymbol{b} as follows: \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right), \boldsymbol{b} = \left( \begin{array}{c} \frac{p}{2} \\ \frac{q}{2} \\ \frac{r}{2} \end{array} \right). General quadratic curves are roughly classified into the 9 types below.

You can shift these quadratic curves so that their center points come to the origin, without rotation, and the resulting curves are as follows. The curves can be all denoted as \boldsymbol{x}^T A\boldsymbol{x}.

As you can see, A is a real symmetric matrix. As I have mentioned repeatedly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthogonal/orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). Hence, you can diagonalize the A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right) with an orthogonal matrix U. Let U be an orthogonal matrix such that U^T A U = \left( \begin{array}{ccc} \alpha  & 0 & 0 \\ 0 & \beta & 0 \\ 0 & 0 & \gamma \end{array} \right) =\left( \begin{array}{ccc} \lambda_1  & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{array} \right). After you apply rotation by U to the curves (a)” ~ (i)”, those curves are symmetrically placed about the xyz axes, and their center points still cross the origin. The resulting curves look like below. Or rather I should say you projected (a)’ ~ (i)’ on their eigenvectors.

In this article mainly (a)” , (g)”, (h)”, and (i)” are important. General equations for the curves is as follows

  • (a)”: \frac{x^2}{l^2} + \frac{y^2}{m^2} + \frac{z^2}{n^2} = 1
  • (g)”: z = \frac{x^2}{l^2} + \frac{y^2}{m^2}
  • (h)”: z = \frac{x^2}{l^2} - \frac{y^2}{m^2}
  • (i)”: z = \frac{x^2}{l^2}

, where l, m, n \in \mathbb{R}^+.

Even if this section has been puzzling to you, you just have to keep one point in your mind: we have been discussing general quadratic curves, but in PCA, you only need to consider a case where A is a covariance matrix, that is A=\Sigma. PCA corresponds to the case where you shift and rotate the curve (a) into (a)”. Subtracting the mean of data from each point of data corresponds to shifting quadratic curve (a) to (a)’. Calculating eigenvectors of A corresponds to calculating a rotation matrix U such that the curve (a)’ comes to (a)” after applying the rotation, or projecting curves on eigenvectors of \Sigma. Importantly we are only discussing the covariance of certain data, not the distribution of the data itself.

*Just in case you are interested in a little more mathematical sides: it is known that if you rotate all the points \boldsymbol{x} on the curve \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0 with the rotation matrix P, those points \boldsymbol{x} are mapped into a new quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0. That means the rotation of the original quadratic curve with P (or rather rotating axes) enables getting rid of the terms xy, yz, zx. Also it is known that when \alpha ' \neq 0, with proper translations and rotations, the quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0 can be mapped into one of the types of quadratic curves in the figure below, depending on coefficients of the original quadratic curve. And the discussion so far can be generalized to higher dimensional spaces, but that is beyond the scope of this article series. Please consult decent textbooks on linear algebra around you for further details.

4 Eigenvectors are gradients and sometimes variances.

In the second section I explained that you can express quadratic functions f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} in a very simple way by projecting \boldsymbol{x} on eigenvectors of A.

You can comprehend what I have explained in another way: eigenvectors, to be exact eigenvectors of real symmetric matrices A, are gradients. And in case of PCA, I mean when A=\Sigma eigenvalues are also variances. Before explaining what that means, let me explain a little of the totally common facts on mathematics. If you have variables \boldsymbol{x}\in \mathbb{R}^D, I think you can comprehend functions f(\boldysmbol{x}) in two ways. One is a normal “functions” f(\boldsymbol{x}), and the others are “curves” f(\boldsymbol{x}) = const.. “Functions” get an input \boldsymbol{x} and gives out an output f(\boldsymbol{x}), just as well as normal functions you would imagine. “Curves” are rather sets of \boldsymbol{x} \in \mathbb{R}^D such that f(\boldsymbol{x}) = const..

*Please assume that the terms “functions” and “curves” are my original words. I use them just in case I fail to use functions and curves properly.

The quadratic curves in the figure above are all “curves” in my term, which can be denoted as f(\boldsymbol{x}; A_3, \boldsymbol{b}_3)=const or f(\boldsymbol{x}; A_3)=const. However if you replace z of (g)”, (h)”, and (i)” with f, you can interpret the “curves” as “functions” which are denoted as f(\boldsymbol{x}; A_2). This might sounds too obvious to you, and my point is you can visualize how values of “functions” change only when the inputs are 2 dimensional.

When a symmetric 2\times 2 real matrices A_2 have two eigenvalues \lambda_1, \lambda_2, the distribution of quadratic curves can be roughly classified to the following three types.

  • (g): Both \lambda_1 and \lambda_2 are positive or negative.
  • (h): Either of \lambda_1 or \lambda_2 is positive and the other is negative.
  • (i): Either of \lambda_1 or \lambda_2 is 0 and the other is not.

The equations of (g)” , (h)”, and (i)” correspond to each type of f=(\boldsymbol{x}; A_2), and thier curves look like the three graphs below.

And in fact, when start from the origin and go in the direction of an eigenvector \boldsymbol{u}_i, \lambda_i is the gradient of the direction. You can see that more clearly when you restrict the distribution of f=(\boldsymbol{x}; A_2) to a unit circle. Like in the figure below, in case \lambda_1 = 7, \lambda_2 = 3, which is classified to (g), the distribution looks like the left side, and if you restrict the distribution in the unit circle, the distribution looks like a bowl like the middle and the right side. When you move in the direction of \boldsymbol{u}_1, you can climb the bowl as as high as \lambda_1, in \boldsymbol{u}_2 as high as \lambda_2.

Also in case of (h), the same facts hold. But in this case, you can also descend the curve.

*You might have seen the curve above in the context of optimization with stochastic gradient descent. The origin of the curve above is a notorious saddle point, where gradients are all 0 in any directions but not a local maximum or minimum. Points can be stuck in this point during optimization.

Especially in case of PCA, A is a covariance matrix, thus A=\Sigma. Eigenvalues of \Sigma are all equal to or greater than 0. And it is known that in this case \lambda_i is the variance of data projected on its corresponding eigenvector \boldsymbol{u}_i (i=0, \dots , D). Hence, if you project f(\boldsymbol{x}; \Sigma), quadratic curves formed by a covariance matrix \Sigma, on eigenvectors of \Sigma, you get f(\boldsymbol{x}; \Sigma) = ({x'}_1 \: \dots \: {x'}_D) (\lambda_1 {x'}_1 \: \dots \: \lambda_D {x'}_D)^t =\lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2.  This shows that you can re-weight ({x'}_1 \: \dots \: {x'}_D), the coordinates of data projected projected on eigenvectors of A, with \lambda_1, \dots, \lambda_D, which are variances ({x'}_1 \: \dots \: {x'}_D). As I mentioned in an example of data of exam scores in the last article, the bigger a variance \lambda_i is, the more the feature described by \boldsymbol{u}_i vary from sample to sample. In other words, you can ignore eigenvectors corresponding to small eigenvalues.

That is a great hint why principal components corresponding to large eigenvectors contain much information of the data distribution. And you can also interpret PCA as a “climbing” a bowl of f(\boldsymbol{x}; A_D), as I have visualized in the case of (g) type curve in the figure above.

*But as I have repeatedly mentioned, ellipsoid which fit data well isf(\boldsymbol{x}; \Sigma ^{-1}) =(\boldsymbol{x}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D})\boldsymbol{x}' = \frac{({x'}_{1})^2}{\lambda_1} + \cdots + \frac{({x'}_{D})^2}{\lambda_D} = const..

*You have to be careful that even if you slice a type (h) curve f(\boldsymbol{x}; A_D) with a place z=const. the resulting cross section does not fit the original data well because the equation of the cross section is \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2 = const. The figure below is an example of slicing the same f(\boldsymbol{x}; A_2) as the one above with z=1, and the resulting cross section.

As we have seen, \lambda_i, the eigenvalues of the covariance matrix of data are variances or data when projected on it eigenvectors. At the same time, when you fit an ellipsoid on the data, \sqrt{\lambda_i} is the radius of the ellipsoid corresponding to \boldsymbol{u}_i. Thus ignoring data projected on eigenvectors corresponding to small eigenvalues is equivalent to cutting of the axes of the ellipsoid with small radiusses.

I have explained PCA in three different ways over three articles.

  • The second article: I focused on what kind of linear transformations convariance matrices \Sigma enable, by visualizing displacement vectors. And those vectors look like swirling and extending into directions of eigenvectors of \Sigma.
  • The third article: We directly found directions where certain data distribution “swell” the most, to find that data swell the most in directions of eigenvectors.
  • In this article, we have seen PCA corresponds to only one case of quadratic functions, where the matrix A is a covariance matrix. When you go in the directions of eigenvectors corresponding to big eigenvalues, the quadratic function increases the most. Also that means data samples have bigger variances when projected on the eigenvectors. Thus you can cut off eigenvectors corresponding to small eigenvectors because they retain little information about data, and that is equivalent to fitting an ellipsoid on data and cutting off axes with small radiuses.

*Let A be a covariance matrix, and you can diagonalize it with an orthogonal matrix U as follow: U^{T}AU = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D). Thus A = U \Lambda U^{T}. U is a rotation, and multiplying a \boldsymbol{x} with \Lambda means you multiply each eigenvalue to each element of \boldsymbol{x}. At the end U^T enables the reverse rotation.

If you get data like the left side of the figure below, most explanation on PCA would just fit an oval on this data distribution. However after reading this articles series so far, you would have learned to see PCA from different viewpoints like at the right side of the figure below.

 

5 Ellipsoids in Gaussian distributions.

I have explained that if the covariance of a data distribution is \boldsymbol{\Sigma}, the ellipsoid which fits the distribution the best is \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = 1. You might have seen the part \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) somewhere else. It is the exponent of general Gaussian distributions: \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) \}.  It is known that the eigenvalues of \Sigma ^{-1} are \frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}, and eigenvectors corresponding to each eigenvalue are also \boldsymbol{u}_1, \dots, \boldsymbol{u}_D respectively. Hence just as well as what we have seen, if you project (\boldsymbol{x} - \boldsymbol{\mu}) on each eigenvector of \Sigma ^{-1}, we can convert the exponent of the Gaussian distribution.

Let -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) be \boldsymbol{y} and U ^{-1} \boldsymbol{y}= U^{T} \boldsymbol{y} be \boldsymbol{y}', where U=(\boldsymbol{u}_1 \: \dots \: \boldsymbol{u}_D). Just as we have seen, (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) =\boldsymbol{y}^T\Sigma^{-1} \boldsymbol{y} =(U\boldsymbol{y}')^T \Sigma^{-1} U\boldsymbol{y}' =((\boldsymbol{y}')^T U^T \Sigma^{-1} U\boldsymbol{y}' = (\boldsymbol{y}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}) \boldsymbol{y}' = \frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D}. Hence \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{y}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{y}) \} =  \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D} ) \} =\frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2} \frac{({y'}_{1})^2}{\lambda_1} \biggl) \cdots \frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2}\frac{({y'}_{D})^2}{\lambda_D} \biggl).

*To be mathematically exact about changing variants of normal distributions, you have to consider for example Jacobian matrices.

This results above demonstrate that, by projecting data on the eigenvectors of its covariance matrix, you can factorize the original multi-dimensional Gaussian distribution into a product of Gaussian distributions which are irrelevant to each other. However, at the same time, that is the potential limit of approximating data with PCA. This idea is going to be more important when you think about more probabilistic ways to handle PCA, which is more robust to lack of data.

I have explained PCA over 3 articles from various viewpoints. If you have been patient enough to read my article series, I think you have gained some deeper insight into not only PCA, but also linear algebra, and that should be helpful when you learn or teach data science. I hope my codes also help you. In fact these are not the only topics about PCA. There are a lot of important PCA-like algorithms.

In fact our expedition of ellipsoids, or PCA still continues, just as Star Wars series still continues. Especially if I have to explain an algorithm named probabilistic PCA, I need to explain the “Bayesian world” of machine learning. Most machine learning algorithms covered by major introductory textbooks tend to be too deterministic and dependent on the size of data. Many of those algorithms have another “parallel world,” where you can handle inaccuracy in better ways. I hope I can also write about them, and I might prepare another trilogy for such PCA. But I will not disappoint you, like “The Phantom Menace.”

Appendix: making a model of a bunch of grape with ellipsoid berries.

If you can control quadratic curves, reshaping and rotating them, you can make a model of a grape of olive bunch on Matplotlib. I made a program of making a model of a bunch of berries on Matplotlib using the module to draw ellipsoids which I introduced earlier. You can check the codes in this page.

*I have no idea how many people on this earth are in need of making such models.

I made some modules so that you can see the grape bunch from several angles. This might look very simple to you, but the locations of berries are organized carefully so that it looks like they are placed around a stem and that the berries are not too close to each other.

 

The programming code I created for this article is completly available here.

[Refereces]

[1]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 78-83, 559-577

[2]「理工系新課程 線形代数 基礎から応用まで」, 培風館、(2017)

[3]「これなら分かる 最適化数学 基礎原理から計算手法まで」, 金谷健一著、共立出版, (2019), pp. 17-49

[4]「これなら分かる 応用数学教室 最小二乗法からウェーブレットまで」, 金谷健一著、共立出版, (2019), pp.165-208

[5] 「サボテンパイソン 」
https://sabopy.com/

 

Multi-head attention mechanism: “queries”, “keys”, and “values,” over and over again

*A comment added on 04/05/2022: Thanks to a comment by Mr. Maier, I found a major mistake in my visualization. To be concrete, there is a mistake in expressing how to get each colored divided group of tokens by applying linear transformations. That corresponds to the section 3.2.2 in the paper “Attention Is All You Need.” There would be no big differences in the main point of this article, the relations of keys, queries, and values, but please bear that in your mind if you need Transformer at a practical work. Besides checking the implementation by Tensorflow, I will soon prepare a modified version of visualization. For further details, please see comments at the bottom of this article.

This is the third article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

In the last article, I explained how attention mechanism works in simple seq2seq models with RNNs, and it basically calculates correspondences of the hidden state at every time step, with all the outputs of the encoder. However I would say the attention mechanisms of RNN seq2seq models use only one standard for comparing them. Using only one standard is not enough for understanding languages, especially when you learn a foreign language. You would sometimes find it difficult to explain how to translate a word in your language to another language. Even if a pair of languages are very similar to each other, translating them cannot be simple switching of vocabulary. Usually a single token in one language is related to several tokens in the other language, and vice versa. How they correspond to each other depends on several criteria, for example “what”, “who”, “when”, “where”, “why”, and “how”. It is easy to imagine that you should compare tokens with several criteria.

Transformer model was first introduced in the original paper named “Attention Is All You Need,” and from the title you can easily see that attention mechanism plays important roles in this model. When you learn about Transformer model, you will see the figure below, which is used in the original paper on Transformer.  This is the simplified overall structure of one layer of Transformer model, and you stack this layer N times. In one layer of Transformer, there are three multi-head attention, which are displayed as boxes in orange. These are the very parts which compare the tokens on several standards. I made the head article of this article series inspired by this multi-head attention mechanism.

The figure below is also from the original paper on Transfromer. If you can understand how multi-head attention mechanism works with the explanations in the paper, and if you have no troubles understanding the codes in the official Tensorflow tutorial, I have to say this article is not for you. However I bet that is not true of majority of people, and at least I need one article to clearly explain how multi-head attention works. Please keep it in mind that this article covers only the architectures of the two figures below. However multi-head attention mechanisms are crucial components of Transformer model, and throughout this article, you would not only see how they work but also get a little control over it at an implementation level.

1 Multi-head attention mechanism

When you learn Transformer model, I recommend you first to pay attention to multi-head attention. And when you learn multi-head attentions, before seeing what scaled dot-product attention is, you should understand the whole structure of multi-head attention, which is at the right side of the figure above. In order to calculate attentions with a “query”, as I said in the last article, “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” Sooner or later, you will notice I would be just repeating these phrases over and over again throughout this article, in several ways.

*Even if you are not sure what “reweighting” means in this context, please keep reading. I think you would little by little see what it means especially in the next section.

The overall process of calculating multi-head attention, displayed in the figure above, is as follows (Please just keep reading. Please do not think too much.): first you split the V: “values”, K: “keys”, and Q: “queries”, and second you transform those divided “values”, “keys”, and “queries” with densely connected layers (“Linear” in the figure). Next you calculate attention weights and reweight the “values” and take the summation of the reiweighted “values”, and you concatenate the resulting summations. At the end you pass the concatenated “values” through another densely connected layers. The mechanism of scaled dot-product attention is just a matter of how to concretely calculate those attentions and reweight the “values”.

*In the last article I briefly mentioned that “keys” and “queries” can be in the same language. They can even be the same sentence in the same language, and in this case the resulting attentions are called self-attentions, which we are mainly going to see. I think most people calculate “self-attentions” unconsciously when they speak. You constantly care about what “she”, “it” , “the”, or “that” refers to in you own sentence, and we can say self-attention is how these everyday processes is implemented.

Let’s see the whole process of calculating multi-head attention at a little abstract level. From now on, we consider an example of calculating multi-head self-attentions, where the input is a sentence “Anthony Hopkins admired Michael Bay as a great director.” In this example, the number of tokens is 9, and each token is encoded as a 512-dimensional embedding vector. And the number of heads is 8. In this case, as you can see in the figure below, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” is implemented as a 9\times 512 matrix. You first split each token into 512/8=64 dimensional, 8 vectors in total, as I colored in the figure below. In other words, the input matrix is divided into 8 colored chunks, which are all 9\times 64 matrices, but each colored matrix expresses the same sentence. And you calculate self-attentions of the input sentence independently in the 8 heads, and you reweight the “values” according to the attentions/weights. After this, you stack the sum of the reweighted “values”  in each colored head, and you concatenate the stacked tokens of each colored head. The size of each colored chunk does not change even after reweighting the tokens. According to Ashish Vaswani, who invented Transformer model, each head compare “queries” and “keys” on each standard. If the a Transformer model has 4 layers with 8-head multi-head attention , at least its encoder has 4\times 8 = 32 heads, so the encoder learn the relations of tokens of the input on 32 different standards.

I think you now have rough insight into how you calculate multi-head attentions. In the next section I am going to explain the process of reweighting the tokens, that is, I am finally going to explain what those colorful lines in the head image of this article series are.

*Each head is randomly initialized, so they learn to compare tokens with different criteria. The standards might be straightforward like “what” or “who”, or maybe much more complicated. In attention mechanisms in deep learning, you do not need feature engineering for setting such standards.

2 Calculating attentions and reweighting “values”

If you have read the last article or if you understand attention mechanism to some extent, you should already know that attention mechanism calculates attentions, or relevance between “queries” and “keys.” In the last article, I showed the idea of weights as a histogram, and in that case the “query” was the hidden state of the decoder at every time step, whereas the “keys” were the outputs of the encoder. In this section, I am going to explain attention mechanism in a more abstract way, and we consider comparing more general “tokens”, rather than concrete outputs of certain networks. In this section each [ \cdots ] denotes a token, which is usually an embedding vector in practice.

Please remember this mantra of attention mechanism: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” The figure below shows an overview of a case where “Michael” is a query. In this case you compare the query with the “keys”, that is, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” and you get the histogram of attentions/weights. Importantly the sum of the weights 1. With the attentions you have just calculated, you can reweight the “values,” which also denote the same input sentence. After that you can finally take a summation of the reweighted values. And you use this summation.

*I have been repeating the phrase “reweighting ‘values’  with attentions,”  but you in practice calculate the sum of those reweighted “values.”

Assume that compared to the “query”  token “Michael”, the weights of the “key” tokens “Anthony”, “Hopkins”, “admired”, “Michael”, “Bay”, “as”, “a”, “great”, and “director.” are respectively 0.06, 0.09, 0.05, 0.25, 0.18, 0.06, 0.09, 0.06, 0.15. In this case the sum of the reweighted token is 0.06″Anthony” + 0.09″Hopkins” + 0.05″admired” + 0.25″Michael” + 0.18″Bay” + 0.06″as” + 0.09″a” + 0.06″great” 0.15″director.”, and this sum is the what wee actually use.

*Of course the tokens are embedding vectors in practice. You calculate the reweighted vector in actual implementation.

You repeat this process for all the “queries.”  As you can see in the figure below, you get summations of 9 pairs of reweighted “values” because you use every token of the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as a “query.” You stack the sum of reweighted “values” like the matrix in purple in the figure below, and this is the output of a one head multi-head attention.

3 Scaled-dot product

This section is a only a matter of linear algebra. Maybe this is not even so sophisticated as linear algebra. You just have to do lots of Excel-like operations. A tutorial on Transformer by Jay Alammar is also a very nice study material to understand this topic with simpler examples. I tried my best so that you can clearly understand multi-head attention at a more mathematical level, and all you need to know in order to read this section is how to calculate products of matrices or vectors, which you would see in the first some pages of textbooks on linear algebra.

We have seen that in order to calculate multi-head attentions, we prepare 8 pairs of “queries”, “keys” , and “values”, which I showed in 8 different colors in the figure in the first section. We calculate attentions and reweight “values” independently in 8 different heads, and in each head the reweighted “values” are calculated with this very simple formula of scaled dot-product: Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) =softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})\boldsymbol{V}. Let’s take an example of calculating a scaled dot-product in the blue head.

At the left side of the figure below is a figure from the original paper on Transformer, which explains one-head of multi-head attention. If you have read through this article so far, the figure at the right side would be more straightforward to understand. You divide the input sentence into 8 chunks of matrices, and you independently put those chunks into eight head. In one head, you convert the input matrix by three different fully connected layers, which is “Linear” in the figure below, and prepare three matrices Q, K, V, which are “queries”, “keys”, and “values” respectively.

*Whichever color attention heads are in, the processes are all the same.

*You divide \frac{\boldsymbol{Q}} {\boldsymbol{K}^T} by \sqrt{d}_k in the formula. According to the original paper, it is known that re-scaling \frac{\boldsymbol{Q} }{\boldsymbol{K}^T} by \sqrt{d}_k is found to be effective. I am not going to discuss why in this article.

As you can see in the figure below, calculating Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) is virtually just multiplying three matrices with the same size (Only K is transposed though). The resulting 9\times 64 matrix is the output of the head.

softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}) is calculated like in the figure below. The softmax function regularize each row of the re-scaled product \frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}, and the resulting 9\times 9 matrix is a kind a heat map of self-attentions.

The process of comparing one “query” with “keys” is done with simple multiplication of a vector and a matrix, as you can see in the figure below. You can get a histogram of attentions for each query, and the resulting 9 dimensional vector is a list of attentions/weights, which is a list of blue circles in the figure below. That means, in Transformer model, you can compare a “query” and a “key” only by calculating an inner product. After re-scaling the vectors by dividing them with \sqrt{d_k} and regularizing them with a softmax function, you stack those vectors, and the stacked vectors is the heat map of attentions.

You can reweight “values” with the heat map of self-attentions, with simple multiplication. It would be more straightforward if you consider a transposed scaled dot-product \boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T. This also should be easy to understand if you know basics of linear algebra.

One column of the resulting matrix (\boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T) can be calculated with a simple multiplication of a matrix and a vector, as you can see in the figure below. This corresponds to the process or “taking a summation of reweighted ‘values’,” which I have been repeating. And I would like you to remember that you got those weights (blue) circles by comparing a “query” with “keys.”

Again and again, let’s repeat the mantra of attention mechanism together: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” If you have been patient enough to follow my explanations, I bet you have got a clear view on how multi-head attention mechanism works.

We have been seeing the case of the blue head, but you can do exactly the same procedures in every head, at the same time, and this is what enables parallelization of multi-head attention mechanism. You concatenate the outputs of all the heads, and you put the concatenated matrix through a fully connected layers.

If you are reading this article from the beginning, I think this section is also showing the same idea which I have repeated, and I bet more or less you no have clearer views on how multi-head attention mechanism works. In the next section we are going to see how this is implemented.

4 Tensorflow implementation of multi-head attention

Let’s see how multi-head attention is implemented in the Tensorflow official tutorial. If you have read through this article so far, this should not be so difficult. I also added codes for displaying heat maps of self attentions. With the codes in this Github page, you can display self-attention heat maps for any input sentences in English.

The multi-head attention mechanism is implemented as below. If you understand Python codes and Tensorflow to some extent, I think this part is relatively easy.  The multi-head attention part is implemented as a class because you need to train weights of some fully connected layers. Whereas, scaled dot-product is just a function.

*I am going to explain the create_padding_mask() and create_look_ahead_mask() functions in upcoming articles. You do not need them this time.

Let’s see a case of using multi-head attention mechanism on a (1, 9, 512) sized input tensor, just as we have been considering in throughout this article. The first axis of (1, 9, 512) corresponds to the batch size, so this tensor is virtually a (9, 512) sized tensor, and this means the input is composed of 9 512-dimensional vectors. In the results below, you can see how the shape of input tensor changes after each procedure of calculating multi-head attention. Also you can see that the output of the multi-head attention is the same as the input, and you get a 9\times 9 matrix of attention heat maps of each attention head.

I guess the most complicated part of this implementation above is the split_head() function, especially if you do not understand tensor arithmetic. This part corresponds to splitting the input tensor to 8 different colored matrices as in one of the figures above. If you cannot understand what is going on in the function, I recommend you to prepare a sample tensor as below.

This is just a simple (1, 9, 512) sized tensor with sequential integer elements. The first row (1, 2, …., 512) corresponds to the first input token, and (4097, 4098, … , 4608) to the last one. You should try converting this sample tensor to see how multi-head attention is implemented. For example you can try the operations below.

These operations correspond to splitting the input into 8 heads, whose sizes are all (9, 64). And the second axis of the resulting (1, 8, 9, 64) tensor corresponds to the index of the heads. Thus sample_sentence[0][0] corresponds to the first head, the blue 9\times 64 matrix. Some Tensorflow functions enable linear calculations in each attention head, independently as in the codes below.

Very importantly, we have been only considering the cases of calculating self attentions, where all “queries”, “keys”, and “values” come from the same sentence in the same language. However, as I showed in the last article, usually “queries” are in a different language from “keys” and “values” in translation tasks, and “keys” and “values” are in the same language. And as you can imagine, usualy “queries” have different number of tokens from “keys” or “values.” You also need to understand this case, which is not calculating self-attentions. If you have followed this article so far, this case is not that hard to you. Let’s briefly see an example where the input sentence in the source language is composed 9 tokens, on the other hand the output is composed 12 tokens.

As I mentioned, one of the outputs of each multi-head attention class is 9\times 9 matrix of attention heat maps, which I displayed as a matrix composed of blue circles in the last section. The the implementation in the Tensorflow official tutorial, I have added codes to display actual heat maps of any input sentences in English.

*If you want to try displaying them by yourself, download or just copy and paste codes in this Github page. Please maker “datasets” directory in the same directory as the code. Please download “spa-eng.zip” from this page, and unzip it. After that please put “spa.txt” on the “datasets” directory. Also, please download the “checkpoints_en_es” folder from this link, and place the folder in the same directory as the file in the Github page. In the upcoming articles, you would need similar processes to run my codes.

After running codes in the Github page, you can display heat maps of self attentions. Let’s input the sentence “Anthony Hopkins admired Michael Bay as a great director.” You would get a heat maps like this.

In fact, my toy implementation cannot handle proper nouns such as “Anthony” or “Michael.” Then let’s consider a simple input sentence “He admired her as a great director.” In each layer, you respectively get 8 self-attention heat maps.

I think we can see some tendencies in those heat maps. The heat maps in the early layers, which are close to the input, are blurry. And the distributions of the heat maps come to concentrate more or less diagonally. At the end, presumably they learn to pay attention to the start and the end of sentences.

You have finally finished reading this article. Congratulations.

You should be proud of having been patient, and you passed the most tiresome part of learning Transformer model. You must be ready for making a toy English-German translator in the upcoming articles. Also I am sure you have understood that Michael Bay is a great director, no matter what people say.

[References]

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core
https://www.tensorflow.org/overview

[3] “Neural machine translation with attention,” Tensorflow Core
https://www.tensorflow.org/tutorials/text/nmt_with_attention

[4] Jay Alammar, “The Illustrated Transformer,”
http://jalammar.github.io/illustrated-transformer/

[5] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)
https://www.youtube.com/watch?v=5vcj8kSwBCY

[6]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[7]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[8]Rosemary Rossi, “Anthony Hopkins Compares ‘Genius’ Michael Bay to Spielberg, Scorsese,” yahoo! entertainment, (2017)
https://www.yahoo.com/entertainment/anthony-hopkins-transformers-director-michael-bay-guy-genius-010058439.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.