Tag Archive for: Python

Why using Infrastructure as Code for developing Cloud-based Data Warehouse Systems?

In the contemporary age of Big Data, Data Warehouse Systems and Data Science Analytics Infrastructures have become an essential component for organizations to store, analyze, and make data-driven decisions. With the evolution of cloud computing, many organizations are now migrating their Data Warehouse Systems to the cloud for better scalability, flexibility, and cost-efficiency. Infrastructure as Code (IaC) can be a game-changer in this scenario. By automating the provisioning and management of cloud resources through code, IaC brings a host of advantages to the development and maintenance of Data Warehouse Systems in the cloud.

So why using IaC for Cloud Data Infrastructures?

Of course you – as a human user – can always login into the admin portal of any cloud provider and manually get your resources like SQL databases, ETL tools, Virtual Networks and tools like Synapse, snowflake, BigQuery or Databrikcs in place by clicking on the right buttons….. But here is why you should better follow the idea of having your code explaining which resources are in what order in place in your cloud:

Version Control for your Cloud Infrastructure

One of the primary advantages of using IaC is version control for your Data Warehouse – or Data Lakehouse – Architecture. Whether you’re using Redshift, Snowflake, or any other cloud-based data warehouse solutions, you can codify your architecture settings, allowing you to track changes over time. This ensures a reliable and consistent development environment and makes it easier to identify issues, rollback updates, or replicate the architecture for other projects.

Scalability Tailored for Data Needs

Data Warehouse Systems often require to scale quickly to handle larger datasets or more queries. Traditional manual scaling methods are cumbersome and slow. IaC allows for efficient auto-scaling based on real-time needs. You can write scripts to automatically provision or de-provision resources depending on your data workloads, making your data warehouse highly adaptive to your organization’s changing requirements.

Cost-Efficiency in Resource Allocation

Cloud resources are priced based on usage, so efficient allocation is crucial for managing costs. IaC enables precise control over cloud resources, allowing you to turn them off when not in use or allocate more resources during peak times. For Data Warehouse Systems that often require powerful (and expensive) computing resources, this level of control can translate into significant cost savings.

Streamlined Collaboration Among Teams

Data Warehouse Systems in the cloud often involve cross-functional teams — data engineers, data scientists, and system administrators. IaC allows these teams to collaborate more effectively. Everyone works with the same infrastructure configurations, reducing discrepancies between development, staging, and production environments. This ensures that the data models and queries developed by data professionals are consistent with the underlying infrastructure.

Enhanced Security and Compliance

Data Warehouses often store sensitive information, making security a paramount concern. IaC allows security configurations to be codified and automated, ensuring that every new resource or service deployed complies with organizational and regulatory guidelines. This proactive security approach is particularly beneficial for industries that have to adhere to strict compliance rules like HIPAA or GDPR.

Reliable Environment for Data Operations

Manual configurations are prone to human error, which can compromise the reliability of a Data Warehouse System. IaC mitigates this risk by automating repetitive tasks, ensuring that the infrastructure is consistently provisioned. This brings reliability to data ETL (Extract, Transform, Load) processes, query performances, and other critical data operations.

Documentation and Disaster Recovery Made Easy

Data is the lifeblood of any organization, and losing it can be catastrophic. IaC allows for swift disaster recovery by codifying the entire infrastructure. If a disaster occurs, the infrastructure can be quickly recreated, reducing downtime and data loss.

Most common IaC solutions

The most common tools for creating Cloud Infrastructure as Code are probably Terraform and Pulumi. However, IaC solutions can be very different in their concepts. For example: While Terraform is a pure declarative configuration language that just describes how the infrastructure will look like (execution then by the Terraform-supporting Cloud Provider), Pulumi on the other hand will execute the deployment by a programming language iteratively deploying the wished cloud resources (e.g. using for loops in Python). While executing Pulumi in any supported programming language like Python or C#, Pulumi generates declarative Infrastructure build plans for the Cloud. Any IaC solution is declaring how the infrastrcture looks like.


Terraform is one of the most widely used Infrastructure as Code (IaC) tools, developed by HashiCorp. It enables users to define and provision a data center infrastructure using a declarative configuration language known as HashiCorp Configuration Language (HCL).

The following Terraform script will create an Azure Resource Group, a SQL Server, and a SQL Database. It will also output the fully qualified domain name (FQDN) of the SQL Server, which you can use to connect to the database:

The HCL code needs to be placed into the Terrafirm main.tf file. Of course, Terraform and the Azure CLI needs to be installed before.


Pulumi is a modern Infrastructure as Code (IaC) tool that sets itself apart by allowing infrastructure to be defined using general-purpose programming languages like Python, TypeScript, Go, and C#.

Example of a Pulumi Python script creating a SQL Database on Microsoft Azure Cloud:

Running the script will need the installation of Python, Pulumi and the Azure CLI.

Cloud Provider specific IaC Solutions

Cloud providers might come up with their own IaC solutions, here are the probably most common ones:

Microsoft Azure Bicep is an open-source domain-specific language (DSL) developed by Microsoft, aimed at simplifying the process of deploying Azure resources. It serves as a declarative alternative to JSON for writing Azure Resource Manager (ARM) templates. Bicep compiles down to ARM templates, offering a more concise syntax and easier tooling while leveraging the proven, underlying ARM deployment engine.

AWS CloudFormation is a service offered by Amazon Web Services (AWS) that allows you to define cloud infrastructure in JSON or YAML templates.

Google Cloud Deployment Manager is quite similar to AWS CloudFormation but tailored for Google Cloud Platform (GCP), it allows you to define and deploy resources using YAML or Python templates.

IaC Tools for Server Configuration

There are many other IaC solutions and some of them are more focused on configuration of servers. In common they offer software provisioning as well and a lot detailing in regards to micro-configuration of single applications running on the server.

The most common IaC software for Server Configuration might be Ansible, a YAML-based configuration management tool that uses an agentless architecture. It’s easy to set up and widely used for automating tasks like software provisioning and configuration management. Puppet, Chef and SaltStack are further alternatives and master-agent architecture-based.

Other types of IaC Solutions

IaC solutions with a more narrow focus are e.g. Vagrant as a primarily used IaC tool for setting up virtual development environments, especially for the automation of VM (Virtual Machine) provisioning. The widely used Docker Compose is a tool for defining and running multi-container Docker applications, which can be defined using YAML files.

Furthermore we have tools that are working closely together with IaC tooling, e.g. Prometheus as an open-source monitoring toolkit often used in conjunction with other IaC tools for monitoring deployed resources.


Infrastructure as Code significantly enhances the development and maintenance of Cloud-based Data Infrastructures. From versioning your warehouse architecture and scaling resources according to real-time data needs, to facilitating team collaboration and ensuring security compliance, IaC serves as a foundational technology that brings agility, reliability, and cost-efficiency. As organizations continue to realize the importance of data-driven decision-making, leveraging IaC for cloud-based Data Warehouse Systems will likely become a best practice in data engineering and infrastructure management.

Data Science – Weiterbildungen mit Coursera


Data Science und AI sind aufstrebende Arbeitsfelder, die sich mit der Gewinnung von Wissen aus Daten beschäftigen. Die Nachfrage nach Fähigkeiten im Bereich Data Science, aber auch in angrenzenden Bereichen wie Data Engineering oder Data Analytics, ist in den letzten Jahren explodiert, da Unternehmen versuchen, die Vorteile von Big Data und künstlicher Intelligenz (KI) zu nutzen. Es lohnt sich sehr, sich in diesen Bereich weiter zu entwickeln. Dafür eignen sich die Kurse von Coursera.org.

Online-Kurse lohnen sich dann, wenn eine Karriere im Bereich der Datenanalyse oder des maschinellen Lernens angestrebt oder einfach nur ihr Wissen in diesem Bereich erweitert werden soll.

Spezialisierungskurs – Google Data Analytics

Data Science hilft dabei, Entscheidungen auf Basis von Daten zu treffen, komplexe Probleme effektiver zu lösen und Karrierechancen zu verbessern. Die Tools von Google Cloud und Jupyter Notebook sind dafür geeignet, da sie eine leistungsstarke und skalierbare Infrastruktur sowie eine interaktive Entwicklungsplattform bieten.

Google Data Analytics Zertifikatskurs

Das Google Zertifikat für Datenanalyse behandelt neben dem Handwerkszeug für jeden Data Analyst – wie etwa SQL – auch die notwendige Datenbereinigung und Datenvisualisierung mit den Tools von Google. Es werden weder Erfahrung noch Vorkenntnisse vorausgsetzt.

Spezialisierungskurs – Google Advanced Data Analytics

Der Zertifikatskurs der erweiterten Datenanalyse von Google baut auf dem zuvorgenannten Data Analytics Kurs auf, kann jedoch auch direkt besucht werden. Hier werden grundlegende Fähigkeiten wie SQL vorausgesetzt und vertiefende Fähigkeiten vermittelt, die für einen Data Analysten nützlich sind und auch in die Data Science eintauchen.

Google Advanced Data Analytics
Dieses Kursangebot zum Aufbau erweiterter Datenanalyse-Fähigkeiten von Coursera wird ebenfalls von Google angeboten. Hier werden die Tools der Datenanalyse sowie der statistischen Handwerkzeuge für Data Science eingeführt, bis hin zum ersten Einstieg in Machine Learning.

Spezialisierungskurs – SQL für Data Science (Generalistisch)

SQL ist wichtig für etablierte und angehende Data Scientists, da es eine grundlegende Technologie für die Arbeit mit Datenbanken und relationalen Datenbankmanagementsystemen ist. SQL für Data Science ermöglicht, Daten effektiv zu organisieren und schnell Abfragen zu erstellen, um Antworten auf komplexe Fragen zu finden. Es ist auch relevant für die Arbeit mit nicht-relationalen Datenbanken und hilft Data Scientists, wertvolle Erkenntnisse aus großen Datenmengen zu gewinnen.

Auch wenn Python als Skill für einen Data Scientist ganz vorne steht, ist eine Karriere als Data Scientist ohne SQL-Kenntnisse nicht vorstellbar und dieser Kurs daher der richtige, wenn Nachbolbedarf besteht.

Spezialisierungskurs – Data Analyst Zertifikat (IBM)

Eine Karriere als Data Analyst ist attraktiv, da ihr eine hohe Nachfrage am Arbeitsmarkt gegenüber steht, die Arbeit vielfältig und herausfordernd ist, viele Weiterentwicklungsmöglichkeiten (z. B. zum Data Scientist) bietet und oft flexibel ist.

Der Online-Kurs von IBM bietet die Ausbildung der beruflichen Qualifikation zum Data Analyst. Ein weiterer Vorteil dieses Kurses ist, dass er für alle geeignet ist – unabhängig von ihrem Hintergrund oder der Vorbildung. Es sind keine Abschlüsse oder Vorkenntnisse erforderlich, was bedeutet, dass jeder, der sich für das Thema interessiert, am Kurs teilnehmen und von ihm profitieren kann.

Spezialisierungskurs – Datenverarbeitung mit Python & SQL (IBM)

Dieser Kurs bietet den Teilnehmern die Möglichkeit, ihre Kenntnisse in der Datenverarbeitung zu verbessern, eine Programmiersprache wie Python zu erlernen und grundlegende Kenntnisse in SQL zu erwerben. Diese Fähigkeiten sind für die Arbeit mit Daten unerlässlich und in der heutigen Arbeitswelt sehr gefragt. Darüber hinaus bietet der Kurs für Datenverarbeitung mit Python und SQL auch Schulungen zur Analyse und Visualisierung von Daten sowie zur Erstellung von Modellen für Maschinelles Lernen. Diese Fähigkeiten sind besonders wertvoll für die Entwicklung von Anwendungen und Systemen im Bereich der KI.

Dieser Kurs ist eine großartige Möglichkeit für alle, die ihre Kenntnisse im Bereich der Datenverarbeitung und des maschinellen Lernens verbessern möchten. Zwar werden auch hier keine Vorkenntnisse vorausgesetzt, jedoch geht der Kurs inhaltlich mehr in die Richtung Data Science als der zuvorgenannte Kurs zum Data Analyst und bietet ein umfassendes Training und Schulungen zu grundlegenden Fähigkeiten, die in der heutigen Arbeitswelt gefragt sind, und ist für jeden zugänglich, unabhängig von Hintergrund oder Erfahrung.

Spezialisierungskurs – Maschinelles Lernen (DeepLearning.AI)

Das Erlernen der Grundlagen des maschinellen Lernens (Machine Learning) ist von großer Bedeutung, da es eine der am schnellsten wachsenden und wichtigsten Technologien in der heutigen Zeit ist. Maschinelles Lernen ermöglicht es Computern, aus Erfahrung zu lernen, ohne explizit programmiert zu werden. Die Teilnehmer lernen, dem Computer das lernen zu ermöglichen.

Machinelles Lernen ist der Schlüssel zur Entwicklung von Anwendungen und Systemen im Bereich der künstlichen Intelligenz (KI) und hat Anwendungen in vielen Bereichen, von der Gesundheitsversorgung und der Finanzindustrie bis hin zur Unterhaltungsbranche und der Automobilindustrie.

Der Kurs für Maschinelles Lernen ist nicht nur ein sinnvoller Einstieg in diese Materie, sondern kann darauf aufbauend mit dem Thema Deep Learning in der Qualifikation erweitert werden.

Spezialisierungskurs – Deep Learning (DeepLearning.AI)

Das Verständnis von Deep Learning ist wichtig, da es eine Unterkategorie des maschinellen Lernens ist und viele noch mächtigere Anwendungen in verschiedenen Bereichen hat. Die populäre Applikation ChatGPT ist ein Produkt des Deep Learning. Deep Learning kann mit AI gleichgesetzt werden. Es ist eine gefragte Fähigkeit auf dem Arbeitsmarkt mit Job-Garantie.

Der Spezialisierungskurs für Deep Learning steht unabhängig für sich und erfordert keine speziellen Vorkenntnisse, darf jedoch auch als sinnvolle Ergänzung zum vorgenannten Einführungskurs in Machine Learning betrachtet werden.

Weitere Kursangebote für Data & AI auf Coursera

Die Entscheidung für ein bestimmtes Thema eines Kurses in den Bereichen Data Analytics, Data Science und AI ist eine persönliche und abhängig von den eigenen Vorkenntnissen und Vorlieben, sowie den eigenen Karrierezielen. Für die Karriere des Data Analyst sind SQL sowie allgemeine Kenntnisse rund um Data Analytics bzw. Datenverarbeitung wichtig. Von einem Data Scientist wird ferner erwartet, die theoretischen Grundlagen sowie die praktische Anwendung von Machine Learning und Deep Learning als trainierte Fähigkeit abrufbar zu haben.

Weitere Kurse von Coursera zum Thema Data & AI (link).

Dieser Artikel wurde gesponsored von Coursera.

7 Gründe, warum es sich jetzt lohnt, Python zu lernen

Hot Skill: Python

7 Gründe, warum es sich jetzt lohnt, Python zu lernen

Die digitale Transformation nimmt Fahrt auf und stellt sowohl Arbeitgeber:innen als auch Arbeitnehmer:innen vor neue Herausforderungen. Um mit dieser Entwicklung Schritt zu halten, lohnt es sich, auf den Zug aufzuspringen und das eigene Portfolio um wichtige Schlüsselkompetenzen zu erweitern. Doch in der heutigen Zeit, wo täglich mehr Lernoptionen und -angebote auf den Markt drängen, ist es besonders wichtig, die eigene, knappe Zeit in die richtigen, zukunftsträchtigen Fähigkeiten zu investieren.

Infolge des rasanten, digitalen Wandels haben sich neue, wichtige Qualifikationen herauskristallisiert, die sich langfristig für Lernwillige auszahlen. Insbesondere technische Fähigkeiten werden von Unternehmen dringend benötigt, um den eigenen Marktanteil zu verteidigen. Unter allen möglichen Qualifikationen hat sich eine bestimmte Fähigkeit in den letzten Jahren von vielversprechend zu unverzichtbar gemausert: Die Programmiersprache Python. Denn Python ist insbesondere in den vergangenen fünf Jahren dem Image des Underdogs entwachsen und hat sich zum Champion unter den Tech-Skills entwickelt.

Wer jetzt denkt, dass Python als Programmiersprache nur für ITler und Tech Nerds lohnenswert ist: Weit gefehlt! Viele Unternehmen beginnen gerade erst die wahren Möglichkeiten von Big Data und künstlicher Intelligenz zu erschließen und Führungskräfte suchen aktiv nach Mitarbeiter:innen, die in der Lage sind, diese Transformation durch technische Fähigkeiten zu unterstützen. Wenn Sie sich in diesem Jahr weiterentwickeln möchten und nach einer Fähigkeit Ausschau halten, die Ihre Karriere weiter voranbringt und langfristig sichert, dann ist dies der ideale Zeitpunkt für Sie, sich mit Python weiterzuqualifizieren.

Nicht nur für Schlangenbeschwörer: Warum es sich jetzt lohnt, Python zu lernen

Falls Sie bei dem Wort Python eher an glänzende Schuppen denken als an Programmcode, dann lassen Sie uns Ihnen etwas Kontext geben: Python ist eine Programmiersprache, die für die Entwicklung von Software genutzt wird. Als serverseitige Sprache ist sie die Logik und das Fundament hinter Benutzereingaben und der Interaktion von Datenbanken mit dem Server. Python ist Open-Source, kostenlos und kann von jedem benutzt und verändert werden, weshalb ihre Verwendung besonders in der Datenwissenschaft sehr beliebt ist. Nicht zuletzt lebt Python von seiner Community, einer engagierten Gemeinschaft rund um die Themen künstliche Intelligenz, maschinelles Lernen, Datenanalyse und -modellierung, mit umfangreichen Ressourcen und über 137.000 Bibliotheken wie TensorFlow, Scikit-learn und Keras.

In der Data Science wird Python verwendet, um große Mengen komplexer Daten zu analysieren und aus ihnen relevante Informationen abzuleiten. Lohnt es sich also, Python zu lernen? Absolut! Laut der Stack Overflow Developer Survey wurde Python 2020 als die drittbeliebteste Technologie des Jahres eingestuft. Sie gilt als eine der angesagtesten Fähigkeiten und als beliebteste Programmiersprache in der Welt nach Angaben des PYPL Popularität der Programmiersprache Index. Wir haben 7 Gründe zusammengefasst, warum es sich jetzt lohnt, Python zu lernen:.

1. An Vielseitigkeit kaum zu übertreffen

Python ist ein wahrer Allrounder unter den Hard Skills! Ein wesentlicher Vorteil von Python ist, dass es in einer Vielzahl von Fachbereichen eingesetzt werden kann. Die häufigsten Bereiche, in denen Python Verwendung findet, sind u. a.:

  • Data Analytics & Data Science
  • Mathematik
  • Web-Entwicklung
  • Finanzen und Handel
  • Automatisierung und künstliche Intelligenz
  • Spieleentwicklung

2. Zahlt sich mehrfach aus

Diejenigen, für die sich eine neue Fähigkeit doppelt lohnen soll, liegen mit Python goldrichtig. Python-Entwickler:innen zählen seit Jahren zu den Bestbezahltesten der Branche. Und auch Data Scientists, für deren Job Python unerlässlich ist, liegen im weltweiten Gehaltsrennen ganz weit vorn. Die Nachfrage nach Python-Entwickler:innen ist hoch – und wächst. Und auch für andere Abteilungen wird die Fähigkeit immer wertvoller. Wer Python beherrscht, wird nicht lange nach einem guten Job Ausschau halten müssen. Unter den Top 10 der gefragtesten Programmier-Skills nach denen Arbeitgeber:innen suchen, liegt Python auf Platz 7. Die Arbeitsmarktaussichten sind also hervorragend.

3. Schnelle Erfolge auch für Neulinge

2016 war das schillernde Jahr, in dem Python Java als beliebteste Sprache an US-Universitäten ablöste und seitdem ist die Programmiersprache besonders unter Anfänger:innen sehr beliebt. In den letzten Jahren konnte Python seine Pole Position immer weiter ausbauen. Und das mit gutem Grund: Python ist leicht zu erlernen und befähigt seine Nutzer:innen dazu, eigene Webanwendungen zu erstellen oder simple Arbeitsabläufe zu automatisieren. Dazu bringt Python eine aufgeräumte und gut lesbare Syntax mit, was sie besonders einsteigerfreundlich macht. Wer mit dem Programmieren anfängt, will nicht mit einer komplizierten Sprache mit allerhand seltsamen Ausnahmen starten. Mit Python machen Sie es sich einfach und sind dennoch effektiv. Ein Doppelsieg!

4. Ideal für Zeitsparfüchse

Mit der Python-Programmierung erwarten Sie nicht nur schnelle Lernerfolge, auch Ihre Arbeit wird effektiver und damit schneller. Im Gegensatz zu anderen Programmiersprachen, braucht die Entwicklung mit Python weniger Code und damit weniger Zeit. Für alle Fans von Effizienz ist Python wie gemacht. Und sie bietet einen weiteren großen Zeitbonus. Unliebsame, sich wiederholende Aufgaben können mithilfe von Python automatisiert werden. Wer schon einmal Stunden damit verbracht hat, Dateien umzubenennen oder Hunderte von Tabellenzeilen zu aktualisieren, der weiß, wie mühsam solche Aufgaben sein können. Umso schöner, dass diese Aufgaben von jetzt an von Ihrem Computer erledigt werden könnten.

5. Über den IT-Tellerrand hinaus

Ob im Marketing, Sales oder im Business Development, Python hat sich längst aus seiner reinen IT-Ecke heraus und in andere Unternehmensbereiche vorgewagt. Denn auch diese Abteilungen stehen vor einer Reihe an Herausforderungen, bei denen Python helfen kann: Reporting, Content-Optimierung, A/B-Tests, Kundensegmentierung, automatisierte Kampagnen, Feedback-Analyse und vieles mehr. Mit Python können Erkenntnisse aus vorliegenden Daten gewonnen werden, besser informierte, datengetriebene Entscheidungen getroffen werden, viele Routineaktivitäten automatisiert und der ROI von Kampagnen erhöht werden.

6. Programmieren für Big Player

Wollten Sie schon immer für einen Tech-Giganten wie Google oder Facebook arbeiten? Dann könnte Python Ihre goldene Eintrittskarte sein, denn viele große und vor allem technologieaffine Unternehmen wie YouTube, IBM, Dropbox oder Instagram nutzen Python für eine Vielzahl von Zwecken und sind immer auf der Suche nach Nachwuchstalenten. Dropbox verwendet Python fast für ihr gesamtes Code-Fundament, einschließlich der Analysen, der Server- und API-Backends und des Desktop-Clients. Wenn Sie Ihrem Lebenslauf einen großen Namen hinzufügen wollen, sollte Python auf demselben Blatt zu finden sein.

7. Ein Must-Have für Datenprofis

Besonders Pythons Anwendung in der Datenwissenschaft und im Data Engineering treibt seine Popularität in ungeahnte Höhen. Aber was macht Python so wichtig für Data Science und Machine Learning? Lange Zeit wurde R als die beste Sprache in diesem Spezialgebiet angesehen, doch Python bietet für die Data Science zahlreiche Vorteile. Bibliotheken und Frameworks wie PyBrain, NumPy und PyMySQL für KI sind wichtige Argumente. Außerdem können Skripte erstellt werden, um einfache Prozesse zu automatisieren. Das macht den Arbeitsalltag von Datenprofis besonders effizient.

Investieren Sie in Ihre berufliche Zukunft und starten Sie jetzt Ihre Python-Weiterbildung! Egal, ob Programmier-Neuling oder Data Nerd: Die Haufe Akademie bietet die passende Weiterbildung für Sie: spannende Online-Kurse für Vollberufstätige und Schnelldurchläufer:innen im Bereich Python, Daten und künstliche Intelligenz.

In Kooperation mit stackfuel.



Coding Nomads: “Why Learn Python? 6 Reasons Why it’s So Hot Right Now.” [11.06.2021]

Air Quality Forecasting Python Project

You will find the full python code and all visuals for this article here in this gitlab repository. The repository contains a series of analysis, transforms and forecasting models frequently used when dealing with time series. The aim of this repository is to showcase how to model time series from the scratch, for this we are using a real usecase dataset

This project forecast the Carbon Dioxide (Co2) emission levels yearly. Most of the organizations have to follow government norms with respect to Co2 emissions and they have to pay charges accordingly, so this project will forecast the Co2 levels so that organizations can follow the norms and pay in advance based on the forecasted values. In any data science project the main component is data, for this project the data was provided by the company, from here time series concept comes into the picture. The dataset for this project contains 215 entries and two components which are Year and Co2 emissions which is univariate time series as there is only one dependent variable Co2 which depends on time. from year 1800 to year 2014 Co2 levels were present in the dataset.

The dataset used: The dataset contains yearly Co2 emmisions levels. data from 1800 to 2014 sampled every 1 year. The dataset is non stationary so we have to use differenced time series for forecasting.

After getting data the next step is to analyze the time series data. This process is done by using Python. The data was present in excel file so first we need to read that excel file. This task is done by using Pandas which is python libraries to creates Pandas Data Frame. After that preprocessing like changing data types of time from object to DateTime performed for the coding purpose. Time series contain 4 main components Level, Trend, Seasonality and Noise. To study this component, we need to decompose our time series so that we can batter understand our time series and we can choose the forecasting model accordingly because each component behave different on the model. also by decomposing we can identify that the time series is multiplicative or additive.

CO2 emissions – plotted via python pandas / matplotlib

Decomposing time series using python statesmodels libraries we get to know trend, seasonality and residual component separately. the components multiply together to make the time series multiplicative and in additive time series components added together. Taking the deep dive to understand the trend component, moving average of 10 steps were applied which shows nonlinear upward trend, fit the linear regression model to check the trend which shows upward trend. talking about seasonality there were combination of multiple patterns over time period which is common in real world time series data. capturing the white noise is difficult in this type of data. the time series contains values from 1800 where the Co2 values are less then 1 because of no human activities so levels were decreasing. By the time numbers of industries and human activities are rapidly increasing which causes Co2 levels rapidly increasing. In time series the highest Co2 emission level was 18.7 in 1979. It was challenging to decide whether to consider this values which are less then 0.5 as white noise or not because 30% of the Co2 values were less then 1, in real world looking at current scenario the chances of Co2 emission level being 0 is near to impossible still there are chances that Co2 levels can be 0.0005. So considering each data point as a valuable information we refused to remove that entries.

Next step is to create Lag plot so we can see the correlation between the current year Co2 level and previous year Co2 level. the plot was linear which shows high correlation so we can say that the current Co2 levels and previous levels have strong relationship. the randomness of the data were measured by plotting autocorrelation graph. the autocorrelation graph shows smooth curves which indicates the time series is nonstationary thus next step is to make time series stationary. in nonstationary time series, summary statistics like mean and variance change over time.

To make time series stationary we have to remove trend and seasonality from it. Before that we use dickey fuller test to make sure our time series is nonstationary. the test was done by using python, and the test gives pvalue as output. here the null hypothesis is that the data is nonstationary while alternate hypothesis is that the data is stationary, in this case the significance values is 0.05 and the pvalues which is given by dickey fuller test is greater than 0.05 hence we failed to reject null hypothesis so we can say the time series is nonstationery. Differencing is one of the techniques to make time series stationary. On this time series, first order differencing technique applied to make the time series stationary. In first order differencing we have to subtract previous value from current value for all the data points. also different transformations like log, sqrt and reciprocal were applied in the context of making the time series stationary. Smoothing techniques like simple moving average, exponential weighted moving average, simple exponential smoothing and double exponential smoothing techniques can be applied to remove the variation between time stamps and to see the smooth curves.

Smoothing techniques also used to observe trend in time series as well as to predict the future values. But performance of other models was good compared to smoothing techniques. First 200 entries taken to train the model and remaining last for testing the performance of the model. performance of different models measured by Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) as we are predicting future Co2 emissions so basically it is regression problem. RMSE is calculated by root of the average of squared difference between actual values and predicted values by the model on testing data. Here RMSE values were calculated using python sklearn library. For model building two approaches are there, one is datadriven and another one is model based. models from both the approaches were applied to find the best fitted model. ARIMA model gives the best results for this kind of dataset as the model were trained on differenced time series. The ARIMA model predicts a given time series based on its own past values. It can be used for any nonseasonal series of numbers that exhibits patterns and is not a series of random events. ARIMA takes 3 parameters which are AR, MA and the order of difference. Hyper parameter tuning technique gives best parameters for the model by trying different sets of parameters. Although The autocorrelation and partial autocorrelation plots can be use to decide AR and MA parameter because partial autocorrelation function shows the partial correlation of a stationary time series with its own lagged values so using PACF we can decide the value of AR and from ACF we can decide the value of MA parameter as ACF shows how data points in a time series are related.

Yearly difference of CO2 emissions – ARIMA Prediction

Apart from ARIMA, few other model were trained which are AR, ARMA, Simple Linear Regression, Quadratic method, Holts winter exponential smoothing, Ridge and Lasso Regression, LGBM and XGboost methods, Recurrent neural network (RNN) Long Short Term Memory (LSTM) and Fbprophet. I would like to mention my experience with LSTM here because it is another model which gives good result as ARIMA. the reason for not choosing LSTM as final model is its complexity. As ARIMA is giving appropriate results and it is simple to understand and requires less dependencies. while using lstm, lot of data preprocessing and other dependencies required, the dataset was small thus we used to train the model on CPU, otherwise gpu is required to train the LSTM model. we face one more challenge in deployment part. the challenge is to get the data into original form because the model was trained on differenced time series, so it will predict the future values in differenced format. After lot of research on the internet and by deeply understanding mathematical concepts finally we got the solution for it. solution for this issue is we have to add previous value from the original data from into first order differencing and then we have to add the last value of this time series into predicted values. To create the user interface streamlit was used, it is commonly used python library. the pickle file of the ARIMA model were used to predict the future values based on user input. The limit for forecasting is the year 2050. The project was uploaded on google cloud platform. so the flow is, first the starting year from which user want to forecast was taken and the end year till which year user want to forecast was taken and then according to the range of this inputs the prediction takes place. so by taking the inputs the pickle file will produce the future Co2 emissions in differenced format, then the values will be converted to original format and then the original values will be displayed on the user interface as well as the interactive line graph were displayed on the interface.

You will find the full python code and all visuals for this article here in this gitlab repository.

What is Portfolio Risk Management in Python?

Data science is a crucial industry, with multiple processes today relying on it. One of its more helpful and intriguing applications is in investing, where it helps investors make more informed decisions. Practices like portfolio management in Python help take the guesswork out of this notoriously risky undertaking.

Investing is a complicated science, making it hard to do well. Some estimates hold that as much as 90% of people lose money in stocks. While stock trading will always involve some risk, Python-based portfolio management can help.

What Is Portfolio Management in Python?

Portfolio management is the process of planning, making and overseeing investments to meet your long-term investment goals. Portfolio management in Python uses data science to analyze risks and rewards to make the best investment decisions.

Since the future is uncertain, buying stocks is inherently risky, but some assets are riskier than others. For example, since many companies are trying to reach carbon neutrality by 2050, investing in sustainable technologies is a fairly sound strategy. However, that doesn’t guarantee that every eco-friendly startup will succeed, so investors need to consider more factors.

Some data scientists have found that you can use Python to understand these factors better. By plugging various figures into a Python equation, investors can chart potential risks and returns to find the best investments.

How Does Python Portfolio Management Work?

Portfolio risk management in Python operates on a principle called Modern Portfolio Theory (MPT). MPT helps investors find an optimal mix of high-risk, high-return investments and low-risk, low-return ones based on their risk tolerance. Investors can either look for the highest returns at a certain risk level or look for the lowest risk to get a certain return.

To apply this in Python, data scientists create one list for portfolio returns, one for risk and one for weights, or how much each investment accounts for the overall portfolio. They then randomly generate weight for the assets, then normalize it to sum to a value of one.

Data scientists then calculate the risks and returns for each asset and plug them into the different randomly generated weights. This will produce a list of various scenarios, showing how much overall risk and reward each portfolio would have.

Investors can then look at this list to see how much of each asset they should include in their portfolio. They can either use the mix that produces the greatest return or the one with the lowest risk.

Why Does It Matter?

Using Python for portfolio risk management helps remove a lot of the guesswork from investing. Running these calculations gives investors multiple scenarios to choose from, helping them find the best portfolio strategy for their needs and goals.

This presents a promising opportunity for data scientists. Data analytics are quickly becoming an essential part of the stock market. Algorithmic trading, which applies data and AI to MPT, already accounts for 60 to 73% of all U.S. equity trading. Portfolio management in Python could help more data scientists capitalize on this trend.

This practice is a relatively straightforward way to apply data science to stock trading. Data scientists that can make the most of that opportunity stand to make a name for themselves in investing circles.

Python Portfolio Management Can Maximize Returns

In the past, stock trading was almost akin to gambling, involving huge amounts of risk. While portfolio management in Python doesn’t remove volatility from the stock market, it helps put it in perspective. Investors can then make safer, more informed decisions to meet their investing goals.

Python-based portfolio management stands as a natural intersection between data science and stock trading. As a result, it can help both data scientists and investors achieve new success.

Data Science mit Python - Buchempfehlung 2021

Data Science mit Python – Aktuelle Buchempfehlungen

Als Dozent für Data Science und Python Programmierung für Hochschulen und Unternehmen (Mitarbeiter-Training) werde ich natürlich immer wieder zu Literatur-Empfehlungen in deutscher Sprache gefragt. Aus aktuellem Anlass gebe ich hiermit eine Empfehlung von Büchern, die ich auch für meine Trainingserklärungen und -beispiele verwende oder einfach generell empfehlen kann.

Das Buch Praktische Statistik für Data Scientists: 50+ essenzielle Konzepte mit R und Python (Animals) ist aktuell eines meiner Lieblinge unter den Büchern, die Statistik methodisch nicht zu trocken, aber auch nicht zu beispielorientiert erklären, sondern eine flüssig lesbare Erläuterung zu den wichtigsten Prinzipien der Statistik von der deskriptiven, induktiven und explorativen Statistik bis hin zu Machine Learning bieten. Dazu gibt es Programmiercode in R und Python, was ich an dieser Stelle eher bemängle als bewundere. Dennoch ein sehr ordentlich geschriebenes und beinahe flüssig lesbares Buch mit tollen Erklärungen.



Das Buch Einführung in Data Science: Grundprinzipien der Datenanalyse mit Python (Animals) kenne ich nur aus der ersten Auflage, die zweite wird jedoch sicher nicht schlechter sein. Dieses Buch sticht mit seiner Methodenorientiertheit hervor, denn hier geht es um die Erläuterung von Prinzipien der Data Science (Statistik, Machine Learning) mit Python, jedoch ohne besonders auf bestehende Bibliotheken zu setzen. Es geht um die Grundprinzipien der Data Science mit didaktischem Mehrwert und verleitet ein Gefühl dafür, wie die Algorithmen funktionieren.



Wer ganz auf das Wissen rund um Machine Learning setzen möchte, liegt mit dem Machine Learning mit Python und Keras, TensorFlow 2 und Scikit-Learn: Das umfassende Praxis-Handbuch für Data Science, Deep Learning und Predictive Analytics (mitp Professional) richtig. Es setzt hingegen sehr auf die Nutzung der Bibliotheken Scikit-Learn und Tensorflow, erklärt dabei die Verfahrensweise von Lernalgorithmen der Klassifikation und Regression sowie des unüberwachten maschinellen Lernens recht ausführlich und mit sehr erklärenden Abbildungen. Insbesondere wird hier auf die grundlegenden Prinzipien des Deep Learnings vom MLP zum CNN eingegangen. Es schlägt die Brücke von Python für Machine Learning zu Python für Deep Learning.


Wenn es schnell gehen soll mit dem Einstieg in Machine Learning mit Python, könnte Data Science mit Python: Das Handbuch für den Einsatz von IPython, Jupyter, NumPy, Pandas, Matplotlib und Scikit-Learn (mitp Professional) eine gute Wahl sein. Auf besonders ausführliche Erklärungen über die Algorithmen des machinellen Lernens muss man hier weitgehend verzichten, dafür sind die Beispiele, gelöst mit den typischen Python-Bibliotheken sehr umfangreich und sofort anwendbar. Dieses Buch ist etwas mehr eines über die Bibliotheken in Python für Data Science als über die dahinter liegenden Methoden.



Alternativ zum vorgenannten Buch gibt es vom konkurrierendem Verlag Datenanalyse mit Python: Auswertung von Daten mit Pandas, NumPy und IPython (Animals). Dieses eignet sich besonders zum einfachen Erlernen der Funktionsweisen der Methoden und Datenstrukturen in Python Numpy, Pandas und Matplotlib. Die klassische Datenanalyse mit deskriptiver Statistik steht hier mehr im Vordergrund als Machine Learning, sorgt jedoch auch dafür, dass die Datenanalyse mit Python sehr ausführlich erklärt wird. Es ist ebenfalls etwas mehr ein Python-Buch als ein Buch über Verfahrensweisen der Data Science. Es eignet sich meiner Meinung nach besonders gut für Python-Lerner, die es bisher gewohnt waren, Daten in SQL zu analysieren und nun auf Pandas umsteigen möchten.


Alle Buchempfehlungen basieren auf meiner Erfahrung als Dozent. Ich habe alle Bücher intensiv gelesen und genutzt.
Die Links sind sogenannte Affiliate-Links. Wenn Du als Leser auf so einen Affiliate-Link klickst und über diesen Link einkaufst, bekomme ich als Inhaber des Data Science Blogs eine Provision, ohne dass sich der Kaufpreis des Artikels ändert. Ich versichere, dass jegliche Einnahmen nach Steuer zu 100% wieder in den Data Science Blog investiert werden.

Coffee Shop Location Predictor

As part of this article, we will explore the main steps involved in predicting the best location for a coffee shop in Vancouver. We will also take into consideration that the coffee shop is near a transit station, and has no Starbucks near it. Well, while at it, let us also add an extra feature where we make sure the crime in the area is lower.


In this article, we will highlight the main steps involved to predict a location for a coffee shop in Vancouver. We also want to make sure that the coffee shop is near a transit station, and has no Starbucks near it. As an added feature, we will make sure that the crime concentration in the area is low, and the entire program should be implemented in Python. So let’s walk through the steps.

Steps Required

  • Get crime history for the last two years
  • Get locations of all transit stations and Starbucks in Vancouver
  • Check all the transit stations that do not have any Starbucks near them
  • Get all the data regarding crimes near the filtered transit stations
  • Create a grid of all possible coordinates around the transit station
  • Check crime around each created coordinate and display the top 5 locations.

Gathering Data

This covers the first two steps required to get data from the internet, both manually and automatically.

Getting all Crime History

We can get crime history for the past 14 years in Vancouver from here. This data is in raw crime.csv format, so we have to process it and filter out useless data. We then write this processed information on the crime_processed.csv file.

Note: There are 530,653 records of crime in this file

In this program, we will just use the type and coordinate of the crime. There are many crime types, but we have classified them into three major categories namely;

Theft (red), Break and Enter (orange) and Mischief (green)

These all crimes can be plotted on Graph as displayed below.

This may seem very congested and full, so let’s see a closeup image for future references.

Getting Locations of all Rapid Transit Stations

We can get the coordinates of all Transit Stations in Vancouver from here. This dataset has all coordinates of rapid transit stations in three transit lines in Vancouver. There are a total of 23 of them in Vancouver, we can then use it for further processing.

Getting Locations of all Starbucks

The Starbucks data is present here, we can scrape it easily and get the locations of all the Starbucks in Vancouver. We just need the Starbucks that is near transit stations, so we’ll filter out the rest. There are a total 24 Starbucks in Vancouver, and 10 of them are near Transit Stations.

Note: Other than the coordinates of Transit Stations and Starbucks, we also need coordinates and type of the crime.

Transit Stations with no Starbucks

As we have all the data required, now moving to the next step. We need to get to the transit Station locations that have no Starbucks near them. For that we can create an area of particular radius around each Transit Station. Then check all Starbucks locations with respect to them, whether they are within that area or not.

If none of the Starbucks are within that particular Transit Station’s area, we can append it to a list. At the end, we have a list of all Transit locations with no Starbucks near them. There are a total of 6 Transit Stations with no Starbucks near them.

Crime near Transit Stations

Now lets filter out all crime records and get just what we are interested in, which means the crime near Transit stations. For that we will plot an area of specific radius around each of them to see the crimes. These are more than 110,000 crime records.

Crime near located Transit Stations

Now that we have all the Transit Stations that don’t have any Starbucks near them and also the crime near all Transit Stations. So, let’s use this information and get crime near the located Transit Stations. These are about 44,000 crime records.

This may seem correct at first glance, but the points are overlapping due to abundance, so we can create different lists of crimes based on their types.


Break and Enter


Generating all possible coordinates

Now finally, we have all the prerequisites and let’s get to the main task at hand, predicting the best coordinate for the coffee shop.

There may be many approaches to solve this problem, but the one I used in this program is that I will create a grid of all possible locations (coordinates) in the area of 1 km radius around each located transit station.

Initially I generated 1 coordinate for every m, this resulted in 1000,000 coordinates in every km. This is a huge number, and for the 6 located Transit stations, it becomes 6 Million. It may not seem much at first glance because computers can handle such data in a few seconds.

But for location prediction we need to compare each coordinate with crime coordinates. As the algorithm has to check for ~7,000 Thefts, ~19,000 Break ins, and ~17,000 Mischiefs around each generated coordinate. Computing this would want the program to process an estimate of 432.4 Billion times. This sort of execution takes many hours on normal computers (sometimes days).

The solution to this is to create a coordinate for each 10 m area, this results about 10,000 coordinate per km. For the above mentioned number of crimes, the estimated processes will be several Billions. That would significantly reduce the time, but is still not less.

To control this, we can remove the duplicate values in crime coordinates and those which are too close to each other ~1m. Doing so, we are left with just 816 Thefts, 2,654 Break ins, and 8,234 Mischiefs around each generated coordinate.
The precision will not be affected much but the time and computational resources required will be reduced a lot.


Checking Crime near Generated coordinates

Now that we have all the locations, we will start some processing on it and check each coordinate against some constraints. That are respectively;

  1. Filter out Coordinates having Theft near 1 km
    We get 122,000 coordinates with no Thefts (Below merged 1000 to 1)
  2. Filter out Coordinates having Break Ins near 200m
    We get 8000 coordinates with no Thefts (Below merged 1000 to 1)
  3. Filter out Coordinates having Mischief near 200m
    We get 6000 coordinates with no Thefts (Below merged 1000 to 1)
    Now that we have 6 Coordinates of best locations that have passed through all the constraints, we will order them.To order them, we will check their distance from the nearest transit location. The nearest will be on top of the list as the best possible location, then the second and so on. The generated List is;

    1. -123.0419406741792, 49.24824259252004
    2. -123.05887151659479, 49.24327221040713
    3. -123.05287151659476, 49.24327221040713
    4. -123.04994067417924, 49.239242592520064
    5. -123.0419406741792, 49.239242592520064
    6. -123.0409406741792, 49.239242592520064

How can MindTrades help?

MindTrades Consulting Services, a leading marketing agency provides in-depth analysis and insights for the global IT sector including leading data integration brands such as Diyotta. From Cloud Migration, Big Data, Digital Transformation, Agile Deliver, Cyber Security, to Analytics- Mind trades provides published breakthrough ideas, and prompt content delivery. For more information, refer to mindtrades.com.




How to make a toy English-German translator with multi-head attention heat maps: the overall architecture of Transformer

If you have been patient enough to read the former articles of this article series Instructions on Transformer for people outside NLP field, but with examples of NLP, you should have already learned a great deal of Transformer model, and I hope you gained a solid foundation of learning theoretical sides on this algorithm.

This article is going to focus more on practical implementation of a transformer model. We use codes in the Tensorflow official tutorial. They are maintained well by Google, and I think it is the best practice to use widely known codes.

The figure below shows what I have explained in the articles so far. Depending on your level of understanding, you can go back to my former articles. If you are familiar with NLP with deep learning, you can start with the third article.

1 The datasets

I think this article series appears to be on NLP, and I do believe that learning Transformer through NLP examples is very effective. But I cannot delve into effective techniques of processing corpus in each language. Thus we are going to use a library named BPEmb. This library enables you to encode any sentences in various languages into lists of integers. And conversely you can decode lists of integers to the language. Thanks to this library, we do not have to do simplification of alphabets, such as getting rid of Umlaut.

*Actually, I am studying in computer vision field, so my codes would look elementary to those in NLP fields.

The official Tensorflow tutorial makes a Portuguese-English translator, but in article we are going to make an English-German translator. Basically, only the codes below are my original. As I said, this is not an article on NLP, so all you have to know is that at every iteration you get a batch of (64, 41) sized tensor as the source sentences, and a batch of (64, 42) tensor as corresponding target sentences. 41, 42 are respectively the maximum lengths of the input or target sentences, and when input sentences are shorter than them, the rest positions are zero padded, as you can see in the codes below.

*If you just replace datasets and modules for encoding, you can make translators of other pairs of languages.

We are going to train a seq2seq-like Transformer model of converting those list of integers, thus a mapping from a vector to another vector. But each word, or integer is encoded as an embedding vector, so virtually the Transformer model is going to learn a mapping from sequence data to another sequence data. Let’s formulate this into a bit more mathematics-like way: when we get a pair of sequence data \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau _x)}) and \boldsymbol{Y} = (\boldsymbol{y}^{(1)}, \dots, \boldsymbol{y}^{(\tau _y)}), where \boldsymbol{x}^{(t)} \in \mathbb{R}^{|\mathcal{V}_{\mathcal{X}}|}, \boldsymbol{x}^{(t)} \in \mathbb{R}^{|\mathcal{V}_{\mathcal{Y}}|}, respectively from English and German corpus, then we learn a mapping f: \boldsymbol{X} \to \boldsymbol{Y}.

*In this implementation the vocabulary sizes are both 10002. Thus |\mathcal{V}_{\mathcal{X}}|=|\mathcal{V}_{\mathcal{Y}}|=10002

2 The whole architecture

This article series has covered most of components of Transformer model, but you might not understand how seq2seq-like models can be constructed with them. It is very effective to understand how transformer is constructed by actually reading or writing codes, and in this article we are finally going to construct the whole architecture of a Transforme translator, following the Tensorflow official tutorial. At the end of this article, you would be able to make a toy English-German translator.

The implementation is mainly composed of 4 classes, EncoderLayer(), Encoder(), DecoderLayer(), and Decoder() class. The inclusion relations of the classes are displayed in the figure below.

To be more exact in a seq2seq-like model with Transformer, the encoder and the decoder are connected like in the figure below. The encoder part keeps converting input sentences in the original language through N layers. The decoder part also keeps converting the inputs in the target languages, also through N layers, but it receives the output of the final layer of the Encoder at every layer.

You can see how the Encoder() class and the Decoder() class are combined in Transformer in the codes below. If you have used Tensorflow or Pytorch to some extent, the codes below should not be that hard to read.

3 The encoder

*From now on “sentences” do not mean only the input tokens in natural language, but also the reweighted and concatenated “values,” which I repeatedly explained in explained in the former articles. By the end of this section, you will see that Transformer repeatedly converts sentences layer by layer, remaining the shape of the original sentence.

I have explained multi-head attention mechanism in the third article, precisely, and I explained positional encoding and masked multi-head attention in the last article. Thus if you have read them and have ever written some codes in Tensorflow or Pytorch, I think the codes of Transformer in the official Tensorflow tutorial is not so hard to read. What is more, you do not use CNNs or RNNs in this implementation. Basically all you need is linear transformations. First of all let’s see how the EncoderLayer() and the Encoder() classes are implemented in the codes below.

You might be confused what “Feed Forward” means in  this article or the original paper on Transformer. The original paper says this layer is calculated as FFN(x) = max(0, xW_1 + b_1)W_2 +b_2. In short you stack two fully connected layers and activate it with a ReLU function. Let’s see how point_wise_feed_forward_network() function works in the implementation with some simple codes. As you can see from the number of parameters in each layer of the position wise feed forward neural network, the network does not depend on the length of the sentences.

From the number of parameters of the position-wise feed forward neural networks, you can see that you share the same parameters over all the positions of the sentences. That means in the figure above, you use the same densely connected layers at all the positions, in single layer. But you also have to keep it in mind that parameters for position-wise feed-forward networks change from layer to layer. That is also true of “Layer” parts in Transformer model, including the output part of the decoder: there are no learnable parameters which cover over different positions of tokens. These facts lead to one very important feature of Transformer: the number of parameters does not depend on the length of input or target sentences. You can offset the influences of the length of sentences with multi-head attention mechanisms. Also in the decoder part, you can keep the shape of sentences, or reweighted values, layer by layer, which is expected to enhance calculation efficiency of Transformer models.

4, The decoder

The structures of DecoderLayer() and the Decoder() classes are quite similar to those of EncoderLayer() and the Encoder() classes, so if you understand the last section, you would not find it hard to understand the codes below. What you have to care additionally in this section is inter-language multi-head attention mechanism. In the third article I was repeatedly explaining multi-head self attention mechanism, taking the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as an example. However, as I explained in the second article, usually in attention mechanism, you compare sentences with the same meaning in two languages. Thus the decoder part of Transformer model has not only self-attention multi-head attention mechanism of the target sentence, but also an inter-language multi-head attention mechanism. That means, In case of translating from English to German, you compare the sentence “Anthony Hopkins hat Michael Bay als einen großartigen Regisseur bewundert.” with the sentence itself in masked multi-head attention mechanism (, just as I repeatedly explained in the third article). On the other hand, you compare “Anthony Hopkins hat Michael Bay als einen großartigen Regisseur bewundert.” with “Anthony Hopkins admired Michael Bay as a great director.” in the inter-language multi-head attention mechanism (, just as you can see in the figure above).

*The “inter-language multi-head attention mechanism” is my original way to call it.

I briefly mentioned how you calculate the inter-language multi-head attention mechanism in the end of the third article, with some simple codes, but let’s see that again, with more straightforward figures. If you understand my explanation on multi-head attention mechanism in the third article, the inter-language multi-head attention mechanism is nothing difficult to understand. In the multi-head attention mechanism in encoder layers, “queries”, “keys”, and “values” come from the same sentence in English, but in case of inter-language one, only “keys” and “values” come from the original sentence, and “queries” come from the target sentence. You compare “queries” in German with the “keys” in the original sentence in English, and you re-weight the sentence in English. You use the re-weighted English sentence in the decoder part, and you do not need look-ahead mask in this inter-language multi-head attention mechanism.

Just as well as multi-head self-attention, you can calculate inter-language multi-head attention mechanism as follows: softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}). In the example above, the resulting multi-head attention map is a 10 \times 9 matrix like in the figure below.

Once you keep the points above in you mind, the implementation of the decoder part should not be that hard.

5 Masking tokens in practice

I explained masked-multi-head attention mechanism in the last article, and the ideas itself is not so difficult. However in practice this is implemented in a little tricky way. You might have realized that the size of input matrices is fixed so that it fits the longest sentence. That means, when the maximum length of the input sentences is 41, even if the sentences in a batch have less than 41 tokens, you sample (64, 41) sized tensor as a batch every time (The 64 is a batch size). Let “Anthony Hopkins admired Michael Bay as a great director.”, which has 9 tokens in total, be an input. We have been considering calculating (9, 9) sized attention maps or (10, 9) sized attention maps, but in practice you use (41, 41) or (42, 41) sized ones. When it comes to calculating self attentions in the encoder part, you zero pad self attention maps with encoder padding masks, like in the figure below. The black dots denote the zero valued elements.

As you can see in the codes below, encode padding masks are quite simple. You just multiply the padding masks with -1e9 and add them to attention maps and apply a softmax function. Thereby you can zero-pad the columns in the positions/columns where you added -1e9 to.

I explained look ahead mask in the last article, and in practice you combine normal padding masks and look ahead masks like in the figure below. You can see that you can compare each token with only its previous tokens. For example you can compare “als” only with “Anthony”, “Hopkins”, “hat”, “Michael”, “Bay”, “als”, not with “einen”, “großartigen”, “Regisseur” or “bewundert.”

Decoder padding masks are almost the same as encoder one. You have to keep it in mind that you zero pad positions which surpassed the length of the source input sentence.

6 Decoding process

In the last section we have seen that we can zero-pad columns, but still the rows are redundant. However I guess that is not a big problem because you decode the final output in the direction of the rows of attention maps. Once you decode <end> token, you stop decoding. The redundant rows would not affect the decoding anymore.

This decoding process is similar to that of seq2seq models with RNNs, and that is why you need to hide future tokens in the self-multi-head attention mechanism in the decoder. You share the same densely connected layers followed by a softmax function, at all the time steps of decoding. Transformer has to learn how to decode only based on the words which have appeared so far.

According to the original paper, “We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.” After these explanations, I think you understand the part more clearly.

The codes blow is for the decoding part. You can see that you first start decoding an output sentence with a sentence composed of only <start>, and you decide which word to decoded, step by step.

*It easy to imagine that this decoding procedure is not the best. In reality you have to consider some possibilities of decoding, and you can do that with beam search decoding.

After training this English-German translator for 30 epochs you can translate relatively simple English sentences into German. I displayed some results below, with heat maps of multi-head attention. Each colored attention maps corresponds to each head of multi-head attention. The examples below are all from the fourth (last) layer, but you can visualize maps in any layers. When it comes to look ahead attention, naturally only the lower triangular part of the maps is activated.

This article series has not covered some important topics machine translation, for example how to calculate translation errors. Actually there are many other fascinating topics related to machine translation. For example beam search decoding, which consider some decoding possibilities, or other topics like how to handle proper nouns such as “Anthony” or “Hopkins.” But this article series is not on NLP. I hope you could effectively learn the architecture of Transformer model with examples of languages so far. And also I have not explained some details of training the network, but I will not cover that because I think that depends on tasks. The next article is going to be the last one of this series, and I hope you can see how Transformer is applied in computer vision fields, in a more “linguistic” manner.

But anyway we have finally made it. In this article series we have seen that one of the earliest computers was invented to break Enigma. And today we can quickly make a more or less accurate translator on our desk. With Transformer models, you can even translate deadly funny jokes into German.

*You can train a translator with this code.

*After training a translator, you can translate English sentences into German with this code.


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core

[3] Jay Alammar, “The Illustrated Transformer,”

[4] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)

[5]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Positional encoding, residual connections, padding masks: covering the rest of Transformer components

This is the fourth article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Wrapping points up so far

This article series has already covered a great deal of the Transformer mechanism. Whether you have read my former articles or not, I bet you are more or less lost in the course of learning Transformer model. The left side of the figure below is from the original paper on Transformer model, and my previous articles explained the parts in each colored frame. In the first article, I  mainly explained how language is encoded in deep learning task and how that is evaluated.

This is more of a matter of inputs and the outputs of deep learning networks, which are in blue dotted frames in the figure. They are not so dependent on types of deep learning NLP tasks. In the second article, I explained seq2seq models, which are encoder-decoder models used in machine translation. Seq2seq models can can be simplified like the figure in the orange frame. In the article I mainly explained seq2seq models with RNNs, but the purpose of this article series is ultimately replace them with Transformer models. In the last article, I finally wrote about some actual components of Transformer models: multi-head attention mechanism. I think this mechanism is the core of Transformed models, and I did my best to explain it with a whole single article, with a lot of visualizations. However, there are still many elements I have not explained.

First, you need to do positional encoding to the word embedding so that Transformer models can learn the relations of the positions of input tokens. At least I was too stupid to understand what this is only with the original paper on Transformer. I am going to explain this algorithm in illustrative ways, which I needed to self-teach it. The second point is residual connections.

The last article has already explained multi-head attention, as precisely as I could do, but I still have to say I covered only two multi-head attention parts in a layer of Transformer model, which are in pink frames. During training, you have to mask some tokens at the decoder part so that some of tokens are invisible, and masked multi-head attention enables that.

You might be tired of the words “queries,” “keys,” and “values,” if you read the last article. But in fact that was not enough. When you think about applying Transformer in other tasks, such as object detection or image generation, you need to reconsider what the structure of data and how “queries,” “keys,” and “values,” correspond to each elements of the data, and probably one of my upcoming articles would cover this topic.

2 Why Transformer?

One powerful strength of Transformer model is its parallelization. As you saw in the last article, Trasformer models enable calculating relations of tokens to all other tokens, on different standards, independently in each head. And each head requires very simple linear transformations. In case of RNN encoders, if an input has \tau tokens, basically you have to wait for \tau time steps to finish encoding the input sentence. Also, at the time step (\tau) the RNN cell retains the information at the time step (1) only via recurrent connections. In this way you cannot attend to tokens in the earlier time steps, and this is obviously far from how we compare tokens in a sentence. You can bring information backward by bidirectional connection s in RNN models, but that all the more deteriorate parallelization of the model. And possessing information via recurrent connections, like a telephone game, potentially has risks of vanishing gradient problems. Gated RNN, such as LSTM or GRU mitigate the problems by a lot of nonlinear functions, but that adds to computational costs. If you understand multi-head attention mechanism, I think you can see that Transformer solves those problems.

I guess this is closer to when you speak a foreign language which you are fluent in. You wan to say something in a foreign language, and you put the original sentence in your mother tongue in the “encoder” in your brain. And you decode it, word by word, in the foreign language. You do not have to wait for the word at the end in your language, or rather you have to consider the relations of of a chunk of words to another chunk of words, in forward and backward ways. This is crucial especially when Japanese people speak English. You have to make the conclusion clear in English usually with the second word, but the conclusion is usually at the end of the sentence in Japanese.

3 Positional encoding

I explained disadvantages of RNN in the last section, but RNN has been a standard algorithm of neural machine translation. As I mentioned in the fourth section of the first article of my series on RNN, other neural nets like fully connected layers or convolutional neural networks cannot handle sequence data well. I would say RNN could be one of the only algorithms to handle sequence data, including natural language data, in more of classical methods of time series data processing.

*As I explained in this article, the original idea of RNN was first proposed in 1997, and I would say the way it factorizes time series data is very classical, and you would see similar procedures in many other algorithms. I think Transformer is a successful breakthrough which gave up the idea of processing sequence data time step by time step.

You might have noticed that multi-head attention mechanism does not explicitly uses the the information of the orders or position of input data, as it basically calculates only the products of matrices. In the case where the input is “Anthony Hopkins admired Michael Bay as a great director.”, multi head attention mechanism does not uses the information that “Hopkins” is the second token, or the information that the token two time steps later is “Michael.” Transformer tackles this problem with an almost magical algorithm named positional encoding.

In order to learn positional encoding, you should first think about what kind of encoding is ideal. According to this blog post, ideal encoding of positions of tokens have the following features.

  • Positional encoding of one token deterministically represents the position of the token.
  • The actual values of positional encoding should not be too big compared to the values of elements of embedding vectors.
  • Positional encodings of different tokens should successfully express their relative positions.

The most straightforward way to give the information of position is implementing the index of times steps (t), but if you naively give the term (t) to the data, the term could get too big compared to the values of data ,for example when the sequence data is 100 time steps long. The next straightforward idea is compressing the idea of time steps to for example the range [0, 1]. With this approach, however, the resolution of encodings can vary depending on the length of the input sequence data. Thus these naive approaches do not meet the requirements above, and I guess even conventional RNN-based models were not so successful in these points.

*I guess that is why attention mechanism of RNN seq2seq models, which I explained in the second article, was successful. You can constantly calculate the relative positions of decoder tokens compared to the encoder tokens.

Positional encoding, to me almost magically, meets the points I have mentioned. However the explanation of positional encoding in the original paper of Transformer is unkindly brief. It says you can encode positions of tokens with the following vector PE_{(pos, 2i)} = sin(pos / 10000^{2i/d_model}), PE_{(pos, 2i+1)} = cos(pos / 10000^{2i/d_model}), where i = 0, 1, \dots, d_{model}/2 - 1. d_{model} is the dimension of word embedding. The heat map below is the most typical type of visualization of positional encoding you would see everywhere, and in this case d_{model}=256, and pos is discrete number which varies from 0 to 49, thus the heat map blow is equal to a 50\times 256 matrix, whose elements are from -1 to 1. Each row of the graph corresponds to one token, and you can see that lower dimensional part is constantly changing like waves. Also it is quite easy to encode an input with this positional encoding: assume that you have a matrix of an input sentence composed of 50 tokens, each of which is a 256 dimensional vector, then all you have to do is just adding the heat map below to the matrix.

Concretely writing down, the encoding of the 256-dim token at pos  is (PE_{(pos, 0)}, PE_{(pos, 1)}, \dots ,  PE_{(pos, 254)}, PE_{(pos, 255)})^T = \bigl( sin(pos / 10000^{0/256}), cos(pos / 10000^{0/256}) \bigr),  \dots , \bigl( sin(pos / 10000^{254/256}), cos(pos / 10000^{254/256}) \bigr)^T.

You should see this encoding more as d_{model} / 2 pairs of circles rather than d_{model} dimensional vectors. When you fix the i, the index of the depth of each encoding, you can extract a 2 dimensional vector \boldsymbol{PE}_i = \bigl( sin(pos / 10000^{2i/d_model}), cos(pos / 10000^{2i/d_model}) \bigr). If you constantly change the value pos, the vector \boldsymbol{PE}_i rotates clockwise on the unit circle in the figure below.

Also, the deeper the dimension of the embedding is, I mean the bigger the index i is, the smaller the frequency of rotation is. I think the video below is a more intuitive way to see how each token is encoded with positional encoding. You can see that the bigger pos is, that is the more tokens an input has, the deeper part positional encoding starts to rotate on the circles.


Very importantly, the original paper of Transformer says, “We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.” For each circle at any depth, I mean for any i, the following simple equation holds:

\left( \begin{array}{c} sin(\frac{pos+k}{10000^{2i/d_{model}}}) \\ cos(\frac{pos+k}{10000^{2i/d_{model}}}) \end{array} \right) =
\left( \begin{array}{ccc} cos(\frac{k}{10000^{2i/d_{model}}}) & sin(\frac{k}{10000^{2i/d_{model}}}) \\ -sin(\frac{k}{10000^{2i/d_{model}}}) & cos(\frac{k}{10000^{2i/d_{model}}}) \\ \end{array} \right) \cdot \left( \begin{array}{c} sin(\frac{pos}{10000^{2i/d_{model}}}) \\ cos(\frac{pos}{10000^{2i/d_{model}}}) \end{array} \right)

The matrix is a simple rotation matrix, so if i is fixed the rotation only depends on k, how many positions to move forward or backward. Then we get a very important fact: as the pos changes (pos is a discrete number), each point rotates in proportion to the offset of “pos,” with different frequencies depending on the depth of the circles. The deeper the circle is, the smaller the frequency is. That means, this type of positional encoding encourages Transformer models to learn definite and relative positions of tokens with rotations of those circles, and the values of each element of the rotation matrices are from -1 to 1, so they do not get bigger no matter how many tokens inputs have.

For example when an input is “Anthony Hopkins admired Michael Bay as a great director.”, a shift from the token “Hopkins” to “Bay” is a rotation matrix  \left( \begin{array}{ccc} cos(\frac{k}{10000^{2i/d_{model}}}) & sin(\frac{k}{10000^{2i/d_{model}}}) \\ -sin(\frac{k}{10000^{2i/d_{model}}}) & cos(\frac{k}{10000^{2i/d_{model}}}) \\ \end{array} \right), where k=3. Also the shift from “Bay” to “great” has the same rotation.

*Positional encoding reminded me of Enigma, a notorious cipher machine used by Nazi Germany. It maps alphabets to different alphabets with different rotating gear connected by cables. With constantly changing gears and keys, it changed countless patterns of alphabetical mappings, every day, which is impossible for humans to solve. One of the first form of computers was invented to break Enigma.

*As far as I could understand from “Imitation Game (2014).”

*But I would say Enigma only relied on discrete deterministic algebraic mapping of alphabets. The rotations of positional encoding is not that tricky as Enigma, but it can encode both definite and deterministic positions of much more variety of tokens. Or rather I would say AI algorithms developed enough to learn such encodings with subtle numerical changes, and I am sure development of NLP increased the possibility of breaking the Turing test in the future.

5 Residual connections

If you naively stack neural networks with simple implementation, that would suffer from vanishing gradient problems during training. Back propagation is basically multiplying many gradients, so

One way to mitigate vanishing gradient problems is quite easy: you have only to make a bypass of propagation. You would find a lot of good explanations on residual connections, so I am not going to explain how this is effective for vanishing gradient problems in this article.

In Transformer models you add positional encodings to the input only in the first layer, but I assume that the encodings remain through the layers by these bypass routes, and that might be one of reasons why Transformer models can retain information of positions of tokens.

6 Masked multi-head attention

Even though Transformer, unlike RNN, can attend to the whole input sentence at once, the decoding process of Transformer-based translator is close to RNN-based one, and you are going to see that more clearly in the codes in the next article. As I explained in the second article, RNN decoders decode each token only based on the tokens the have generated so far. Transformer decoder also predicts the output sequences autoregressively one token at a time step, just as RNN decoders. I think it easy to understand this process because RNN decoder generates tokens just as you connect RNN cells one after another, like connecting rings to a chain. In this way it is easy to make sure that generating of one token in only affected by the former tokens. On the other hand, during training Transformer decoders, you input the whole sentence at once. That means Transformer decoders can see the whole sentence during training. That is as if a student preparing for a French translation test could look at the whole answer French sentences. It is easy to imagine that you cannot prepare for the French test effectively if you study this way. Transformer decoders also have to learn to decode only based on the tokens they have generated so far.

In order to properly train a Transformer-based translator to learn such decoding, you have to hide the upcoming tokens in target sentences during training. During calculating multi-head attentions in each Transformer layer, if you keep ignoring the weights from up coming tokens like in the figure below, it is likely that Transformer models learn to decode only based on the tokens generated so far. This is called masked multi-head attention.

*I am going to take an input “Anthonly Hopkins admire Michael Bay as a great director.” as an example of calculating masked multi-head attention mechanism, but this is supposed to be in the target laguage. So when you train an translator from English to German, in practice you have to calculate masked multi-head atetntion of “Anthony Hopkins hat Michael Bay als einen großartigen Regisseur bewundert.”

As you can see from the whole architecture of Transformer, you only need to consider masked multi-head attentions of of self-attentions of the input sentences at the decoder side. In order to concretely calculate masked multi-head attentions, you need a technique named look ahead masking. This is also quite simple. Just as well as the last article, let’s take an example of calculating self attentions of an input “Anthony Hopkins admired Michael Bay as a great director.” Also in this case you just calculate multi-head attention as usual, but when you get the histograms below, you apply look ahead masking to each histogram and delete the weights from the future tokens. In the figure below the black dots denote zero, and the sum of each row of the resulting attention map is also one. In other words, you get a lower triangular matrix, the sum of whose each row is 1.

Also just as I explained in the last article, you reweight vlaues with the triangular attention map. The figure below is calculating a transposed masked multi-head attention because I think it is a more straightforward way to see how vectors are reweighted in multi-head attention mechanism.

When you closely look at how each column of the transposed multi-head attention is reweighted, you can clearly see that the token is reweighted only based on the tokens generated so far.

*If you are still not sure why you need such masking in multi-head attention of target sentences, you should proceed to the next article for now. Once you check the decoding processes of Transformer-based translators, you would see why you need masked multi-head attention mechanism on the target sentence during training.

If you have read my articles, at least this one and the last one, I think you have gained more or less clear insights into how each component of Transfomer model works. You might have realized that each components require simple calculations. Combined with the fact that multi-head attention mechanism is highly parallelizable, Transformer is easier to train, compared to RNN.

In this article, we are going to see how masking of multi-head attention is implemented and how the whole Transformer structure is constructed. By the end of the next article, you would be able to create a toy English-German translator with more or less clear understanding on its architecture.


You can visualize positional encoding the way I explained with simple Python codes below. Please just copy and paste them, importing necessary libraries. You can visualize positional encoding as both heat maps and points rotating on rings, and in this case the dimension of word embedding is 256, and the maximum length of sentences is 50.

*In fact some implementations use different type of positional encoding, as you can see in the codes below. In this case, embedding vectors are roughly divided into two parts, and each part is encoded with different sine waves. I have been using a metaphor of rotating rings or gears in this article to explain positional encoding, but to be honest that is not necessarily true of all the types of Transformer implementation. Some papers compare different types of pairs of positional encoding. The most important point is, Transformer models is navigated to learn positions of tokens with certain types of mathematical patterns.


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core

[3] Jay Alammar, “The Illustrated Transformer,”

[4] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)

[5]Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 191-193
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 191-193

[6] Amirhossein Kazemnejad, “Transformer Architecture: The Positional Encoding
Let’s use sinusoidal functions to inject the order of words in our model”, Amirhossein Kazemnejad’s Blog, (2019)

[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko, “End-to-End Object Detection with Transformers,” (2020)

[8]中西 啓、「【第5回】機械式暗号機の傑作~エニグマ登場~」、HH News & Reports, (2011)

[9]中西 啓、「【第6回】エニグマ解読~第2次世界大戦とコンピュータの誕生~」、HH News & Reports, (2011)

[10]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[11]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.


Multi-head attention mechanism: “queries”, “keys”, and “values,” over and over again

*A comment added on 04/05/2022: Thanks to a comment by Mr. Maier, I found a major mistake in my visualization. To be concrete, there is a mistake in expressing how to get each colored divided group of tokens by applying linear transformations. That corresponds to the section 3.2.2 in the paper “Attention Is All You Need.” There would be no big differences in the main point of this article, the relations of keys, queries, and values, but please bear that in your mind if you need Transformer at a practical work. Besides checking the implementation by Tensorflow, I will soon prepare a modified version of visualization. For further details, please see comments at the bottom of this article.

This is the third article of my article series named “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

In the last article, I explained how attention mechanism works in simple seq2seq models with RNNs, and it basically calculates correspondences of the hidden state at every time step, with all the outputs of the encoder. However I would say the attention mechanisms of RNN seq2seq models use only one standard for comparing them. Using only one standard is not enough for understanding languages, especially when you learn a foreign language. You would sometimes find it difficult to explain how to translate a word in your language to another language. Even if a pair of languages are very similar to each other, translating them cannot be simple switching of vocabulary. Usually a single token in one language is related to several tokens in the other language, and vice versa. How they correspond to each other depends on several criteria, for example “what”, “who”, “when”, “where”, “why”, and “how”. It is easy to imagine that you should compare tokens with several criteria.

Transformer model was first introduced in the original paper named “Attention Is All You Need,” and from the title you can easily see that attention mechanism plays important roles in this model. When you learn about Transformer model, you will see the figure below, which is used in the original paper on Transformer.  This is the simplified overall structure of one layer of Transformer model, and you stack this layer N times. In one layer of Transformer, there are three multi-head attention, which are displayed as boxes in orange. These are the very parts which compare the tokens on several standards. I made the head article of this article series inspired by this multi-head attention mechanism.

The figure below is also from the original paper on Transfromer. If you can understand how multi-head attention mechanism works with the explanations in the paper, and if you have no troubles understanding the codes in the official Tensorflow tutorial, I have to say this article is not for you. However I bet that is not true of majority of people, and at least I need one article to clearly explain how multi-head attention works. Please keep it in mind that this article covers only the architectures of the two figures below. However multi-head attention mechanisms are crucial components of Transformer model, and throughout this article, you would not only see how they work but also get a little control over it at an implementation level.

1 Multi-head attention mechanism

When you learn Transformer model, I recommend you first to pay attention to multi-head attention. And when you learn multi-head attentions, before seeing what scaled dot-product attention is, you should understand the whole structure of multi-head attention, which is at the right side of the figure above. In order to calculate attentions with a “query”, as I said in the last article, “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” Sooner or later, you will notice I would be just repeating these phrases over and over again throughout this article, in several ways.

*Even if you are not sure what “reweighting” means in this context, please keep reading. I think you would little by little see what it means especially in the next section.

The overall process of calculating multi-head attention, displayed in the figure above, is as follows (Please just keep reading. Please do not think too much.): first you split the V: “values”, K: “keys”, and Q: “queries”, and second you transform those divided “values”, “keys”, and “queries” with densely connected layers (“Linear” in the figure). Next you calculate attention weights and reweight the “values” and take the summation of the reiweighted “values”, and you concatenate the resulting summations. At the end you pass the concatenated “values” through another densely connected layers. The mechanism of scaled dot-product attention is just a matter of how to concretely calculate those attentions and reweight the “values”.

*In the last article I briefly mentioned that “keys” and “queries” can be in the same language. They can even be the same sentence in the same language, and in this case the resulting attentions are called self-attentions, which we are mainly going to see. I think most people calculate “self-attentions” unconsciously when they speak. You constantly care about what “she”, “it” , “the”, or “that” refers to in you own sentence, and we can say self-attention is how these everyday processes is implemented.

Let’s see the whole process of calculating multi-head attention at a little abstract level. From now on, we consider an example of calculating multi-head self-attentions, where the input is a sentence “Anthony Hopkins admired Michael Bay as a great director.” In this example, the number of tokens is 9, and each token is encoded as a 512-dimensional embedding vector. And the number of heads is 8. In this case, as you can see in the figure below, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” is implemented as a 9\times 512 matrix. You first split each token into 512/8=64 dimensional, 8 vectors in total, as I colored in the figure below. In other words, the input matrix is divided into 8 colored chunks, which are all 9\times 64 matrices, but each colored matrix expresses the same sentence. And you calculate self-attentions of the input sentence independently in the 8 heads, and you reweight the “values” according to the attentions/weights. After this, you stack the sum of the reweighted “values”  in each colored head, and you concatenate the stacked tokens of each colored head. The size of each colored chunk does not change even after reweighting the tokens. According to Ashish Vaswani, who invented Transformer model, each head compare “queries” and “keys” on each standard. If the a Transformer model has 4 layers with 8-head multi-head attention , at least its encoder has 4\times 8 = 32 heads, so the encoder learn the relations of tokens of the input on 32 different standards.

I think you now have rough insight into how you calculate multi-head attentions. In the next section I am going to explain the process of reweighting the tokens, that is, I am finally going to explain what those colorful lines in the head image of this article series are.

*Each head is randomly initialized, so they learn to compare tokens with different criteria. The standards might be straightforward like “what” or “who”, or maybe much more complicated. In attention mechanisms in deep learning, you do not need feature engineering for setting such standards.

2 Calculating attentions and reweighting “values”

If you have read the last article or if you understand attention mechanism to some extent, you should already know that attention mechanism calculates attentions, or relevance between “queries” and “keys.” In the last article, I showed the idea of weights as a histogram, and in that case the “query” was the hidden state of the decoder at every time step, whereas the “keys” were the outputs of the encoder. In this section, I am going to explain attention mechanism in a more abstract way, and we consider comparing more general “tokens”, rather than concrete outputs of certain networks. In this section each [ \cdots ] denotes a token, which is usually an embedding vector in practice.

Please remember this mantra of attention mechanism: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” The figure below shows an overview of a case where “Michael” is a query. In this case you compare the query with the “keys”, that is, the input sentence “Anthony Hopkins admired Michael Bay as a great director.” and you get the histogram of attentions/weights. Importantly the sum of the weights 1. With the attentions you have just calculated, you can reweight the “values,” which also denote the same input sentence. After that you can finally take a summation of the reweighted values. And you use this summation.

*I have been repeating the phrase “reweighting ‘values’  with attentions,”  but you in practice calculate the sum of those reweighted “values.”

Assume that compared to the “query”  token “Michael”, the weights of the “key” tokens “Anthony”, “Hopkins”, “admired”, “Michael”, “Bay”, “as”, “a”, “great”, and “director.” are respectively 0.06, 0.09, 0.05, 0.25, 0.18, 0.06, 0.09, 0.06, 0.15. In this case the sum of the reweighted token is 0.06″Anthony” + 0.09″Hopkins” + 0.05″admired” + 0.25″Michael” + 0.18″Bay” + 0.06″as” + 0.09″a” + 0.06″great” 0.15″director.”, and this sum is the what wee actually use.

*Of course the tokens are embedding vectors in practice. You calculate the reweighted vector in actual implementation.

You repeat this process for all the “queries.”  As you can see in the figure below, you get summations of 9 pairs of reweighted “values” because you use every token of the input sentence “Anthony Hopkins admired Michael Bay as a great director.” as a “query.” You stack the sum of reweighted “values” like the matrix in purple in the figure below, and this is the output of a one head multi-head attention.

3 Scaled-dot product

This section is a only a matter of linear algebra. Maybe this is not even so sophisticated as linear algebra. You just have to do lots of Excel-like operations. A tutorial on Transformer by Jay Alammar is also a very nice study material to understand this topic with simpler examples. I tried my best so that you can clearly understand multi-head attention at a more mathematical level, and all you need to know in order to read this section is how to calculate products of matrices or vectors, which you would see in the first some pages of textbooks on linear algebra.

We have seen that in order to calculate multi-head attentions, we prepare 8 pairs of “queries”, “keys” , and “values”, which I showed in 8 different colors in the figure in the first section. We calculate attentions and reweight “values” independently in 8 different heads, and in each head the reweighted “values” are calculated with this very simple formula of scaled dot-product: Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) =softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})\boldsymbol{V}. Let’s take an example of calculating a scaled dot-product in the blue head.

At the left side of the figure below is a figure from the original paper on Transformer, which explains one-head of multi-head attention. If you have read through this article so far, the figure at the right side would be more straightforward to understand. You divide the input sentence into 8 chunks of matrices, and you independently put those chunks into eight head. In one head, you convert the input matrix by three different fully connected layers, which is “Linear” in the figure below, and prepare three matrices Q, K, V, which are “queries”, “keys”, and “values” respectively.

*Whichever color attention heads are in, the processes are all the same.

*You divide \frac{\boldsymbol{Q}} {\boldsymbol{K}^T} by \sqrt{d}_k in the formula. According to the original paper, it is known that re-scaling \frac{\boldsymbol{Q} }{\boldsymbol{K}^T} by \sqrt{d}_k is found to be effective. I am not going to discuss why in this article.

As you can see in the figure below, calculating Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) is virtually just multiplying three matrices with the same size (Only K is transposed though). The resulting 9\times 64 matrix is the output of the head.

softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}) is calculated like in the figure below. The softmax function regularize each row of the re-scaled product \frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k}, and the resulting 9\times 9 matrix is a kind a heat map of self-attentions.

The process of comparing one “query” with “keys” is done with simple multiplication of a vector and a matrix, as you can see in the figure below. You can get a histogram of attentions for each query, and the resulting 9 dimensional vector is a list of attentions/weights, which is a list of blue circles in the figure below. That means, in Transformer model, you can compare a “query” and a “key” only by calculating an inner product. After re-scaling the vectors by dividing them with \sqrt{d_k} and regularizing them with a softmax function, you stack those vectors, and the stacked vectors is the heat map of attentions.

You can reweight “values” with the heat map of self-attentions, with simple multiplication. It would be more straightforward if you consider a transposed scaled dot-product \boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T. This also should be easy to understand if you know basics of linear algebra.

One column of the resulting matrix (\boldsymbol{V}^T \cdot softmax(\frac{\boldsymbol{Q} \boldsymbol{K} ^T}{\sqrt{d}_k})^T) can be calculated with a simple multiplication of a matrix and a vector, as you can see in the figure below. This corresponds to the process or “taking a summation of reweighted ‘values’,” which I have been repeating. And I would like you to remember that you got those weights (blue) circles by comparing a “query” with “keys.”

Again and again, let’s repeat the mantra of attention mechanism together: “you compare the ‘query’ with the ‘keys’ and get scores/weights for the ‘values.’ Each score/weight is in short the relevance between the ‘query’ and each ‘key’. And you reweight the ‘values’ with the scores/weights, and take the summation of the reweighted ‘values’.” If you have been patient enough to follow my explanations, I bet you have got a clear view on how multi-head attention mechanism works.

We have been seeing the case of the blue head, but you can do exactly the same procedures in every head, at the same time, and this is what enables parallelization of multi-head attention mechanism. You concatenate the outputs of all the heads, and you put the concatenated matrix through a fully connected layers.

If you are reading this article from the beginning, I think this section is also showing the same idea which I have repeated, and I bet more or less you no have clearer views on how multi-head attention mechanism works. In the next section we are going to see how this is implemented.

4 Tensorflow implementation of multi-head attention

Let’s see how multi-head attention is implemented in the Tensorflow official tutorial. If you have read through this article so far, this should not be so difficult. I also added codes for displaying heat maps of self attentions. With the codes in this Github page, you can display self-attention heat maps for any input sentences in English.

The multi-head attention mechanism is implemented as below. If you understand Python codes and Tensorflow to some extent, I think this part is relatively easy.  The multi-head attention part is implemented as a class because you need to train weights of some fully connected layers. Whereas, scaled dot-product is just a function.

*I am going to explain the create_padding_mask() and create_look_ahead_mask() functions in upcoming articles. You do not need them this time.

Let’s see a case of using multi-head attention mechanism on a (1, 9, 512) sized input tensor, just as we have been considering in throughout this article. The first axis of (1, 9, 512) corresponds to the batch size, so this tensor is virtually a (9, 512) sized tensor, and this means the input is composed of 9 512-dimensional vectors. In the results below, you can see how the shape of input tensor changes after each procedure of calculating multi-head attention. Also you can see that the output of the multi-head attention is the same as the input, and you get a 9\times 9 matrix of attention heat maps of each attention head.

I guess the most complicated part of this implementation above is the split_head() function, especially if you do not understand tensor arithmetic. This part corresponds to splitting the input tensor to 8 different colored matrices as in one of the figures above. If you cannot understand what is going on in the function, I recommend you to prepare a sample tensor as below.

This is just a simple (1, 9, 512) sized tensor with sequential integer elements. The first row (1, 2, …., 512) corresponds to the first input token, and (4097, 4098, … , 4608) to the last one. You should try converting this sample tensor to see how multi-head attention is implemented. For example you can try the operations below.

These operations correspond to splitting the input into 8 heads, whose sizes are all (9, 64). And the second axis of the resulting (1, 8, 9, 64) tensor corresponds to the index of the heads. Thus sample_sentence[0][0] corresponds to the first head, the blue 9\times 64 matrix. Some Tensorflow functions enable linear calculations in each attention head, independently as in the codes below.

Very importantly, we have been only considering the cases of calculating self attentions, where all “queries”, “keys”, and “values” come from the same sentence in the same language. However, as I showed in the last article, usually “queries” are in a different language from “keys” and “values” in translation tasks, and “keys” and “values” are in the same language. And as you can imagine, usualy “queries” have different number of tokens from “keys” or “values.” You also need to understand this case, which is not calculating self-attentions. If you have followed this article so far, this case is not that hard to you. Let’s briefly see an example where the input sentence in the source language is composed 9 tokens, on the other hand the output is composed 12 tokens.

As I mentioned, one of the outputs of each multi-head attention class is 9\times 9 matrix of attention heat maps, which I displayed as a matrix composed of blue circles in the last section. The the implementation in the Tensorflow official tutorial, I have added codes to display actual heat maps of any input sentences in English.

*If you want to try displaying them by yourself, download or just copy and paste codes in this Github page. Please maker “datasets” directory in the same directory as the code. Please download “spa-eng.zip” from this page, and unzip it. After that please put “spa.txt” on the “datasets” directory. Also, please download the “checkpoints_en_es” folder from this link, and place the folder in the same directory as the file in the Github page. In the upcoming articles, you would need similar processes to run my codes.

After running codes in the Github page, you can display heat maps of self attentions. Let’s input the sentence “Anthony Hopkins admired Michael Bay as a great director.” You would get a heat maps like this.

In fact, my toy implementation cannot handle proper nouns such as “Anthony” or “Michael.” Then let’s consider a simple input sentence “He admired her as a great director.” In each layer, you respectively get 8 self-attention heat maps.

I think we can see some tendencies in those heat maps. The heat maps in the early layers, which are close to the input, are blurry. And the distributions of the heat maps come to concentrate more or less diagonally. At the end, presumably they learn to pay attention to the start and the end of sentences.

You have finally finished reading this article. Congratulations.

You should be proud of having been patient, and you passed the most tiresome part of learning Transformer model. You must be ready for making a toy English-German translator in the upcoming articles. Also I am sure you have understood that Michael Bay is a great director, no matter what people say.


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need” (2017)

[2] “Transformer model for language understanding,” Tensorflow Core

[3] “Neural machine translation with attention,” Tensorflow Core

[4] Jay Alammar, “The Illustrated Transformer,”

[5] “Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention,” stanfordonline, (2019)

[6]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 191-193

[7]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)

[8]Rosemary Rossi, “Anthony Hopkins Compares ‘Genius’ Michael Bay to Spielberg, Scorsese,” yahoo! entertainment, (2017)

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.