Visual Question Answering with Keras – Part 1

This is Part I of II of the Article Series Visual Question Answering with Keras

Making Computers Intelligent to answer from images

If we look closer in the history of Artificial Intelligence (AI), the Deep Learning has gained more popularity in the recent years and has achieved the human-level performance in the tasks such as Speech Recognition, Image Classification, Object Detection, Machine Translation and so on. However, as humans, not only we but also a five-year child can normally perform these tasks without much inconvenience. But the development of such systems with these capabilities has always considered an ambitious goal for the researchers as well as for developers.

In this series of blog posts, I will cover an introduction to something called VQA (Visual Question Answering), its available datasets, the Neural Network approach for VQA and its implementation in Keras and the applications of this challenging problem in real life. 

Table of Contents:

1 Introduction

2 What is exactly Visual Question Answering?

3 Prerequisites

4 Datasets available for VQA

4.1 DAQUAR Dataset

4.2 CLEVR Dataset

4.3 FigureQA Dataset

4.4 VQA Dataset

5 Real-life applications of VQA

6 Conclusion

 

  1. Introduction:

Let’s say you are given a below picture along with one question. Can you answer it?

I expect confidently you all say it is the Kitchen without much inconvenience which is also the right answer. Even a five-year child who just started to learn things might answer this question correctly.

Alright, but can you write a computer program for such type of task that takes image and question about the image as an input and gives us answer as output?

Before the development of the Deep Neural Network, this problem was considered as one of the difficult, inconceivable and challenging problem for the AI researcher’s community. However, due to the recent advancement of Deep Learning the systems are capable of answering these questions with the promising result if we have a required dataset.

Now I hope you have got at least some intuition of a problem that we are going to discuss in this series of blog posts. Let’s try to formalize the problem in the below section.

  1. What is exactly Visual Question Answering?:

We can define, “Visual Question Answering(VQA) is a system that takes an image and natural language question about the image as an input and generates natural language answer as an output.”

VQA is a research area that requires an understanding of vision(Computer Vision)  as well as text(NLP). The main beauty of VQA is that the reasoning part is performed in the context of the image. So if we have an image with the corresponding question then the system must able to understand the image well in order to generate an appropriate answer. For example, if the question is the number of persons then the system must able to detect faces of the persons. To answer the color of the horse the system need to detect the objects in the image. Many of these common problems such as face detection, object detection, binary object classification(yes or no), etc. have been solved in the field of Computer Vision with good results.

To summarize a good VQA system must be able to address the typical problems of CV as well as NLP.

To get a better feel of VQA you can try online VQA demo by CloudCV. You just go to this link and try uploading the picture you want and ask the related question to the picture, the system will generate the answer to it.

 

  1. Prerequisites:

In the next post, I will walk you through the code for this problem using Keras. So I assume that you are familiar with:

  1. Fundamental concepts of Machine Learning
  2. Multi-Layered Perceptron
  3. Convolutional Neural Network
  4. Recurrent Neural Network (especially LSTM)
  5. Gradient Descent and Backpropagation
  6. Transfer Learning
  7. Hyperparameter Optimization
  8. Python and Keras syntax
  1. Datasets available for VQA:

As you know problems related to the CV or NLP the availability of the dataset is the key to solve the problem. The complex problems like VQA, the dataset must cover all possibilities of questions answers in real-world scenarios. In this section, I will cover some of the datasets available for VQA.

4.1 DAQUAR Dataset:

The DAQUAR dataset is the first dataset for VQA that contains only indoor scenes. It shows the accuracy of 50.2% on the human baseline. It contains images from the NYU_Depth dataset.

Example of DAQUAR dataset

Example of DAQUAR dataset

The main disadvantage of DAQUAR is the size of the dataset is very small to capture all possible indoor scenes.

4.2 CLEVR Dataset:

The CLEVR Dataset from Stanford contains the questions about the object of a different type, colors, shapes, sizes, and material.

It has

  • A training set of 70,000 images and 699,989 questions
  • A validation set of 15,000 images and 149,991 questions
  • A test set of 15,000 images and 14,988 questions

Image Source: https://cs.stanford.edu/people/jcjohns/clevr/?source=post_page

 

4.3 FigureQA Dataset:

FigureQA Dataset contains questions about the bar graphs, line plots, and pie charts. It has 1,327,368 questions for 100,000 images in the training set.

4.4 VQA Dataset:

As comapred to all datasets that we have seen so far VQA dataset is relatively larger. The VQA dataset contains open ended as well as multiple choice questions. VQA v2 dataset contains:

  • 82,783 training images from COCO (common objects in context) dataset
  • 40, 504 validation images and 81,434 validation images
  • 443,757 question-answer pairs for training images
  • 214,354 question-answer pairs for validation images.

As you might expect this dataset is very huge and contains 12.6 GB of training images only. I have used this dataset in the next post but a very small subset of it.

This dataset also contains abstract cartoon images. Each image has 3 questions and each question has 10 multiple choice answers.

  1. Real-life applications of VQA:

There are many applications of VQA. One of the famous applications is to help visually impaired people and blind peoples. In 2016, Microsoft has released the “Seeing AI” app for visually impaired people to describe the surrounding environment around them. You can watch this video for the prototype of the Seeing AI app.

Another application could be on social media or e-commerce sites. VQA can be also used for educational purposes.

  1. Conclusion:

I hope this explanation will give you a good idea of Visual Question Answering. In the next blog post, I will walk you through the code in Keras.

If you like my explanations, do provide some feedback, comments, etc. and stay tuned for the next post.

Whitepaper „Data-Management“: Wenn Daten auf Reisen gehen

Datenmanagement ist heutzutage ein komplexes Thema und schon lange nicht mehr nur die Aufgabe der Abteilung „EDV“. Unstrukturierte Daten von Sensoren, Maschinen und Anlagen haben einen langen Weg durch das Unternehmen, bevor sie Mehrwerte liefern.

Innovative, digitale Kundenservices erfordern neue Infrastrukturen und Cloudanwendungen in der Umsetzung. Wie Sie mit Ihren Daten auf Reisen gehen, ohne unnötige Risiken einzugehen, zeigt Ihnen das gut gelaunte Autorenteam in der praktischen Sommerausgabe dieses Whitepapers. Erfrischend geschrieben, konstruktiv beispielhaft und mit einer optischen Aufmachung, die auch am Strand einen guten Eindruck macht.

Die gut umgesetzte Idee der Autoren können Sie sich als Download oder sogar als Taschenbuch kostenlos zuschicken lassen. Auf der Website www.pack-die-daten-ein.de werden Sie fündig. Wer sich schnell auf die Datenreise begibt, wird belohnt. Das Autorenteam bedankt sich bei den ersten 25 Lesern für ihr Interesse mit einer Original OGIO-Reisetasche. Ob Download oder Taschenbuch, es ist ein sehr angenehmer Service und Bereicherung für die Sommerreise.

Whitepaper - Pack die Daten ein

Von BI zu PI: Der nächste Schritt auf dem Weg zu datengetriebenen Entscheidungen

„Alles ist stetig und fortlaufend im Wandel.“ „Das Tempo der Veränderungen nimmt zu.“ „Die Welt wird immer komplexer und Unternehmen müssen Schritt halten.“ Unternehmen jeder Art und Größe haben diese Sätze schon oft gehört – vielleicht zu oft! Und dennoch ist es für den Erfolg eines Unternehmens von entscheidender Bedeutung, sich den Veränderungen anzupassen.


Read this article in English: 
“From BI to PI: The Next Step in the Evolution of Data-Driven Decisions”


Sie müssen die zugrunde liegenden organisatorischen Bausteine verstehen, um sicherzustellen, dass die von Ihnen getroffenen Entscheidungen sich auch in die richtige Richtung entwickeln. Es geht sozusagen um die DNA Ihres Unternehmens: die Geschäftsprozesse, auf denen Ihre Arbeitsweise basiert, und die alles zu einer harmonischen Einheit miteinander verbinden. Zu verstehen, wie diese Prozesse verlaufen und an welcher Stelle es Verbesserungsmöglichkeiten gibt, kann den Unterschied zwischen Erfolg und Misserfolg ausmachen.

Unternehmen, die ihren Fokus auf Wachstum gesetzt haben, haben dies bereits erkannt. In der Vergangenheit wurde Business Intelligence als die Lösung für diese Herausforderung betrachtet. In jüngerer Zeit sehen sich zukunftsorientierte Unternehmen damit konfrontiert, Lösungen zu überwachen, die mit dem heutigen Tempo der Veränderungen Schritt halten können. Gleichzeitig erkennen diese Unternehmen, dass die zunehmende Komplexität der Geschäftsprozesse dazu führt, dass herkömmliche Methoden nicht mehr ausreichen.

Anpassung an ein sich änderndes Umfeld? Die Herausforderungen von BI

Business Intelligence ist nicht notwendigerweise überholt oder unnötig. In einer schnelllebigen und sich ständig verändernden Welt stehen die BI-Tools und -Lösungen jedoch vor einer Reihe von Herausforderungen. Hierzu können zählen:

  • Hohe Datenlatenz – Die Datenlatenz gibt an, wie lange ein Benutzer benötigt, um Daten beispielsweise über ein Business-Intelligence-Dashboard abzurufen. In vielen Fällen kann dies mehr als 24 Stunden dauern. Ein geschäftskritischer Zeitraum, da Unternehmen Geschäftschancen für sich nutzen möchten, die möglicherweise ein begrenztes Zeitfenster haben.
  • Unvollständige Datensätze – Business Intelligence verfolgt einen breiten Ansatz, sodass Prüfungen möglicherweise zwar umfassend, aber nicht tief greifend sind. Dies erhöht die Wahrscheinlichkeit, dass Daten übersehen werden; insbesondere in Fällen, in denen die Prüfungsparameter durch die Tools selbst nur schwer geändert werden können.
  • Erkennung statt Analyse – Business-Intelligence-Tools sind in erster Linie darauf ausgelegt, Daten zu finden. Der Fokus hierbei liegt vor allem auf Daten, die für ihre Benutzer nützlich sein können. An dieser Stelle endet jedoch häufig die Leistungsfähigkeit der Tools, da sie Benutzern keine einfachen Optionen bieten, die Daten tatsächlich zu analysieren. Die Möglichkeit, umsetzbare Erkenntnisse zu gewinnen, verringert sich somit.
  • Eingeschränkte Skalierbarkeit – Im Allgemeinen bleibt Business Intelligence ein Bereich für Spezialisten und Experten mit dem entsprechenden Know-how, über das Mitarbeiter im operativen Bereich oftmals nicht verfügen. Ohne umfangreiches Verständnis für die geschäftlichen Prozesse und deren Analyse innerhalb des Unternehmens bleibt die optimierte Anwendung eines bestimmten Business-Intelligence-Tools aber eingeschränkt.
  • Nicht nachvollziehbare Metriken – Werden Metriken verwendet, die nicht mit den Geschäftsprozessen verknüpft sind, kann Business Intelligence kaum positive Veränderungen innerhalb eines Unternehmens unterstützen. Für Benutzer ist es schwierig, Ergebnisse richtig auszuwerten und zu verstehen und diese Ergebnisse zweckdienlich zu nutzen.

Process Intelligence: der nächste wegweisende Schritt

Es bedarf einer effektiveren Methode zur Prozessanalyse, um eine effiziente Arbeitsweise und fundierte Entscheidungsfindung sicherzustellen. An dieser Stelle kommt Process Intelligence (PI) ins Spiel. PI bietet die entscheidenden Hintergrundinformationen für die Beantwortung von Fragen, die mit Business-Intelligence-Tools unbeantwortet bleiben.

Process Intelligence ermöglicht die durchgehende Visualisierung von Prozessabläufen mithilfe von Rohdaten. Mit dem richtigen Process-Intelligence-Tool können diese Rohdaten sofort analysiert werden, sodass Prozesse präzise angezeigt werden. Der Endbenutzer kann diese Informationen nach Bedarf einsehen und bearbeiten, ohne eine Vorauswahl für die Analyse treffen zu müssen.

Zum Vergleich: Da Business Intelligence vordefinierte Analysekriterien benötigt, kann BI nur dann wirklich nützlich sein, wenn diese Kriterien auch definiert sind. Unternehmen können verzögerte Analysen vermeiden, indem sie Process Intelligence zur Ermittlung der Hauptursache von Prozessproblemen nutzen, und dann die richtigen Kriterien zur Bestimmung des Analyserahmens auswählen.

Anschließend können Sie Ihre Systemprozesse analysieren und erkennen die Diskrepanzen und Varianten zwischen dem angestrebten Geschäftsprozess und dem tatsächlichen Verlauf Ihrer Prozesse. Und je schneller Sie Echtzeit-Einblicke in Ihre Prozesse gewinnen, desto schneller können Sie in Ihrem Unternehmen positive Veränderungen auf den Weg bringen.

Kurz gesagt: Business Intelligence eignet sich dafür, ein breites Verständnis über die Abläufe in einem Unternehmen zu gewinnen. Für einige Unternehmen kann dies ausreichend sein. Für andere hingegen ist ein Überblick nicht genug.

Sie suchen nach einer Möglichkeit um festzustellen, wie jeder Prozess in Ihrer Organisation tatsächlich funktioniert? Die Antwort hierauf lautet Software. Software, die Prozesserkennung, Prozessanalyse und Konformitätsprüfung miteinander kombiniert.

Mit den richtigen Process-Intelligence-Tools können Sie nicht nur Daten aus den verschiedenen IT-Systemen in Ihrem Unternehmen gewinnen, sondern auch Ihre End-to-End-Prozesse kontinuierlich überwachen. So erhalten Sie Erkenntnisse über mögliche Risiken und Verbesserungspotenziale. PI steht für einen kollaborativen Ansatz zur Prozessverbesserung, der zu einem bahnbrechenden Verständnis über die Abläufe in Ihrem Unternehmen führt, und wie diese optimiert werden können.

Erhöhtes Potenzial mit Signavio Process Intelligence

Mit Signavio Process Intelligence erhalten Sie wegweisende Erkenntnisse über Ihre Prozesse, auf deren Basis Sie bessere Geschäftsentscheidungen treffen können. Erlangen Sie eine vollständige Sicht auf Ihre Abläufe und ein Verständnis dafür, was in Ihrer Organisation tatsächlich geschieht.

Als Teil der Signavio Business Transformation Suite lässt sich Signavio Process Intelligence perfekt mit der Prozessmodellierung und -automatisierung kombinieren. Als eine vollständig cloudbasierte Process-Mining-Lösung erleichtert es die Software, organisationsweit zusammenzuarbeiten und Wissen zu teilen.

Generieren Sie neue Ideen, sparen Sie Aufwand und Kosten ein und optimieren Sie Ihre Prozesse. Erfahren Sie mehr über Signavio Process Intelligence.

From BI to PI: The Next Step in the Evolution of Data-Driven Decisions

“Change is a constant.” “The pace of change is accelerating.” “The world is increasingly complex, and businesses have to keep up.” Organizations of all shapes and sizes have heard these ideas over and over—perhaps too often! However, the truth remains that adaptation is crucial to a successful business.


Read this article in German: Von der Datenanalyse zur Prozessverbesserung: So gelingt eine erfolgreiche Process-Mining-Initiative

 


Of course, the only way to ensure that the decisions you make are evolving in the right way is to understand the underlying building blocks of your organization. You can think of it as DNA; the business processes that underpin the way you work and combine to create a single unified whole. Knowing how those processes operate, and where the opportunities for improvement lie, can be the difference between success and failure.

Businesses with an eye on their growth understand this already. In the past, Business Intelligence was seen as the solution to this challenge. In more recent times, forward-thinking organizations see the need for monitoring solutions that can keep up with today’s rate of change, at the same time as they recognize that increasing complexity within business processes means traditional methods are no longer sufficient.

Adapting to a changing environment? The challenges of BI

Business Intelligence itself is not necessarily defunct or obsolete. However, the tools and solutions that enable Business Intelligence face a range of challenges in a fast-paced and constantly changing world. Some of these issues may include:

  • High data latency – Data latency refers to how long it takes for a business user to retrieve data from, for example, a business intelligence dashboard. In many cases, this can take more than 24 hours, a critical time period when businesses are attempting to take advantage of opportunities that may have a limited timeframe.
  • Incomplete data sets – The broad approach of Business Intelligence means investigations may run wide but not deep. This increases the chances that data will be missed, especially in instances where the tools themselves make the parameters for investigations difficult to change.
  • Discovery, not analysis – Business intelligence tools are primarily optimized for exploration, with a focus on actually finding data that may be useful to their users. Often, this is where the tools stop, offering no simple way for users to actually analyze the data, and therefore reducing the possibility of finding actionable insights.
  • Limited scalability – In general, Business Intelligence remains an arena for specialists and experts, leaving a gap in understanding for operational staff. Without a wide appreciation for processes and their analysis within an organization, the opportunities to increase the application of a particular Business Intelligence tool will be limited.
  • Unconnected metrics – Business Intelligence can be significantly restricted in its capacity to support positive change within a business through the use of metrics that are not connected to the business context. This makes it difficult for users to interpret and understand the results of an investigation, and apply these results to a useful purpose within their organization.

Process Intelligence: the next evolutionary step

To ensure companies can work efficiently and make the best decisions, a more effective method of process discovery is needed. Process Intelligence (PI) provides the critical background to answer questions that cannot be answered with Business Intelligence tools.

Process Intelligence offers visualization of end-to-end process sequences using raw data, and the right Process Intelligence tool means analysis of that raw data can be conducted straight away, so that processes are displayed accurately. The end-user is free to view and work with this accurate information as they please, without the need to do a preselection for the analysis.

By comparison, because Business Intelligence requires predefined analysis criteria, only once the criteria are defined can BI be truly useful. Organizations can avoid delayed analysis by using Process Intelligence to identify the root causes of process problems, then selecting the right criteria to determine the analysis framework.

Then, you can analyze your system processes and see the gaps and variants between the intended business process and what you actually have. And of course, the faster you discover what you have, the faster you can apply the changes that will make a difference in your business.

In short, Business Intelligence is suitable for gaining a broad understanding of the way a business usually functions. For some businesses, this will be sufficient. For others, an overview is not enough.

They understand that true insights lie in the detail, and are looking for a way of drilling down into exactly how each process within their organization actually works. Software that combines process discovery, process analysis, and conformance checking is the answer.

The right Process Intelligence tools means you will be able to automatically mine process models from the different IT systems operating within your business, as well as continuously monitor your end-to-end processes for insights into potential risks and ongoing improvement opportunities. All of this is in service of a collaborative approach to process improvement, which will lead to a game-changing understanding of how your business works, and how it can work better.

Early humans evolved from more primitive ancestors, and in the process, learned to use more and more sophisticated tools. For the modern human, working in a complex organization, the right tool is Process Intelligence.

Endless Potential with Signavio Process Intelligence

Signavio Process Intelligence allows you to unearth the truth about your processes and make better decisions based on true evidence found in your organization’s IT systems. Get a complete end-to-end perspective and understanding of exactly what is happening in your organization in a matter of weeks.

As part of Signavio Business Transformation Suite, Signavio Process Intelligence integrates perfectly with Signavio Process Manager and is accessible from the Signavio Collaboration Hub. As an entirely cloud-based process mining solution, the tool makes it easy to collaborate with colleagues from all over the world and harness the wisdom of the crowd.

Find out more about Signavio Process Intelligence, and see how it can help your organization generate more ideas, save time and money, and optimize processes.

The Data Scientist Job and the Future

A dramatic upswing of data science jobs facilitating the rise of data science professionals to encounter the supply-demand gap.

By 2024, a shortage of 250,000 data scientists is predicted in the United States alone. Data scientists have emerged as one of the hottest careers in the data world today. With digitization on the rise, IoT and cognitive technologies have generated a large number of data sets, thus, making it difficult for an organization to unlock the value of these data.

With the constant rise in data science, those fail to upgrade their skill set may be putting themselves at a competitive disadvantage. No doubt data science is still deemed as one of the best job titles today, but the battles for expert professionals in this field is fierce.

The hiring market for a data science professional has gone into overdrive making the competition even tougher. New online institutions have come up with credible certification programs for professionals to get skilled. Not to forget, organizations are in a hunt to hire candidates with data science and big data analytics skills, as these are the top skills that are going around in the market today. In addition to this, it is also said that typically it takes around 45 days for these job roles to be filled, which is five days longer than the average U.S. market.

Data science

One might come across several definitions for data science, however, a simple definition states that it is an accumulation of data, which is arranged and analyzed in a manner that will have an effect on businesses. According to Google, a data scientist is one who has the ability to analyze and interpret complex data, being able to make use of the statistic of a website and assist in business decision making. Also, one needs to be able to choose and build appropriate algorithms and predictive models that will help analyze data in a viable manner to uncover positive insights from it.

A data scientist job is now a buzzworthy career in the IT industry. It has driven a wider workforce to get skilled in this job role, as most organizations are becoming data-driven. It’s pretty obnoxious being a data professional will widen job opportunities and offer more chances of getting lucrative salary packages today. Similarly, let us look at a few points that define the future of data science to be bright.

  • Data science is still an evolving technology

A career without upskilling often remains redundant. To stay relevant in the industry, it is crucial that professionals get themselves upgraded in the latest technologies. Data science evolves to have an abundance of job opportunities in the coming decade. Since, the supply is low, it is a good call for professionals looking to get skilled in this field.

  • Organizations are still facing a challenge using data that is generated

Research by 2018 Data Security Confidence from Gemalto estimated that 65% of the organizations could not analyze or categorized the data they had stored. However, 89% said they could easily analyze the information prior they have a competitive edge. Being a data science professional, one can help organizations make progress with the data that is being gathered to draw positive insights.

  • In-demand skill-set

Most of the data scientists possess to have the in-demand skill set required by the current industry today. To be specific, since 2013 it is said that there has been a 256% increase in the data science jobs. Skills such as Machine Learning, R and Python programming, Predictive analytics, AI, and Data Visualization are the most common skills that employers seek from the candidates of today.

  • A humongous amount of data growing everyday

There are around 5 billion consumers that interact with the internet on a daily basis, this number is set to increase to 6 billion in 2025, thus, representing three-quarters of the world’s population.

In 2018, 33 zettabytes of data were generated and projected to rise to 133 zettabytes by 2025. The production of data will only keep increasing and data scientists will be the ones standing to guard these enterprises effectively.

  • Advancement in career

According to LinkedIn, data scientist was found to be the most promising career of 2019. The top reason for this job role to be ranked the highest is due to the salary compensation people were being awarded, a range of $130,000. The study also predicts that being a data scientist, there are high chances or earning a promotion giving a career advancement score of 9 out of 10.

Precisely, data science is still a fad job and will not cease until the foreseeable future.

Interview: Profitiert Business Intelligence vom Data Warehouse in der Cloud?

Interview mit Ross Perez, Senior Director, Marketing EMEA bei Snowflake

Read this Article in English:
“Does Business Intelligence benefit from Cloud Data Warehousing?”

Profitiert Business Intelligence vom Cloud Data Warehousing?

Ross Perez ist Senior Director Marketing EMEA bei Snowflake. Er leitet das Snowflake-Marketingteam in EMEA und ist damit beauftragt, die Diskussion über Analysen, Daten und Cloud-Data-Warehousing in EMEA voran zu bringen. Vor Snowflake war Ross Produkt Marketer bei Tableau Software, wo er die Iron Viz Championship gründete, den weltweit größten und aufwändigsten Wettbewerb für Datenvisualisierung.

Data Science Blog: Ross, Business Intelligence (BI) ist kein wirklich neuer Trend. In 2019/2020 sollte es kein Thema mehr sein, Daten für das ganze Unternehmen verfügbar zu machen. Stimmt das soweit?

BI ist definitiv ein alter Trend, denn Berichterstattung gibt es schon seit 50 Jahren. Die Menschen sind es gewohnt, Statistiken und Daten für das gesamte Unternehmen und sogar für ihre Geschäftsbereiche zu erhalten. Die Verwendung von BI zur Bereitstellung von Analysen für alle Mitarbeiter im Unternehmen und die Ermutigung zur Entscheidungsfindung auf der Grundlage von Daten für den jeweiligen Bereich ist jedoch relativ neu. In vielen Unternehmen, mit denen Snowflake zusammenarbeitet, gibt es eine neue Gruppe von Mitarbeitern, die gerade erst den Zugriff auf Self-Service-BI- und Visualisierungstools wie Tableau, Looker und Sigma erhalten haben und nun auch anfangen, Antworten auf ihre Fragen zu finden.

Data Science Blog: Bi jetzt ging es im BI vor allem darum Dashboards für Geschäftsberichte zu erstellen. Und dabei spielte das Data Warehouse (DWH) die Rolle des Backends. Heute haben wir einen noch viel größeren Bedarf an Datentransparenz. Wie sollten Unternehmen damit umgehen?

Da immer mehr Mitarbeiter in immer mehr Abteilungen immer häufiger auf Daten zugreifen möchten, steigt die Nachfrage nach Back-End-Systemen – wie dem Data Warehouse – rapide. In vielen Fällen verfügen Unternehmen über Data Warehouses, die nicht für diese gleichzeitige und heterogene Nachfrage gebaut wurden. Die Erfahrungen der Mitarbeiter mit dem DWH und BI sind daher oftmals schlecht, denn Endbenutzer müssen lange auf ihre Berichte warten. Und nun kommt Snowflake ins Spiel: Da wir die Leistung der Cloud nutzen können, um Ressourcen auf Abruf bereitzustellen, können wir beliebig viele Benutzer gleichzeitig bedienen. Snowflake kann zudem unbegrenzte Datenmengen sowohl in strukturierten als auch in halbstrukturierten Formaten speichern.

Data Science Blog: Würden Sie sagen, dass das DWH der Schlüssel dazu ist, ein datengetriebenes Unternehmen zu werden? Was sollte noch bedacht werden?

Absolut. Ohne alle Ihre Daten in einem einzigen, hoch-elastischen und flexiblen Data Warehouse zu haben, kann es eine große Herausforderung sein, den Mitarbeitern im Unternehmen Einblicke zu gewähren.

Data Science Blog: So viel zur Theorie, lassen Sie uns nun über spezifische Anwendungsfälle sprechen. Generell macht es einen großen Unterschied, welche Daten wir speichern und analysieren wollen, beispielsweise Finanz- oder Maschinendaten. Was dürfen wir dabei nicht vergessen, wenn es um die Erstellung eines DWHs geht?

Finanzdaten und Maschinendaten sind sehr unterschiedlich und liegen häufig in unterschiedlichen Formaten vor. Beispielsweise weisen Finanzdaten häufig ein relationales Standardformat auf. Daten wie diese müssen mit Standard-SQL einfach abgefragt werden können, was viele Hadoop- und noSQL-Tools nicht sinnvoll bereitstellen konnten. Zum Glück handelt es sich bei Snowflake um ein SQL-Data-Warehouse nach ANSI-Standard, sodass die Verwendung dieser Art von Daten problemlos möglich ist.

Zum anderen sind Maschinendaten häufig teilstrukturiert oder sogar völlig unstrukturiert. Diese Art von Daten wird mit dem Aufkommen von Internet of Things (IoT) immer häufiger, aber herkömmliche Data Warehouses haben sich bisher kaum darauf vorbereitet, da sie für relationale Daten optimiert wurden. Halbstrukturierte Daten wie JSON, Avro, XML, Orc und Parkett können in Snowflake zur Analyse nahtlos in ihrem nativen Format geladen werden. Dies ist wichtig, da Sie die Daten nicht reduzieren müssen, um sie nutzen zu können.

Beide Datentypen sind wichtig und Snowflake ist das erste Data Warehouse, das nahtlos mit beiden zusammenarbeitet.

Data Science Blog: Zurück zum gewöhnlichen Anwendungsfall im Business, also der Erstellung von Verkaufs- und Einkaufs-Berichten für die Business Manager, die auf Daten von ERP-Systemen – wie etwa von Microsoft oder SAP – basieren. Welche Architektur könnte für das DWH die richtige sein? Wie viele Layer braucht ein DWH dafür?

Die Art des Berichts spielt weitgehend keine Rolle, da Sie in jedem Fall ein Data Warehouse benötigen, das alle Ihre Daten unterstützt und alle Ihre Benutzer bedient. Idealerweise möchten Sie es auch in der Lage sein, es je nach Bedarf ein- und auszuschalten. Das bedeutet, dass Sie eine Cloud-basierte Architektur benötigen… und insbesondere die innovative Architektur von Snowflake, die Speicher und Computer voneinander trennt und es Ihnen ermöglicht, genau das zu bezahlen, was Sie verwenden.

Data Science Blog: Wo würden Sie den Hauptteil der Geschäftslogik für einen Report implementieren? Tendenziell eher im DWH oder im BI-Tool, dass für das Reporting verwendet word? Hängt es eigentlich vom BI-Tool ab?

Das Tolle ist, dass Sie es frei wählen können. Snowflake kann als Data Warehouse für SQL nach dem ANSI-Standard ein hohes Maß an Datenmodellierung und Geschäftslogik-Implementierung unterstützen. Sie können aber auch Partner wie Looker und Sigma einsetzen, die sich auf die Datenmodellierung für BI spezialisiert haben. Wir sind der Meinung, dass es am besten ist, wenn jedes Unternehmen für sich selbst entscheidet, was der individuell richtige Ansatz ist.

Data Science Blog: Snowflake ermöglicht es Organisationen, Daten in der Cloud zu speichern und zu verwalten. Heißt das aber auch, dass Unternehmen ein Stück weit die Kontrolle über ihre eigenen Daten verlieren?

Kunden haben die vollständige Kontrolle über ihre Daten und Snowflake kann keinen Teil ihrer Daten sehen oder ändern. Der Vorteil einer Cloud-Lösung besteht darin, dass Kunden weder die Infrastruktur noch das Tuning verwalten müssen. Sie entscheiden, wie sie ihre Daten speichern und analysieren möchten, und Snowflake kümmert sich um den Rest.

Data Science Blog: Wie groß ist der Aufwand für kleinere oder mittelgroße Unternehmen, ein DWH in der Cloud zu errichten? Und bedeutet es auch, dass damit ein teures Langzeit-Projekt verbunden ist?

Das Schöne an Snowflake ist, dass Sie in wenigen Minuten mit einer kostenlosen Testversion beginnen können. Nun kann der Wechsel von einem herkömmlichen Data Warehouse zu Snowflake einige Zeit in Anspruch nehmen, abhängig von der von Ihnen verwendeten Legacy-Technologie. Snowflake selbst ist jedoch recht einfach einzurichten und sehr gut mit historischen Werkzeugen kompatibel. Der Einstieg könnte daher nicht einfacherer sein.

A Bird’s Eye View: How Machine Learning Can Help You Charge Your E-Scooters

Bird scooters in Columbus, Ohio

Bird scooters in Columbus, Ohio

Ever since I started using bike-sharing to get around in Seattle, I have become fascinated with geolocation data and the transportation sharing economy. When I saw this project leveraging the mobility data RESTful API from the Los Angeles Department of Transportation, I was eager to dive in and get my hands dirty building a data product utilizing a company’s mobility data API.

Unfortunately, the major bike and scooter providers (Bird, JUMP, Lime) don’t have publicly accessible APIs. However, some folks have seemingly been able to reverse-engineer the Bird API used to populate the maps in their Android and iOS applications.

One interesting feature of this data is the nest_id, which indicates if the Bird scooter is in a “nest” — a centralized drop-off spot for charged Birds to be released back into circulation.

I set out to ask the following questions:

  1. Can real-time predictions be made to determine if a scooter is currently in a nest?
  2. For non-nest scooters, can new nest location recommendations be generated from geospatial clustering?

To answer these questions, I built a full-stack machine learning web application, NestGenerator, which provides an automated recommendation engine for new nest locations. This application can help power Bird’s internal nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on proximity to scooters and nest locations in their area.

Bird

The electric scooter market has seen substantial growth with Bird’s recent billion dollar valuation  and their $300 million Series C round in the summer of 2018. Bird offers electric scooters that top out at 15 mph, cost $1 to unlock and 15 cents per minute of use. Bird scooters are in over 100 cities globally and they announced in late 2018 that they eclipsed 10 million scooter rides since their launch in 2017.

Bird scooters in Tel Aviv, Israel

Bird scooters in Tel Aviv, Israel

With all of these scooters populating cities, there’s much-needed demand for people to charge them. Since they are electric, someone needs to charge them! A charger can earn additional income for charging the scooters at their home and releasing them back into circulation at nest locations. The base price for charging each Bird is $5.00. It goes up from there when the Birds are harder to capture.

Data Collection and Machine Learning Pipeline

The full data pipeline for building “NestGenerator”

Data

From the details here, I was able to write a Python script that returned a list of Bird scooters within a specified area, their geolocation, unique ID, battery level and a nest ID.

I collected scooter data from four cities (Atlanta, Austin, Santa Monica, and Washington D.C.) across varying times of day over the course of four weeks. Collecting data from different cities was critical to the goal of training a machine learning model that would generalize well across cities.

Once equipped with the scooter’s latitude and longitude coordinates, I was able to leverage additional APIs and municipal data sources to get granular geolocation data to create an original scooter attribute and city feature dataset.

Data Sources:

  • Walk Score API: returns a walk score, transit score and bike score for any location.
  • Google Elevation API: returns elevation data for all locations on the surface of the earth.
  • Google Places API: returns information about places. Places are defined within this API as establishments, geographic locations, or prominent points of interest.
  • Google Reverse Geocoding API: reverse geocoding is the process of converting geographic coordinates into a human-readable address.
  • Weather Company Data: returns the current weather conditions for a geolocation.
  • LocationIQ: Nearby Points of Interest (PoI) API returns specified PoIs or places around a given coordinate.
  • OSMnx: Python package that lets you download spatial geometries and model, project, visualize, and analyze street networks from OpenStreetMap’s APIs.

Feature Engineering

After extensive API wrangling, which included a four-week prolonged data collection phase, I was finally able to put together a diverse feature set to train machine learning models. I engineered 38 features to classify if a scooter is currently in a nest.

Full Feature Set

Full Feature Set

The features boiled down into four categories:

  • Amenity-based: parks within a given radius, gas stations within a given radius, walk score, bike score
  • City Network Structure: intersection count, average circuity, street length average, average streets per node, elevation level
  • Distance-based: proximity to closest highway, primary road, secondary road, residential road
  • Scooter-specific attributes: battery level, proximity to closest scooter, high battery level (> 90%) scooters within a given radius, total scooters within a given radius

 

Log-Scale Transformation

For each feature, I plotted the distribution to explore the data for feature engineering opportunities. For features with a right-skewed distribution, where the mean is typically greater than the median, I applied these log transformations to normalize the distribution and reduce the variability of outlier observations. This approach was used to generate a log feature for proximity to closest scooter, closest highway, primary road, secondary road, and residential road.

An example of a log transformation

Statistical Analysis: A Systematic Approach

Next, I wanted to ensure that the features I included in my model displayed significant differences when broken up by nest classification. My thinking was that any features that did not significantly differ when stratified by nest classification would not have a meaningful predictive impact on whether a scooter was in a nest or not.

Distributions of a feature stratified by their nest classification can be tested for statistically significant differences. I used an unpaired samples t-test with a 0.01% significance level to compute a p-value and confidence interval to determine if there was a statistically significant difference in means for a feature stratified by nest classification. I rejected the null hypothesis if a p-value was smaller than the 0.01% threshold and if the 99.9% confidence interval did not straddle zero. By rejecting the null-hypothesis in favor of the alternative hypothesis, it’s deemed there is a significant difference in means of a feature by nest classification.

Battery Level Distribution Stratified by Nest Classification to run a t-test

Battery Level Distribution Stratified by Nest Classification to run a t-test

Log of Closest Scooter Distribution Stratified by Nest Classification to run a t-test

Throwing Away Features

Using the approach above, I removed ten features that did not display statistically significant results.

Statistically Insignificant Features Removed Before Model Development

Model Development

I trained two models, a random forest classifier and an extreme gradient boosting classifier since tree-based models can handle skewed data, capture important feature interactions, and provide a feature importance calculation. I trained the models on 70% of the data collected for all four cities and reserved the remaining 30% for testing.

After hyper-parameter tuning the models for performance on cross-validation data it was time to run the models on the 30% of test data set aside from the initial data collection.

I also collected additional test data from other cities (Columbus, Fort Lauderdale, San Diego) not involved in training the models. I took this step to ensure the selection of a machine learning model that would generalize well across cities. The performance of each model on the additional test data determined which model would be integrated into the application development.

Performance on Additional Cities Test Data

The Random Forest Classifier displayed superior performance across the board

The Random Forest Classifier displayed superior performance across the board

I opted to move forward with the random forest model because of its superior performance on AUC score and accuracy metrics on the additional cities test data. AUC is the Area under the ROC Curve, and it provides an aggregate measure of model performance across all possible classification thresholds.

AUC Score on Test Data for each Model

AUC Score on Test Data for each Model

Feature Importance

Battery level dominated as the most important feature. Additional important model features were proximity to high level battery scooters, proximity to closest scooter, and average distance to high level battery scooters.

Feature Importance for the Random Forest Classifier

Feature Importance for the Random Forest Classifier

The Trade-off Space

Once I had a working machine learning model for nest classification, I started to build out the application using the Flask web framework written in Python. After spending a few days of writing code for the application and incorporating the trained random forest model, I had enough to test out the basic functionality. I could finally run the application locally to call the Bird API and classify scooter’s into nests in real-time! There was one huge problem, though. It took more than seven minutes to generate the predictions and populate in the application. That just wasn’t going to cut it.

The question remained: will this model deliver in a production grade environment with the goal of making real-time classifications? This is a key trade-off in production grade machine learning applications where on one end of the spectrum we’re optimizing for model performance and on the other end we’re optimizing for low latency application performance.

As I continued to test out the application’s performance, I still faced the challenge of relying on so many APIs for real-time feature generation. Due to rate-limiting constraints and daily request limits across so many external APIs, the current machine learning classifier was not feasible to incorporate into the final application.

Run-Time Compliant Application Model

After going back to the drawing board, I trained a random forest model that relied primarily on scooter-specific features which were generated directly from the Bird API.

Through a process called vectorization, I was able to transform the geolocation distance calculations utilizing NumPy arrays which enabled batch operations on the data without writing any “for” loops. The distance calculations were applied simultaneously on the entire array of geolocations instead of looping through each individual element. The vectorization implementation optimized real-time feature engineering for distance related calculations which improved the application response time by a factor of ten.

Feature Importance for the Run-time Compliant Random Forest Classifier

Feature Importance for the Run-time Compliant Random Forest Classifier

This random forest model generalized well on test-data with an AUC score of 0.95 and an accuracy rate of 91%. The model retained its prediction accuracy compared to the former feature-rich model, but it gained 60x in application performance. This was a necessary trade-off for building a functional application with real-time prediction capabilities.

Geospatial Clustering

Now that I finally had a working machine learning model for classifying nests in a production grade environment, I could generate new nest locations for the non-nest scooters. The goal was to generate geospatial clusters based on the number of non-nest scooters in a given location.

The k-means algorithm is likely the most common clustering algorithm. However, k-means is not an optimal solution for widespread geolocation data because it minimizes variance, not geodetic distance. This can create suboptimal clustering from distortion in distance calculations at latitudes far from the equator. With this in mind, I initially set out to use the DBSCAN algorithm which clusters spatial data based on two parameters: a minimum cluster size and a physical distance from each point. There were a few issues that prevented me from moving forward with the DBSCAN algorithm.

  1. The DBSCAN algorithm does not allow for specifying the number of clusters, which was problematic as the goal was to generate a number of clusters as a function of non-nest scooters.
  2. I was unable to hone in on an optimal physical distance parameter that would dynamically change based on the Bird API data. This led to suboptimal nest locations due to a distortion in how the physical distance point was used in clustering. For example, Santa Monica, where there are ~15,000 scooters, has a higher concentration of scooters in a given area whereas Brookline, MA has a sparser set of scooter locations.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

Given the granularity of geolocation scooter data I was working with, geospatial distortion was not an issue and the k-means algorithm would work well for generating clusters. Additionally, the k-means algorithm parameters allowed for dynamically customizing the number of clusters based on the number of non-nest scooters in a given location.

Once clusters were formed with the k-means algorithm, I derived a centroid from all of the observations within a given cluster. In this case, the centroids are the mean latitude and mean longitude for the scooters within a given cluster. The centroids coordinates are then projected as the new nest recommendations.

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm.

NestGenerator Application

After wrapping up the machine learning components, I shifted to building out the remaining functionality of the application. The final iteration of the application is deployed to Heroku’s cloud platform.

In the NestGenerator app, a user specifies a location of their choosing. This will then call the Bird API for scooters within that given location and generate all of the model features for predicting nest classification using the trained random forest model. This forms the foundation for map filtering based on nest classification. In the app, a user has the ability to filter the map based on nest classification.

Drop-Down Map View filtering based on Nest Classification

Drop-Down Map View filtering based on Nest Classification

Nearest Generated Nest

To see the generated nest recommendations, a user selects the “Current Non-Nest Scooters & Predicted Nest Locations” filter which will then populate the application with these nest locations. Based on the user’s specified search location, a table is provided with the proximity of the five closest nests and an address of the Nest location to help inform a Bird charger in their decision-making.

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

Conclusion

By accurately predicting nest classification and clustering non-nest scooters, NestGenerator provides an automated recommendation engine for new nest locations. For Bird, this application can help power their nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on scooters and nest locations in their area.

Code

The code for this project can be found on my GitHub

Comments or Questions? Please email me an E-Mail!

 

Interview – Knowledge Graphs and Semantic Technologies

“It’s incredibly empowering when data that is clear and understood – what we call ‘beautiful data’ – is available to the data workforce.”

Juan F. Sequeda is co-founder of Capsenta, a spin-off from his research, and Senior Director of Capsenta Labs. He is an expert on knowledge graphs, semantic web, semantic & graph data management and (ontology-based) data integration. In this interview Juan lets us know how SMEs can create value from data, what makes the Knowledge Graph so important and why CDOs and CIOs should use semantic technologies.

Data Science Blog: If you had to name five things that apply to SMEs as well as enterprises as they are on their journey through digital transformation: What are the most important steps to take in order to create value from data?

I would state four things:

  1. Focus on the business problem that needs to be solved instead of the technology.
  2. Getting value out of your data is a social-technical problem. Not everything can be solved by technology and automation. It is crucial to understand the social/human aspect of the problems.
  3. Avoid boiling the ocean. Be agile and iterate.
  4. Recall that it’s a marathon, not a sprint. Hence why you shouldn’t focus on boiling the ocean.

Data Science Blog: You help companies to make their company data meaningfully and thus increase their value. The magic word is the knowledge graph. What exactly is a Knowledge Graph?

Let’s recall that the term “knowledge graph”, that is being actively used today, was coined by Google in a 2012 blogpost. From an industry point of view, it’s a term that represents data integration, where not just entities but also relationships are first class citizens. In other words, it’s data integration based on graphs. That is why you see graph database companies use the term knowledge graph instead of data integration.

In the academic circle, there is a “debate” on what the term “knowledge graph” means. As academics, it’s clear that we should always strive to have well defined terms. Nevertheless, I find it ironic that academics are spending time debating on the definition of a term that appeared in a (marketing) blog post 7 years ago! I agree with Simeon Warner on this: “I care about putting more knowledge in my graph, instead of defining what is a knowledge graph”.

Whatever definition prevails, it should be open and inclusive.

On a final note, it is paramount that we remember our history in order to avoid reinventing the wheel. There is over half a century of research results that has led us to what we are calling Knowledge Graphs today. If you are interested, please check out our upcoming ISWC 2019 tutorial “Knowledge Graphs: How did we get here? A Half Day Tutorial on the History of Knowledge Graph’s Main Ideas“.

Data Science Blog: Speaking of Knowledge Graphs: According to SEMANTiCS 2019 Research and Innovation Chair Philippe Cudre-Mauroux the next generation of knowledge graphs will capture more detailed information. Towards which directions are you steering with gra.fo?

Gra.fo is a knowledge graph schema (i.e ontology) collaborative modeling tool combined with google doc style features such as real-time collaboration, comments, history and search.

Designing a knowledge graph schema is just the first step. You have to do something with it! The next step is to map the knowledge graph schema to underlying data sources in order to integrate data.

We are driving Gra.fo to also be a mapping management system. We recently released our first mapping features. You now have the ability to import existing R2RML mapping. The next step will be to create the mappings between relational databases and the schema all within Gra.fo. Furthermore, we will extend to support mappings from different types of sources.

Finally, there are so many features that our users are requesting. We are working on those and will also offer an API in order to empower users to develop their own apps and features.

Data Science Blog: At Capsenta, you are changing the way enterprises model, govern and integrate data. Put in brief, how can you explain the benefits of using semantic technologies and knowledge technologies to a CDO or CIO? Which clients could you serve and how did you help them?

Business users need to answer critical business questions quickly and accurately. However, the frequent bottleneck is the lack of understanding of the large and complex enterprise databases. Additionally, the IT experts who do understand are not always available. The ultimate goal is to empower business users to access the data in the way they think of their domain.

This is where Knowledge Graphs come into play.

At Capsenta, we use our Knowledge Graph technology to bridge this conceptualization gap between the complex and inscrutable data sources and the business intelligence and data analytic tools that domain experts use to answer critical business questions. Our goal is to deliver beautiful data so the business users and data scientist can run with the data.

We are helping large scale enterprises in e-commerce, oil & gas and life science industries to generate beautiful data.

Data Science Blog: What are reasons for which Knowledge Graphs should be part of any corporate strategy?

Graphs are very easy for people to understand and express the complex relationships between concepts. Bubbles and lines between them (i.e. a graph!) is what domain experts draw on the whiteboard all the time. We have even had C-level executives look at a Knowledge Graph and immediately see how it expresses a portion of their business and even offer suggestions for additional richness. Imagine that, C-level executives participating in an ontology engineering session because they understand the graph.

This is in sharp contrast to the data itself, which is almost always very difficult to understand and overwhelming in scope. Critical business value is available in a subset of this data. A Knowledge Graph bridges the conceptual gap between a critical portion of the inscrutable data itself and the business user’s view of their world.

It’s incredibly empowering when data that is clear and understood – what we call “beautiful data” – is available to the data workforce.

Data Science Blog: Data-driven process analyzes require interdisciplinary knowledge. What advice would you give to a process manager who wants to familiarize her-/himself with the topic?

Domain experts/business users frequently use multiple words/phrases to mean the same thing and also a specific phrase can mean different things to different people. Also, the domain experts/business users speak a very different language than the IT database owners.

How can the business have clear, accurate answers when there’s inconsistency in what people mean and are thinking?

This is the social problem of getting everyone on the same page. We’ve seen Knowledge Graphs dramatically help with this problem. The exercise of getting people to agree upon what they mean and encoding it in an intuitive Knowledge Graph is very powerful.

The Knowledge Graph also brings the IT stakeholders into the process by clarifying exactly what data or, typically, complex calculations of data is the actual, accurate value for each and every business concept and relationship expressed in the Knowledge Graph.

It is crucial to avoid boiling the ocean. That is why we have designed a pay-as-you-go methodology to start small and provide value as quickly and accurately as possible. Ideally, the team has available what we call a “Knowledge Engineer”. This is someone who can effectively speak with the business users/domain experts and also nerd out with the database folks.

About SEMANTiCS Conference

SEMANTiCS is an established knowledge hub where technology professionals, industry experts, researchers and decision makers can learn about new technologies, innovations and enterprise implementations in the fields of Linked Data and Semantic AI. Founded in 2005 the SEMANTiCS is the only European conference at the intersection of research and industry.

This year’s event is hosted by the Semantic Web Company, FIZ Karlsruhe – Leibniz Institute for Information Infrastructure GmbH, Fachhochschule St. Pölten Forschungs GmbH, KILT Competence Center am Institut für Angewandte Informatik e.V. and Vrije Universiteit Amsterdam.

Interview: Does Business Intelligence benefit from Cloud Data Warehousing?

Interview with Ross Perez, Senior Director, Marketing EMEA at Snowflake

Read this article in German:
“Profitiert Business Intelligence vom Data Warehouse in der Cloud?”

Does Business Intelligence benefit from Cloud Data Warehousing?

Ross Perez is the Senior Director, Marketing EMEA at Snowflake. He leads the Snowflake marketing team in EMEA and is charged with starting the discussion about analytics, data, and cloud data warehousing across EMEA. Before Snowflake, Ross was a product marketer at Tableau Software where he founded the Iron Viz Championship, the world’s largest and longest running data visualization competition.

Data Science Blog: Ross, Business Intelligence (BI) is not really a new trend. In 2019/2020, making data available for the whole company should not be a big thing anymore. Would you agree?

BI is definitely an old trend, reporting has been around for 50 years. People are accustomed to seeing statistics and data for the company at large, and even their business units. However, using BI to deliver analytics to everyone in the organization and encouraging them to make decisions based on data for their specific area is relatively new. In a lot of the companies Snowflake works with, there is a huge new group of people who have recently received access to self-service BI and visualization tools like Tableau, Looker and Sigma, and they are just starting to find answers to their questions.

Data Science Blog: Up until today, BI was just about delivering dashboards for reporting to the business. The data warehouse (DWH) was something like the backend. Today we have increased demand for data transparency. How should companies deal with this demand?

Because more people in more departments are wanting access to data more frequently, the demand on backend systems like the data warehouse is skyrocketing. In many cases, companies have data warehouses that weren’t built to cope with this concurrent demand and that means that the experience is slow. End users have to wait a long time for their reports. That is where Snowflake comes in: since we can use the power of the cloud to spin up resources on demand, we can serve any number of concurrent users. Snowflake can also house unlimited amounts of data, of both structured and semi-structured formats.

Data Science Blog: Would you say the DWH is the key driver for becoming a data-driven organization? What else should be considered here?

Absolutely. Without having all of your data in a single, highly elastic, and flexible data warehouse, it can be a huge challenge to actually deliver insight to people in the organization.

Data Science Blog: So much for the theory, now let’s talk about specific use cases. In general, it matters a lot whether you are storing and analyzing e.g. financial data or machine data. What do we have to consider for both purposes?

Financial data and machine data do look very different, and often come in different formats. For instance, financial data is often in a standard relational format. Data like this needs to be able to be easily queried with standard SQL, something that many Hadoop and noSQL tools were unable to provide. Luckily, Snowflake is an ansi-standard SQL data warehouse so it can be used with this type of data quite seamlessly.

On the other hand, machine data is often semi-structured or even completely unstructured. This type of data is becoming significantly more common with the rise of IoT, but traditional data warehouses were very bad at dealing with it since they were optimized for relational data. Semi-structured data like JSON, Avro, XML, Orc and Parquet can be loaded into Snowflake for analysis quite seamlessly in its native format. This is important, because you don’t want to have to flatten the data to get any use from it.

Both types of data are important, and Snowflake is really the first data warehouse that can work with them both seamlessly.

Data Science Blog: Back to the common business use case: Creating sales or purchase reports for the business managers, based on data from ERP-systems such as Microsoft or SAP. Which architecture for the DWH could be the right one? How many and which database layers do you see as necessary?

The type of report largely does not matter, because in all cases you want a data warehouse that can support all of your data and serve all of your users. Ideally, you also want to be able to turn it off and on depending on demand. That means that you need a cloud-based architecture… and specifically Snowflake’s innovative architecture that separates storage and compute, making it possible to pay for exactly what you use.

Data Science Blog: Where would you implement the main part of the business logic for the report? In the DWH or in the reporting tool? Does it matter which reporting tool we choose?

The great thing is that you can choose either. Snowflake, as an ansi-Standard SQL data warehouse, can support a high degree of data modeling and business logic. But you can also utilize partners like Looker and Sigma who specialize in data modeling for BI. We think it’s best that the customer chooses what is right for them.

Data Science Blog: Snowflake enables organizations to store and manage their data in the cloud. Does it mean companies lose control over their storage and data management?

Customers have complete control over their data, and in fact Snowflake cannot see, alter or change any aspect of their data. The benefit of a cloud solution is that customers don’t have to manage the infrastructure or the tuning – they decide how they want to store and analyze their data and Snowflake takes care of the rest.

Data Science Blog: How big is the effort for smaller and medium sized companies to set up a DWH in the cloud? Does this have to be an expensive long-term project in every case?

The nice thing about Snowflake is that you can get started with a free trial in a few minutes. Now, moving from a traditional data warehouse to Snowflake can take some time, depending on the legacy technology that you are using. But Snowflake itself is quite easy to set up and very much compatible with historical tools making it relatively easy to move over.

Allgemeines über Geodaten

Dieser Artikel ist der Auftakt in einer Artikelserie zum Thema “Geodatenanalyse”.

Von den vielen Arten an Datensätzen, die öffentlich im Internet verfügbar sind, bin ich in letzter Zeit vermehrt über eine besonders interessante Gruppe gestolpert, die sich gleich für mehrere Zwecke nutzen lassen: Geodaten.

Gerade in wirtschaftlicher Hinsicht bieten sich eine ganze Reihe von Anwendungsfällen, bei denen Geodaten helfen können, Einblicke in Tatsachen zu erlangen, die ohne nicht möglich wären. Der wohl bekannteste Fall hierfür ist vermutlich die einfache Navigation zwischen zwei Punkten, die jeder kennt, der bereits ein Navigationssystem genutzt oder sich eine Route von Google Maps berechnen lassen hat.
Hiermit können nicht nur Fragen nach dem schnellsten oder Energie einsparensten (und damit gleichermaßen auch witschaftlichsten) Weg z. B. von Berlin nach Hamburg beantwortet werden, sondern auch die bestmögliche Lösung für Ausnahmesituationen wie Stau oder Vollsperrungen berechnet werden (ja, Stau ist, zumindest in der Theorie immer noch eine “Ausnahmesituation” ;-)).
Neben dieser beliebten Art Geodaten zu nutzen, gibt es eine ganze Reihe weiterer Situationen in denen deren Nutzung hilfreich bis essentiell sein kann. Als Beispiel sei hier der Einzugsbereich von in Konkurrenz stehenden Einheiten, wie z. B. Supermärkten genannt. Ohne an dieser Stelle statistische Nachweise vorlegen zu können, kaufen (zumindest meiner persönlichen Beobachtung nach) die meisten Menschen fast immer bei dem Supermarkt ein, der am bequemsten zu erreichen ist und dies ist in der Regel der am nächsten gelegene. Besitzt man nun eine Datenbank mit der Information, wo welcher Supermarkt bzw. welche Supermarktkette liegt, kann man mit so genannten Voronidiagrammen recht einfach den jeweiligen Einzugsbereich der jeweiligen Supermärkte berechnen.
Entsprechende Karten können auch von beliebigen anderen Entitäten mit fester geographischer Position gezeichnet werden: Geldautomaten, Funkmasten, öffentlicher Nahverkehr, …

Ein anderes Beispiel, das für die Datenauswertung interessant ist, ist die kartographische Auswertung von Postleitzahlen. Diese sind in fast jedem Datensatz zu Kunden, Lieferanten, ect. vorhanden, bilden jedoch weder eine ordinale, noch eine sinnvolle kategorische Größe, da es viele tausend verschiedene gibt. Zudem ist auch eine einfache Gruppierung in gröbere Kategorien wie beispielsweise Postleitzahlen des Schemas 1xxxx oft kaum sinnvoll, da diese in aller Regel kein sinnvolles Mapping auf z. B. politische Gebiete – wie beispielsweise Bundesländer – zulassen. Ein Ausweg aus diesem Dilemma ist eine einfache kartographische Übersicht, welche die einzelnen Postleitzahlengebiete in einer Farbskala zeigt.

Im gezeigten Beispiel ist die Bevölkerungsdichte Deutschlands als Karte zu sehen. Hiermit wird schnell und übersichtlich deutlich, wo in Deutschland die Bevölkerung lokalisiert ist. Ähnliche Karten können beispielsweise erstellt werden, um Fragen wie “Wie ist meine Kundschaft verteilt?” oder “Wo hat die Werbekampange XYZ besonders gut funktioniert?” zu beantworten. Bezieht man weitere Daten wie die absolute Bevölkerung oder die Bevölkerungsdichte mit ein, können auch Antworten auf Fragen wie “Welchen Anteil der Bevölkerung habe ich bereits erreicht und wo ist noch nicht genutztes Potential?” oder “Ist mein Produkt eher in städtischen oder ländlichen Gebieten gefragt?” einfach und schnell gefunden werden.
Ohne die entsprechende geographische Zusatzinformation bleiben insbesondere Postleitzahlen leider oft als “nicht sinnvoll auswertbar” bei der Datenauswertung links liegen.
Eine ganz andere Art von Vorteil der Geodaten ist der educational point of view:
  • Wer erst anfängt, sich mit Datenbanken zu beschäftigen, findet mit Straßen, Postleitzahlen und Ländern einen deutlich einfacheren und vor allem besser verständlichen Zugang zu SQL als mit abstrakten Größen und Nummern wie ProductID, CustomerID und AdressID. Zudem lassen sich Geodaten nebenbei bemerkt mittels so genannter GeoInformationSystems (*gis-Programme), erstaunlich einfach und ansprechend plotten.
  • Wer sich mit SQL bereits ein wenig auskennt, kann mit den (beispielsweise von Spatialite oder PostGIS) bereitgestellten SQL-Funktionen eine ganze Menge über Datenbanken sowie deren Möglichkeiten – aber auch über deren Grenzen – erfahren.
  • Für wen relationale Datenbanken sowie deren Funktionen schon lange nichts Neues mehr darstellen, kann sich hier (selbst mit dem eigenen Notebook) erstaunlich einfach in das Thema “Bug Data” einarbeiten, da die Menge an öffentlich vorhandenen Geodaten z.B. des OpenStreetMaps-Projektes selbst in optimal gepackten Format vielen Dutzend GB entsprechen. Gerade die Möglichkeit, die viele *gis-Programme wie beispielsweise QGIS bieten, nämlich Straßen-, Schienen- und Stromnetze “on-the-fly” zu plotten, macht die Bedeutung von richtig oder falsch gesetzten Indices in verschiedenen Datenbanken allein anhand der Geschwindigkeit mit der sich die Plots aufbauen sehr eindrucksvoll deutlich.
Um an Datensätze zu kommen, reicht es in der Regel Google mit den entsprechenden Schlagworten zu versorgen.
Neben – um einen Vergleich zu nutzen – dem Brockhaus der Karten GoogleMaps gibt es beispielsweise mit dem OpenStreetMaps-Projekt einen freien Geodatensatz, welcher in diesem Kontext etwa als das Wikipedia der Karten zu verstehen ist.
Hier findet man zum Beispiel Daten wie Straßen-, Schienen- oder dem Stromnetz, aber auch die im obigen Voronidiagramm eingezeichneten Gebäude und Supermärkte stammen aus diesem Datensatz. Hiermit lassen sich recht einfach just for fun interessante Dinge herausfinden, wie z. B., dass es in Deutschland ca. 28 Mio Gebäude gibt (ein SQL-Einzeiler), dass der Berliner Osten auch ca. 30 Jahre nach der Wende noch immer vorwiegend von der Tram versorgt wird, während im Westen hauptsächlich die U-Bahn fährt. Oder über welche Trassen der in der Nordsee von Windkraftanlagen erzeugte Strom auf das Festland kommt und von da aus weiter verteilt wird.
Eher grundlegende aber deswegen nicht weniger nützliche Datensätze lassen sich unter dem Stichwort “natural earth” finden. Hier sind Daten wie globale Küstenlinien, mittels Echolot ausgemessene Meerestiefen, aber auch von Menschen geschaffene Dinge wie Landesgrenzen und Städte sehr übersichtlich zu finden.
Im Grunde sind der Vorstellung aber keinerlei Grenzen gesetzt und fast alle denkbaren geographischen Fakten können, manchmal sogar live via Sattelit, mitverfolgt werden. So kann man sich beispielsweise neben aktueller Wolkenbedekung, Regenradar und globaler Oberflächentemperatur des Planeten auch das Abschmelzen der Polkappen seit 1970 ansehen (NSIDC) oder sich live die Blitzeinschläge auf dem gesamten Planeten anschauen – mit Vorhersage darüber, wann und wo der Donner zu hören ist (das funktioniert wirklich! Beispielsweise auf lightningmaps).
Kurzum Geodaten sind neben ihrer wirtschaftlichen Relevanz – vor allem für die Logistik – auch für angehende Data Scientists sehr aufschlussreich und ein wunderbares Spielzeug, mit dem man sich lange beschäftigen und eine Menge interessanter Dinge herausfinden kann.