Data Science Knowledge Stack – Abstraction of the Data Science Skillset

What must a Data Scientist be able to do? Which skills does as Data Scientist need to have? This question has often been asked and frequently answered by several Data Science Experts. In fact, it is now quite clear what kind of problems a Data Scientist should be able to solve and which skills are necessary for that. I would like to try to bring this consensus into a visual graph: a layer model, similar to the OSI layer model (which any data scientist should know too, by the way).
I’m giving introductory seminars in Data Science for merchants and engineers and in those seminars I always start explaining what we need to work out together in theory and practice-oriented exercises. Against this background, I came up with the idea for this layer model. Because with my seminars the problem already starts: I am giving seminars for Data Science for Business Analytics with Python. So not for medical analyzes and not with R or Julia. So I do not give a general knowledge of Data Science, but a very specific direction.

A Data Scientist must deal with problems at different levels in any Data Science project, for example, the data access does not work as planned or the data has a different structure than expected. A Data Scientist can spend hours debating its own source code or learning the ropes of new DataScience packages for its chosen programming language. Also, the right algorithms for data evaluation must be selected, properly parameterized and tested, sometimes it turns out that the selected methods were not the optimal ones. Ultimately, we are not doing Data Science all day for fun, but for generating value for a department and a data scientist is also faced with special challenges at this level, at least a basic knowledge of the expertise of that department is a must have.


Read this article in German:
“Data Science Knowledge Stack – Was ein Data Scientist können muss“


Data Science Knowledge Stack

With the Data Science Knowledge Stack, I would like to provide a structured insight into the tasks and challenges a Data Scientist has to face. The layers of the stack also represent a bidirectional flow from top to bottom and from bottom to top, because Data Science as a discipline is also bidirectional: we try to answer questions with data, or we look at the potentials in the data to answer previously unsolicited questions.

The DataScience Knowledge Stack consists of six layers:

Database Technology Knowledge

A Data Scientist works with data which is rarely directly structured in a CSV file, but usually in one or more databases that are subject to their own rules. In particular, business data, for example from the ERP or CRM system, are available in relational databases, often from Microsoft, Oracle, SAP or an open source alternative. A good Data Scientist is not only familiar with Structured Query Language (SQL), but is also aware of the importance of relational linked data models, so he also knows the principle of data table normalization.

Other types of databases, so-called NoSQL databases (Not only SQL) are based on file formats, column or graph orientation, such as MongoDB, Cassandra or GraphDB. Some of these databases use their own programming languages ​​(for example JavaScript at MongoDB or the graph-oriented database Neo4J has its own language called Cypher). Some of these databases provide alternative access via SQL (such as Hive for Hadoop).

A data scientist has to cope with different database systems and has to master at least SQL – the quasi-standard for data processing.

Data Access & Transformation Knowledge

If data are given in a database, Data Scientists can perform simple (and not so simple) analyzes directly on the database. But how do we get the data into our special analysis tools? To do this, a Data Scientist must know how to export data from the database. For one-time actions, an export can be a CSV file, but which separators and text qualifiers should be used? Possibly, the export is too large, so the file must be split.
If there is a direct and synchronous data connection between the analysis tool and the database, interfaces like REST, ODBC or JDBC come into play. Sometimes a socket connection must also be established and the principle of a client-server architecture should be known. Synchronous and asynchronous encryption methods should also be familiar to a Data Scientist, as confidential data are often used, and a minimum level of security is most important for business applications.

Many datasets are not structured in a database but are so-called unstructured or semi-structured data from documents or from Internet sources. And again we have interfaces, a frequent entry point for Data Scientists is, for example, the Twitter API. Sometimes we want to stream data in near real-time, let it be machine data or social media messages. This can be quite demanding, so the data streaming is almost a discipline with which a Data Scientist can come into contact quickly.

Programming Language Knowledge

Programming languages ​​are tools for Data Scientists to process data and automate processing. Data Scientists are usually no real software developers and they do not have to worry about software security or economy. However, a certain basic knowledge about software architectures often helps because some Data Science programs can be going to be integrated into an IT landscape of the company. The understanding of object-oriented programming and the good knowledge of the syntax of the selected programming languages ​​are essential, especially since not every programming language is the most useful for all projects.

At the level of the programming language, there is already a lot of snares in the programming language that are based on the programming language itself, as each has its own faults and details determine whether an analysis is done correctly or incorrectly: for example, whether data objects are copied or linked as reference, or how NULL/NaN values ​​are treated.

Data Science Tool & Library Knowledge

Once a data scientist has loaded the data into his favorite tool, for example, one of IBM, SAS or an open source alternative such as Octave, the core work just began. However, these tools are not self-explanatory and therefore there is a wide range of certification options for various Data Science tools. Many (if not most) Data Scientists work mostly directly with a programming language, but this alone is not enough to effectively perform statistical data analysis or machine learning: We use Data Science libraries (packages) that provide data structures and methods as a groundwork and thus extend the programming language to a real Data Science toolset. Such a library, for example Scikit-Learn for Python, is a collection of methods implemented in the programming language. The use of such libraries, however, is intended to be learned and therefore requires familiarization and practical experience for reliable application.

When it comes to Big Data Analytics, the analysis of particularly large data, we enter the field of Distributed Computing. Tools (frameworks) such as Apache Hadoop, Apache Spark or Apache Flink allows us to process and analyze data in parallel on multiple servers. These tools also provide their own libraries for machine learning, such as Mahout, MLlib and FlinkML.

Data Science Method Knowledge

A Data Scientist is not simply an operator of tools, he uses the tools to apply his analysis methods to data he has selected for to reach the project targets. These analysis methods are, for example, descriptive statistics, estimation methods or hypothesis tests. Somewhat more mathematical are methods of machine learning for data mining, such as clustering or dimensional reduction, or more toward automated decision making through classification or regression.

Machine learning methods generally do not work immediately, they have to be improved using optimization methods like the gradient method. A Data Scientist must be able to detect under- and overfitting, and he must prove that the prediction results for the planned deployment are accurate enough.

Special applications require special knowledge, which applies, for example, to the fields of image recognition (Visual Computing) or the processing of human language (Natural Language Processiong). At this point, we open the door to deep learning.

Expertise

Data Science is not an end in itself, but a discipline that would like to answer questions from other expertise fields with data. For this reason, Data Science is very diverse. Business economists need data scientists to analyze financial transactions, for example, to identify fraud scenarios or to better understand customer needs, or to optimize supply chains. Natural scientists such as geologists, biologists or experimental physicists also use Data Science to make their observations with the aim of gaining knowledge. Engineers want to better understand the situation and relationships between machinery or vehicles, and medical professionals are interested in better diagnostics and medication for their patients.

In order to support a specific department with his / her knowledge of data, tools and analysis methods, every data scientist needs a minimum of the appropriate skills. Anyone who wants to make analyzes for buyers, engineers, natural scientists, physicians, lawyers or other interested parties must also be able to understand the people’s profession.

Engere Data Science Definition

While the Data Science pioneers have long established and highly specialized teams, smaller companies are still looking for the Data Science Allrounder, which can take over the full range of tasks from the access to the database to the implementation of the analytical application. However, companies with specialized data experts have long since distinguished Data Scientists, Data Engineers and Business Analysts. Therefore, the definition of Data Science and the delineation of the abilities that a data scientist should have, varies between a broader and a more narrow demarcation.


A closer look at the more narrow definition shows, that a Data Engineer takes over the data allocation, the Data Scientist loads it into his tools and runs the data analysis together with the colleagues from the department. According to this, a Data Scientist would need no knowledge of databases or APIs, neither an expertise would be necessary …

In my experience, DataScience is not that narrow, the task spectrum covers more than just the core area. This misunderstanding comes from Data Science courses and – for me – I should point to the overall picture of Data Science again and again. In courses and seminars, which want to teach Data Science as a discipline, the focus will of course be on the core area: programming, tools and methods from mathematics & statistics.

Is Data Science the new Statistics?

Table of Contents

1 Introduction

2 Emerging of Data Science

3 Big data technologies

4 Two data worlds: Predictive vs inferential statistics

5 How to study data science

6 Conclusions

7 References

Introduction

As a student of Statistics and the winner of Data Science Scholarship I am often surrounded by computer scientists, mathematicians, physicists and of course statisticians. During conversation, I was asked questions such as “So what actually do I do? What is Data Science?”. These are some very difficult questions and as like you will see during reading this document many before me tried to answer those questions. There is a dispute between statisticians and computer scientists what is the origin of data science and who should teach it. According to the Institute of Mathematical Statistics in the: “The IMS presidential address: let us own data science” we can find a simple recipe for data scientist. [1]

“Putting the traits of Turner and Carver together gives a good portrait of a data scientist:

  • Statistics (S)
  • Domain/Science knowledge (D)
  • Computing (C)
  • Collaboration/teamwork (C)
  • Communication to outsiders (C)

That is, data science = SDCCC = S DC3

However, despite all the challenges that I will need to overcome in answering those questions I will try to do it. I will refer to ideas from several reputable sources, in which I will also tell you: what is in the data science that I am really fascinated about? What is magical in this creation of statistics and computer science that I am drawn to?

Emerging of Data Science

On Tuesday, the 8th of September 2015, University of Michigan announced the 100 million dollars “Data Science Initiative” (DSI), hired 35 new faculty members. On the DSI website we can read about this initiative:

“This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualisation, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational and interdisciplinary applications”2

But that sounds like a bread and butter for statisticians. So, is it really a new creation or is it something that exists for many years but it didn’t sound so sexy as data science? In the article written by Karl Broman, (the University of Wisconsin) we can read:

“When physicists do mathematics, they’re don’t say they’re doing “number science”. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it ‘s still statistics. If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly. You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

Reading the definition of data science on the Data Science Association’s “Professional Code of Conduct”:

“Data scientist means a professional who uses scientific methods to liberate and create meaning from raw data”

These sound like K. Browman maybe right. Maybe I should go on MSc Statistics like many before me did. Maybe Data Science is simply a new sexy name for statistician only data is big, technology more advanced rather than it used to be so you need to have programming skills to handle the data. Maybe let say loudly data science is a modern version of statistics? But maybe not? Because we can also find statements like the following:

“Statistics is the least important part of data science”. [3]

Further, we can read:

“There ‘s so, much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .”.[3]

So maybe people from computer science are right. Maybe I should go and study programming and forget about expanding my knowledge in statistics? After all, we all know that computer science always had much bigger funding and having MSc computer science was always like a magic star for employers. What should I do? Let me research further.

Big data technologies

Is the data size important to distinguish between data science and statistics? Going back to the “Let us own data science” article we can read that a statistician, Hollerith, invented the punched card reader to allow e cient compilation of a US census, the first elements of machine learning. So, no, machine learning is not an invention of computer scientists. It was well known for statistician for decades already. What about different techniques used in DOE (Design of Experiments) or sampling methods to decrease the sample size. If the data used by statisticians would be only small they wouldn’t have to discover methods such PCA (Principle component analysis) or dimensionality reduction techniques. So, no, data can be big and/or small for statisticians, so what is the difference between data science and statistics and what department should I choose?

When I spoke to computer scientists they try to convince me to choose computer science department. Their reasons being that there are many different programmes that I need to know to deal with large datasets. For instance: Java, Hadoop, SQL, Python, and much more. Moreover, programming can only be taught to the best standard through computer science courses Is it true? Can’t we do the same calculations using statistical software such as R, SAS or even Matlab? But on the other hand, doesn’t the newest technology always work faster? And if so, wouldn’t be better to use the newest technology when we program and write loops?

But, I don’t want to underestimate the effort made by statisticians and data analyst over last 50 years in developing statistical programmes. Their efforts have resulted in the emergence of today’s technology. Early statistical packages such as SPSS or Minitab (from 1960’s) allowed to develop more advanced programmes having roots in mini computer era such as STATA or my favourite R which in turn allowed progress to advanced technology even further and create Python, Hadoop, SQL and so on. Becker and Chambers (with S) and later Ihaka, Gentleman, and members of the R Core team (with R) worked on developing the statistical software. These names should be convincing about how powerful statistical programming languages can be. Many operations that we can do in Hadoop or SQL we can also do easily in R.

Two data worlds: Predictive vs inferential statistics

So maybe Data Science is a creature merged by statisticians working on computer science department? Maybe there are two different approaches to statistics: mathematical statistics and computer science statistics and the computer science statisticians are data scientists because according to Yanir Seroussi in his blog:

“A successful data scientist needs to be able to “become one with the data” by exploring it and applying rigorous statistical analysis (right-hand side of the continuum). But good data scientists also understand what it takes to deploy production systems, and are ready to get their hands dirty by writing code that cleans up the data or performs core system functionality (lefthand side of the continuum). Gaining all these skills takes time.”[4]

Okay, so my reasoning that some statisticians work on computer science department is right, as well as there exists subject like computational statistics, so maybe I should go for computer science department but study statistics.

In fact, I am not the first one to arrive at the conclusion. Everything started from a confession made by John Tukey in “The Future of Data Analysis” article published in “The Annals of Mathematical Statistics” :

For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. … All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data

If I am right then above confession was a critical moment. The time when mathematical statistics become more inferential and computational statistics concentrated more on predictive statistics. Applied statisticians working on predictive analytics that are more interested in applying the knowledge rather than developing long proofs decided to move on computer science department.

Additionally, the following is crucial discussion made by Leo Biermann in his paper published in Statistical Science titled “Statistical modelling: the two cultures”. It enables us to understand and differentiate views from both types of statistician, namely mathematical and statistical.

Statistics starts with data. Think of the data as being generated by a black box in which a vector of input variables x (independent variables) go in one side, and on the other side the response variables y come out. Inside the black box, nature functions to associate the predictor variables with the response variables … There are two goals in analyzing the data:

  • Prediction. To be able to predict what the responses are going to be to future input variables
  • InferenceTo [infer] how nature is associating the response variables to the input variables.”

Furthermore, in the same dispute we can read:

“The statistical community has been committed to the almost exclusive use of [generative] models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. [Predictive] modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on [generative] models …”

So, we can say that Data Science evolved from Predictive Analytics which in turn evolved from Statistics but it becomes separate science. Tukey and Wilk 1969 compared this new science to established sciences and further circumscribed the role of Statistics within it:

“ … data analysis is a very di cult field. It must adapt itself to what people can and need to do with data. In the sense that biology is more complex than physics, and the behavioural sciences are more complex than either, it is likely that the general problems of data analysis are more complex than those of all three. It is too much to ask for close and effective guidance for data analysis from any highly formalized structure, either now or in the near future. Data analysis can gain much from formal statistics, but only if the connection is kept adequately loose”

How to study data science

So, what is exactly predictive analytics culture? I think that everyone who used Kaggle competition before can agree with me that description of common task framework (CTF) formulated by Marc Liberman in 2009 is a perfect description of Kaggle competitions, and hackathons events; where latter has worked as training sessions for newbies in the data world. An instance of the CTF has these ingredients:

  1. A publicly available training data set involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.
  2. A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.
  3. A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule

Kaggle competitions are not only training platforms for newbies like me but also very challenging statistical competitions where experienced statisticians can win “pocket money”. A famous example is the Netflix Challenge where the common task was to predict Netflix user movie selection. The winning team (which included ATT Statistician Bob Bell) won 1 mln dollars.

Comparing modules that are available on master in data science at University of Berkley[6]:

  1. Both
  • Applied machine learning
  • Experiments and causality
  1. Statistics
  • Research design and application for data and analysis
  • Statistics for Data Science
  • Behind the data: humans and values
  • Statistical methods for discrete response, Time Series and panel data
  • Data visualisation
  1. Computer Science
  • Python for Data Science
  • Storing and Retrieving Data
  • Scalling up! Really Big Data
  • Machine Learning at scale
  • Natural Language Processing with Deep Learning

We can really see that data science is a subject that demands skills from both computer science and statistics. So, it is another confirmation for me that it is the best time to change department for my postgraduate study, that is, to study statistics on computer science department.

In the 50 Years of Data Science article we can read: “The activities of Greater Data Science are classified into 6 divisions:

  1. Data exploration and preparation
  2. Data representation and transformation
  3. Computing with data
  4. Data visualization and presentation
  5. Data Modelling
  6. Science about data science [5]

I will quickly go through all of them using my Ebola research example, this required using machine learning on time series data.

  1. The most demanding part. Many people told me before starting this project that: collecting, cleaning, wrangling and preparing data take 60% of all the time that you need to spend on data science project. I didn’t realise how much this 60% means in real time. I didn ‘t realise that the 60 percent will take so much time and that after this I will be exhausted. Exhausted but ready for the next step.
  2. This point is actually part of the first one, or maybe just like many other things in statistics: everything is one huge connected bunch.Data that you can find can be very nice, well behaving, written in CSV or JSON or any other format file that you can quickly download and use, but what if not? What if your data is ‘dirty’and not stored as a file (e.g. only appear on a website)? What if data is coded? Do you need to decode it?
  3. The even bigger challenge, but what a fun? You need to know a few different programming languages or least as I do know a little bit of R, a little bit of Python, quite well Tableau and Excel. So you can use different program in different scenarios or for different tasks. For example, using Panda to do EDA and ggplot 2 to do data vis.
  4. Graphs are pretty, right? If you are still reading my article, I bet you know what is heat map, spatial vis in big cities or different infographics. Surely, I would like to highlight, that we respect only the ones that are not only pretty but also valid. Nevertheless, time that is required to create these visualisations is another matter.
  5. The data modelling, finally? I don’t need to say a lot about this. All forms of inferential and predictive analytic are allowed and accepted.
  6. My favourite part, not the end yet. All the conferences and meetups that I can attend on. All the seminars where we all present our current projects.

Conclusions

After graduation, I will be graduated Statistician. Even more, I will be a mathematical statistician whom mostly during degree dealt with inferential statistics. On the other hand, winning data science scholarship gave me exposure to predictive analytic which I highly enjoyed. Therefore, for my next stage, I will just change my department and concentrate more on predictive analytic. There are many statisticians working on computer science department. They possess both statistical knowledge and advanced software engineering skills, they are called data scientists. It would be a pleasure for me to join them. I don’t mind if it will be MSc. Computer Science, MSc. Data Science, MSc. Big Data or whatever the name will be. I do mind to have sufficient exposure to deal with “dirty” data using statistical modelling and machine learning using modern technology. This is what data science is for me. Maybe for you, it will be something else. Maybe you will be more satisfied with expanding massively programming skills. But for me, programming is a tool, modern technology is my friend and my bread and butter will be predictive analytic.

References

  1. IMS Presidential Address: Let us own data science
  2. Data science is statistics
  3. A Gelman, Columbia University
  4. Yanir Seroussi: What is data Science?
  5. 50 Years Data Science
  6. Curriculum: data science@Berkley

What makes a good Data Scientist? Answered by leading Data Officers!

What makes a good Data Scientist? A question I got asked recently a lot by data science newbies as well as long-established CIOs and my answer ist probably not what you think:
In my opinion is a good Data Scientist somebody with, at least, a good knowledge of computer programming, statistics and the ability of understanding the customer´s business. Above all stands a strong interest in finding value in distributed data sources.

Debatable? Maybe. That’s why I forwarded this question to five other leading Data Scientists and Chief Data Officers in Germany, let’s have a look on their answers to this question and create your own idea of what a good Data Scientist might be:


Dr. Andreas Braun – Head of Global Data & Analytics @ Allianz SE

A data scientist connects thorough analytical and methodological understanding  with a technical hands-on/ engineering mentality.
Data scientists bridge between analytics, tech, and business. “New methods”, such as machine learning, AI, deep learning etc. are crucial and are continuously challenged and improved. (14 February 2017)


Dr. Helmut Linde – Head of Data Science @ SAP SE

The ideal data scientist is a thought leader who creates value from analytics, starting from a vision for improved business processes and an algorithmic concept, down to the technical realization in productive software. (09 February 2017)


Klaas Bollhoefer – Chief Data Scientist @ The unbelievable Machine Company

For me a data scientist thinks ahead, thinks about and thinks in-between. He/she is a motivated, open-minded, enthusiastic and unconventional problem solver and tinkerer. Being a team player and a lone wolf are two sides of the same coin and he/she definitely hates unicorns and nerd shirts. (27 March 2017)

 


Wolfgang Hauner – Chief Data Officer @ Munich Re

A data scientist is, from their very nature, interested in data and its underlying relationship and has the cognitive, methodical and technical skills to find these relationships, even in unstructured data. The essential prerequisites to achieve this are curiosity, a logical mind-set and a passion for learning, as well as an affinity for team interaction in the work place. (08 February 2017)

 


Dr. Florian Neukart – Principal Data Scientist @ Volkswagen Group of America

In my opinion, the most important trait seems to be driven by an irresistible urge to understand fundamental relations and things, whereby I summarize both an atom and a complex machine among “things”. People with this trait are usually persistent, can solve a new problem even with little practical experience, and strive for the necessary training or appropriate quantitative knowledge autodidactically. (08 February 2017)

Background idea:
That I am writing about atoms and complex machines has to do with the fact that I have been able to analyze the most varied data through my second job at the university, and that I am given a chance to making significant contributions to both machine learning and physics, is primarily rooted in curiosity. Mathematics, physics, neuroscience, computer science, etc. are the fundamentals that someone will acquire if she wants to understand. In the beginning, there is only curiosity… I hope this is not too out of the way, but I’ve done a lot of job interviews and worked with lots of smart people, and it has turned out that quantitative knowledge alone is not enough. If someone is not burning for understanding, she may be able to program a Convolutional Network from the ground but will not come up with new ideas.

 


Interview – Using Decision Science to forecast customer behaviour

Interview with Dr. Eva-Marie Müller-Stüler from KPMG about how to use Decision Science to forecast customer behaviour

Dr. Eva-Marie Müller-Stüler is Chief Data Scientist and Associate Director in Decision Science at KPMG LLP in London. She graduated as a mathematician at the Technical University of Munich with a year abroad in Tokyo, and completed her Doctorate at the Philipp University in Marburg.

linkedin-button xing-button

Read this article in German:
“Interview – Mit Data Science Kundenverhalten vorhersagen “

Data Science Blog: Ms Dr. Müller-Stüler, which path led you to the top of Analytics for KPMG?

I always enjoyed analytical questions, and have a great interest in people and finance. For me, understanding how people work and make decisions is incredibly exciting. In my Master’s and my PhD theses I had to analyse large amounts of data and had to program various algorithms. Now, combining a solid mathematical education with specific industry and business knowledge enables me to understand my clients’ businesses and to develop methods that disrupt the market and uncover new business strategies.

Data Science Blog: What kind of analytical solutions do you offer your clients? What benefits do you generate for them?

Our team focuses on Behaviour and Customer Science under a mantra and mission: “We understand human behaviour and we change it”. We look at all the data artefacts a person (for example, the customer or the employee) leaves behind and try to solve the question of how to change their behaviour or to predict future behaviour. With advanced analytics and data science we develop “always-on” forecasting models, which enable our clients to act in advance. This could be forecasting customer demand at a particular location, how it can be improved or influenced in the desired direction, or which kind of promotions work best for which customer. Also the challenge of predicting where, and with what product mix, a new store should be opened can be solved much more accurately with Predictive Analytics than by conventional methods.


Data Science Blog: What prerequisites must be fulfilled to ensure that predictive analyses work adequately for customer behaviour?

The data must, of course, have a certain quality and history to recognize trends and cycles. Often, however, one can also create an advantage by using additional new data sources. Experience and creativity are enormously important to understand what is possible and how to improve the quality of our work, or whether something only increases the noise.

Data Science Blog: What external data sources do you need to integrate? How do you handle unstructured data?

As far as external data sources are concerned, we are very spoiled here in England. We use about 10,000 different signals on average, and which vary depending on the question. These might include signals that show the composition of the population, local traffic information, the proximity of sights, hospitals, schools, crime rates and many more. The influence of each signal is also different for each problem. So, a high number of pick pocketing incidences can be a positive sign of the vibrancy of an area, and that people carry a lot of cash on average. For a fast food retailer with a presence in the city centre, for example, this could have a positive influence on a decision to invest in a new outlet in the area, in another area the opposite.

Data Science Blog: What possibilities does data science provide for forensics or fraud detection?

Every customer is surrounded by thousands of data signals and produces and transmits more by through his behaviour. This enables us to get a pretty good picture about the person online. As every kind of person also has a certain behavioural pattern (and this also applies to fraudsters) it is possible to recognise or predict these patterns in time.

Data Science Blog: What tools do you use in your work? When do you rely on proprietary software or on open source?

This depends on what stage we are in the process and the goal defined. We differentiate our team into different groups: Our Data Wranglers (who are responsible for extracting, generating and processing the data) work with other tools than our Data Modellers. Basically our tool kit covers the entire range of SQL Server, R, Python, but sometimes also Matlab or SAS. More and more, we are working with cloud-based solutions. Data visualization and dashboards in Qlik, Tableau or Alteryx are usually passed on to other teams.

Data Science Blog: What does your working day as a data scientist look like from after the morning café until the end of the evening?

My role is perhaps best described as the player’s coach. At the beginning of a project, it is primarily about working with the client to understand and develop the project. New ideas and methods have to be developed. During a project, I manage the teams and knowledge transfer; the review and the questioning of the models are my main tasks. In the end I do the final sign-off of the project. Since I often run several projects at different stages at the same time, it is guaranteed never boring.

Data Science Blog: Are good Data Scientists of your experience more likely to be consultant types or introvert nerds?

That depends upon what one is focused. A Data Visualizer or Data Artist reduces the information and visualise it in a great and understandable way. This requires creativity, a good understanding of business and safe handling of the tools.

The Data Analyst is more concerned with the “Slicing and Dicing” of data. The aim is to analyse the past and to recognize relationships. It is important to have good mathematical and statistical abilities in addition to the financial knowledge.

The Data Scientist is the most mathematical type. His job is to recognize deeper connections in the data and to make predictions. This involves the development of complicated models or Machine Learning Algorithms. Without a good mathematical education and programming skills it is unfortunately not possible to understand the risk of potential errors in full depth. The danger of drawing wrong conclusions or interpreting correlations counterfactually is very great. A simple example of this is that, in summer, when the weather is beautiful, more people eat ice cream and go swimming. Therefore, there is a strong correlation between eating ice and the number of drowned people, although eating ice cream does not lead to drowning. The influencing variable is the temperature. To minimise the risk for wrong conclusions I think it is important have worked and studied mathematics, data science, machine learning and statistics in depth – this usually means a PhD in science related subject.

Beyond that, business and industry knowledge is also important for a Data Scientist. His solutions must be relevant to the client and solve their problems or improve their processes. The best AI machine does not give any bank a competitive advantage if it predicts the sale of ice cream based on the weather. This may be 100% correct, but has no relevance for the client.

It is quite similar to other areas (e.g., medicine) too. There are many different areas, but for serious problems it is best to ask a specialist so that you do not draw wrong conclusions.

Data Science Blog: For all students who have soon finished their bachelor’s degree in computer science, mathematics, or economics, what would they advise these young ladies how to become good Data Scientists?

Never stop learning! The market is currently developing incredibly fast and has so many great areas to focus on. You should dive into it with passion, enthusiasm and creativity and have fun with the recognition of patterns and relationships. If you also surround yourself with interesting and inspiring people from whom you can learn more, I predict that you’ll do well.

This interview is also available in German: https://data-science-blog.com/de/blog/2016/11/10/interview-mit-advanced-analytics-kundenverhalten-verstehen/

A review of Language Understanding tools – IBM Conversation

In the first part of this series, we saw how top firms with their different assistants are vying to acquire a space in the dialogue market. In this second and final part of this blog-series on Conversational AI, I go more technical to discuss the fundamentals of the underlying concept behind building a Dialogue system i.e. the cornerstone of any Language Understanding tool. Moreover, I explain this by reviewing one such Language Understanding tool as an example that is available in the IBM Bluemix suite, called as IBM Conversation.

IBM Conversation within Bluemix

IBM Conversation was built on the lines of IBM Watson from the IBM Bluemix suite. It is now the for dialogue construction after IBM Dialog was deprecated.We start off by searching and then creating a dedicated environment in the console.

ibm-bluemix-screenshot

Setting up IBM Conversation from the Bluemix Catalog/Console

Basics

Conversation component in IBM Bluemix  is based on the Intent, Entity and Dialogue architecture. And the same is the case with Microsoft LUIS (LUIS stands for Language Understanding Intelligent Service). One of the key components involves doing what is termed as Natural Language Understanding or NLU for short. It extracts words from a textual sentence to understand the grammar dependencies to construct high level semantic information that identifies the underlying intent and entity in the given utterance. It returns a confidence measure i.e. the top-most extracted intent out of the many pre-specified intents that gives us the most likely intent from the given utterance as per our trained model.

These are all statistically/machine learned based on the training data. Go over the demo, tutorial and documentation to get a more in-depth view of things at IBM Conversation.

The intent, entity and dialogue based architecture forms the crux of any SLU system to extract semantic information from speech and enables such a system to be generic across the various Language Understanding toolkits.

alexa-interaction-model-ask-screenshot

The Alexa Interaction model based on intent and slots in ASK

Another huge advantage that ASK provides for building such an architecture, is that it has multi-lingual support.

Conceptual Mapping

Intents can be thought of as classes where one classifies the input examples into one of them. For example,

Call Mark is mapped to the MOBILE class and Navigate to Munich is mapped to the ROUTE class

The entities are labels, so e.g. from above, you can have

Mark as a PERSON and Munich as a CITY.

Major advantage and drawback

Both Conversation and LUIS use a non-Machine Learning based approach for software developers or business users to create a fast prototype. It is definitely easy to begin with and gives a lot of options to create drag and drop based dialogue system. However, it can’t scale up to large data. A hybrid approach that can combine or build a dynamic system on top of this static approach is needed for scalable industry solutions.

Extensions

Moreover, an end to end workflow can be built by plugging in components from Node-RED and introduction to the same can be viewed in the below video.

What’s good is that they have a component for Conversation as well. So, we can build a complete chatbot starting from a speech to text component to get the human commands translated to text, followed by a conversation component to build up the dialog and lastly by a text to speech component to translate this textual dialogue back to speech to be spoken by a humanoid or a mobile device!

Missing components and possible future work

It is not possible to add entities/intent dynamically through the UI after the initial workspace is constructed. The advanced response tab doesn’t allow to edit (add) the entities in the response field, like for example adding variables to the context. We can edit it (highlighted in orange) but it doesn’t save or get reflected.

{
“output”: {
“text”: “I understand you want me to turn on something. You can say turn on the wipers or switch on the lights.”
},
“context”: {
“toppings”: “<? context.toppings.append( ‘onions’ ) ?>”
},
“entities”: {
   “appliance”: “<? entities.appliance.append( ‘mobile’ ) ?>”
}
}

Moreover, the link which only mentions accessing intents and entities but not modifying them.

watson-developer-cloud-screenshot watson-developer-cloud-screenshot2

The only place to add the intent, entities is back in the work space and not programmatically at run time. Perhaps, a possible solution can be to use UI with DB data to save the intermediate and newly discovered intent/entity values and then update the workspace later.

As I end this blog, perhaps there would be another AI assistant released that has moved beyond its embryonic stage to conquer real life application scenarios. Conversational AI is hot property, so dive in to reap its benefits, both from an end user and developer’s perspective!

Note: Hope you enjoyed the read. I have deliberately kept the content a mix of non technical and technical to build the excitement and buzz going around this exciting field of conversational AI! Publishing this blog was on my list as I was compiling lot of facts since last few weeks but I had to hurry even more, given the recent news surrounding this upsurge. As always, any feedback as a comment below or through a message are more than welcome!

A quick primer on TensorFlow – Google’s machine learning workhorse

Introducing Google Brains‘ TensorFlow™

This week started with major news for the machine learning and data science community: the Google Brain Team announced the open sourcing of TensorFlow, their numerical library for tensor network computations. This software is actively developed (and used!) within Google and builds on many of Google’s large scale neural network applications such as automatic image labeling and captioning as well as the speech recognition in Google’s apps.

TensorFlow in bullet points

Here are the main features:

  • Supports deep neural networks – and much more machine learning approaches
  • Highly scalable across many machines and huge data sets
  • Runs on desktops, servers, in cloud and even mobile devices
  • Computation can run on CPUs, GPUs or both
  • All this flexibility is covered by a single API making the execution very streamlined
  • Available interfaces: C++ and Python. More will follow (Java, R, Lua, Go…)
  • Comes with many tools helping to build and visualize the data flow networks
  • Includes a powerful gradient based optimizer with auto-differentiation
  • Extensible with C++
  • Usable for commercial applications – released under Apache Software Licence 2.0

Tensor, what? Tensor, why?

„Numerical library for tensor network computations“ maybe doesn’t sound too exciting, but let’s  consider the implications.

Application of tensors and their networks is a relatively new (but fast evolving) approach in machine learning. Tensors, if you recall your algebra classes, are simply n-dimensional data arrays (so a scalar is a 0th order tensor, a vector is 1st order, and a matrix a 2nd order matrix).

A simple practical example of is color image’s RGB layers (essentially three 2D matrices combined into a 3rd order tensor). Or a more business minded example – if your data source generates a table (a 2D array) every hour, you can look at the full data set as a 3rd order tensor – time being the extra dimension.

Tensor networks then represent “data flow graphs”, where the edges are your multi-dimensional data sets and nodes are the mathematical operations on this data.

Example of of a data flow graph with multiple nodes (data operations). Notice how the execution of nodes is asynchronous. This allows incredible scalability across many machines. Image Source.

Looking at your data through the tensor formalism gives you a lot of powerful tools that were already developed for tensor algebra, allowing fast, complex computations.  

Tensor networks are also a natural fit for computations done on graphical processing units (GPUs) as they are built exactly for the purpose of very fast numerical operations on such a data – speeding up your calculations significantly compared to standard CPU execution!

The importance of flexible architecture & scaling

The data flow graph approach has also further advantages. Most notably, you can split the design of your data flows (i.e. data cleaning, processing, transformations, model building etc.) from its execution. You first build up the graph of your data flow and then you send it to for execution: either on the CPUs of your machines (and it can be your laptop just as well as cluster) or GPUs or a combination. This happens through a single interface that hides all the complexities from you.

Since the execution is asynchronous it scales across many machines and can deal with huge amounts of data.

You can count on the Google guys to build tools not only for academic use, but also heavy-duty operations in the industry!

Is this just another deep learning library?

TensorFlow is of course not the first library to embrace the tensor formalism and GPU execution. The nearest comparisons (and competitors) are Theano, Torch and CGT (Caffe to a limited degree).

While there are significant overlaps between the libraries, TensorFlow tries to provide a broader framework. It is not only a deep learning library – the Data Flow Graphs can incorporate any data processing/analysis applications. It also comes with a very powerful gradient based optimizer with automatic calculations of derivatives offering huge flexibility.

Given this broad vision the closest competitor is probably Theano (while Caffe and the existing Theano wrappers have a narrower focus on deep learning). TensorFlow’s distinguishing feature is that by design its focus is on large, scalable architectures with a complete flexibility in the hardware, best suited for industry/operational use, whereas the other libraries have more academic pedigrees.

Initial analyses also indicate that TensorFlow should bring also performance improvements compared to Theano, although no comprehensive benchmarks have yet been published.

As the other packages are out already for a while, they have large, active communities and often additional supporting software (examples are the very useful wrappers around Theano like Lasagne, Keras and Blocks that provider higher level abstractions to its engine).

Of course, with Google’s gravitas, one can expect that TensorFlow’s open source community will grow very fast and the contributors will quickly add a lot of additional features (and find hidden bugs).

Finally, keep in mind, that while Google provided us with this great data processing framework and some of its machine learning capabilities, it is likely that the most powerful machine learning algorithms still remain Google’s proprietary secret.

Nonetheless, TensorFlow is a huge and very welcome contribution to the open source machine learning world!

Where to go next?

You can find Google’s getting started guide here. The TensorFlow white paper is worth a read too. Source code can be found at the Github page. There is also a Vagrant virtual machine with TensorFlow pre-installed available here.