Data Science Knowledge Stack – Abstraction of the Data Science Skillset

What must a Data Scientist be able to do? Which skills does as Data Scientist need to have? This question has often been asked and frequently answered by several Data Science Experts. In fact, it is now quite clear what kind of problems a Data Scientist should be able to solve and which skills are necessary for that. I would like to try to bring this consensus into a visual graph: a layer model, similar to the OSI layer model (which any data scientist should know too, by the way).
I’m giving introductory seminars in Data Science for merchants and engineers and in those seminars I always start explaining what we need to work out together in theory and practice-oriented exercises. Against this background, I came up with the idea for this layer model. Because with my seminars the problem already starts: I am giving seminars for Data Science for Business Analytics with Python. So not for medical analyzes and not with R or Julia. So I do not give a general knowledge of Data Science, but a very specific direction.

A Data Scientist must deal with problems at different levels in any Data Science project, for example, the data access does not work as planned or the data has a different structure than expected. A Data Scientist can spend hours debating its own source code or learning the ropes of new DataScience packages for its chosen programming language. Also, the right algorithms for data evaluation must be selected, properly parameterized and tested, sometimes it turns out that the selected methods were not the optimal ones. Ultimately, we are not doing Data Science all day for fun, but for generating value for a department and a data scientist is also faced with special challenges at this level, at least a basic knowledge of the expertise of that department is a must have.


Read this article in German:
“Data Science Knowledge Stack – Was ein Data Scientist können muss“


Data Science Knowledge Stack

With the Data Science Knowledge Stack, I would like to provide a structured insight into the tasks and challenges a Data Scientist has to face. The layers of the stack also represent a bidirectional flow from top to bottom and from bottom to top, because Data Science as a discipline is also bidirectional: we try to answer questions with data, or we look at the potentials in the data to answer previously unsolicited questions.

The DataScience Knowledge Stack consists of six layers:

Database Technology Knowledge

A Data Scientist works with data which is rarely directly structured in a CSV file, but usually in one or more databases that are subject to their own rules. In particular, business data, for example from the ERP or CRM system, are available in relational databases, often from Microsoft, Oracle, SAP or an open source alternative. A good Data Scientist is not only familiar with Structured Query Language (SQL), but is also aware of the importance of relational linked data models, so he also knows the principle of data table normalization.

Other types of databases, so-called NoSQL databases (Not only SQL) are based on file formats, column or graph orientation, such as MongoDB, Cassandra or GraphDB. Some of these databases use their own programming languages ​​(for example JavaScript at MongoDB or the graph-oriented database Neo4J has its own language called Cypher). Some of these databases provide alternative access via SQL (such as Hive for Hadoop).

A data scientist has to cope with different database systems and has to master at least SQL – the quasi-standard for data processing.

Data Access & Transformation Knowledge

If data are given in a database, Data Scientists can perform simple (and not so simple) analyzes directly on the database. But how do we get the data into our special analysis tools? To do this, a Data Scientist must know how to export data from the database. For one-time actions, an export can be a CSV file, but which separators and text qualifiers should be used? Possibly, the export is too large, so the file must be split.
If there is a direct and synchronous data connection between the analysis tool and the database, interfaces like REST, ODBC or JDBC come into play. Sometimes a socket connection must also be established and the principle of a client-server architecture should be known. Synchronous and asynchronous encryption methods should also be familiar to a Data Scientist, as confidential data are often used, and a minimum level of security is most important for business applications.

Many datasets are not structured in a database but are so-called unstructured or semi-structured data from documents or from Internet sources. And again we have interfaces, a frequent entry point for Data Scientists is, for example, the Twitter API. Sometimes we want to stream data in near real-time, let it be machine data or social media messages. This can be quite demanding, so the data streaming is almost a discipline with which a Data Scientist can come into contact quickly.

Programming Language Knowledge

Programming languages ​​are tools for Data Scientists to process data and automate processing. Data Scientists are usually no real software developers and they do not have to worry about software security or economy. However, a certain basic knowledge about software architectures often helps because some Data Science programs can be going to be integrated into an IT landscape of the company. The understanding of object-oriented programming and the good knowledge of the syntax of the selected programming languages ​​are essential, especially since not every programming language is the most useful for all projects.

At the level of the programming language, there is already a lot of snares in the programming language that are based on the programming language itself, as each has its own faults and details determine whether an analysis is done correctly or incorrectly: for example, whether data objects are copied or linked as reference, or how NULL/NaN values ​​are treated.

Data Science Tool & Library Knowledge

Once a data scientist has loaded the data into his favorite tool, for example, one of IBM, SAS or an open source alternative such as Octave, the core work just began. However, these tools are not self-explanatory and therefore there is a wide range of certification options for various Data Science tools. Many (if not most) Data Scientists work mostly directly with a programming language, but this alone is not enough to effectively perform statistical data analysis or machine learning: We use Data Science libraries (packages) that provide data structures and methods as a groundwork and thus extend the programming language to a real Data Science toolset. Such a library, for example Scikit-Learn for Python, is a collection of methods implemented in the programming language. The use of such libraries, however, is intended to be learned and therefore requires familiarization and practical experience for reliable application.

When it comes to Big Data Analytics, the analysis of particularly large data, we enter the field of Distributed Computing. Tools (frameworks) such as Apache Hadoop, Apache Spark or Apache Flink allows us to process and analyze data in parallel on multiple servers. These tools also provide their own libraries for machine learning, such as Mahout, MLlib and FlinkML.

Data Science Method Knowledge

A Data Scientist is not simply an operator of tools, he uses the tools to apply his analysis methods to data he has selected for to reach the project targets. These analysis methods are, for example, descriptive statistics, estimation methods or hypothesis tests. Somewhat more mathematical are methods of machine learning for data mining, such as clustering or dimensional reduction, or more toward automated decision making through classification or regression.

Machine learning methods generally do not work immediately, they have to be improved using optimization methods like the gradient method. A Data Scientist must be able to detect under- and overfitting, and he must prove that the prediction results for the planned deployment are accurate enough.

Special applications require special knowledge, which applies, for example, to the fields of image recognition (Visual Computing) or the processing of human language (Natural Language Processiong). At this point, we open the door to deep learning.

Expertise

Data Science is not an end in itself, but a discipline that would like to answer questions from other expertise fields with data. For this reason, Data Science is very diverse. Business economists need data scientists to analyze financial transactions, for example, to identify fraud scenarios or to better understand customer needs, or to optimize supply chains. Natural scientists such as geologists, biologists or experimental physicists also use Data Science to make their observations with the aim of gaining knowledge. Engineers want to better understand the situation and relationships between machinery or vehicles, and medical professionals are interested in better diagnostics and medication for their patients.

In order to support a specific department with his / her knowledge of data, tools and analysis methods, every data scientist needs a minimum of the appropriate skills. Anyone who wants to make analyzes for buyers, engineers, natural scientists, physicians, lawyers or other interested parties must also be able to understand the people’s profession.

Engere Data Science Definition

While the Data Science pioneers have long established and highly specialized teams, smaller companies are still looking for the Data Science Allrounder, which can take over the full range of tasks from the access to the database to the implementation of the analytical application. However, companies with specialized data experts have long since distinguished Data Scientists, Data Engineers and Business Analysts. Therefore, the definition of Data Science and the delineation of the abilities that a data scientist should have, varies between a broader and a more narrow demarcation.


A closer look at the more narrow definition shows, that a Data Engineer takes over the data allocation, the Data Scientist loads it into his tools and runs the data analysis together with the colleagues from the department. According to this, a Data Scientist would need no knowledge of databases or APIs, neither an expertise would be necessary …

In my experience, DataScience is not that narrow, the task spectrum covers more than just the core area. This misunderstanding comes from Data Science courses and – for me – I should point to the overall picture of Data Science again and again. In courses and seminars, which want to teach Data Science as a discipline, the focus will of course be on the core area: programming, tools and methods from mathematics & statistics.

Is Data Science the new Statistics?

Table of Contents

1 Introduction

2 Emerging of Data Science

3 Big data technologies

4 Two data worlds: Predictive vs inferential statistics

5 How to study data science

6 Conclusions

7 References

Introduction

As a student of Statistics and the winner of Data Science Scholarship I am often surrounded by computer scientists, mathematicians, physicists and of course statisticians. During conversation, I was asked questions such as “So what actually do I do? What is Data Science?”. These are some very difficult questions and as like you will see during reading this document many before me tried to answer those questions. There is a dispute between statisticians and computer scientists what is the origin of data science and who should teach it. According to the Institute of Mathematical Statistics in the: “The IMS presidential address: let us own data science” we can find a simple recipe for data scientist. [1]

“Putting the traits of Turner and Carver together gives a good portrait of a data scientist:

  • Statistics (S)
  • Domain/Science knowledge (D)
  • Computing (C)
  • Collaboration/teamwork (C)
  • Communication to outsiders (C)

That is, data science = SDCCC = S DC3

However, despite all the challenges that I will need to overcome in answering those questions I will try to do it. I will refer to ideas from several reputable sources, in which I will also tell you: what is in the data science that I am really fascinated about? What is magical in this creation of statistics and computer science that I am drawn to?

Emerging of Data Science

On Tuesday, the 8th of September 2015, University of Michigan announced the 100 million dollars “Data Science Initiative” (DSI), hired 35 new faculty members. On the DSI website we can read about this initiative:

“This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualisation, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational and interdisciplinary applications”2

But that sounds like a bread and butter for statisticians. So, is it really a new creation or is it something that exists for many years but it didn’t sound so sexy as data science? In the article written by Karl Broman, (the University of Wisconsin) we can read:

“When physicists do mathematics, they’re don’t say they’re doing “number science”. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it ‘s still statistics. If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly. You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

Reading the definition of data science on the Data Science Association’s “Professional Code of Conduct”:

“Data scientist means a professional who uses scientific methods to liberate and create meaning from raw data”

These sound like K. Browman maybe right. Maybe I should go on MSc Statistics like many before me did. Maybe Data Science is simply a new sexy name for statistician only data is big, technology more advanced rather than it used to be so you need to have programming skills to handle the data. Maybe let say loudly data science is a modern version of statistics? But maybe not? Because we can also find statements like the following:

“Statistics is the least important part of data science”. [3]

Further, we can read:

“There ‘s so, much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .”.[3]

So maybe people from computer science are right. Maybe I should go and study programming and forget about expanding my knowledge in statistics? After all, we all know that computer science always had much bigger funding and having MSc computer science was always like a magic star for employers. What should I do? Let me research further.

Big data technologies

Is the data size important to distinguish between data science and statistics? Going back to the “Let us own data science” article we can read that a statistician, Hollerith, invented the punched card reader to allow e cient compilation of a US census, the first elements of machine learning. So, no, machine learning is not an invention of computer scientists. It was well known for statistician for decades already. What about different techniques used in DOE (Design of Experiments) or sampling methods to decrease the sample size. If the data used by statisticians would be only small they wouldn’t have to discover methods such PCA (Principle component analysis) or dimensionality reduction techniques. So, no, data can be big and/or small for statisticians, so what is the difference between data science and statistics and what department should I choose?

When I spoke to computer scientists they try to convince me to choose computer science department. Their reasons being that there are many different programmes that I need to know to deal with large datasets. For instance: Java, Hadoop, SQL, Python, and much more. Moreover, programming can only be taught to the best standard through computer science courses Is it true? Can’t we do the same calculations using statistical software such as R, SAS or even Matlab? But on the other hand, doesn’t the newest technology always work faster? And if so, wouldn’t be better to use the newest technology when we program and write loops?

But, I don’t want to underestimate the effort made by statisticians and data analyst over last 50 years in developing statistical programmes. Their efforts have resulted in the emergence of today’s technology. Early statistical packages such as SPSS or Minitab (from 1960’s) allowed to develop more advanced programmes having roots in mini computer era such as STATA or my favourite R which in turn allowed progress to advanced technology even further and create Python, Hadoop, SQL and so on. Becker and Chambers (with S) and later Ihaka, Gentleman, and members of the R Core team (with R) worked on developing the statistical software. These names should be convincing about how powerful statistical programming languages can be. Many operations that we can do in Hadoop or SQL we can also do easily in R.

Two data worlds: Predictive vs inferential statistics

So maybe Data Science is a creature merged by statisticians working on computer science department? Maybe there are two different approaches to statistics: mathematical statistics and computer science statistics and the computer science statisticians are data scientists because according to Yanir Seroussi in his blog:

“A successful data scientist needs to be able to “become one with the data” by exploring it and applying rigorous statistical analysis (right-hand side of the continuum). But good data scientists also understand what it takes to deploy production systems, and are ready to get their hands dirty by writing code that cleans up the data or performs core system functionality (lefthand side of the continuum). Gaining all these skills takes time.”[4]

Okay, so my reasoning that some statisticians work on computer science department is right, as well as there exists subject like computational statistics, so maybe I should go for computer science department but study statistics.

In fact, I am not the first one to arrive at the conclusion. Everything started from a confession made by John Tukey in “The Future of Data Analysis” article published in “The Annals of Mathematical Statistics” :

For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. … All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data

If I am right then above confession was a critical moment. The time when mathematical statistics become more inferential and computational statistics concentrated more on predictive statistics. Applied statisticians working on predictive analytics that are more interested in applying the knowledge rather than developing long proofs decided to move on computer science department.

Additionally, the following is crucial discussion made by Leo Biermann in his paper published in Statistical Science titled “Statistical modelling: the two cultures”. It enables us to understand and differentiate views from both types of statistician, namely mathematical and statistical.

Statistics starts with data. Think of the data as being generated by a black box in which a vector of input variables x (independent variables) go in one side, and on the other side the response variables y come out. Inside the black box, nature functions to associate the predictor variables with the response variables … There are two goals in analyzing the data:

  • Prediction. To be able to predict what the responses are going to be to future input variables
  • InferenceTo [infer] how nature is associating the response variables to the input variables.”

Furthermore, in the same dispute we can read:

“The statistical community has been committed to the almost exclusive use of [generative] models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. [Predictive] modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on [generative] models …”

So, we can say that Data Science evolved from Predictive Analytics which in turn evolved from Statistics but it becomes separate science. Tukey and Wilk 1969 compared this new science to established sciences and further circumscribed the role of Statistics within it:

“ … data analysis is a very di cult field. It must adapt itself to what people can and need to do with data. In the sense that biology is more complex than physics, and the behavioural sciences are more complex than either, it is likely that the general problems of data analysis are more complex than those of all three. It is too much to ask for close and effective guidance for data analysis from any highly formalized structure, either now or in the near future. Data analysis can gain much from formal statistics, but only if the connection is kept adequately loose”

How to study data science

So, what is exactly predictive analytics culture? I think that everyone who used Kaggle competition before can agree with me that description of common task framework (CTF) formulated by Marc Liberman in 2009 is a perfect description of Kaggle competitions, and hackathons events; where latter has worked as training sessions for newbies in the data world. An instance of the CTF has these ingredients:

  1. A publicly available training data set involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.
  2. A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.
  3. A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule

Kaggle competitions are not only training platforms for newbies like me but also very challenging statistical competitions where experienced statisticians can win “pocket money”. A famous example is the Netflix Challenge where the common task was to predict Netflix user movie selection. The winning team (which included ATT Statistician Bob Bell) won 1 mln dollars.

Comparing modules that are available on master in data science at University of Berkley[6]:

  1. Both
  • Applied machine learning
  • Experiments and causality
  1. Statistics
  • Research design and application for data and analysis
  • Statistics for Data Science
  • Behind the data: humans and values
  • Statistical methods for discrete response, Time Series and panel data
  • Data visualisation
  1. Computer Science
  • Python for Data Science
  • Storing and Retrieving Data
  • Scalling up! Really Big Data
  • Machine Learning at scale
  • Natural Language Processing with Deep Learning

We can really see that data science is a subject that demands skills from both computer science and statistics. So, it is another confirmation for me that it is the best time to change department for my postgraduate study, that is, to study statistics on computer science department.

In the 50 Years of Data Science article we can read: “The activities of Greater Data Science are classified into 6 divisions:

  1. Data exploration and preparation
  2. Data representation and transformation
  3. Computing with data
  4. Data visualization and presentation
  5. Data Modelling
  6. Science about data science [5]

I will quickly go through all of them using my Ebola research example, this required using machine learning on time series data.

  1. The most demanding part. Many people told me before starting this project that: collecting, cleaning, wrangling and preparing data take 60% of all the time that you need to spend on data science project. I didn’t realise how much this 60% means in real time. I didn ‘t realise that the 60 percent will take so much time and that after this I will be exhausted. Exhausted but ready for the next step.
  2. This point is actually part of the first one, or maybe just like many other things in statistics: everything is one huge connected bunch.Data that you can find can be very nice, well behaving, written in CSV or JSON or any other format file that you can quickly download and use, but what if not? What if your data is ‘dirty’and not stored as a file (e.g. only appear on a website)? What if data is coded? Do you need to decode it?
  3. The even bigger challenge, but what a fun? You need to know a few different programming languages or least as I do know a little bit of R, a little bit of Python, quite well Tableau and Excel. So you can use different program in different scenarios or for different tasks. For example, using Panda to do EDA and ggplot 2 to do data vis.
  4. Graphs are pretty, right? If you are still reading my article, I bet you know what is heat map, spatial vis in big cities or different infographics. Surely, I would like to highlight, that we respect only the ones that are not only pretty but also valid. Nevertheless, time that is required to create these visualisations is another matter.
  5. The data modelling, finally? I don’t need to say a lot about this. All forms of inferential and predictive analytic are allowed and accepted.
  6. My favourite part, not the end yet. All the conferences and meetups that I can attend on. All the seminars where we all present our current projects.

Conclusions

After graduation, I will be graduated Statistician. Even more, I will be a mathematical statistician whom mostly during degree dealt with inferential statistics. On the other hand, winning data science scholarship gave me exposure to predictive analytic which I highly enjoyed. Therefore, for my next stage, I will just change my department and concentrate more on predictive analytic. There are many statisticians working on computer science department. They possess both statistical knowledge and advanced software engineering skills, they are called data scientists. It would be a pleasure for me to join them. I don’t mind if it will be MSc. Computer Science, MSc. Data Science, MSc. Big Data or whatever the name will be. I do mind to have sufficient exposure to deal with “dirty” data using statistical modelling and machine learning using modern technology. This is what data science is for me. Maybe for you, it will be something else. Maybe you will be more satisfied with expanding massively programming skills. But for me, programming is a tool, modern technology is my friend and my bread and butter will be predictive analytic.

References

  1. IMS Presidential Address: Let us own data science
  2. Data science is statistics
  3. A Gelman, Columbia University
  4. Yanir Seroussi: What is data Science?
  5. 50 Years Data Science
  6. Curriculum: data science@Berkley

Data Science and Predictive Analytics in Healthcare

Doing data science in a healthcare company can save lives. Whether it’s by predicting which patients have a tumor on an MRI, are at risk of re-admission, or have misclassified diagnoses in electronic medical records are all examples of how predictive models can lead to better health outcomes and improve the quality of life of patients.  Nevertheless, the healthcare industry presents many unique challenges and opportunities for data scientists.

The impact of data science in healthcare

Healthcare providers have a plethora of important but sensitive data. Medical records include a diverse set of data such as basic demographics, diagnosed illnesses, and a wealth of clinical information such as lab test results. For patients with chronic diseases, there could be a long and detailed history of data available on a number of health indicators due to the frequency of visits to a healthcare provider. Information from medical records can often be combined with outside data as well. For example, a patient’s address can be combined with other publicly available information to determine the number of surgeons that practice near a patient or other relevant information about the type of area that patients reside in.

With this rich data about a patient as well as their surroundings, models can be built and trained to predict many outcomes of interest. One important area of interest is models predicting disease progression, which can be used for disease management and planning. For example, at Fresenius Medical Care (where we primarily care for patients with chronic conditions such as kidney disease), we use a Chronic Kidney Disease progression model that can predict the trajectory of a patient’s condition to help clinicians decide whether and when to proceed to the next stage in their medical care. Predictive models can also notify clinicians about patients who may require interventions to reduce risk of negative outcomes. For instance, we use models to predict which patients are at risk for hospitalization or missing a dialysis treatment. These predictions, along with the key factors driving the prediction, are presented to clinicians who can decide if certain interventions might help reduce the patient’s risk.

Challenges of data science in healthcare

One challenge is that the healthcare industry is far behind other sectors in terms of adopting the latest technology and analytics tools. This does present some challenges, and data scientists should be aware that the data infrastructure and development environment at many healthcare companies will not be at the bleeding edge of the field. However it also means there are a lot of opportunities for improvement, and even small simple models can yield vast improvements over current methods.

Another challenge in the healthcare sector arises from the sensitive nature of medical information. Due to concerns over data privacy, it can often be difficult to obtain access to data that the company has. For this reason, data scientists considering a position at a healthcare company should be aware of whether there is already an established protocol for data professionals to get access to the data. If there isn’t, be aware that simply getting access to the data may be a major effort in itself.

Finally, it is important to keep in mind the end-use of any predictive model. In many cases, there are very different costs to false-negatives and false-positives. A false-negative may be detrimental to a patient’s health, while too many false-positives may lead to many costly and unnecessary treatments (also to the detriment of patients’ health for certain treatments as well as economy overall). Education about the proper use of predictive models and their limitations is essential for end-users. Finally, making sure the output of a predictive model is actionable is important. Predicting that a patient is at high-risk is only useful if the model outputs is interpretable enough to explain what factors are putting that patient at risk. Furthermore, if the model is being used to plan interventions, the factors that can be changed need to be highlighted in some way – telling a clinician that a patient is at risk because of their age is not useful if the point of the prediction is to lower risk through intervention.

The future of data science in the healthcare sector

The future holds a lot of promise for data science in healthcare. Wearable devices that track all kinds of activity and biometric data are becoming more sophisticated and more common. Streaming data coming from either wearables or devices providing treatment (such as dialysis machines) could eventually be used to provide real-time alerts to patients or clinicians about health events outside of the hospital.

Currently, a major issue facing medical providers is that patients’ data tends to exist in silos. There is little integration across electronic medical record systems (both between and within medical providers), which can lead to fragmented care. This can lead to clinicians receiving out of date or incomplete information about a patient, or to duplication of treatments. Through a major data engineering effort, these systems could (and should) be integrated. This would vastly increase the potential of data scientists and data engineers, who could then provide analytics services that took into account the whole patients’ history to provide a level of consistency across care providers. Data workers could use such an integrated record to alert clinicians to duplications of procedures or dangerous prescription drug combinations.

Data scientists have a lot to offer in the healthcare industry. The advances of machine learning and data science can and should be adopted in a space where the health of individuals can be improved. The opportunities for data scientists in this sector are nearly endless, and the potential for good is enormous.