Data Science Modeling and Featurization

Overview

Data modeling is an essential part of the data science pipeline. This, combined with the fact that it is a very rewarding process, makes it the one that often receives the most attention among data science learners. However, things are not as simple as they may seem, since there is much more to it than applying a function from a particular class of a package and applying it on the data available.

A big part of data science modeling involves evaluating a model, for example, making sure that it is robust and therefore reliable. Also, data science modeling is closely linked to creating an information rich feature set. Moreover, it entails a variety of other processes that ensure that the data at hand is harnessed as much as possible.

What Is a Robust Data Model?

When it comes to robust models, worthy of making it to production, these need to tick several boxes. First of all, they need to have a good performance, based on various metrics. Oftentimes a single metric can be misleading, as how well a model performs has many aspects, especially for classification problems.

In addition, a robust model has good generalization. This means that the model performs well for various datasets, not just the one it has been trained on.

Sensitivity analysis is another aspect of a data science modeling, something essential for thoroughly testing a model to ensure it is robust enough. Sensitivity is a condition whereby a model’s output is bound to change significantly if the inputs change even slightly. This is quite undesirable and needs to be checked since a robust model ought to be stable.

Finally, interpretability is an important aspect too, though it’s not always possible. This has to do with how easy it is to interpret a model’s results. Many modern models, however, are more like black boxes, making it particularly difficult to interpret them. Nevertheless, it is often preferable to opt for an interpretable model, especially if we need to defend its outputs to others.

How Is Featurization Accomplished?

In order for a model to maximize its potential, it needs an information rich set of features. The latter can be created in various ways. Whatever the case, cleaning up the data is a prerequisite. This involves removing or correcting problematic data points, filling in missing values wherever possible, and in some cases removing noisy variables.

Before you can use variables in a model, you need to perform normalization on them. This is usually accomplished through a linear transformation ensuring that the variable’s values are around a certain range. Oftentimes, normalization is sufficient for turning your variables into features, once they are cleaned.

Binning is another process that can aid in featurization. This entails creating nominal (discreet) variables, which can in turn be broken down into binary features, to be used in a data model.

Finally, some dimensionality reduction method (e.g. PCA) can be instrumental in shaping up your feature-set. This has to do with creating linear combinations of features, aka meta-features, which express the same information in fewer dimensions.

Some Useful Considerations

Beyond these basic attributes of data science modeling there several more that every data scientist has in mind in order to create something of value from the available data. Things like in-depth testing using sensitivity analysis, specialized sampling, and various aspects of model performance (as well as tweaking the model to optimize for a particular performance metric) are parts of data science modeling that require meticulous study and ample practice. After all, even though this part of data science is fairly easy to pick up, it takes a while to master, while performing well in it is something that every organization can benefit from.

Resources

To delve more into all this, there are various relevant resources you can leverage, helping you in not just the methodologies involved but also in the mindset behind them. Here are two of the most useful ones.

  1. Data Science Modeling Tutorial on the Safari platform
  2. Data Science Mindset, Methodologies and Misconceptions book (Technics Publications)

Stichwort Datenkompetenz: Von Big Data zu Big Insights

Anzeige – Artikel des Data Science Blog Sponsors Qlik.com

Wer in einer Organisation mit Daten arbeiten möchte, sollte dazu befähigt werden – sonst bleiben wertvolle Einblicke unter Umständen verborgen.

Aus der reinen Technologie-Perspektive ist Big Data nahezu grenzenlos: Prozessoren arbeiten immer schneller, die Kosten für Speicherplatz sinken kontinuierlich, Cloud-Dienste stellen ad hoc und flexibel auch riesige Speichervolumen zur Verfügung. Beste Voraussetzungen also für Big-Data-Enthusiasten? Könnte man meinen. Doch Big Data hat nicht von Haus aus Wert, Sinn oder Geschäftsnutzen. Der stellt sich erst ein, wenn die vielen verfügbaren Daten assoziativ und ohne Denk- oder Infrastruktur-Hürden neu kombiniert, analysiert und visualisiert – also wirklich smart – werden. Der Schlüssel dazu liegt in moderner Data Analytics Software, die unterschiedlichste Datenquellen und -formate verarbeiten und in Beziehung setzen kann – und so wertvolle neue Einsichten offenbart, die ohne Data Analytics im (Big-)Data-Lake abtauchen würden.

Reich an Daten – arm an Einsichten?

Entscheidend für den Erfolg von Big-Data-Projekten ist es, aus der Datenfülle die wirklich relevanten Zusammenhänge zu evaluieren – und nicht um des Sammelns willen Daten zu horten, die neue Einsichten eher zu- als aufdecken. Viele Organisationen befinden sich leider nach wie vor an diesem Punkt. Sie sind reich an Daten, aber nicht in der Lage, neue Informationen daraus zu extrahieren, die gute Ideen anstoßen, Innovation fördern und das Unternehmen nachhaltig weiterbringen. Es herrscht weitgehende Überforderung mit dem eigenen Datenschatz.

Wer in Big-Data-Technologien investiert, fragt früher oder später nach dem ROI seiner Investitionen. Dieser wird umso günstiger ausfallen, je leichter und passgenauer der Datennutzen an möglichst vielen Stellen im Unternehmen verfügbar ist. Hier gilt es zu erkennen, dass fast jeder im Unternehmen Daten gut nutzen kann und sich im Umgang mit ihnen sicher fühlen möchte, um seine Arbeit noch erfolgreicher zu machen – eine neue Untersuchung des Business-Intelligence-Experten Qlik beweist das.

88 Prozent sind überzeugt: Mit Daten läuft es besser

Demnach würden 66 Prozent der Befragten gerne mehr Zeit und Energie in ihre Datenkompetenz investieren – wenn es die Gelegenheit dazu gäbe. 88 Prozent der befragten Sachbearbeiter und ausführenden Kräfte sind überzeugt davon, dass sie mit adäquatem Datenzugang sowie mit den nötigen Befugnissen und Kompetenzen bessere Resultate im Job erreichen könnten. Doch nur 55 Prozent fühlen sich tatsächlich demensprechend ausgestattet und befähigt. Ganz anders das Bild unter Führungskräften: Unter diesen sind zwar 83 Prozent überzeugt davon, guten Zugang zu Daten zu haben – allerdings haben nur 26 Prozent der Chefs wirklich einen Ansatz gefunden, wie sie nutzbringend mit den Daten arbeiten können.

Das bedeutet: Zur datengetriebenen Arbeit sowie zur Unternehmenssteuerung und -entwicklung auf der Basis von Daten braucht nicht jeder im Unternehmen die gleichen Daten und Dashboards. Jedoch braucht jeder Mitarbeiter in der Organisation gleichermaßen die Möglichkeiten und Fähigkeiten, unkompliziert in den Daten zu forschen, die ihm persönlich helfen, seine Arbeit zu verbessern. Welche Ideen und Anschlussfragen die assoziative Data Discovery im Selfservice auslöst, ist vorher schwer zu sagen – Assoziation ist spontan. Daher gilt: Die Erkenntnis kommt beim Tun.

Aus diesem Grund verlangt wirkliche Innovation nach schrankenloser und intuitiver Datenarbeit, die Platz lässt für Ideen, für ungewöhnliche Datenkombinationen und für ein erfindungsreiches „Um-die-Ecke-Denken“. Lineare SQL-Abfragen können das nicht leisten – und entsprechen in ihren vordefinierten Pfaden nicht der wertvollen Kombinationskompetenz, die das menschliche Gehirn von Natur aus mitbringt.

Zukunftsweisende Data Analytics und Advanced Analytics versucht nicht, das Denken und Assoziieren zu ersetzen – sondern die kognitiven Prozesse des Anwenders zu unterstützen, sie zu erweitern und in ihren Möglichkeiten zu vervollständigen. So entsteht Augmented Intelligence: die intelligente Verknüpfung von menschlicher Ratio und technologischer Schnelligkeit, bzw. Vollständigkeit.

Zentral gemanagte Governance

Natürlich soll assoziatives und individuelles Daten-Handling nicht zum digitalen Selbstbedienungsladen führen. Um dennoch assoziative Analysen und freies Forschen in relevanten Daten zu gewährleisten, bewährt sich in der Selfservice-Datenanalyse zentral gesteuerte Governance mit rollenbasierter Datenverfügbarkeit und individuellen Zugriffsrechten als ideale Lösung.

Data Science vs Data Engineering

The job of the Data Scientist is actually a fairly new trend, and yet other job titles are coming to us. “Is this really necessary?”, Some will ask. But the answer is clear: yes!

There are situations, every Data Scientist know: a recruiter calls, speaks about a great new challenge for a Data Scientist as you obviously claim on your LinkedIn profile, but in the discussion of the vacancy it quickly becomes clear that you have almost none of the required skills. This mismatch is mainly due to the fact that under the job of the Data Scientist all possible activity profiles, method and tool knowledge are summarized, which a single person can hardly learn in his life. Many open jobs, which are to be called under the name Data Science, describe rather the professional image of the Data Engineer.


Read this article in German:
“Data Science vs Data Engineering – Wo liegen die Unterschiede?“


What is a Data Engineer?

Data engineering is primarily about collecting or generating data, storing, historicalizing, processing, adapting and submitting data to subsequent instances. A Data Engineer, often also named as Big Data Engineer or Big Data Architect, models scalable database and data flow architectures, develops and improves the IT infrastructure on the hardware and software side, deals with topics such as IT Security , Data Security and Data Protection. A Data Engineer is, as required, a partial administrator of the IT systems and also a software developer, since he or she extends the software landscape with his own components. In addition to the tasks in the field of ETL / Data Warehousing, he also carries out analyzes, for example, to investigate data quality or user access. A Data Engineer mainly works with databases and data warehousing tools.

A Data Engineer is talented as an educated engineer or computer scientist and rather far away from the actual core business of the company. The Data Engineer’s career stages are usually something like:

  1. (Big) Data Architect
  2. BI Architect
  3. Senior Data Engineer
  4. Data Engineer

What makes a Data Scientist?

Although there may be many intersections with the Data Engineer’s field of activity, the Data Scientist can be distinguished by using his working time as much as possible to analyze the available data in an exploratory and targeted manner, to visualize the analysis results and to convert them into a red thread (storytelling). Unlike the Data Engineer, a data scientist rarely sees into a data center, because he picks up data via interfaces provided by the Data Engineer or provides by other resources.

A Data Scientist deals with mathematical models, works mainly with statistical procedures, and applies them to the data to generate knowledge. Common methods of Data Mining, Machine Learning and Predictive Modeling should be known to a Data Scientist. Data Scientists basically work close to the department and need appropriate expertise. Data Scientists use proprietary tools (e.g. Tools by IBM, SAS or Qlik) and program their own analyzes, for example, in Scala, Java, Python, Julia, or R. Using such programming languages and data science libraries (e.g. Mahout, MLlib, Scikit-Learn or TensorFlow) is often considered as advanced data science.

Data Scientists can have diverse academic backgrounds, some are computer scientists or engineers for electrical engineering, others are physicists or mathematicians, not a few have economical backgrounds. Common career levels could be:

  1. Chief Data Scientist
  2. Senior Data Scientist
  3. Data Scientist
  4. Data Analyst oder Junior Data Scientist

Data Scientist vs Data Analyst

I am often asked what the difference between a Data Scientist and a Data Analyst would be, or whether there would be a distinction criterion at all:

In my experience, the term Data Scientist stands for the new challenges for the classical concept of Data Analysts. A Data Analyst performs data analysis like a Data Scientist. More complex topics such as predictive analytics, machine learning or artificial intelligence are topics for a Data Scientist. In other words, a Data Scientist is a Data Analyst++ (one step above the Data Analyst).

And how about being a Business Analyst?

Business Analysts can (but need not) be Data Analysts. In any case, they have a very strong relationship with the core business of the company. Business Analytics is about analyzing business models and business successes. The analysis of business success is usually carried out by IT, and many business analysts are starting a career as Data Analyst now. Dashboards, KPIs and SQL are the tools of a good business analyst, but there might be a lot business analysts, who are just analysing business models by reading the newspaper…