Glorious career paths of a Big Data Professional

Are you wondering about the career profiles you may get to fill if you get into Big Data industry? If yes, then Bingo! This is the post that will inform you just about that. Big data is just an umbrella term. There are a lot of profiles and career paths that are covered under this umbrella term. Let us have a look at some of these profiles.

Data Visualisation Specialist

The process of visualizing data is turning out to be critical in guaranteeing information-driven representatives get the upfront investment required to actualize goal-oriented and significant Big Data extends in their organization. Making your data to tell a story and the craft of envisioning information convincingly has turned into a significant piece of the Big Data world and progressively associations need to have these capacities in-house. Besides, as a rule, these experts are relied upon to realize how to picture in different instruments, for example, Spotfire, D3, Carto, and Tableau – among numerous others. Information Visualization Specialists should be versatile and inquisitive to guarantee they stay aware of most recent patterns and answers for a recount to their information stories in the most intriguing manner conceivable with regards to the board room. 

 

Big Data Architect

This is the place the Hadoop specialists come in. Ordinarily, a Big Data planner tends to explicit information issues and necessities, having the option to portray the structure and conduct of a Big Data arrangement utilizing the innovation wherein they practice – which is, as a rule, mostly Hadoop.

These representatives go about as a significant connection between the association (and its specific needs) and Data Scientists and Engineers. Any organization that needs to assemble a Big Data condition will require a Big Data modeler who can serenely deal with the total lifecycle of a Hadoop arrangement – including necessity investigation, stage determination, specialized engineering structure, application plan, and advancement, testing the much-dreaded task of deploying lastly.

Systems Architect 

This Big data professional is in charge of how your enormous information frameworks are architected and interconnected. Their essential incentive to your group lies in their capacity to use their product building foundation and involvement with huge scale circulated handling frameworks to deal with your innovation decisions and execution forms. You’ll need this individual to construct an information design that lines up with the business, alongside abnormal state anticipating the improvement. The person in question will consider different limitations, adherence to gauges, and varying needs over the business.

Here are some responsibilities that they play:

    • Determine auxiliary prerequisites of databases by investigating customer tasks, applications, and programming; audit targets with customers and assess current frameworks.
    • Develop database arrangements by planning proposed framework; characterize physical database structure and utilitarian abilities, security, back-up and recuperation particulars.
    • Install database frameworks by creating flowcharts; apply ideal access methods, arrange establishment activities, and record activities.
    • Maintain database execution by distinguishing and settling generation and application advancement issues, figuring ideal qualities for parameters; assessing, incorporating, and putting in new discharges, finishing support and responding to client questions.
    • Provide database support by coding utilities, reacting to client questions, and settling issues.


Artificial Intelligence Developer

The certain promotion around Artificial Intelligence is additionally set to quicken the number of jobs publicized for masters who truly see how to apply AI, Machine Learning, and Deep Learning strategies in the business world. Selection representatives will request designers with broad learning of a wide exhibit of programming dialects which loan well to AI improvement, for example, Lisp, Prolog, C/C++, Java, and Python.

All said and done; many people estimate that this popular demand for AI specialists could cause a something like what we call a “Brain Drain” organizations poaching talented individuals away from the universe of the scholarly world. A month ago in the Financial Times, profound learning pioneer and specialist Yoshua Bengio, of the University of Montreal expressed: “The industry has been selecting a ton of ability — so now there’s a lack in the scholarly world, which is fine for those organizations. However, it’s not extraordinary for the scholarly world.” It ; howeverusiasm to perceive how this contention among the scholarly world and business is rotated in the following couple of years.

Data Scientist

The move of Big Data from tech publicity to business reality may have quickened, yet the move away from enrolling top Data Scientists isn’t set to change in 2020. An ongoing Deloitte report featured that the universe of business will require three million Data Scientists by 2021, so if their expectations are right, there’s a major ability hole in the market. This multidisciplinary profile requires specialized logical aptitudes, specialized software engineering abilities just as solid gentler abilities, for example, correspondence, business keenness, and scholarly interest.

Data Engineer

Clean and quality data is crucial in the accomplishment of Big Data ventures. Consequently, we hope to see a lot of opening in 2020 for Data Engineers who have a predictable and awesome way to deal with information transformation and treatment. Organizations will search for these special data masters to have broad involvement in controlling data with SQL, T-SQL, R, Hadoop, Hive, Python and Spark. Much like Data Scientists. They are likewise expected to be innovative with regards to contrasting information with clashing information types with have the option to determine issues. They additionally frequently need to make arrangements which enable organizations to catch existing information in increasingly usable information groups – just as performing information demonstrations and their modeling.

IT/Operations Manager Job Description

In Big data industry, the IT/Operations Manager is a profitable expansion to your group and will essentially be in charge of sending, overseeing, and checking your enormous information frameworks. You’ll depend on this colleague to plan and execute new hardware and administrations. The person in question will work with business partners to comprehend the best innovation ventures to address their procedures and concerns—interpreting business necessities to innovation plans. They’ll likewise work with venture chiefs to actualize innovation and be in charge of effective progress and general activities.

Here are some responsibilities that they play:

  • Manage and be proactive in announcing, settling and raising issues where required 
  • Lead and co-ordinate issue the executive’s exercises, notwithstanding ceaseless procedure improvement activities  
  • Proactively deal with our IT framework 
  • Supervise and oversee IT staffing, including enrollment, supervision, planning, advancement, and assessment
  • Verify existing business apparatuses and procedures remain ideally practical and worth included 
  • Benchmark, dissect, report on and make suggestions for the improvement and development of the IT framework and IT frameworks 
  • Advance and keep up a corporate SLA structure

Conclusion

These are some of the best career paths that big data professionals can play after entering the industry. Honesty and hard work can always take you to the zenith of any field that you choose to be in. Also, keep upgrading your skills by taking newer certifications and technologies. Good Luck 

6 Important Reasons for the Java Experts to learn Hadoop Skills

You must be well aware of the fact that Java and Hadoop Skills are in high demand these days. Gone are the days when advancement work moved around Java and social database. Today organizations are managing big information. It is genuinely big. From gigabytes to petabytes in size and social databases are exceptionally restricted to store it. Additionally, organizations are progressively outsourcing the Java development jobs to different groups who are as of now having big data experts.

Ever wondered what your future would have in store for you if you possess Hadoop as well as Java skills? No? Let us take a look. Today we shall discuss the point that why is it preferable for Java Developers to learn Hadoop.

Hadoop is the Future Java-based Framework that Leads the Industry

Data analysis is the current marketing strategy that the companies are adopting these days. What’s more, Hadoop is to process and comprehend all the Big Data that is generated all the time. As a rule, Hadoop is broadly utilized by practically all organizations from big and small and in practically all business spaces. It is an open-source stage where Java owes a noteworthy segment of its success

The processing channel of Hadoop, which is MapReduce, is written in Java. Thus, a Hadoop engineer needs to compose MapReduce contents in Java for Big data analysis. Notwithstanding that, HDFS, which is the record arrangement of Hadoop, is additionally Java-based programming language at its core. Along these lines, a Hadoop developer needs to compose documents from local framework to HDFS through deployment, which likewise includes Java programming.

Learn Hadoop: It is More Comfortable for a Java Developer

Hadoop is more of an environment than a standalone innovation. Also, Hadoop is a Java-based innovation. Regardless of whether it is Hadoop 1 which was about HDFS and MapReduce or Hadoop2 biological system that spreads HDFS, Spark, Yarn, MapReduce, Tez, Flink, Giraph, Storm, JVM is the base for all. Indeed, even a portion of the broadly utilized programming languages utilized in a portion of the Hadoop biological system segments like Spark is JVM based. The run of the mill models is Scala and Clojure.

Consequently, if you have a Java foundation, understanding Hadoop is progressively easier for you. Also, here, a Hadoop engineer needs Java programming information to work in MapReduce or Spark structure. Thus, if you are as of now a Java designer with a logical twist of the brain, you are one stage ahead to turn into a Hadoop developer.

IT Industry is looking for Professionals with Java and Hadoop Skills

If you pursue the expected set of responsibilities and range of abilities required for a Hadoop designer in places of work, wherever you will watch the reference of Java. As Hadoop needs solid Java foundation, from this time forward associations are searching for Java designers as the best substitution for Hadoop engineers. It is savvy asset usage for organizations as they don’t have to prepare Java for new recruits to learn Hadoop for tasks.

Nonetheless, the accessible market asset for Hadoop is less. Therefore, there is a noteworthy possibility for Java designers in the Hadoop occupation field. Henceforth, as a Java designer, on the off chance that you are not yet arrived up in your fantasy organization, learning Hadoop, will without a doubt help you to discover the chance to one of your top picks.

Combined Java and Hadoop Skills Means Better Pay Packages

You will be progressively keen on learning Hadoop on the off chance that you investigate Gartner report on big information industry. According to the report, the Big Data industry has just come to the $50 billion points. Additionally, over 64% of the main 720 organizations worldwide are prepared to put resources into big information innovation. Notwithstanding that when you are a mix of a Java and Hadoop engineer, you can appreciate 250% pay climb with a normal yearly compensation of $150,000.It is about the yearly pay of a senior Hadoop developer.

Besides, when you change to Big Data Hadoop, it very well may be useful to improve the nature of work. You will manage unpredictable and greater tasks. It does not just give you a better extension to demonstrate your expertise yet, in addition, to set up yourself as a profitable asset who can have any kind of effect.

Adapting Big Data Hadoop can be exceptionally advantageous because it will assist you in dealing with greater, complex activities a lot simpler and convey preferable yield over your associates. To be considered for examinations, you should be somebody who can have any kind of effect in the group, and that is the thing that Hadoop lets you be.

Learning Hadoop will open New Opportunities to Other Lucrative Fields

Big data is only not going to learn Hadoop. When you are in Big information space, you have sufficient chance to jump other Java and Hadoop engineer. There are different exceedingly requesting zones in big information like Artificial Intelligence, Machine Learning, Data Science. You can utilize your Java and Hadoop engineer expertise as a springboard to take your vocation to the following level. In any case, the move will give you the best outcome once you move from Java to Hadoop and increase fundamental working knowledge.

Java with Hadoop opens new skylines of occupation jobs, for example, data scientist, data analyst business intelligence analyst, DBA, etc.

Premier organizations prefer Hadoop Developers with Java skills

Throughout the years the Internet has been the greatest driver of information, and the new data produced in 2012 remained at 2500 Exabyte. The computerized world developed by 62% a year ago to 800K petabytes and will keep on developing to the tune of 1.2 zeta bytes during the present year. Gartner gauges the market of Hadoop Ecosystem to $77 million and predicts it will come to the $813 million marks by 2016.

A review of LinkedIn profiles referencing Hadoop as their abilities uncovered that just about 17000 individuals are working in Companies like Cisco, HP, TCS, Oracle, Amazon, Yahoo, and Facebook, and so on. Aside from this Java proficient who learn Hadoop can begin their vocations with numerous new businesses like Platfora, Alpine information labs, Trifacta, Datatorrent, and so forth.

Conclusion

You can see that combining your Java skills with Hadoop skills can open the doors of several new opportunities for you. You can get better remuneration for your efforts, and you will always be in high demand. It is high time to learn Hadoop online now if you are a java developer.

Interview – Customer Data Platform, more than CRM 2.0?

Interview with David M. Raab from the CDP Institute

David M. Raab is as a consultant specialized in marketing software and service vendor selection, marketing analytics and marketing technology assessment. Furthermore he is the founder of the Customer Data Platform Institute which is a vendor-neutral educational project to help marketers build a unified customer view that is available to all of their company systems.

Furthermore he is a Keynote-Speaker for the Predictive Analytics World Event 2019 in Berlin.

Data Science Blog: Mr. Raab, what exactly is a Customer Data Platform (CDP)? And where is the need for it?

The CDP Institute defines a Customer Data Platform as „packaged software that builds a unified, persistent customer database that is accessible by other systems“.  In plainer language, a CDP assembles customer data from all sources, combines it into customer profiles, and makes the profiles available for any use.  It’s important because customer data is collected in so many different systems today and must be unified to give customers the experience they expect.

Data Science Blog: Is it something like a CRM System 2.0? What Use Cases can be realized by a Customer Data Platform?

CRM systems are used to interact directly with customers, usually by telephone or in the field.  They work almost exclusively with data that is entered during those interactions.  This gives a very limited view of the customer since interactions through other channels such as order processing or Web sites are not included.  In fact, one common use case for CDP is to give CRM users a view of all customer interactions, typically by opening a window into the CDP database without needing to import the data into the CRM.  There are many other use cases for unified data, including customer segmentation, journey analysis, and personalization.  Anything that requires sharing data across different systems is a CDP use case.

Data Science Blog: When does a CDP make sense for a company? It is more relevant for retail and financial companies than for industrial companies, isn´t it?

CDP has been adopted most widely in retail and online media, where each customer has many interactions and there are many products to choose from.  This is a combination that can make good use of predictive modeling, which benefits greatly from having more complete data.  Financial services was slower to adopt, probably because they have fewer products but also because they already had pretty good customer data systems.  B2B has also been slow to adopt because so much of their customer relationship is handled by sales people.  We’ve more recently been seeing growth in additional sectors such as travel, healthcare, and education.  Those involve fewer transactions than retail but also rely on building strong customer relationships based on good data.

Data Science Blog: There are several providers for CDPs. Adobe, Tealium, Emarsys or Dynamic Yield, just to name some of them. Do they differ a lot between each other?

Yes they do.  All CDPs build the customer profiles I mentioned.  But some do more things, such as predictive modeling, message selection, and, increasingly, message delivery.  Of course they also vary in the industries they specialize in, regions they support, size of clients they work with, and many technical details.  This makes it hard to buy a CDP but also means buyers are more likely to find a system that fits their needs.

Data Science Blog: How established is the concept of the CDP in Europe in general? And how in comparison with the United States?

CDP is becoming more familiar in Europe but is not as well understood as in the U.S.  The European market spent a lot of money on Data Management Platforms (DMPs) which promised to do much of what a CDP does but were not able to because they do not store the level of detail that a CDP does.  Many DMPs also don’t work with personally identifiable data because the DMPs primarily support Web advertising, where many customers are anonymous.  The failures of DMPs have harmed CDPs because they have made buyers skeptical that any system can meet their needs, having already failed once.  But we are overcoming this as the market becomes better educated and more success stories are available.  What’s the same in Europe and the U.S. is that marketers face the same needs.  This will push European marketers towards CDPs as the best solution in many cases.

Data Science Blog: What are coming trends? What will be the main topic 2020?

We see many CDPs with broader functions for marketing execution: campaign management, personalization, and message delivery in particular.  This is because marketers would like to buy as few systems as possible, so they want broader scope in each systems.  We’re seeing expansion into new industries such as financial services, travel, telecommunications, healthcare, and education.  Perhaps most interesting will be the entry of Adobe, Salesforce, and Oracle, who have all promised CDP products late this year or early next year.  That will encourage many more people to consider buying CDPs.  We expect that market will expand quite rapidly, so current CDP vendors will be able to grow even as Adobe, Salesforce, and Oracle make new CDP sales.


You want to get in touch with Daniel M. Raab and understand more about the concept of a CDP? Meet him at the Predictive Analytics World 18th and 19th November 2019 in Berlin, Germany. As a Keynote-Speaker, he will introduce the concept of a Customer Data Platform in the light of Predictive Analytics. Click here to see the agenda of the event.

 


 

Interview: Does Business Intelligence benefit from Cloud Data Warehousing?

Interview with Ross Perez, Senior Director, Marketing EMEA at Snowflake

Read this article in German:
“Profitiert Business Intelligence vom Data Warehouse in der Cloud?”

Does Business Intelligence benefit from Cloud Data Warehousing?

Ross Perez is the Senior Director, Marketing EMEA at Snowflake. He leads the Snowflake marketing team in EMEA and is charged with starting the discussion about analytics, data, and cloud data warehousing across EMEA. Before Snowflake, Ross was a product marketer at Tableau Software where he founded the Iron Viz Championship, the world’s largest and longest running data visualization competition.

Data Science Blog: Ross, Business Intelligence (BI) is not really a new trend. In 2019/2020, making data available for the whole company should not be a big thing anymore. Would you agree?

BI is definitely an old trend, reporting has been around for 50 years. People are accustomed to seeing statistics and data for the company at large, and even their business units. However, using BI to deliver analytics to everyone in the organization and encouraging them to make decisions based on data for their specific area is relatively new. In a lot of the companies Snowflake works with, there is a huge new group of people who have recently received access to self-service BI and visualization tools like Tableau, Looker and Sigma, and they are just starting to find answers to their questions.

Data Science Blog: Up until today, BI was just about delivering dashboards for reporting to the business. The data warehouse (DWH) was something like the backend. Today we have increased demand for data transparency. How should companies deal with this demand?

Because more people in more departments are wanting access to data more frequently, the demand on backend systems like the data warehouse is skyrocketing. In many cases, companies have data warehouses that weren’t built to cope with this concurrent demand and that means that the experience is slow. End users have to wait a long time for their reports. That is where Snowflake comes in: since we can use the power of the cloud to spin up resources on demand, we can serve any number of concurrent users. Snowflake can also house unlimited amounts of data, of both structured and semi-structured formats.

Data Science Blog: Would you say the DWH is the key driver for becoming a data-driven organization? What else should be considered here?

Absolutely. Without having all of your data in a single, highly elastic, and flexible data warehouse, it can be a huge challenge to actually deliver insight to people in the organization.

Data Science Blog: So much for the theory, now let’s talk about specific use cases. In general, it matters a lot whether you are storing and analyzing e.g. financial data or machine data. What do we have to consider for both purposes?

Financial data and machine data do look very different, and often come in different formats. For instance, financial data is often in a standard relational format. Data like this needs to be able to be easily queried with standard SQL, something that many Hadoop and noSQL tools were unable to provide. Luckily, Snowflake is an ansi-standard SQL data warehouse so it can be used with this type of data quite seamlessly.

On the other hand, machine data is often semi-structured or even completely unstructured. This type of data is becoming significantly more common with the rise of IoT, but traditional data warehouses were very bad at dealing with it since they were optimized for relational data. Semi-structured data like JSON, Avro, XML, Orc and Parquet can be loaded into Snowflake for analysis quite seamlessly in its native format. This is important, because you don’t want to have to flatten the data to get any use from it.

Both types of data are important, and Snowflake is really the first data warehouse that can work with them both seamlessly.

Data Science Blog: Back to the common business use case: Creating sales or purchase reports for the business managers, based on data from ERP-systems such as Microsoft or SAP. Which architecture for the DWH could be the right one? How many and which database layers do you see as necessary?

The type of report largely does not matter, because in all cases you want a data warehouse that can support all of your data and serve all of your users. Ideally, you also want to be able to turn it off and on depending on demand. That means that you need a cloud-based architecture… and specifically Snowflake’s innovative architecture that separates storage and compute, making it possible to pay for exactly what you use.

Data Science Blog: Where would you implement the main part of the business logic for the report? In the DWH or in the reporting tool? Does it matter which reporting tool we choose?

The great thing is that you can choose either. Snowflake, as an ansi-Standard SQL data warehouse, can support a high degree of data modeling and business logic. But you can also utilize partners like Looker and Sigma who specialize in data modeling for BI. We think it’s best that the customer chooses what is right for them.

Data Science Blog: Snowflake enables organizations to store and manage their data in the cloud. Does it mean companies lose control over their storage and data management?

Customers have complete control over their data, and in fact Snowflake cannot see, alter or change any aspect of their data. The benefit of a cloud solution is that customers don’t have to manage the infrastructure or the tuning – they decide how they want to store and analyze their data and Snowflake takes care of the rest.

Data Science Blog: How big is the effort for smaller and medium sized companies to set up a DWH in the cloud? Does this have to be an expensive long-term project in every case?

The nice thing about Snowflake is that you can get started with a free trial in a few minutes. Now, moving from a traditional data warehouse to Snowflake can take some time, depending on the legacy technology that you are using. But Snowflake itself is quite easy to set up and very much compatible with historical tools making it relatively easy to move over.

Bringing intelligence to where data lives: Python & R embedded in T-SQL

Introduction

Did you know that you can write R and Python code within your T-SQL statements? Machine Learning Services in SQL Server eliminates the need for data movement. Instead of transferring large and sensitive data over the network or losing accuracy with sample csv files, you can have your R/Python code execute within your database. Easily deploy your R/Python code with SQL stored procedures making them accessible in your ETL processes or to any application. Train and store machine learning models in your database bringing intelligence to where your data lives.

You can install and run any of the latest open source R/Python packages to build Deep Learning and AI applications on large amounts of data in SQL Server. We also offer leading edge, high-performance algorithms in Microsoft’s RevoScaleR and RevoScalePy APIs. Using these with the latest innovations in the open source world allows you to bring unparalleled selection, performance, and scale to your applications.

If you are excited to try out SQL Server Machine Learning Services, check out the hands on tutorial below. If you do not have Machine Learning Services installed in SQL Server,you will first want to follow the getting started tutorial I published here: 

How-To Tutorial

In this tutorial, I will cover the basics of how to Execute R and Python in T-SQL statements. If you prefer learning through videos, I also published the tutorial on YouTube.

Basics

Open up SQL Server Management Studio and make a connection to your server. Open a new query and paste this basic example: (While I use Python in these samples, you can do everything with R as well)

EXEC sp_execute_external_script @language = N'Python',
@script = N'print(3+4)'

Sp_execute_external_script is a special system stored procedure that enables R and Python execution in SQL Server. There is a “language” parameter that allows us to choose between Python and R. There is a “script” parameter where we can paste R or Python code. If you do not see an output print 7, go back and review the setup steps in this article.

Parameter Introduction

Now that we discussed a basic example, let’s start adding more pieces:

EXEC sp_execute_external_script  @language =N'Python', 
@script = N' 
OutputDataSet = InputDataSet;
',
@input_data_1 =N'SELECT 1 AS Col1';

Machine Learning Services provides more natural communications between SQL and R/Python with an input data parameter that accepts any SQL query. The input parameter name is called “input_data_1”.
You can see in the python code that there are default variables defined to pass data between Python and SQL. The default variable names are “OutputDataSet” and “InputDataSet” You can change these default names like this example:

EXEC sp_execute_external_script  @language =N'Python', 
@script = N' 
MyOutput = MyInput;
',
@input_data_1_name = N'MyInput',
@input_data_1 =N'SELECT 1 AS foo',
@output_data_1_name =N'MyOutput';

As you executed these examples, you might have noticed that they each return a result with “(No column name)”? You can specify a name for the columns that are returned by adding the WITH RESULT SETS clause to the end of the statement which is a comma separated list of columns and their datatypes.

EXEC sp_execute_external_script  @language =N'Python', 
@script=N' 
MyOutput = MyInput;
',
@input_data_1_name = N'MyInput',
@input_data_1 =N'
SELECT 1 AS foo,
2 AS bar
',
@output_data_1_name =N'MyOutput'
WITH RESULT SETS ((MyColName int, MyColName2 int));

Input/Output Data Types

Alright, let’s discuss a little more about the input/output data types used between SQL and Python. Your input SQL SELECT statement passes a “Dataframe” to python relying on the Python Pandas package. Your output from Python back to SQL also needs to be in a Pandas Dataframe object. If you need to convert scalar values into a dataframe here is an example:

EXEC sp_execute_external_script  @language =N'Python', 
@script=N' 
import pandas as pd
c = 1/2
d = 1*2
s = pd.Series([c,d])
df = pd.DataFrame(s)
OutputDataSet = df
'

Variables c and d are both scalar values, which you can add to a pandas Series if you like, and then convert them to a pandas dataframe. This one shows a little bit more complicated example, go read up on the python pandas package documentation for more details and examples:

EXEC sp_execute_external_script  @language =N'Python', 
@script=N' 
import pandas as pd
s = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(s)
OutputDataSet = df
'

You now know the basics to execute Python in T-SQL!

Did you know you can also write your R and Python code in your favorite IDE like RStudio and Jupyter Notebooks and then remotely send the execution of that code to SQL Server? Check out these documentation links to learn more: https://aka.ms/R-RemoteSQLExecution https://aka.ms/PythonRemoteSQLExecution

Check out the SQL Server Machine Learning Services documentation page for more documentation, samples, and solutions. Check out these E2E tutorials on github as well.

Would love to hear from you! Leave a comment below to ask a question, or start a discussion!

Data Science Modeling and Featurization

Overview

Data modeling is an essential part of the data science pipeline. This, combined with the fact that it is a very rewarding process, makes it the one that often receives the most attention among data science learners. However, things are not as simple as they may seem, since there is much more to it than applying a function from a particular class of a package and applying it on the data available.

A big part of data science modeling involves evaluating a model, for example, making sure that it is robust and therefore reliable. Also, data science modeling is closely linked to creating an information rich feature set. Moreover, it entails a variety of other processes that ensure that the data at hand is harnessed as much as possible.

What Is a Robust Data Model?

When it comes to robust models, worthy of making it to production, these need to tick several boxes. First of all, they need to have a good performance, based on various metrics. Oftentimes a single metric can be misleading, as how well a model performs has many aspects, especially for classification problems.

In addition, a robust model has good generalization. This means that the model performs well for various datasets, not just the one it has been trained on.

Sensitivity analysis is another aspect of a data science modeling, something essential for thoroughly testing a model to ensure it is robust enough. Sensitivity is a condition whereby a model’s output is bound to change significantly if the inputs change even slightly. This is quite undesirable and needs to be checked since a robust model ought to be stable.

Finally, interpretability is an important aspect too, though it’s not always possible. This has to do with how easy it is to interpret a model’s results. Many modern models, however, are more like black boxes, making it particularly difficult to interpret them. Nevertheless, it is often preferable to opt for an interpretable model, especially if we need to defend its outputs to others.

How Is Featurization Accomplished?

In order for a model to maximize its potential, it needs an information rich set of features. The latter can be created in various ways. Whatever the case, cleaning up the data is a prerequisite. This involves removing or correcting problematic data points, filling in missing values wherever possible, and in some cases removing noisy variables.

Before you can use variables in a model, you need to perform normalization on them. This is usually accomplished through a linear transformation ensuring that the variable’s values are around a certain range. Oftentimes, normalization is sufficient for turning your variables into features, once they are cleaned.

Binning is another process that can aid in featurization. This entails creating nominal (discreet) variables, which can in turn be broken down into binary features, to be used in a data model.

Finally, some dimensionality reduction method (e.g. PCA) can be instrumental in shaping up your feature-set. This has to do with creating linear combinations of features, aka meta-features, which express the same information in fewer dimensions.

Some Useful Considerations

Beyond these basic attributes of data science modeling there several more that every data scientist has in mind in order to create something of value from the available data. Things like in-depth testing using sensitivity analysis, specialized sampling, and various aspects of model performance (as well as tweaking the model to optimize for a particular performance metric) are parts of data science modeling that require meticulous study and ample practice. After all, even though this part of data science is fairly easy to pick up, it takes a while to master, while performing well in it is something that every organization can benefit from.

Resources

To delve more into all this, there are various relevant resources you can leverage, helping you in not just the methodologies involved but also in the mindset behind them. Here are two of the most useful ones.

  1. Data Science Modeling Tutorial on the Safari platform
  2. Data Science Mindset, Methodologies and Misconceptions book (Technics Publications)

Data Science vs Data Engineering

The job of the Data Scientist is actually a fairly new trend, and yet other job titles are coming to us. “Is this really necessary?”, Some will ask. But the answer is clear: yes!

There are situations, every Data Scientist know: a recruiter calls, speaks about a great new challenge for a Data Scientist as you obviously claim on your LinkedIn profile, but in the discussion of the vacancy it quickly becomes clear that you have almost none of the required skills. This mismatch is mainly due to the fact that under the job of the Data Scientist all possible activity profiles, method and tool knowledge are summarized, which a single person can hardly learn in his life. Many open jobs, which are to be called under the name Data Science, describe rather the professional image of the Data Engineer.


Read this article in German:
“Data Science vs Data Engineering – Wo liegen die Unterschiede?“


What is a Data Engineer?

Data engineering is primarily about collecting or generating data, storing, historicalizing, processing, adapting and submitting data to subsequent instances. A Data Engineer, often also named as Big Data Engineer or Big Data Architect, models scalable database and data flow architectures, develops and improves the IT infrastructure on the hardware and software side, deals with topics such as IT Security , Data Security and Data Protection. A Data Engineer is, as required, a partial administrator of the IT systems and also a software developer, since he or she extends the software landscape with his own components. In addition to the tasks in the field of ETL / Data Warehousing, he also carries out analyzes, for example, to investigate data quality or user access. A Data Engineer mainly works with databases and data warehousing tools.

A Data Engineer is talented as an educated engineer or computer scientist and rather far away from the actual core business of the company. The Data Engineer’s career stages are usually something like:

  1. (Big) Data Architect
  2. BI Architect
  3. Senior Data Engineer
  4. Data Engineer

What makes a Data Scientist?

Although there may be many intersections with the Data Engineer’s field of activity, the Data Scientist can be distinguished by using his working time as much as possible to analyze the available data in an exploratory and targeted manner, to visualize the analysis results and to convert them into a red thread (storytelling). Unlike the Data Engineer, a data scientist rarely sees into a data center, because he picks up data via interfaces provided by the Data Engineer or provides by other resources.

A Data Scientist deals with mathematical models, works mainly with statistical procedures, and applies them to the data to generate knowledge. Common methods of Data Mining, Machine Learning and Predictive Modeling should be known to a Data Scientist. Data Scientists basically work close to the department and need appropriate expertise. Data Scientists use proprietary tools (e.g. Tools by IBM, SAS or Qlik) and program their own analyzes, for example, in Scala, Java, Python, Julia, or R. Using such programming languages and data science libraries (e.g. Mahout, MLlib, Scikit-Learn or TensorFlow) is often considered as advanced data science.

Data Scientists can have diverse academic backgrounds, some are computer scientists or engineers for electrical engineering, others are physicists or mathematicians, not a few have economical backgrounds. Common career levels could be:

  1. Chief Data Scientist
  2. Senior Data Scientist
  3. Data Scientist
  4. Data Analyst oder Junior Data Scientist

Data Scientist vs Data Analyst

I am often asked what the difference between a Data Scientist and a Data Analyst would be, or whether there would be a distinction criterion at all:

In my experience, the term Data Scientist stands for the new challenges for the classical concept of Data Analysts. A Data Analyst performs data analysis like a Data Scientist. More complex topics such as predictive analytics, machine learning or artificial intelligence are topics for a Data Scientist. In other words, a Data Scientist is a Data Analyst++ (one step above the Data Analyst).

And how about being a Business Analyst?

Business Analysts can (but need not) be Data Analysts. In any case, they have a very strong relationship with the core business of the company. Business Analytics is about analyzing business models and business successes. The analysis of business success is usually carried out by IT, and many business analysts are starting a career as Data Analyst now. Dashboards, KPIs and SQL are the tools of a good business analyst, but there might be a lot business analysts, who are just analysing business models by reading the newspaper…