Data Scientist: Rock the Tech World

It’s almost 2020! Are you a data Rockstar or a laggard?

IDC agrees to the fact that the global data, 33 zettabytes in 2018 is predicted to grow to 175 zettabytes by 2025. That’s like ten times bigger the amount of data seen in 2017.

Isn’t this an exciting analysis? 

Hold on! Are all the industries set for a digitally transformed future? 

A digital transformed future is an opportunity of historic proportions. The way data is consumed today changes the way we work, live, and play. Businesses across the globe are now using data to transform themselves to adapt to these changes, become agile, improve customer experience, and to introduce new business models. 

With the full dependency on online channels, connectivity with friends and family around the world has increased the consumption of data. Today, the entire economy is reliant on data. Without data, you’re lost. 

Leverage the benefits of the data era

At the outset, with not many big data industries to be found, we can still agree to the fact that the knowledge for data skills is still early for professionals in the big data realm.

  • Big data assisting the humanitarian aid 
    • Case study: During a disaster

Be it natural or conflict-driven – if the response is driven quickly, it minimizes problems that are predicted to happen. In such instances, big data could be of great help in helping improve the responses of the aid organizations. 

Data scientists can easily use analytics and algorithms to provide actionable insights that can be used during emergencies to identify patterns in the data that is generated by online connected devices or from other related sources. 

During a 2015 refugee crisis in Germany, the Sweden Migration Board saw 10,000 asylum seekers every week up from 2,500 asylum seekers they saw in a month. A critical situation where other organizations could have faced challenges in dealing with the problem. However, with the help of big data this agency could cope up with the challenges. The challenges were addressed by ensuring extra staff was hired and of securing housing started early. Big data was of aid to this agency, meaning since they were users of this the preprocessing technology for quite a long time, the predictions were given well ahead of time. 

Earlier the results were not easy to extract due to obstruction such as not finding the relevant data. However, now with the launch of open data initiatives, the process has become easy. 

  • Tapping into talents of data scientist 

The Defence Science and Technology Laboratory (Dstl) along with other government partners launched “The Data Science Challenge.” This is done to harness the skills of data science professionals, to check their capability of tackling real-world problems people face daily. 

The challenge is part of a wider program set out majorly in the Defence Innovation Initiative.

It is an open data science challenge that welcome entrants from all facets of background and specialization to demonstrate their skills. The challenge is to acknowledge that the best of minds need not necessarily be the ones that work for you. 

 

  • The challenge comprises of two competitions each offering an award of £40,000
  1. First competition – this analyzes the ability to analyze data that is in documents i.e. media reports. This helps the data scientist have a deeper understanding of a political situation like it occurs for those on the ground level and even for those assisting it from afar. 
  2. Second competition – the second test involves creating possible ways to detect and classify vehicles like buses, cars, and motorbikes easily from aerial imagery. A solution to be used for aiding the safe journey of vehicles going through conflict zones.  

What makes the data world significant?

In all aspects, the upshot of the paradigm shift is that data has become a critical influencer in businesses as well as our lives. Internet of things (IoT) and embedded devices are already pacing their way in boosting the big data world. 

Some great key findings based on research by IDC: –

  • 75% of the population that interacts with data is estimated to stay connected by 2025.
  • The number of embedded devices that can be found on driverless cars, manufacturing floors, and smart buildings is estimated to grow from less than one per person to more than four in the next decade. 
  • In 2025, the amount of real-time data created in the data sphere is estimated to be over 25% while IoT real-time data will be more than 95%. 

With the data science industry becoming the top-end of the pyramid, a certified data scientist plays an imperative role today. 

In recent times, it is seen that big data has emerged to be the célèbre in the tech industry, generating several job opportunities.

What do you consider yourself to be today? 

Defining a data scientist is tough and finding one is tougher!

 

The importance of being Data Scientist

Header-Image by Clint Adair on Unsplash.

The incredible results of Machine Learning and Artificial Intelligence, Deep Learning in particular, could give the impression that Data Scientist are like magician. Just think of it. Recognising faces of people, translating from one language to another, diagnosing diseases from images, computing which product should be shown for us next to buy and so on from numbers only. Numbers which existed for centuries. What a perfect illusion. But it is only an illusion, as Data Scientist existed as well for centuries. However, there is a difference between the one from today compared to the one from the past: evolution.

The main activity of Data Scientist is to work with information also called data. Records of data are as old as mankind, but only within the 16 century did it include also numeric forms — as numbers started to gain more and more ground developing their own symbols. Numerical data, from a given phenomenon — being an experiment or the counts of sheep sold by week over the year –, was from early on saved in tabular form. Such a way to record data is interlinked with the supposition that information can be extracted from it, that knowledge — in form of functions — is hidden and awaits to be discovered. Collecting data and determining the function best fitting them let scientist to new insight into the law of nature right away: Galileo’s velocity law, Kepler’s planetary law, Newton theory of gravity etc.

Such incredible results where not possible without the data. In the past, one was able to collect data only as a scientist, an academic. In many instances, one needed to perform the experiment by himself. Gathering data was tiresome and very time consuming. No sensor which automatically measures the temperature or humidity, no computer on which all the data are written with the corresponding time stamp and are immediately available to be analysed. No, everything was performed manually: from the collection of the data to the tiresome computation.

More then that. Just think of Michael Faraday and Hermann Hertz and there experiments. Such endeavour where what we will call today an one-man-show. Both of them developed parts of the needed physics and tools, detailed the needed experiment settings, conducting the experiment and collect the data and, finally, computing the results. The same is true for many other experiments of their time. In biology Charles Darwin makes its case regarding evolution from the data collected in his expeditions on board of the Beagle over a period of 5 years, or Gregor Mendel which carry out a study of pea regarding the inherence of traits. In physics Blaise Pascal used the barometer to determine the atmospheric pressure or in chemistry Antoine Lavoisier discovers from many reaction in closed container that the total mass does not change over time. In that age, one person was enough to perform everything and was the reason why the last part, of a data scientist, could not be thought of without the rest. It was inseparable from the rest of the phenomenon.

With the advance of technology, theory and experimental tools was a specialisation gradually inescapable. As the experiments grow more and more complex, the background and condition in which the experiments were performed grow more and more complex. Newton managed to make first observation on light with a simple prism, but observing the line and bands from the light of the sun more than a century and half later by Joseph von Fraunhofer was a different matter. The small improvements over the centuries culminated in experiments like CERN or the Human Genome Project which would be impossible to be carried out by one person alone. Not only was it necessary to assign a different person with special skills for a separate task or subtask, but entire teams. CERN employs today around 17 500 people. Only in such a line of specialisation can one concentrate only on one task alone. Thus, some will have just the knowledge about the theory, some just of the tools of the experiment, other just how to collect the data and, again, some other just how to analyse best the recorded data.

If there is a specialisation regarding every part of the experiment, what makes Data Scientist so special? It is impossible to validate a theory, deciding which market strategy is best without the work of the Data Scientist. It is the reason why one starts today recording data in the first place. Not only the size of the experiment has grown in the past centuries, but also the size of the data. Gauss manage to determine the orbit of Ceres with less than 20 measurements, whereas the new picture about the black hole took 5 petabytes of recorded data. To put this in perspective, 1.5 petabytes corresponds to 33 billion photos or 66.5 years of HD-TV videos. If one includes also the time to eat and sleep, than 5 petabytes would be enough for a life time.

For Faraday and Hertz, and all the other scientist of their time, the goal was to find some relationship in the scarce data they painstakingly recorded. Due to time limitations, no special skills could be developed regarding only the part of analysing data. Not only are Data Scientist better equipped as the scientist of the past in analysing data, but they managed to develop new methods like Deep Learning, which have no mathematical foundation yet in spate of their success. Data Scientist developed over the centuries to the seldom branch of science which bring together what the scientific specialisation was forced to split.

What was impossible to conceive in the 19 century, became more and more a reality at the end of the 20 century and developed to a stand alone discipline at the beginning of the 21 century. Such a development is not only natural, but also the ground for the development of A.I. in general. The mathematical tools needed for such an endeavour where already developed by the half of the 20 century in the period when computing power was scars. Although the mathematical methods were present for everyone, to understand them and learn how to apply them developed quite differently within every individual field in which Machine Learning/A.I. was applied. The way the same method would be applied by a physicist, a chemist, a biologist or an economist would differ so radical, that different words emerged which lead to different langues for similar algorithms. Even today, when Data Science has became a independent branch, two different Data Scientists from different application background could find it difficult to understand each other only from a language point of view. The moment they look at the methods and code the differences will slowly melt away.

Finding a universal language for Data Science is one of the next important steps in the development of A.I. Then it would be possible for a Data Scientist to successfully finish a project in industry, turn to a new one in physics, then biology and returning to industry without much need to learn special new languages in order to be able to perform each tasks. It would be possible to concentrate on that what a Data Scientist does best: find the best algorithm. In other words, a Data Scientist could resolve problems independent of the background the problem was stated.

This is the most important aspect that distinguish the Data Scientist. A mathematician is limited to solve problems in mathematics alone, a physicist is able to solve problems only in physics, a biologist problems only in biology. With a unique language regarding the methods and strategies to solve Machine Learning/A.I. problems, a Data Scientist can solve a problem independent of the field. Specialisation put different branches of science at drift from each other, but it is the evolution of the role of the Data Scientist to synthesize from all of them and find the quintessence in a language which transpire beyond all the field of science. The emerging language of Data Science is a new building block, a new mathematical language of nature.

Although such a perspective does not yet exists, the principal component of Machine Learning/A.I. already have such proprieties partially in form of data. Because predicting for example the numbers of eggs sold by a company or the numbers of patients which developed immune bacteria to a specific antibiotic in all hospital in a country can be performed by the same prediction method. The data do not carry any information about the entities which are being predicted. It does not matter anymore if the data are from Faraday’s experiment, CERN of Human Genome. The same data set and its corresponding prediction could stand literary for anything. Thus, the result of the prediction — what we would call for a human being intuition and/or estimation — would be independent of the domain, the area of knowledge it originated.

It also lies at the very heart of A.I., the dream of researcher to create self acting entities, that is machines with consciousness. This implies that the algorithms must be able to determine which task, model is relevant at a given moment. It would be to cumbersome to have a model for every task and and every field and then try to connect them all in one. The independence of scientific language, like of data, is thus a mandatory step. It also means that developing A.I. is not only connected to develop a new consciousness, but, and most important, to the development of our one.

Accelerate your AI Skills Today: A Million Dollar Job!

The skyrocketing salaries ($1m per year) of AI engineers is not a hype. It is the fact of current corporate world, where you will witness a shift that is inevitable.

We’ve already set our feet at the edge of the technological revolution. A revolution that is at the verge of altering the way we live and work. As the fact suggests, humanity has fundamentally developed human production in three revolutions, and we’re now entering the fourth revolution. In its scope, the fourth revolution projects a transformation that is unlike anything we humans have ever experienced.

  • The first revolution had the world transformed from rural to urban
  • the emergence of mass production in the second revolution
  • third introduced the digital revolution
  • The fourth industrial revolution is anxious to integrate technologies into our lives.

And all thanks to artificial intelligence (AI). An advanced technology that surrounds us, from virtual assistants to software that translates to self-driving cars.

The rise of AI at an exponential rate has disrupted almost every industry. So much so that AI is being rated as one-million-dollar profession.

Did this grab your attention? It did?

Now, what if we were to tell you that the salary compensation for AI experts has grown dramatically. AI and machine learning are fields that have a mountain of demand in the tech industry today but has sparse supply.

AI field is growing at a quicker pace and salaries are skyrocketing! Read it for yourself to know what AI experts, AI researchers and any other AI talent are commanding today.

  • A top-class AI research laboratory, OpenAI says that techies in the AI field are projected to earn a salary compensation ranging between $300 to $500k for fresh graduates. However, expert professionals could earn anywhere up to $1m.
  • Whopping salary package of above 100 million yen that amounts to $1m is being offered to AI geniuses by a Japanese firm, Start Today. A firm that operates a fashion shopping website named Zozotown.

Does this leave you with a question – Is this a right opportunity for you to jump in the field and make hay while the sun is shining? 

And the answer to this question is – yes, it is the right opportunity for any developer seeking a role in the AI industry. It can be your chance to bridge the skill shortage in the AI field either by upskilling or reskilling yourself in the field of AI.

There are a wide varieties of roles available for an AI enthusiast like you. And certain areas are like AI Engineers and AI Researchers are high in demand, as there are not many professionals who have robust AI knowledge.

According to a job report, “The Future of Jobs 2018,” a prediction was made suggesting that machines and algorithms will create around 133 million new job roles by 2022.

AI and machine learning will dominate the tech world. The World Economic Forum says that several sectors have started embracing AI and machine learning to tackle challenges in certain fields such as advertising, supply chain, manufacturing, smart cities, drones, and cybersecurity.

Unraveling the AI realm

From chatbots to financial planners, AI is impacting the way businesses function on a day-today basis. AI makes the work simpler, as it provides variables, which makes the work more streamlined.

Alright! You know that

  • the demand for AI professionals is rising exponentially and that there is just a trickle of supply
  • the AI professionals are demanding skyrocketing salaries

However, beyond that how much more do you know about AI?

Considering the fact that our lives have already been touched by AI (think Alexa, and Siri), it is just a matter of time when AI will become an indispensable part of our lives.

As Gartner predicts that 2020 will be an important year for business growth in AI. Thus, it is possible to witness significant sparks for employment growth. Though AI predicts to diminish 1.8 million jobs, it is also said to replace it with 2.3 million jobs that will be created. As we look forward to stepping into 2020, AI-related job roles are set to make positive progress of achieving 2 million net-new employments by 2025.

With AI promising to score fat paychecks that would reach millions, AI experts are struggling to find new ways to pick up nouveau skills. However, one of the biggest impacts that affect the job market today is the scarcity of talent in this field.

The best way to stay relevant and employable in AI is probably by “reskilling,” and “upskilling.” And  AI certifications is considered ideal for those in the current workforce.

Looking to upskill yourself – here’s how you can become an AI engineer today.

Top three ways to enhance your artificial intelligence career:

  1. Acquire skills in Statistics and Machine Learning: If you’re getting into the field of machine learning, it is crucial that you have in-depth knowledge of statistics. Statistics is considered a prerequisite to the ML field. Both the fields are tightly related. Machine learning models are created to make accurate predictions while statistical models do the job of interpreting the relationship between variables. Many ML techniques heavily rely on the theory obtained through statistics. Thus, having extensive knowledge in statistics help initiate the first step towards an AI career.
  2. Online certification programs in AI skills: Opting for AI certifications will boost your credibility amongst potential employers. Certifications will also enhance your earning potential and increase your marketability. If you’re looking for a change and to be a part of something impactful; join the AI bandwagon. The IT industry is growing at breakneck speed; it is now that businesses are realizing how important it is to hire professionals with certain skillsets. Specifically, those who are certified in AI are becoming sought after in the job market.
  3. Hands-on experience: There’s a vast difference in theory and practical knowledge. One needs to familiarize themselves with the latest tools and technologies used by the industry. This is possible only if the individual is willing to work on projects and build things from scratch.

Despite all the promises, AI does prove to be a threat to job holders, if they don’t upskill or reskill themselves. The upcoming AI revolution will definitely disrupt the way we work, however, it will leave room for humans to perform more creative jobs in the future corporate world.

So a word of advice is to be prepared and stay future ready.

The Data Scientist Job and the Future

A dramatic upswing of data science jobs facilitating the rise of data science professionals to encounter the supply-demand gap.

By 2024, a shortage of 250,000 data scientists is predicted in the United States alone. Data scientists have emerged as one of the hottest careers in the data world today. With digitization on the rise, IoT and cognitive technologies have generated a large number of data sets, thus, making it difficult for an organization to unlock the value of these data.

With the constant rise in data science, those fail to upgrade their skill set may be putting themselves at a competitive disadvantage. No doubt data science is still deemed as one of the best job titles today, but the battles for expert professionals in this field is fierce.

The hiring market for a data science professional has gone into overdrive making the competition even tougher. New online institutions have come up with credible certification programs for professionals to get skilled. Not to forget, organizations are in a hunt to hire candidates with data science and big data analytics skills, as these are the top skills that are going around in the market today. In addition to this, it is also said that typically it takes around 45 days for these job roles to be filled, which is five days longer than the average U.S. market.

Data science

One might come across several definitions for data science, however, a simple definition states that it is an accumulation of data, which is arranged and analyzed in a manner that will have an effect on businesses. According to Google, a data scientist is one who has the ability to analyze and interpret complex data, being able to make use of the statistic of a website and assist in business decision making. Also, one needs to be able to choose and build appropriate algorithms and predictive models that will help analyze data in a viable manner to uncover positive insights from it.

A data scientist job is now a buzzworthy career in the IT industry. It has driven a wider workforce to get skilled in this job role, as most organizations are becoming data-driven. It’s pretty obnoxious being a data professional will widen job opportunities and offer more chances of getting lucrative salary packages today. Similarly, let us look at a few points that define the future of data science to be bright.

  • Data science is still an evolving technology

A career without upskilling often remains redundant. To stay relevant in the industry, it is crucial that professionals get themselves upgraded in the latest technologies. Data science evolves to have an abundance of job opportunities in the coming decade. Since, the supply is low, it is a good call for professionals looking to get skilled in this field.

  • Organizations are still facing a challenge using data that is generated

Research by 2018 Data Security Confidence from Gemalto estimated that 65% of the organizations could not analyze or categorized the data they had stored. However, 89% said they could easily analyze the information prior they have a competitive edge. Being a data science professional, one can help organizations make progress with the data that is being gathered to draw positive insights.

  • In-demand skill-set

Most of the data scientists possess to have the in-demand skill set required by the current industry today. To be specific, since 2013 it is said that there has been a 256% increase in the data science jobs. Skills such as Machine Learning, R and Python programming, Predictive analytics, AI, and Data Visualization are the most common skills that employers seek from the candidates of today.

  • A humongous amount of data growing everyday

There are around 5 billion consumers that interact with the internet on a daily basis, this number is set to increase to 6 billion in 2025, thus, representing three-quarters of the world’s population.

In 2018, 33 zettabytes of data were generated and projected to rise to 133 zettabytes by 2025. The production of data will only keep increasing and data scientists will be the ones standing to guard these enterprises effectively.

  • Advancement in career

According to LinkedIn, data scientist was found to be the most promising career of 2019. The top reason for this job role to be ranked the highest is due to the salary compensation people were being awarded, a range of $130,000. The study also predicts that being a data scientist, there are high chances or earning a promotion giving a career advancement score of 9 out of 10.

Precisely, data science is still a fad job and will not cease until the foreseeable future.

Closing the AI-skills gap with Upskilling

Closing the AI-skills gap with Upskilling

Artificial Intelligent or as it is fancily referred as AI, has garnered huge popularity worldwide.  And given the career prospects it has, it definitely should. Almost everyone interested in technology sector has them rushing towards it, especially young and motivated fresh computer science graduates. Compared to other IT-related jobs AI pays way higher salary and have opportunities. According to a Glassdoor report, Data Scientist, one of the many related jobs, is the number one job with good salary, job openings and more. AI-related jobs include Data Scientists, Analysts, Machine Learning Engineer, NLP experts etc.

AI has found applications in almost every industry and thus it has picked up demand. Home assistants – Siri, Ok Google, Amazon Echo — chatbots, and more some of the popular applications of AI.

Increasing adoption of AI across Industry

The advantages of AI like increased productivity has increased its adoption among companies. According to Gartner, 37 percent of enterprise currently use AI in one way or the other. In fact, in the last four year adoption of AI technologies among companies has increased by 270 percent. In telecommunications, for instance, 52 percent of companies have chatbots deployed for better and smoother customer experience. Now, about 49 percent of businesses are now on their way to alter business models to integrate and adopt AI-driven processes. Further, industry leaders have gone beyond and voiced their concerns about companies that are lagging in AI adoption.

Unfortunately, it has been extremely difficult for employers to find right skilled or qualified candidates for AI-related positions. A reports suggests that there are total 300,000 AI professionals are available worldwide, while there’s demand for millions. In a recent survey conducted by Ernst & Young, 51 percent AI professionals told that lack of talent was the biggest impediment in AI adoption.

Further, O’Reilly, in 2018 conducted a survey, which found the lack of AI skills, among other things, was the major reason that was holding companies back from implementing AI.
The major reason for this is the lack of skills among people who aspire to get into AI-related jobs. According to a report, there demand for millions for jobs in AI. However, only a handful of qualified people are available.

Bridging the skill gap in AI-related jobs

Top companies and government around the world have taken up initiatives to close this gap. Google and Amazon, for instance, have dedicated facilities which trains in AI skills.  Google’s Brain Toronto is a dedicated facility to expand their talent in AI.  Similarly, Amazon has facility near University of Cambridge which is dedicated to AI. Most companies either already have a facility or are in the process of setting up one.

In addition to this, governments around the world are also taking initiatives to address the skill gap. For instance, government across the world are pushing towards AI advancement and are develop collaborative plans which aims at delivering more AI skilled professionals. Recently, the white house launched ai.gov which is further helping to promote AI in the US. The website will offer updates related to AI projects across different sectors.

Other than these, companies have taken this upon themselves to reskills their employees and prepare them for future roles. According to a report from Towards Data Science, about 63 percent of companies have in-house training programs to train employees in AI-related skills.

Overall, though there is demand for AI professionals, lack of skilled talent is a major problem.

Roles in Artificial Intelligence
Artificial Intelligence is the most dominant role for which companies hire across artificial Intelligence. Other than that, following are some of the popular roles:

  1. Machine learning Engineer: These are the people who make machines learn with complex algorithms. On advance level, Machine learning engineers are required to have good knowledge of computer vision. According to Indeed, in the last year, demand for Machine Learning Engineer has grown by 344 percent.
  2. NLP Experts: These experts are equipped with the understanding of making machines computer understand human language. Their expertise includes knowledge of how machines understand human language. Text-to-speech technologies are the common areas which require NLP experts. Demand for engineers who can program computers to understand human speech is growing continuously. It was the fast growing skills in Upwork’s list of in-demand freelancing skills. In Q4, 2016, it had grown 200 percent and since then has been on continuously growing.
  3. Big Data Engineers: This is majorly an analytics role. These gather huge amount of data available from sources and analyze it to derive insights and understand patter, which may be further used for machine learning, prediction modelling, natural language processing. In Mckinsey annual report 2018, it had reported that there was shortage of 190,000 big data professionals in the US alone.

Other roles like Data Scientists, Analysts, and more also in great demand. Then, again due to insufficient talent in the market, companies are struggling to hire for these roles.

Self-learning and upskilling
Artificial Intelligence is a continuously growing field and it has been advancing at a very fast pace, and it makes extremely difficult to keep up with in-demand skills. Hence, it is imperative to keep yourself up with demand of the industry, or it is just a matter of time before one becomes redundant.

On an individual level, learning new skills is necessary. One has to be agile and keep learning, and be ready to adapt new technologies. For this, AI training programs and certifications are ideal.  There are numerous AI programs which individuals can take to further learn new skills. AI certifications can immensely boost career opportunities. Certification programs offer a structured approach to learning which benefits in learning mostly practical and executional skills while keeping fluff away. It is more hands-on. Plus, certifications programs qualify only when one has passed practical test which is very advantageous in tech. AI certifications like AIE (Artificial Intelligence Engineer) are quite popular.

Online learning platforms also offer good a resource to learn artificial intelligence. Most schools haven’t yet adapted their curriculum to skill for AI, while most universities and grad schools are in their way to do so. In the meantime, online learning platforms offer a good way to learn AI skills, where one can start from basic and reach to advance skills.

Business Intelligence Organizations

I am often asked how the Business Intelligence department should be set up and how it should interact and collaborate with other departments. First and foremost: There is no magic recipe here, but every company must find the right organization for itself.

Before we can talk about organization of BI, we need to have a clear definition of roles for team members within a BI department.

A Data Engineer (also Database Developer) uses databases to save structured, semi-structured and unstructured data. He or she is responsible for data cleaning, data availability, data models and also for the database performance. Furthermore, a good Data Engineer has at least basic knowledge about data security and data privacy. A Data Engineer uses SQL and NoSQL-Technologies.

A Data Analyst (also BI Analyst or BI Consultant) uses the data delivered by the Data Engineer to create or adjust data models and implementing business logic in those data models and BI dashboards. He or she needs to understand the needs of the business. This job requires good communication and consulting skills as well as good developing skills in SQL and BI Tools such like MS Power BI, Tableau or Qlik.

A Business Analyst (also Business Data Analyst) is a person form any business department who has basic knowledge in data analysis. He or she has good knowledge in MS Excel and at least basic knowledge in data analysis and BI Tools. A Business Analyst will not create data models in databases but uses existing data models to create dashboards or to adjust existing data analysis applications. Good Business Analyst have SQL Skills.

A Data Scientist is a Data Analyst with extended skills in statistics and machine learning. He or she can use very specific tools and analytical methods for finding pattern in unknow or big data (Data Mining) or to predict events based on pattern calculated by using historized data (Predictive Analytics). Data Scientists work mostly with Python or R programming.

Organization Type 1 – Central Approach (Data Lab)

The first type of organization is the data lab approach. This organization form is easy to manage because it’s focused and therefore clear in terms of budgeting. The data delivery is done centrally by experts and their method and technology knowledge. Consequently, the quality expectation of data delivery and data analysis as well as the whole development process is highest here. Also the data governance is simple and the responsibilities clearly adjustable. Not to be underestimated is the aspect of recruiting, because new employees and qualified applicants like to join a central team of experts.

However, this form of organization requires that the company has the right working attitude, especially in the business intelligence department. A centralized business intelligence department acts as a shared service. Accordingly, customer-oriented thinking becomes a prerequisite for the company’s success – and customers here are the other departments that need access to the capacities of those centralized data experts. Communication boundaries must be overcome and ways of simple and effective communication must be found.

Organization Type 2 – Stakeholder Focus Approach

Other companies want to shift more responsibility for data governance, and especially data use and analytics, to those departments where data plays a key role right now. A central business intelligence department manages its own projects, which have a meaning for the entire company. The specialist departments, which have a special need for data analysis, have their own data experts who carry out critical projects for the specialist department. The central Business Intelligence department does not only provide the technical delivery of data, but also through methodical consulting. Although most of the responsibility lies with the Business Intelligence department, some other data-focused departments are at least co-responsible.

The advantage is obvious: There are special data experts who work deeper in the actual departments and feel more connected and responsible to them. The technical-business focus lies on pain points of the company.

However, this form of Ogranization also has decisive disadvantages: The danger of developing isolated solutions that are so special in some specific areas that they will not really work company-wide increases. Typically the company has to deal with asymmetrical growth of data analytics
know-how. Managing data governance is more complex and recruitment is becoming more difficult as the business intelligence department is weakened and smaller, and data professionals for other departments need to have more business focus, which means they are looking for more specialized profiles.

Organization Type 3 – Decentral Approach

Some companies are also taking a more extreme approach in the other direction. The Business Intelligence department now has only Data Engineers building and maintaining the data warehouse or data lake. As a result, the central department only provides data; it is used and analyzed in all other departments, specifically for the respective applications.

The advantage lies in the personal responsibility of the respective departments as „pain points“ of the company are in focus in belief that business departments know their problems and solutions better than any other department does. Highly specialized data experts can understand colleagues of their own department well and there is no no shared service mindset neccessary, except for the data delivery.

Of course, this organizational form has clear disadvantages since many isolated solutions are unavoidable and the development process of each data-driven solution will be inefficient. These insular solutions may work with luck for your own department, but not for the whole company. There is no one single source of truth. The recruiting process is more difficult as it requires more specialized data experts with more business background. We have to expect an asymmetrical growth of data analytics know-how and a difficult data governance.

 

The 6 most in-demand AI jobs and how to get them

A press release issued in December 2017 by Gartner, Inc explicitly states, 2020 will be a pivotal year in Artificial Intelligence-related employment dynamics. It states AI will become “a positive job motivator”.

However, the Gartner report also sounds some alarm bells. “The number of jobs affected by AI will vary by industry-through 2019, healthcare, the public sector and education will see continuously growing job demand while manufacturing will be hit the hardest. Starting in 2020, AI-related job creation will cross into positive territory, reaching two million net-new jobs in 2025,” the press release adds.

This phenomenon is expected to strike worldwide, as a report carried by a leading Indian financial daily, The Hindu BusinessLine states. “The year 2018 will see a sharp increase in demand for professionals with skills in emerging technologies such as Artificial Intelligence (AI) and machine learning, even as people with capabilities in Big Data and Analytics will continue to be the most sought after by companies across sectors, say sources in the recruitment industry,” this news article says.

Before we proceed, let us understand what exactly does Artificial Intelligence or AI mean.

Understanding Artificial Intelligence

Encyclopedia Britannica explains AI as: “The ability of a digital computer or computer-controlled robot to perform tasks commonly associated with human beings.” Classic examples of AI are computer games that can be played solo on a computer. Of these, one can be a human while the other is the reasoning, analytical and other intellectual property a computer. Chess is one example of such a game. While playing Chess with a computer, AI will analyze your moves. It will predict and reason why you made them and respond accordingly.

Similarly, AI imitates functions of the human brain to a very great extent. Of course, AI can never match the prowess of humans but it can come fairly close.

What this means?

This means that AI technology will advance exponentially. The main objective for developing AI will not aim at reducing dependence on humans that can result in loss of jobs or mass retrenchment of employees. Having a large population of unemployed people is harmful to economy of any country. Secondly, people without money will not be able to utilize most functions that are performed through AI, which will render the technology useless.

The advent and growing popularity of AI can be summarized in words of Bill Gates. According to the founder of Microsoft, AI will have a positive impact on people’s lives. In an interview with Fox Business, he said, people would have more spare time that would eventually lead to happier life. However he cautions, it would be long before AI starts making any significant impact on our daily activities and jobs.

Career in AI

Since AI primarily aims at making human life better, several companies are testing the technology. Global online retailer Amazon is one amongst these. Banks and financial institutions, service providers and several other industries are expected to jump on the AI bandwagon in 2018 and coming years. Hence, this is the right time to aim for a career in AI. Currently, there exists a great demand for AI professionals. Here, we look at the top six employment opportunities in Artificial Intelligence.

Computer Vision Research Engineer

 A Computer Vision Research Engineer’s work includes research and analysis, developing software and tools, and computer vision technologies. The primary role of this job is to ensure customer experience that equals human interaction.

Business Intelligence Engineer

As the job designation implies, the role of a Business Intelligence Engineer is to gather data from multiple functions performed by AI such as marketing and collecting payments. It also involves studying consumer patterns and bridging gaps that AI leaves.

Data Scientist

A posting for Data Scientist on recruitment website Indeed describes Data Scientist in these words: “ A mixture between a statistician, scientist, machine learning expert and engineer: someone who has the passion for building and improving Internet-scale products informed by data. The ideal candidate understands human behavior and knows what to look for in the data.

Research and Development Engineer (AI)

Research & Development Engineers are needed to find ways and means to improve functions performed through Artificial Intelligence. They research voice and text chat conversations conducted by bots or robotic intelligence with real-life persons to ensure there are no glitches. They also develop better solutions to eliminate the gap between human and AI interactions.

Machine Learning Specialist

The job of a Machine Learning Specialist is rather complex. They are required to study patterns such as the large-scale use of data, uploads, common words used in any language and how it can be incorporated into AI functions as well as analyzing and improving existing techniques.

Researchers

Researchers in AI is perhaps the best-paid lot. They are required to research into various aspects of AI in any organization. Their role involves researching usage patterns, AI responses, data analysis, data mining and research, linguistic differences based on demographics and almost every human function that AI is expected to perform.

As with any other field, there are several other designations available in AI. However, these will depend upon your geographic location. The best way to find the demand for any AI job is to look for good recruitment or job posting sites, especially those specific to your region.

In conclusion

Since AI is a technology that is gathering momentum, it will be some years before there is a flood of people who can be hired as fresher or expert in this field. Consequently, the demand for AI professionals is rather high. Median salaries these jobs mentioned above range between US$ 100,000 to US$ 150,000 per year.

However, before leaping into AI, it is advisable to find out what other qualifications are required by employers. As with any job, some companies need AI experts that hold specific engineering degrees combined with additional qualifications in IT and a certificate that states you hold the required AI training. Despite, this is the best time to make a career in the AI sector.

Data Science vs Data Engineering

The job of the Data Scientist is actually a fairly new trend, and yet other job titles are coming to us. “Is this really necessary?”, Some will ask. But the answer is clear: yes!

There are situations, every Data Scientist know: a recruiter calls, speaks about a great new challenge for a Data Scientist as you obviously claim on your LinkedIn profile, but in the discussion of the vacancy it quickly becomes clear that you have almost none of the required skills. This mismatch is mainly due to the fact that under the job of the Data Scientist all possible activity profiles, method and tool knowledge are summarized, which a single person can hardly learn in his life. Many open jobs, which are to be called under the name Data Science, describe rather the professional image of the Data Engineer.


Read this article in German:
“Data Science vs Data Engineering – Wo liegen die Unterschiede?“


What is a Data Engineer?

Data engineering is primarily about collecting or generating data, storing, historicalizing, processing, adapting and submitting data to subsequent instances. A Data Engineer, often also named as Big Data Engineer or Big Data Architect, models scalable database and data flow architectures, develops and improves the IT infrastructure on the hardware and software side, deals with topics such as IT Security , Data Security and Data Protection. A Data Engineer is, as required, a partial administrator of the IT systems and also a software developer, since he or she extends the software landscape with his own components. In addition to the tasks in the field of ETL / Data Warehousing, he also carries out analyzes, for example, to investigate data quality or user access. A Data Engineer mainly works with databases and data warehousing tools.

A Data Engineer is talented as an educated engineer or computer scientist and rather far away from the actual core business of the company. The Data Engineer’s career stages are usually something like:

  1. (Big) Data Architect
  2. BI Architect
  3. Senior Data Engineer
  4. Data Engineer

What makes a Data Scientist?

Although there may be many intersections with the Data Engineer’s field of activity, the Data Scientist can be distinguished by using his working time as much as possible to analyze the available data in an exploratory and targeted manner, to visualize the analysis results and to convert them into a red thread (storytelling). Unlike the Data Engineer, a data scientist rarely sees into a data center, because he picks up data via interfaces provided by the Data Engineer or provides by other resources.

A Data Scientist deals with mathematical models, works mainly with statistical procedures, and applies them to the data to generate knowledge. Common methods of Data Mining, Machine Learning and Predictive Modeling should be known to a Data Scientist. Data Scientists basically work close to the department and need appropriate expertise. Data Scientists use proprietary tools (e.g. Tools by IBM, SAS or Qlik) and program their own analyzes, for example, in Scala, Java, Python, Julia, or R. Using such programming languages and data science libraries (e.g. Mahout, MLlib, Scikit-Learn or TensorFlow) is often considered as advanced data science.

Data Scientists can have diverse academic backgrounds, some are computer scientists or engineers for electrical engineering, others are physicists or mathematicians, not a few have economical backgrounds. Common career levels could be:

  1. Chief Data Scientist
  2. Senior Data Scientist
  3. Data Scientist
  4. Data Analyst oder Junior Data Scientist

Data Scientist vs Data Analyst

I am often asked what the difference between a Data Scientist and a Data Analyst would be, or whether there would be a distinction criterion at all:

In my experience, the term Data Scientist stands for the new challenges for the classical concept of Data Analysts. A Data Analyst performs data analysis like a Data Scientist. More complex topics such as predictive analytics, machine learning or artificial intelligence are topics for a Data Scientist. In other words, a Data Scientist is a Data Analyst++ (one step above the Data Analyst).

And how about being a Business Analyst?

Business Analysts can (but need not) be Data Analysts. In any case, they have a very strong relationship with the core business of the company. Business Analytics is about analyzing business models and business successes. The analysis of business success is usually carried out by IT, and many business analysts are starting a career as Data Analyst now. Dashboards, KPIs and SQL are the tools of a good business analyst, but there might be a lot business analysts, who are just analysing business models by reading the newspaper…

Data Science Knowledge Stack – Abstraction of the Data Science Skillset

What must a Data Scientist be able to do? Which skills does as Data Scientist need to have? This question has often been asked and frequently answered by several Data Science Experts. In fact, it is now quite clear what kind of problems a Data Scientist should be able to solve and which skills are necessary for that. I would like to try to bring this consensus into a visual graph: a layer model, similar to the OSI layer model (which any data scientist should know too, by the way).
I’m giving introductory seminars in Data Science for merchants and engineers and in those seminars I always start explaining what we need to work out together in theory and practice-oriented exercises. Against this background, I came up with the idea for this layer model. Because with my seminars the problem already starts: I am giving seminars for Data Science for Business Analytics with Python. So not for medical analyzes and not with R or Julia. So I do not give a general knowledge of Data Science, but a very specific direction.

A Data Scientist must deal with problems at different levels in any Data Science project, for example, the data access does not work as planned or the data has a different structure than expected. A Data Scientist can spend hours debating its own source code or learning the ropes of new DataScience packages for its chosen programming language. Also, the right algorithms for data evaluation must be selected, properly parameterized and tested, sometimes it turns out that the selected methods were not the optimal ones. Ultimately, we are not doing Data Science all day for fun, but for generating value for a department and a data scientist is also faced with special challenges at this level, at least a basic knowledge of the expertise of that department is a must have.


Read this article in German:
“Data Science Knowledge Stack – Was ein Data Scientist können muss“


Data Science Knowledge Stack

With the Data Science Knowledge Stack, I would like to provide a structured insight into the tasks and challenges a Data Scientist has to face. The layers of the stack also represent a bidirectional flow from top to bottom and from bottom to top, because Data Science as a discipline is also bidirectional: we try to answer questions with data, or we look at the potentials in the data to answer previously unsolicited questions.

The DataScience Knowledge Stack consists of six layers:

Database Technology Knowledge

A Data Scientist works with data which is rarely directly structured in a CSV file, but usually in one or more databases that are subject to their own rules. In particular, business data, for example from the ERP or CRM system, are available in relational databases, often from Microsoft, Oracle, SAP or an open source alternative. A good Data Scientist is not only familiar with Structured Query Language (SQL), but is also aware of the importance of relational linked data models, so he also knows the principle of data table normalization.

Other types of databases, so-called NoSQL databases (Not only SQL) are based on file formats, column or graph orientation, such as MongoDB, Cassandra or GraphDB. Some of these databases use their own programming languages ​​(for example JavaScript at MongoDB or the graph-oriented database Neo4J has its own language called Cypher). Some of these databases provide alternative access via SQL (such as Hive for Hadoop).

A data scientist has to cope with different database systems and has to master at least SQL – the quasi-standard for data processing.

Data Access & Transformation Knowledge

If data are given in a database, Data Scientists can perform simple (and not so simple) analyzes directly on the database. But how do we get the data into our special analysis tools? To do this, a Data Scientist must know how to export data from the database. For one-time actions, an export can be a CSV file, but which separators and text qualifiers should be used? Possibly, the export is too large, so the file must be split.
If there is a direct and synchronous data connection between the analysis tool and the database, interfaces like REST, ODBC or JDBC come into play. Sometimes a socket connection must also be established and the principle of a client-server architecture should be known. Synchronous and asynchronous encryption methods should also be familiar to a Data Scientist, as confidential data are often used, and a minimum level of security is most important for business applications.

Many datasets are not structured in a database but are so-called unstructured or semi-structured data from documents or from Internet sources. And again we have interfaces, a frequent entry point for Data Scientists is, for example, the Twitter API. Sometimes we want to stream data in near real-time, let it be machine data or social media messages. This can be quite demanding, so the data streaming is almost a discipline with which a Data Scientist can come into contact quickly.

Programming Language Knowledge

Programming languages ​​are tools for Data Scientists to process data and automate processing. Data Scientists are usually no real software developers and they do not have to worry about software security or economy. However, a certain basic knowledge about software architectures often helps because some Data Science programs can be going to be integrated into an IT landscape of the company. The understanding of object-oriented programming and the good knowledge of the syntax of the selected programming languages ​​are essential, especially since not every programming language is the most useful for all projects.

At the level of the programming language, there is already a lot of snares in the programming language that are based on the programming language itself, as each has its own faults and details determine whether an analysis is done correctly or incorrectly: for example, whether data objects are copied or linked as reference, or how NULL/NaN values ​​are treated.

Data Science Tool & Library Knowledge

Once a data scientist has loaded the data into his favorite tool, for example, one of IBM, SAS or an open source alternative such as Octave, the core work just began. However, these tools are not self-explanatory and therefore there is a wide range of certification options for various Data Science tools. Many (if not most) Data Scientists work mostly directly with a programming language, but this alone is not enough to effectively perform statistical data analysis or machine learning: We use Data Science libraries (packages) that provide data structures and methods as a groundwork and thus extend the programming language to a real Data Science toolset. Such a library, for example Scikit-Learn for Python, is a collection of methods implemented in the programming language. The use of such libraries, however, is intended to be learned and therefore requires familiarization and practical experience for reliable application.

When it comes to Big Data Analytics, the analysis of particularly large data, we enter the field of Distributed Computing. Tools (frameworks) such as Apache Hadoop, Apache Spark or Apache Flink allows us to process and analyze data in parallel on multiple servers. These tools also provide their own libraries for machine learning, such as Mahout, MLlib and FlinkML.

Data Science Method Knowledge

A Data Scientist is not simply an operator of tools, he uses the tools to apply his analysis methods to data he has selected for to reach the project targets. These analysis methods are, for example, descriptive statistics, estimation methods or hypothesis tests. Somewhat more mathematical are methods of machine learning for data mining, such as clustering or dimensional reduction, or more toward automated decision making through classification or regression.

Machine learning methods generally do not work immediately, they have to be improved using optimization methods like the gradient method. A Data Scientist must be able to detect under- and overfitting, and he must prove that the prediction results for the planned deployment are accurate enough.

Special applications require special knowledge, which applies, for example, to the fields of image recognition (Visual Computing) or the processing of human language (Natural Language Processiong). At this point, we open the door to deep learning.

Expertise

Data Science is not an end in itself, but a discipline that would like to answer questions from other expertise fields with data. For this reason, Data Science is very diverse. Business economists need data scientists to analyze financial transactions, for example, to identify fraud scenarios or to better understand customer needs, or to optimize supply chains. Natural scientists such as geologists, biologists or experimental physicists also use Data Science to make their observations with the aim of gaining knowledge. Engineers want to better understand the situation and relationships between machinery or vehicles, and medical professionals are interested in better diagnostics and medication for their patients.

In order to support a specific department with his / her knowledge of data, tools and analysis methods, every data scientist needs a minimum of the appropriate skills. Anyone who wants to make analyzes for buyers, engineers, natural scientists, physicians, lawyers or other interested parties must also be able to understand the people’s profession.

Engere Data Science Definition

While the Data Science pioneers have long established and highly specialized teams, smaller companies are still looking for the Data Science Allrounder, which can take over the full range of tasks from the access to the database to the implementation of the analytical application. However, companies with specialized data experts have long since distinguished Data Scientists, Data Engineers and Business Analysts. Therefore, the definition of Data Science and the delineation of the abilities that a data scientist should have, varies between a broader and a more narrow demarcation.


A closer look at the more narrow definition shows, that a Data Engineer takes over the data allocation, the Data Scientist loads it into his tools and runs the data analysis together with the colleagues from the department. According to this, a Data Scientist would need no knowledge of databases or APIs, neither an expertise would be necessary …

In my experience, DataScience is not that narrow, the task spectrum covers more than just the core area. This misunderstanding comes from Data Science courses and – for me – I should point to the overall picture of Data Science again and again. In courses and seminars, which want to teach Data Science as a discipline, the focus will of course be on the core area: programming, tools and methods from mathematics & statistics.

Is Data Science the new Statistics?

Table of Contents

1 Introduction

2 Emerging of Data Science

3 Big data technologies

4 Two data worlds: Predictive vs inferential statistics

5 How to study data science

6 Conclusions

7 References

Introduction

As a student of Statistics and the winner of Data Science Scholarship I am often surrounded by computer scientists, mathematicians, physicists and of course statisticians. During conversation, I was asked questions such as “So what actually do I do? What is Data Science?”. These are some very difficult questions and as like you will see during reading this document many before me tried to answer those questions. There is a dispute between statisticians and computer scientists what is the origin of data science and who should teach it. According to the Institute of Mathematical Statistics in the: “The IMS presidential address: let us own data science” we can find a simple recipe for data scientist. [1]

“Putting the traits of Turner and Carver together gives a good portrait of a data scientist:

  • Statistics (S)
  • Domain/Science knowledge (D)
  • Computing (C)
  • Collaboration/teamwork (C)
  • Communication to outsiders (C)

That is, data science = SDCCC = S DC3

However, despite all the challenges that I will need to overcome in answering those questions I will try to do it. I will refer to ideas from several reputable sources, in which I will also tell you: what is in the data science that I am really fascinated about? What is magical in this creation of statistics and computer science that I am drawn to?

Emerging of Data Science

On Tuesday, the 8th of September 2015, University of Michigan announced the 100 million dollars “Data Science Initiative” (DSI), hired 35 new faculty members. On the DSI website we can read about this initiative:

“This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualisation, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational and interdisciplinary applications”2

But that sounds like a bread and butter for statisticians. So, is it really a new creation or is it something that exists for many years but it didn’t sound so sexy as data science? In the article written by Karl Broman, (the University of Wisconsin) we can read:

“When physicists do mathematics, they’re don’t say they’re doing “number science”. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it ‘s still statistics. If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly. You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

Reading the definition of data science on the Data Science Association’s “Professional Code of Conduct”:

“Data scientist means a professional who uses scientific methods to liberate and create meaning from raw data”

These sound like K. Browman maybe right. Maybe I should go on MSc Statistics like many before me did. Maybe Data Science is simply a new sexy name for statistician only data is big, technology more advanced rather than it used to be so you need to have programming skills to handle the data. Maybe let say loudly data science is a modern version of statistics? But maybe not? Because we can also find statements like the following:

“Statistics is the least important part of data science”. [3]

Further, we can read:

“There ‘s so, much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .”.[3]

So maybe people from computer science are right. Maybe I should go and study programming and forget about expanding my knowledge in statistics? After all, we all know that computer science always had much bigger funding and having MSc computer science was always like a magic star for employers. What should I do? Let me research further.

Big data technologies

Is the data size important to distinguish between data science and statistics? Going back to the “Let us own data science” article we can read that a statistician, Hollerith, invented the punched card reader to allow e cient compilation of a US census, the first elements of machine learning. So, no, machine learning is not an invention of computer scientists. It was well known for statistician for decades already. What about different techniques used in DOE (Design of Experiments) or sampling methods to decrease the sample size. If the data used by statisticians would be only small they wouldn’t have to discover methods such PCA (Principle component analysis) or dimensionality reduction techniques. So, no, data can be big and/or small for statisticians, so what is the difference between data science and statistics and what department should I choose?

When I spoke to computer scientists they try to convince me to choose computer science department. Their reasons being that there are many different programmes that I need to know to deal with large datasets. For instance: Java, Hadoop, SQL, Python, and much more. Moreover, programming can only be taught to the best standard through computer science courses Is it true? Can’t we do the same calculations using statistical software such as R, SAS or even Matlab? But on the other hand, doesn’t the newest technology always work faster? And if so, wouldn’t be better to use the newest technology when we program and write loops?

But, I don’t want to underestimate the effort made by statisticians and data analyst over last 50 years in developing statistical programmes. Their efforts have resulted in the emergence of today’s technology. Early statistical packages such as SPSS or Minitab (from 1960’s) allowed to develop more advanced programmes having roots in mini computer era such as STATA or my favourite R which in turn allowed progress to advanced technology even further and create Python, Hadoop, SQL and so on. Becker and Chambers (with S) and later Ihaka, Gentleman, and members of the R Core team (with R) worked on developing the statistical software. These names should be convincing about how powerful statistical programming languages can be. Many operations that we can do in Hadoop or SQL we can also do easily in R.

Two data worlds: Predictive vs inferential statistics

So maybe Data Science is a creature merged by statisticians working on computer science department? Maybe there are two different approaches to statistics: mathematical statistics and computer science statistics and the computer science statisticians are data scientists because according to Yanir Seroussi in his blog:

“A successful data scientist needs to be able to “become one with the data” by exploring it and applying rigorous statistical analysis (right-hand side of the continuum). But good data scientists also understand what it takes to deploy production systems, and are ready to get their hands dirty by writing code that cleans up the data or performs core system functionality (lefthand side of the continuum). Gaining all these skills takes time.”[4]

Okay, so my reasoning that some statisticians work on computer science department is right, as well as there exists subject like computational statistics, so maybe I should go for computer science department but study statistics.

In fact, I am not the first one to arrive at the conclusion. Everything started from a confession made by John Tukey in “The Future of Data Analysis” article published in “The Annals of Mathematical Statistics” :

For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. … All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data

If I am right then above confession was a critical moment. The time when mathematical statistics become more inferential and computational statistics concentrated more on predictive statistics. Applied statisticians working on predictive analytics that are more interested in applying the knowledge rather than developing long proofs decided to move on computer science department.

Additionally, the following is crucial discussion made by Leo Biermann in his paper published in Statistical Science titled “Statistical modelling: the two cultures”. It enables us to understand and differentiate views from both types of statistician, namely mathematical and statistical.

Statistics starts with data. Think of the data as being generated by a black box in which a vector of input variables x (independent variables) go in one side, and on the other side the response variables y come out. Inside the black box, nature functions to associate the predictor variables with the response variables … There are two goals in analyzing the data:

  • Prediction. To be able to predict what the responses are going to be to future input variables
  • InferenceTo [infer] how nature is associating the response variables to the input variables.”

Furthermore, in the same dispute we can read:

“The statistical community has been committed to the almost exclusive use of [generative] models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. [Predictive] modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on [generative] models …”

So, we can say that Data Science evolved from Predictive Analytics which in turn evolved from Statistics but it becomes separate science. Tukey and Wilk 1969 compared this new science to established sciences and further circumscribed the role of Statistics within it:

“ … data analysis is a very di cult field. It must adapt itself to what people can and need to do with data. In the sense that biology is more complex than physics, and the behavioural sciences are more complex than either, it is likely that the general problems of data analysis are more complex than those of all three. It is too much to ask for close and effective guidance for data analysis from any highly formalized structure, either now or in the near future. Data analysis can gain much from formal statistics, but only if the connection is kept adequately loose”

How to study data science

So, what is exactly predictive analytics culture? I think that everyone who used Kaggle competition before can agree with me that description of common task framework (CTF) formulated by Marc Liberman in 2009 is a perfect description of Kaggle competitions, and hackathons events; where latter has worked as training sessions for newbies in the data world. An instance of the CTF has these ingredients:

  1. A publicly available training data set involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.
  2. A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.
  3. A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule

Kaggle competitions are not only training platforms for newbies like me but also very challenging statistical competitions where experienced statisticians can win “pocket money”. A famous example is the Netflix Challenge where the common task was to predict Netflix user movie selection. The winning team (which included ATT Statistician Bob Bell) won 1 mln dollars.

Comparing modules that are available on master in data science at University of Berkley[6]:

  1. Both
  • Applied machine learning
  • Experiments and causality
  1. Statistics
  • Research design and application for data and analysis
  • Statistics for Data Science
  • Behind the data: humans and values
  • Statistical methods for discrete response, Time Series and panel data
  • Data visualisation
  1. Computer Science
  • Python for Data Science
  • Storing and Retrieving Data
  • Scalling up! Really Big Data
  • Machine Learning at scale
  • Natural Language Processing with Deep Learning

We can really see that data science is a subject that demands skills from both computer science and statistics. So, it is another confirmation for me that it is the best time to change department for my postgraduate study, that is, to study statistics on computer science department.

In the 50 Years of Data Science article we can read: “The activities of Greater Data Science are classified into 6 divisions:

  1. Data exploration and preparation
  2. Data representation and transformation
  3. Computing with data
  4. Data visualization and presentation
  5. Data Modelling
  6. Science about data science [5]

I will quickly go through all of them using my Ebola research example, this required using machine learning on time series data.

  1. The most demanding part. Many people told me before starting this project that: collecting, cleaning, wrangling and preparing data take 60% of all the time that you need to spend on data science project. I didn’t realise how much this 60% means in real time. I didn ‘t realise that the 60 percent will take so much time and that after this I will be exhausted. Exhausted but ready for the next step.
  2. This point is actually part of the first one, or maybe just like many other things in statistics: everything is one huge connected bunch.Data that you can find can be very nice, well behaving, written in CSV or JSON or any other format file that you can quickly download and use, but what if not? What if your data is ‘dirty’and not stored as a file (e.g. only appear on a website)? What if data is coded? Do you need to decode it?
  3. The even bigger challenge, but what a fun? You need to know a few different programming languages or least as I do know a little bit of R, a little bit of Python, quite well Tableau and Excel. So you can use different program in different scenarios or for different tasks. For example, using Panda to do EDA and ggplot 2 to do data vis.
  4. Graphs are pretty, right? If you are still reading my article, I bet you know what is heat map, spatial vis in big cities or different infographics. Surely, I would like to highlight, that we respect only the ones that are not only pretty but also valid. Nevertheless, time that is required to create these visualisations is another matter.
  5. The data modelling, finally? I don’t need to say a lot about this. All forms of inferential and predictive analytic are allowed and accepted.
  6. My favourite part, not the end yet. All the conferences and meetups that I can attend on. All the seminars where we all present our current projects.

Conclusions

After graduation, I will be graduated Statistician. Even more, I will be a mathematical statistician whom mostly during degree dealt with inferential statistics. On the other hand, winning data science scholarship gave me exposure to predictive analytic which I highly enjoyed. Therefore, for my next stage, I will just change my department and concentrate more on predictive analytic. There are many statisticians working on computer science department. They possess both statistical knowledge and advanced software engineering skills, they are called data scientists. It would be a pleasure for me to join them. I don’t mind if it will be MSc. Computer Science, MSc. Data Science, MSc. Big Data or whatever the name will be. I do mind to have sufficient exposure to deal with “dirty” data using statistical modelling and machine learning using modern technology. This is what data science is for me. Maybe for you, it will be something else. Maybe you will be more satisfied with expanding massively programming skills. But for me, programming is a tool, modern technology is my friend and my bread and butter will be predictive analytic.

References

  1. IMS Presidential Address: Let us own data science
  2. Data science is statistics
  3. A Gelman, Columbia University
  4. Yanir Seroussi: What is data Science?
  5. 50 Years Data Science
  6. Curriculum: data science@Berkley