Data Science on a large scale – can it be done?
Analytics drives business
In today’s digital world, data has become the crucial success factor for businesses as they seek to maintain a competitive advantage, and there are numerous examples of how companies have found smart ways of monetizing data and deriving value accordingly.
On the one hand, many companies use data analytics to streamline production lines, optimize marketing channels, minimize logistics costs and improve customer retention rates. These use cases are often described under the umbrella term of operational BI, where decisions are based on data to improve a company’s internal operations, whether that be a company in the manufacturing industry or an e-commerce platform.
On the other hand, over the last few years, a whole range of new service-oriented companies have popped up whose revenue models wholly depend on data analytics. These Data-Driven Businesses have contributed largely to the ongoing development of new technologies that make it possible to process and analyze large amounts of data to find the right insights. The better these technologies are leveraged, the better their value-add and the better for their business success. Indeed, without data and data analytics, they don’t have a business.
Data Science – hype or has it always been around?
In my opinion, there is too much buzz around the new era of data scientists. Ten years ago, people simply called it data mining, describing similar skills and methods. What has actually changed is the fact that businesses are now confronted with new types of data sources such as mobile devices and data-driven applications rather than statistical methodologies. I described that idea in detail in my recent post Let’s replace the Vs of Big Data with a single D.
But, of course, you cannot deny that the importance of these data crunchers has increased significantly. The art of mining data mountains (or perhaps I should say “diving through data lakes”) to find appropriate insights and models and then find the right answers to urgent, business-critical questions has become very popular these days.
The challenge: Data Science with large volumes?
Michael Stonebraker, winner of the Turing Award 2014, has been quoted as saying: “The change will come when business analysts who work with SQL on large amounts of data give way to data scientists, which will involve more sophisticated analysis, predictive modeling, regressions and Bayesian classification. That stuff at scale doesn’t work well on anyone’s engine right now. If you want to do complex analytics on big data, you have a big problem right now.”
And if you look at the limitations of existing statistical environments out there using R, Python, Java, Julia and other languages, I think he is absolutely right. Once the data scientists have to handle larger volumes, the tools are just not powerful and scalable enough. This results in data sampling or aggregation to make statistical algorithms applicable at all.
A new architecture for “Big Data Science”
We at EXASOL have worked hard to develop a smart solution to respond to this challenge. Imagine that it is possible to use raw data and intelligent statistical models on very large data sets, directly at the place where the data is stored. Where the data is processed in-memory to achieve optimal performance, all distributed across a powerful MPP cluster of servers, in an environment where you can now “install” the programming language of your choice.
Sounds far-fetched? If you are not convinced, then I highly recommend you have a look at our brand-new in-database analytic programming platform, which is deeply integrated in our parallel in-memory engine and extensible through using nearly any programming language and statistical library.
For further information on our approach to big data science, go ahead and download a copy of our technical whitepaper: Big Data Science – The future of analytics.