Posts

Interview – Python as productive data science environment

Miroslav Šedivý is a Senior Software Architect at UBIMET GmbH, using Python to make the sun shine and the wind blow. He is an enthusiast of both human and programming languages and found Python as his language of choice to setup very productive environments. Mr. Šedivý was born in Czechoslovakia, studied in France and is now living in Germany. Furthermore, he helps in the organization of the events PyCon.DE and Polyglot Gathering.


On 26th June 2018 he will explain at the Python@DWX conference why “Lifelong Text Hackers Use Vim and Python”. Insert the promotion code PY18science to unlock your 10% discount on all tickets. More info and tickets on python-con.com.


Data Science Blog: Mr. Šedivý, how did you find the way to Python as your favorite programming language?

Apart from traditional languages taught at school (Basic, Pascal, C, Java), some twenty years ago I learned Perl to hack a dynamic web site and used it to automate my daily tasks. Later I used it professionally for scientific calculations in the production. This was later replaced by Python, its newer versions and more advanced libraries. Nowadays Python has almost completely replaced Perl as my principal language and I use Perl just to hack some command line filters and to impress colleagues.

Data Science Blog: Python is one of the most popular programming language for data scientists. This is remarkable as it is originally not designed for doing data science with it. What made it a competitor to languages like R or Julia?

Python is the most powerful programming language that is still legible. This appeals to data scientists who can enter each line interactively, and immediately see what happens, because each line actually does something. They can inspect their data easily and build automating systems to process their data transparently.

Data Science Blog: Is there anything you could do better with another programming language?

Sometimes I’m playing with some functional languages that would allow me to write code that is easier to test and parallelize.

Data Science Blog: Which libraries are the most important ones for your daily business?

The whole Pandas ecosystem with Numpy and Scipy. Matplotlib for plots, PyTables and Psycopg2 for storage. I’m also importing a few async libs for webservices and similar network-based software.

I also enjoy discovering the world of Unicode and Timezones – both of them are the spots where the programmers absolutely have to obey the chaotic reality of the outside world.

Data Science Blog: Which editor do you use? And how to set it up as a productive environment?

I tried several editors and IDEs, but always came back to Vi or Vim. This is an extremely powerful editor that is around since over forty years, which was probably before most of today’s active developers learned to type. I’m using it for all text editing tasks, which I’m actually going to show in my talk at DWX [Lifelong Text Hackers Use Vim and Python]. Steep learning curve is not an argument against a tool you can grok during your entire career.

Data Science Blog: In your opinion: For all developers and data scientists, who are used to Java, Scala, R oder Perl, is Python easy to learn? Could it be too late to switch for somebody?

Python is a great general language that can be learned rapidly to a usable level. It’s different from the aforementioned languages. I remember my switching process from Perl to Python over ten years ago with a book “Perl to Python Migration”, which forced me to switch my way of thinking. From the question “Why do I have to import ‘re’ for regular expressions if Perl uses them natively?” to “Actually, I can solve this problem without regular expressions.”.

New Sponsor: Snowflake

Dear readers,

we have good news again: Now we welcome snowflake as our new Data Science Blog Sponsor! So we are booked out for the moment regarding sponsoring. Snowflake provides data warehousing for the cloud and has an unique data, access and feature model, the snowflake. Now we are looking forward to editorial contributions by snowflake.

Snowflake is the only data warehouse built for the cloud. Snowflake delivers the performance, concurrency and simplicity needed to store and analyze all data available to an organization in one location. Snowflake’s technology combines the power of data warehousing, the flexibility of big data platforms, the elasticity of the cloud, and live data sharing at a fraction of the cost of traditional solutions. Snowflake: Your data, no limits. Find out more at snowflake.net.

Furthermore, snowflake will also sponsor our Data Leader Days 2018 in November in Berlin!

Show your Data Science Workplace!

The job of a data scientist is often a mystery to outsiders. Of course, you do not really need much more than a medium-sized notebook to use data science methods for finding value in data. Nevertheless, data science workplaces can look so different and, let’s say, interesting. And that’s why I want to launch a blog parade – which I want to start with this article – where you as a Data Scientist or Data Engineer can show your workplace and explain what tools a data scientist in your opinion really needs.

I am very curious how many monitors you prefer, whether you use Apple, Dell, HP or Lenovo, MacOS, Linux or Windows, etc., etc. And of course, do you like a clean or messy desk?

What is a Blog Parade?

A blog parade is a call to blog owners to report on a specific topic. Everyone who participates in the blog parade, write on their blog a contribution to the topic. The organizer of the blog parade collects all the articles and will recap those articles in a short form together, of course with links to the articles.

How can I participate?

Write an article on your blog! Mention this blog parade here, show and explain your workplace (your desk with your technical equipment) in an article. If you’re missing your own blog, articles can also be posted directly to LinkedIn (LinkedIn has its own blogging feature that every LinkedIn member can use). Alternative – as a last resort – it would also be possible to send me your article with a photo about your workplace directly to: redaktion@data-science-blog.com.
Please make me aware of an article, via e-mail or with a comment (below) on this article.

Who can participate?

Any data scientist or anyone close to Data Science: Everyone concerned with topics such as data analytics, data engineering or data security. Please do not over-define data science here, but keep it in a nutshell, so that all professionals who manage and analyze data can join in with a clear conscience.

And yes, I will participate too. I will propably be the first who write an article about my workplace (I just need a new photo of my desk).

When does the article have to be finished?

By 31/12/2017, the article must have been published on your blog (or LinkedIn or wherever) and the release has to be reported to me.
But beware: Anyone who has previously written an article will also be linked earlier. After all, reporting on your article will take place immediately after I hear about it.
If you publish an artcile tomorrow, it will be shown the day after tomorrow here on the Data Science Blog.

What is in it for me to join?

Nothing! Except perhaps the fun factor of sharing your idea of ​​a nice desk for a data expert with others, so as to share creativity or a certain belief in what a data scientist needs.
Well and for bloggers: There is a great backlink from this data science blog for you 🙂

What should I write? What are the minimum requirements of content?

The article does not have to (but may be) particularly long. Anyway, here on this data science blog only a shortened version of your article will appear (with a link, of course).

Minimum requirments:

  • Show a photo (at least one!) of your workplace desk!
  • And tell us something about:
    • How many monitors do you use (or wish to have)?
    • What hardware do you use? Apple? Dell? Lenovo? Others?
    • Which OS do you use (or prefer)? MacOS, Linux, Windows? Virtual Machines?
    • What are your favorite databases, programming languages and tools? (e.g. Python, R, SAS, Postgre, Neo4J,…)
    • Which data dou you analyze on your local hardware? Which in server clusters or clouds?
    • If you use clouds, do you prefer Azure, AWS, Google oder others?
    • Where do you make your notes/memos/sketches. On paper or digital?

Not allowed:
Of course, please do not provide any information, which could endanger your company`s IT security.

Absolutly allowed:
Bringing some joke into the matter 🙂 We are happy to vote in the comments on the best or funniest desk for election, there may be also a winner later!


The resulting Blog Posts: https://data-science-blog.com/data-science-insights/show-your-desk/


 

Data Science vs Data Engineering

The job of the Data Scientist is actually a fairly new trend, and yet other job titles are coming to us. “Is this really necessary?”, Some will ask. But the answer is clear: yes!

There are situations, every Data Scientist know: a recruiter calls, speaks about a great new challenge for a Data Scientist as you obviously claim on your LinkedIn profile, but in the discussion of the vacancy it quickly becomes clear that you have almost none of the required skills. This mismatch is mainly due to the fact that under the job of the Data Scientist all possible activity profiles, method and tool knowledge are summarized, which a single person can hardly learn in his life. Many open jobs, which are to be called under the name Data Science, describe rather the professional image of the Data Engineer.


Read this article in German:
“Data Science vs Data Engineering – Wo liegen die Unterschiede?“


What is a Data Engineer?

Data engineering is primarily about collecting or generating data, storing, historicalizing, processing, adapting and submitting data to subsequent instances. A Data Engineer, often also named as Big Data Engineer or Big Data Architect, models scalable database and data flow architectures, develops and improves the IT infrastructure on the hardware and software side, deals with topics such as IT Security , Data Security and Data Protection. A Data Engineer is, as required, a partial administrator of the IT systems and also a software developer, since he or she extends the software landscape with his own components. In addition to the tasks in the field of ETL / Data Warehousing, he also carries out analyzes, for example, to investigate data quality or user access. A Data Engineer mainly works with databases and data warehousing tools.

A Data Engineer is talented as an educated engineer or computer scientist and rather far away from the actual core business of the company. The Data Engineer’s career stages are usually something like:

  1. (Big) Data Architect
  2. BI Architect
  3. Senior Data Engineer
  4. Data Engineer

What makes a Data Scientist?

Although there may be many intersections with the Data Engineer’s field of activity, the Data Scientist can be distinguished by using his working time as much as possible to analyze the available data in an exploratory and targeted manner, to visualize the analysis results and to convert them into a red thread (storytelling). Unlike the Data Engineer, a data scientist rarely sees into a data center, because he picks up data via interfaces provided by the Data Engineer or provides by other resources.

A Data Scientist deals with mathematical models, works mainly with statistical procedures, and applies them to the data to generate knowledge. Common methods of Data Mining, Machine Learning and Predictive Modeling should be known to a Data Scientist. Data Scientists basically work close to the department and need appropriate expertise. Data Scientists use proprietary tools (e.g. Tools by IBM, SAS or Qlik) and program their own analyzes, for example, in Scala, Java, Python, Julia, or R. Using such programming languages and data science libraries (e.g. Mahout, MLlib, Scikit-Learn or TensorFlow) is often considered as advanced data science.

Data Scientists can have diverse academic backgrounds, some are computer scientists or engineers for electrical engineering, others are physicists or mathematicians, not a few have economical backgrounds. Common career levels could be:

  1. Chief Data Scientist
  2. Senior Data Scientist
  3. Data Scientist
  4. Data Analyst oder Junior Data Scientist

Data Scientist vs Data Analyst

I am often asked what the difference between a Data Scientist and a Data Analyst would be, or whether there would be a distinction criterion at all:

In my experience, the term Data Scientist stands for the new challenges for the classical concept of Data Analysts. A Data Analyst performs data analysis like a Data Scientist. More complex topics such as predictive analytics, machine learning or artificial intelligence are topics for a Data Scientist. In other words, a Data Scientist is a Data Analyst++ (one step above the Data Analyst).

And how about being a Business Analyst?

Business Analysts can (but need not) be Data Analysts. In any case, they have a very strong relationship with the core business of the company. Business Analytics is about analyzing business models and business successes. The analysis of business success is usually carried out by IT, and many business analysts are starting a career as Data Analyst now. Dashboards, KPIs and SQL are the tools of a good business analyst, but there might be a lot business analysts, who are just analysing business models by reading the newspaper…

Data Science on a large scale – can it be done?

Analytics drives business

In today’s digital world, data has become the crucial success factor for businesses as they seek to maintain a competitive advantage, and there are numerous examples of how companies have found smart ways of monetizing data and deriving value accordingly.

On the one hand, many companies use data analytics to streamline production lines, optimize marketing channels, minimize logistics costs and improve customer retention rates.  These use cases are often described under the umbrella term of operational BI, where decisions are based on data to improve a company’s internal operations, whether that be a company in the manufacturing industry or an e-commerce platform.

On the other hand, over the last few years, a whole range of new service-oriented companies have popped up whose revenue models wholly depend on data analytics.  These Data-Driven Businesses have contributed largely to the ongoing development of new technologies that make it possible to process and analyze large amounts of data to find the right insights.  The better these technologies are leveraged, the better their value-add and the better for their business success.  Indeed, without data and data analytics, they don’t have a business.

Data Science – hype or has it always been around?Druck

In my opinion, there is too much buzz around the new era of data scientists.  Ten years ago, people simply called it data mining, describing similar skills and methods.  What has actually changed is the fact that businesses are now confronted with new types of data sources such as mobile devices and data-driven applications rather than statistical methodologies.  I described that idea in detail in my recent post Let’s replace the Vs of Big Data with a single D.

But, of course, you cannot deny that the importance of these data crunchers has increased significantly. The art of mining data mountains (or perhaps I should say “diving through data lakes”) to find appropriate insights and models and then find the right answers to urgent, business-critical questions has become very popular these days.

The challenge: Data Science with large volumes?

Michael Stonebraker, winner of the Turing Award 2014, has been quoted as saying: “The change will come when business analysts who work with SQL on large amounts of data give way to data EXASOL Pipelinescientists, which will involve more sophisticated analysis, predictive modeling, regressions and Bayesian classification. That stuff at scale doesn’t work well on anyone’s engine right now. If you want to do complex analytics on big data, you have a big problem right now.”

And if you look at the limitations of existing statistical environments out there using R, Python, Java, Julia and other languages, I think he is absolutely right.  Once the data scientists have to handle larger volumes, the tools are just not powerful and scalable enough.  This results in data sampling or aggregation to make statistical algorithms applicable at all.

A new architecture for “Big Data Science”

We at EXASOL have worked hard to develop a smart solution to respond to this challenge.  Imagine that it is possible to use raw data and intelligent statistical models on very large data sets, directly at the place where the data is stored.  Where the data is processed in-memory to achieve optimal performance, all distributed across a powerful MPP cluster of servers, in an environment where you can now “install” the programming language of your choice.

Sounds far-fetched?  If you are not convinced, then I highly recommend you have a look at our brand-new in-database analytic programming platform, which is deeply integrated in our parallel in-memory engine and extensible through using nearly any programming language and statistical library.

For further information on our approach to big data science, go ahead and download a copy of our technical whitepaper:  Big Data Science – The future of analytics.