Tag Archive for: Data Engineer

Jobprofil des Data Engineers

Warum Data Engineering der Data Science in Bedeutung und Berufschancen längst die Show stiehlt, dabei selbst ebenso einem stetigen Wandel unterliegt.

Was ein Data Engineer wirklich können muss

Der Data Scientist als sexiest Job des 21. Jahrhunderts? Mag sein, denn der Job hat seinen ganz speziellen Reiz, auch auf Grund seiner Schnittstellenfunktion zwischen Technik und Fachexpertise. Doch das Spotlight der kommenden Jahre gehört längst einem anderen Berufsbild aus der Datenwertschöpfungskette – das zeigt sich auch bei den Gehältern.

Viele Unternehmen sind gerade auf dem Weg zum Data-Driven Business, einer Unternehmensführung, die für ihre Entscheidungen auf transparente Datengrundlagen setzt und unter Einsatz von Business Intelligence, Data Science sowie der Automatisierung mit Deep Learning und RPA operative Prozesse so weit wie möglich automatisiert. Die Lösung für diese Aufgabenstellungen werden oft vor allem bei den Experten für Prozessautomatisierung und Data Science gesucht, dabei hängt der Erfolg jedoch gerade viel eher von der Beschaffung valider Datengrundlagen ab, und damit von einer ganz anderen entscheidenden Position im Workflow datengetriebener Entscheidungsprozesse, dem Data Engineer.

Data Engineer, der gefragteste Job des 21sten Jahrhunderts?

Der Job des Data Scientists hingegen ist nach wie vor unter Studenten und Absolventen der MINT-Fächer gerade so gefragt wie nie, das beweist der tägliche Ansturm der vielen Absolventen aus Studiengängen rund um die Data Science auf derartige Stellenausschreibungen. Auch mangelt es gerade gar nicht mehr so sehr an internationalen Bewerben mit Schwerpunkt auf Statistik und Machine Learning. Der solide ausgebildete und bestenfalls noch deutschsprachige Data Scientist findet sich zwar nach wie vor kaum im Angebot, doch insgesamt gute Kandidaten sind nicht mehr allzu schwer zu finden. Seit Jahren sind viele Qualifizierungsangebote für Studenten sowie Arbeitskräfte am Markt auch günstig und ganz flexibel online verfügbar, ohne dabei Abstriche bei beim Ansehen dieser Aus- und Fortbildungsmaßnahmen in Kauf nehmen zu müssen.

Was ein Data Scientist fachlich in Sachen Expertise alles abdecken muss, hatten wir ganz ausführlich über Betrachtung des Data Science Knowledge Stack besprochen.

Doch was bringt ein Data Scientist, wenn dieser gar nicht über die Daten verfügt, die für seine Aufgaben benötigt werden? Sicherlich ist die Aufgabe eines jeden Data Scientists auch die Vorbereitung und Präsentation seiner Vorhaben. Die Heranschaffung und Verwaltung großer Datenmengen in einer Enterprise-fähigen Architektur ist jedoch grundsätzlich nicht sein Schwerpunkt und oft fehlen ihm dafür auch die Berechtigungen in einer Enterprise-IT. Noch konkreter wird der Bedarf an Datenbeschaffung und -aufbereitung in der Business Intelligence, denn diese benötigt für nachhaltiges Reporting feste Strukturen wie etwa ein Data Warehouse.

Das Profil des Data Engineers: Big Data High-Tech

Auch wenn Data Engineering von Hochschulen und Fortbildungsanbietern gerade noch etwas stiefmütterlich behandelt werden, werden der Einsatz und das daraus resultierende Anforderungsprofil eines Data Engineers am Markt recht eindeutig skizziert. Einsatzszenarien für diese Dateningenieure – auch auf Deutsch eine annehmbare Benennung – sind im Kern die Erstellung von Data Warehouse und Data Lake Systeme, mittlerweile vor allem auf Cloud-Plattformen. Sie entwickeln diese für das Anzapfen von unternehmensinternen sowie -externen Datenquellen und bereiten die gewonnenen Datenmengen strukturell und inhaltlich so auf, dass diese von anderen Mitarbeitern des Unternehmens zweckmäßig genutzt werden können.

Enabler für Business Intelligence, Process Mining und Data Science

Kein Data Engineer darf den eigentlichen Verbraucher der Daten aus den Augen verlieren, für den die Daten nach allen Regeln der Kunst zusammengeführt, bereinigt und in das Zielformat gebracht werden sollen. Klassischerweise arbeiten die Engineers am Data Warehousing für Business Intelligence oder Process Mining, wofür immer mehr Event Logs benötigt werden. Ein Data Warehouse ist der unter Wasser liegende, viel größere Teil des Eisbergs der Business Intelligence (BI), der die Reports mit qualifizierten Daten versorgt. Diese Eisberg-Analogie lässt sich auch insgesamt auf das Data Engineering übertragen, der für die Endanwender am oberen Ende der Daten-Nahrungskette meistens kaum sichtbar ist, denn diese sehen nur die fertigen Analysen und nicht die dafür vorbereiteten Datentöpfe.

Abbildung 1 - Data Engineering ist der Mittelpunkt einer jeden Datenplattform. Egal ob für Data Science, BI, Process Mining oder sogar RPA, die Datenanlieferung bedingt gute Dateningenieure, die bis hin zur Cloud Infrastructure abtauchen können.

Abbildung 1 – Data Engineering ist der Mittelpunkt einer jeden Datenplattform. Egal ob für Data Science, BI, Process Mining oder sogar RPA, die Datenanlieferung bedingt gute Dateningenieure, die bis hin zur Cloud Infrastructure abtauchen können.

Datenbanken sind Quelle und Ziel der Data Engineers

Daten liegen selten direkt in einer einzigen CSV-Datei strukturiert vor, sondern entstammen einer oder mehreren Datenbanken, die ihren eigenen Regeln unterliegen. Geschäftsdaten, beispielsweise aus ERP- oder CRM-Systemen, liegen in relationalen Datenbanken vor, oftmals von Microsoft, Oracle, SAP oder als eine Open-Source-Alternative. Besonders im Trend liegen derzeitig die Cloud-nativen Datenbanken BigQuery von Google, Redshift von Amazon und Synapse von Microsoft sowie die cloud-unabhängige Datenbank snowflake. Dazu gesellen sich Datenbanken wie der PostgreSQL, Maria DB oder Microsoft SQL Server sowie CosmosDB oder einfachere Cloud-Speicher wie der Microsoft Blobstorage, Amazon S3 oder Google Cloud Storage. Welche Datenbank auch immer die passende Wahl für das Unternehmen sein mag, ohne SQL und Verständnis für normalisierte Daten läuft im Data Engineering nichts.

Andere Arten von Datenbanken, sogenannte NoSQL-Datenbanken beruhen auf Dateiformaten, einer Spalten- oder einer Graphenorientiertheit. Beispiele für verbreitete NoSQL-Datenbanken sind MongoDB, CouchDB, Cassandra oder Neo4J. Diese Datenbanken exisiteren nicht nur als Unterhaltungswert gelangweilter Nerds, sondern haben ganz konkrete Einsatzgebiete, in denen sie jeweils die beste Performance im Lesen oder Schreiben der Daten bieten.

Ein Data Engineer muss demnach mit unterschiedlichen Datenbanksystemen zurechtkommen, die teilweise auf unterschiedlichen Cloud Plattformen heimisch sind.

Data Engineers brauchen Hacker-Qualitäten

Liegen Daten in einer Datenbank vor, können Analysten mit Zugriff einfache Analysen bereits direkt auf der Datenbank ausführen. Doch wie bekommen wir die Daten in unsere speziellen Analyse-Tools? Hier muss der Engineer seinen Dienst leisten und die Daten aus der Datenbank exportieren können. Bei direkten Datenanbindungen kommen APIs, also Schnittstellen wie REST, ODBC oder JDBC ins Spiel und ein guter Data Engineer benötigt Programmierkenntnisse, bevorzugt in Python, diese APIs ansprechen zu können. Etwas Kenntnis über Socket-Verbindungen und Client-Server-Architekturen zahlt sich dabei manchmal aus. Ferner sollte jeder Data Engineer mit synchronen und asynchronen Verschlüsselungsverfahren vertraut sein, denn in der Regel wird mit vertraulichen Daten gearbeitet. Ein Mindeststandard an Sicherheit gehört zum Data Engineering und darf keinesfalls nur Datensicherheitsexperten überlassen werden, eine Affinität zu Netzwerksicherheit oder gar Penetration-Testing ist positiv zu bewerten, mindestens aber ein sauberes Berechtigungsmanagement gehört zu den Grundfähigkeiten. Viele Daten liegen nicht strukturiert in einer Datenbank vor, sondern sind sogenannte unstrukturierte oder semi-strukturierte Daten aus Dokumenten oder aus Internetquellen. Mit Methoden wie Data Web Scrapping und Data Crawling sowie der Automatisierung von Datenabrufen beweisen herausragende Data Engineers sogar echte Hacker-Qualitäten.

Dirigent der Daten: Orchestrierung von Datenflüssen

Eine der Kernaufgaben des Data Engineers ist die Entwicklung von ETL-Strecken, um Daten aus Quellen zu Extrahieren, zu in das gewünschte Zielformat zu Transformieren und schließlich in die Zieldatenbank zu Laden. Dies mag erstmal einfach klingen, wird jedoch zur echten Herausforderung, wenn viele ETL-Prozesse sich zu ganzen ETL-Ketten und -Netzwerken zusammenfügen, diese dabei trotz hochfrequentierter Datenabfrage performant laufen müssen. Die Orchestrierung der Datenflüsse kann in der Regel in mehrere Etappen unterschieden werden, von der Quelle ins Data Warehouse, zwischen den Ebenen im Data Warehouse sowie vom Data Warehouse in weiterführende Systeme, bis hin zum Zurückfließen verarbeiteter Daten in das Data Warehouse (Reverse ETL).

Hart an der Grenze zu DevOp: Automatisierung in Cloud-Architekturen

In den letzten Jahren sind Anforderungen an Data Engineers deutlich gestiegen, denn neben dem eigentlichen Verwalten von Datenbeständen und -strömen für Analysezwecke wird zunehmend erwartet, dass ein Data Engineer auch Ressourcen in der Cloud managen, mindestens jedoch die Datenbanken und ETL-Ressourcen. Darüber hinaus wird zunehmend jedoch verlangt, IT-Netzwerke zu verstehen und das ganze Zusammenspiel der Ressourcen auch als Infrastructure as Code zu automatisieren. Auch das automatisierte Deployment von Datenarchitekturen über CI/CD-Pipelines macht einen Data Engineer immer mehr zum DevOp.

Zukunfts- und Gehaltsaussichten

Im Vergleich zum Data Scientist, der besonders viel Methodenverständnis für Datenanalyse, Statistik und auch für das zu untersuchende Fachgebiet benötigt, sind Data Engineers mehr an Tools und Plattformen orientiert. Ein Data Scientist, der Deep Learning verstanden hat, kann sein Wissen zügig sowohl mit TensorFlow als auch mit PyTorch anwenden. Ein Data Engineer hingegen arbeitet intensiver mit den Tools, die sich über die Jahre viel zügiger weiterentwickeln. Ein Data Engineer für die Google Cloud wird mehr Einarbeitung benötigen, sollte er plötzlich auf AWS oder Azure arbeiten müssen.

Ein Data Engineer kann in Deutschland als Einsteiger mit guten Vorkenntnissen und erster Erfahrung mit einem Bruttojahresgehalt zwischen 45.000 und 55.000 EUR rechnen. Mehr als zwei Jahre konkrete Erfahrung im Data Engineering wird von Unternehmen gerne mit Gehältern zwischen 50.000 und 80.000 EUR revanchiert. Darüber liegen in der Regel nur die Data Architects / Datenarchitekten, die eher in großen Unternehmen zu finden sind und besonders viel Erfahrung voraussetzen. Weitere Aufstiegschancen für Data Engineers sind Berater-Karrieren oder Führungspositionen.

Wer einen Data Engineer in Festanstellung gebracht hat, darf sich jedoch nicht all zu sicher fühlen, denn Personalvermittler lauern diesen qualifizierten Fachkräften an jeder Ecke des Social Media auf. Gerade in den Metropolen wie Berlin schaffen es längst nicht alle Unternehmen, jeden Data Engineer über Jahre hinweg zu beschäftigen. Bei der großen Auswahl an Jobs und Herausforderungen fällt diesen Datenexperten nicht schwer, seine Gehaltssteigerungen durch Jobwechsel proaktiv voranzutreiben.

Data Warehousing Basiscs

Data Warehousing is applied Big Data Management and a key success factor in almost every company. Without a data warehouse, no company today can control its processes and make the right decisions on a strategic level as there would be a lack of data transparency for all decision makers. Bigger comanies even have multiple data warehouses for different purposes.

In this series of articles I would like to explain what a data warehouse actually is and how it is set up. However, I would also like to explain basic topics regarding Data Engineering and concepts about databases and data flows.

To do this, we tick off the following points step by step:

 

Data Security for Data Scientists & Co. – Infographic

Data becomes information and information becomes knowledge. For this reason, companies are nowadays also evaluated with regard to their data and their data quality. Furthermore, data is also the material that is needed for management decisions and artificial intelligence. For this reason, IT Security is very important and special consulting and auditing companies offer their own services specifically for the security of IT systems.

However, every Data Scientist, Data Analyst and Data Engineer rarely only works with open data, but rather intensively with customer data. Therefore, every expert for the storage and analysis of data should at least have a basic knowledge of Data Security and work according to certain principles in order to guarantee the security of the data and the legality of the data processing.

There are a number of rules and principles for data security that must be observed. Some of them – in our opinion the most important ones – we from DATANOMIQ have summarized in an infographic for Data Scientists, Data Analysts and Data Engineers. You can download the infographic here: DataSecurity_Infographic

Data Security for Data Scientists, Data Analysts and Data Engineers

Data Security for Data Scientists, Data Analysts and Data Engineers

Download Infographic as PDF

Infographic - Data Security for Data Scientists, Data Analysts and Data Engineers

Infographic – Data Security for Data Scientists, Data Analysts and Data Engineers

Determining Your Data Pipeline Architecture and Its Efficacy

Data analytics has become a central part of how many businesses operate. If you hope to stay competitive in today’s market, you need to take advantage of all your available data. For that, you’ll need an efficient data pipeline, which is often easier said than done.

If your pipeline is too slow, your data will be all but useless by the time it’s usable. Successful analytics require an optimized pipeline, and that looks different for every company. No matter your specific circumstances, though, a traditional approach will result in inefficiencies.

Creating the most efficient pipeline architecture will require you to change how you look at the process. By understanding each stage’s role and how they serve your goals, you can optimize your data analytics.

Understanding Your Data Needs

You can’t build an optimal data pipeline if you don’t know what you need from your data. If you spend too much time collecting and organizing information you won’t use, you’ll take time away from what you need. Similarly, if you only work to meet one team’s needs, you’ll have to go back and start over to help others.

Data analytics involves multiple stakeholders, all with individual needs and expectations that you should consider. Your data engineers need your pipeline to be accessible and scalable, while analysts require visual, relevant datasets. If you consider these aspects from the beginning, you can build a pipeline that works for everyone.

Start at the earliest stage — collection. You may be collecting data from every channel you can, which could result in an information overload. Focus instead on gathering things from the most relevant sources. At the same time, ensure you can add more channels if necessary in the future.

As you reorganize your pipeline, remember that analytics are only as good as your datasets. If you put more effort into organizing and scrubbing data, helpful analytics will follow. Focus on preparing data well, and the last few stages will be smoother.

Creating a Collaborative Pipeline

When structuring your pipeline, it’s easy to focus too much on the individual stages. While seeing things as rigid steps can help you visualize them, you need something more fluid in practice. If you want the process to run as smoothly as possible, it needs to be collaborative.

Look at the software development practice of DevOps, which doubles a team’s likelihood of exceeding productivity goals. This strategy focuses on collaboration across separate teams instead of passing things back and forth between them. You can do the same thing with your data pipeline.

Instead of dividing steps between engineers and analysts, make it a single, cohesive process. Teams will still focus on different areas according to their expertise, but they’ll reduce disruption by working together instead of independently. If workers can collaborate along every step, they don’t have to go back and forth.

Simultaneously, everyone should have clearly defined responsibilities. Collaboration doesn’t mean overstepping your areas of expertise. The goal here isn’t to make everyone handle everything but to ensure they understand each other’s needs.

Eliminating the time between steps also applies to your platform. Look for or build software that integrates both refinement and data preparation. If you have to export data to various programs, it will cause unnecessary bottlenecks.

Enabling Continuous Improvement

Finally, understand that restructuring your data pipeline isn’t a one-and-done job. Another principle you can adopt from DevOps is continuous development across all sides of the process. Your engineers should keep looking for better ways to structure data as your analysts search for new applications for this information.

Make sure you always measure your throughput and efficiency. If you tweak something and you notice the process starts to slow, revert to the older method. If your changes improve the pipeline, try something similar in another area.

Optimize Your Data Pipeline

Remember to start slow when optimizing your data pipeline. Changing too much at once can cause more disruptions than it avoids, so start small with an emphasis on scalability.

The specifics of your pipeline will vary depending on your needs and circumstances. No matter what these are, though, you can benefit from collaboration and continuous development. When you start breaking down barriers between different steps and teams, you unclog your pipeline.

Closing the AI-skills gap with Upskilling

Closing the AI-skills gap with Upskilling

Artificial Intelligent or as it is fancily referred as AI, has garnered huge popularity worldwide.  And given the career prospects it has, it definitely should. Almost everyone interested in technology sector has them rushing towards it, especially young and motivated fresh computer science graduates. Compared to other IT-related jobs AI pays way higher salary and have opportunities. According to a Glassdoor report, Data Scientist, one of the many related jobs, is the number one job with good salary, job openings and more. AI-related jobs include Data Scientists, Analysts, Machine Learning Engineer, NLP experts etc.

AI has found applications in almost every industry and thus it has picked up demand. Home assistants – Siri, Ok Google, Amazon Echo — chatbots, and more some of the popular applications of AI.

Increasing adoption of AI across Industry

The advantages of AI like increased productivity has increased its adoption among companies. According to Gartner, 37 percent of enterprise currently use AI in one way or the other. In fact, in the last four year adoption of AI technologies among companies has increased by 270 percent. In telecommunications, for instance, 52 percent of companies have chatbots deployed for better and smoother customer experience. Now, about 49 percent of businesses are now on their way to alter business models to integrate and adopt AI-driven processes. Further, industry leaders have gone beyond and voiced their concerns about companies that are lagging in AI adoption.

Unfortunately, it has been extremely difficult for employers to find right skilled or qualified candidates for AI-related positions. A reports suggests that there are total 300,000 AI professionals are available worldwide, while there’s demand for millions. In a recent survey conducted by Ernst & Young, 51 percent AI professionals told that lack of talent was the biggest impediment in AI adoption.

Further, O’Reilly, in 2018 conducted a survey, which found the lack of AI skills, among other things, was the major reason that was holding companies back from implementing AI.
The major reason for this is the lack of skills among people who aspire to get into AI-related jobs. According to a report, there demand for millions for jobs in AI. However, only a handful of qualified people are available.

Bridging the skill gap in AI-related jobs

Top companies and government around the world have taken up initiatives to close this gap. Google and Amazon, for instance, have dedicated facilities which trains in AI skills.  Google’s Brain Toronto is a dedicated facility to expand their talent in AI.  Similarly, Amazon has facility near University of Cambridge which is dedicated to AI. Most companies either already have a facility or are in the process of setting up one.

In addition to this, governments around the world are also taking initiatives to address the skill gap. For instance, government across the world are pushing towards AI advancement and are develop collaborative plans which aims at delivering more AI skilled professionals. Recently, the white house launched ai.gov which is further helping to promote AI in the US. The website will offer updates related to AI projects across different sectors.

Other than these, companies have taken this upon themselves to reskills their employees and prepare them for future roles. According to a report from Towards Data Science, about 63 percent of companies have in-house training programs to train employees in AI-related skills.

Overall, though there is demand for AI professionals, lack of skilled talent is a major problem.

Roles in Artificial Intelligence
Artificial Intelligence is the most dominant role for which companies hire across artificial Intelligence. Other than that, following are some of the popular roles:

  1. Machine learning Engineer: These are the people who make machines learn with complex algorithms. On advance level, Machine learning engineers are required to have good knowledge of computer vision. According to Indeed, in the last year, demand for Machine Learning Engineer has grown by 344 percent.
  2. NLP Experts: These experts are equipped with the understanding of making machines computer understand human language. Their expertise includes knowledge of how machines understand human language. Text-to-speech technologies are the common areas which require NLP experts. Demand for engineers who can program computers to understand human speech is growing continuously. It was the fast growing skills in Upwork’s list of in-demand freelancing skills. In Q4, 2016, it had grown 200 percent and since then has been on continuously growing.
  3. Big Data Engineers: This is majorly an analytics role. These gather huge amount of data available from sources and analyze it to derive insights and understand patter, which may be further used for machine learning, prediction modelling, natural language processing. In Mckinsey annual report 2018, it had reported that there was shortage of 190,000 big data professionals in the US alone.

Other roles like Data Scientists, Analysts, and more also in great demand. Then, again due to insufficient talent in the market, companies are struggling to hire for these roles.

Self-learning and upskilling
Artificial Intelligence is a continuously growing field and it has been advancing at a very fast pace, and it makes extremely difficult to keep up with in-demand skills. Hence, it is imperative to keep yourself up with demand of the industry, or it is just a matter of time before one becomes redundant.

On an individual level, learning new skills is necessary. One has to be agile and keep learning, and be ready to adapt new technologies. For this, AI training programs and certifications are ideal.  There are numerous AI programs which individuals can take to further learn new skills. AI certifications can immensely boost career opportunities. Certification programs offer a structured approach to learning which benefits in learning mostly practical and executional skills while keeping fluff away. It is more hands-on. Plus, certifications programs qualify only when one has passed practical test which is very advantageous in tech. AI certifications like AIE (Artificial Intelligence Engineer) are quite popular.

Online learning platforms also offer good a resource to learn artificial intelligence. Most schools haven’t yet adapted their curriculum to skill for AI, while most universities and grad schools are in their way to do so. In the meantime, online learning platforms offer a good way to learn AI skills, where one can start from basic and reach to advance skills.

My Desk for Data Science

In my last post I anounced a blog parade about what a data scientist’s workplace might look like.

Here are some photos of my desk and my answers to the questions:

How many monitors do you use (or wish to have)?

I am mostly working at my desk in my office with a tower PC and three monitors.
I definitely need at least three monitors to work productively as a data scientist. Who does not know this: On the left monitor the data model is displayed, on the right monitor the data mapping and in the middle I do my work: programming the analysis scripts.

What hardware do you use? Apple? Dell? Lenovo? Others?

I am note an Apple guy. When I need to work mobile, I like to use ThinkPad notebooks. The ThinkPads are (in my experience) very robust and are therefore particularly good for mobile work. Besides, those notebooks look conservative and so I’m not sad if there comes a scratch on the notebook. However, I do not solve particularly challenging analysis tasks on a notebook, because I need my monitors for that.

Which OS do you use (or prefer)? MacOS, Linux, Windows? Virtual Machines?

As a data scientist, I have to be able to communicate well with my clients and they usually use Microsoft Windows as their operating system. I also use Windows as my main operating system. Of course, all our servers run on Linux Debian, but most of my tasks are done directly on Windows.
For some notebooks, I have set up a dual boot, because sometimes I need to start native Linux, for all other cases I work with virtual machines (Linux Ubuntu or Linux Mint).

What are your favorite databases, programming languages and tools?

I prefer the Microsoft SQL Server (T-SQL), C# and Python (pandas, numpy, scikit-learn). This is my world. But my customers are kings, therefore I am working with Postgre SQL, MongoDB, Neo4J, Tableau, Qlik Sense, Celonis and a lot more. I like to get used to new tools and technologies again and again. This is one of the benefits of being a data scientist.

Which data dou you analyze on your local hardware? Which in server clusters or clouds?

There have been few cases yet, where I analyzed really big data. In cases of analyzing big data we use horizontally scalable systems like Hadoop and Spark. But we also have customers analyzing middle-sized data (more than 10 TB but less than 100 TB) on one big server which is vertically scalable. Most of my customers just want to gather data to answer questions on not so big amounts of data. Everything less than 10TB we can do on a highend workstation.

If you use clouds, do you prefer Azure, AWS, Google oder others?

Microsoft Azure! I am used to tools provided by Microsoft and I think Azure is a well preconfigured cloud solution.

Where do you make your notes/memos/sketches. On paper or digital?

My calender is managed digital, because I just need to know everywhere what appointments I have. But my I prefer to wirte down my thoughts on paper and that´s why I have several paper-notebooks.

Now it is your turn: Join our Blog Parade!

So what does your workplace look like? Show your desk on your blog until 31/12/2017 and we will show a short introduction of your post here on the Data Science Blog!

 

Show your Data Science Workplace!

The job of a data scientist is often a mystery to outsiders. Of course, you do not really need much more than a medium-sized notebook to use data science methods for finding value in data. Nevertheless, data science workplaces can look so different and, let’s say, interesting. And that’s why I want to launch a blog parade – which I want to start with this article – where you as a Data Scientist or Data Engineer can show your workplace and explain what tools a data scientist in your opinion really needs.

I am very curious how many monitors you prefer, whether you use Apple, Dell, HP or Lenovo, MacOS, Linux or Windows, etc., etc. And of course, do you like a clean or messy desk?

What is a Blog Parade?

A blog parade is a call to blog owners to report on a specific topic. Everyone who participates in the blog parade, write on their blog a contribution to the topic. The organizer of the blog parade collects all the articles and will recap those articles in a short form together, of course with links to the articles.

How can I participate?

Write an article on your blog! Mention this blog parade here, show and explain your workplace (your desk with your technical equipment) in an article. If you’re missing your own blog, articles can also be posted directly to LinkedIn (LinkedIn has its own blogging feature that every LinkedIn member can use). Alternative – as a last resort – it would also be possible to send me your article with a photo about your workplace directly to: redaktion@data-science-blog.com.
Please make me aware of an article, via e-mail or with a comment (below) on this article.

Who can participate?

Any data scientist or anyone close to Data Science: Everyone concerned with topics such as data analytics, data engineering or data security. Please do not over-define data science here, but keep it in a nutshell, so that all professionals who manage and analyze data can join in with a clear conscience.

And yes, I will participate too. I will propably be the first who write an article about my workplace (I just need a new photo of my desk).

When does the article have to be finished?

By 31/12/2017, the article must have been published on your blog (or LinkedIn or wherever) and the release has to be reported to me.
But beware: Anyone who has previously written an article will also be linked earlier. After all, reporting on your article will take place immediately after I hear about it.
If you publish an artcile tomorrow, it will be shown the day after tomorrow here on the Data Science Blog.

What is in it for me to join?

Nothing! Except perhaps the fun factor of sharing your idea of ​​a nice desk for a data expert with others, so as to share creativity or a certain belief in what a data scientist needs.
Well and for bloggers: There is a great backlink from this data science blog for you 🙂

What should I write? What are the minimum requirements of content?

The article does not have to (but may be) particularly long. Anyway, here on this data science blog only a shortened version of your article will appear (with a link, of course).

Minimum requirments:

  • Show a photo (at least one!) of your workplace desk!
  • And tell us something about:
    • How many monitors do you use (or wish to have)?
    • What hardware do you use? Apple? Dell? Lenovo? Others?
    • Which OS do you use (or prefer)? MacOS, Linux, Windows? Virtual Machines?
    • What are your favorite databases, programming languages and tools? (e.g. Python, R, SAS, Postgre, Neo4J,…)
    • Which data dou you analyze on your local hardware? Which in server clusters or clouds?
    • If you use clouds, do you prefer Azure, AWS, Google oder others?
    • Where do you make your notes/memos/sketches. On paper or digital?

Not allowed:
Of course, please do not provide any information, which could endanger your company`s IT security.

Absolutly allowed:
Bringing some joke into the matter 🙂 We are happy to vote in the comments on the best or funniest desk for election, there may be also a winner later!


The resulting Blog Posts: https://data-science-blog.com/data-science-insights/show-your-desk/


 

Data Science vs Data Engineering

The job of the Data Scientist is actually a fairly new trend, and yet other job titles are coming to us. “Is this really necessary?”, Some will ask. But the answer is clear: yes!

There are situations, every Data Scientist know: a recruiter calls, speaks about a great new challenge for a Data Scientist as you obviously claim on your LinkedIn profile, but in the discussion of the vacancy it quickly becomes clear that you have almost none of the required skills. This mismatch is mainly due to the fact that under the job of the Data Scientist all possible activity profiles, method and tool knowledge are summarized, which a single person can hardly learn in his life. Many open jobs, which are to be called under the name Data Science, describe rather the professional image of the Data Engineer.


Read this article in German:
“Data Science vs Data Engineering – Wo liegen die Unterschiede?“


What is a Data Engineer?

Data engineering is primarily about collecting or generating data, storing, historicalizing, processing, adapting and submitting data to subsequent instances. A Data Engineer, often also named as Big Data Engineer or Big Data Architect, models scalable database and data flow architectures, develops and improves the IT infrastructure on the hardware and software side, deals with topics such as IT Security , Data Security and Data Protection. A Data Engineer is, as required, a partial administrator of the IT systems and also a software developer, since he or she extends the software landscape with his own components. In addition to the tasks in the field of ETL / Data Warehousing, he also carries out analyzes, for example, to investigate data quality or user access. A Data Engineer mainly works with databases and data warehousing tools.

A Data Engineer is talented as an educated engineer or computer scientist and rather far away from the actual core business of the company. The Data Engineer’s career stages are usually something like:

  1. (Big) Data Architect
  2. BI Architect
  3. Senior Data Engineer
  4. Data Engineer

What makes a Data Scientist?

Although there may be many intersections with the Data Engineer’s field of activity, the Data Scientist can be distinguished by using his working time as much as possible to analyze the available data in an exploratory and targeted manner, to visualize the analysis results and to convert them into a red thread (storytelling). Unlike the Data Engineer, a data scientist rarely sees into a data center, because he picks up data via interfaces provided by the Data Engineer or provides by other resources.

A Data Scientist deals with mathematical models, works mainly with statistical procedures, and applies them to the data to generate knowledge. Common methods of Data Mining, Machine Learning and Predictive Modeling should be known to a Data Scientist. Data Scientists basically work close to the department and need appropriate expertise. Data Scientists use proprietary tools (e.g. Tools by IBM, SAS or Qlik) and program their own analyzes, for example, in Scala, Java, Python, Julia, or R. Using such programming languages and data science libraries (e.g. Mahout, MLlib, Scikit-Learn or TensorFlow) is often considered as advanced data science.

Data Scientists can have diverse academic backgrounds, some are computer scientists or engineers for electrical engineering, others are physicists or mathematicians, not a few have economical backgrounds. Common career levels could be:

  1. Chief Data Scientist
  2. Senior Data Scientist
  3. Data Scientist
  4. Data Analyst oder Junior Data Scientist

Data Scientist vs Data Analyst

I am often asked what the difference between a Data Scientist and a Data Analyst would be, or whether there would be a distinction criterion at all:

In my experience, the term Data Scientist stands for the new challenges for the classical concept of Data Analysts. A Data Analyst performs data analysis like a Data Scientist. More complex topics such as predictive analytics, machine learning or artificial intelligence are topics for a Data Scientist. In other words, a Data Scientist is a Data Analyst++ (one step above the Data Analyst).

And how about being a Business Analyst?

Business Analysts can (but need not) be Data Analysts. In any case, they have a very strong relationship with the core business of the company. Business Analytics is about analyzing business models and business successes. The analysis of business success is usually carried out by IT, and many business analysts are starting a career as Data Analyst now. Dashboards, KPIs and SQL are the tools of a good business analyst, but there might be a lot business analysts, who are just analysing business models by reading the newspaper…

Data Science Knowledge Stack – Abstraction of the Data Science Skillset

What must a Data Scientist be able to do? Which skills does as Data Scientist need to have? This question has often been asked and frequently answered by several Data Science Experts. In fact, it is now quite clear what kind of problems a Data Scientist should be able to solve and which skills are necessary for that. I would like to try to bring this consensus into a visual graph: a layer model, similar to the OSI layer model (which any data scientist should know too, by the way).
I’m giving introductory seminars in Data Science for merchants and engineers and in those seminars I always start explaining what we need to work out together in theory and practice-oriented exercises. Against this background, I came up with the idea for this layer model. Because with my seminars the problem already starts: I am giving seminars for Data Science for Business Analytics with Python. So not for medical analyzes and not with R or Julia. So I do not give a general knowledge of Data Science, but a very specific direction.

A Data Scientist must deal with problems at different levels in any Data Science project, for example, the data access does not work as planned or the data has a different structure than expected. A Data Scientist can spend hours debating its own source code or learning the ropes of new DataScience packages for its chosen programming language. Also, the right algorithms for data evaluation must be selected, properly parameterized and tested, sometimes it turns out that the selected methods were not the optimal ones. Ultimately, we are not doing Data Science all day for fun, but for generating value for a department and a data scientist is also faced with special challenges at this level, at least a basic knowledge of the expertise of that department is a must have.


Read this article in German:
“Data Science Knowledge Stack – Was ein Data Scientist können muss“


Data Science Knowledge Stack

With the Data Science Knowledge Stack, I would like to provide a structured insight into the tasks and challenges a Data Scientist has to face. The layers of the stack also represent a bidirectional flow from top to bottom and from bottom to top, because Data Science as a discipline is also bidirectional: we try to answer questions with data, or we look at the potentials in the data to answer previously unsolicited questions.

The DataScience Knowledge Stack consists of six layers:

Database Technology Knowledge

A Data Scientist works with data which is rarely directly structured in a CSV file, but usually in one or more databases that are subject to their own rules. In particular, business data, for example from the ERP or CRM system, are available in relational databases, often from Microsoft, Oracle, SAP or an open source alternative. A good Data Scientist is not only familiar with Structured Query Language (SQL), but is also aware of the importance of relational linked data models, so he also knows the principle of data table normalization.

Other types of databases, so-called NoSQL databases (Not only SQL) are based on file formats, column or graph orientation, such as MongoDB, Cassandra or GraphDB. Some of these databases use their own programming languages ​​(for example JavaScript at MongoDB or the graph-oriented database Neo4J has its own language called Cypher). Some of these databases provide alternative access via SQL (such as Hive for Hadoop).

A data scientist has to cope with different database systems and has to master at least SQL – the quasi-standard for data processing.

Data Access & Transformation Knowledge

If data are given in a database, Data Scientists can perform simple (and not so simple) analyzes directly on the database. But how do we get the data into our special analysis tools? To do this, a Data Scientist must know how to export data from the database. For one-time actions, an export can be a CSV file, but which separators and text qualifiers should be used? Possibly, the export is too large, so the file must be split.
If there is a direct and synchronous data connection between the analysis tool and the database, interfaces like REST, ODBC or JDBC come into play. Sometimes a socket connection must also be established and the principle of a client-server architecture should be known. Synchronous and asynchronous encryption methods should also be familiar to a Data Scientist, as confidential data are often used, and a minimum level of security is most important for business applications.

Many datasets are not structured in a database but are so-called unstructured or semi-structured data from documents or from Internet sources. And again we have interfaces, a frequent entry point for Data Scientists is, for example, the Twitter API. Sometimes we want to stream data in near real-time, let it be machine data or social media messages. This can be quite demanding, so the data streaming is almost a discipline with which a Data Scientist can come into contact quickly.

Programming Language Knowledge

Programming languages ​​are tools for Data Scientists to process data and automate processing. Data Scientists are usually no real software developers and they do not have to worry about software security or economy. However, a certain basic knowledge about software architectures often helps because some Data Science programs can be going to be integrated into an IT landscape of the company. The understanding of object-oriented programming and the good knowledge of the syntax of the selected programming languages ​​are essential, especially since not every programming language is the most useful for all projects.

At the level of the programming language, there is already a lot of snares in the programming language that are based on the programming language itself, as each has its own faults and details determine whether an analysis is done correctly or incorrectly: for example, whether data objects are copied or linked as reference, or how NULL/NaN values ​​are treated.

Data Science Tool & Library Knowledge

Once a data scientist has loaded the data into his favorite tool, for example, one of IBM, SAS or an open source alternative such as Octave, the core work just began. However, these tools are not self-explanatory and therefore there is a wide range of certification options for various Data Science tools. Many (if not most) Data Scientists work mostly directly with a programming language, but this alone is not enough to effectively perform statistical data analysis or machine learning: We use Data Science libraries (packages) that provide data structures and methods as a groundwork and thus extend the programming language to a real Data Science toolset. Such a library, for example Scikit-Learn for Python, is a collection of methods implemented in the programming language. The use of such libraries, however, is intended to be learned and therefore requires familiarization and practical experience for reliable application.

When it comes to Big Data Analytics, the analysis of particularly large data, we enter the field of Distributed Computing. Tools (frameworks) such as Apache Hadoop, Apache Spark or Apache Flink allows us to process and analyze data in parallel on multiple servers. These tools also provide their own libraries for machine learning, such as Mahout, MLlib and FlinkML.

Data Science Method Knowledge

A Data Scientist is not simply an operator of tools, he uses the tools to apply his analysis methods to data he has selected for to reach the project targets. These analysis methods are, for example, descriptive statistics, estimation methods or hypothesis tests. Somewhat more mathematical are methods of machine learning for data mining, such as clustering or dimensional reduction, or more toward automated decision making through classification or regression.

Machine learning methods generally do not work immediately, they have to be improved using optimization methods like the gradient method. A Data Scientist must be able to detect under- and overfitting, and he must prove that the prediction results for the planned deployment are accurate enough.

Special applications require special knowledge, which applies, for example, to the fields of image recognition (Visual Computing) or the processing of human language (Natural Language Processiong). At this point, we open the door to deep learning.

Expertise

Data Science is not an end in itself, but a discipline that would like to answer questions from other expertise fields with data. For this reason, Data Science is very diverse. Business economists need data scientists to analyze financial transactions, for example, to identify fraud scenarios or to better understand customer needs, or to optimize supply chains. Natural scientists such as geologists, biologists or experimental physicists also use Data Science to make their observations with the aim of gaining knowledge. Engineers want to better understand the situation and relationships between machinery or vehicles, and medical professionals are interested in better diagnostics and medication for their patients.

In order to support a specific department with his / her knowledge of data, tools and analysis methods, every data scientist needs a minimum of the appropriate skills. Anyone who wants to make analyzes for buyers, engineers, natural scientists, physicians, lawyers or other interested parties must also be able to understand the people’s profession.

Engere Data Science Definition

While the Data Science pioneers have long established and highly specialized teams, smaller companies are still looking for the Data Science Allrounder, which can take over the full range of tasks from the access to the database to the implementation of the analytical application. However, companies with specialized data experts have long since distinguished Data Scientists, Data Engineers and Business Analysts. Therefore, the definition of Data Science and the delineation of the abilities that a data scientist should have, varies between a broader and a more narrow demarcation.


A closer look at the more narrow definition shows, that a Data Engineer takes over the data allocation, the Data Scientist loads it into his tools and runs the data analysis together with the colleagues from the department. According to this, a Data Scientist would need no knowledge of databases or APIs, neither an expertise would be necessary …

In my experience, DataScience is not that narrow, the task spectrum covers more than just the core area. This misunderstanding comes from Data Science courses and – for me – I should point to the overall picture of Data Science again and again. In courses and seminars, which want to teach Data Science as a discipline, the focus will of course be on the core area: programming, tools and methods from mathematics & statistics.