Posts

Jobprofil des Data Engineers

Warum Data Engineering der Data Science in Bedeutung und Berufschancen längst die Show stiehlt, dabei selbst ebenso einem stetigen Wandel unterliegt.

Was ein Data Engineer wirklich können muss

Der Data Scientist als sexiest Job des 21. Jahrhunderts? Mag sein, denn der Job hat seinen ganz speziellen Reiz, auch auf Grund seiner Schnittstellenfunktion zwischen Technik und Fachexpertise. Doch das Spotlight der kommenden Jahre gehört längst einem anderen Berufsbild aus der Datenwertschöpfungskette – das zeigt sich auch bei den Gehältern.

Viele Unternehmen sind gerade auf dem Weg zum Data-Driven Business, einer Unternehmensführung, die für ihre Entscheidungen auf transparente Datengrundlagen setzt und unter Einsatz von Business Intelligence, Data Science sowie der Automatisierung mit Deep Learning und RPA operative Prozesse so weit wie möglich automatisiert. Die Lösung für diese Aufgabenstellungen werden oft vor allem bei den Experten für Prozessautomatisierung und Data Science gesucht, dabei hängt der Erfolg jedoch gerade viel eher von der Beschaffung valider Datengrundlagen ab, und damit von einer ganz anderen entscheidenden Position im Workflow datengetriebener Entscheidungsprozesse, dem Data Engineer.

Data Engineer, der gefragteste Job des 21sten Jahrhunderts?

Der Job des Data Scientists hingegen ist nach wie vor unter Studenten und Absolventen der MINT-Fächer gerade so gefragt wie nie, das beweist der tägliche Ansturm der vielen Absolventen aus Studiengängen rund um die Data Science auf derartige Stellenausschreibungen. Auch mangelt es gerade gar nicht mehr so sehr an internationalen Bewerben mit Schwerpunkt auf Statistik und Machine Learning. Der solide ausgebildete und bestenfalls noch deutschsprachige Data Scientist findet sich zwar nach wie vor kaum im Angebot, doch insgesamt gute Kandidaten sind nicht mehr allzu schwer zu finden. Seit Jahren sind viele Qualifizierungsangebote für Studenten sowie Arbeitskräfte am Markt auch günstig und ganz flexibel online verfügbar, ohne dabei Abstriche bei beim Ansehen dieser Aus- und Fortbildungsmaßnahmen in Kauf nehmen zu müssen.

Was ein Data Scientist fachlich in Sachen Expertise alles abdecken muss, hatten wir ganz ausführlich über Betrachtung des Data Science Knowledge Stack besprochen.

Doch was bringt ein Data Scientist, wenn dieser gar nicht über die Daten verfügt, die für seine Aufgaben benötigt werden? Sicherlich ist die Aufgabe eines jeden Data Scientists auch die Vorbereitung und Präsentation seiner Vorhaben. Die Heranschaffung und Verwaltung großer Datenmengen in einer Enterprise-fähigen Architektur ist jedoch grundsätzlich nicht sein Schwerpunkt und oft fehlen ihm dafür auch die Berechtigungen in einer Enterprise-IT. Noch konkreter wird der Bedarf an Datenbeschaffung und -aufbereitung in der Business Intelligence, denn diese benötigt für nachhaltiges Reporting feste Strukturen wie etwa ein Data Warehouse.

Das Profil des Data Engineers: Big Data High-Tech

Auch wenn Data Engineering von Hochschulen und Fortbildungsanbietern gerade noch etwas stiefmütterlich behandelt werden, werden der Einsatz und das daraus resultierende Anforderungsprofil eines Data Engineers am Markt recht eindeutig skizziert. Einsatzszenarien für diese Dateningenieure – auch auf Deutsch eine annehmbare Benennung – sind im Kern die Erstellung von Data Warehouse und Data Lake Systeme, mittlerweile vor allem auf Cloud-Plattformen. Sie entwickeln diese für das Anzapfen von unternehmensinternen sowie -externen Datenquellen und bereiten die gewonnenen Datenmengen strukturell und inhaltlich so auf, dass diese von anderen Mitarbeitern des Unternehmens zweckmäßig genutzt werden können.

Enabler für Business Intelligence, Process Mining und Data Science

Kein Data Engineer darf den eigentlichen Verbraucher der Daten aus den Augen verlieren, für den die Daten nach allen Regeln der Kunst zusammengeführt, bereinigt und in das Zielformat gebracht werden sollen. Klassischerweise arbeiten die Engineers am Data Warehousing für Business Intelligence oder Process Mining, wofür immer mehr Event Logs benötigt werden. Ein Data Warehouse ist der unter Wasser liegende, viel größere Teil des Eisbergs der Business Intelligence (BI), der die Reports mit qualifizierten Daten versorgt. Diese Eisberg-Analogie lässt sich auch insgesamt auf das Data Engineering übertragen, der für die Endanwender am oberen Ende der Daten-Nahrungskette meistens kaum sichtbar ist, denn diese sehen nur die fertigen Analysen und nicht die dafür vorbereiteten Datentöpfe.

Abbildung 1 - Data Engineering ist der Mittelpunkt einer jeden Datenplattform. Egal ob für Data Science, BI, Process Mining oder sogar RPA, die Datenanlieferung bedingt gute Dateningenieure, die bis hin zur Cloud Infrastructure abtauchen können.

Abbildung 1 – Data Engineering ist der Mittelpunkt einer jeden Datenplattform. Egal ob für Data Science, BI, Process Mining oder sogar RPA, die Datenanlieferung bedingt gute Dateningenieure, die bis hin zur Cloud Infrastructure abtauchen können.

Datenbanken sind Quelle und Ziel der Data Engineers

Daten liegen selten direkt in einer einzigen CSV-Datei strukturiert vor, sondern entstammen einer oder mehreren Datenbanken, die ihren eigenen Regeln unterliegen. Geschäftsdaten, beispielsweise aus ERP- oder CRM-Systemen, liegen in relationalen Datenbanken vor, oftmals von Microsoft, Oracle, SAP oder als eine Open-Source-Alternative. Besonders im Trend liegen derzeitig die Cloud-nativen Datenbanken BigQuery von Google, Redshift von Amazon und Synapse von Microsoft sowie die cloud-unabhängige Datenbank snowflake. Dazu gesellen sich Datenbanken wie der PostgreSQL, Maria DB oder Microsoft SQL Server sowie CosmosDB oder einfachere Cloud-Speicher wie der Microsoft Blobstorage, Amazon S3 oder Google Cloud Storage. Welche Datenbank auch immer die passende Wahl für das Unternehmen sein mag, ohne SQL und Verständnis für normalisierte Daten läuft im Data Engineering nichts.

Andere Arten von Datenbanken, sogenannte NoSQL-Datenbanken beruhen auf Dateiformaten, einer Spalten- oder einer Graphenorientiertheit. Beispiele für verbreitete NoSQL-Datenbanken sind MongoDB, CouchDB, Cassandra oder Neo4J. Diese Datenbanken exisiteren nicht nur als Unterhaltungswert gelangweilter Nerds, sondern haben ganz konkrete Einsatzgebiete, in denen sie jeweils die beste Performance im Lesen oder Schreiben der Daten bieten.

Ein Data Engineer muss demnach mit unterschiedlichen Datenbanksystemen zurechtkommen, die teilweise auf unterschiedlichen Cloud Plattformen heimisch sind.

Data Engineers brauchen Hacker-Qualitäten

Liegen Daten in einer Datenbank vor, können Analysten mit Zugriff einfache Analysen bereits direkt auf der Datenbank ausführen. Doch wie bekommen wir die Daten in unsere speziellen Analyse-Tools? Hier muss der Engineer seinen Dienst leisten und die Daten aus der Datenbank exportieren können. Bei direkten Datenanbindungen kommen APIs, also Schnittstellen wie REST, ODBC oder JDBC ins Spiel und ein guter Data Engineer benötigt Programmierkenntnisse, bevorzugt in Python, diese APIs ansprechen zu können. Etwas Kenntnis über Socket-Verbindungen und Client-Server-Architekturen zahlt sich dabei manchmal aus. Ferner sollte jeder Data Engineer mit synchronen und asynchronen Verschlüsselungsverfahren vertraut sein, denn in der Regel wird mit vertraulichen Daten gearbeitet. Ein Mindeststandard an Sicherheit gehört zum Data Engineering und darf keinesfalls nur Datensicherheitsexperten überlassen werden, eine Affinität zu Netzwerksicherheit oder gar Penetration-Testing ist positiv zu bewerten, mindestens aber ein sauberes Berechtigungsmanagement gehört zu den Grundfähigkeiten. Viele Daten liegen nicht strukturiert in einer Datenbank vor, sondern sind sogenannte unstrukturierte oder semi-strukturierte Daten aus Dokumenten oder aus Internetquellen. Mit Methoden wie Data Web Scrapping und Data Crawling sowie der Automatisierung von Datenabrufen beweisen herausragende Data Engineers sogar echte Hacker-Qualitäten.

Dirigent der Daten: Orchestrierung von Datenflüssen

Eine der Kernaufgaben des Data Engineers ist die Entwicklung von ETL-Strecken, um Daten aus Quellen zu Extrahieren, zu in das gewünschte Zielformat zu Transformieren und schließlich in die Zieldatenbank zu Laden. Dies mag erstmal einfach klingen, wird jedoch zur echten Herausforderung, wenn viele ETL-Prozesse sich zu ganzen ETL-Ketten und -Netzwerken zusammenfügen, diese dabei trotz hochfrequentierter Datenabfrage performant laufen müssen. Die Orchestrierung der Datenflüsse kann in der Regel in mehrere Etappen unterschieden werden, von der Quelle ins Data Warehouse, zwischen den Ebenen im Data Warehouse sowie vom Data Warehouse in weiterführende Systeme, bis hin zum Zurückfließen verarbeiteter Daten in das Data Warehouse (Reverse ETL).

Hart an der Grenze zu DevOp: Automatisierung in Cloud-Architekturen

In den letzten Jahren sind Anforderungen an Data Engineers deutlich gestiegen, denn neben dem eigentlichen Verwalten von Datenbeständen und -strömen für Analysezwecke wird zunehmend erwartet, dass ein Data Engineer auch Ressourcen in der Cloud managen, mindestens jedoch die Datenbanken und ETL-Ressourcen. Darüber hinaus wird zunehmend jedoch verlangt, IT-Netzwerke zu verstehen und das ganze Zusammenspiel der Ressourcen auch als Infrastructure as Code zu automatisieren. Auch das automatisierte Deployment von Datenarchitekturen über CI/CD-Pipelines macht einen Data Engineer immer mehr zum DevOp.

Zukunfts- und Gehaltsaussichten

Im Vergleich zum Data Scientist, der besonders viel Methodenverständnis für Datenanalyse, Statistik und auch für das zu untersuchende Fachgebiet benötigt, sind Data Engineers mehr an Tools und Plattformen orientiert. Ein Data Scientist, der Deep Learning verstanden hat, kann sein Wissen zügig sowohl mit TensorFlow als auch mit PyTorch anwenden. Ein Data Engineer hingegen arbeitet intensiver mit den Tools, die sich über die Jahre viel zügiger weiterentwickeln. Ein Data Engineer für die Google Cloud wird mehr Einarbeitung benötigen, sollte er plötzlich auf AWS oder Azure arbeiten müssen.

Ein Data Engineer kann in Deutschland als Einsteiger mit guten Vorkenntnissen und erster Erfahrung mit einem Bruttojahresgehalt zwischen 45.000 und 55.000 EUR rechnen. Mehr als zwei Jahre konkrete Erfahrung im Data Engineering wird von Unternehmen gerne mit Gehältern zwischen 50.000 und 80.000 EUR revanchiert. Darüber liegen in der Regel nur die Data Architects / Datenarchitekten, die eher in großen Unternehmen zu finden sind und besonders viel Erfahrung voraussetzen. Weitere Aufstiegschancen für Data Engineers sind Berater-Karrieren oder Führungspositionen.

Wer einen Data Engineer in Festanstellung gebracht hat, darf sich jedoch nicht all zu sicher fühlen, denn Personalvermittler lauern diesen qualifizierten Fachkräften an jeder Ecke des Social Media auf. Gerade in den Metropolen wie Berlin schaffen es längst nicht alle Unternehmen, jeden Data Engineer über Jahre hinweg zu beschäftigen. Bei der großen Auswahl an Jobs und Herausforderungen fällt diesen Datenexperten nicht schwer, seine Gehaltssteigerungen durch Jobwechsel proaktiv voranzutreiben.

6 Steps of Process Mining – Infographic

Many Process Mining projects mainly revolve around the selection and introduction of the right Process Mining tools. Relying on the right tool is of course an important aspect in the Process Mining project. Depending on whether the process analysis project is a one-time affair or daily process monitoring, different tools are pre-selected. Whether, for example, a BI system has already been established and whether a sophisticated authorization concept is required for the process analyzes also play a role in the selection, as do many other factors.

Nevertheless, it should not be forgotten that process mining is not primarily a tool, but an analysis method, in which the first part is about the reconstruction of the processes from operational IT systems in a resulting process log (event log), the second step is about a (core) graph analysis to visualize the process flows with additional analysis/reporting elements. If this perspective on process mining is not lost sight of, companies can save a lot of costs because it allows them to concentrate on solution-oriented concepts.

However, completely independent of the tools, there is a very general procedure in this data-driven process analysis you should understand and which we would like to describe with the following infographic:

DATANOMIQ Process Mining - 6 Steps of Doing Process Mining Analysis

6 Steps of Process Mining – Infographic PDF Download.

Interested in introducing Process Mining to your organization? Do not hesitate to get in touch with us!

DATANOMIQ is the independent consulting and service partner for business intelligence, process mining and data science. We are opening up the diverse possibilities offered by big data and artificial intelligence in all areas of the value chain. We rely on the best minds and the most comprehensive method and technology portfolio for the use of data for business optimization.

5 Apache Spark Best Practices

Already familiar with the term big data, right? Despite the fact that we would all discuss Big Data, it takes a very long time before you confront it in your career. Apache Spark is a Big Data tool that aims to handle large datasets in a parallel and distributed manner. Apache Spark began as a research project at UC Berkeley’s AMPLab, a student, researcher, and faculty collaboration centered on data-intensive application domains, in 2009. 

Introduction

Spark’s aim is to create a new framework that was optimized for quick iterative processing, such as machine learning and interactive data analysis while retaining Hadoop MapReduce’s scalability and fault-tolerant. Spark outperforms Hadoop in many ways, reaching performance levels that are nearly 100 times higher in some cases. Spark has a number of components for various types of processing, all of which are based on Spark Core. Today we will be going to discuss in brief the Apache  Spark and 5 of its best practices to look forward to-

What is Apache Spark?

Apache Spark is an open-source distributed system for big data workforces. For fast analytic queries against another size of data, it uses in-memory caching and optimised query execution. It is a parallel processing framework for grouped computers to operate large-scale data analytics applications. This could handle packet and real-time data processing and predictive analysis workloads.

It claims to support code reuse all over multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing—and offers development APIs in Java, Scala, Python, and R. With 365,000 meetup members in 2017, Apache Spark is becoming one of the most renowned big data distributed processing frameworks. Explore for Apache Spark Tutorial for more information.

5 best practices of Apache Spark

1. Begin with a small sample of the data.

Because we want to make big data work, we need to start with a small sample of data to see if we’re on the right track. In my project, I sampled 10% of the data and verified that the pipelines were working properly. This allowed me to use the SQL section of the Spark UI to watch the numbers grow throughout the flow while not having to wait too long for it to complete.

In my experience, if you attain your preferred runtime with a small sample, scaling up is usually simple.

2. Spark troubleshooting

For transformations, Spark seems to have a lazy loading behaviour. That is, it will not initiate the transformation computation; instead, it will keep records of the transformation requested. This makes it difficult to determine where in our code there are bugs or areas that need to be optimised. Splitting the code into sections with df.cache() and then using df.count() to force Spark to calculate the df at every section was one practise that we found useful.

Spark actions seem to be keen in that they cause the underlying action to perform a computation. So, if you’ve had a Spark action which you only call when it’s required, pay attention. A Spark action, for instance, is count() on a dataset. You can now inspect the computation of each section using the spark UI and identify any issues. It’s important to note that if you don’t use the sampling we mentioned in (1), you’ll probably end up with a very long runtime that’s difficult to debug.

Check out Apache Spark Training & Certification Course to get yourself certified in Apache Spark with industry-level skills.

3. Finding and resolving Skewness is a difficult task.

Having to look at the stage specifics in the spark UI and looking for just a major difference between both the max and median can help you find the Skewness:

Let’s begin with a definition of Skewness. As previously stated, our data is divided into partitions, and the size of each partition will most likely change as the progress of transformation. This can result in a large difference in size between partitions, indicating that our data is skew. This implies that a few of the tasks were markedly slower than the rest.

Why is this even a bad thing? Because it may cause other stages to stand in line for these few tasks, leaving cores idle. If you understand where all the Skewness has been coming from, you can fix it right away by changing the partitioning.

4. Appropriately cache

Spark allows you to cache datasets in memory. There are a variety of options to choose from:

  • Since the same operation has been computed several times in the pipeline flow, cache it.
  • To allow the required cache setting, use the persist API to enable caching (persist to disc or not; serialized or not).
  • Be cognizant of lazy loading and, if necessary, prime cache up front. Some APIs are eager, while others aren’t.
  • To see information about the datasets you’ve cached, go to the Storage tab in the Spark UI.
  • It’s a good idea to unpersist your cached datasets after you’ve finished using them to free up resources, especially if other people are using the cluster.

5. Spark has issues with iterative code.

It was particularly difficult. Spark uses lazy evaluation so that when the code is run, it only creates a computational graph, a DAG. Once you have an iterative process, however, this method can be very problematic so because DAG finally opens the prior iteration and then becomes extremely large, we mean extremely large. This may be too large for the driver to remember. Because the application is stuck, this makes it appear in the spark UI as if no jobs are running (which is correct) for an extended period of time — until the driver crashes.

This seems to be presently an obvious issue with Spark, and the workaround that worked for me was to use df.checkpoint() / df.reset() / df.reset() / df.reset() / df.reset() / df. every 5–6 iterations, call localCheckpoint() (find your number by experimenting a bit). This works because, unlike cache(), checkpoint() breaks the lineage and the DAG, saves the results and starts from a new checkpoint. The disadvantage is that you don’t have the entire DAG to recreate the df if something goes wrong.

Conclusion

Spark is now one of the most popular projects inside the Hadoop ecosystem, with many companies using it in conjunction with Hadoop to process large amounts of data. In June 2013, Spark was acknowledged into the Apache Software Foundation’s (ASF) entrepreneurial context, and in February 2014, it was designated as an Apache Top-Level Project. Spark could indeed run by itself, on Apache Mesos, or on Apache Hadoop, which is the most common. Spark is used by large enterprises working with big data applications because of its speed and ability to connect multiple types of databases and run various types of analytics applications.

Learning how to make Spark work its magic takes time, but these 5 practices will help you move your project forward and sprinkle some spark charm on your code.

6 Ways to Optimize Your Database for Performance

Knowing how to optimize your organization’s database for maximum performance can lead to greater efficiency, productivity, and user satisfaction. While it may seem challenging at first, there are a few easy performance tuning tips that you can get started with.

1. Use Indexing

Indexing is one of the core ways to give databases a performance boost. There are different ways of approaching indexing, but they all have the same goal: decreasing query wait time by making it easier to find and access data.

Indexes have a search key attached to a value or data reference. The index file will direct a query to a record, “bucket” of data, or group of data, depending on the indexing method used. Choosing a good indexing method for your specific needs will reduce strain on your system by making it much easier for data to be located, since a uniform, systematic organization is applied to the entire database.

2. Avoid Using Loops

Many coders learn early on that loops can be both useful and dangerous. It is all too easy to accidentally create an infinite loop and crash your whole system.

Loops are problematic when it comes to database performance because they often are looping redundantly. That is not to say that loops should never be used; they are useful sometimes. It simply depends on the specific case, and removing or minimizing unnecessary loops will help increase performance.

For example, having SQL queries inside of loops is not generally advised, because the system is running the same query numerous times rather than just once. A good rule of thumb is that the more data you have in a loop, the slower it is going to be.

3. Get a Stronger CPU

This fix is a classic in computer science. A CPU with better specs will increase system performance. There are ways, like those above, to increase performance within your system’s organization and coding. However, if you find that your database is consistently struggling to keep up, your hardware might be in need of an upgrade.

Even if the CPU you have seems like it should be sufficient, a CPU that is more powerful than your minimum requirements will be able to handle waves of queries with ease. The more data you are working with and the more queries you need to manage, the stronger your CPU needs to be.

4. Defragment Data

Data defragmentation is a common solution for performance issues. When data gets accessed, written, and rewritten many times, it can get fragmented from all that copying. It is good practice to go in and clean things up on a regular basis.

One symptom of fragmented data is clogged memory, where tables are taking up more room than they should. Crammed memory, as discussed below, is another common cause of a low-performing database.

5. Optimize Queries

There are many ways to go about optimizing queries, depending on the indexing method and the specific needs of your database. When queries aren’t being handled efficiently, the whole system can get backed up, leading to longer wait times for query results. Causes may include duplicate or overlapping indexes and keys or queries that return data that isn’t relevant.

Optimizing queries can be a complex process, but there are some easy steps you can take to work out the best plan for your database and identify its specific inefficiencies.

6. Optimize Memory

Another hardware fix that may help under-performing databases is additional memory space. Databases need some memory “wiggle room” to operate quickly. When your memory is nearly or completely full, things get backed up while the system struggles to find room for creating temporary files and moving things around. It’s a bit like trying to reorganize a living room that is packed floor-to-ceiling with boxes. You need plenty of empty floor space to maneuver and shuffle things around.

Databases work the same way. Increasing your database’s memory capacity will allow it more flexibility and operating room, reducing stress on the system so it can run more efficiently.

Keep It Simple

Many of the tips above focus on simplifying and cleaning up your database. Keep your coding as straightforward and easy-to-navigate as possible. Databases are all about accessing information, so your main priority should simply be to have a well-organized system. Keeping these tips in mind will help you do just that and get your database running at top-notch performance!

ACID vs BASE Concepts

Understanding databases for storing, updating and analyzing data requires the understanding of two concepts: ACID and BASE. This is the first article of the article series Data Warehousing Basics.

The properties of ACID are being applied for databases in order to fulfill enterprise requirements of reliability and consistency.

ACID is an acronym, and stands for:

  • Atomicity – Each transaction is either properly executed completely or does not happen at all. If the transaction was not finished the process reverts the database back to the state before the transaction started. This ensures that all data in the database is valid even if we execute big transactions which include multiple statements (e. g. SQL) composed into one transaction updating many data rows in the database. If one statement fails, the entire transaction will be aborted, and hence, no changes will be made.
  • Consistency – Databases are governed by specific rules defined by table formats (data types) and table relations as well as further functions like triggers. The consistency of data will stay reliable if transactions never endanger the structural integrity of the database. Therefor, it is not allowed to save data of different types into the same single column, to use written primary key values again or to delete data from a table which is strictly related to data in another table.
  • Isolation – Databases are multi-user systems where multiple transactions happen at the same time. With Isolation, transactions cannot compromise the integrity of other transactions by interacting with them while they are still in progress. It guarantees data tables will be in the same states with several transactions happening concurrently as they happen sequentially.
  • Durability – The data related to the completed transaction will persist even in cases of network or power outages. Databases that guarante Durability save data inserted or updated permanently, save all executed and planed transactions in a recording and ensure availability of the data committed via transaction even after a power failure or other system failures If a transaction fails to complete successfully because of a technical failure, it will not transform the targeted data.

ACID Databases

The ACID transaction model ensures that all performed transactions will result in reliable and consistent databases. This suits best for businesses which use OLTP (Online Transaction Processing) for IT-Systems such like ERP- or CRM-Systems. Furthermore, it can also be a good choice for OLAP (Online Analytical Processing) which is used in Data Warehouses. These applications need backend database systems which can handle many small- or medium-sized transactions occurring simultaneous by many users. An interrupted transaction with write-access must be removed from the database immediately as it could cause negative side effects impacting the consistency(e.g., vendors could be deleted although they still have open purchase orders or financial payments could be debited from one account and due to technical failure, never credited to another).

The speed of the querying should be as fast as possible, but even more important for those applications is zero tolerance for invalid states which is prevented by using ACID-conform databases.

BASE Concept

ACID databases have their advantages but also one big tradeoff: If all transactions need to be committed and checked for consistency correctly, the databases are slow in reading and writing data. Furthermore, they demand more effort if it comes to storing new data in new formats.

In chemistry, a base is the opposite to acid. The database concepts of BASE and ACID have a similar relationship. The BASE concept provides several benefits over ACID compliant databases asthey focus more intensely on data availability of database systems without guarantee of safety from network failures or inconsistency.

The acronym BASE is even more confusing than ACID as BASE relates to ACID indirectly. The words behind BASE suggest alternatives to ACID.

BASE stands for:

  • Basically Available – Rather than enforcing consistency in any case, BASE databases will guarantee availability of data by spreading and replicating it across the nodes of the database cluster. Basic read and write functionality is provided without liabilityfor consistency. In rare cases it could happen that an insert- or update-statement does not result in persistently stored data. Read queries might not provide the latest data.
  • Soft State – Databases following this concept do not check rules to stay write-consistent or mutually consistent. The user can toss all data into the database, delegating the responsibility of avoiding inconsistency or redundancy to developers or users.
  • Eventually Consistent –No guarantee of enforced immediate consistency does not mean that the database never achieves it. The database can become consistent over time. After a waiting period, updates will ripple through all cluster nodes of the database. However, reading data out of it will stay always be possible, it is just not certain if we always get the last refreshed data.

All the three above mentioned properties of BASE-conforming databases sound like disadvantages. So why would you choose BASE? There is a tradeoff compared to ACID. If databases do not have to follow ACID properties then the database can work much faster in terms of writing and reading from the database. Further, the developers have more freedom to implement data storage solutions or simplify data entry into the database without thinking about formats and structure beforehand.

BASE Databases

While ACID databases are mostly RDBMS, most other database types, known as NoSQL databases, tend more to conform to BASE principles. Redis, CouchDB, MongoDB, Cosmos DB, Cassandra, ElasticSearch, Neo4J, OrientDB or ArangoDB are just some popular examples. But other than ACID, BASE is not a strict approach. Some NoSQL databases apply at least partly to ACID rules or provide optional functions to get almost or even full ACID compatibility. These databases provide different level of freedom which can be useful for the Staging Layer in Data Warehouses or as a Data Lake, but they are not the recommended choice for applications which need data environments guaranteeing strict consistency.

Data Warehousing Basiscs

Data Warehousing is applied Big Data Management and a key success factor in almost every company. Without a data warehouse, no company today can control its processes and make the right decisions on a strategic level as there would be a lack of data transparency for all decision makers. Bigger comanies even have multiple data warehouses for different purposes.

In this series of articles I would like to explain what a data warehouse actually is and how it is set up. However, I would also like to explain basic topics regarding Data Engineering and concepts about databases and data flows.

To do this, we tick off the following points step by step:

 

In-memory Caching in Finance

Big data has been gradually creeping into a number of industries through the years, and it seems there are no exceptions when it comes to what type of business it plans to affect. Businesses, understandably, are scrambling to catch up to new technological developments and innovations in the areas of data processing, storage, and analytics. Companies are in a race to discover how they can make big data work for them and bring them closer to their business goals. On the other hand, consumers are more concerned than ever about data privacy and security, taking every step to minimize the data they provide to the companies whose services they use. In today’s ever-connected, always online landscape, however, every company and consumer engages with data in one way or another, even if indirectly so.

Despite the reluctance of consumers to share data with businesses and online financial service providers, it is actually in their best interest to do so. It ensures that they are provided the best experience possible, using historical data, browsing histories, and previous purchases. This is why it is also vital for businesses to find ways to maximize the use of data so they can provide the best customer experience each time. Even the more traditional industries like finance have gradually been exploring the benefits they can gain from big data. Big data in the financial services industry refers to complex sets of data that can help provide solutions to the business challenges financial institutions and banking companies have faced through the years. Considered today as a business imperative, data management is increasingly leveraged in finance to enhance processes, their organization, and the industry in general.

How Caching Can Boost Performance in Finance

In computing, caching is a method used to manage frequently accessed data saved in a system’s main memory (RAM). By using RAM, this method allows quick access to data without placing too much load on the main data stores. Caching also addresses the problems of high latency, network congestion, and high concurrency. Batch jobs are also done faster because request run times are reduced—from hours to minutes and from minutes to mere seconds. This is especially important today, when a host of online services are available and accessible to users. A delay of even a few seconds can lead to lost business, making both speed and performance critical factors to business success. Scalability is another aspect that caching can help improve by allowing finance applications to scale elastically. Elastic scalability ensures that a business is equipped to handle usage peaks without impacting performance and with the minimum required effort.

Below are the main benefits of big data and in-memory caching to financial services:

  • Big data analytics integration with financial models
    Predictive modeling can be improved significantly with big data analytics so it can better estimate business outcomes. Proper management of data helps improve algorithmic understanding so the business can make more accurate predictions and mitigate inherent risks related to financial trading and other financial services.
    Predictive modeling can be improved significantly with big data analytics so it can better estimate business outcomes. Proper management of data helps improve algorithmic understanding so the business can make more accurate predictions and mitigate inherent risks related to financial trading and other financial services.
  • Real-time stock market insights
    As data volumes grow, data management becomes a vital factor to business success. Stock markets and investors around the globe now rely on advanced algorithms to find patterns in data that will help enable computers to make human-like decisions and predictions. Working in conjunction with algorithmic trading, big data can help provide optimized insights to maximize portfolio returns. Caching can consequently make the process smoother by making access to needed data easier, quicker, and more efficient.
  • Customer analytics
    Understanding customer needs and preferences is the heart and soul of data management, and, ultimately, it is the goal of transforming complex datasets into actionable insights. In banking and finance, big data initiatives focus on customer analytics and providing the best customer experience possible. By focusing on the customer, companies are able to Ieverage new technologies and channels to anticipate future behaviors and enhance products and services accordingly. By building meaningful customer relationships, it becomes easier to create customer-centric financial products and seize market opportunities.
  • Fraud detection and risk management
    In the finance industry, risk is the primary focus of big data analytics. It helps in identifying fraud and mitigating operational risk while ensuring regulatory compliance and maintaining data integrity. In this aspect, an in-memory cache can help provide real-time data that can help in identifying fraudulent activities and the vulnerabilities that caused them so that they can be avoided in the future.

What Does This Mean for the Finance Industry?

Big data is set to be a disruptor in the finance sector, with 70% of companies citing big data as a critical factor of the business. In 2015 alone, financial service providers spent $6.4 billion on data-related applications, with this spending predicted to increase at a rate of 26% per year. The ability to anticipate risk and pre-empt potential problems are arguably the main reasons why the finance industry in general is leaning toward a more data-centric and customer-focused model. Data analysis is also not limited to customer data; getting an overview of business processes helps managers make informed operational and long-term decisions that can bring the company closer to its objectives. The challenge is taking a strategic approach to data management, choosing and analyzing the right data, and transforming it into useful, actionable insights.

Web Scraping Using R..!

In this blog, I’ll show you, How to Web Scrape using R..?

What is R..?

R is a programming language and its environment built for statistical analysis, graphical representation & reporting. R programming is mostly preferred by statisticians, data miners, and software programmers who want to develop statistical software.

R is also available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form.

Reasons to choose R

Reasons to choose R

Let’s begin our topic of Web Scraping using R.

Step 1- Select the website & the data you want to scrape.

I picked this website “https://www.alexa.com/topsites/countries/IN” and want to scrape data of Top 50 sites in India.

Data we want to scrape

Data we want to scrape

Step 2- Get to know the HTML tags using SelectorGadget.

In my previous blog, I already discussed how to inspect & find the proper HTML tags. So, now I’ll explain an easier way to get the HTML tags.

You have to go to Google chrome extension (chrome://extensions) & search SelectorGadget. Add it to your browser, it’s a quite good CSS selector.

Step 3- R Code

Evoking Important Libraries or Packages

I’m using RVEST package to scrape the data from the webpage; it is inspired by libraries like Beautiful Soup. If you didn’t install the package yet, then follow the code in the snippet below.

Step 4- Set the url of the website

Step 5- Find the HTML tags using SelectorGadget

It’s quite easy to find the proper HTML tags in which your data is present.

Firstly, I have to click on data using SelectorGadget which I want to scrape, it automatically selects the data which are similar to selected HTML tags. Before going forward, cross-check the selected values, are they correct or some junk data is also gets selected..? If you noticed our page has only 50 values, but you can see 156 values are selected.

Selection by SelectorGadget

Selection by SelectorGadget

So I need to remove unwanted values who get selected, once you click on them to deselect it, it turns red and others will turn yellow except our primary selection which turn to green. Now you can see only 50 values are selected as per our primary requirement but it’s not enough. I have to again cross-check that some required values are not exchanged with junk values.

If we satisfy with our selection then copy the HTML tag & include it into the code, else repeat this exercise.

Modified Selection by SelectorGadget

Step 6- Include the tag in our Code

After including the tags, our code is like this.

Code Snippet

If I run the code, values in each list object will be 50.

Data Stored in List Objects

Step 7- Creating DataFrame

Now, we create a dataframe with our list-objects. So for creating a dataframe, we always need to remember one thumb rule that is the number of rows (length of all the lists) should be equal, else we get an error.

Error appears when number of rows differs

Finally, Our DataFrame will look like this:

Our Final Data

Step 8- Writing our DataFrame to CSV file

We need our scraped data to be available locally for further analysis & model building or other purposes.

Our final piece of code to write it in CSV file is:

Writing to CSV file

Step 9- Check the CSV file

Data written in CSV file

Conclusion-

I tried to explain Web Scraping using R in a simple way, Hope this will help you in understanding it better.

Find full code on

https://github.com/vgyaan/Alexa/blob/master/webscrap.R

If you have any questions about the code or web scraping in general, reach out to me on LinkedIn!

Okay, we will meet again with the new exposer.

Till then,

Happy Coding..!

Data Science in Engineering Process - Product Lifecycle Management

How to develop digital products and solutions for industrial environments?

The Data Science and Engineering Process in PLM.

Huge opportunities for digital products are accompanied by huge risks

Digitalization is about to profoundly change the way we live and work. The increasing availability of data combined with growing storage capacities and computing power make it possible to create data-based products, services, and customer specific solutions to create insight with value for the business. Successful implementation requires systematic procedures for managing and analyzing data, but today such procedures are not covered in the PLM processes.

From our experience in industrial settings, organizations start processing the data that happens to be available. This data often does not fully cover the situation of interest, typically has poor quality, and in turn the results of data analysis are misleading. In industrial environments, the reliability and accuracy of results are crucial. Therefore, an enormous responsibility comes with the development of digital products and solutions. Unless there are systematic procedures in place to guide data management and data analysis in the development lifecycle, many promising digital products will not meet expectations.

Various methodologies exist but no comprehensive framework

Over the last decades, various methodologies focusing on specific aspects of how to deal with data were promoted across industries and academia. Examples are Six Sigma, CRISP-DM, JDM standard, DMM model, and KDD process. These methodologies aim at introducing principles for systematic data management and data analysis. Each methodology makes an important contribution to the overall picture of how to deal with data, but none provides a comprehensive framework covering all the necessary tasks and activities for the development of digital products. We should take these approaches as valuable input and integrate their strengths into a comprehensive Data Science and Engineering framework.

In fact, we believe it is time to establish an independent discipline to address the specific challenges of developing digital products, services and customer specific solutions. We need the same kind of professionalism in dealing with data that has been achieved in the established branches of engineering.

Data Science and Engineering as new discipline

Whereas the implementation of software algorithms is adequately guided by software engineering practices, there is currently no established engineering discipline covering the important tasks that focus on the data and how to develop causal models that capture the real world. We believe the development of industrial grade digital products and services requires an additional process area comprising best practices for data management and data analysis. This process area addresses the specific roles, skills, tasks, methods, tools, and management that are needed to succeed.

Figure: Data Science and Engineering as new engineering discipline

More than in other engineering disciplines, the outputs of Data Science and Engineering are created in repetitions of tasks in iterative cycles. The tasks are therefore organized into workflows with distinct objectives that clearly overlap along the phases of the PLM process.

Feasibility of Objectives
  Understand the business situation, confirm the feasibility of the product idea, clarify the data infrastructure needs, and create transparency on opportunities and risks related to the product idea from the data perspective.
Domain Understanding
  Establish an understanding of the causal context of the application domain, identify the influencing factors with impact on the outcomes in the operational scenarios where the digital product or service is going to be used.
Data Management
  Develop the data management strategy, define policies on data lifecycle management, design the specific solution architecture, and validate the technical solution after implementation.
Data Collection
  Define, implement and execute operational procedures for selecting, pre-processing, and transforming data as basis for further analysis. Ensure data quality by performing measurement system analysis and data integrity checks.
Modeling
  Select suitable modeling techniques and create a calibrated prediction model, which includes fitting the parameters or training the model and verifying the accuracy and precision of the prediction model.
Insight Provision
  Incorporate the prediction model into a digital product or solution, provide suitable visualizations to address the information needs, evaluate the accuracy of the prediction results, and establish feedback loops.

Real business value will be generated only if the prediction model at the core of the digital product reliably and accurately reflects the real world, and the results allow to derive not only correct but also helpful conclusions. Now is the time to embrace the unique chances by establishing professionalism in data science and engineering.

Authors

Peter Louis                               

Peter Louis is working at Siemens Advanta Consulting as Senior Key Expert. He has 25 years’ experience in Project Management, Quality Management, Software Engineering, Statistical Process Control, and various process frameworks (Lean, Agile, CMMI). He is an expert on SPC, KPI systems, data analytics, prediction modelling, and Six Sigma Black Belt.


Ralf Russ    

Ralf Russ works as a Principal Key Expert at Siemens Advanta Consulting. He has more than two decades experience rolling out frameworks for development of industrial-grade high quality products, services, and solutions. He is Six Sigma Master Black Belt and passionate about process transparency, optimization, anomaly detection, and prediction modelling using statistics and data analytics.4


Test-data management  support in Test Automation Development

Data is centric in testing of several applications because data is critical to organizations. Businesses are becoming more data-driven, and hence it is imperative that as Automation Test developers, the value of the test-data is understood and  completely harnessed during Test Automation development. The test-data involved in both Manual/Automation testing encompasses the test-data inputs, test-data outputs, and the test-data flow.

TestProject.io is the world’s first free cloud-based, community-powered test automation platform which caters to this important aspect of Test Automation development. The tool successfully adheres to the importance of keeping test-data centric in Automation Test solutions.

To start with, organizing and managing test data is very easy in TestProject. We are aware that as an application gets bigger and more tests are added, test data management becomes more difficult. This tool allows easy and clear management of the elements, tests, parameters by helping the Automation Test Developer associate data, be as an input or output in the UI as follows:

The tool makes the tests maintainable by allowing the Test data to be easily added, deleted, modified  making it  flexible in the perspective when business  requirements change. It also allows test data to be associated with Web, Android and iOS apps, allowing several types of input – web pages, JSON, PDFs etc. The test data can be also tested on several browsers such as Chrome, Firefox, Safari, Edge, Internet Explorer.

TestProject enables easy collaboration in a test automation team- by allowing/dis-allowing sharing of the test cases, test data etc as and when applicable. Eventually the team has shareable test repository which can be easily managed and controlled.

Sharing of parameters is available in levels –Test level and Project level. For example,

Hence, because of this, the test data can be easily re-usable, without having to mention the same test data repeatedly in some cases.

TestProject also has a “Secret Parameter” feature built in the smart test recorder that allows storing sensitive test data in an encrypted state.

There are also powerful Addons available in TestProject that can help the Automation Developers complete their tasks easily and quickly .For example, there are several  Random Data Generator Addons available. ‘Random Login Credentials Addon’ is one such Addon which generates random credentials to be entered for several tests.  Similarly, there are many more Random data generators available, such as for generating random dates, character/word/number etc as per several requirements. This definitely makes the job of an Automation developer much easier, and helps save time.

In TestProject, we can choose the input data source to be the default input parameters or to be associated with the data- driven method as follows :

The Data-driven Testing method of testing is necessarily important in cases when the coverage of any data variable comes into picture. We are aware that Data driven tests are tests that run multiple times, but with different values for some of the variables in the test. For example if you wanted to test that the username field on a login page could handle several different types of inputs you could create a separate test for each input, or you could use a data driven tests to drive the same login test multiple times, but just using a different username input each time. We are aware that Data-driven Testing is a very good approach if you have huge volumes of data to be tested for the same scripts.

One such support for Data driven testing in this tool is the Parameterization of variables. Once the parameters are added, like in the screenshot below, the parameter can be navigated to and picked for use.

In order to run a ‘Data-driven’ test, the Automation Developer would need to associate the test with various Data Sources. One such example is as follows, where the Developer can associate the test with the input CSV data source as follows:

Since it supports Data-driven test development, it results in stronger Test Coverage. That is, large volume of data can be managed and executed thereby improving regression testing and better coverage.

Speaking about data sources, TestProject also provides addons that help to work with several database as PostgreSQL, MySQL, MSSQL, Db2, Oracle. The tool can be easily linked with the databases by providing details as:

All this also shows the fact that the tool clearly separates the test cases and the test data and hence allows testers to test their applications using different data values and parameters without the need for changing test script/cases. While making a change in data sets such as addition, or deletion, doesn’t have implication with test cases.

Also, once the test is generated by the Automation developer, it can be viewed both in the ‘Manual Test’ view or the ‘Test document’ view. In both cases, once either of the options are chosen and they are downloaded, the test data is clearly mentioned in their respective columns in the documents.

For example, the ‘Manual Test’ document that gets generated automatically shows the Test Data used as,

And, the ‘Test’ document that gets generated automatically shows the Test Data’s default values used as,

While assesing the test results,  the tool clearly gives details on failures, helping the automation developer to easily debug the issue/ decide to open a defect. For example, the details are clearly showed as :

TestProject.io tool can also be easily integrated with many other tools, such as Jenkins, qTest, Slack etc, and the testcases/test data etc are easily synced during this association. Example, in the cases of Jenkins, we can associate the build step by linking it with the TestProject data source as follows:

Eventually, TestProject has emerged as a powerful test Automation framework, having very attractive features especially to the fact that it imparts the value of Test-data being centric in the  Automation Test tasks. Along with the fact that the tool supports the ideology of having the test-data to be the driving base to the whole Test Automation framework process, it  also enables sharing and syncing with other teams and tools during the development, management and execution of the Test Automation Solution.