Posts

ACID vs BASE Concepts

Understanding databases for storing, updating and analyzing data requires the understanding of two concepts: ACID and BASE. This is the first article of the article series Data Warehousing Basics.

The properties of ACID are being applied for databases in order to fulfill enterprise requirements of reliability and consistency.

ACID is an acronym, and stands for:

  • Atomicity – Each transaction is either properly executed completely or does not happen at all. If the transaction was not finished the process reverts the database back to the state before the transaction started. This ensures that all data in the database is valid even if we execute big transactions which include multiple statements (e. g. SQL) composed into one transaction updating many data rows in the database. If one statement fails, the entire transaction will be aborted, and hence, no changes will be made.
  • Consistency – Databases are governed by specific rules defined by table formats (data types) and table relations as well as further functions like triggers. The consistency of data will stay reliable if transactions never endanger the structural integrity of the database. Therefor, it is not allowed to save data of different types into the same single column, to use written primary key values again or to delete data from a table which is strictly related to data in another table.
  • Isolation – Databases are multi-user systems where multiple transactions happen at the same time. With Isolation, transactions cannot compromise the integrity of other transactions by interacting with them while they are still in progress. It guarantees data tables will be in the same states with several transactions happening concurrently as they happen sequentially.
  • Durability – The data related to the completed transaction will persist even in cases of network or power outages. Databases that guarante Durability save data inserted or updated permanently, save all executed and planed transactions in a recording and ensure availability of the data committed via transaction even after a power failure or other system failures If a transaction fails to complete successfully because of a technical failure, it will not transform the targeted data.

ACID Databases

The ACID transaction model ensures that all performed transactions will result in reliable and consistent databases. This suits best for businesses which use OLTP (Online Transaction Processing) for IT-Systems such like ERP- or CRM-Systems. Furthermore, it can also be a good choice for OLAP (Online Analytical Processing) which is used in Data Warehouses. These applications need backend database systems which can handle many small- or medium-sized transactions occurring simultaneous by many users. An interrupted transaction with write-access must be removed from the database immediately as it could cause negative side effects impacting the consistency(e.g., vendors could be deleted although they still have open purchase orders or financial payments could be debited from one account and due to technical failure, never credited to another).

The speed of the querying should be as fast as possible, but even more important for those applications is zero tolerance for invalid states which is prevented by using ACID-conform databases.

BASE Concept

ACID databases have their advantages but also one big tradeoff: If all transactions need to be committed and checked for consistency correctly, the databases are slow in reading and writing data. Furthermore, they demand more effort if it comes to storing new data in new formats.

In chemistry, a base is the opposite to acid. The database concepts of BASE and ACID have a similar relationship. The BASE concept provides several benefits over ACID compliant databases asthey focus more intensely on data availability of database systems without guarantee of safety from network failures or inconsistency.

The acronym BASE is even more confusing than ACID as BASE relates to ACID indirectly. The words behind BASE suggest alternatives to ACID.

BASE stands for:

  • Basically Available – Rather than enforcing consistency in any case, BASE databases will guarantee availability of data by spreading and replicating it across the nodes of the database cluster. Basic read and write functionality is provided without liabilityfor consistency. In rare cases it could happen that an insert- or update-statement does not result in persistently stored data. Read queries might not provide the latest data.
  • Soft State – Databases following this concept do not check rules to stay write-consistent or mutually consistent. The user can toss all data into the database, delegating the responsibility of avoiding inconsistency or redundancy to developers or users.
  • Eventually Consistent –No guarantee of enforced immediate consistency does not mean that the database never achieves it. The database can become consistent over time. After a waiting period, updates will ripple through all cluster nodes of the database. However, reading data out of it will stay always be possible, it is just not certain if we always get the last refreshed data.

All the three above mentioned properties of BASE-conforming databases sound like disadvantages. So why would you choose BASE? There is a tradeoff compared to ACID. If databases do not have to follow ACID properties then the database can work much faster in terms of writing and reading from the database. Further, the developers have more freedom to implement data storage solutions or simplify data entry into the database without thinking about formats and structure beforehand.

BASE Databases

While ACID databases are mostly RDBMS, most other database types, known as NoSQL databases, tend more to conform to BASE principles. Redis, CouchDB, MongoDB, Cosmos DB, Cassandra, ElasticSearch, Neo4J, OrientDB or ArangoDB are just some popular examples. But other than ACID, BASE is not a strict approach. Some NoSQL databases apply at least partly to ACID rules or provide optional functions to get almost or even full ACID compatibility. These databases provide different level of freedom which can be useful for the Staging Layer in Data Warehouses or as a Data Lake, but they are not the recommended choice for applications which need data environments guaranteeing strict consistency.

Data Warehousing Basiscs

Data Warehousing is applied Big Data Management and a key success factor in almost every company. Without a data warehouse, no company today can control its processes and make the right decisions on a strategic level as there would be a lack of data transparency for all decision makers. Bigger comanies even have multiple data warehouses for different purposes.

In this series of articles I would like to explain what a data warehouse actually is and how it is set up. However, I would also like to explain basic topics regarding Data Engineering and concepts about databases and data flows.

To do this, we tick off the following points step by step:

 

What Is Data Lake Architecture?

The volume of information produced by everyone in the world is growing exponentially. To put it in perspective, it’s estimated that by 2023 the big data analytics market will reach $103 billion.

Finding probable solutions for storing big data is a challenge. It’s no easy task to hold enormous amounts of information, clean it and transform it into understandable subsets — it’s best to take one step at a time.

Some reasons why companies access their big data is to:

  • Improve their consumer experience
  • Draw conclusions and make data-driven decisions
  • Identify potential problems
  • Create innovative products

There are ways to help define big data. Combining its characteristics with storage management methods help experts make their clients’ information digestible and understandable. Cue data lakes, which are repositories for big data in its native form.

Think of an actual lake with multiple water sources around the perimeter flowing into it. Picture these as three types of data: structured, semi-structured and unstructured. All this information can remain in a data lake and be accessed in its raw form at any time, making it an attractive storage method.

Here’s how data lakes are created, some of their components and how to avoid common pitfalls.

Creating a Data Lake

One benefit of creating and implementing a data lake is that structuring becomes much more manageable.  Pulling necessary information from a lake allows analysts to compare and contrast data and communicate any connections between datasets to their client.

There are four steps to follow when setting up a data lake:

  1. Choosing a software solution: Microsoft, Amazon and Google are cloud vendors that allow developers to create data lakes without using servers.
  2. Identifying where data is sourced: Where is your information coming from? Once sources are identified, determine how your data will be cleaned or transformed.
  3. Defining process and automation: It’s vital to outline how information should be processed once the data lake ingests it. This creates consistency for businesses.
  4. Establishing retrieval governance: Choosing who has access to what types of information is crucial for companies with multiple locations and departments. It helps with overall organization. Data scientists, for this reason, primarily access data lakes.

The next step would be to determine the extract, transform and load (ETL) process. ETL creates visual interpretations of data to provide context to businesses. When information from a data lake is sent to a warehouse, it can be analyzed.

Components of a Data Lake

Here is what happens to information once a data lake is created:

  • Collection: Data comes in from various sources.
  • Ingestion: Data is processed using management software.
  • Blending: Data is combined from multiple sources.
  • Transformation: Data is analyzed and made sense of.
  • Publication: Data can be used to drive business decisions.

There are other aspects of a data lake to keep in mind. These are the critical components that help provide business solutions:

  • Security: Data lakes require security to protect information — they do not have built-in safety measures.
  • Governance: Determine who can check on the quality of data and perform measurements.
  • Metadata: This provides information about other data to improve understanding.
  • Stewardship: Choose one or more employees to take on the responsibility of managing data.
  • Monitoring: Employ other software to perform the ETL process.

Big data lends itself to incorporating multiple processes to make it usable for companies. The volume of information one company produces is massive — to manage it, experts need to consider these components and steps when building a data lake.

What to Avoid When Using Data Lakes

The last thing people want for their data lake is to see it turn into a swamp. When big data is processed incorrectly, its value decreases, making it useless to the business sourcing it.

The first step in avoiding a common pitfall is to consider the sustainability of the data lake. Planning processes are necessary to ensure it’s secure, and governing and regulating incoming information will allow for long-term use.

A lack of security causes another problem that can arise in data lakes. Safety measures must be implemented. Because enterprises will build data lakes for different purposes, it’s easy for information to become unorganized and vulnerable to hacking. With security, the likelihood of data breaches decreases, and the quality of data remains high.

The most important thing to remember about data lakes is the planning stage. Without proper preparation, they tend to be overwhelming due to their size and complexity. Taking the time and care to establish the processes ahead of time is vital.

Using Data Lake Architecture for Business

Data lakes store massive amounts of information to be used later on to create subsets, analyze metadata and more. Their advantages allow businesses to be flexible, save money and have access to raw information at all times.

In-memory Caching in Finance

Big data has been gradually creeping into a number of industries through the years, and it seems there are no exceptions when it comes to what type of business it plans to affect. Businesses, understandably, are scrambling to catch up to new technological developments and innovations in the areas of data processing, storage, and analytics. Companies are in a race to discover how they can make big data work for them and bring them closer to their business goals. On the other hand, consumers are more concerned than ever about data privacy and security, taking every step to minimize the data they provide to the companies whose services they use. In today’s ever-connected, always online landscape, however, every company and consumer engages with data in one way or another, even if indirectly so.

Despite the reluctance of consumers to share data with businesses and online financial service providers, it is actually in their best interest to do so. It ensures that they are provided the best experience possible, using historical data, browsing histories, and previous purchases. This is why it is also vital for businesses to find ways to maximize the use of data so they can provide the best customer experience each time. Even the more traditional industries like finance have gradually been exploring the benefits they can gain from big data. Big data in the financial services industry refers to complex sets of data that can help provide solutions to the business challenges financial institutions and banking companies have faced through the years. Considered today as a business imperative, data management is increasingly leveraged in finance to enhance processes, their organization, and the industry in general.

How Caching Can Boost Performance in Finance

In computing, caching is a method used to manage frequently accessed data saved in a system’s main memory (RAM). By using RAM, this method allows quick access to data without placing too much load on the main data stores. Caching also addresses the problems of high latency, network congestion, and high concurrency. Batch jobs are also done faster because request run times are reduced—from hours to minutes and from minutes to mere seconds. This is especially important today, when a host of online services are available and accessible to users. A delay of even a few seconds can lead to lost business, making both speed and performance critical factors to business success. Scalability is another aspect that caching can help improve by allowing finance applications to scale elastically. Elastic scalability ensures that a business is equipped to handle usage peaks without impacting performance and with the minimum required effort.

Below are the main benefits of big data and in-memory caching to financial services:

  • Big data analytics integration with financial models
    Predictive modeling can be improved significantly with big data analytics so it can better estimate business outcomes. Proper management of data helps improve algorithmic understanding so the business can make more accurate predictions and mitigate inherent risks related to financial trading and other financial services.
    Predictive modeling can be improved significantly with big data analytics so it can better estimate business outcomes. Proper management of data helps improve algorithmic understanding so the business can make more accurate predictions and mitigate inherent risks related to financial trading and other financial services.
  • Real-time stock market insights
    As data volumes grow, data management becomes a vital factor to business success. Stock markets and investors around the globe now rely on advanced algorithms to find patterns in data that will help enable computers to make human-like decisions and predictions. Working in conjunction with algorithmic trading, big data can help provide optimized insights to maximize portfolio returns. Caching can consequently make the process smoother by making access to needed data easier, quicker, and more efficient.
  • Customer analytics
    Understanding customer needs and preferences is the heart and soul of data management, and, ultimately, it is the goal of transforming complex datasets into actionable insights. In banking and finance, big data initiatives focus on customer analytics and providing the best customer experience possible. By focusing on the customer, companies are able to Ieverage new technologies and channels to anticipate future behaviors and enhance products and services accordingly. By building meaningful customer relationships, it becomes easier to create customer-centric financial products and seize market opportunities.
  • Fraud detection and risk management
    In the finance industry, risk is the primary focus of big data analytics. It helps in identifying fraud and mitigating operational risk while ensuring regulatory compliance and maintaining data integrity. In this aspect, an in-memory cache can help provide real-time data that can help in identifying fraudulent activities and the vulnerabilities that caused them so that they can be avoided in the future.

What Does This Mean for the Finance Industry?

Big data is set to be a disruptor in the finance sector, with 70% of companies citing big data as a critical factor of the business. In 2015 alone, financial service providers spent $6.4 billion on data-related applications, with this spending predicted to increase at a rate of 26% per year. The ability to anticipate risk and pre-empt potential problems are arguably the main reasons why the finance industry in general is leaning toward a more data-centric and customer-focused model. Data analysis is also not limited to customer data; getting an overview of business processes helps managers make informed operational and long-term decisions that can bring the company closer to its objectives. The challenge is taking a strategic approach to data management, choosing and analyzing the right data, and transforming it into useful, actionable insights.

Operational Data Store vs. Data Warehouse

One of the main problems with large amounts of data, especially in this age of data-driven tools and near-instant results, is how to store the data. With proper storage also comes the challenge of keeping the data updated, and this is the reason why organizations focus on solutions that will help make data processing faster and more efficient. For many, a digital transformation is in their roadmap, thanks in large part to the changes brought about by the global COVID-19 pandemic. The problem is that organizations often assume that it’s similar to traditional change initiatives, which can’t be any further from the truth. There are a number of challenges to prepare for in digital transformations, however, and without proper planning, non-unified data storage systems and systems of record implemented through the years can slow down or even hinder the process.

Businesses have relied on two main solutions for data storage for many years: traditional data warehouses and operational data stores (ODS). These key data structures provide assistance when it comes to boosting business intelligence so that the business can make sound corporate decisions based on data. Before considering which one will work for your business, it’s important to understand the main differences between the two.

What is a Data Warehouse?

Data warehousing is a common practice because a data warehouse is designed to support business intelligence tools and activities. It’s subject-oriented so data is centered on customers, products, sales, or other subjects that contribute to the business bottom line. Because data comes from a multitude of sources, a data warehouse is also designed to consolidate large amounts of data in a variety of formats, including flat files, legacy database management systems, and relational database management systems. It’s considered an organization’s single source of truth because it houses historical records built through time, which could become invaluable as a source of actionable insights.

One of the main disadvantages of a data warehouse is its non-volatile nature. Non-volatile data is read-only and, therefore, not frequently updated or deleted over time. This leads to some time variance, which means that a data warehouse only stores a time series of periodic data snapshots that show the state of data during specific periods. As such, data loading and data retrieval are the most vital operations for a data warehouse.

What is an Operational Data Store?

Forward-thinking companies turn to an operational data store to resolve the issues with data warehousing, primarily, the issue of always keeping data up-to-date. Similar to a data warehouse, an ODS can aggregate data from multiple sources and report across multiple systems of record to provide a more comprehensive view of the data. It’s essentially a staging area that can receive operational data from transactional sources and can be queried directly. This allows data analytics tools to query ODS data as it’s received from the respective source systems. This offloads the burden from the transactional systems by only providing access to current data that’s queried in an integrated manner. This makes an ODS the ideal solution for those looking for near-real time data that’s processed quickly and efficiently.

Traditional ODS solutions, however, typically suffer from high latency because they are based on either relational databases or disk-based NoSQL databases. These systems simply can’t handle large amounts of data and provide high performance at the same time, which is a common requirement of most modern applications. The limited scalability of traditional systems also leads to performance issues when multiple users access the data store all at the same time. As such, traditional ODS solutions are incapable of providing real-time API services for accessing systems of record.

A Paradigm Shift

As modern real-time digital applications replace previously offline services, companies are going through a paradigm shift and venturing beyond what traditional data storage systems can offer. This has led to the rise of a new breed of ODS solutions that Gartner refers to as digital integration hubs. It’s a cost-effective solution because it doesn’t require a rip-and-replace if you already have a traditional ODS in place. Adopting a digital integration hub can be as simple as augmenting your current system with the missing layers, including the microservices API, smart cache, and event-driven architecture.

While sticking with a data warehouse or traditional ODS may not necessarily hurt your business, the benefits of modernization via a digital integration hub are too great to ignore. Significant improvements in throughput, availability, and scalability will help organizations become more agile so they can drive innovation quicker, helping their industry and pushing the limits of technology further to open up possibilities never before discovered.

How the Pandemic is Changing the Data Analytics Outsourcing Industry

While media pundits have largely focused on the impact of COVID-19 as far as human health is concerned, it hasn’t been particularly good for the health of automated systems either. As cybersecurity budgets plummet in the face of dwindling finances, computer criminals have taken the opportunity to increase attacks against high value targets.

In June, an online antique store suffered a data breach that contained over 3 million records, and it’s likely that a number of similar attacks have simply gone unpublished. Fortunately, data scientists are hard at work developing new methods of fighting back against these kinds of breaches. Budget constraints and a lack of personnel as a result of the pandemic continues to be a problem, but automation has helped to assuage the issue to some degree.

AI-Driven Data Storage Systems

Big data experts have long promoted the cloud as an ideal metaphor for the way that data is stored remotely, but as a result few people today consider the physical locations that this information is stored at. All data has to be located on some sort of physical storage device. Even so-called serverless apps have to be distributed from a server unless they’re fully deployed using P2P services.

Since software can never truly replace hardware, researchers are looking at refining the various abstraction layers that exist between servers and the clients who access them. Data warehousing software has enabled computer scientists to construct centralized data storage solutions that look like traditional disk locations. This gives users the ability to securely interact with resources that are encrypted automatically.

Background services based on artificial intelligence monitor virtual data warehouse locations, which gives specialists the freedom to conduct whatever analytics they deem necessary. In some cases, a data warehouse can even anonymize information as it’s stored, which can streamline workflows involved with the analysis process.

While this level of automation has proven useful, it’s still subject to some of the problems that have occurred as a result of the pandemic. Traditional supply chains are in shambles and a large percentage of technical workers are now telecommuting. If there’s a problem with any existing big data plans, then there’s often nobody around to do any work in person.

Living with Shifting Digital Priorities

Many businesses were in the process of outsourcing their data operations even before the pandemic, and the current situation is speeding this up considerably. Initial industry estimates had projected steady growth numbers for the data analytics sector through 2025. While the current figures might not be quite as bullish, it’s likely that sales of outsourcing contracts will remain high.

That being said, firms are also shifting a large percentage of their IT spending dollars into cybersecurity projects. A recent survey found that 37 percent of business leaders said they were already going to cut their IT department budgets. The same study found that 28 percent of businesses are going to move at least some part of their data analytics programs abroad.

Those companies that can’t find an attractive outsourcing contract might start to patch their remote systems over a virtual private network. Unfortunately, this kind of technology has been strained to some degree in recent months. The virtual servers that power VPNs are flooded with requests, which in turn has brought them down in some instances. Neural networks, which utilize deep learning technology to improve themselves as time goes on, have proven more than capable of predicting when these problems are most likely to arise.

That being said, firms that deploy this kind of technology might find that it still costs more to work with automated technology on-premise compared to simply investing in an outsourcing program that works with these kinds of algorithms at an outside location.

Saving Money in the Time of Corona

Experts from Think Big Analytics pointed out how specialist organizations can deal with a much wider array of technologies than a small business ever could. Since these companies specialize in providing support for other organizations, they have a tendency to offer support for a large number of platforms.

These representatives recently opined that they could provide support for NoSQL, Presto, Apache Spark and several other emerging platforms at the same time. Perhaps most importantly, these organizations can work with Hadoop and other traditional data analysis languages.

Staffers working on data mining operations have long relied on languages like Hadoop and R to write scripts that they later use to automate the process of collecting and analyzing data. By working with an organization that already supports a language that companies rely on, they can avoid the need of changing up their existing operations.

This can help to drastically reduce the cost of migration, which is extremely important since many of the firms that need to migrate to a remote system are already suffering from budget problems. Assuming that some issues related to the pandemic continue to plague businesses for some time, it’s likely that these budget constraints will force IT departments to consider a migration even if they would have otherwise relied solely on a traditional colocation arrangement.

IT department staffers were already moving away from many rare platforms even before the COVID-19 pandemic hit, however, so this shouldn’t be as much of a herculean task as it sounds. For instance, the KNIME Analytics Platform has increased in popularity exponentially since it’s release in 2006. The fact that it supports over 1,000 plug-in modules has made it easy for smaller businesses to move toward the platform.

The road ahead isn’t going to be all that pleasant, however. COBOL and other antiquated languages still rule the roost at many governmental big data processing centers. At the same time, some small businesses have never even been able to put a big data plan into play in the first place. As the pandemic continues to wreak havoc on the world’s economy, however, it’s likely that there will be no shortage of organizations continuing to migrate to more secure third-party platforms backed by outsourcing contracts.

Sechs Eigenschaften einer modernen Business Intelligence

Völlig unabhängig von der Branche, in der Sie tätig sind, benötigen Sie Informationssysteme, die Ihre geschäftlichen Daten auswerten, um Ihnen Entscheidungsgrundlagen zu liefern. Diese Systeme werden gemeinläufig als sogenannte Business Intelligence (BI) bezeichnet. Tatsächlich leiden die meisten BI-Systeme an Mängeln, die abstellbar sind. Darüber hinaus kann moderne BI Entscheidungen teilweise automatisieren und umfassende Analysen bei hoher Flexibilität in der Nutzung ermöglichen.


english-flagRead this article in English:
“Six properties of modern Business Intelligence”


Lassen Sie uns die sechs Eigenschaften besprechen, die moderne Business Intelligence auszeichnet, die Berücksichtigungen von technischen Kniffen im Detail bedeuten, jedoch immer im Kontext einer großen Vision für die eigene Unternehmen-BI stehen:

1.      Einheitliche Datenbasis von hoher Qualität (Single Source of Truth)

Sicherlich kennt jeder Geschäftsführer die Situation, dass sich seine Manager nicht einig sind, wie viele Kosten und Umsätze tatsächlich im Detail entstehen und wie die Margen pro Kategorie genau aussehen. Und wenn doch, stehen diese Information oft erst Monate zu spät zur Verfügung.

In jedem Unternehmen sind täglich hunderte oder gar tausende Entscheidungen auf operative Ebene zu treffen, die bei guter Informationslage in der Masse sehr viel fundierter getroffen werden können und somit Umsätze steigern und Kosten sparen. Demgegenüber stehen jedoch viele Quellsysteme aus der unternehmensinternen IT-Systemlandschaft sowie weitere externe Datenquellen. Die Informationsbeschaffung und -konsolidierung nimmt oft ganze Mitarbeitergruppen in Anspruch und bietet viel Raum für menschliche Fehler.

Ein System, das zumindest die relevantesten Daten zur Geschäftssteuerung zur richtigen Zeit in guter Qualität in einer Trusted Data Zone als Single Source of Truth (SPOT) zur Verfügung stellt. SPOT ist das Kernstück moderner Business Intelligence.

Darüber hinaus dürfen auch weitere Daten über die BI verfügbar gemacht werden, die z. B. für qualifizierte Analysen und Data Scientists nützlich sein können. Die besonders vertrauenswürdige Zone ist jedoch für alle Entscheider diejenige, über die sich alle Entscheider unternehmensweit synchronisieren können.

2.      Flexible Nutzung durch unterschiedliche Stakeholder

Auch wenn alle Mitarbeiter unternehmensweit auf zentrale, vertrauenswürdige Daten zugreifen können sollen, schließt das bei einer cleveren Architektur nicht aus, dass sowohl jede Abteilung ihre eigenen Sichten auf diese Daten erhält, als auch, dass sogar jeder einzelne, hierfür qualifizierte Mitarbeiter seine eigene Sicht auf Daten erhalten und sich diese sogar selbst erstellen kann.

Viele BI-Systeme scheitern an der unternehmensweiten Akzeptanz, da bestimmte Abteilungen oder fachlich-definierte Mitarbeitergruppen aus der BI weitgehend ausgeschlossen werden.

Moderne BI-Systeme ermöglichen Sichten und die dafür notwendige Datenintegration für alle Stakeholder im Unternehmen, die auf Informationen angewiesen sind und profitieren gleichermaßen von dem SPOT-Ansatz.

3.      Effiziente Möglichkeiten zur Erweiterung (Time to Market)

Bei den Kernbenutzern eines BI-Systems stellt sich die Unzufriedenheit vor allem dann ein, wenn der Ausbau oder auch die teilweise Neugestaltung des Informationssystems einen langen Atem voraussetzt. Historisch gewachsene, falsch ausgelegte und nicht besonders wandlungsfähige BI-Systeme beschäftigen nicht selten eine ganze Mannschaft an IT-Mitarbeitern und Tickets mit Anfragen zu Änderungswünschen.

Gute BI versteht sich als Service für die Stakeholder mit kurzer Time to Market. Die richtige Ausgestaltung, Auswahl von Software und der Implementierung von Datenflüssen/-modellen sorgt für wesentlich kürzere Entwicklungs- und Implementierungszeiten für Verbesserungen und neue Features.

Des Weiteren ist nicht nur die Technik, sondern auch die Wahl der Organisationsform entscheidend, inklusive der Ausgestaltung der Rollen und Verantwortlichkeiten – von der technischen Systemanbindung über die Datenbereitstellung und -aufbereitung bis zur Analyse und dem Support für die Endbenutzer.

4.      Integrierte Fähigkeiten für Data Science und AI

Business Intelligence und Data Science werden oftmals als getrennt voneinander betrachtet und geführt. Zum einen, weil Data Scientists vielfach nur ungern mit – aus ihrer Sicht – langweiligen Datenmodellen und vorbereiteten Daten arbeiten möchten. Und zum anderen, weil die BI in der Regel bereits als traditionelles System im Unternehmen etabliert ist, trotz der vielen Kinderkrankheiten, die BI noch heute hat.

Data Science, häufig auch als Advanced Analytics bezeichnet, befasst sich mit dem tiefen Eintauchen in Daten über explorative Statistik und Methoden des Data Mining (unüberwachtes maschinelles Lernen) sowie mit Predictive Analytics (überwachtes maschinelles Lernen). Deep Learning ist ein Teilbereich des maschinellen Lernens (Machine Learning) und wird ebenfalls für Data Mining oder Predictvie Analytics angewendet. Bei Machine Learning handelt es sich um einen Teilbereich der Artificial Intelligence (AI).

In der Zukunft werden BI und Data Science bzw. AI weiter zusammenwachsen, denn spätestens nach der Inbetriebnahme fließen die Prädiktionsergebnisse und auch deren Modelle wieder in die Business Intelligence zurück. Vermutlich wird sich die BI zur ABI (Artificial Business Intelligence) weiterentwickeln. Jedoch schon heute setzen viele Unternehmen Data Mining und Predictive Analytics im Unternehmen ein und setzen dabei auf einheitliche oder unterschiedliche Plattformen mit oder ohne Integration zur BI.

Moderne BI-Systeme bieten dabei auch Data Scientists eine Plattform, um auf qualitativ hochwertige sowie auf granularere Rohdaten zugreifen zu können.

5.      Ausreichend hohe Performance

Vermutlich werden die meisten Leser dieser sechs Punkte schon einmal Erfahrung mit langsamer BI gemacht haben. So dauert das Laden eines täglich zu nutzenden Reports in vielen klassischen BI-Systemen mehrere Minuten. Wenn sich das Laden eines Dashboards mit einer kleinen Kaffee-Pause kombinieren lässt, mag das hin und wieder für bestimmte Berichte noch hinnehmbar sein. Spätestens jedoch bei der häufigen Nutzung sind lange Ladezeiten und unzuverlässige Reports nicht mehr hinnehmbar.

Ein Grund für mangelhafte Performance ist die Hardware, die sich unter Einsatz von Cloud-Systemen bereits beinahe linear skalierbar an höhere Datenmengen und mehr Analysekomplexität anpassen lässt. Der Einsatz von Cloud ermöglicht auch die modulartige Trennung von Speicher und Rechenleistung von den Daten und Applikationen und ist damit grundsätzlich zu empfehlen, jedoch nicht für alle Unternehmen unbedingt die richtige Wahl und muss zur Unternehmensphilosophie passen.

Tatsächlich ist die Performance nicht nur von der Hardware abhängig, auch die richtige Auswahl an Software und die richtige Wahl der Gestaltung von Datenmodellen und Datenflüssen spielt eine noch viel entscheidender Rolle. Denn während sich Hardware relativ einfach wechseln oder aufrüsten lässt, ist ein Wechsel der Architektur mit sehr viel mehr Aufwand und BI-Kompetenz verbunden. Dabei zwingen unpassende Datenmodelle oder Datenflüsse ganz sicher auch die neueste Hardware in maximaler Konfiguration in die Knie.

6.      Kosteneffizienter Einsatz und Fazit

Professionelle Cloud-Systeme, die für BI-Systeme eingesetzt werden können, bieten Gesamtkostenrechner an, beispielsweise Microsoft Azure, Amazon Web Services und Google Cloud. Mit diesen Rechnern – unter Einweisung eines erfahrenen BI-Experten – können nicht nur Kosten für die Nutzung von Hardware abgeschätzt, sondern auch Ideen zur Kostenoptimierung kalkuliert werden. Dennoch ist die Cloud immer noch nicht für jedes Unternehmen die richtige Lösung und klassische Kalkulationen für On-Premise-Lösungen sind notwendig und zudem besser planbar als Kosten für die Cloud.

Kosteneffizienz lässt sich übrigens auch mit einer guten Auswahl der passenden Software steigern. Denn proprietäre Lösungen sind an unterschiedliche Lizenzmodelle gebunden und können nur über Anwendungsszenarien miteinander verglichen werden. Davon abgesehen gibt es jedoch auch gute Open Source Lösungen, die weitgehend kostenfrei genutzt werden dürfen und für viele Anwendungsfälle ohne Abstriche einsetzbar sind.

Die Total Cost of Ownership (TCO) gehören zum BI-Management mit dazu und sollten stets im Fokus sein. Falsch wäre es jedoch, die Kosten einer BI nur nach der Kosten für Hardware und Software zu bewerten. Ein wesentlicher Teil der Kosteneffizienz ist komplementär mit den Aspekten für die Performance des BI-Systems, denn suboptimale Architekturen arbeiten verschwenderisch und benötigen mehr und teurere Hardware als sauber abgestimmte Architekturen. Die Herstellung der zentralen Datenbereitstellung in adäquater Qualität kann viele unnötige Prozesse der Datenaufbereitung ersparen und viele flexible Analysemöglichkeiten auch redundante Systeme direkt unnötig machen und somit zu Einsparungen führen.

In jedem Fall ist ein BI für Unternehmen mit vielen operativen Prozessen grundsätzlich immer günstiger als kein BI zu haben. Heutzutage könnte für ein Unternehmen nichts teurer sein, als nur nach Bauchgefühl gesteuert zu werden, denn der Markt tut es nicht und bietet sehr viel Transparenz.

Dennoch sind bestehende BI-Architekturen hin und wieder zu hinterfragen. Bei genauerem Hinsehen mit BI-Expertise ist die Kosteneffizienz und Datentransparenz häufig möglich.

Process Mining Tools – Artikelserie

Process Mining ist nicht länger nur ein Buzzword, sondern ein relevanter Teil der Business Intelligence. Process Mining umfasst die Analyse von Prozessen und lässt sich auf alle Branchen und Fachbereiche anwenden, die operative Prozesse haben, die wiederum über operative IT-Systeme erfasst werden. Um die zunehmende Bedeutung dieser Data-Disziplin zu verstehen, reicht ein Blick auf die Entwicklung der weltweiten Datengenerierung aus: Waren es 2010 noch 2 Zettabytes (ZB), sind laut Statista für das Jahr 2020 mehr als 50 ZB an Daten zu erwarten. Für 2025 wird gar mit einem Bestand von 175 ZB gerechnet.

Hier wird das Datenvolumen nach Jahren angezeit

Abbildung 1 zeigt die Entwicklung des weltweiten Datenvolumen (Stand 2018). Quelle: https://www.statista.com/statistics/871513/worldwide-data-created/

Warum jetzt eigentlich Process Mining?

Warum aber profitiert insbesondere Process Mining von dieser Entwicklung? Der Grund liegt in der Unordnung dieser Datenmenge. Die Herausforderung der sich viele Unternehmen gegenübersehen, liegt eben genau in der Analyse dieser unstrukturierten Daten. Hinzu kommt, dass nahezu jeder Prozess Datenspuren in Informationssystemen hinterlässt. Die Betrachtung von Prozessen auf Datenebene birgt somit ein enormes Potential, welches in Anbetracht der Entwicklung zunehmend an Bedeutung gewinnt.

Was war nochmal Process Mining?

Process Mining ist eine Analysemethodik, welche dazu befähigt, aus den abgespeicherten Datenspuren der Informationssysteme eine Rekonstruktion der realen Prozesse zu schaffen. Diese Prozesse können anschließend als Prozessflussdiagramm dargestellt und ausgewertet werden. Die klassischen Anwendungsfälle reichen von dem Aufspüren (Discovery) unbekannter Prozesse, über einen Soll-Ist-Vergleich (Conformance) bis hin zur Anpassung/Verbesserung (Enhancement) bestehender Prozesse. Mittlerweile setzen viele Firmen darüber hinaus auf eine Integration von RPA und Data Science im Process Mining. Und die Analyse-Tiefe wird zunehmen und bis zur Analyse einzelner Klicks reichen, was gegenwärtig als sogenanntes „Task Mining“ bezeichnet wird.

Hier wird ein typischer Process Mining Workflow dargestellt

Abbildung 2 zeigt den typischen Workflow eines Process Mining Projektes. Oftmals dient das ERP-System als zentrale Datenquelle. Die herausgearbeiteten Event-Logs werden anschließend mittels Process Mining Tool visualisiert.

In jedem Fall liegt meistens das Gros der Arbeit auf die Bereitstellung und Vorbereitung der Daten und der Transformation dieser in sogenannte „Event-Logs“, die den Input für die Process Mining Tools darstellen. Deshalb arbeiten viele Anbieter von Process Mining Tools schon länger an Lösungen, um die mit der Datenvorbereitung verbundenen zeit -und arbeitsaufwendigen Schritte zu erleichtern. Während fast alle Tool-Anbieter vorgefertigte Protokolle für Standardprozesse anbieten, gehen manche noch weiter und bieten vollumfängliche Plattform Lösungen an, welche eine effiziente Integration der aufwendigen ETL-Prozesse versprechen. Der Funktionsumfang der Process Mining Tools geht daher mittlerweile deutlich über eine reine Darstellungsfunktion hinaus und deckt ggf. neue Trends sowie optimierte Einsteigerbarrieren mit ab.

Motivation dieser Artikelserie

Die Motivation diesen Artikel zu schreiben liegt nicht in der Erläuterung der Methode des Process Mining. Hierzu gibt es mittlerweile zahlreiche Informationsquellen. Eine besonders empfehlenswerte ist das Buch „Process Mining“ von Will van der Aalst, einem der Urväter des Process Mining. Die Motivation dieses Artikels liegt viel mehr in der Betrachtung der zahlreichen Process Mining Tools am Markt. Sehr oft erlebe ich als Data-Consultant, dass Process Mining Projekte im Vorfeld von der Frage nach dem „besten“ Tool dominiert werden. Diese Fragestellung ist in Ihrer Natur sicherlich immer individuell zu beantworten. Da individuelle Projekte auch einen individuellen Tool-Einsatz bedingen, beschäftige ich mich meist mit einem großen Spektrum von Process Mining Tools. Daher ist es mir in dieser Artikelserie ein Anliegen einen allgemeingültigen Überblick zu den üblichen Process Mining Tools zu erarbeiten. Dabei möchte ich mich nicht auf persönliche Erfahrungen stützen, sondern die Tools anhand von Testdaten einem praktischen Vergleich unterziehen, der für den Leser nachvollziehbar ist.

Um den Umfang der Artikelserie zu begrenzen, werden die verschiedenen Tools nur in Ihren Kernfunktionen angewendet und verglichen. Herausragende Funktionen oder Eigenschaften der jeweiligen Tools werden jedoch angemerkt und ggf. in anderen Artikeln vertieft. Das Ziel dieser Artikelserie soll sein, dem Leser einen ersten Einblick über die am Markt erhältlichen Tools zu geben. Daher spricht dieser Artikel insbesondere Einsteiger aber auch Fortgeschrittene im Process Mining an, welche einen Überblick über die Tools zu schätzen wissen und möglicherweise auch mal über den Tellerand hinweg schauen mögen.

Die Tools

Die Gruppe der zu betrachteten Tools besteht aus den folgenden namenhaften Anwendungen:

Die Auswahl der Tools orientiert sich an den „Market Guide for Process Mining 2019“ von Gartner. Aussortiert habe ich jene Tools, mit welchen ich bisher wenig bis gar keine Berührung hatte. Diese Auswahl an Tools verspricht meiner Meinung nach einen spannenden Einblick von verschiedene Process Mining Tools am Markt zu bekommen.

Die Anwendung in der Praxis

Um die Tools realistisch miteinander vergleichen zu können, werden alle Tools die gleichen Datengrundlage benutzen. Die Datenbasis wird folglich über die gesamte Artikelserie hinweg für die Darstellungen mit den Tools genutzt. Ich werde im nächsten Artikel explizit diese Datenbasis kurz erläutern.

Das Ziel der praktischen Untersuchung soll sein, die Beispieldaten in die verschiedenen Tools zu laden, um den enthaltenen Prozess zu visualisieren. Dabei möchte ich insbesondere darauf achten wie bedienbar und anpassungsfähig/flexibel die Tools mir erscheinen. An dieser Stelle möchte ich eindeutig darauf hinweisen, dass dieser Vergleich und seine Bewertung meine Meinung ist und keineswegs Anspruch auf Vollständigkeit beansprucht. Da der Markt in Bewegung ist, behalte ich mir ferner vor, diese Artikelserie regelmäßig anzupassen.

Die Kriterien

Neben der Bedienbarkeit und der Anpassungsfähigkeit der Tools möchte ich folgende zusätzliche Gesichtspunkte betrachten:

  • Bedienbarkeit: Wie leicht gehen die Analysen von der Hand? Wie einfach ist der Einstieg?
  • Anpassungsfähigkeit: Wie flexibel reagiert das Tool auf meine Daten und Analyse-Wünsche?
  • Integrationsfähigkeit: Welche Schnittstellen bringt das Tool mit? Läuft es auch oder nur in der Cloud?
  • Skalierbarkeit: Ist das Tool dazu in der Lage, auch große und heterogene Daten zu verarbeiten?
  • Zukunftsfähigkeit: Wie steht es um Machine Learning, ETL-Modeller oder Task Mining?
  • Preisgestaltung: Nach welchem Modell bestimmt sich der Preis?

Die Datengrundlage

Die Datenbasis bildet ein Demo-Datensatz der von Celonis für die gesamte Artikelserie netter Weise zur Verfügung gestellt wurde. Dieser Datensatz bildet einen Versand Prozess vom Zeitpunkt des Kaufes bis zur Auslieferung an den Kunden ab. In der folgenden Abbildung ist der Soll Prozess abgebildet.

Hier wird die Variante 1 der Demo Daten von Celonis als Grafik dargestellt

Abbildung 4 zeigt den gewünschten Versand Prozess der Datengrundlage von dem Kauf des Produktes bis zur Auslieferung.

Die Datengrundlage besteht aus einem 60 GB großen Event-Log, welcher lokal in einer Microsoft SQL Datenbank vorgehalten wird. Da diese Tabelle über 600 Mio. Events beinhaltet, wird die Datengrundlage für die Analyse der einzelnen Tools auf einen Ausschnitt von 60 Mio. Events begrenzt. Um die Performance der einzelnen Tools zu testen, wird jedoch auf die gesamte Datengrundlage zurückgegriffen. Der Ausschnitt der Event-Log Tabelle enthält 919 verschiedene Varianten und weisst somit eine ausreichende Komplexität auf, welche es mit den verschiednene Tools zu analysieren gilt.

Folgender Veröffentlichungsplan gilt für diese Artikelserie und wird mit jeder Veröffentlichung verlinkt:

  1. Celonis
  2. PAFnow
  3. MEHRWERK
  4. Fluxicon Disco
  5. Lana Labs (erscheint demnächst)
  6. Signavio (erscheint demnächst)
  7. Process Gold (erscheint demnächst)
  8. Aris Process Mining der Software AG (erscheint demnächst)

Simplify Vendor Onboarding with Automated Data Integration

Vendor onboarding is a key business process that involves collecting and processing large data volumes from one or multiple vendors. Business users need vendor information in a standardized format to use it for subsequent data processes. However, consolidating and standardizing data for each new vendor requires IT teams to write code for custom integration flows, which can be a time-consuming and challenging task.

In this blog post, we will talk about automated vendor onboarding and how it is far more efficient and quicker than manually updating integration flows.

Problems with Manual Integration for Vendor Onboarding

During the onboarding process, vendor data needs to be extracted, validated, standardized, transformed, and loaded into the target system for further processing. An integration task like this involves coding, updating, and debugging manual ETL pipelines that can take days and even weeks on end.

Every time a vendor comes on board, this process is repeated and executed to load the information for that vendor into the unified business system. Not just this, but because vendor data is often received from disparate sources in a variety of formats (CSV, Text, Excel), these ETL pipelines frequently break and require manual fixes.

All this effort is not suitable, particularly for large-scale businesses that onboard hundreds of vendors each month. Luckily, there is a faster alternative available that involves no code-writing.

Automated Data Integration

The manual onboarding process can be automated using purpose-built data integration tools.

To help you better understand the advantages, here is a step-by-step guide on how automated data integration for vendor onboarding works:

  1. Vendor data is retrieved from heterogeneous sources such as databases, FTP servers, and web APIs through built-in connectors available in the solution.
  2. The data from each file is validated by passing it through a set of predefined quality rules – this step helps in eliminating records with missing, duplicate, or incorrect data.
  3. Transformations are applied to convert input data into the desired output format or screen vendors based on business criteria. For example, if the vendor data is stored in Excel sheets and the business uses SQL Server for data storage, then the data has to be mapped to the relevant fields in the SQL Server database, which is the destination.
  4. The standardized, validated data is then loaded into a unified enterprise database that you can use as the source of information for business processes. In some cases, this can be a staging database where you can perform further filtering and aggregation to build a consolidated vendor database.
  5. This entire ETL pipeline (Step 1 through Step 4) can then be automated through event-based or time-based triggers in a workflow. For instance, you may want to run the pipeline once every day, or once a new file/data point is available in your FTP server.

Why Build a Consolidated Database for Vendors?

Once the ETL pipeline runs, you will end up with a consolidated database with complete vendor information. The main benefit of having a unified database is that it would have filtered information regarding vendors.

Most businesses have a strict process for screening vendors that follows a set of predefined rules. For example, you may want to reject vendors that have a poor credit history automatically. With manual data integration, you would need to perform this filtering by writing code. Automated data integration allows you to apply pre-built filters directly within your ETL pipeline to flag or remove vendors with a credit score lower than the specified threshold.

This is just one example; you can perform a wide range of tasks at this level in your ETL pipeline including vendor scoring (calculated based on multiple fields in your data), filtering (based on rules applied to your data), and data aggregation (to add measures to your data) to build a robust vendor database for decision-making and subsequent processes.

Conclusion

Automated vendor onboarding offers cost-and-time benefits to your organization. Making use of enterprise-grade data integration tools ensures a seamless business-to-vendor data exchange without the need for reworking and upgrading your ETL pipelines.

Interview: Does Business Intelligence benefit from Cloud Data Warehousing?

Interview with Ross Perez, Senior Director, Marketing EMEA at Snowflake

Read this article in German:
“Profitiert Business Intelligence vom Data Warehouse in der Cloud?”

Does Business Intelligence benefit from Cloud Data Warehousing?

Ross Perez is the Senior Director, Marketing EMEA at Snowflake. He leads the Snowflake marketing team in EMEA and is charged with starting the discussion about analytics, data, and cloud data warehousing across EMEA. Before Snowflake, Ross was a product marketer at Tableau Software where he founded the Iron Viz Championship, the world’s largest and longest running data visualization competition.

Data Science Blog: Ross, Business Intelligence (BI) is not really a new trend. In 2019/2020, making data available for the whole company should not be a big thing anymore. Would you agree?

BI is definitely an old trend, reporting has been around for 50 years. People are accustomed to seeing statistics and data for the company at large, and even their business units. However, using BI to deliver analytics to everyone in the organization and encouraging them to make decisions based on data for their specific area is relatively new. In a lot of the companies Snowflake works with, there is a huge new group of people who have recently received access to self-service BI and visualization tools like Tableau, Looker and Sigma, and they are just starting to find answers to their questions.

Data Science Blog: Up until today, BI was just about delivering dashboards for reporting to the business. The data warehouse (DWH) was something like the backend. Today we have increased demand for data transparency. How should companies deal with this demand?

Because more people in more departments are wanting access to data more frequently, the demand on backend systems like the data warehouse is skyrocketing. In many cases, companies have data warehouses that weren’t built to cope with this concurrent demand and that means that the experience is slow. End users have to wait a long time for their reports. That is where Snowflake comes in: since we can use the power of the cloud to spin up resources on demand, we can serve any number of concurrent users. Snowflake can also house unlimited amounts of data, of both structured and semi-structured formats.

Data Science Blog: Would you say the DWH is the key driver for becoming a data-driven organization? What else should be considered here?

Absolutely. Without having all of your data in a single, highly elastic, and flexible data warehouse, it can be a huge challenge to actually deliver insight to people in the organization.

Data Science Blog: So much for the theory, now let’s talk about specific use cases. In general, it matters a lot whether you are storing and analyzing e.g. financial data or machine data. What do we have to consider for both purposes?

Financial data and machine data do look very different, and often come in different formats. For instance, financial data is often in a standard relational format. Data like this needs to be able to be easily queried with standard SQL, something that many Hadoop and noSQL tools were unable to provide. Luckily, Snowflake is an ansi-standard SQL data warehouse so it can be used with this type of data quite seamlessly.

On the other hand, machine data is often semi-structured or even completely unstructured. This type of data is becoming significantly more common with the rise of IoT, but traditional data warehouses were very bad at dealing with it since they were optimized for relational data. Semi-structured data like JSON, Avro, XML, Orc and Parquet can be loaded into Snowflake for analysis quite seamlessly in its native format. This is important, because you don’t want to have to flatten the data to get any use from it.

Both types of data are important, and Snowflake is really the first data warehouse that can work with them both seamlessly.

Data Science Blog: Back to the common business use case: Creating sales or purchase reports for the business managers, based on data from ERP-systems such as Microsoft or SAP. Which architecture for the DWH could be the right one? How many and which database layers do you see as necessary?

The type of report largely does not matter, because in all cases you want a data warehouse that can support all of your data and serve all of your users. Ideally, you also want to be able to turn it off and on depending on demand. That means that you need a cloud-based architecture… and specifically Snowflake’s innovative architecture that separates storage and compute, making it possible to pay for exactly what you use.

Data Science Blog: Where would you implement the main part of the business logic for the report? In the DWH or in the reporting tool? Does it matter which reporting tool we choose?

The great thing is that you can choose either. Snowflake, as an ansi-Standard SQL data warehouse, can support a high degree of data modeling and business logic. But you can also utilize partners like Looker and Sigma who specialize in data modeling for BI. We think it’s best that the customer chooses what is right for them.

Data Science Blog: Snowflake enables organizations to store and manage their data in the cloud. Does it mean companies lose control over their storage and data management?

Customers have complete control over their data, and in fact Snowflake cannot see, alter or change any aspect of their data. The benefit of a cloud solution is that customers don’t have to manage the infrastructure or the tuning – they decide how they want to store and analyze their data and Snowflake takes care of the rest.

Data Science Blog: How big is the effort for smaller and medium sized companies to set up a DWH in the cloud? Does this have to be an expensive long-term project in every case?

The nice thing about Snowflake is that you can get started with a free trial in a few minutes. Now, moving from a traditional data warehouse to Snowflake can take some time, depending on the legacy technology that you are using. But Snowflake itself is quite easy to set up and very much compatible with historical tools making it relatively easy to move over.