Cloud Archives

Tag Archive for: Cloud

Process Mining – Ist Celonis wirklich so gut? Ein Praxisbericht.

September 3, 2024/in Data Engineering, Data Warehousing, Insights, Machine Learning, Main Category, Process Mining, Tool Introduction/by Benjamin Aunkofer

Diese Artikel wird viel gelesen werden. Von Process Mining Kunden, von Process Mining Beratern und von Process Mining Software-Anbietern. Und ganz besonders von Celonis.

Der Gartner´s Magic Quadrant zu Process Mining Tools für 2024 zeigt einige Movements im Vergleich zu 2023. Jeder kennt den Gartner Magic Quadrant, nicht nur für Process Mining Tools sondern für viele andere Software-Kategorien und auch für Dienstleistungen/Beratungen. Gartner gilt längst als der relevanteste und internationale Benchmark.

Process Mining – Wo stehen wir heute?

Eine Einschränkung dazu vorweg: Ich kann nur für den deutschen Markt sprechen. Zwar verfolge ich mit Spannung die ersten Erfolge von Celonis in den USA und in Japan, aber ich bin dort ja nicht selbst tätig. Ich kann lediglich für den Raum D/A/CH sprechen, in dem ich für Unternehmen in nahezu allen Branchen zu Process Mining Beratung und gemeinsam mit meinem Team Implementierung anbiete. Dabei arbeiten wir technologie-offen und mit nahezu allen Tools – Und oft in enger Verbindung mit Initiativen der Business Intelligence und Data Science. Wir sind neutral und haben keine “Aktien” in irgendeinem Process Mining Tool!

Process Mining wird heute in allen DAX-Konzernen und auch in allen MDAX-Unternehmen eingesetzt. Teilweise noch als Nischenanalytik, teilweise recht großspurig wie es z. B. die Deutsche Telekom oder die Lufthansa tun.

Mittelständische Unternehmen sind hingegen noch wenig erschlossen in Sachen Process Mining, wobei das nicht ganz richtig ist, denn vieles entwickelt sich – so unsere Erfahrung – aus BI / Data Science Projekten heraus dann doch noch in kleinere Process Mining Applikationen, oft ganz unter dem Radar. In Zukunft – da habe ich keinen Zweifel – wird Process Mining jedoch in jedem Unternehmen mit mehr als 1.000 Mitarbeitern ganz selbstverständlich und quasi nebenbei gemacht werden.

Process Mining Software – Was sagt Gartner?

Ich habe mal die Gartner Charts zu Process Mining Tools von 2023 und 2024 übereinandergelegt und erkenne daraus die folgende Entwicklung:

– Celonis bleibt der Spitzenreiter nach Gartner, gerät jedoch zunehmend unter Druck auf dieser Spitzenposition.

– SAP hatte mit dem Kauf von Signavio vermutlich auf das richtige Pferd gesetzt, die Enterprise-Readiness für SAP-Kunden ist leicht erahnbar.

– Die Software AG ist schon lange mit Process Mining am Start, kann sich in ihrer Positionierung nur leicht verbessern.

– Ähnlich wenig Bewegung bei UiPath, in Sachen Completness of Vision immer noch deutlich hinter der Software AG.

– Interessant ist die Entwicklung des deutschen Anbieters MEHRWERK Process Mining (MPM), bei Completness of Vision verschlechtert, bei Ability to Execute verbessert.

– Der deutsche Anbieter process.science, mit MEHRWERK und dem früheren (von Celonis gekauften) PAFnow mindestens vergleichbar, ist hier noch immer nicht aufgeführt.

– Microsoft Process Mining ist der relative Sieger in Sachen Aufholjagd mit ihrer eigenen Lösung (die zum Teil auf dem eingekauften Tool namens Minit basiert). Process Mining wurde kürzlich in die Power Automate Plattform und in Power BI integriert.

– Fluxicon (Disco) ist vom Chart verschwunden. Das ist schade, vom Tool her recht gut mit dem aufgekauften Minit vergleichbar (reine Desktop-Applikation).

Process Mining Tool im Gartner Magic Quadrant Chart – 2023 vs 2024

Auch wenn ich große Ehrfurcht gegenüber Gartner als Quelle habe, bin ich jedoch nicht sicher, wie weit die Datengrundlage für die Feststellung geht. Ich vertraue soweit der Reputation von Gartner, möchte aber als neutraler Process Mining Experte mit Einblick in den deutschen Markt dazu Stellung beziehen.

Process Mining Tools – Unterschiedliche Erfolgsstories

Aber fangen wir erstmal von vorne an, denn Process Mining Tools haben ihre ganz eigene Geschichte und diese zu kennen, hilft bei der Einordnung von Marktbewegungen etwas und mein Process Mining Software Vergleich auf CIO.de von 2019 ist mittlerweile etwas in die Jahre gekommen. Und Unterhaltungswert haben diese Stories auch, beispielsweise wie ganze Gründer und Teams von diesen Software-Anbietern wie Celonis, UiPath (ehemals ProcessGold), PAFnow (jetzt Celonis), Signavio (jetzt SAP) und Minit (jetzt Microsoft) teilweise im Streit auseinandergingen, eigene Process Mining Tools entwickelt und dann wieder Know How verloren oder selbst aufgekauft wurden – Unter Insidern ist der Gesprächsstoff mit Unterhaltungswert sehr groß.

Dabei darf gerne in Erinnerung gerufen werden, dass Process Mining im Kern eine Graphenanalyse ist, die ein Event Log in Graphen umwandelt, Aktivitäten (Events) stellen dabei die Knoten und die Prozesszeiten die Kanten dar, zumindest ist das grundsätzlich so. Es handelt sich dabei also um eine Analysemethodik und nicht um ein Tool. Ein Process Mining Tool nutzt diese Methodik, stellt im Zweifel aber auch nur exakt diese Visualisierung der Prozessgraphen zur Verfügung oder ein ganzes Tool-Werk von der Datenanbindung und -aufbereitung in ein Event Log bis hin zu weiterführenden Analysen in Richtung des BI-Reportings oder der Data Science.

Im Grunde kann man aber folgende große Herkunftskategorien ausmachen:

1. Process Mining Tools, die als pure Process Mining Software gestartet sind

Hierzu gehört Celonis, das drei-köpfige und sehr geschäftstüchtige Gründer-Team, das ich im Jahr 2012 persönlich kennenlernen durfte. Aber Celonis war nicht das erste Process Mining Unternehmen. Es gab noch einige mehr. Hier fällt mir z. B. das kleine und sympathische Unternehmen Fluxicon ein, dass mit seiner Lösung Disco auch heute noch einen leichtfüßigen Einstieg in Process Mining bietet.

2. Process Mining Tools, die eigentlich aus der Prozessmodellierung oder -automatisierung kommen

Einige Software-Anbieter erkannten frühzeitig (oder zumindest rechtzeitig?), dass Process Mining vielleicht nicht das Kerngeschäft, jedoch eine sinnvolle Ergänzung zu ihrem Portfolio an Software für Prozessmodellierung, -dokumentations oder -automatisierung bietet. Hierzu gehört die Software AG, die eigentlich für ihre ARIS-Prozessmodellierung bekannt war. Und hierzu zählt auch Signavio, die ebenfalls ein reines Prozessmodellierungsprogramm waren und von kurzem von SAP aufgekauft wurden. Aber auch das für RPA bekannte Unternehmen UiPath verleibte sich Process Mining durch den Zukauf von ehemals Process Gold.

3. Process Mining Tools, die Business Intelligence Software erweitern

Und dann gibt es noch diejenigen Anbieter, die bestehende BI Tools mit Erweiterungen zum Process Mining Analysewerkzeug machen. Einer der ersten dieser Anbieter war das Unternehmen PAF (Process Analytics Factory) mit dem Power BI Plugin namens PAFnow, welches von Celonis aufgekauft wurde und heute anscheinend (?) nicht mehr weiterentwickelt wird. Das Unternehmen MEHRWERK, eigentlich ein BI-Dienstleister mit Fokus auf QlikTech-Produkte, bietet für das BI-Tool Qlik Sense ebenfalls eine Erweiterung für Process Mining an und das Unternehmen mit dem unscheinbaren Namen process.science bietet Erweiterungen sowohl für Power BI als auch für Qlik Sense, zukünftig ist eine Erweiterung für Tableu geplant. Process.science fehlt im Gartner Magic Quadrant bis jetzt leider gänzlich, trotz bestehender Marktrelevanz (nach meiner Beobachtung).

Process Mining Tools in der Praxis – Ein Einblick

DAX-Konzerne setzen vor allem auf Celonis. Das Gründer-Team, das starke Vertriebsteam und die Medienpräsenz erst als Unicorn, dann als Decacorn, haben die Türen zu Vorstandsetagen zumindest im mitteleuropäischen Raum geöffnet. Und ganz ehrlich: Dass Celonis ein deutsches Decacorn ist, ist einfach wunderbar. Es ist das erste Decacorn aus Deutschland, das zurzeit wertvollste StartUp in Deutschland und wir können – für den Standort Deutschland – nur hoffen, dass dieser Erfolg bleibt.

Doch wie weit vorne ist Process Mining mit Celonis nun wirklich im Praxiseinsatz? Und ist Celonis für jedes Unternehmen der richtige Einstieg in Process Mining?

Celonis unterscheidet sich von den meisten anderen Tools noch dahingehend, dass es versucht, die ganze Kette des Process Minings in einer einzigen und ausschließlichen Cloud-Anwendung in einer Suite bereitzustellen. Während vor zehn Jahren ich für Celonis noch eine Installation erst einer MS SQL Server Datenbank, etwas später dann bevorzugt eine SAP Hana Datenbank auf einem on-prem Server beim Kunden voraussetzend installieren musste, bevor ich dann zur Installation der Celonis ServerAnwendung selbst kam, ist es heute eine 100% externe Cloud-Lösung. Dies hat anfangs für große Widerstände bei einigen Kunden verursacht, die ehrlicherweise heute jedoch kaum noch eine Rolle spielen. Cloud ist heute selbst für viele mitteleuropäische Unternehmen zum Standard geworden und wird kaum noch infrage gestellt. Vielleicht haben wir auch das ein Stück weit Celonis zu verdanken.

Celonis bietet eine bereits sehr umfassende Anbindung von Datenquellen z. B. für SAP oder Oracle ERP an, mit vordefinierten Event Log SQL Skripten für viele Standard-Prozesse, insbesondere Procure-to-Pay und Order-to-Cash. Aber auch andere Prozesse für andere Geschäftsprozesse z. B. von SalesForce CRM sind bereits verfügbar. Celonis ist zudem der erste Anbieter, der diese Prozessaufbereitung und weiterführende Dashboards in einem App-Store anbietet und so zu einer Plattform wird. Hinzu kommen auch die zuvor als Action Engine bezeichnete Prozessautomation, die mit Lösungen wie Power Automate von Microsoft vergleichbar sind.

Celonis schafft es oftmals in größere Konzerne, ist jedoch selten dann das einzige eingesetzte Process Mining Tool. Meine Kunden und Kontakte aus unterschiedlichsten Unternhemen in Deutschland berichten in Sachen Celonis oft von zu hohen Kosten für die Lizensierung und den Betrieb, zu viel Sales im Vergleich zur Leistung sowie von hohen Aufwänden, wenn der Fokus nicht auf Standardprozesse liegt. Demgegenüber steht jedoch die Tatsache, dass Celonis zumindest für die Standardprozesse bereits viel mitbringt und hier definitiv allen anderen Tool-Anbietern voraus ist und den wohl besten Service bietet.

SAP Signavio rückt nach

Mit dem Aufkauf von Signavio von SAP hat sich SAP meiner Meinung nach an eine gute Position katapultiert. Auch wenn ich vor Jahren noch hätte Wetten können, dass Celonis mal von SAP gekauft wird, scheint der Move mit Signavio nicht schlecht zu wirken, denn ich sehe das Tool bei Kunden mit SAP-Liebe bereits erfolgreich im Einsatz. Dabei scheint SAP nicht den Anspruch zu haben, Signavio zur Plattform für Analytics ausbauen zu wollen, um 1:1 mit Celonis gleichzuziehen, so ist dies ja auch nicht notwendig, wenn Signavio mit SAP Hana und der SAP Datasphere Cloud besser integriert werden wird.

Unternehmen, die am liebsten nur Software von SAP einsetzen, werden also mittlerweile bedient.

Mircosoft holt bei Process Mining auf

Ein absoluter Newcomer unter den Großen Anbietern im praktischen Einsatz bei Unternehmen ist sicherlich Microsoft Process Mining. Ich betreue bereits selbst Kunden, die auf Microsoft setzen und beobachte in meinem Netzwerk ein hohes Interesse an der Lösung von Microsoft. Was als logischer Schritt von Microsoft betrachtet werden kann, ist in der Praxis jedoch noch etwas hakelig, da Microsoft – und ich weiß wovon ich spreche – aktuell noch ein recht komplexes Zusammenspiel aus dem eigentlichen Process Mining Client (ehemals Minit) und der Power Automate Plattform sowie Power BI bereitstellt. Sehr hilfreich ist die Weiterführung der Process Mining Analyse vom Client-Tool dann direkt in der PowerBI Cloud. Das Ganze hat definitiv Potenzial, hängt aber in Details in 2024 noch etwas in diesem Zusammenspiel an verschiedenen Tools, die kein einfaches Setup für den User darstellen.

Doch wenn diese Integration besser funktioniert, und das ist in Kürze zu erwarten, dann bringt das viele Anbieter definitiv in Bedrängnis, denn den Microsoft Stack nutzen die meisten Unternehmen sowieso. Somit wäre kein weiteres Tool für datengetriebene Prozessanalysen mehr notwendig.

Process Mining – Und wie steht es um Machine Learning?

Obwohl ich mich gemeinsam mit Kunden besonders viel mit Machine Learning befasse, sind die Beispiele mit Process Mining noch recht dünn gesäht, dennoch gibt es etwa seit 2020 in Sachen Machine Learning für Process Mining auch etwas zu vermelden.

Celonis versucht Machine Learning innerhalb der Plattform aus einer Hand anzubieten und hat auch eigene Python-Bibleotheken dafür entwickelt. Bisher dreht sich hier viel eher noch um z. B. die Vorhersage von Prozesszeiten oder um die Erkennung von Doppelvorgängen. Die Erkennung von Doppelzahlungen ist sogar eine der penetrantesten Werbeversprechen von Celonis, obwohl eigentlich bereits mit viel einfacherer Analytik effektiv zu bewerkstelligen.

Von Kunden bisher über meinen Geschäftskanal nachgefragte und umgesetzte Machine Learning Funktionen sind u.a. die Anomalie-Erkennung in Prozessdaten, die möglichst frühe Vorhersage von Prozesszeiten (oder -kosten) und die Impact-Prediction auf den Prozess, wenn ein bestimmtes Event eintritt.

Umgesetzt werden diese Anwendungsfälle bisher vor allem auf dritten Plattformen, wie z. B. auf den Analyse-Ressourcen der Microsoft Azure Cloud oder in auf der databricks-Plattform.

Während das nun Anwendungsfälle auf der Prozessanalyse-Seite sind, kann Machine Learning jedoch auf der anderen Seite zur Anwendung kommen: Mit NER-Verfahren (Named Entity Recognition) aus dem NLP-Baukasten (Natural Language Processing) können Event Logs aus unstrukturierten Daten gewonnen werden, z. B. aus Texten in E-Mails oder Tickets.

Data Lakehouse – Event Logs außerhalb des Process Mining Tools

Auch wenn die vorbereitete Anbindung von Standard-ERP-Systemen und deren Standard-Prozesse durch Celonis einen echten Startvorteil bietet, so schwenken Unternehmen immer mehr auf die Etablierung eines unternehmensinternen Data Warehousing oder Data Lakehousing Prozesses, der die Daten als “Data Middlelayer” vorhält und Process Mining Applikationen bereitstellt.

Ich selbst habe diese Beobachtung bereits bei Unternehmen der industriellen Produktion, Handel, Finanzdienstleister und Telekommunikation gemacht und teilweise selbst diese Projekte betreut und/oder umgesetzt. Recht unterschiedlich hingegen ist die interne Benennung dieser Architektur, von “Middlelayer” über “Data Lakehouse” oder “Event Log Layer” bis “Data Hub” waren sehr unterschiedliche Bezeichnungen dabei. Gemeinsam haben sie alle die Funktion als Zwischenebene zwischen den Datenquellen und den Process Mining, BI und Data Science Applikationen.

DATANOMIQ Cloud Architecture for Data Mesh - Process Mining, BI and Data Science Applications

Prinzipielle Architektur-Darstellung eines Data Lakehouse Systems unter Einsatz von Databricks auf der Goolge / Amazon / Microsoft Azure Cloud nach dem Data Mesh Konzept zur Bereitstellung von Data Products für Process Mining, BI und Data Science Applikationen. Alternativ zu Databricks können auch andere Data Warehouse Datenbankplattformen zur Anwendung kommen, beispielsweise auch snowflake mit dbt.

Das Kernziel der Zwischenschicht erstellt für die Process Mining Vohaben die benötigten Event Logs, kann jedoch diesselben Daten für ganz andere Vorhaben und Applikationen zur Verfügung zu stellen.

Vorteile des Data Lakehousing

Die Vorteile einer Daten-Zwischenschicht in Form eines Data Warehouses oder Data Lakehouses sind – je nach unternehmensinterner Ausrichtung – beispielsweise die folgenden:

Keine doppelte Datenhaltung, denn Daten können zentral gehalten werden und in Views speziellen Applikationen der BI, Data Science, KI und natürlich auch für Process Mining genutzt werden.
Einfachere Data Governance, denn eine zentrale Datenschicht zwischen den Applikationen erleichtert die Übersicht und die Aussteuerung der Datenzugriffsberechtigung.
Reduzierte Cloud Kosten, denn Cloud Tools berechnen Gebühren für die Speicherung von Daten. Müssen Rohdatentabellen in die Analyse-Tools wie z. B. Celonis geladen werden, kann dies unnötig hohe Kosten verursachen.
Reduzierte Personalkosten, sind oft dann gegeben, wenn interne Data Engineers verfügbar sind, die die Datenmodelle intern entwickeln.
Höhere Data Readiness, denn für eine zentrale Datenplattform lohn es sich eher, Daten aus weniger genutzten Quellen anzuschließen. Hier ergeben sich oft neue Chancen der Datenfusion für nützliche Analysen, die vorher nicht angedacht waren, weil sich der Aufwand nur hierfür speziell nicht lohne.
Große Datenmodelle werden möglich und das Investment in diese lohnt sich nun, da sie für verschiedene Process Mining Tools ausgeliefert werden können, oder auch nur Sichten (Views) auf Prozess-Perspektiven. So wird Object-centric Process Mining annäherend mit jedem Tool möglich.
Nutzung von heterogenen Datenquellen, denn mit einem Data Lakehouse ist auch die Nutzung von unstrukturierten Daten leicht möglich, davon wird in Zukunft auch Process Mining profitieren. Denn dank KI und NLP (Data Science) können auch Event Logs aus unstrukturierten Daten generiert werden.
Unabhängigkeit von Tool-Anbietern, denn wenn die zentrale Datenschicht die Daten in Datenmodelle aufbereitet (im Falle von Process Mining oft in normalisierten Event Logs), können diese allen Tools zur Verfügung gestellt werden. Dies sorgt für Unabhängigkeit gegenüber einzelnen Tool-Anbietern.
Data Science und KI wird erleichtert, denn die Data Science und das Training im Machine Learning kann direkt mit dem reichhaltigen Pool an Daten erfolgen, auch direkt mit den Daten der Event Logs und losgelöst vom Process Mining Analyse-Tool, z. B. in Databricks oder den KI-Tools von Google, AWS und Mircosoft Azure (Azure Cognitive Services, Azure Machine Learning etc.).

Unter diesen Aspekten wird die Tool-Auswahl für die Prozessanalyse selbst in ihrer Relevanz abgemildert, da diese Tools schneller ausgetauscht werden können. Dies könnte auch bedeuten, dass sich für Unternehmen die Lösung von Microsoft besonders anbietet, da das Data Engineering und die Data Science sowieso über andere Cloud Services abgebildet wird, jedoch kein weiterer Tool-Anbieter eingebunden werden muss.

Process Mining Software – Fazit

Es ist viel Bewegung am Markt und bietet dem Beobachter auch tatsächlich etwas Entertainment. Celonis ist weiterhin der Platzhirsch und wir können sehr froh sein, dass wir es hier mit einem deutschen Start-Up zutun haben. Für Unternehmen, die gleich voll in Process Mining reinsteigen möchten und keine Scheu vor einem möglichen Vendor-Lock-In, bietet Celonis meiner Ansicht nach immer noch das beste Angebot, wenn auch nicht die günstigste Lösung. Die anderen Tools können ebenfalls eine passende Lösung sein, nicht nur aus preislichen Gründen, sondern vor allem im Kontext der zu untersuchenden Prozesse, der Datenquellen und der bestehenden Tool-Landschaft. Dies sollte im Einzelfall geprüft werden.

Die Datenbereitstellung und -aufbereitung sollte idealerweise nicht im Process Mining Tool erfolgen, sondern auf einer zentralen Datenschicht als Data Warehouse oder Data Lakehouse für Process Mining. Die damit gewonnene Data Readiness zahlt nicht nur auf datengetriebene Prozessanalysen ein, sondern kommt dem ganzen Unternehmen zu Gute und ermöglicht zukünftige Projekte mit Daten, an die vorher oder bisher gar nicht zu denken waren.

Dieser Artikel wurde von Benjamin Aunkofer, einem neutralen Process Mining Berater, ohne KI (ohne ChatGPT etc.) verfasst!

Benjamin Aunkofer im Interview mit Atreus Interim Management über Daten & KI in Unternehmen

Video Interview – Interim Management für Daten & KI

March 13, 2024/in Interviews/by Redaktion

Data & AI im Unternehmen zu etablieren ist ein Prozess, der eine fachlich kompetente Führung benötigt. Hier kann Interim Management die Lösung sein.

Unternehmer stehen dabei vor großen Herausforderungen und stellen sich oft diese oder ähnliche Fragen:

Welche Top-Level Strategie brauche ich?
Wo und wie finde ich die ersten Show Cases im Unternehmen?
Habe ich aktuell den richtigen Daten back-bone?

Diese Fragen beantwortet Benjamin Aunkofer (Gründer von DATANOMIQ und AUDAVIS) im Interview mit Atreus Interim Management. Er erläutert, wie Unternehmen die Disziplinen Data Science, Business Intelligence, Process Mining und KI zusammenführen können, und warum Interim Management dazu eine gute Idee sein kann.

Video Interview “Meet the Manager” auf Youtube mit Franz Kubbillum von Atreus Interim Management und Benjamin Aunkofer von DATANOMIQ.

Über Benjamin Aunkofer

Benjamin Aunkofer – Interim Manager für Data & AI, Gründer von DATANOMIQ und AUDAVIS.

Benjamin Aunkofer ist Gründer des Beratungs- und Implementierungspartners für Daten- und KI-Lösungen namens DATANOMIQ sowie Co-Gründer der AUDAVIS, einem AI as a Service für die Wirtschaftsprüfung.

Nach seiner Ausbildung zum Software-Entwickler (FI-AE IHK) und seinem Einstieg als Consultant bei Deloitte, gründete er 2015 die DATANOMIQ GmbH in Berlin und unterstütze mit mehreren kleinen Teams Unternehmen aus unterschiedlichen Branchen wie Handel, eCommerce, Finanzdienstleistungen und der produzierenden Industrie (Pharma, Automobilzulieferer, Maschinenbau). Er partnert mit anderen Unternehmensberatungen und unterstütze als externer Dienstleister auch Wirtschaftsprüfungsgesellschaften.

Der Projekteinstieg in Unternehmen erfolgte entweder rein projekt-basiert (Projektangebot) oder über ein Interim Management z. B. als Head of Data & AI, Chief Data Scientist oder Head of Process Mining.

Im Jahr 2023 gründete Benjamin Aunkofer mit zwei Mitgründern die AUDAVIS GmbH, die eine Software as a Service Cloud-Plattform bietet für Wirtschaftsprüfungsgesellschaften, Interne Revisionen von Konzernen oder für staatliche Prüfung von Finanztransaktionen.

Object-centric Data Modelling for Process Mining and BI

Object-centric Process Mining on Data Mesh Architectures

November 15, 2023/in Artificial Intelligence, Big Data, Business Analytics, Business Intelligence, Cloud, Data Engineering, Data Mining, Data Science, Data Warehousing, Industrie 4.0, Machine Learning, Main Category, Predictive Analytics, Process Mining/by Benjamin Aunkofer

In addition to Business Intelligence (BI), Process Mining is no longer a new phenomenon, but almost all larger companies are conducting this data-driven process analysis in their organization.

The database for Process Mining is also establishing itself as an important hub for Data Science and AI applications, as process traces are very granular and informative about what is really going on in the business processes.

The trend towards powerful in-house cloud platforms for data and analysis ensures that large volumes of data can increasingly be stored and used flexibly. This aspect can be applied well to Process Mining, hand in hand with BI and AI.

New big data architectures and, above all, data sharing concepts such as Data Mesh are ideal for creating a common database for many data products and applications.

The Event Log Data Model for Process Mining

Process Mining as an analytical system can very well be imagined as an iceberg. The tip of the iceberg, which is visible above the surface of the water, is the actual visual process analysis. In essence, a graph analysis that displays the process flow as a flow chart. This is where the processes are filtered and analyzed.

The lower part of the iceberg is barely visible to the normal analyst on the tool interface, but is essential for implementation and success: this is the Event Log as the data basis for graph and data analysis in Process Mining. The creation of this data model requires the data connection to the source system (e.g. SAP ERP), the extraction of the data and, above all, the data modeling for the event log.

Simple Data Model for a Process Mining Event Log.

As part of data engineering, the data traces that indicate process activities are brought into a log-like schema. A simple event log is therefore a simple table with the minimum requirement of a process number (case ID), a time stamp and an activity description.

Example Event Log for Process Mining

An Event Log can be seen as one big data table containing all the process information. Splitting this big table into several data tables is due to the goal of increasing the efficiency of storing the data in a normalized database.

The following example SQL-query is inserting Event-Activities from a SAP ERP System into an existing event log database table (one big table). It shows that events are based on timestamps (CPUDT, CPUTM) and refer each to one of a list of possible activities (dependent on VGABE).

Inserting Events of Material Movement of Purchasing Processes

INSERT INTO Event_Log

SELECT EKBE.EBELN AS Einkaufsbeleg

,EKBE.EBELN + EKBE.EBELP AS PurchaseOrderPosition-- <-- Case ID of this Purchasing Process

,MSEG_MKPF.AUFNR AS CustomerOrder

,NULL AS CustomerOrderPosition

,CASE -- <-- Activitiy Description dependent on a flag

WHEN MSEG_MKPF.VGABE = 'WA' THEN 'Warehouse Outbound for Customer Order'

WHEN MSEG_MKPF.VGABE = 'WF' THEN 'Warehouse Inbound for Customer Order'

WHEN MSEG_MKPF.VGABE = 'WO' THEN 'Material Movement for Manufacturing'

WHEN MSEG_MKPF.VGABE = 'WE' THEN 'Warehouse Inbound for Purchase Order'

WHEN MSEG_MKPF.VGABE = 'WQ' THEN 'Material Movement for Stock'

WHEN MSEG_MKPF.VGABE = 'WR' THEN 'Material Movement after Manufacturing'

ELSE 'Material Movement (other)'

END AS Activity

,EKPO.MATNR AS Material -- <--

,NULL AS StorageType

,MSEG_MKPF.LGORT AS StorageLocation

,SUBSTRING(MSEG_MKPF.CPUDT ,1,2) + '-' + SUBSTRING(MSEG_MKPF.CPUDT,4,2) + '-' + SUBSTRING(MSEG_MKPF.CPUDT,7,4) + ' '

+ SUBSTRING(MSEG_MKPF.CPUTM,1,8) + '.0000' AS EventTime

,'020' AS Sorting

,MSEG_MKPF.USNAM AS EventUser

,EKBE.MATNR AS Material

,MSEG_MKPF.BWART AS MovementType

,MSEG_MKPF.MANDT AS Mandant

FROM SAP.EKBE

LEFT JOIN SAP.EKPO

ON EKBE.MANDT = EKPO.MANDT

AND EKBE.BUKRS = EKPO.BURKSEKBE.EBELN = EKPO.EBELN

AND EKBE.Pos = EKPO.Pos

LEFT JOIN SAP.MSEG_MKPF AS MSEG_MKPF -- <-- Here as a pre-join of MKPF & MSEG table

ON EKBE.MANDT = MSEG_MKPF.MANDT

AND EKBE.BURKS = MSEG.BUKRSMSEG_MKPF.MATNR = EKBE.MATNR

AND MSEG_MKPF.EBELP = EKBE.EBELP

WHERE EKBE.VGABE= '1' -- <-- OR EKBE.VGABE= '2' -- Warehouse Outbound -> VGABE = 1, Invoice Inbound -> VGABE = 2

Attention: Please see this SQL as a pure example of event mining for a classic (single table) event log! It is based on a German SAP ERP configuration with customized processes.

An Event Log can also include many other columns (attributes) that describe the respective process activity in more detail or the higher-level process context.

Incidentally, Process Mining can also work with more than just one timestamp per activity. Even the small Process Mining tool Fluxicon Disco made it possible to handle two activities from the outset. For example, when creating an order in the ERP system, the opening and closing of an input screen could be recorded as a timestamp and the execution time of the micro-task analyzed. This concept is continued as so-called task mining.

Task Mining

Task Mining is a subtype of Process Mining and can utilize user interaction data, which includes keystrokes, mouse clicks or data input on a computer. It can also include user recordings and screenshots with different timestamp intervals.

As Task Mining provides a clearer insight into specific sub-processes, program managers and HR managers can also understand which parts of the process can be automated through tools such as RPA. So whenever you hear that Process Mining can prepare RPA definitions you can expect that Task Mining is the real deal.

Machine Learning for Process and Task Mining on Text and Video Data

Process Mining and Task Mining is already benefiting a lot from Text Recognition (Named-Entity Recognition, NER) by Natural Lamguage Processing (NLP) by identifying events of processes e.g. in text of tickets or e-mails. And even more Task Mining will benefit form Computer Vision since videos of manufacturing processes or traffic situations can be read out. Even MTM analysis can be done with Computer Vision which detects movement and actions in video material.

Object-Centric Process Mining

Object-centric Process Data Modeling is an advanced approach of dynamic data modelling for analyzing complex business processes, especially those involving multiple interconnected entities. Unlike classical process mining, which focuses on linear sequences of activities of a specific process chain, object-centric process mining delves into the intricacies of how different entities, such as orders, items, and invoices, interact with each other. This method is particularly effective in capturing the complexities and many-to-many relationships inherent in modern business processes.

Note from the author: The concept and name of object-centric process mining was introduced by Wil M.P. van der Aalst 2019 and as a product feature term by Celonis in 2022 and is used extensively in marketing. This concept is based on dynamic data modelling. I probably developed my first event log made of dynamic data models back in 2016 and used it for an industrial customer. At that time, I couldn’t use the Celonis tool for this because you could only model very dedicated event logs for Celonis and the tool couldn’t remap the attributes of the event log while on the other hand a tool like Fluxicon disco could easily handle all kinds of attributes in an event log and allowed switching the event perspective e.g. from sales order number to material number or production order number easily.

An object-centric data model is a big deal because it offers the opportunity for a holistic approach and as a database a single source of truth for Process Mining but also for other types of analytical applications.

Enhancement of the Data Model for Obect-Centricity

The Event Log is a data model that stores events and their related attributes. A classic Event Log has next to the Case ID, the timestamp and a activity description also process related attributes containing information e.g. about material, department, user, amounts, units, prices, currencies, volume, volume classes and much much more. This is something we can literally objectify!

The problem of this classic event log approach is that this information is transformed and joined to the Event Log specific to the process it is designed for.

An object-centric event log is a central data store for all kind of events mapped to all relevant objects to these events. For that reason our event log – that brings object into the center of gravity – we need a relational bridge table (Event_Object_Relation) into the focus. This tables creates the n to m relation between events (with their timestamps and other event-specific values) and all objects.

For fulfillment of relational database normalization the object table contains the object attributes only but relates their object attribut values from another table to these objects.

Advanced Event Log with dynamic Relations between Objects and Events

The above showed data model is already object-centric but still can become more dynamic in order to object attributes by object type (e.g. the type material will have different attributes then the type invoice or department). Furthermore the problem that not just events and their activities have timestamps but also objects can have specific timestamps (e.g. deadline or resignation dates).

Advanced Event Log with dynamic Relations between Objects and Events and dynamic bounded attributes and their values to Events – And the same for Objects.

A last step makes the event log data model more easy to analyze with BI tools: Adding a classical time dimension adding information about each timestamp (by date, not by time of day), e.g. weekdays or public holidays.

Advanced Event Log with dynamic Relations between Objects and Events and dynamic bounded attributes and their values to Events and Objects. The measured timestamps (and duration times in case of Task Mining) are enhanced with a time-dimension for BI applications.

For analysis the way of Business Intelligence this normalized data model can already be used. On the other hand it is also possible to transform it into a fact-dimensional data model like the star schema (Kimball approach). Also Data Science related use cases will find granular data e.g. for training a regression model for predicting duration times by process.

Note from the author: Process Mining is often regarded as a separate discipline of analysis and this is a justified classification, as process mining is essentially a graph analysis based on the event log. Nevertheless, process mining can be considered a sub-discipline of business intelligence. It is therefore hardly surprising that some process mining tools are actually just a plugin for Power BI, Tableau or Qlik.

Storing the Object-Centrc Analytical Data Model on Data Mesh Architecture

Central data models, particularly when used in a Data Mesh in the Enterprise Cloud, are highly beneficial for Process Mining, Business Intelligence, Data Science, and AI Training. They offer consistency and standardization across data structures, improving data accuracy and integrity. This centralized approach streamlines data governance and management, enhancing efficiency. The scalability and flexibility provided by data mesh architectures on the cloud are very beneficial for handling large datasets useful for all analytical applications.

Note from the author: Process Mining data models are very similar to normalized data models for BI reporting according to Bill Inmon (as a counterpart to Ralph Kimball), but are much more granular. While classic BI is satisfied with the header and item data of orders, process mining also requires all changes to these orders. Process mining therefore exceeds this data requirement. Furthermore, process mining is complementary to data science, for example the prediction of process runtimes or failures. It is therefore all the more important that these efforts in this treasure trove of data are centrally available to the company.

Central single source of truth models also foster collaboration, providing a common data language for cross-functional teams and reducing redundancy, leading to cost savings. They enable quicker data processing and decision-making, support advanced analytics and AI with standardized data formats, and are adaptable to changing business needs.

DATANOMIQ Data Mesh Cloud Architecture – This image is animated! Click to enlarge!

Central data models in a cloud-based Data Mesh Architecture (e.g. on Microsoft Azure, AWS, Google Cloud Platform or SAP Dataverse) significantly improve data utilization and drive effective business outcomes. And that´s why you should host any object-centric data model not in a dedicated tool for analysis but centralized on a Data Lakehouse System.

About the Process Mining Tool for Object-Centric Process Mining

Celonis is the first tool that can handle object-centric dynamic process mining event logs natively in the event collection. However, it is not neccessary to have Celonis for using object-centric process mining if you have the dynamic data model on your own cloud distributed with the concept of a data mesh. Other tools for process mining such as Signavio, UiPath, and process.science or even the simple desktop tool Fluxicon Disco can be used as well. The important point is that the data mesh approach allows you to easily generate classic event logs for each analysis perspective using the dynamic object-centric data model which can be used for all tools of process visualization…

… and you can also use this central data model to generate data extracts for all other data applications (BI, Data Science, and AI training) as well!

Data Mesh Architecture on Cloud for BI, Data Science and Process Mining

July 23, 2023/in Artificial Intelligence, Big Data, Business Analytics, Business Intelligence, Cloud, Data Engineering, Data Science, Machine Learning, Main Category, Predictive Analytics, Process Mining, Tool Introduction, Use Cases/by Benjamin Aunkofer

Companies use Business Intelligence (BI), Data Science, and Process Mining to leverage data for better decision-making, improve operational efficiency, and gain a competitive edge. BI provides real-time data analysis and performance monitoring, while Data Science enables a deep dive into dependencies in data with data mining and automates decision making with predictive analytics and personalized customer experiences. Process Mining offers process transparency, compliance insights, and process optimization. The integration of these technologies helps companies harness data for growth and efficiency.

Applications of BI, Data Science and Process Mining grow together

More and more all these disciplines are growing together as they need to be combined in order to get the best insights. So while Process Mining can be seen as a subpart of BI while both are using Machine Learning for better analytical results. Furthermore all theses analytical methods need more or less the same data sources and even the same datasets again and again.

Bring separate(d) applications together with Data Mesh

While all these analytical concepts grow together, they are often still seen as separated applications. There often remains the question of responsibility in a big organization. If this responsibility is decided as not being a central one, Data Mesh could be a solution.

Data Mesh is an architectural approach for managing data within organizations. It advocates decentralizing data ownership to domain-oriented teams. Each team becomes responsible for its Data Products, and a self-serve data infrastructure is established. This enables scalability, agility, and improved data quality while promoting data democratization.

In the context of a Data Mesh, a Data Product refers to a valuable dataset or data service that is managed and owned by a specific domain-oriented team within an organization. It is one of the key concepts in the Data Mesh architecture, where data ownership and responsibility are distributed across domain teams rather than centralized in a single data team.

A Data Product can take various forms, depending on the domain’s requirements and the data it manages. It could be a curated dataset, a machine learning model, an API that exposes data, a real-time data stream, a data visualization dashboard, or any other data-related asset that provides value to the organization.

However, successful implementation requires addressing cultural, governance, and technological aspects. One of this aspect is the cloud architecture for the realization of Data Mesh.

Example of a Data Mesh on Microsoft Azure Cloud using Databricks

The following image shows an example of a Data Mesh created and managed by DATANOMIQ for an organization which uses and re-uses datasets from various data sources (ERP, CRM, DMS, IoT,..) in order to provide the data as well as suitable data models as data products to applications of Data Science, Process Mining (Celonis, UiPath, Signavio & more) and Business Intelligence (Tableau, Power BI, Qlik & more).

Data Mesh on Azure Cloud with Databricks and Delta Lake for Applications of Business Intelligence, Data Science and Process Mining.

Microsoft Azure Cloud is favored by many companies, especially for European industrial companies, due to its scalability, flexibility, and industry-specific solutions. It offers robust IoT and edge computing capabilities, advanced data analytics, and AI services. Azure’s strong focus on security, compliance, and global presence, along with hybrid cloud capabilities and cost management tools, make it an ideal choice for industrial firms seeking to modernize, innovate, and improve efficiency. However, this concept on the Azure Cloud is just an example and can easily be implemented on the Google Cloud (GCP), Amazon Cloud (AWS) and now even on the SAP Cloud (Datasphere) using Databricks.

Databricks is an ideal tool for realizing a Data Mesh due to its unified data platform, scalability, and performance. It enables data collaboration and sharing, supports Delta Lake for data quality, and ensures robust data governance and security. With real-time analytics, machine learning integration, and data visualization capabilities, Databricks facilitates the implementation of a decentralized, domain-oriented data architecture we need for Data Mesh.

Furthermore there are also alternate architectures without Databricks but more cloud-specific resources possible, for Microsoft Azure e.g. using Azure Synapse instead. See this as an example which has many possible alternatives.

Summary – What value can you expect?

With the concept of Data Mesh you will be able to access all your organizational internal and external data sources once and provides the data as several data models for all your analytical applications. The data models are seen as data products with defined value, costs and ownership. Each applications has its own data model. While Data Science Applications have more raw data, BI applications get their well prepared star schema galaxy models, and Process Mining apps get normalized event logs. Using data sharing (in Databricks: Delta Sharing) data products or single datasets can be shared through applications and owners.

Jurek Dörner @ DATANOMIQ

How to reduce costs for Process Mining

June 21, 2023/in Big Data, Business Analytics, Business Intelligence, Data Mining, Data Science, Insights, Main Category, Process Mining/by Jurek Dörner

Process mining has emerged as a powerful Business Process Intelligence discipline (BPI) for analyzing and improving business processes. It involves extracting data from source systems to gain insights into process behavior and uncover opportunities for optimization. While there are many approaches to create value with process mining, organizations often face challenges when it comes to the cost of implementing the necessary solution. In this article, we will highlight the key elements when it comes to process mining architectures as well as the most common mistakes, to help organizations leverage the power of process mining while maintain cost control.

Process Mining – Elements of Process Mining and their cost aspects

Data Extraction for process mining

Most process mining projects underestimate the complexity of data extraction. Even for well-known sources like SAP-ERP’s, the extraction often consumes 50% of the first pilot’s resources. As a result, the extraction pipelines are often built with the credo of “asap” and this is where the cost-drama begins. Process Mining demands Big Data in 99% of the cases, releasing bad developed extraction jobs will end in big cost chunks down the value stream. Frequently organizations perform full loads of big SAP tables, causing source system performance impact, increasing maintenance, and moving hundred GB’s of data on daily basis without any new value. Other organizations fall for the connectors, provided by some process mining platform tools, promising time-to-value being the best. Against all odds the data is getting extracted then into costly third-party platforms where they can be only consumed by the platforms process mining tool itself. On top of that, these organizations often perform more than one Business Process Intelligence discipline, resulting in extracting the exact same data multiple times.

Process Mining – Data Extraction

The data extraction for process mining should be well planed and match the data strategy of the organization. By considering lightweighted data preprocessing techniques organizations can save both time and money. When accepting the investment character of big data extractions, the investment should be done properly in the beginning and therefore cost beneficial in the long term.

Cloud-Based infrastructure with process mining?

Depending on the data strategy of one organization, one cost-effective approach to process mining could be to leverage cloud computing resources. Cloud platforms, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP), provide scalable and flexible infrastructure options. By using cloud services, organizations can avoid the upfront investment in hardware and maintenance costs associated with on-premises infrastructure. They can pay for resources on a pay-as-you-go basis, scaling up or down as needed, which can significantly reduce costs. When dealing with big data in the cloud, meeting the performance requirements while keeping cost control can be a balancing act, that requires a high skillset in cloud technologies. Depending the organization situation and data strategy, on premises or hybrid approaches should be also considered. But costs won’t decrease only migrating from on-premises to cloud and vice versa. What makes the difference is a smart ETL design capturing the nature of process mining data.

Process Mining Cloud Architecture on “pay as you go” base.

Storage for process mining data

Storing data is a crucial aspect of process mining, as in most cases big data is involved. Instead of investing in expensive data storage solutions, which some process mining solutions offer, organizations can opt for cost-effective alternatives. Cloud storage services like Amazon S3, Azure Blob Storage, or Google Cloud Storage provide highly scalable and durable storage options at a fraction of the cost of process mining storage systems. By utilizing these services, organizations can store large volumes of event data without incurring substantial expenses. Moreover, when big data engineering technics, consider profound process mining logics the storage cost cut down can be tremendous.

Process Mining – Infrastructure Cost Curve: On-Premise vs Cloud

Process Mining Tools

While some commercial process mining tools can be expensive, there are several powerful more economical alternatives available. Tools like Process Science, ProM, and Disco provide comprehensive process mining capabilities without the hefty price tag. These tools offer functionalities such as event log import, process discovery, conformance checking, and performance analysis. Organizations often mismanage the fact, that there can and should be more then one process mining tool available. As expensive solutions like Celonis have their benefits, not all use cases make up for the price of these tools. As a result, these low ROI-use cases will eat up the margin, or (and that’s even more critical) little promising use cases won’t be investigated on and therefore high hanging fruits never discovered. Leveraging process mining tools can significantly reduce costs while still enabling organizations to achieve valuable process insights.

Process Mining Tool Landscape (examples shown)

Collaboration

Another cost-saving aspect is to encourage collaboration within the organization itself. Most process mining initiatives require the input from process experts and often involve multiple stakeholders across different departments. By establishing cross-functional teams and supporting collaboration, organizations can share resources and distribute the cost burden. This approach allows for the pooling of expertise, reduces duplication of efforts, and facilitates knowledge exchange, all while keeping costs low.

Process Mining Team Structure

Conclusion

Process mining offers tremendous potential for organizations seeking to optimize their business processes. While many organizations start process mining projects euphorically, the costs set an abrupt end to the party. Implementing a low-cost and collaborative architecture can help to create a sustainable value for the organization. By leveraging cloud-based infrastructure, cost-effective storage solutions, big data engineering techniques, process mining tools, well developed data extractions, lightweight data preprocessing techniques, and fostering collaboration, organizations can embark on process mining initiatives without straining their budgets. With the right approach, organizations can unlock the power of process mining and drive operational excellence without losing cost control.

One might argue that implementing process mining is not only about the costs. In the end each organization must consider the long-term benefits and return on investment (ROI). But with a cost controlled and sustainable process mining approach, return on investment is likely higher and less risky.

This article provides general information for process mining cost reduction. Specific strategic decisions should always consider the unique requirements and restrictions of individual organizations.

Cloud Data Platform for Shopfloor Management

How Cloud Data Platforms improve Shopfloor Management

February 5, 2023/in Big Data, Data Engineering, Data Migration, Data Mining, Data Science, Data Warehousing, Database, Deep Learning, ETL, Machine Learning, Main Category, Predictive Analytics/by Benjamin Aunkofer

In the era of Industry 4.0, linking data from MES (Manufacturing Execution System) with that from ERP, CRM and PLM systems plays an important role in creating integrated monitoring and control of business processes.

ERP (Enterprise Resource Planning) systems contain information about finance, supplier management, human resources and other operational processes, while CRM (Customer Relationship Management) systems provide data about customer relationships, marketing and sales activities. PLM (Product Lifecycle Management) systems contain information about products, development, design and engineering.

By linking this data with the data from MES, companies can obtain a more complete picture of their business operations and thus achieve better monitoring and control of their business processes. Of central importance here are the OEE (Overall Equipment Effectiveness) KPIs that are so important in production, as well as the key figures from financial controlling, such as contribution margins. The fusion of data in a central platform enables smooth analysis to optimize processes and increase business efficiency in the world of Industry 4.0 using methods from business intelligence, process mining and data science. Companies also significantly increase their enterprise value with the linking of this data, thanks to the data and information transparency gained.

Cloud Data Platform for shopfloor management and data sources such like MES, ERP, PLM and machine data. Copyright by DATANOMIQ.

If the data sources are additionally expanded to include the machines of production and logistics, much more in-depth analyses for error detection and prevention as well as for optimizing the factory in its dynamic environment become possible. The machine sensor data can be monitored directly in real time via respective data pipelines (real-time stream analytics) or brought into an overall picture of aggregated key figures (reporting). The readers of this data are not only people, but also individual machines or entire production plants that can react to this data.

As a central data architecture there are dozens of analytical applications which can be fed with data:

– OEE key figures for Shopfloor reporting
– Process Mining (e.g. material flow analysis) for manufacturing and supply chain.
– Detection of anomalies on the shopfloor or on individual machines.
– Predictive maintenance for individual machines or entire production lines.

This solution scales completely automatically in terms of both performance and cost. It looks beyond individual problems since it offers universal and flexible scope for action. In other words, it will result in a “god mode” for the management being able to drill-down from a specific client project to insights into single machines involved into each project.

Are you interested in scalable data architectures for your shopfloor management? Or would you like to discuss a specific problem with us? Or maybe you are interested in an individual data strategy? Then get in touch with me! 🙂

Google Cloud run with Infrastructure by Code using Terraform

Google Cloud Run – Tutorial

November 8, 2022/in Cloud, Data Science Hack, DevOps, Infrastructure as Code, Terraform/by Otrek Wilke

Es gibt Gelegenheiten, da ist eine oder mehrere serverlose Funktionen nicht ausreichend, um einen Service darzustellen. Für diese Fälle gibt es auf der Google Cloud Plattform Google Cloud Run. Cloud Run bietet zwei Möglichkeiten Container auszuführen. Services und Jobs. In diesem Beispiel wird ein Google Cloud Run Service mittels Terraform definiert, welcher auf Basis eines Scheduler Jobs regelmäßig aufgerufen wird. Cloud Build wird dazu genutzt den aktuellen Code auf den Service zu veröffentlichen.

Der nachfolgende Quellcode ist in GitHub verfügbar: https://github.com/fingineering/GCPCloudRunDemo

tl;dr

Mittels Terraform können alle notwendigen Komponenten erstellt werden, um einen Cloud Run Services aus einem Github Repository kontinuierlich zu aktualisieren. Als Beispiel wird ein stark vereinfachter Flask Webservice verwendet. Durch den Cloud Scheduler wird dieser Service regelmäßig aufgerufen.

Voraussetzungen

Bevor der Service aufgesetzt werden kann, müssen einige Voraussetzungen erfüllt sein. Es wird ein Google Cloud Project benötigt, sowie ein Github Account. Auf dem Computer, welcher zur Entwicklung verwendet werden soll, müssen Terraform, Google Cloud SDK, git, Docker und Python installiert sein. In diesem Beispiel wird Python verwendet, es ist aber mit jeder Sprache möglich, mit der ein WebServer erstellt werden kann.

Ein Google Cloud Projekt kann bei Google Cloud Plattform erstellt werden. Es wird nur ein Google Account benötigt, Neukunden erhalten ein kostenloses Guthaben von 300€ für 90 Tage.
Wenn noch nicht vorhanden sollte ein kostenloser Github Account auf Github erstellt werden. Das Beispiel kann auch mittels Google Cloud Source Repositories umgesetzt werden. Als Alternative zu Cloud Build kann Github Actions eingesetzt werden.
Terraform, Google Cloud SDK, Docker und Python müssen auf dem verwendeten Computer installiert werden, hierzu empfiehlt sich ein Paketmanager wie Homebrew oder Chocolatey. Linux Nutzer verwenden am besten den in ihrer Distribution mitgelieferten.
Soll nichts installiert werden, dann kann auch die Google Cloud Shell verwendet werden, diese findet sich im in der Cloud Console

APIs die in Google Cloud aktiviert werden müssen

Neben den Voraussetzungen zur Software müssen auf der Google Cloud Plattform einige APIs aktiviert werden:

Cloud Run API
Cloud Build API
Artifact Registry API
Cloud Scheduler API
Cloud Logging API
Identity and Access Management API

Die Cloud Run API wird benötigt um einen Cloud Run Service zu erstellen, die Artifact Registry wird benötigt, um die Container Abbilder zu speichern. Die Cloud Build API und die Identity and Access Management API werden benötigt, um eine CI/CD Pipeline zu implementieren. Cloud Scheduler wird in diesem Beispiel verwendet, um den Service regelmäßig aufzurufen. Da alle Services via Terraform erstellt und verwaltet werden, werden die APIs benötigt.

Infrastruktur

Das Erstellen der Infrastruktur läuft in mehreren Schritten ab, nur wenn App Code und Container bereits vorhanden sind, kann der gesamte Prozess automatisiert werden. Für die Definition der Infrastruktur wird ein Ordner “Infrastructure” erstellt und darin die Dateien main.tf, variables.tfund terraform.tfvars.

mkdir Infrastructure

cd Infrastructure

touch main.tf

touch terraform.tfvars

Insgesamt werden vier Komponenten erstellt, das Artifact Registry Repository, ein Cloud Run Service, ein Cloud Build Trigger und ein Cloud Scheduler Job. Bevor die eigentliche Infrastruktur erzeugt werden kann, müssen einige Service Accounts definiert werden. Ziel ist es, das jedes Asset eine eigene Identität zugewiesen werden kann. Daher werden drei Service Accounts erstellt, für den Run Service, den Build Trigger und den Scheduler Job.

resource "google_service_account" "cloud_run_demo_sa" {

project = var.project_name

account_id = "cloudrundemosa"

description = "Running Cloud Run Scheduled Job to update data in BigQuery under this account"

display_name = "Cloud Run Demo Service Account"

}

# Service account for scheduler to call the function

resource "google_service_account" "cloud_shedule_caller" {

project = var.project_name

account_id = "cloudscheduler"

description = "A service account to run the scheduler under an invoke the cloud run service"

display_name = "Cloud Scheduler Demo User"

}

# Service Account to run the cloud build trigger

resource "google_service_account" "cloud_run_demo_builder" {

project = var.project_name

account_id = "cloudrundemobuilder"

description = "A service account to run the cloud build trigger"

display_name = "Cloud Run Demo Build Trigger"

}

Da der erste Schritt die Einrichtung der Artifact Registry für den Cloud Run Service ist, wird dieser wie folgt hinzugefügt:

resource "google_artifact_registry_repository" "container_demo_repo" {

location = var.location

project = var.project_name

repository_id = var.container_name

description = "Demo docker repository"

format = "DOCKER"

}

// create the url of a container in the registry, by name of container

data "google_container_registry_image" "demo_container" {

name = var.container_name

project = var.project_name

}

Nun kann der erste Schritt zum Aufsetzen der Infrastruktur mittels des Terraform Dreiklangs durchgeführt werden:

# initialize the terraform project

terraform init

# check if everything will go as planned

terraform plan

# create the infrastructure

terraform apply

Die Beispiel Flask Anwendung

Als Beispielanwendung wird hier eine sehr simple Flask Webapp verwendet. Die Webapp beinhaltet eine einzige Route, es wird Hello World bei einem GET Request zurück gegeben und mittels POST kann die Nachricht personalisiert werden. Die Anwendung dient nur der Demonstration, es können fast beliebige Funktionalitäten umgesetzt werden.

Es ist auch nur zwingend notwendig flask oder Python zu verwenden, es kann jede Sprache und jedes Framework eingesetzt werden, welches einen Webserver implementieren kann und auf HTTP Anfragen reagieren kann. Flask selbst ist ein sogenanntes Micro Framework und kann flexibel eingesetzt werden, für mehr Informationen empfiehlt sich z.b. das Flask Mega Tutorial

Im Projektverzeichnis muss ein neuer Ordner App erzeugt werden, in diesem Ordner wird die Python Datei main.py, sowie die requirements.txt Datei erzeugt.

from flask import Flask, request

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])

def main():

if request.method == 'POST':

# get some incomming data

data = request.get_json()

# This would be the place where could call your code to run.

# But that wouldn't be a good idea.

# Therefore we create a script package, register the scripts and run

# them all in here

return f"Hello {data['name']}"

return "Hello World"

if __name__ == '__main__':

"""

Creating a very basic debug app. Don't use this in production

"""

app.run(

debug=True,

port=8000,

host="0.0.0.0"

)

Dieser minimale Webservice kann local ausgeführt werden indem ein virtual environment erzeugt und die in requirements.txt spezifizierten Pakete installiert werden.

# create a virtual environment

python -m venv venv

# activate it

source venv/bin/activate

# install required packages

pip install -r requirements.txt

Die App kann nun local ausgeführt werden mittels:

				
				1

						python main.py

Zum Testen kann im Browser die Adresse localhost:8080 aufgerufen werden oder mittels curl ein POST Request an den Service gesendet werden.

				
				1
2
3

						curl -X POST -H "Content-Type: application/json" \
    -d '{"name": "FooBar"}' \
    http://localhost:8080

Der mit flask mitgelieferte Web Server sollte nur zu Entwicklungszwecken verwendet werden, in produktiven Umgebungen kann z.b. gunicorn eingesetzt werden. Gunicorn wird in diesem Beispiel später auch im Container verwendet werden.

				
				1

						gunicorn --bind 0.0.0.0:8080 --workers 1 main:app

Docker Container erstellen, ausführen und deployen

Um einen Service in Cloud Run auszuführen, muss dieser in einem Docker Container vorliegen. Dazu wird zunächst ein Dockerfile im App Ordner erstellt. Diese ist einfach gehalten, es basiert auf einem Python Container, kopiert die Dateien aus dem App Ordner und definiert das Start Kommando für Gunicorn.

In vielen Fällen sind im lokalen Entwicklungsordner Dateien vorhanden die nicht in den Container veröffentlicht werden sollten, damit Dateien explizit aus der Containererzeugung ausgeschlossen werden können kann eine .dockerignore Datei hinzugefügt werden. Diese funktioniert analog der .gitignore Dateien.

Um das Veröffentlichen auf die Artifact Registry zu vereinfachen, kann es eine gute Idee sein dem Namensschema der Registry zu folgen: location—docker.pkg.dev/your-project-id/registryname/containername:latest. Mittels docker build kann das Container Image erstellt werden:

				
				1

						docker build -t LOCATION-docker-pkg.dev/your-project-name/democontainer/democontainer:latest .

Um das erstellt Image local zu testen, kann dieses mittels docker run auch lokal ausgeführt werden. Wichtig ist dabei zu beachten die notwendigen Umgebungsvariablen mit zugeben und den Port zu exponieren.

				
				1

						docker run -e PORT=8080 -p 8080:8080 <image_id>

Werden im Webservice Google Identitäten verwendet, dann müssen Informationen über den zu verwendenden Google Account mitgegeben werden. Wie dies im Detail funktioniert findet sich unter Cloud Run lokal testen.

Den ersten Container manuell deployen

Bevor es möglich ist den Cloud Run Service zu erstellen muss das Container Abbild einmal manuell in die Artifact Registry veröffentlicht werden.

				
				1

						docker push europe-west3-docker.pkg.dev/snappy-nature-350016/democontainer/democontainer:latest

Erstellen und veröffentlichen des Container Abbilds kann auch in einem Kommando erfolgen:

				
				1
2

						docker buildx build -t name_of_image —push -f Dockerfile

Cloud Run Service aufsetzen

Da der initiale Container nun in der Aritfact Registry vorhanden ist, kann der Cloud Run Service daraus erstellt werden. Für den Cloud Run Service wird eine neue Resource im main.tf erstellt.

resource "google_cloud_run_service" "cloud_run_demo" {

project = var.project_name

name = var.RunJobName

location = var.location

template {

spec {

containers {

image = "europe-west3-docker.pkg.dev/${var.project_name}/${var.container_name}/${var.container_name}:latest"

}

traffic {

percent = 100

latest_revision = true

}

# A Schedule to run the job on

resource "google_cloud_scheduler_job" "timer_trigger" {

name = "scheduled-cloud-run-job"

project = var.project_name

description = "Invoke a Cloud Run container on a schedule."

schedule = "*/8 * * * *"

time_zone = "Europe/Berlin"

attempt_deadline = "320s"

retry_config {

retry_count = 2

}

http_target {

http_method = "POST"

uri = google_cloud_run_service.cloud_run_demo.status[0].url

oidc_token {

service_account_email = google_service_account.cloud_shedule_caller.email

}

Dieser Service ist zunächst privat, alle Identitäten die diesen Service aufrufen wollen benötigen die Rolle roles/run.invoker. Da der Cloud Scheduler den Service regelmäßig aufrufen soll, muss die Identität des Schedulers Mitglied der Rolle sein.

				
					
				1
2
3
4
5
6
7
8
9
10
11

						# allow shedule to run the cloud run service
resource "google_cloud_run_service_iam_binding" "invoker" {
  location = var.location
  project  = var.project_name
  service  = google_cloud_run_service.cloud_run_demo.name
  role     = "roles/run.invoker"
  members = [
    "user:${var.service_owner}",
    "serviceAccount:${google_service_account.cloud_shedule_caller.email}"
  ]
}

					

			

Nun kann der Terraform Dreiklang verwendet werden den Service und den Scheduler Job zu erstellen.

Cloud Build Trigger erstellen

Der finale Schritt um ein kontinuierliches deployment des Services zu erreichen ist einen Cloud Build Trigger einzurichten. Der Cloud Build Trigger beobachtet Veränderungen am GitHub Repository und erstellt bei jedem neuen Commit auf dem main branch eine neue Version des Cloud Run Services. Die hier vorgestellte Pipeline beinhaltet nur das Erstellen und Veröffentlichen des Containers. Für eine produktive Implementation ist unbedingt zu empfehlen auch noch ein Testschritt mit einzufügen. Mit dem folgenden Code wird der Trigger mittels Terraform erstellt:

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

						resource "google_cloudbuild_trigger" "demo_cloud_build" {
  name            = "CloudRundDemo"
  project         = var.project_name
  description     = "Triggers a new cloud run image to be build, when new code is published"
  service_account = google_service_account.cloud_run_demo_builder.id
  github {
    owner = "fingineering"
    name  = "GCPCloudRunDemo"
    push {
      branch = "main"
    }
  }
  filename = "cloudbuild.yaml"
 
  substitutions = {
    "_REPO"         = var.container_name
    "_IMAGE_NAME"   = var.container_name
    "_LOCATION"     = var.location
    "_SERVICE_NAME" = google_cloud_run_service.cloud_run_demo.name
  }
}
 
# IAM Binding for Cloud Run
# Allow Cloud Build to edit the cloud run service
resource "google_cloud_run_service_iam_member" "triggerRunServiceDeveloper" {
  location = var.location
  project  = var.project_name
  service  = google_cloud_run_service.cloud_run_demo.name
  role     = "roles/run.admin"
  member   = "serviceAccount:${google_service_account.cloud_run_demo_builder.email}"
}

					

			

Der Cloud Build Trigger nutzt zur Definition des Deployment Processes die Datei cloudbuild.yaml, diese enthält die drei Schritte zum Erstellen des Container Abbilds, Veröffentlichen des Abbilds und Erzeugen einer neuen Version des Cloud Run Services.

				
					
				1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

						steps:
# build step
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', '$_LOCATION-docker.pkg.dev/$PROJECT_ID/$_REPO/$_IMAGE_NAME:latest', './App' ]
# push to artifact registry
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', '$_LOCATION-docker.pkg.dev/$PROJECT_ID/$_REPO/$_IMAGE_NAME:latest']
# Deploy Container to Cloud Run
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
  entrypoint: gcloud
  args: ['run', 'deploy', '$_SERVICE_NAME', '--image', '$_LOCATION-docker.pkg.dev/$PROJECT_ID/$_REPO/$_IMAGE_NAME:latest', '--region', '$_LOCATION']
images:
- '$_LOCATION-docker.pkg.dev/$PROJECT_ID/$_REPO/$_IMAGE_NAME:latest'
options:
  logging: CLOUD_LOGGING_ONLY

					

			

Erstellen des Container Abbilds mittels des Docker build Tools. Das erste Argument ist die Aktion build, das zweite und dritte beziehen sich auf den Tag des Images und das vierte gibt den Ort vor an dem nach dem Dockerfile gesucht wird
Veröffentlichen des Container Images in die Artifact Registry mittels des Cloud Build Docker Tools. Als Argumente werden die Aktion push und das Ziel mitgegeben.
Veröffentlichen der neuen Version des Images im Cloud Run Service mittels des Google Cloud SDK Tools gcloud

Alle drei Schritte nutzen Variablen, um Projektziel, Image Tag und Ort flexibel durch den Build Trigger zu steuern. Diese Variable werden bei der Erstellung des Triggers mit definiert, d.h. dieser finden sich in der Definition des Cloud Build Triggers in der Terraform Datei. Die Variablen werden substitutions genannt, es ist zu beachten, das nutzerdefinierte Variablen mit einem Unterstrich beginnen müssen, nur Systemvariablen, wie die PROJECT_ID.

Das Erstellen des Cloud Build Triggers kann versagen, in diesem Falle sollten die Einstellungen von Cloud Build in der Cloud Console geprüft werden. Die Service Account Berechtigungen für Cloud Run und Service Accounts müssen aktiviert sein, wie im Bild unten.

Service Account Berechtigungen für Cloud Run und Service Accounts

Zusammenfassung

Mit Hilfe von Terraform ist es möglich ein vollständig in Code definierten, kontinuierlich veröffentlichten Google Cloud Run Service zu erstellen. Dazu werden GCP Services verwendet, eine Flask Webapp in einem Container zu verpacken und diesen auf Cloud Run zu veröffentlichen.

Für Fragen erstellt gerne ein Issue oder ihr findet mich auf LinkedIn.

Der Quellcode ist in GitHub verfügbar: https://github.com/fingineering/GCPCloudRunDemo

7 Reasons Why You Need Cloud Cost Management Tool

November 2, 2022/in Cloud/by Nadica Metuleva

Many businesses today use a CCM (Cloud Cost Management) tool to optimize their spending on the cloud. Read this article to learn why.

7 Reasons Why You Need Cloud Cost Management Tool (And Where to Find One)

The Cloud. Do you remember the time when this sounded like a silly, fiction-like buzzword? No one imagined that a few years later, we’d be amazed by the technology. Or, that the clouds will be used by millions of people and businesses.

As time passes, more and more businesses shift to the cloud. They use clouds for data warehousing and to share information with remote workers and clients. They use it to keep information easily accessible and safely stored in a digital format.

Clouds have become irreplaceable in the world. In 2020, the total worth of this market reached $371.4 billion. With a compound annual growth rate of 17.5%, the projections tell us that this market will reach $832 billion by 2025.

Today, zettabytes of data are placed in the top storage services such as Google Drive, Google Workspace, and Dropbox, as well as private IT clouds.

Source: https://www.cloudwards.net/cloud-computing-statistics/

In a few years from now, there will be over a hundred zettabytes of data placed in the cloud, which amounts to nearly half of the global data storage projected for 2025. At this point, it is safe to say that clouds are trending in the business world and we can only expect these numbers to grow.

This raises a very big question – how will businesses handle their cloud usage?

One of the biggest challenges that companies face when it comes to the cloud is controlling cloud costs. Fortunately, there’s such a thing as cloud cost optimization, used to help businesses minimize their expenses and achieve better cloud performance.

This article will tell you all about cloud cost optimization and the tools used to make this happen. Read on to learn about the top 7 reasons why you need to invest in a good CCM tool.

What is cloud cost optimization?

Before we jump at the benefits of using a CCM, let’s discuss cloud cost optimization a bit. CCO refers to the act of reducing resource waste on clouds by scaling and choosing the resources required for cloud functions.

Data analytics in the cloud can transform your business altogether. While this is impossible to analyze on your own, certain tools can help you find ways to optimize the cloud performance.

Depending on what you use for this purpose, the success of your CCO strategy can vary. Nevertheless, with a good tool, you can get insight into what you’re doing right and what you need to change to get the most out of your cloud investment.

Choosing one of the best cloud cost management tools

When it comes to choosing cloud cost management tools, Zluri is the answer to all your questions. The popular SaaS management platform has a full range of top-ranked tools to be used for cloud cost optimization.

According to Zluri, the top choice available to businesses today is Harness.

The number one CCM tool will proactively detect anomalies and tell you all you need to know about your cloud resources. It will also predict your cloud spending and help you make smart data-based decisions. Lastly, Harness allows users to automate idle resource management.

Source: https://www.zluri.com/blog/cloud-cost-management-tools/

Some of the things to look for when choosing a cloud cost management tool are:

Automated cost optimizatione. the option to schedule resources to turn on and off to reduce costs and manual efforts.
Visibility of resources. The tool you use should make various resources used in the cloud environment visible such as applications, containers, and servers.
State of resources. A good CCM will tell you about the state of resources, how they are used, and whether they are underutilized.
Multi-cloud support. Many businesses store their data across different cloud providers. If you are one of them, you need a tool that offers multi-cloud support.

Naturally, the pricing and ease of use are also important when you’re making your choice.

Reasons to invest in a cloud cost management tool

Now that you know where to find the best CCM, let’s talk about what you get if you use a cloud cost management tool.

It all starts with what the CCM can do for you. Here are just a few of the most popular functions:

Forecast and budget the cloud spend with a great deal of accuracy
Inform on the current cloud costs
Discover areas that can increase profitability and find the least profitable projects within the cloud
Give you tips on adjusting the pricing structure and re-thinking unnecessary features

https://www.jamcracker.com/blogs/fundamentals-of-cloud-cost-management

With the right tool in your arsenal, you can reap many benefits. We present you with the top seven.

https://www.techtarget.com/searchcloudcomputing/feature/5-ways-to-reduce-cloud-costs

1. Define the cloud costs

451Research reported that 73% of US cloud users look at their cloud expenses as a fixed cost. But, these expenses aren’t fixed – they are variables. That’s why businesses often come across cost inefficiencies. Cloud service charging changes every day and interpreting their plans is not as straightforward as it might seem.

However, with a good CCM tool, you can tie all the expenses to pricing and value and figure out a fixed quote for cloud spending. You can use this data to plan and manage your budget or figure out ways to make your cloud use more cost-conscious.

2. More transparency

A good model for cost optimization will positively impact your operational, as well as financial aspects. Thanks to the analytics feature in a CCM tool, you can monitor and better organize your cloud expenditures. You’ll know how your cloud is used, how much every action costs, and what you can do to optimize this. More transparency leads to better decision-making.

3. Reduce waste and correct defects

Let’s say that you invest in a cloud cost management tool. It will discover the resources you are spending and detect features that are underused. You can use this information to manage neglected tools and eliminate the unnecessary. You can correct the defects and reduce waste.

As a result of the transparency and analytics, users of a cloud can find a better cost-performance balance and optimize their cloud.

4. Smart predictions

A cloud cost management tool will provide you with more than waste information. The usage patterns you’ll get access to can be analyzed and used to make smart predictions. You can use the data to perform consolidated cost analysis and predict future expenses for the cloud. Base this on additional cloud services you’ll need, increased cloud usage, as well as scalability.

5. Timely reviews

By performing audits of the cloud usage and resource, teams can get better control over this matter. A CCM tool can perform audits every couple of months and hand out reports to the company about cloud usage.

These reports are shared within organizations and used to categorize expenses and drop under-utilized resources in the cloud.

6. Memory and storage management

Another great benefit of using a CCM tool is that it helps you manage your storage. With it, you can detect unused data, manage regular data backups, and check the usage of the cloud.

Going through everything stored in your cloud can take forever, but not with these tools. With these tools, organizations can find and discard data that is useless and save a lot of money in the process. It’s amazing how much is stored in clouds without actual use. The right cloud cost management tool can point you in the direction of things you don’t need and remind you to cut them loose.

7. Multi-cloud synchronization

Chances are, your business is using more than one cloud service to store its data. After a while, your data is scattered across different clouds and you cannot find or manage information as easily as you should. When you’re handling more clouds, it gets much harder to optimize the costs or reduce the expenses.

But, not if you have a good CCM tool.

A good CCM tool offers synchronization between at least the most-used cloud services out there. This will allow you to optimize your costs across different clouds, figure out where you can save some money, and find a way to organize your cloud usage more efficiently.

The best time to optimize your cloud is now!

Who would have thought that clouds will become such a big part of our business operations? If you are using clouds for your business, investing in a cloud cost management tool is the next smartest step to take. Thanks to a small investment in a good tool, you can save a fortune on cloud costs.

Kubernetes – der Steuermann für dein Big Data Projekt!

February 5, 2022/in Data Engineering, Data Migration, Data Warehousing, Insights, Main Category, Software Development, Tool Introduction/by Niklas Lang

Kubernetes ist ein Container-Orchestrierungssystem. Damit lassen sich also Anwendungen auf verschiedene Container aufteilen, wodurch sie effizient und ausfallsicher ausgeführt werden können. Kubernetes ist ein Open-Source-Projekt und wurde erstmals im Jahr 2014 veröffentlicht. Es ist sehr leistungsfähig und kann verteilte Systeme, die über Tausende von Rechnern verstreut sind, verwalten.

In diesem und in vielen anderen Beiträgen zum Thema Kubernetes wird die Abkürzung k8s genutzt. Sie kommt daher, dass das Wort Kubernetes mit k beginnt, mit s endet und dazwischen 8 Buchstaben stehen. Bevor wir beginnen, noch eine kleine Anmerkung, woher der Name Kubernetes eigentlich stammt: Das griechische Wort „Kubernetes“ bedeutet Steuermann und beschreibt genau das, was Kubernetes macht, es steuert. Es steuert verschiedene sogenannte Container und koordiniert deren Ausführung.

Was sind Container und warum brauchen wir sie?

Eines der bestimmenden Merkmale von Big Data oder Machine Learning Projekte ist, dass ein einzelner Computer in vielen Fällen nicht ausreicht, um die gewaltigen Rechenlasten bewältigen zu können. Deshalb ist es notwendig, mehrere Computer zu verwenden, die sich die Arbeit teilen können. Zusätzlich können durch ein solches System auch Ausfälle von einzelnen Computern kompensiert werden, wodurch wiederum sichergestellt ist, dass die Anwendung durchgehend erreichbar ist. Wir bezeichnen eine solche Anordnung von Computern als Computing-Cluster oder verteiltes System für paralleles Rechnen.

Im Mittelpunkt des Open Source Projektes Docker stehen die sogenannten Container. Container sind alleinstehende Einheiten, die unabhängig voneinander ausgeführt werden und immer gleich ablaufen. Docker-Container können wir uns tatsächlich relativ praktisch wie einen Frachtcontainer vorstellen. Angenommen, in diesem Container arbeiten drei Menschen an einer bestimmten Aufgabe (Ich weiß, dass dies wahrscheinlich gegen jedes geltende Arbeitsschutzgesetz verstößt, aber es passt nun mal sehr gut in unser Beispiel).

In ihrem Container finden sie alle Ressourcen und Maschinen, die sie für ihre Aufgabe benötigen. Über eine bestimmte Lucke im Container bekommen sie die Rohstoffe geliefert, die sie benötigen, und über eine andere Lucke geben sie das fertige Produkt heraus. Unser Schiffscontainer kann dadurch ungestört und weitestgehend autark arbeiten. Den Menschen darin wird es nicht auffallen, ob sich das Schiff inklusive Container gerade im Hamburger Hafen, in Brasilien oder irgendwo bei ruhigem Seegang auf offenem Meer befindet. Solange sie kontinuierlich Rohstoffe geliefert bekommen, führen sie ihre Aufgabe aus, egal wo sie sind.

Foto von Ian Taylor auf Unsplash

Genauso verhält es sich mit Docker Containern im Softwareumfeld. Es handelt sich dabei um genau definierte, abgeschlossene Applikationen, die auf verschiedenen Maschinen/Rechnern laufen können. Solange sie die festgelegten Inputs kontinuierlich erhalten, können sie auch kontinuierlich weiterarbeiten, unabhängig von ihrer Umgebung.

Was macht Kubernetes?

Wir nutzen Computing-Cluster, um rechenintensive Projekte, wie Machine Learning Modelle, auf mehreren Rechnern zuverlässig und effizient laufen lassen zu können. In Containern wiederum programmieren wir Unteraufgaben, die in sich abgeschlossen sein können und die immer gleich ablaufen, egal ob auf Rechner 1 oder Rechner 2. Das klingt doch eigentlich ausreichend, oder?

Verteilte Systeme bieten gegenüber Einzelrechnern neben Vorteilen auch zusätzliche Herausforderungen, beispielsweise bei der gemeinsamen Nutzung von Daten oder der Kommunikation zwischen den Rechnern innerhalb des Clusters. Kubernetes übernimmt die Arbeit die Container auf das Cluster zu verteilen und sorgt für den reibungslosen Ablauf des Programmes. Dadurch können wir uns auf das eigentliche Problem, also unseren konkreten Anwendungsfall, konzentrieren.

Kubernetes ist also wie der Kapitän, oder Steuermann, auf dem großen Containerschiff, der die einzelnen Container auf seinem Schiff richtig platziert und koordiniert.

Aufbau eines Kubernetes Clusters

Kubernetes wird normalerweise auf einem Cluster von Computern installiert. Jeder Computer in diesem Cluster wird als Node bezeichnet. Auf einem Computer bzw. Node wiederum laufen mehrere sogenannte Pods. Auf den Pods sind die schlussendlichen Container mit den kleineren Applikationen installiert und können in einem lokalen System kommunizieren.

Damit die Pods und die Container darin ohne Komplikationen laufen können, gibt es einige Hilfsfunktionen und -komponenten im Kubernetes Cluster, die dafür sorgen, dass alle Systeme reibungslos funktionieren:

Aufbau Kubernetes Cluster | Abbildung: Kubernetes

Control Plane: Das ist der Rechner, welcher das komplette Cluster überwacht. Auf diesem laufen keine Pods für die Anwendung. Stattdessen werden den einzelnen Pods die Container zugewiesen, die auf ihnen laufen sollen.
Sched: Der Scheduler hält innerhalb des Clusters Ausschau nach neu erstellen Pods und teilt diese zu bestehenden Nodes zu.
ETCD: Ein Speicher für alle Informationen, die im Cluster anfallen und aufbewahrt werden müssen, bspw. Metadaten zur Konfiguration.
Cloud Controller Manager (CCM): Wenn ein Teil des Systems auf Cloud Ressourcen läuft, kommt diese Komponente zum Einsatz und übernimmt die Kommunikation und Koordination mit der Cloud.
Controller Manager (CM): Die wichtigste Komponente im Kubernetes Cluster überwacht das Cluster und sucht nach ausgefallenen Nodes, um dann die Container und Pods neu zu verteilen.
API: Diese Schnittstelle ermöglicht die Kommunikation zwischen den Nodes und dem Control Plane.

Die Nodes sind deutlich schlanker aufgebaut als das Control Plane und enthalten neben den Pods zwei wesentliche Komponenten zur Überwachung:

Kubelet: Es ist das Control Plane innerhalb eines Nodes und sorgt dafür, dass alle Pods einwandfrei laufen.
Kube-Proxy (k-proxy): Diese Komponente verteilt den eingehenden Node Traffic an die Pods, indem es das Netzwerk innerhalb des Nodes erstellt.

Fazit

Ein Netzwerk aus verschiedenen Computern wird als Cluster bezeichnet und wird genutzt, um große Rechenlasten auf mehrere Computer aufteilen und dadurch effizienter gestalten zu können. Die kleinste Einheit, in die man eine Applikation aufteilen kann, ist der Docker Container. Dieser beinhaltet eine Unteraufgabe des Programms, die autark, also unabhängig vom System, ausgeführt wird.

Da es in einem Computing-Cluster sehr viele dieser Container geben kann, übernimmt Kubernetes für uns das Management der Container, also unter anderem deren Kommunikation und Koordinierung. Das Kubernetes Cluster hat dazu verschiedene Komponenten die dafür sorgen, dass alle Container laufen und das System einwandfrei funktioniert.

AI Platforms – A Comprehensive Guide

July 26, 2021/in Tool Introduction/by Jaimin Dave

A comprehensive guide compiled to introduce readers to AI platforms, their types, and benefits. A concluding section to discuss AI platform selection strategy with Attri’s Best of Breed approach to build AI platforms.

Don’t you think that this century is really fortunate? In my opinion, the answer is yes; we witnessed technological transformations and their miracles that created substantial changes in our lifestyle. While talking about these life-changing technological revolutions, AI or artificial intelligence deserves a front seat due to its incredible contribution and capabilities. Now everyone knows AI has limitless potential simply from creating funny faces in mobile to taking informed and intelligent business decisions. In the last 50 years, we have progressed by leaps and bounds to give machines the ability to understand, help and mimic us.

Artificial intelligence enables machines to imitate human intelligence across a variety of domains ranging from problem-solving and reasoning to General Intelligence and in-depth knowledge representation. With tremendous progress in AI, another enabler came into existence and received attention—AI platforms. AI-platform is a layer that integrates all the tools and processes required to build, deploy and monitor ML models. In this article, we shall go through the various aspects of AI platforms covering a range of topics like AI Platform types, the benefits such platforms entail, selection strategy in detail as well as a brief look into Attri’s industry contribution with an Open AI Platform.

Diving Deeper With AI Platforms

The AI Platform acts as a layer over your current AI infrastructure and integrates all the tools and processes required to develop ML models. It provides you the flexibility to integrate all your ML models under a single roof. With this flexibility, you can create and deploy several ML models over the platform. Further, you can even monitor these models to confirm that they are serving their intended purpose. AI platform makes your AI adoption easy by attaining the following requirements–

Use of vast data to develop ML solutions.
Ensure transparency and reproducibility within a project
Accelerate collaboration and governance within teams
Ensure scalability for ever-growing machine learning demands

An ideal AI platform should ensure the following features for better addressing different challenges.

Seamless access control: Ensure robust access control to team members in order to conquer the challenge of centralized data access with AI projects.
Excellent monitoring: Integrate top-notch observability practices while developing ML models.
Data and technology-agnostic integration: Seamless experience to enterprises with infrastructure set up responsibility handed over to platform providers
All-inclusive Platform: Single platform to facilitate all underlying tasks from data preparation to model deployment
Continuous Improvement: Ability to produce and deploy models as a reproducible package and thereby integrate changes with models that are already in production
Rapid Processing: Faster data preparation and powerful visual interfaces

AI Platform Classification

With loads of AI platform providers available in the market, AI platform classification becomes a tough job, as it requires thinking separately on each platform’s offerings, its features, and cost factors. Also, you need to check whether AI solutions are open source AI platforms or proprietary offerings.

We have decided to present an AI platform classification based on its striking features and offerings. With this, we have classified AI platforms across three main classes—

AI cloud-based platforms
AI conversational platforms
No code AI platforms

Cloud based AI Platforms

All major cloud providers offer cloud-based AI platforms to boost businesses with AI capabilities. With cloud AI platforms, enterprises can leverage cloud providers’ matchless technical expertise to overcome affordability and data requirement challenges associated with AI implementation. Cloud-based AI offerings benefit businesses with economic AI solutions, defined and pre-packaged services, lower risks, and modern technology.

Amazon Web Services

AWS offers a comprehensive set of AI solutions to conquer major hurdles in the AI adoption journey of businesses. AWS has been recognized as the topmost cloud AI partner with its broad capable portfolio. AWS pre-trained models cater to diverse use cases like forecasting, recommendations, computer vision, language interpretation, customer engagement, and safety for deploying ML models at scale. Amazon also provides text analytics, NLP, chatbots, and document analysis solutions. Fully managed AWS packages amplify your experience with minimum resource requirements and wizard-based friendly model development experience. Hence, AWS is one of the top cloud AI partners that cater to your AI adoption needs.

Google cloud

The Google Cloud Platform (GCP) is a Google offering for cloud-driven computing services devised to support multiple use cases such as hosting containerized applications, massive-scale data analytics platforms, and even applying ML and AI for business use cases. Google AI Platform is a Google Cloud offering that helps build, deploy and manage machine learning models in the cloud.

Google leverages enterprise AI experience through its consumer-facing products. Google helps improve customer satisfaction through Contact Center AI. Google offering DialogFlow CX is used to create advanced chatbots that handle customer messaging, response, and voice recognition. Digiflow is applied to create virtual agents for messaging services, mobile apps, and IoT devices.

Google’s Cloud Vision API is beneficial to recognize objects, logos, and landmarks within content or images. Google provides Natural Language API to bring more clarity in content classification, entities, syntax, and sentiments. Further, Google speech API helps in converting audio to text and recognizing 110 languages.

Google’s Cloud ML services facilitate better decision-making with end-to-end ML solutions. Google offers an all-inclusive ML development platform that enables effective decision-making backed by explainable AI, continuous evaluation, data labeling, pipelines, training, and what-if tool. This platform is based on the TensorFlow framework and it enables building predictive models for various scenarios.

Kubeflow is a Cloud-Native and open-source platform that helps you build portable ML pipelines that can be executed on-premises or on the cloud. With this, you can access Google technologies like TPUs, TensorFlow, and TFX tools as you deploy your ML models in production.

For expert ML developers, Google provides an Open Source AI platform with TensorFlow models that are trained for various scenarios. It offers an excellent prediction service using trained models.

Microsoft Azure

Similar to Amazon Web Services, Microsoft Azure ML capabilities are based on its real-time and live applications. Azure provides superior machine learning capabilities to develop, train, and deploy machine learning models through Azure Machine Learning, Azure Databricks, and ONNX.

Azure Machine Learning

A Python-based ML service to facilitate automated machine learning.

ONNX

An open-source model format enables machine learning through various frameworks and hardware platforms of the user’s choice.

Azure Cognitive Search

Formerly known as Azure Search,this is the only cloud search service that allows built-in AI capabilities to explore content effectively at scale. Microsoft empowers the user with cognitive search services like text analytics, translation, document analytics, custom vision, and Azure Machine Learning solutions.

IBM Cloud

IBM has brought Watson studio a data analysis application to accelerate innovation and ML-centric practices in business. IBM Cloud AI Platform offers 170 services with more emphasis on data-speech conversions and analytics. Watson Studio offers an all-inclusive suite to work with data and train, build and deploy ML models.

An innovative giant IBM also brought AI based learning platform recently to aid academic stakeholder like students, researchers and teachers.

AI Conversational Platforms

Conversational AI opens new doors for automated conversations between an enterprise and its customers. These conversations include messaging or voice-based communication platforms to enable text or audio-based conversation.

Conversational platforms leverage your customer experience with a range of applications such as follow-up, guidance, or the resolution of customer queries and round-the-clock support. These platforms are beneficial to drive more leads, increase conversions by cross-selling and upselling, promotional efforts, customer research, queries resolution and customer feedback handling, etc.

AI technology helps systems to mimic human conversations to a certain level and with great accuracy. An AI offering- Natural Language processing is used to shape these conversations by understanding intent, text, speech, and languages.

Intelligent Virtual Assistants

The intelligent virtual assistants represent an advanced level of Conversational AI and their discussion is incomplete without a mention to Siri and Alexa. Most popular intelligent virtual assistants include Siri by Apple, Alexa by Amazon, Google Assistant, and Bixby by Samsung. While Alexa performs as a voice assistant for the home, Siri and Bixby stand as mobile assistants with numerous operations support like navigation, text-to-speech, response to weather, quick reply, and address search.

SAP Conversational AI

SAP Conversational AI is one of the leading conversational AI platforms. With its friendly UI and multiple versioning, it offers a better experience of mimicking human conversations. SAP Conversational AI Platform uses NLP to facilitate developing chatbot that works more humanely and serves your customers 24*7. Its striking features include—

Simple integration
NLP capabilities
Analytics tools to help you
Multi-language support

Clinc

A powerful self-learning Conversational AI Platform enriched with NLP capabilities and machine learning. It secures top position in the Conversational AI Platform list due to its learning from previous conversations and improving responses over time. Its feature set include—

No technical expertise required
Self-learning abilities
NLP capabilities

Kore.ai

An enterprise-grade Conversational AI Platform to cater to your consumer as well as staff needs. It helps to build a virtual chatbot for any suitable platform without compromising the safety and security standards. Its major features cover—

The high degree of customization for chatbots
Comprehensive analytics with FAQs and alerts
Simple integration with ML models and channels
Flexible deployment
Supported with a multi-pronged NLP engine

Mindmeld

It is an excellent option as a Deep-Domain Conversational AI Platform with NLP capabilities. It can be used for both text-based and voice-based virtual assistants. This platform effectively caters to multiple industries and their numerous use cases. Check its striking features list—

Open-source platform
NLP capabilities
Supports discovering on-demand video or music
Quick chat-based transactions

No Code AI Platforms

As discussed above, AI platform classification necessitates platform considerations from various perspectives. We are introducing another category of AI platforms—No Code AI Platforms. The motivation behind introducing these platforms is to encourage enterprise AI adoption while keeping AI implementation costs low and minimizing dependencies on skilled professionals. Many IT giants are now offering no-code AI Platforms to enterprises for their AI adoption.

Google ML Kit

Google ML Kit comes with Android and iOS and it facilitates the integration of functions with lesser codes or with minimum knowledge of machine learning algorithms. This open source AI Platform supports different features such as text recognition, face detection, and landmark recognition.

RapidMiner Studio

RapidMiner Studio enables powerful data analytics with drag and drop features. Rapidminer Studio allows easy integration with databases, warehouses, social media for easy data access by authorized persons.

ML Platform Selection Strategy

Having discussed so many types of ML platforms, their features, and offerings, the next question is–how to select the best ML Platform for an enterprise AI adoption. Well, to answer this Million-Dollar question, we need to consider a few key aspects, such as

Who will use and benefit from the AI Platform? It is required to find out AI platform users here, the data science team, analytics team, developers, and how the platform will benefit each stakeholder.
The next aspect is to explore the skill levels of AI platform users, are they competent to handle ML development and analytics requirements with years of experience
Proficiency of users with programming languages
The next point in finalizing the AI platform strategy is to conclude code-first or code-free approaches to streamline AI workflows. This aspect can be studied by thinking about different attributes such as data preparation ease, feature engineering automation, ML algorithms, Model Deployment ease, and platform integration aspects.

Once you come up with answers to these queries, you will be able to finalize the best AI Platform Selection strategy for your enterprise. It can be a unique cloud platform, or even it can be a hybrid solution with a “best-of-breed” approach.

All-in-one platform strategy involves getting one end-to-end platform for the entire AI project lifecycle from raw data prep to ETL to building and operationalizing models followed by monitoring and governance of systems.

The best-of-breed approach allows using the preferred and custom tools for each phase of the lifecycle and aligning these tools together to build a customized platform solution for AI adoption.

This approach offers an excellent AI platform solution for organizations looking for flexible, inexpensive, change-oriented AI solutions and having a DIY spirit. With this mix-and-match approach, you can combine APIs offered by different cloud platforms and deliver AI solutions that cater to your AI use cases. Organizations using the best-of-breed approach are more comfortable with technology shifts with their abilities to use, adopt and swap out tools as requirement changes.

Business Process AI Transformation Simplified With Attri’s Open AI Platform

At Attri, we provide AI platform solutions to diverse industry verticals. With our flagship Open AI Platform, we heighten your AI adoption experience with a rich array of platform features like—

Customizable best-of-breed architecture
Utilize existing infrastructure
AI as a platform solution
Reduced effort in migrating to a new technology
Centralized Monitoring and Governance
Explainable and Responsible AI

We help you achieve your business process transformation goals with our unique AI offerings such as Open AI Platform and Open AI solutions.

Our AI platform assures multiple benefits to your enterprise while keeping AI adoptions costs low and ensuring faster AI implementations. We can summarize the benefits of Attri Open AI Platform as under–

No efforts in reinventing complete AI suites

Attri’s AI Platform integrates multiple AI services and eliminates the need for reinventing complete AI suites. The platform delights enterprises with scalability, the ability to reuse current infrastructure, and customizable architecture.

Accelerated Go To Market

Attri’s Open AI Platform ensures accelerated GTM with a sincere approach to testing, reviewing, and finalizing reference templates for different industries.

No vendor lock-in

With Open AI Platform, we bring client-friendly policies such as no vendor lock-in and flexibility to choose their preferred tools and technology.

High reliability

We keep our AI Platform highly reliable with a comprehensive testing approach. We also meet the growing requirements of enterprises by ensuring high scalability with our open AI platform.

Get connected with us for your enterprise AI adoption requirements.

Know more about our Open AI Platform…

Interesting links

Pages

Categories

Archive