CRISP-DM methodology in technical view

On this paper discuss about CRISP-DM (Cross Industry Standard Process for data mining) methodology and its steps including selecting technique to successful the data mining process. Before going to CRISP-DM it is better to understand what data mining is? So, here first I introduce the data mining and then discuss about CRISP-DM and its steps for any beginner (data scientist) need to know.

1 Data Mining

Data mining is an exploratory analysis where has no idea about interesting outcome (Kantardzic, 2003). So data mining is a process to explore by analysis a large set of data to discover meaningful information which help the business to take a proper decision. For better business decision data mining is a way to select feature, correlation, and interesting patterns from large dataset (Fu, 1997; SPSS White Paper, 1999).

Data mining is a step by step process to discover knowledge from data. Pre-processing data is vital part for a data mining. In pre-process remove noisy data, combining multiple sources of data, retrieve relevant feature and transforming data for analysis. After pre-process mining algorithm applied to extract data pattern so data mining is a step by step process and applied algorithm to find meaning full data pattern. Actually data mining is not only conventional analysis it is more than that (Read, 1999).

Data mining and statistics closely related. Main goal of data mining and statistic is find the structure of data because data mining is a part of statistics (Hand, 1999). However, data mining use tools, techniques, database, machine learning which not part of statistics but data mining use statistics algorithm to find a pattern or discover hidden decision.

Data mining objective could be prediction or description. On prediction data mining considering several features of dataset to predict unidentified future, on the other hand description involve identifying pattern of data to interpreted (Kantardzic, 2003).

From figure 1.1 shows data mining is the only one part of getting unknown information from data but it is the central process of whole process. Before data mining there are several processes need to be done like collecting data from several sources than integrated data and keep in data storage. Stored unprocessed data evaluated and selected with pre-processed activity to give a standard format than data mining algorithm to analysis for hidden pattern.

Data Mining Process

2 CRISP-DM Methodologies

Cross Industry Standard Process for data mining (CRISP-DM) is most popular and widely uses data mining methodology. CRISP-DM breaks down the data mining project life cycle into six phases and each phase consists of many second-level generic tasks. Generic task cover all possible data mining application. CRISP-DM extends KDD (Knowledge Discovery and Data Mining) into six steps which are sequence of data mining application (Martínez-Plumed 2019).

Data science and data mining project extract meaningful information from data. Data science is an art where a lot of time need to spend for understanding the business value and data before applying any algorithm then evaluate and deployed a project. CRISP-DM help any data science and data mining project from start to end by giving step by step process.

Present world every day billions of data are generating. So organisations are struggling with overwhelmed data to process and find a business goal. Comprehensive data mining methodology, CRISP-DM help business to achieve desirable goal by analysing data.

CRISP-DM (Cross Industry Standard Process for Data Mining) is well documented, freely available, data mining methodology. CRISP-DM is developed by more than 200 data mining users and many mining tool and service providers funded by European Union. CRISP-DM encourages organization for best practice and provides a structure of data mining to get better, faster result.

CRISP-DM is a step by step methodology. Figure-2.1 show the phases of CRISP-DM and process of data mining. Here one side arrow indicates the dependency between phases and double side arrow represents repeatable process. Six phases of CRISP-DM are Business understanding, Data understanding, Modelling, Evaluation and Deployment.

CRISP-DM

2.1 Business Understanding

Business Understanding or domain understanding is the first step of CRISP-DM methodology. On this stage identify the area of business which is going to transform into meaningful information by analysing, processing and implementing several algorithms. Business understanding identifies the available resource (human and hardware), problems and set a goal. Identification of business objective should be agreed with project sponsors and other unit of business which will be affected. This step also focuses about details business success criteria, requirements, constraints, risk, project plan and timeline.

2.2 Data Understanding

Data understanding is the second and closely related with the business understanding phase. This phase mainly focus on data collection and proceeds to get familiar with the data and also detect interesting subset from data. Data understanding has four subsets these are:-

2.2.1 Initial data collection

On this subset considering the data collection sources which is mainly divided into two categories like outsource data or internal source data.  If data is from outsource then it may costly, time consuming and may be low quality but if data is collected form internal source it is an easy and less costly, but it may be contain irrelevant data. If internal source data does not fulfil the interest of analysis than it is necessary to move outsource data. Data collection also give an assumption that the data is quantitative (continuous, count) or qualitative (categorical).  It also gives information about balance or imbalanced dataset.  On data collection should avoid random error, systematic error, exclusion errors, and errors of choosing.

2.2.2 Data Description

Data description performs initial analysis about data. On this stage it is going to determine about the source of data like RDBMS, SQL, NoSQL, Big data etc. then analysis and describe the data about size (large data set give more accurate result but time consuming), number of records, tables, database, variables, and data types (numeric, categorical or Boolean). On this phase examine the accessibility and availability of attributes.

2.2.3 Exploratory data analysis (EDA)

On exploratory data analysis describe the inferential statistics, descriptive statistics and graphical representation of data. Inferential statistics summarize the entire population from the sample data to perform sampling and hypothesis testing. On Parametric hypothesis testing  (Null or alternate – ANOVA, t-test, chi square test) perform for known distribution (based on population) like mean, variance, standard deviation, proportion and Non-parametric hypothesis testing perform when distribution is unknown or sample size is small. On sample dataset, random sampling implement when dataset is balance but for imbalance dataset should be follow random resampling (under  and over sampling), k fold cross validation, SMOTE (synthetic minority oversampling technique), cluster base sampling, ensemble techniques (bagging and boosting – Add boost, Gradient Tree Boosting, XG Boost) to form a balance dataset.

On descriptive statistics analysis describe about the mean, median, mode for measures of central tendency on first moment business decision. On second moment business decision describe the measure of dispersion about the variance, standard deviation and range of data.  On third and fourth moment business decision describe accordingly skewness (Positive skewness – heavier tail to the right, negative skewness – heavier tail to the left, Zero skewness – symmetric distribution) and Kurtosis (Leptokurtosis – heavy tail, platykurtosis – light tail, mesokurtic – normal distribution).

Graphical representation is divided into univariate, bivariate and multivariate analysis. Under univariate whisker plot, histogram identify the outliers and shape of distribution of data and Q-Q plot (Quantile – Quantile) plot describe the normality of data that means data is normally distribution or not.  On whisker plot if data present above of Q3 + 1.5 (IQR) and below of Q1 – 1.5 (IQR) is outlier. For Bivariate correlations identify with scatter plot which describe positive, negative or no correlation and also identify the data linearity or non-linearity. Scatter plot also describe the clusters and outliers of data.  For multivariate has no graphical analysis but used to use regression analysis, ANOVA, Hypothesis analysis.

2.2.4 Data Quality analysis

This phase identified and describes the potential errors like outliers, missing data, level of granularity, validation, reliability, bad metadata and inconsistency.  On this phase AAA (attribute agreement analysis) analysed discrete data for data error. Continuous data analysed with Gage repeatability and reproducibility (Gage R & R) which follow SOP (standard operating procedures). Here Gage R & R define the aggregation of variation in the measurement data because of the measurement system.

2.3 Data Preparation

Data Preparation is the time consuming stage for every data science project. Overall on every data science project 60% to 70% time spend on data preparation stage. Data preparation stapes are described below.

2.3.1 Data integration

Data integration involved to integrate or merged multiple dataset. Integration integrates data from different dataset where same attribute or same columns presents but when there is different attribute then merging the both dataset.

2.3.2 Data Wrangling

On this subset data are going to clean, curate and prepare for next level. Here analysis the outlier and treatment done with 3 R technique (Rectify, Remove, Retain) and for special cases if there are lots of outliner then need to treat outlier separately (upper outliner in an one dataset and lower outliner in another dataset) and alpha (significant value) trim technique use to separate the outliner from the original dataset. If dataset has a missing data then need to use imputation technique like mean, median, mode, regression, KNN etc.

If dataset is not normal or has a collinearity problem or autocorrelation then need to implement transformation techniques like log, exponential, sort, Reciprocal, Box-cox etc. On this subset use the data normalization (data –means/standard deviation) or standardization (min- max scaler) technique to make unitless and scale free data. This step also help if data required converting into categorical then need to use discretization or binning or grouping technique. For factor variable (where has limited set of values), dummy variable creation technique need to apply like one hot encoding.  On this subset also help heterogeneous data to transform into homogenous with clustering technique. Data inconsistencies also handle the inconsistence of data to make data in a single scale.

2.3.3 Feature engineering and selection/reduction

Feature engineering may called as attribute generation or feature extraction. Feature extraction creating new feature by reducing original feature to make simplex model. Feature engineering also do the normalized feature by producing calculative new feature. So feature engineering is a data pre-process technique where improve data quality by cleaning, integration, reduction, transformation and scaling.

Feature selections reduce the multicollinearity or high correlated data and make model simple. Main two type of feature selection technique are supervised and unsupervised. Principal Components Analysis (PCA) is an unsupervised feature reduction/ feature selection technique and LDA is a Linear Discriminant analysis supervised technique mainly use for classification problem. LDA analyse by comparing mean of the variables. Supervised technique is three types filter, wrapper and ensemble method. Filter method is easy to implement but wrapper is costly method and ensemble use inside a model.

2.4 Model

2.4.1 Model Selection Technique

Model selection techniques are influence by accuracy and performance.  Because recommendation need better performance but banking fraud detection needs better accuracy technique.  Model is mainly subdivided into two category supervised learning where predict an output variable according to given an input variable and unsupervised learning where has not output variable.

On supervised learning if an output variable is categorical than it is classification problem like two classes or multiclass classification problem. If an output variable is continuous (numerical) then the problem is called prediction problem. If need to recommending according to relevant information is called recommendation problem or if need to retrieve data according to relevance data is called retrieval problem.

On unsupervised learning where target or output variable is not present. On this technique all variable is treated as an input variable. Unsupervised learning also called clustering problem where clustering the dataset for future decision.

Reinforcement learning agent solves the problem by getting reward for success and penalty for any failure. And semi-supervised learning is a process to solve the problem by combining supervised and unsupervised learning method. On semi-supervised, a problem solved by apply unsupervised clustering technique then for each cluster apply different type of supervised machine learning algorithm like linear algorithm, neural network, K nearest  neighbour etc.

On data mining model selection technique, where output variable is known, then need to implement supervised learning.  Regression is the first choice where interpretation of parameter is important. If response variable is continuous then linear regression or if response variable is discrete with 2 categories value then logistic regression or if response variable is discrete with more than 2 categorical values then multinomial or ordinal regression or if response variable is count then poission where mean is equal to variance or negative binomial regression where variance is grater then mean or if response variable contain excessive zero values then need to choose Zero inflated poission (ZIP) or Zero inflated negative binomial (ZINB).

On supervised technique except regression technique all other technique can be used for both continuous or categorical response variable like KNN (K-Nearest Neighbour),  Naïve Bays, Black box techniques (Neural network, Support vector machine), Ensemble Techniques (Stacking, Bagging like random forest, Boosting like Decision tree, Gradient boosting, XGB, Adaboost).

When response variable is unknown then need to implement unsupervised learning. Unsupervised learning for row reduction is K-Means, Hierarchical etc., for columns reduction or dimension reduction PCA (principal component analysis), LDA (Linear Discriminant analysis), SVD (singular value decomposition) etc. On market basket analysis or association rules where measure are support and confidence then lift ration to determine which rules is important. There are recommendation systems, text analysis and NLP (Natural language processing) also unsupervised learning technique.

For time series need to select forecasting technique. Where forecasting may model based or data based. For Trend under model based need to use linear, exponential, quadratic techniques. And for seasonality need to use additive, multiplicative techniques. On data base approaches used auto regressive, moving average, last sample, exponential smoothing (e.g. SES – simple exponential smoothing, double exponential smoothing, and winters method).

2.4.2 Model building

After selection model according to model criterion model is need to be build. On model building provided data is subdivided with training, validation and testing.  But sometime data is subdivided just training and testing where information may leak from testing data to training data and cause an overfitting problem. So training dataset should be divided into training and validation whereas training model is tested with validation data and if need any tuning to do according to feedback from validation dataset. If accuracy is acceptable and error is reasonable then combine the training and validation data and build the model and test it on unknown testing dataset. If the training error and testing error is minimal or reasonable then the model is right fit or if the training error is low and testing error is high then model is over fitted (Variance) or if training error is high and testing error is also high then model is under fitted (bias). When model is over fitted then need to implement regularization technique (e.g. linear – lasso, ridge regression, Decision tree – pre-pruning, post-pruning, Knn – K value, Naïve Bays – Laplace, Neural network – dropout, drop connect, batch normalization, SVM –  kernel trick)

When data is balance then split the data training, validation and testing and here training is larger dataset then validation and testing. If data set is imbalance then need to use random resampling (over and under) by artificially increases training dataset. On random resampling by randomly partitioning data and for each partition implement the model and taking the average of accuracy. Under K fold cross validation creating K times cross dataset and creating model for every dataset and validate, after validation taking the average of accuracy of all model. There is more technique for imbalance dataset like SMOTH (synthetic minority oversampling technique), cluster based sampling, ensemble techniques e.g. Bagging, Boosting (Ada Boost, XGBoost).

2.4.3 Model evaluation and Tuning

On this stage model evaluate according to errors and accuracy and tune the error and accuracy for acceptable manner. For continuous outcome variable there are several way to measure the error like mean error, mean absolute deviation, Mean squared error, Root mean squared error, Mean percentage error and Mean absolute percentage error but more acceptable way is Mean absolute percentage error. For this continuous data if error is known then it is easy to find out the accuracy because accuracy and error combining value is one. The error function also called cost function or loss function.

For discrete output variable model, for evaluation and tuning need to use confusion matrix or cross table. From confusion matrix, by measuring accuracy, error, precision, sensitivity, specificity, F1 help to take decision about model fitness. ROC curve (Receiver operating characteristic curve), AUC curve (Area under the ROC curve) also evaluate the discrete output variable. AUC and ROC curve plot of sensitivity (true positive rate) vs 1-specificity (false positive rate).  Here sensitivity is a positive recall and  recall is basically out of all positive samples, how sample classifier able to identify. Specificity is negative recall here recall is out of all negative samples, how many sample classifier able to identify.  On AUC where more the area under the ROC is represent better accuracy. On ROC were step bend it’s indicate the cut off value.

2.4.4 Model Assessment

There is several ways to assess the model. First it is need to verify model performance and success according to desire achievement. It needs to identify the implemented model result according to accuracy where accuracy is repeatable and reproducible. It is also need to identify that the model is scalable, maintainable, robust and easy to deploy. On assessment identify that the model evaluation about satisfactory results (identify the precision, recall, sensitivity are balance) and meet business requirements.

2.5 Evaluation

On evaluation steps, all models which are built with same dataset, given a rank to find out the best model by assessing model quality of result and simplicity of algorithm and also cost of deployment. Evaluation part contains the data sufficiency report according to model result and also contain suggestion, feedback and recommendation from solutions team and SMEs (Subject matter experts) and record all these under OPA (organizational process assets).

2.6 Deployment

Deployment process needs to monitor under PEST (political economical social technological) changes within the organization and outside of the organization. PEST is similar to SWOT (strength weakness opportunity and thread) where SW represents the changes of internal and OT represents external changes.

On this deployment steps model should be seamless (like same environment, same result etc.) from development to production. Deployment plan contain the details of human resources, hardware, software requirements. Deployment plan also contain maintenance and monitoring plan by checking the model result and validity and if required then implement retire, replace and update plan.

3 Summaries

CRISP-DM implementation is costly and time consuming. But CRISP-DM methodology is an umbrella for data mining process. CRISP-DM has six phases, Business understanding, Data understanding, Modelling, Evaluation and Deployment. Every phase has several individual criteria, standard and process. CRISP-DM is Guideline for data mining process so if CRISP-DM is going to implement in any project it is necessary to follow each and every single guideline and maintain standard and criteria to get required result.

4 References

  1. Fu, Y., (1997), “Data Mining: Tasks, Techniques and Applications”, Potentials, IEEE, 16: 4, 18–20.
  2. Hand, D. J., (1999), “Statistics and Data Mining: Intersecting Disciplines”, ACM SIGKDD Explorations Newsletter, 1: 1, 16 – 19.
  3. Kantardzic, M., (2003), “Data Mining: Concepts, Models, Methods, and Algorithms” John Wiley and Sons, Inc., Hoboken, New Jersey
  4. Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C., Orallo, J.H., Kull, M., Lachiche, N., Quintana, M.J.R. and Flach, P.A., 2019. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering.
  5. Read, B.J., (1999), “Data Mining and Science? Knowledge discovery in science as opposed to business”, 12th ERCIM Workshop on Database Research.

Business Intelligence Organizations

I am often asked how the Business Intelligence department should be set up and how it should interact and collaborate with other departments. First and foremost: There is no magic recipe here, but every company must find the right organization for itself.

Before we can talk about organization of BI, we need to have a clear definition of roles for team members within a BI department.

A Data Engineer (also Database Developer) uses databases to save structured, semi-structured and unstructured data. He or she is responsible for data cleaning, data availability, data models and also for the database performance. Furthermore, a good Data Engineer has at least basic knowledge about data security and data privacy. A Data Engineer uses SQL and NoSQL-Technologies.

A Data Analyst (also BI Analyst or BI Consultant) uses the data delivered by the Data Engineer to create or adjust data models and implementing business logic in those data models and BI dashboards. He or she needs to understand the needs of the business. This job requires good communication and consulting skills as well as good developing skills in SQL and BI Tools such like MS Power BI, Tableau or Qlik.

A Business Analyst (also Business Data Analyst) is a person form any business department who has basic knowledge in data analysis. He or she has good knowledge in MS Excel and at least basic knowledge in data analysis and BI Tools. A Business Analyst will not create data models in databases but uses existing data models to create dashboards or to adjust existing data analysis applications. Good Business Analyst have SQL Skills.

A Data Scientist is a Data Analyst with extended skills in statistics and machine learning. He or she can use very specific tools and analytical methods for finding pattern in unknow or big data (Data Mining) or to predict events based on pattern calculated by using historized data (Predictive Analytics). Data Scientists work mostly with Python or R programming.

Organization Type 1 – Central Approach (Data Lab)

The first type of organization is the data lab approach. This organization form is easy to manage because it’s focused and therefore clear in terms of budgeting. The data delivery is done centrally by experts and their method and technology knowledge. Consequently, the quality expectation of data delivery and data analysis as well as the whole development process is highest here. Also the data governance is simple and the responsibilities clearly adjustable. Not to be underestimated is the aspect of recruiting, because new employees and qualified applicants like to join a central team of experts.

However, this form of organization requires that the company has the right working attitude, especially in the business intelligence department. A centralized business intelligence department acts as a shared service. Accordingly, customer-oriented thinking becomes a prerequisite for the company’s success – and customers here are the other departments that need access to the capacities of those centralized data experts. Communication boundaries must be overcome and ways of simple and effective communication must be found.

Organization Type 2 – Stakeholder Focus Approach

Other companies want to shift more responsibility for data governance, and especially data use and analytics, to those departments where data plays a key role right now. A central business intelligence department manages its own projects, which have a meaning for the entire company. The specialist departments, which have a special need for data analysis, have their own data experts who carry out critical projects for the specialist department. The central Business Intelligence department does not only provide the technical delivery of data, but also through methodical consulting. Although most of the responsibility lies with the Business Intelligence department, some other data-focused departments are at least co-responsible.

The advantage is obvious: There are special data experts who work deeper in the actual departments and feel more connected and responsible to them. The technical-business focus lies on pain points of the company.

However, this form of Ogranization also has decisive disadvantages: The danger of developing isolated solutions that are so special in some specific areas that they will not really work company-wide increases. Typically the company has to deal with asymmetrical growth of data analytics
know-how. Managing data governance is more complex and recruitment is becoming more difficult as the business intelligence department is weakened and smaller, and data professionals for other departments need to have more business focus, which means they are looking for more specialized profiles.

Organization Type 3 – Decentral Approach

Some companies are also taking a more extreme approach in the other direction. The Business Intelligence department now has only Data Engineers building and maintaining the data warehouse or data lake. As a result, the central department only provides data; it is used and analyzed in all other departments, specifically for the respective applications.

The advantage lies in the personal responsibility of the respective departments as „pain points“ of the company are in focus in belief that business departments know their problems and solutions better than any other department does. Highly specialized data experts can understand colleagues of their own department well and there is no no shared service mindset neccessary, except for the data delivery.

Of course, this organizational form has clear disadvantages since many isolated solutions are unavoidable and the development process of each data-driven solution will be inefficient. These insular solutions may work with luck for your own department, but not for the whole company. There is no one single source of truth. The recruiting process is more difficult as it requires more specialized data experts with more business background. We have to expect an asymmetrical growth of data analytics know-how and a difficult data governance.

 

The importance of domain knowledge – A healthcare data science perspective

Data scientists have (and need) many skills. They are frequently either former academic researchers or software engineers, with knowledge and skills in statistics, programming, machine learning, and many other domains of mathematics and computer science. These skills are general and allow data scientists to offer valuable services to almost any field. However, data scientists in some cases find themselves in industries they have relatively little knowledge of.

This is especially true in the healthcare field. In healthcare, there is an enormous amount of important clinical knowledge that might be relevant to a data scientist. It is unreasonable to expect a data scientist to not only have all of the skills typically required of a data scientist, but to also have all of the knowledge a medical professional may have.

Why is domain knowledge necessary?

This lack of domain knowledge, while perfectly understandable, can be a major barrier to healthcare data scientists. For one thing, it’s difficult to come up with project ideas in a domain that you don’t know much about. It can also be difficult to determine the type of data that may be helpful for a project – if you want to build a model to predict a health outcome (for example, whether a patient has or is likely to develop a gastrointestinal bleed), you need to know what types of variables might be related to this outcome so you can make sure to gather the right data.

Knowing the domain is useful not only for figuring out projects and how to approach them, but also for having rules of thumb for sanity checks on the data. Knowing how data is captured (is it hand-entered? Is it from machines that can give false readings for any number of reasons?) can help a data scientist with data cleaning and from going too far down the wrong path. It can also inform what true outliers are and which values might just be due to measurement error.

Often the most challenging part of building a machine learning model is feature engineering. Understanding clinical variables and how they relate to a health outcome is extremely important for this. Is a long history of high blood pressure important for predicting heart problems, or is only very recent history? How long a time horizon is considered ‘long’ or ‘short’ in this context? What other variables might be related to this health outcome? Knowing the domain can help direct the data exploration and greatly speed (and enhance) the feature engineering process.

Once features are generated, knowing what relationships between variables are plausible helps for basic sanity checks. If you’re finding the best predictor of hospitalization is the patient’s eye color, this might indicate an issue with your code. Being able to glance at the outcome of a model and determine if they make sense goes a long way for quality assurance of any analytical work.

Finally, one of the biggest reasons a strong understanding of the data is important is because you have to interpret the results of analyses and modeling work. Knowing what results are important and which are trivial is important for the presentation and communication of results. An analysis that determines there is a strong relationship between age and mortality is probably well-known to clinicians, while weaker but more surprising associations may be of more use. It’s also important to know what results are actionable. An analysis that finds that patients who are elderly are likely to end up hospitalized is less useful for trying to determine the best way to reduce hospitalizations (at least, without further context).

How do you get domain knowledge?

In some industries, such as tech, it’s fairly easy and straightforward to see an end-user’s prospective. By simply viewing a website or piece of software from the user’s point of view, a data scientist can gain a lot of the needed context and background knowledge needed to understand where their data is coming from and how their model output is being used. In the healthcare industry, it’s more difficult. A data scientist can’t easily choose to go through med school or the experience of being treated for a chronic illness. This means there is no easy single answer to where to gain domain knowledge. However, there are many avenues available.

Reading literature and attending presentations can boost one’s domain knowledge. However, it’s often difficult to find resources that are penetrable for someone who is not already a clinician. To gain deep knowledge, one needs to be steeped in the topic. One important avenue to doing this is through the establishment of good relationships with clinicians. Clinicians can be powerful allies that can help point you in the right direction for understanding your data, and simply by chatting with them you can gain important insights. They can also help you visit the clinics or practices to interact with the people that perform the procedures or even watch the procedures being done. At Fresenius Medical Care, where I work, members of my team regularly visit clinics. I have in the last year visited one of our dialysis clinics, a nephrology practice, and a vascular care unit. These experiences have been invaluable to me in developing my knowledge of the treatment of chronic illnesses.

In conclusion, it is crucial for data scientists to acquire basic familiarity in the field they are working in and in being part of collaborative teams that include people who are technically knowledgeable in the field they work in. This said, acquiring even an essential understanding (such as “Medicine 101”) may go a long way for the data scientists in being able to become self-sufficient in essential feature selection and design.

 

Data Science Knowledge Stack – Abstraction of the Data Science Skillset

What must a Data Scientist be able to do? Which skills does as Data Scientist need to have? This question has often been asked and frequently answered by several Data Science Experts. In fact, it is now quite clear what kind of problems a Data Scientist should be able to solve and which skills are necessary for that. I would like to try to bring this consensus into a visual graph: a layer model, similar to the OSI layer model (which any data scientist should know too, by the way).
I’m giving introductory seminars in Data Science for merchants and engineers and in those seminars I always start explaining what we need to work out together in theory and practice-oriented exercises. Against this background, I came up with the idea for this layer model. Because with my seminars the problem already starts: I am giving seminars for Data Science for Business Analytics with Python. So not for medical analyzes and not with R or Julia. So I do not give a general knowledge of Data Science, but a very specific direction.

A Data Scientist must deal with problems at different levels in any Data Science project, for example, the data access does not work as planned or the data has a different structure than expected. A Data Scientist can spend hours debating its own source code or learning the ropes of new DataScience packages for its chosen programming language. Also, the right algorithms for data evaluation must be selected, properly parameterized and tested, sometimes it turns out that the selected methods were not the optimal ones. Ultimately, we are not doing Data Science all day for fun, but for generating value for a department and a data scientist is also faced with special challenges at this level, at least a basic knowledge of the expertise of that department is a must have.


Read this article in German:
“Data Science Knowledge Stack – Was ein Data Scientist können muss“


Data Science Knowledge Stack

With the Data Science Knowledge Stack, I would like to provide a structured insight into the tasks and challenges a Data Scientist has to face. The layers of the stack also represent a bidirectional flow from top to bottom and from bottom to top, because Data Science as a discipline is also bidirectional: we try to answer questions with data, or we look at the potentials in the data to answer previously unsolicited questions.

The DataScience Knowledge Stack consists of six layers:

Database Technology Knowledge

A Data Scientist works with data which is rarely directly structured in a CSV file, but usually in one or more databases that are subject to their own rules. In particular, business data, for example from the ERP or CRM system, are available in relational databases, often from Microsoft, Oracle, SAP or an open source alternative. A good Data Scientist is not only familiar with Structured Query Language (SQL), but is also aware of the importance of relational linked data models, so he also knows the principle of data table normalization.

Other types of databases, so-called NoSQL databases (Not only SQL) are based on file formats, column or graph orientation, such as MongoDB, Cassandra or GraphDB. Some of these databases use their own programming languages ​​(for example JavaScript at MongoDB or the graph-oriented database Neo4J has its own language called Cypher). Some of these databases provide alternative access via SQL (such as Hive for Hadoop).

A data scientist has to cope with different database systems and has to master at least SQL – the quasi-standard for data processing.

Data Access & Transformation Knowledge

If data are given in a database, Data Scientists can perform simple (and not so simple) analyzes directly on the database. But how do we get the data into our special analysis tools? To do this, a Data Scientist must know how to export data from the database. For one-time actions, an export can be a CSV file, but which separators and text qualifiers should be used? Possibly, the export is too large, so the file must be split.
If there is a direct and synchronous data connection between the analysis tool and the database, interfaces like REST, ODBC or JDBC come into play. Sometimes a socket connection must also be established and the principle of a client-server architecture should be known. Synchronous and asynchronous encryption methods should also be familiar to a Data Scientist, as confidential data are often used, and a minimum level of security is most important for business applications.

Many datasets are not structured in a database but are so-called unstructured or semi-structured data from documents or from Internet sources. And again we have interfaces, a frequent entry point for Data Scientists is, for example, the Twitter API. Sometimes we want to stream data in near real-time, let it be machine data or social media messages. This can be quite demanding, so the data streaming is almost a discipline with which a Data Scientist can come into contact quickly.

Programming Language Knowledge

Programming languages ​​are tools for Data Scientists to process data and automate processing. Data Scientists are usually no real software developers and they do not have to worry about software security or economy. However, a certain basic knowledge about software architectures often helps because some Data Science programs can be going to be integrated into an IT landscape of the company. The understanding of object-oriented programming and the good knowledge of the syntax of the selected programming languages ​​are essential, especially since not every programming language is the most useful for all projects.

At the level of the programming language, there is already a lot of snares in the programming language that are based on the programming language itself, as each has its own faults and details determine whether an analysis is done correctly or incorrectly: for example, whether data objects are copied or linked as reference, or how NULL/NaN values ​​are treated.

Data Science Tool & Library Knowledge

Once a data scientist has loaded the data into his favorite tool, for example, one of IBM, SAS or an open source alternative such as Octave, the core work just began. However, these tools are not self-explanatory and therefore there is a wide range of certification options for various Data Science tools. Many (if not most) Data Scientists work mostly directly with a programming language, but this alone is not enough to effectively perform statistical data analysis or machine learning: We use Data Science libraries (packages) that provide data structures and methods as a groundwork and thus extend the programming language to a real Data Science toolset. Such a library, for example Scikit-Learn for Python, is a collection of methods implemented in the programming language. The use of such libraries, however, is intended to be learned and therefore requires familiarization and practical experience for reliable application.

When it comes to Big Data Analytics, the analysis of particularly large data, we enter the field of Distributed Computing. Tools (frameworks) such as Apache Hadoop, Apache Spark or Apache Flink allows us to process and analyze data in parallel on multiple servers. These tools also provide their own libraries for machine learning, such as Mahout, MLlib and FlinkML.

Data Science Method Knowledge

A Data Scientist is not simply an operator of tools, he uses the tools to apply his analysis methods to data he has selected for to reach the project targets. These analysis methods are, for example, descriptive statistics, estimation methods or hypothesis tests. Somewhat more mathematical are methods of machine learning for data mining, such as clustering or dimensional reduction, or more toward automated decision making through classification or regression.

Machine learning methods generally do not work immediately, they have to be improved using optimization methods like the gradient method. A Data Scientist must be able to detect under- and overfitting, and he must prove that the prediction results for the planned deployment are accurate enough.

Special applications require special knowledge, which applies, for example, to the fields of image recognition (Visual Computing) or the processing of human language (Natural Language Processiong). At this point, we open the door to deep learning.

Expertise

Data Science is not an end in itself, but a discipline that would like to answer questions from other expertise fields with data. For this reason, Data Science is very diverse. Business economists need data scientists to analyze financial transactions, for example, to identify fraud scenarios or to better understand customer needs, or to optimize supply chains. Natural scientists such as geologists, biologists or experimental physicists also use Data Science to make their observations with the aim of gaining knowledge. Engineers want to better understand the situation and relationships between machinery or vehicles, and medical professionals are interested in better diagnostics and medication for their patients.

In order to support a specific department with his / her knowledge of data, tools and analysis methods, every data scientist needs a minimum of the appropriate skills. Anyone who wants to make analyzes for buyers, engineers, natural scientists, physicians, lawyers or other interested parties must also be able to understand the people’s profession.

Engere Data Science Definition

While the Data Science pioneers have long established and highly specialized teams, smaller companies are still looking for the Data Science Allrounder, which can take over the full range of tasks from the access to the database to the implementation of the analytical application. However, companies with specialized data experts have long since distinguished Data Scientists, Data Engineers and Business Analysts. Therefore, the definition of Data Science and the delineation of the abilities that a data scientist should have, varies between a broader and a more narrow demarcation.


A closer look at the more narrow definition shows, that a Data Engineer takes over the data allocation, the Data Scientist loads it into his tools and runs the data analysis together with the colleagues from the department. According to this, a Data Scientist would need no knowledge of databases or APIs, neither an expertise would be necessary …

In my experience, DataScience is not that narrow, the task spectrum covers more than just the core area. This misunderstanding comes from Data Science courses and – for me – I should point to the overall picture of Data Science again and again. In courses and seminars, which want to teach Data Science as a discipline, the focus will of course be on the core area: programming, tools and methods from mathematics & statistics.

Establish a Collaborative Culture – Process Mining Rule 4 of 4

This is article no. 4 of the four-part article series Privacy, Security and Ethics in Process Mining.

Read this article in German:
Datenschutz, Sicherheit und Ethik beim Process Mining – Regel 4 von 4

Perhaps the most important ingredient in creating a responsible process mining environment is to establish a collaborative culture within your organization. Process mining can make the flaws in your processes very transparent, much more transparent than some people may be comfortable with. Therefore, you should include change management professionals, for example, Lean practitioners who know how to encourage people to tell each other “the truth”, in your team.

Furthermore, be careful how you communicate the goals of your process mining project and involve relevant stakeholders in a way that ensures their perspective is heard. The goal is to create an atmosphere, where people are not blamed for their mistakes (which only leads to them hiding what they do and working against you) but where everyone is on board with the goals of the project and where the analysis and process improvement is a joint effort.

Do:

  • Make sure that you verify the data quality before going into the data analysis, ideally by involving a domain expert already in the data validation step. This way, you can build trust among the process managers that the data reflects what is actually happening and ensure that you have the right understanding of what the data represents.
  • Work in an iterative way and present your findings as a starting point for discussion in each iteration. Give people the chance to explain why certain things are happening and let them ask additional questions (to be picked up in the next iteration). This will help to improve the quality and relevance of your analysis as well as increase the buy-in of the process stakeholders in the final results of the project.

Don’t:

  • Jump to conclusions. You can never assume that you know everything about the process. For example, slower teams may be handling the difficult cases, people may deviate from the process for good reasons, and you may not see everything in the data (for example, there might be steps that are performed outside of the system). By consistently using your observations as a starting point for discussion, and by allowing people to join in the interpretation, you can start building trust and the collaborative culture that process mining needs to thrive.
  • Force any conclusions that you expect, or would like to have, by misrepresenting the data (or by stating things that are not actually supported by the data). Instead, keep track of the steps that you have taken in the data preparation and in your process mining analysis. If there are any doubts about the validity or questions about the basis of your analysis, you can always go back and show, for example, which filters have been applied to the data to come to the particular process view that you are presenting.

Artificial Intelligence and Data Science in the Automotive Industry

Data science and machine learning are the key technologies when it comes to the processes and products with automatic learning and optimization to be used in the automotive industry of the future. This article defines the terms “data science” (also referred to as “data analytics”) and “machine learning” and how they are related. In addition, it defines the term “optimizing analytics“ and illustrates the role of automatic optimization as a key technology in combination with data analytics. It also uses examples to explain the way that these technologies are currently being used in the automotive industry on the basis of the major subprocesses in the automotive value chain (development, procurement; logistics, production, marketing, sales and after-sales, connected customer). Since the industry is just starting to explore the broad range of potential uses for these technologies, visionary application examples are used to illustrate the revolutionary possibilities that they offer. Finally, the article demonstrates how these technologies can make the automotive industry more efficient and enhance its customer focus throughout all its operations and activities, extending from the product and its development process to the customers and their connection to the product.

Read this article in German:
“Künstliche Intelligenz und Data Science in der Automobilindustrie“


Table of Contents

1 Introduction

2 The Data Mining Process

3 The pillars of artificial intelligence
3.1 Maschine Learning
3.2 Computer Vision
3.3 Inference and decision-making
3.4 Language and communication
3.5 Agents and Actions

4 Data mining and artificial intelligence in the automotive industry
4.1 Development
4.2 Procurement
4.3 Logistics
4.4 Production
4.5 Marketing
4.6 Sales, After Sales, and Retail
4.7 Connected Customer

5 Vision
5.1 Autonomous vehicles
5.2 Integrated factory optimization
5.3 Companies acting autonomously

6 Conclusions

7 Authors

8 Sources

Authors:

  • Dr. Martin Hofmann (CIO – Volkswagen AG)
  • Dr. Florian Neukart (Principal Data Scientist – Volkswagen AG)
  • Prof. Dr. Thomas Bäck (Universität Leiden)

1 Introduction

Data science and machine learning are now key technologies in our everyday lives, as we can see in a multitude of applications, such as voice recognition in vehicles and on cell phones, automatic facial and traffic sign recognition, as well as chess and, more recently, Go machine algorithms[1] which humans can no longer beat. The analysis of large data volumes based on search, pattern recognition, and learning algorithms provides insights into the behavior of processes, systems, nature, and ultimately people, opening the door to a world of fundamentally new possibilities. In fact, the now already implementable idea of autonomous driving is virtually a tangible reality for many drivers today with the help of lane keeping assistance and adaptive cruise control systems in the vehicle.

The fact that this is just the tip of the iceberg, even in the automotive industry, becomes readily apparent when one considers that, at the end of 2015, Toyota and Tesla’s founder, Elon Musk, each announced investments amounting to one billion US dollars in artificial intelligence research and development almost at the same time. The trend towards connected, autonomous, and artificially intelligent systems that continuously learn from data and are able to make optimal decisions is advancing in ways that are simply revolutionary, not to mention fundamentally important to many industries. This includes the automotive industry, one of the key industries in Germany, in which international competitiveness will be influenced by a new factor in the near future – namely the new technical and service offerings that can be provided with the help of data science and machine learning.

This article provides an overview of the corresponding methods and some current application examples in the automotive industry. It also outlines the potential applications to be expected in this industry very soon. Accordingly, sections 2 and 3 begin by addressing the subdomains of data mining (also referred to as “big data analytics”) and artificial intelligence, briefly summarizing the corresponding processes, methods, and areas of application and presenting them in context. Section 4 then provides an overview of current application examples in the automotive industry based on the stages in the industry’s value chain –from development to production and logistics through to the end customer. Based on such an example, section 5 describes the vision for future applications using three examples: one in which vehicles play the role of autonomous agents that interact with each other in cities, one that covers integrated production optimization, and one that describes companies themselves as autonomous agents.

Whether these visions will become a reality in this or any other way cannot be said with certainty at present – however, we can safely predict that the rapid rate of development in this area will lead to the creation of completely new products, processes, and services, many of which we can only imagine today. This is one of the conclusions drawn in section 6, together with an outlook regarding the potential future effects of the rapid rate of development in this area.

2 The data mining process

Gartner uses the term “prescriptive analytics“ to describe the highest level of ability to make business decisions on the basis of data-based analyses. This is illustrated by the question “what should I do?” and prescriptive analytics supplies the required decision-making support, if a person is still involved, or automation if this is no longer the case.

The levels below this, in ascending order in terms of the use and usefulness of AI and data science, are defined as follows: descriptive analytics (“what has happened?”), diagnostic analytics (“why did it happen?”), and predictive analytics (“what will happen?”) (see Figure 1). The last two levels are based on data science technologies, including data mining and statistics, while descriptive analytics essentially uses traditional business intelligence concepts (data warehouse, OLAP).

In this article, we seek to replace the term “prescriptive analytics“ with the term “optimizing analytics.“ The reason for this is that a technology can “prescribe” many things, while, in terms of implementation within a company, the goal is always to make something “better” with regard to target criteria or quality criteria. This optimization can be supported by search algorithms, such as evolutionary algorithms in nonlinear cases and operation research (OR) methods in – much rarer – linear cases. It can also be supported by application experts who take the results from the data mining process and use them to draw conclusions regarding process improvement. One good example are the decision trees learned from data, which application experts can understand, reconcile with their own expert knowledge, and then implement in an appropriate manner. Here too, the application is used for optimizing purposes, admittedly with an intermediate human step.

Within this context, another important aspect is the fact that multiple criteria required for the relevant application often need to be optimized at the same time, meaning that multi-criteria optimization methods – or, more generally, multi-criteria decision-making support methods – are necessary. These methods can then be used in order to find the best possible compromises between conflicting goals. The examples mentioned include the frequently occurring conflicts between cost and quality, risk and profit, and, in a more technical example, between the weight and passive occupant safety of a body.

Figure 1: The four levels of data analysis usage within a company

These four levels form a framework, within which it is possible to categorize data analysis competence and potential benefits for a company in general. This framework is depicted in Figure 1 and shows the four layers which build upon each other, together with the respective technology category required for implementation.

The traditional Cross-Industry Standard Process for Data Mining (CRISP-DM)[2] includes no optimization or decision-making support whatsoever. Instead, based on the business understanding, data understanding, data preparation, modeling, and evaluation sub-steps, CRISP proceeds directly to the deployment of results in business processes. Here too, we propose an additional optimization step that in turn comprises multi-criteria optimization and decision-making support. This approach is depicted schematically in Figure 2.

Figure 2: Traditional CRISP-DM process with an additional optimization step

It is important to note that the original CRISP model deals with a largely iterative approach used by data scientists to analyze data manually, which is reflected in the iterations between business understanding and data understanding as well as data preparation and modeling. However, evaluating the modeling results with the relevant application experts in the evaluation step can also result in having to start the process all over again from the business understanding sub-step, making it necessary to go through all the sub-steps again partially or completely (e.g., if additional data needs to be incorporated).

The manual, iterative procedure is also due to the fact that the basic idea behind this approach  – as up-to-date as it may be for the majority of applications – is now almost 20 years old and certainly only partially compatible with a big data strategy. The fact is that, in addition to the use of nonlinear modeling methods (in contrast to the usual generalized linear models derived from statistical modeling) and knowledge extraction from data, data mining rests on the fundamental idea that models can be derived from data with the help of algorithms and that this modeling process can run automatically for the most part – because the algorithm “does the work.”

In applications where a large number of models need to be created, for example for use in making forecasts (e.g., sales forecasts for individual vehicle models and markets based on historical data), automatic modeling plays an important role. The same applies to the use of online data mining, in which, for example, forecast models (e.g., for forecasting product quality) are not only constantly used for a production process, but also adapted (i.e., retrained) continuously whenever individual process aspects change (e.g., when a new raw material batch is used). This type of application requires the technical ability to automatically generate data, and integrate and process it in such a way that data mining algorithms can be applied to it. In addition, automatic modeling and automatic optimization are necessary in order to update models and use them as a basis for generating optimal proposed actions in online applications. These actions can then be communicated to the process expert as a suggestion or – especially in the case of continuous production processes – be used directly to control the respective process. If sensor systems are also integrated directly into the production process – to collect data in real time – this results in a self-learning cyber-physical system [3] that facilitates implementation of the Industry 4.0[4] vision in the field of production engineering.

Figure 3: Architecture of an Industry 4.0 model for optimizing analytics

This approach is depicted schematically in Figure 3. Data from the system is acquired with the help of sensors and integrated into the data management system. Using this as a basis, forecast models for the system’s relevant outputs (quality, deviation from target value, process variance, etc.) are used continuously in order to forecast the system’s output. Other machine learning options can be used within this context in order, for example, to predict maintenance results (predictive maintenance) or to identify anomalies in the process. The corresponding models are monitored continuously and, if necessary, automatically retrained if any process drift is observed. Finally, the multi-criteria optimization uses the models to continuously compute optimum setpoints for the system control.  Human process experts can also be integrated here by using the system as a suggestion generator so that a process expert can evaluate the generated suggestions before they are implemented in the original system.

In order to differentiate it from “traditional” data mining, the term “big data” is frequently defined now with three (sometimes even four or five) essential characteristics: volume, velocity, and variety, which refer to the large volume of data, the speed at which data is generated, and the heterogeneity of the data to be analyzed, which can no longer be categorized into the conventional relational database schema. Veracity, i.e., the fact that large uncertainties may also be hidden in the data (e.g., measurement inaccuracies), and finally value, i.e., the value that the data and its analysis represents for a company’s business processes, are often cited as additional characteristics. So it is not just the pure data volume that distinguishes previous data analytics methods from big data, but also other technical factors that require the use of new methods– such as Hadoop and MapReduce – with appropriately adapted data analysis algorithms in order to allow the data to be saved and processed. In addition, so-called “in-memory databases” now also make it possible to apply traditional learning and modeling algorithms in main memory to large data volumes.

This means that if one were to establish a hierarchy of data analysis and modeling methods and techniques, then, in very simplistic terms, statistics would be a subset of data mining, which in turn would be a subset of big data. Not every application requires the use of data mining or big data technologies. However, a clear trend can be observed, which indicates that the necessities and possibilities involved in the use of data mining and big data are growing at a very rapid pace as increasingly large data volumes are being collected and linked across all processes and departments of a company. Nevertheless, conventional hardware architecture with additional main memory is often more than sufficient for analyzing large data volumes in the gigabyte range.

Although optimizing analytics is of tremendous importance, it is also crucial to always be open to the broad variety of applications when using artificial intelligence and machine learning algorithms. The wide range of learning and search methods, with potential use in applications such as image and language recognition, knowledge learning, control and planning in areas such as production and logistics, among many others, can only be touched upon within the scope of this article.

3 The pillars of artificial intelligence

An early definition of artificial intelligence from the IEEE Neural Networks Council was “the study of how to make computers do things at which, at the moment, people are better.”[5] Although this still applies, current research is also focused on improving the way that software does things at which computers have always been better, such as analyzing large amounts of data. Data is also the basis for developing artificially intelligent software systems not only to collect information, but also to:

  • Learn
  • Understand and interpret information
  • Behave adaptively
  • Plan
  • Make inferences
  • Solve problems
  • Think abstractly
  • Understand and interpret ideas and language

3.1 Machine learning

At the most general level, machine learning (ML) algorithms can be subdivided into two categories: supervised and unsupervised, depending on whether or not the respective algorithm requires a target variable to be specified.

Supervised learning algorithms

Apart from the input variables (predictors), supervised learning algorithms also require the known target values (labels) for a problem. In order to train an ML model to identify traffic signs using cameras, images of traffic signs – preferably with a variety of configurations – are required as input variables. In this case, light conditions, angles, soiling, etc. are compiled as noise or blurring in the data; nonetheless, it must be possible to recognize a traffic sign in rainy conditions with the same accuracy as when the sun is shining. The labels, i.e., the correct designations, for such data are normally assigned manually. This correct set of input variables and their correct classification constitute a training data set. Although we only have one image per training data set in this case, we still speak of multiple input variables, since ML algorithms find relevant features in training data and learn how these features and the class assignment for the classification task indicated in the example are associated. Supervised learning is used primarily to predict numerical values (regression) and for classification purposes (predicting the appropriate class), and the corresponding data is not limited to a specific format – ML algorithms are more than capable of processing images, audio files, videos, numerical data, and text. Classification examples include object recognition (traffic signs, objects in front of a vehicle, etc.), face recognition, credit risk assessment, voice recognition, and customer churn, to name but a few.

Regression examples include determining continuous numerical values on the basis of multiple (sometimes hundreds or thousands) input variables, such as a self-driving car calculating its ideal speed on the basis of road and ambient conditions, determining a financial indicator such as gross domestic product based on a changing number of input variables (use of arable land, population education levels, industrial production, etc.), and determining potential market shares with the introduction of new models. Each of these problems is highly complex and cannot be represented by simple, linear relationships in simple equations. Or, to put it another way that more accurately represents the enormous challenge involved: the necessary expertise does not even exist.

Unsupervised learning algorithms

Unsupervised learning algorithms do not focus on individual target variables, but instead have the goal of characterizing a data set in general. Unsupervised ML algorithms are often used to group (cluster) data sets, i.e., to identify relationships between individual data points (that can consist of any number of attributes) and group them into clusters. In certain cases, the output from unsupervised ML algorithms can in turn be used as an input for supervised methods. Examples of unsupervised learning include forming customer groups based on their buying behavior or demographic data, or clustering time series in order to group millions of time series from sensors into groups that were previously not obvious.

In other words, machine learning is the area of artificial intelligence (AI) that enables computers to learn without being programmed explicitly. Machine learning focuses on developing programs that grow and change by themselves as soon as new data is provided. Accordingly, processes that can be represented in a flowchart are not suitable candidates for machine learning – in contrast, everything that requires dynamic and changing solution strategies and cannot be constrained to static rules is potentially suitable for solution with ML. For example, ML is used when:

  • No relevant human expertise exists
  • People are unable to express their expertise
  • The solution changes over time
  • The solution needs to be adapted to specific cases

In contrast to statistics, which follows the approach of making inferences based on samples, computer science is interested in developing efficient algorithms for solving optimization problems, as well as in developing a representation of the model for evaluating inferences. Methods frequently used for optimization in this context include so-called “evolutionary algorithms” (genetic algorithms, evolution strategies), the basic principles of which emulate natural evolution[6]. These methods are very efficient when applied to complex, nonlinear optimization problems.

Even though ML is used in certain data mining applications, and both look for patterns in data, ML and data mining are not the same thing. Instead of extracting data that people can understand, as is the case with data mining, ML methods are used by programs to improve their own understanding of the data provided. Software that implements ML methods recognizes patterns in data and can dynamically adjust the behavior based on them. If, for example, a self-driving car (or the software that interprets the visual signal from the corresponding camera) has been trained to initiate a braking maneuver if a pedestrian appears in front it, this must work with all pedestrians regardless of whether they are short, tall, fat, thin, clothed, coming from the left, coming from the right, etc. In turn, the vehicle must not brake if there is a stationary garbage bin on the side of the road.

The level of complexity in the real world is often greater than the level of complexity of an ML model, which is why, in most cases, an attempt is made to subdivide problems into subproblems and then apply ML models to these subproblems.  The output from these models is then integrated in order to permit complex tasks, such as autonomous vehicle operation, in structured and unstructured environments.

3.2 Computer vision

Computer vision (CV) is a very wide field of research that merges scientific theories from various fields (as is often the case with AI), starting from biology, neuroscience, and psychology and extending all the way to computer science, mathematics, and physics. First, it is important to know how an image is produced physically. Before light hits sensors in a two-dimensional array, it is refracted, absorbed, scattered, or reflected, and an image is produced by measuring the intensity of the light beams through each element in the image (pixel). The three primary focuses of CV are:

  • Reconstructing a scene and the point from which the scene is observed based on an image, an image sequence, or a video.
  • Emulating biological visual perception in order to better understand which physical and biological processes are involved, how the wetware works, and how the corresponding interpretation and understanding work.
  • Technical research and development focuses on efficient, algorithmic solutions – when it comes to CV software, problem-specific solutions that only have limited commonalities with the visual perception of biological organisms are often developed.

All three areas overlap and influence each other. If, for example, the focus in an application is on obstacle recognition in order to initiate an automated braking maneuver in the event of a pedestrian appearing in front of the vehicle, the most important thing is to identify the pedestrian as an obstacle. Interpreting the entire scene – e.g., understanding that the vehicle is moving towards a family having a picnic in a field – is not necessary in this case. In contrast, understanding a scene is an essential prerequisite if context is a relevant input, such as is the case when developing domestic robots that need to understand that an occupant who is lying on the floor not only represents an obstacle that needs to be evaded, but is also probably not sleeping and a medical emergency is occurring.

Vision in biological organisms is regarded as an active process that includes controlling the sensor and is tightly linked to successful performance of an action[7]. Consequently,[8] CV systems are not passive either. In other words, the system must:

  • Be continuously provided with data via sensors (streaming)
  • Act based on this data stream

Having said that, the goal of CV systems is not to understand scenes in images – first and foremost, the systems must extract the relevant information for a specific task from the scene. This means that they must identify a “region of interest” that will be used for processing. Moreover, these systems must feature short response times, since it is probable that scenes will change over time and that a heavily delayed action will not achieve the desired effect. Many different methods have been proposed for object recognition purposes (“what” is located “where” in a scene), including:

  • Object detectors, in which case a window moves over the image and a filter response is determined for each position by comparing a template and the sub-image (window content), with each new object parameterization requiring a separate scan. More sophisticated algorithms simultaneously make calculations based on various scales and apply filters that have been learned from a large number of images.
  • Segment-based techniques extract a geometrical description of an object by grouping pixels that define the dimensions of an object in an image. Based on this, a fixed feature set is computed, i.e., the features in the set retain the same values even when subjected to various image transformations, such as changes in light conditions, scaling, or rotation. These features are used to clearly identify objects or object classes, one example being the aforementioned identification of traffic signs.
  • Alignment-based methods use parametric object models that are trained on data[9],[10]. Algorithms search for parameters, such as scaling, translation, or rotation, that adapt a model optimally to the corresponding features in the image, whereby an approximated solution can be found by means of a reciprocal process, i.e., by features, such as contours, corners, or others, “selecting” characteristic points in the image for parameter solutions that are compatible with the found feature.

With object recognition, it is necessary to decide whether algorithms need to process 2-D or 3-D representations of objects – 2-D representations are very frequently a good compromise between accuracy and availability. Current research (deep learning) shows that even distances between two points based on two 2-D images captured from different points can be accurately determined as an input. In daylight conditions and with reasonably good visibility, this input can be used in addition to data acquired with laser and radar equipment in order to increase accuracy – moreover, a single camera is sufficient to generate the required data. In contrast to 3-D objects, no shape, depth, or orientation information is directly encoded in 2-D images. Depth can be encoded in a variety of ways, such as with the use of laser or stereo cameras (emulating human vision) and structured light approaches (such as Kinect). At present, the most intensively pursued research direction involves the use of superquadrics – geometric shapes defined with formulas, which use any number of exponents to identify structures such as cylinders, cubes, and cones with round or sharp edges. This allows a large variety of different basic shapes to be described with a small set of parameters. If 3-D images are acquired using stereo cameras, statistical methods (such as generating a stereo point cloud) are used instead of the aforementioned shape-based methods, because the data quality achieved with stereo cameras is poorer than that achieved with laser scans.

Other research directions include tracking[11],[12], contextual scene understanding,[13],[14] and monitoring[15], although these aspects are currently of secondary importance to the automotive industry.

3.3 Inference and decision-making

This field of research, referred to in the literature as “knowledge representation & reasoning” (KRR), focuses on designing and developing data structures and inference algorithms. Problems solved by making inferences are very often found in applications that require interaction with the physical world (humans, for example), such as generating diagnostics, planning, processing natural languages, answering questions, etc. KRR forms the basis for AI at the human level.

Making inferences is the area of KRR in which data-based answers need to be found without human intervention or assistance, and for which data is normally presented in a formal system with distinct and clear semantics. Since 1980, it has been assumed that the data involved is a mixture of simple and complex structures, with the former having a low degree of computational complexity and forming the basis for research involving large databases. The latter are presented in a language with more expressive power, which requires less space for representation, and they correspond to generalizations and fine-grained information.

Decision-making is a type of inference that revolves primarily around answering questions regarding preferences between activities, for example when an autonomous agent attempts to fulfill a task for a person. Such decisions are very frequently made in a dynamic domain which changes over the course of time and when actions are executed. An example of this is a self-driving car that needs to react to changes in traffic.

Logic and combinatorics

Mathematical logic is the formal basis for many applications in the real world, including calculation theory, our legal system and corresponding arguments, and theoretical developments and evidence in the field of research and development. The initial vision was to represent every type of knowledge in the form of logic and use universal algorithms to make inferences from it, but a number of challenges arose – for example, not all types of knowledge can be represented simply. Moreover, compiling the knowledge required for complex applications can become very complex, and it is not easy to learn this type of knowledge in a logical, highly expressive language.[16] In addition, it is not easy to make inferences with the required highly expressive language – in extreme cases, such scenarios cannot be implemented computationally, even if the first two challenges are overcome. Currently, there are three ongoing debates on this subject, with the first one focusing on the argument that logic is unable to represent many concepts, such as space, analogy, shape, uncertainty, etc., and consequently cannot be included as an active part in developing AI to a human level. The counterargument states that logic is simply one of many tools. At present, the combination of representative expressiveness, flexibility, and clarity cannot be achieved with any other method or system. The second debate revolves around the argument that logic is too slow for making inferences and will therefore never play a role in a productive system. The counterargument here is that ways exist to approximate the inference process with logic, so processing is drawing close to remaining within the required time limits, and progress is being made with regard to logical inference. Finally, the third debate revolves around the argument that it is extremely difficult, or even impossible, to develop systems based on logical axioms into applications for the real world. The counterarguments in this debate are primarily based on the research of individuals currently researching techniques for learning logical axioms from natural-language texts.

In principle, a distinction is made between four different types of logic[17] which are not discussed any further in this article:

  • Propositional logic
  • First-order predicate logic
  • Modal logic
  • Non-monotonic logic

Automated decision-making, such as that found in autonomous robots (vehicles), WWW agents, and communications agents, is also worth mentioning at this point. This type of decision-making is particularly relevant when it comes to representing expert decision-making processes with logic and automating them. Very frequently, this type of decision-making process takes account of the dynamics of the surroundings, for example when a transport robot in a production plant needs to evade another transport robot. However, this is not a basic prerequisite, for example, if a decision-making process without a clearly defined direction is undertaken in future, e.g., the decision to rent a warehouse at a specific price at a specific location. Decision-making as a field of research encompasses multiple domains, such as computer science, psychology, economics, and all engineering disciplines. Several fundamental questions need to be answered to enable development of automated decision-making systems:

  • Is the domain dynamic to the extent that a sequence of decisions is required or static in the sense that a single decision or multiple simultaneous decisions need to be made?
  • Is the domain deterministic, non-deterministic, or stochastic?
  • Is the objective to optimize benefits or to achieve a goal?
  • Is the domain known to its full extent at all times? Or is it only partially known?

Logical decision-making problems are non-stochastic in nature as far as planning and conflicting behavior are concerned. Both require that the available information regarding the initial and intermediate states be complete, that actions have exclusively deterministic, known effects, and that a specific defined goal exists. These problem types are often applied in the real world, for example in robot control, logistics, complex behavior in the WWW, and in computer and network security.

In general, planning problems consist of an initial (known) situation, a defined goal, and a set of permitted actions or transitions between steps. The result of a planning process is a sequence or set of actions that, when executed correctly, change the executing entity from an initial state to a state that meets the target conditions. Computationally speaking, planning is a difficult problem, even if simple problem specification languages are used. Even when relatively simple problems are involved, the search for a plan cannot run through all state-space representations, as these are exponentially large in the number of states that define the domains. Consequently, the aim is to develop efficient algorithms that represent sub-representations in order to search through these with the hope of achieving the relevant goal. Current research is focused on developing new search methods and new representations for actions and states, which will make planning easier. Particularly when one or more agents acting against each other are taken into account, it is crucial to find a balance between learning and decision-making – exploration for the sake of learning while decisions are being made can lead to undesirable results.

Many problems in the real world are problems with dynamics of a stochastic nature. One example of this is buying a vehicle with features that affect its value, of which we are unaware. These dependencies influence the buying decision, so it is necessary to allow risks and uncertainties to be considered. For all intents and purposes, stochastic domains are more challenging when it comes to making decisions, but they are also more flexible than deterministic domains with regard to approximations – in other words, simplifying practical assumptions makes automated decision-making possible in practice. A great number of problem formulations exist, which can be used to represent various aspects and decision-making processes in stochastic domains, with the best-known being decision networks and Markov decision processes.

Many applications require a combination of logical (non-stochastic) and stochastic elements, for example when the control of robots requires high-level specifications in logic and low-level representations for a probabilistic sensor model. Processing natural languages is another area in which this assumption applies, since high-level knowledge in logic needs to be combined with low-level models of text and spoken signals.

3.4 Language and communication

In the field of artificial intelligence, processing language is considered to be of fundamental importance, with a distinction being made here between two fields: computational linguistics (CL) and natural language processing (NLP). In short, the difference is that CL research focuses on using computers for language processing purposes, while NLP consists of all applications, including machine translation (MT), Q&A, document summarization, information extraction, to name but a few. In other words, NLP requires a specific task and is not a research discipline per se. NLP comprises:

  • Part-of-speech tagging
  • Natural language understanding
  • Natural language generation
  • Automatic summarization
  • Named-entity recognition
  • Parsing
  • Voice recognition
  • Sentiment analysis
  • Language, topic, and word segmentation
  • Co-reference resolution
  • Discourse analysis
  • Machine translation
  • Word sense disambiguation
  • Morphological segmentation
  • Answers to questions
  • Relationship extraction
  • Sentence splitting

The core vision of AI says that a version of first-order predicate logic (“first-order predicate calculus” or “FOPC”) supported by the necessary mechanisms for the respective problem is sufficient for representing language and knowledge. This thesis says that logic can and should supply the semantics underlying natural language. Although attempts to use a form of logical semantics as the key to representing contents have made progress in the field of AI and linguistics, they have had little success with regard to a program that can translate English into formal logic. To date, the field of psychology has also failed to provide proof that this type of translation into logic corresponds to the way in which people store and manipulate “meaning.” Consequently, the ability to translate a language into FOPC continues to be an elusive goal. Without a doubt, there are NLP applications that need to establish logical inferences between sentence representations, but if these are only one part of an application, it is not clear that they have anything to do with the underlying meaning of the corresponding natural language (and consequently with CL/NLP), since the original task for logical structures was inference. These and other considerations have crystallized into three different positions:

  • Position 1: Logical inferences are tightly linked to the meaning of sentences, because knowing their meaning is equivalent to deriving inferences and logic is the best way to do this.
  • Position 2: A meaning exists outside logic, which postulates a number of semantic markers or primes that are appended to words in order to express their meaning – this is prevalent today in the form of annotations.
  • Position 3: In general, the predicates of logic and formal systems only appear to be different from human language, but their terms are in actuality the words as which they appear

The introduction of statistical and AI methods into the field is the latest trend within this context. The general strategy is to learn how language is processed – ideally in the way that humans do this, although this is not a basic prerequisite. In terms of ML, this means learning based on extremely large corpora that have been translated manually by humans. This often means that it is necessary to learn (algorithmically) how annotations are assigned or how part-of-speech categories (the classification of words and punctuation marks in a text into word types) or semantic markers or primes are added to corpora, all based on corpora that have been prepared by humans (and are therefore correct). In the case of supervised learning, and with reference to ML, it is possible to learn potential associations of part-of-speech tags with words that have been annotated by humans in the text, so that the algorithms are also able to annotate new, previously unknown texts. [18] This works the same way for lightly supervised and unsupervised learning, such as when no annotations have been made by humans and the only data presented is a text in a language with texts with identical contents in other languages or when relevant clusters are found in thesaurus data without there being a defined goal. [19] With regard to AI and language, information retrieval (IR) and information extraction (IE) play a major role and correlate very strongly with each other. One of the main tasks of IR is grouping texts based on their content, whereas IE extracts similarly factual elements from texts or is used to be able to answer questions concerning text contents. These fields therefore correlate very strongly with each other, since individual sentences (not only long texts) can also be regarded as documents. These methods are used, for example, in interactions between users and systems, such as when a driver asks the on-board computer a question regarding the owner’s manual during a journey – once the language input has been converted into text, the question’s semantic content is used as the basis for finding the answer in the manual, and then for extracting the answer and returning it to the driver.

3.5 Agents and actions

In traditional AI, people focused primarily on individual, isolated software systems that acted relatively inflexibly to predefined rules. However, new technologies and applications have established a need for artificial entities that are more flexible, adaptive, and autonomous, and that act as social units in multi-agent systems. In traditional AI (see also “physical symbol system hypothesis”[20] that has been embedded into so-called “deliberative” systems), an action theory that establishes how systems make decisions and act is represented logically in individual systems that must execute actions. Based on these rules, the system must prove a theorem – the prerequisite here being that the system must receive a description of the world in which it currently finds itself, the desired target state, and a set of actions, together with the prerequisites for executing these actions and a list of the results for each action. It turned out that the computational complexity involved rendered any system with time limits useless even when dealing with simple problems, which had an enormous impact on symbolic AI, resulting in the development of reactive architectures. These architectures follow if-then rules that translate inputs directly into tasks. Such systems are extremely simple, although they can solve very complex tasks. The problem is that such systems learn procedures rather than declarative knowledge, i.e., they learn attributes that cannot easily be generalized for similar situations. Many attempts have been made to combine deliberative and reactive systems, but it appears that it is necessary to focus either on impractical deliberative systems or on very loosely developed reactive systems – focusing on both is not optimal.

Principles of the new, agent-centered approach

The agent-oriented approach is characterized by the following principles:

  • Autonomous behavior:

“Autonomy” describes the ability of systems to make their own decisions and execute tasks on behalf of the system designer. The goal is to allow systems to act autonomously in scenarios where controlling them directly is difficult. Traditional software systems execute methods after these methods have been called, i.e., they have no choice, whereas agents make decisions based on their beliefs, desires, and intentions (BDI)[21].

  • Adaptive behavior:

Since it is impossible to predict all the situations that agents will encounter, these agents must be able to act flexibly. They must be able to learn from and about their environment and adapt accordingly. This task is all the more difficult if not only nature is a source of uncertainty, but the agent is also part of a multi-agent system. Only environments that are not static and self-contained allow for an effective use of BDI agents – for example, reinforcement learning can be used to compensate for a lack of knowledge of the world. Within this context, agents are located in an environment that is described by a set of possible states. Every time an agent executes an action, it is “rewarded” with a numerical value that expresses how good or bad the action was. This results in a series of states, actions, and rewards, and the agent is compelled to determine a course of action that entails maximization of the reward.

  • Social behavior:

In an environment where various entities act, it is necessary for agents to recognize their adversaries and form groups if this is required by a common goal. Agent-oriented systems are used for personalizing user interfaces, as middleware, and in competitions such as the RoboCup. In a scenario where there are only self-driving cars on roads, the individual agent’s autonomy is not the only indispensable component – car2car communications, i.e., the exchange of information between vehicles and acting as a group on this basis, are just as important. Coordination between the agents results in an optimized flow of traffic, rendering traffic jams and accidents virtually impossible (see also section 5.1, “Vehicles as autonomous, adaptive, and social agents & cities as super-agents”).

In summary, this agent-oriented approach is accepted within the AI community as the direction of the future.

Multi-agent behavior

Various approaches are being pursued for implementing multi-agent behavior, with the primary difference being in the degree of control that designers have over individual agents.[22],[23],[24] A distinction is made here between:

  • Distributed problem-solving systems (DPS)
  • Multi-agent systems (MAS)

DPS systems allow the designer to control each individual agent in the domain, with the solution to the task being distributed among multiple agents. In contrast, MAS systems have multiple designers, each of whom can only influence their own agents with no access to the design of any other agent. In this case, the design of the interaction protocols is extremely important. In DPS systems, agents jointly attempt to achieve a goal or solve a problem, whereas, in MAS systems, each agent is individually motivated and wants to achieve its own goal and maximize its own benefit. The goal of DPS research is to find collaboration strategies for problem-solving, while minimizing the level of communication required for this purpose. Meanwhile, MAS research is looking at coordinated interaction, i.e., how autonomous agents can be brought to find a common basis for communication and undertake consistent actions.[25] Ideally, a world in which only self-driving cars use the road would be a DPS world. However, the current competition between OEMs means that a MAS world will come into being first. In other words, communication and negotiation between agents will take center stage (see also Nash equilibrium).

Multi-agent learning

Multi-agent learning (MAL) has only relatively recently been bestowed a certain degree of attention.[26],[27],[28],[29] The key problems in this area include determining which techniques should be used and what exactly “multi-agent learning” means. Current ML approaches were developed in order to train individual agents, whereas MAL focuses first and foremost on distributed learning. “Distributed” does not necessarily mean that a neural network is used, in which many identical operations run during training and can accordingly be parallelized, but instead that:

  • A problem is split into subproblems and individual agents learn these subproblems in order to solve the main problem using their combined knowledge OR
  • Many agents try to solve the same problem independently of each other by competing with each other

Reinforcement learning is one of the approaches being used in this context.[30]

4 Data mining and artificial intelligence in the automotive industry

At a high level of abstraction, the value chain in the automotive industry can broadly be described with the following subprocesses:

  1. Development
  2. Procurement
  3. Logistics
  4. Production
  5. Marketing
  6. Sales, after-sales, and retail
  7. Connected customer

Each of these areas already features a significant level of complexity, so the following description of data mining and artificial intelligence applications has necessarily been restricted to an overview.

4.1 Development

Vehicle development has become a largely virtual process that is now the accepted state of the art for all manufacturers. CAD models and simulations (typically of physical processes, such as mechanics, flow, acoustics, vibration, etc., on the basis of finite element models) are used extensively in all stages of the development process.

The subject of optimization (often with the use of evolution strategies[31] or genetic algorithms and related methods) is usually less well covered, even though it is precisely here in the development process that it can frequently yield impressive results. Multi-disciplinary optimization, in which multiple development disciplines (such as occupant safety and noise, vibration, and harshness (NVH)) are combined and optimized simultaneously, is still rarely used in many cases due to supposedly excessive computation time requirements. However, precisely this approach offers enormous potential when it comes to agreeing more quickly and efficiently across the departments involved on a common design that is optimal in terms of the requirements of multiple departments.

In terms of the analysis and further use of simulation results, data mining is already being used frequently to generate co-called “response surfaces.” In this application, data mining methods (the entire spectrum, ranging from linear models to Gaussian processes, support vector machines, and random forests) are used in order to learn a nonlinear regression model as an approximation of the representation of the input vectors for the simulation based on the relevant (numerical) simulation results[32]. Since this model needs to have good interpolation characteristics, cross-validation methods that allow the model’s prediction quality for new input vectors to be estimated are typically used for training the algorithms. The goal behind this use of supervised learning methods is frequently to replace computation-time-consuming simulations with a fast approximation model that, for example, represents a specific component and can be used in another application. In addition, this allows time-consuming adjustment processes to be carried out faster and with greater transparency during development.

One example: It is desirable to be able to immediately evaluate the forming feasibility[33] of geometric variations in components during the course of an interdepartmental meeting instead of having to run complex simulations and wait one or two days for the results. A response surface model that has been previously trained using simulations can immediately provide a very good approximation of the risk of excessive thinning or cracks in this type of meeting, which can then be used immediately for evaluating the corresponding geometry.

These applications are frequently focused on or limited to specific development areas, which, among other reasons, is due to the fact that simulation data management, in its role as a central interface between data generation and data usage and analysis, constitutes a bottleneck. This applies especially when simulation data is intended for use across multiple departments, variants, and model series, as is essential for real use of data in the sense of a continuously learning development organization. The current situation in practice is that department-specific simulation data is often organized in the form of file trees in the respective file system within a department, which makes it difficult to access for an evaluation based on machine learning methods. In addition, simulation data may already be very voluminous for an individual simulation (in the range of terabytes for the latest CFD simulations), so efficient storage solutions are urgently required for machine-learning-based analyses.

While simulation and the use of nonlinear regression models limited to individual applications have become the standard, the opportunities offered by optimizing analytics are rarely being exploited. Particularly with regard to such important issues as multi-disciplinary (as well as cross-departmental) machine learning, learning based on historical data (in other words, learning from current development projects for future projects), and cross-model learning, there is an enormous and completely untapped potential for increasing efficiency.

4.2 Procurement

The procurement process uses a wide variety of data concerning suppliers, purchase prices, discounts, delivery reliability, hourly rates, raw material specifications, and other variables. Consequently, computing KPIs for the purpose of evaluating and ranking suppliers poses no problem whatsoever today. Data mining methods allow the available data to be used, for example, to generate forecasts, to identify important supplier characteristics with the greatest impact on performance criteria, or to predict delivery reliability. In terms of optimizing analytics, the specific parameters that an automotive manufacturer can influence in order to achieve optimum conditions are also important.

Overall, the finance business area is a very good field for optimizing analytics, because the available data contains information about the company’s main success factors. Continuous monitoring[34] is worth a brief mention as an example, here with reference to controlling. This monitoring is based on finance and controlling data, which is continuously prepared and reported. This data can also be used in the sense of predictive analytics in order to automatically generate forecasts for the upcoming week or month. In terms of optimizing analytics, analyses of the key influencing parameters, together with suggested optimizing actions, can also be added to the aforementioned forecasts.

These subject areas are more of a vision than a reality at present, but they do convey an idea of what could be possible in the fields of procurement, finance, and controlling.

4.3 Logistics

In the field of logistics, a distinction can be made between procurement logistics, production logistics, distribution logistics, and spare parts logistics.

Procurement logistics considers the process chain extending from the purchasing of goods through to shipment of the material to the receiving warehouse. When it comes to the purchasing of goods, a large amount of historical price information is available for data mining purposes, which can be used to generate price forecasts and, in combination with delivery reliability data, to analyze supplier performance. As for shipment, optimizing analytics can be used to identify and optimize the key cost factors.

A similar situation applies to production logistics, which deals with planning, controlling, and monitoring internal transportation, handling, and storage processes. Depending on the granularity of the available data, it is possible to identify bottlenecks, optimize stock levels, and minimize the time required, for example here.

Distribution logistics deals with all aspects involved in transporting products to customers, and can refer to both new and used vehicles for OEMs. Since the primary considerations here are the relevant costs and delivery reliability, all the subcomponents of the multimodal supply chain need to be taken into account – from rail to ship and truck transportation through to subaspects such as the optimal combination of individual vehicles on a truck. In terms of used-vehicle logistics, optimizing analytics can be used to assign vehicles to individual distribution channels (e.g., auctions, Internet) on the basis of a suitable, vehicle-specific resale value forecast in order to maximize total sale proceeds. GM implemented this approach as long ago as 2003 in combination with a forecast of expected vehicle-specific sales revenue[35].

In spare parts logistics, i.e., the provision of spare parts and their storage, data-driven forecasts of the number of spare parts needing to be held in stock depending on model age and model (or sold volumes) are one important potential application area for data mining, because it can significantly decrease the storage costs.

As the preceding examples show, data analytics and optimization must frequently be coupled with simulations in the field of logistics, because specific aspects of the logistics chain need to be simulated in order to evaluate and optimize scenarios. Another example is the supplier network, which, when understood in greater depth, can be used to identify and avoid critical paths in the logistics chain, if possible. This is particularly important, as the failure of a supplier to make a delivery on the critical path would result in a production stoppage for the automaker. Simulating the supplier network not only allows this type of bottleneck to be identified, but also countermeasures to be optimized. In order to facilitate a simulation that is as detailed and accurate as possible, experience has shown that mapping all subprocesses and interactions between suppliers in detail becomes too complex, as well as nontransparent for the automobile manufacturer, as soon as attempts are made to include Tier 2 and Tier 3 suppliers as well.

This is why data-driven modeling should be considered as an alternative. When this approach is used, a model is learned from the available data about the supplier network (suppliers, products, dates, delivery periods, etc.) and the logistics (stock levels, delivery frequencies, production sequences) by means of data mining methods. The model can then be used as a forecast model in order, for example, to predict the effects of a delivery delay for specific parts on the production process. Furthermore, the use of optimizing analytics in this case makes it possible to perform a worst-case analysis, i.e., to identify the parts and suppliers that would bring about production stoppages the fastest if their delivery were to be delayed. This example very clearly shows that optimization, in the sense of scenario analysis, can also be used to determine the worst-case scenario for an automaker (and then to optimize countermeasures in future).

4.4 Production

Every sub-step of the production process will benefit from the consistent use of data mining. It is therefore essential for all manufacturing process parameters to be continuously recorded and stored. Since the main goal of optimization is usually to improve quality or reduce the incidence of defects, data concerning the defects that occur and the type of defect is required, and it must be possible to clearly assign this data to the process parameters. This approach can be used to achieve significant improvements, particularly in new types of production process – one example being CFRP[36]. Other potential optimization areas include energy consumption and the throughput of a production process per time unit. Optimizing analytics can be applied both offline and online in this context.

When used in offline applications, the analysis identifies variables that have a significant influence on the process. Furthermore, correlations are derived between these influencing variables and their targets (quality, etc.) and, if applicable, actions are also derived from this, which can improve the targets. Frequently, such analyses focus on a specific problem or an urgent issue with the process and can deliver a solution very efficiently – however, they are not geared towards continuous process optimization. Conducting the analyses and interpreting and implementing the results consistently requires manual sub-steps that can be carried out by data scientists or statisticians – usually in consultation with the respective process experts.

In the case of online applications, there is a very significant difference in the fact that the procedure is automated, resulting in completely new challenges for data acquisition and integration, data pre-processing, modeling, and optimization. In these applications, even the provision of process and quality data needs to be automated, as this provides integrated data that can be used as a basis for modeling at any time. This is crucial given that modeling always needs to be performed when changes to the process (including drift) are detected. The resulting forecast models are then used automatically for optimization purposes and are capable of, e.g., forecasting the quality and suggesting (or directly implementing) actions for optimizing the relevant target variable (quality in this case) even further. This implementation of optimizing analytics, with automatic modeling and optimization, is technically available, although it is more a vision than a reality for most users today.

The potential applications include forming technology (conventional as well as for new materials), car body manufacture, corrosion protection, painting, drive trains, and final assembly, and can be adapted to all sub-steps. An integrated analysis of all process steps, including an analysis of all potential influencing factors and their impact on overall quality, is also conceivable in future – in this case, it would be necessary to integrate the data from all subprocesses.

4.5 Marketing

The focus in marketing is to reach the end customer as efficiently as possible and to convince people either to become customers of the company or to remain customers. The success of marketing activities can be measured in sales figures, whereby it is important to differentiate marketing effects from other effects, such as the general financial situation of customers. Measuring the success of marketing activities can therefore be a complex endeavor, since multivariate influencing factors can be involved.

It would also be ideal if optimizing analytics could always be used in marketing, because optimization goals, such as maximizing return business from a marketing activity, maximizing sales figures while minimizing the budget employed, optimizing the marketing mix, and optimizing the order in which things are done, are all vital concerns. Forecast models, such as those for predicting additional sales figures over time as a result of a specific marketing campaign, are only one part of the required data mining results – multi-criteria decision-making support also plays a decisive role in this context.

Two excellent examples of the use of data mining in marketing are the issues of churn (customer turnover) and customer loyalty. In a saturated market, the top priority for automakers is to prevent loss of custom, i.e., to plan and implement optimal countermeasures. This requires information that is as individualized as possible concerning the customer, the customer segment to which the customer belongs, the customer’s satisfaction and experience with their current vehicle, and data concerning competitors, their models, and prices. Due to the subjectivity of some of this data (e.g., satisfaction surveys, individual satisfaction values), individualized churn predictions and optimal countermeasures (e.g., personalized discounts, refueling or cash rewards, incentives based on additional features) are a complex subject that is always relevant.

Since maximum data confidentiality is guaranteed and no personal data is recorded – unless the customer gives their explicit consent in order to receive offers as individually tailored as possible – such analyses and optimizations are only possible at the level of customer segments that represent the characteristics of an anonymous customer subset.

Customer loyalty is closely related to this subject, and takes on board the question of how to retain and optimize, i.e., increase the loyalty of existing customers. Likewise, the topic of “upselling,” i.e., the idea of offering existing customers a higher-value vehicle as their next one and being successful with this offer, is always associated with this. It is obvious that such issues are complex, as they require information about customer segments, marketing campaigns, and correlated sales successes in order to facilitate analysis. However, this data is mostly not available, difficult to collect systematically, and characterized by varying levels of veracity, i.e., uncertainty in the data.

Similar considerations apply to optimizing the marketing mix, including the issue of trade fair participation. In this case, data needs to be collected over longer periods of time, so that it can be evaluated and conclusions can be drawn. For individual marketing campaigns such as mailing campaigns, evaluating the return business rate with regard to the characteristics of the selected target group is a much more likely objective of a data analysis and corresponding campaign optimization.

In principle, very promising potential applications for optimizing analytics can also be found in the marketing field. However, the complexity involved in data collection and data protection, as well as the partial inaccuracy of data collected, means that a long-term approach with careful planning of the data collection strategy is required. The issue becomes even more complex if “soft” factors such as brand image also need to be taken into account in the data mining process – in this case, all data has a certain level of uncertainty, and the corresponding analyses (“What are the most important brand image drivers?” “How can the brand image be improved?”) are more suitable for determining trends than drawing quantitative conclusions. Nevertheless, within the scope of optimization, it is possible to determine whether an action will have a positive or negative impact, thereby allowing the direction to be determined, in which actions should go.

4.6 Sales, after-sales, and retail

The diversity of potential applications and existing applications in this area is significant. Since the “human factor,” embodied by the end customer, plays a crucial role within this context, it is not only necessary to take into account objective data such as sales figures, individual price discounts, and dealer campaigns; subjective customer data such as customer satisfaction analyses based on surveys or third-party market studies covering such subjects as brand image, breakdown rates, brand loyalty, and many others may also be required. At the same time, it is often necessary to procure and integrate a variety of data sources, make them accessible for analysis, and finally analyze them correctly in terms of the potential subjectivity of the evaluations[37] – a process that currently depends to a large extent on the expertise of the data scientists conducting the analysis.

The field of sales itself is closely intermeshed with marketing. After all, the ultimate objective is to measure the success of marketing activities in terms of turnover based on sales figures. A combined analysis of marketing activities (including distribution among individual media, placement frequency, costs of the respective marketing activities, etc.) and sales can be used to optimize market activities in terms of cost and effectiveness, in which case a portfolio-based approach is always used. This means that the optimum selection of a portfolio of marketing activities and their scheduling – and not just focusing on a single marketing activity – is the main priority. Accordingly, the problem here comes from the field of multi-criteria decision-making support, in which decisive breakthroughs have been made in recent years thanks to the use of evolutionary algorithms and new, portfolio-based optimization criteria. However, applications in the automotive industry are still restricted to a very limited scope.

Similarly, customer feedback, warranty repairs, and production are potentially intermeshed as well, since customer satisfaction can be used to derive soft factors and warranty repairs can be used to derive hard factors, which can then be coupled with vehicle-specific production data and analyzed. In this way, factors that affect the occurrence of quality defects not present or foreseeable at the factory can be determined. This makes it possible to forecast such quality defects and use optimizing analytics to reduce their occurrence. Nevertheless, it is also necessary to combine data from completely different areas – production, warranty, and after-sales – in order to make it accessible to the analysis.

In the case of used vehicles, residual value plays a vital role in a company’s fleet or rental car business, as the corresponding volumes of tens of thousands of vehicles are entered into the balance sheet as assets with the corresponding residual value. Today, OEMs typically transfer this risk to banks or leasing companies, although these companies may in turn be part of the OEM’s corporate group. Data mining and, above all, predictive analytics can play a decisive role here in the correct evaluation of assets, as shown by an American OEM as long as ten years ago[38]. Nonlinear forecasting models can be used with the company’s own sales data to generate individualized, equipment-specific residual value forecasts at the vehicle level, which are much more accurate than the models currently available as a market standard. This also makes it possible to optimize distribution channels – even as far as geographically assigning used vehicles to individual auction sites at the vehicle level – in such a way as to maximize a company’s overall sales success on a global basis.

Considering sales operations in greater detail, it is obvious that knowledge regarding each individual customer’s interests and preferences when buying a vehicle or, in future, temporarily using available vehicles, is an important factor. The more individualized the knowledge concerning the sociodemographic factors for a customer, their buying behavior, or even their clicking behavior on the OEM’s website, as well as their driving behavior and individual use of a vehicle, the more accurately it will be possible to address their needs and provide them with an optimal offer for a vehicle (suitable model with appropriate equipment features) and its financing.

4.7 Connected customer

While this term is not yet established as such at present, it does describe a future in which both the customer and their vehicle are fully integrated with state-of-the-art information technology. This aspect is closely linked to marketing and sales issues, such as customer loyalty, personalized user interfaces, vehicle behavior in general, and other visionary aspects (see also section 5). With a connection to the Internet and by using intelligent algorithms, a vehicle can react to spoken commands and search for answers that, for example, can communicate directly with the navigation system and change the destination. Communication between vehicles makes it possible to collect and exchange information on road and traffic conditions, which is much more precise and up-to-date than that which can be obtained via centralized systems. One example is the formation of black ice, which is often very localized and temporary, and which can be detected and communicated in the form of a warning to other vehicles very easily today.

5 Vision

Vehicle development already makes use of “modular systems” that allow components to be used across multiple model series. At the same time, development cycles are becoming increasingly shorter. Nevertheless, the field of virtual vehicle development has not yet seen any effective attempts to use machine learning methods in order to facilitate automatic learning that extracts both knowledge that is built upon other historical knowledge and knowledge that applies to more than one model series so as to assist with future development projects and organizing them more efficiently. This topic is tightly intermeshed with that of data management, the complexity of data mining in simulation and optimization data, and the difficulty in defining a suitable representation of knowledge concerning vehicle development aspects. Furthermore, this approach is restricted by the organizational limitations of the vehicle development process, which is often still exclusively oriented towards the model being developed. Moreover, due to the heterogeneity of data (often numerical data, but also images and videos, e.g., from flow fields) and the volume of data (now in the terabyte range for a single simulation), the issue of “data mining in simulation data” is extremely complex and, at best, the object of tentative research approaches at this time[39].

New services are becoming possible due to the use of predictive maintenance. Automatically learned knowledge regarding individual driving behavior – i.e., annual, seasonal, or even monthly mileages, as well as the type of driving – can be used to forecast intervals for required maintenance work (brake pads, filters, oil, etc.) with great accuracy. Drivers can use this information to schedule garage appointments in a timely manner, and the vision of a vehicle that can schedule garage appointments by itself in coordination with the driver’s calendar – which is accessible to the on-board computer via appropriate protocols – is currently more realistic than the often cited refrigerator that automatically reorders groceries.

In combination with automatic optimization, the local authorized repair shop, in its role as a central coordination point where individual vehicle service requests arrive via an appropriate telematics interface, can optimally schedule service appointments in real time – keeping workloads as evenly distributed as possible while taking staff availability into account, for example.

With regard to the vehicle’s learning and adaptation abilities, there is virtually limitless potential. Vehicles can identify and classify their drivers’ driving behavior – i.e., assign them to a specific driver type. Based on this, the vehicles themselves can make adjustments to systems ranging from handling to the electronic user interface – in other words, they can offer drivers individualization and adaptation options that extend far beyond mere equipment features. The learned knowledge about the driver can then be transferred to a new vehicle when one is purchased, ensuring that the driver’s familiar environment is immediately available again.

5.1 Vision – Vehicles as autonomous, adaptive, and social agents & cities as super-agents

Research into self-driving cars is here to stay in the automotive industry, and the “mobile living room” is no longer an implausible scenario, but is instead finding a more and more positive response. Today, the focus of development is on autonomy, and for good reason: In most parts of the world, self-driving cars are not permitted on roads, and if they are, they are not widespread. This means that the vehicle as an agent cannot communicate with all other vehicles, and that vehicles driven by humans adjust their behavior based on the events in their drivers’ field of view. Navigation systems offer support by indicating traffic congestion and suggesting alternative routes. However, we now assume that every vehicle is a fully connected agent, with the two primary goals of:

  • Contributing to optimizing the flow of traffic
  • Preventing accidents

In this scenario, agents communicate with each other and negotiate routes with the goal of minimizing total travel time (obvious parameters being, for example, the route distance, the possible speed, roadworks, etc.). Unforeseeable events are minimized, although not eliminated completely – for example, storm damage would still result in a road being blocked. This type of information would then need to be communicated immediately to all vehicles in the relevant action area, after which a new optimization cycle would be required. Moreover, predictive maintenance minimizes damage to vehicles. Historical data is analyzed and used to predict when a defect would be highly likely to occur, and the vehicle (the software in the vehicle, i.e., the agent) makes a service appointment without requiring any input from the passenger and then drives itself to the repair shop –all made possible with access to the passenger’s calendar. In the event of damage making it impossible to continue a journey, this would also be communicated as quickly as possible – either with a “breakdown” broadcast or to a control center, and a self-driving tow truck would be immediately available to provide assistance, ideally followed by a (likewise self-driving) replacement vehicle. In other words, vehicles act:

  • Autonomously in the sense that they automatically follow a route to a destination
  • Adaptively in the sense that they can react to unforeseen events, such as road closures and breakdowns
  • Socially in the sense that they work together to achieve the common goals of optimizing the flow of traffic and preventing accidents (although the actual situation is naturally more complex and many subgoals need to be defined in order for this to be achieved).

In combination with taxi services that would also have a self-driving fleet, it would be possible to use travel data and information on the past use of taxi services (provided that the respective user gives their consent) in order to send taxis to potential customers at specific locations without the need for these customers to actively request the taxis. In a simplified form, it would also be possible to implement this in an anonymized manner, for example, by using data to identify locations where taxis are frequently needed (as identified with the use of clusters; see also section 3.1, “Machine learning”) at specific times or for specific events (rush hour, soccer matches, etc.).

If roads become digital as well, i.e., if asphalt roads are replaced with glass and supplemented with OLED technology, dynamic changes to traffic management would also be possible. From a materials engineering perspective, this is feasible:

  • The surface structure of glass can be developed in such a way as to be skid resistant, even in the rain.
  • Glass can be designed to be so flexible and sturdy that it will not break, even when trucks drive over it.
  • The waste heat emitted by displays can be used to heat roads and prevent ice from forming during winter.

In this way, cities themselves can be embedded as agents in the multi-agent environment and help achieve the defined goals.

5.2 Vision – integrated factory optimization

By using software to analyze customer and repair shop reports and repair data concerning defects occurring in the field, we can already automatically analyze whether an increase in defects can be expected for specific vehicle models or installed parts. The goal here is to identify and avoid potential problems at an early stage, before large-scale recall actions need to be initiated. Causes of defects in the field can be manifold, including deficient quality of the parts being used or errors during production, which, together with the fact that thousands of vehicles leave Volkswagen production plants every day, makes it clear that acting quickly is of utmost importance. Assuming that at present (November 2015), the linguistic analysis of customer statements and repair shop reports shows that a significant increase in right-hand-side parking light failures can be expected for model x, platform C vehicles delivered from July 2015 onwards. In this case, “significant” means that a statistically verifiable upwards trend (increase in reported malfunctions) can be extracted based on vehicle sales between January 2015 and November 2015. By analyzing fault chains and repair chains, it is possible to determine which events will result in a fault or defect or which other models are or will be affected. If the error in production is caused by a production robot, for example, this can be traced back to a hardware fault and/or software error or to an incorrect or incomplete configuration. In the worst-case scenario, it may even be necessary to update the control system in order to eliminate the error. In addition, it is not possible to update software immediately, because work on patches can only begin after the manufacturer has received and reviewed the problem report. Likewise, reconfiguring the robot can be a highly complex task due to the degrees of freedom (axes of rotation) that multi-axis manipulators have at their disposal. In short, making such corrections is time-consuming, demanding, and, in all but ideal scenarios, results in subsequent issues.

Artificial intelligence (AI) approaches can be used to optimize this process at several points.

One of the areas being addressed by AI research is enabling systems (where the term “system” is a synonym for “a whole consisting of multiple individual parts,” especially in the case of software or combinations of hardware and software, such as those found in industrial robots) to automatically extract and interpret knowledge from data, although the extent of this is still limited at present.

Figure 4 – Data, knowledge, action

In contrast to data, knowledge can form the basis for an action, and the result of an action can be fed back into data, which then forms the basis for new knowledge, and so on.

If an agent with the ability to learn and interpret data is supplied with the results (state of the world before the action, state of the world after the action; see also section 3.5) of its own actions or of the actions of other agents, the agent, provided it has a goal and the freedom to adapt as necessary, will attempt to achieve its goal autonomously. Biological agents, such as humans and animals, do this intuitively without needing to actively control or monitor the process of transforming data into knowledge. If, for instance, wood in a DIY project splits because we hammered in a nail too hard at an excessively acute angle, our brain subconsciously transforms the angle, the material’s characteristics, and the force of the hammer blow into knowledge and experience, minimizing the likelihood of us repeating the same mistake.

In the previously discussed, specific area of artificial intelligence referred to as “machine learning,” research is focused on emulating such behavior. Using ML to enable software to learn from data in a specific problem domain and to infer how to solve new events on the basis of past events opens up a world of new possibilities. ML is nothing new in the field of data analysis, where it has been used for many years now. What is new is the possibility to compute highly complex models with data volumes in the petabyte range within a specific time limit. If one thinks of a production plant as an organism pursuing the objective of producing defect-free vehicles, it is clear that granting this organism access to relevant data would help the organism with its own development and improvement, provided, of course, that this organism has the aforementioned capabilities.

Two stages of development are relevant in this case:

Stage 1 – Learning from data and applying experiences

In order to learn from data, a robot must not just operate according to static programming, it must also be able to use ML methods to work autonomously towards defined learning goals. With regard to any production errors that may occur, this means, first and foremost, that the actions being carried out that result in these errors will have been learned, and not programmed based on a flowchart and an event diagram. Assume, for example, that the aforementioned parking light problem has not only been identified, but that its cause can also been traced back to an issue in production, e.g., a robot that is pushing a headlamp into its socket too hard. All that’s now required is to define the learning goal for the corrective measure. Let us also assume that the production error is not occurring with robots in other production plants, and that left-hand headlamps are being installed correctly in general.  In the best-case scenario, we, as humans, would be able to visually recognize and interpret the difference between robots that are working correctly and robots that are not – and the robot making the mistake should be able to learn in a similar way. The difference here is in the type of perception involved – digital systems can “see” much better than us in such cases. Even though the internal workings of ML methods implemented by means of software are rarely completely transparent during the learning process – even for the developer of the learning system – due to the stochastic components and complexity involved, the action itself is transparent, i.e., not how a system does something, but what it does. These signals need to be used in order to initiate the learning process anew and to adapt the control system of the problematic robot. In the aforementioned case, these would be the manipulator and effector motion signals of a robot that is working correctly, which can be measured and defined with any desired level of accuracy. This does not require any human intervention, as the system’s complete transparency is ensured by continuously securing and analyzing the data accrued in the production process. Neither is any human analysis required in the identification and transmission of defects from the field. Based on linguistic analyses of repair shop and customer reports, together with repair data, we can already swiftly identify which problems are attributable to production. The corresponding delivery of this data to the relevant agents (the data situation makes it possible to determine exactly which machine needs to correct itself) allows these agents to learn from the defects and correct themselves.

Stage 2 – Overcoming the limitations of programming – smart factories as individuals

What if the production plant needs to learn things for which even the flexibility of one or more ML methods used by individual agents (such as production or handling robots) is insufficient? Just like a biological organism, a production plant could act as a separate entity composed of subcomponents, similarly to a human, who can be addressed using natural language, understands context, and is capable of interpreting this. The understanding and interpretation of context have always been a challenge in the field of AI research. AI theory views context as a shared (or common) interpretation of a situation, with the context of a situation and the context of an entity relative to a situation being relevant here. Contexts relevant to a production plant include everything that is relevant to production when expressed in natural language or any other way. The following simplified scenario helps in understanding the concept: Let us assume that the final design for a car body is agreed upon by a committee during a meeting.

“We decided on the following body for the Golf 15 facelift. Please build a prototype accordingly based on the Golf 15,” says Mr. Müller while looking at the 3-D model that seems to be floating in front of everyone at the meeting and can only be seen with augmented reality glasses.

In this scenario, use of evolutionary algorithms for simulation is conceivable, limited to the possible combinations that can actually be built. Provided that the required computing power is available and the parameters involved have been reduced, this can cut simulation times from several hours to minutes, making dynamic morphing of components or component combinations possible during a meeting.

Factory : “Based on the model input, I determined that it will take 26 minutes to adjust the programming of my robots. In order to assemble the floor assembly, tools x1, y1 must be replaced with tools x2, y2 on robots x, y. Production of the prototype will be completed in 6 hours and 37 minutes.”

Of course, this scenario is greatly simplified, but it should still show what the future may hold. In order to understand what needs to be done, the production plant must understand what a car body is, what a facelift is, etc. and interpret the parameters and output from a simulation in such a way that they can be converted into production steps. This conversion into production steps requires the training of individual ML components on the robots or the adaptation/enhancement of their programs based on the simulation data, so that all steps can be carried out, from cutting sheet metal to assembling and integrating the (still fictitious) Golf 15 basic variant.  And while this encompasses a wide range of methods, extending from natural language understanding and language generation through to planning, optimization, and autonomous model generation, it is by no means mere science fiction.

5.3 Vision – companies acting autonomously

When planning marketing activities or customer requirements, for example, it is imperative for companies of all types to monitor how sales change over time, to predict how markets will develop and which customers will potentially be lost, to respond to financial crises, and to quickly interpret the potential impact of catastrophes or political structures. We already do all this today, and what we need for it is data. We are not interested in the personal data of individuals, but in what can be derived from many individual components. For example, by analyzing over 1,600 indicators, we can predict how certain financial indicators for markets will move and respond accordingly or we can predict, with a high probability of being correct, which customer groups find models currently in pre-production development appealing and then derive marketing actions accordingly. In fact, we can go so far as to determine fully configured models to suit the tastes of specific customer groups. With this knowledge, we make decisions, adjust our production levels, prepare marketing plans, and propose fully configured models appropriate for specific customer groups in the configurator.

Preparing a marketing plan sometimes follows a static process (what needs to be done), but how something is done remains variable. As soon as it is possible to explain to another person how and why something is being done, this information can also be made available to algorithms. Breaking it down into an example, we can predict that one of our competitors opening a new production plant in a country where we already have manufacturing operations would result in us having to expect a drop in our sales. We can even predict (within a certain fluctuation range) how large this expected drop in sales would be. In this case, “we” refers to the algorithms that we have developed for this specific use case. Since this situation occurs more than once and requires (virtually) identical input parameters every time, we can use the same algorithms to predict events in other countries. This makes it possible to use knowledge from past marketing campaigns in order to conduct future campaigns. In short, algorithms would prepare the marketing plans.

When provided with a goal, such as maximizing the benefit for our customers while taking account of cost-effectiveness, algorithm sub-classes would be able to take internal data (such as sales figures and configurator data) and external data (such as stock market trends, financial indicators, political structures) to autonomously generate output for a “marketing program” or “GDP program.” If a company were allowed to use its resources and act autonomously, then it would be able to react autonomously to fluctuations in markets, subsidize vulnerable suppliers, and much more. The possibilities are wide-ranging, and such scenarios can already be technically conceived today. Continuously monitoring stock prices, understanding and interpreting news items, and taking into account demographic changes are only a few of the areas that are relevant and that, among other things, require a combination of natural language understanding, knowledge-based systems, and the ability to make logical inferences. In the field of AI research, language and visual information are very frequently used as the basis for understanding things, because we humans also learn and understand a great deal using language and visual stimuli.

6 Conclusions

Artificial intelligence has already found its way into our daily lives, and is no longer solely the subject of science fiction novels. At present, AI is used primarily in the following areas:

  • Analytical data processing
  • Domains in which qualified decisions need to be made quickly on the basis of a large amount of (often heterogeneous) data
  • Monotonous activities that still require constant alertness

In the field of analytical data processing, the next few years will see us transition from exclusive use of decision-support systems to additional use of systems that make decisions on our behalf. Particularly in the field of data analysis, we are currently developing individual analytical solutions for specific problems, although these solutions cannot be used across different contexts – for example, a solution developed to detect anomalies in stock price movements cannot be used to understand the contents of images. This will remain the case in the future, although AI systems will integrate individual interacting components and consequently be able to take care of increasingly complex tasks that are currently reserved exclusively for humans – a clear trend that we can already observe today. A system that not only processes current data regarding stock markets, but that also follows and analyzes the development of political structures based on news texts or videos, extracts sentiments from texts in blogs or social networks, monitors and predicts relevant financial indicators, etc. requires the integration of many different subcomponents – getting these to interact and cooperate is the subject of current research, and new advances in this field are being published every week. In a world where AI systems are able to improve themselves continuously and, for example, manage companies more effectively than humans, what would be left for humans? Time for expanding one’s knowledge, improving society, eradicating hunger, eliminating diseases, and spreading our species beyond our own solar system.[40] Some theories say that quantum computers are required in order to develop powerful AI systems[41], and only a very careless person would suggest than an effective quantum computer will be available within the next 10 years. Then again, only a careless person would suggest that it will not. Regardless of this, and as history has taught us time and time again with the majority of relevant scientific accomplishments, caution will also have to be exercised when implementing artificial intelligence – systems capable of making an exponentially larger number of decisions in extremely short times as hardware performance improves can achieve many positive things, but they can also be misused.

Authors

Dr. Martin Hofmann

Dr. Martin Hofmann is Executive Vice President of the Volkswagen Group and Group CIO at Volkswagen AG. One of his most prominent initiatives has been the creation of Information Technology Labs in San Francisco, Berlin and Munich, which are considered leading in the field of data science, applied AI and Machine Learning in the automotive industry. He has a long history at the Volkswagen Group, joining in 2001, where he took charge of Group Procurement Process and Information Management.

Previously, he worked at the international IT service provider Electronic Data Systems Corporation (EDS) where he held several senior management positions and served as Executive Director Digital Supply Chain in the United States.

Dr. Hofmann graduated Harvard Business School AMP, has a PhD in engineering from the ETH Zurich and a degree in business computer science and business administration from the University of Mannheim.

Dr. Florian Neukart

Dr. Florian Neukart is Principal Data Scientist at Volkswagen Group of America. He is lecturer and scientist in the fields of quantum computers and artificial intelligence at University of Leiden. Furthermore, he is the author of “Reverse Engineering the Mind – Consciously Acting Machines and Accelerated Evolution”.

Prof. Dr. Thomas Bäck

Prof. Dr. Thomas Bäck is scientist for global optimization, predictive analytics and Industry 4.0. As an entrepreneur, he has more than 20 years experience in industrial applications with companies such as 3M, Air Liquide, BMW, Daimler, Unilever and Volkswagen. Furthermore, he is an author of more than 300 scientific publications, e.g. “Evolutionary Algorithms in Theory and Practice” and co-inventor of 4 patents.

Sources

[1] D. Silver et. al.: Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature 529, 484-489 (January 28, 2016).

[2] https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

[3] Systems “in which information and software components are connected to mechanical and electronic components and in which data is transferred and exchanged, and monitoring and control tasks are carried out, in real-time using infrastructures such as the Internet.” (Translation of the following article in Gabler Wirtschaftslexikon, Springer:  http://wirtschaftslexikon.gabler.de/Definition/cyber-physische-systeme.html).

[4] Industry 4.0 is defined therein as “a marketing term that is also used in science communication and refers to a ‘future project’ of the German federal government. The so-called ‘Fourth Industrial Revolution’ is characterized by the customization and hybridization of products and the integration of customers and business partners into business processes.” (Translation of the following article in Gabler Wirtschaftslexikon, Springer: http://wirtschaftslexikon.gabler.de/Definition/industrie-4-0.html).

[5] E. Rich, K. Knight: Artificial Intelligence, 5, 1990

[6] Th. Bäck, D.B. Fogel, Z. Michalewicz: Handbook of Evolutionary Computation, Institute of Physics Publishing, New York, 1997.

[7] R. Bajcsy: Active perception, Proceedings of the IEEE, 76:996-1005, 1988

[8] J. L. Crowley, H. I. Christensen: Vision as a Process: Basic Research on Computer Vision Systems, Berlin: Springer, 1995

[9] D. P. Huttenlocher, S. Ulman: Recognizing Solid Objects by Alignment with an Image, International Journal of Computer Vision, 5: 195-212, 1990

[10] K. Frankish, W. M. Ramsey: The Cambridge Handbook of Artificial Intelligence, Cambridge: Cambridge University Press, 2014

[11] F. Chaumette, S. Hutchinson: Visual Servo Control I: Basic Approaches, IEEE Robotics and Automation Magazine, 13(4): 82-90, 2006

[12] E. D. Dickmanns: Dynamic Vision for Perception and Control of Motion, London: Springer, 2007

[13] T. M. Straat, M. A. Fischler: Context-Based Vision: Recognizing Objects Using Information from Both 2D and 3D Imagery, IEEE Transactions on Pattern Analysis and Machine Intelligence, 13: 1050-65, 1991

[14] D. Hoiem, A. A. Efros, M. Hebert: Putting Objects in Perspective, Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2137-44, 2006

[15] H. Buxton: Learning and Understanding Dynamic Scene Activity: A Review, Vision Computing, 21: 125-36, 2003

[16] N. Lavarac, S. Dzeroski: Inductive Logic Programming, Vol. 3: Non-Monotonic Reasoning and Uncertain Reasoning, Oxford University Press: Oxford, 1994

[17] K. Frankish, W. M. Ramsey: The Cambridge Handbook of Artificial Intelligence, Cambridge: Cambridge University Press, 2014

[18] G. Leech, R. Garside, M. Bryant: . CLAWS4: The Tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94) Kyoto, Japan, pp. 622-628, 1994

[19] K. Spärck Jones: Information Retrieval and Artificial Intelligence, Artificial Intelligence 141: 257-81, 1999

[20] A. Newell, H. A. Simon: Computer Science as Empirical Enquiry: Symbols and Search, Communications of the ACM 19:113-26

[21] M. Bratman, D. J. Israel, M. E. Pollack: Plans and Resource-Bounded Practical Reasoning, Computational Intelligence, 4: 156-72, 1988

[22] H. Ah. Bond, L. Gasser: Readings in Distributed Artificial Intelligence, San Mateo, CA: Morgan Kaufmann, 1988

[23] E. H. Durfee: Coordination for Distributed Problem Solvers, Boston, MA: Kluwer Academic, 1988

[24] G. Weiss: Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, Cambridge, MA: MIT Press, 1999

[25] K. Frankish, W. M. Ramsey: The Cambridge Handbook of Artificial Intelligence, Cambridge: Cambridge University Press, 2014

[26] E. Alonso: Multi-Agent Learning, Special Issue of Autonomous Agents and Multi-Agent Systems 15(1), 2007

[27] E. Alonso: M. d’Inverno, D. Kudenko, M. Luck, J. Noble: Learning in Multi-Agent Systems, Knowledge Engineering Review 16: 277-84, 2001

[28] M. Veloso, P. Stone: Multiagent Systems: A Survey from a Machine Learning Perspective, Autonomous Robots 8: 345-83, 2000

[29] R. Vohra, M. Wellmann: Foundations of Multi-Agent Learning, Artificial Intelligence 171:363-4, 2007

[30] L. Busoniu, R. Babuska, B. De Schutter: A Comprehensive Survey of Multi-Agent Reinforcement Learning, IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews 38: 156-72, 2008

[31] “Evolution strategies” are a variant of “evolutionary algorithms,” which has been developed in Germany. They offer significantly better performance than genetic algorithms for this type of task. See also Th. Bäck: Evolutionary Algorithms in Theory and Practice, Oxford University Press, NY, 1996.

[32] For example: Th. Bäck, C. Foussette, P. Krause: Automatische Metamodellierung von CAE-Simulationsmodellen (Automatic Meta-Modeling of CAE Simulation Models), ATZ – Automobiltechnische Zeitschrift 117(5), 64-69, 2015.

[33] For details: http://www.divis-gmbh.de/fileadmin/download/fallbeispiele/140311_Fallbeispiel_BMW_Machbarkeitsbewertung_Umformsimulation_DE.pdf

[34] This term is used in a great many ways and is actually very general in nature. We use it here in the sense of continuous monitoring of company KPIs that are automatically generated and analyzed on a weekly basis, for example.

[35] http://www.automotive-fleet.com/news/story/2003/02/gm-installs-nutechs-vehicle-distribution-system.aspx

[36] C. Sorg: Data Mining als Methode zur Industrialisierung und Qualifizierung neuer Fertigungsprozesse für CFK-Bauteile in automobiler Großserienproduktion (Data Mining as a Method for the Industrialization and Qualification of New Production Processes for CFRP Components in Large-Scale Automotive Production). Dissertation, Technical University of Munich, 2014. Dr. Hut Verlag.

[37] One example can be found in this article: http://www.enbis.org/activities/events/current/214_ENBIS_12_in_Ljubljana/programmeitem/1183_Ask_the_Right_Questions__or_Apply_Involved_Statistics__Thoughts_on_the_Analysis_of_Customer_Satisfaction_Data.

[38] http://www.syntragy.com/doc/q3-05%5B1%5D.pdf

[39] L. Gräning, B. Sendhoff: Shape Mining: A Holistic Data Mining Approach to Engineering Design. Advanced Engineering Informatics 28(2), 166-185, 2014.

[40] F. Neukart: Reverse-Engineering the Mind; see https://www.scribd.com/mobile/doc/195264056/Reverse-Engineering-the-Mind, 2014

[41] F. Neukart: On Quantum Computers and Artificial Neural Networks, Signal Processing Research 2(1), 2013

Responsible Handling of Data – Process Mining Rule 2 of 4

This is article no. 2 of the four-part article series Privacy, Security and Ethics in Process Mining.

Read this article in German:
Datenschutz, Sicherheit und Ethik beim Process Mining – Regel 2 von 4

Like in any other data analysis technique, you must be careful with the data once you have obtained it. In many projects, nobody thinks about the data handling until it is brought up by the security department. Be that person who thinks about the appropriate level of protection and has a clear plan already prior to the collection of the data.

Do:

  • Have external parties sign a Non Disclosure Agreement (NDA) to ensure the confidentiality of the data. This holds, for example, for consultants you have hired to perform the process mining analysis for you, or for researchers who are participating in your project. Contact your legal department for this. They will have standard NDAs that you can use.
  • Make sure that the hard drive of your laptop, external hard drives, and USB sticks that you use to transfer the data and your analysis results are encrypted.

Don’t:

  • Give the data set to your co-workers before you have checked what is actually in the data. For example, it could be that the data set contains more information than you requested, or that it contains sensitive data that you did not think about. For example, the names of doctors and nurses might be mentioned in a free-text medical notes attribute. Make sure you remove or anonymize (see guideline No. 3) all sensitive data before you pass it on.
  • Upload your data to a cloud-based process mining tool without checking that your organization allows you to upload this kind of data. Instead, use a desktop-based process mining tool (like Disco [3] or ProM [4]) to analyze your data locally or get the cloud-based process mining vendor to set-up an on-premise version of their software within your organization. This is also true for cloud-based storage services like Dropbox: Don’t just store data or analysis results in the cloud even if it is convenient.

Five Illusions about Big Data you can’t help but believe in

Big Data is a smorgasbord of data. Even the marketing world has acknowledged the gravity of Big Data. But alas! Instead of having such a resplendent data power by our side, we are no closer to construct smart marketing decisions than before, when the concept was not well known.

So, something is definitely not right, right? Not all information derived from this industry is precise and to address this issue, I have highlighted five common misconceptions about Big Data. Know it, work on it and gain from it.

 

Misconception 1: Human touch surpasses automation

Entrepreneurs are the ones who pull their weight. The human effort they offer yields potential success for the firm, only if it is backed by meaningful data.
“One of the most common misconceptions is that people believe they will always outperform computers in their decision-making process. That may have been the case in the past, but with the complexity of today’s markets and the advancement of technology, this assumption no longer holds true,” says Victor Rosenman, CEO of Feedvisor, the pioneer of Algo-Commerce. He added, “All business owners are constantly required to make critical decisions, and the most effective decisions are not based on gut feelings, but on facts and data.”

Misconception 2: Data leads to more costs

Money makes a business. It is also the other way round. Using artificial intelligence, small business-owners benefit the most. AI saves time and money both, thus helps in raising the revenues. You need to understand that big data wouldn’t be enjoying the current hot seat status, if it was that expensive to implement. They are low on cost now, even getting lower. Moreover, besides being inexpensive, big data also aid in curbing other costs that the company would have to bear otherwise.

Misconception 3: Data takes the lead in big changes

“The view of cognitive systems as brains that automatically solve any problem is a popular misconception.” – IBM’s Brandon Buckner recently said. Integrated tools are mostly implemented to do stuffs like gauge human expertise and enhance human intelligence. By this, he meant that technologies actually support your business instead of taking the lead. With data, business-owners enjoy better decision-making capabilities, which is propitious for future business endeavours.

Misconception 4: Little data is too little to make any impact

Though big data arrests the glowing eyes, little data seizes the mind.  Little data is a small set of data. We know that people always look for a bulk of information, but at times, quality is not what they seek. Sometimes, little data can do the job, which bulk data fail to do. The information in little data is more restrained, clean and unprecedented.

Misconception 5: Big data for big businesses

No more, you need to shell out ludicrous amounts of money to acquire big data technologies. Non- Fortune 500 companies are also introducing big data in their systems. And the best part is that it is no more confined to a single sector, it is omnipresent in almost every industry.

In 2011 McKinsey Global Institute report called “Big data: The next frontier for innovation, competition, and productivity” revealed: “The use of big data will become a key basis of competition and growth for individual firms.” Now it is 2017, so just think how big Big Data must have grown in size and scope over the past 6 years.

Clarify Goal of the Analysis – Process Mining Rule 1 of 4

This is article no. 1 of the four-part article series Privacy, Security and Ethics in Process Mining.

Read this article in German:
Datenschutz, Sicherheit und Ethik beim Process Mining – Regel 1 von 4

Clarify Goal of the Analysis

The good news is that in most situations Process Mining does not need to evaluate personal information, because it usually focuses on the internal organizational processes rather than, for example, on customer profiles. Furthermore, you are investigating the overall process patterns. For example, a process miner is typically looking for ways to organize the process in a smarter way to avoid unnecessary idle times rather than trying to make people work faster.

However, as soon as you would like to better understand the performance of a particular process, you often need to know more about other case attributes that could explain variations in process behaviours or performance. And people might become worried about where this will lead them.

Therefore, already at the very beginning of the process mining project, you should think about the goal of the analysis. Be clear about how the results will be used. Think about what problem are you trying to solve and what data you need to solve this problem.

Do:

  • Check whether there are legal restrictions regarding the data. For example, in Germany employee-related data cannot be used and typically simply would not be extracted in the first place. If your project relates to analyzing customer data, make sure you understand the restrictions and consider anonymization options (see guideline No. 3).
  • Consider establishing an ethical charter that states the goal of the project, including what will and what will not be done based on the analysis. For example, you can clearly state that the goal is not to evaluate the performance of the employees. Communicate to the people who are responsible for extracting the data what these goals are and ask for their assistance to prepare the data accordingly.

Don’t:

  • Start out with a fuzzy idea and simply extract all the data you can get. Instead, think about what problem are you trying to solve? And what data do you actually need to solve this problem? Your project should focus on business goals that can get the support of the process managers you work with (see guideline No. 4).
  • Make your first project too big. Instead, focus on one process with a clear goal. If you make the scope of your project too big, people might block it or work against you while they do not yet even understand what process mining can do.