Posts

Five ways Data Science is used in Fintech

Data science experts process and act upon data that digital resources produce. In the fintech world, data comes from mobile apps, transactions, conversations and financial standings. With this data for fintech, experts can improve the experience and success of businesses and customers alike.

Apps like PayPal, Venmo and Cash App have led the way for other fintech organizations, big and small, to grow. In fact, roughly 65% of Americans are already using digital banking in some capacity, whether it’s an app or online service. This growth, in turn, brings benefits. From personalization to integrating robotic advisors, here are five ways data scientists help fintech brands.

1. Personalization

Finance is one of the most personal industries out there as it deals with your private accounts and data. To match this uniqueness, fintechs can use data science for personalization. That way, customer service caters to individual needs.

As the fintech company gathers data from individual transactions, communications, behavior and interests, data scientists can then use said data to curate a better experience for the customer. They can advertise products and services that the customer may need to help with savings, for instance.

Contis is one example of a fintech that has integrated personalization into its services. Customers receive specific recommendations to create an efficient experience.

2. Fundraising

Fundraising had an interesting year in 2020. Amid racial justice protests and movements, crowdfunding took off on fintechs like GoFundMe and Kickstarter. These platforms helped provide funding for those who needed it. From here, data scientists can use fundraising in unique ways.

They can help raise money by targeting people who have donated in the past, or who are likely to donate based on spending habits. This data provides a more well-rounded fundraising campaign.

Then, once they do have donors, they can again use data to segment contributors by interest, demographic or engagement history. This segmentation helps advertise in a more personal, interest-specific way.

3. Fraud Detection

Cybercriminals thrive on an abundance of digital interactions. With the rise in digital banking — and the pandemic-driven shift to technology — fintechs could potentially see high rates of fraud. In fact, by the end of 2020, the United States saw about $11 billion in lost funds from credit card fraud alone.

Data for fintech brands will help address and prevent fraud like this in the future. As customers produce data from their transactions and interactions, it provides a better picture of their behavior. If there’s deviance, the data then shows potential fraud may be occurring.

If fraud does occur, data scientists can then use that instance to learn and properly recognize how data behaves during cybercriminal activity.

4. Robo-Advisors

With more people using fintech services, employees have a lot on their hands. They must properly address the customers’ needs and provide solutions. However, in the online world, employees are now getting some robotic assistance.

Robo-advisors use machine learning algorithms to interact with customers online or on mobile apps. They ask questions, understand the problems and provide solutions. They also collect data like customer goals and financial plans, which they can report back to data scientists for analysis.

Overall, roughly 75% and 46% of large and small banks, respectively, are implementing artificial intelligence to some degree. This data-driven revolution is one to keep your eye on.

5. Blockchain Governance

Blockchain governance is a somewhat newer way that experts can use data for fintech services. The blockchain is commonly known for its support of cryptocurrency services. Though crypto assets like Bitcoin and Ethereum are on the rise, the blockchain itself is still getting its footing.

Now, fintechs like PayPal are offering crypto services, which means data scientists will be able to expand what’s possible for digital banking. As customers transfer crypto funds, data scientists can monitor their activity and get a better handle on the data that exists on the blockchain. From there, they can provide personalization and prevent fraud in the same ways as they would with standard digital banking.

A Changing Landscape

As data scientists continue to help fintech services grow, you’ll notice each of these five areas begins to become more common. Some, like personalization and fraud detection, are already key focuses for fintech companies. However, alongside robo-advisor, fundraising and blockchain, they all have room to grow through the use of data science.

Data Mining Process flow – Easy Understanding

1 Overview

Development of computer processing power, network and automated software completely change and give new concept of each business. And data mining play the vital part to solve, finding the hidden patterns and relationship from large dataset with business by using sophisticated data analysis tools like methodology, method, process flow etc.

On this paper, proposed a process flow followed CRISP-DM methodology and has six steps where data understanding does not considered.

Phase of new process flow given below:-

Phase 1: Involved with collection, outliner treatment, imputation, transformation, scaling, and partition dataset in to two sub-frames (Training and Testing). Here as an example for outliner treatment, imputation, transformation, scaling consider accordingly Z score, mean, One hot encoding and Min Max Scaler.

Phase 2: On this Phase training and testing data balance with same balancing algorithm but separately. As an example here SMOTE (synthetic minority oversampling technique) is considered.

Phase 3: This phase involved with reduction, selection, aggregation, extraction. But here for an example considering same feature reduction algorithm (LDA -Linear Discriminant analysis) on training and testing data set separately.

Phase 4: On this Phase Training data set again partition into two more set (Training and Validation).

Phase 5: This Phase considering several base algorithms as a base model like CNN, RNN, Random forest, MLP, Regression, Ensemble method. This phase also involve to find out best hyper parameter and sub-algorithm for each base algorithm. As an example on this paper consider two class classification problems and also consider Random forest (Included CART – Classification and Regression Tree and GINI index impurity) and MLP classifier (Included (Relu, Sigmoid, binary cross entropy, Adam – Adaptive Moment Estimation) as base algorithms.

Phase 6: First, Prediction with validation data then evaluates with Test dataset which is fully unknown for these (Random forest, MLP classifier) two base algorithms. Then calculate the confusion matrix, ROC, AUC to find the best base algorithm.

New method from phase 1 to phase 4 followed CRISP-DM methodology steps such as data collection, data preparation then phase 5 followed modelling and phase 6 followed evaluation and implementation steps.

Structure of proposed process flow for two class problem combined with algorithm and sub-algorithm display on figure – 1.

These articles mainly focus to describe all algorithms which are going to implementation for better understanding.

 

 

Data Mining Process Flow

Figure 1 – Data Mining Process Flow

2 Phase 1: Outlier treatment, Transform, Scaling, Imputation

This phase involved with outlier treatment, imputation, scaling, and transform data.

2.1 Outliner treatment: – Z score

Outlier is a data point which lies far from all other data point in a data set. Outlier need to treat because it may bias the entire result. Outlier treatment with Z score is a common technique.  Z score is a standard score in statistics.  Z score provides information about data value is smaller or grater then mean that means how many standard deviations away from the mean value. Z score equation display below:

Z = \frac{(x - \mu)}{\sigma}

Here x = data point
σ = Standard deviation
μ = mean value

Equation- 1 Z-Score

In a normal distribution Z score represent 68% data lies on +/- 1, 95% data point lies on +/- 2, 99.7% data point lies on +/- 3 standard deviation.

2.2 Imputation data: – mean

Imputation is a way to handle missing data by replacing substituted value. There are many imputation technique represent like mean, median, mode, k-nearest neighbours. Mean imputation is the technique to replacing missing information with mean value. On the mean imputation first calculate the particular features mean value and then replace the missing value with mean value. The next equation displays the mean calculation:

\mu = \frac{(\sum x)}{n}

Here x = value of each point
n = number of values
μ = mean value

Equation- 2 Mean

2.3 Transform: – One hot encoding

Encoding is a pre-processing technique which represents data in such a way that computer can understand.  For understanding of machine learning algorithm categorical columns convert to numerical columns, this process called categorical encoding. There are multiple way to handle categorical variable but most widely used techniques are label encoding and one host encoding. On label encoding give a numeric (integer number) for each category. Suppose there are 3 categories of foods like apples, orange, banana. When label encoding is used then 3 categories will get a numerical value like apples = 1, banana = 2 and orange = 3. But there is very high probability that machine learning model can capture the relationship in between categories such as apple < banana < orange or calculate average across categories like 1 +3 = 4 / 2 = 2 that means model can understand average of apple and orange together is banana which is not acceptable because model correlation calculation is wrong. For solving this problem one hot encoding appear. The following table displays the label encoding is transformed into one hot encoding.

Label Encoding and One-Hot-Encoding

Table- 1 Encoding example

On hot encoding categorical value split into columns and each column contains 0 or 1 according to columns placement.

2.4 Scaling data: – Min Max Scaler

Feature scaling method is standardized or normalization the independent variable that means it is used to scale the data in a particular range like -1 to +1 or depending on algorithm. Generally normalization used where data distribution does not follow Gaussian distribution and standardization used where data distribution follow Gaussian distribution. On standardization techniques transform data values are cantered around the mean and unit is standard deviation. Formula for standardization given below:

Standardization X = \frac{(X - \mu)}{\sigma}

Equation-3 Equations for Standardization

X represent the feature value, µ represent mean of the feature value and σ represent standard deviation of the feature value. Standardized data value does not restrict to a particular range.

Normalization techniques shifted and rescaled data value range between 0 and 1. Normalization techniques also called Min-Max scaling. Formula for normalization given below:

Normalization X = \frac{(X - X_{min})}{X_{max} - X_{min}}

Equation – 4 Equations for Normalization

Above X, Xmin, Xmax are accordingly feature values, feature minimum value and feature maximum value. On above formula when X is minimums then numerator will be 0 (  is 0) or if X is maximums then the numerator is equal to the denominator (  is 1). But when X data value between minimum and maximum then  is between 0 and 1. If ranges value of data does not normalized then bigger range can influence the result.

3 Phase 2: – Balance Data

3.1 SMOTE

SMOTE (synthetic minority oversampling technique) is an oversampling technique where synthetic observations are created based on existing minority observations. This technique operates in feature space instead of data space. Under SMOTE each minority class observation calculates k nearest neighbours and randomly chose the neighbours depending on over-sampling requirements. Suppose there are 4 data point on minority class and 10 data point on majority class. For this imbalance data set, balance by increasing minority class with synthetic data point.   SMOTE creating synthetic data point but it is necessary to consider k nearest neighbours first. If k = 3 then SMOTE consider 3 nearest neighbours. Figure-2 display SMOTE with k = 3 and x = x1, x2, x3, x4 data point denote minority class. And all circles represent majority class.

SMOTE Example

Figure- 2 SMOTE example

 

4 Phase 3: – Feature Reduction

4.1 LDA

LDA stands for Linear Discriminant analysis supervised technique are commonly used for classification problem.  On this feature reduction account continuous independent variable and output categorical variable. It is multivariate analysis technique. LDA analyse by comparing mean of the variables.  Main goal of LDA is differentiate classes in low dimension space. LDA is similar to PCA (Principal component analysis) but in addition LDA maximize the separation between multiple classes. LDA is a dimensionality reduction technique where creating synthetic feature from linear combination of original data set then discard less important feature. LDA calculate class variance, it maximize between class variance and minimize within class variance. Table-2 display the process steps of LDA.

LDA Process

Table- 2 LDA process

5 Phase 5: – Base Model

Here we consider two base model ensemble random forest and MLP classifier.

5.1 Random Forest

Random forest is an ensemble (Bagging) method where group of weak learner (decision tree) come together to form a strong leaner. Random forest is a supervised algorithm which is used for regression and classification problem. Random forests create several decisions tree for predictions and provide solution by voting (classification) or mean (regression) value. Working process of Random forest given below (Table -3).

Random Forest

Table-3 Random Forest process

When training a Random forest root node contains a sample of bootstrap dataset and the feature is as same as original dataset. Suppose the dataset is D and contain d record and m number of columns. From the dataset D random forest first randomly select sample of rows (d) with replacement and sample of features (n) and give it to the decision tree. Suppose Random forest created several decision trees like T1, T2, T3, T4 . . . Tn. Then randomly selected dataset D = d + n is given to the decision tree T1, T2, T3, T4 . . . Tn where D < D, m > n and d > d.  After taking the dataset decision tree give the prediction for binary classification 1 or 0 then aggregating the decision and select the majority voted result. Figure-3 describes the structure of random forest process.

Random Forest Process

Figure- 3 Random Forest process

On Random forest base learner Decision Tree grows complete depth where bias (properly train on training dataset) is low and variance is high (when implementing test data give big error) called overfitting. On Random forest using multiple decision trees where each Decision tree is high variance but when combining all decision trees with the respect of majority vote then high variance converted into low variance because using row and feature sampling with replacement and taking the majority vote where decision is not depend on one decision tree.

CART (Classification and Regression Tree) is binary segmentation technique. CART is a Gini’s impurity index based classical algorithm to split a dataset and build a decision tree. By splitting a selected dataset CART created two child nodes repeatedly and builds a tree until the data no longer be split. There are three steps CART algorithm follow:

  1. Find best split for each features. For each feature in binary split make two groups of the ordered classes. That means possibility of split for k classes is k-1. Find which split is maximized and contain best splits (one for each feature) result.
  2. Find the best split for nodes. From step 1 find the best one split (from all features) which maximized the splitting criterion.
  3. Split the best node from step 2 and repeat from step 1 until fulfil the stopping criterion.

 

For splitting criteria CART use GINI index impurity algorithm to calculate the purity of split in a decision tree. Gini impurity randomly classified the labels with the same distribution in the dataset. A Gini impurity of 0 (lowest) is the best possible impurity and it is achieve when everything is in a same class. Gini index varies from 0 to 1. 0 indicate the purity of class where only one class exits or all element under a specific class. 1 indicates that elements are randomly distributed across various classes. And 0.5 indicate equal elements distributed over classes. Gini index (GI) described by mathematically that sum of squared of probabilities of each class (pi) deducted from one (Equation-5).

Gini Impurities

Equation – 5 Gini impurities

Here (Equation-5) pi represent the probability (probability of p+ or yes and probability of p- or no) of distinct class with classified element. Suppose randomly selected feature (a1) which has 8 yes and 4 no. After the split right had side (b1 on equation-6) has 4 yes and 4 no and left had side (b2 on equation – 7) has 4 yes and 0 no. here b2 is a pure split (leaf node) because only one class yes is present. By using the GI (Gini index) formula for b1 and b2:-

Equation- 6 & 7 – Gini Impurity b1 & Gini Impurity b2

Here for b1 value 0.5 indicates that equal element (yes and no) distribute over classes which is not pure split. And b2 value 0 indicates pure split. On GINI impurity indicates that when probability (yes or no) increases GINI value also increases. Here 0 indicate pure split and .5 indicate equal split that means worst situation. After calculating the GINI index for b1 and b2 now calculate the reduction of impurity for data point a1. Here total yes 8 (b1 and b2 on Equation – 8) and total no 4 (b1) so total data is 12 on a1. Below display the weighted GINI index for feature a1:

Total data point on b1 with Gini index (m) = 8/12 * 0.5 = 0.3333

Total data point on b2 with Gini index (n) = 4/12 * 0 = 0

Weighted Gini index for feature a1 = m + n = 0.3333

Equation- 8 Gini Impurity b1 & b2

After computing the weighted Gini value for every feature on a dataset taking the highest value feature as first node and split accordingly in a decision tree. Gini is less costly to compute.

5.2 Multilayer Perceptron Classifier (MLP Classifier)

Multilayer perceptron classifier is a feedforward neural network utilizes supervised learning technique (backpropagation) for training. MLP Classifier combines with multiple perceptron (hidden) layers. For feedforward taking input send combining with weight bias and then activation function from one hidden layer output goes to other hidden and this process continuing until reached the output. Then output calculates the error with error algorithm. These errors send back with backpropagation for weight adjustment by decreasing the total error and process is repeated, this process is call epoch. Number of epoch is determined with the hyper-parameter and reduction rate of total error.

5.2.1 Back-Propagation

Backpropagation is supervised learning algorithm that is used to train neural network. A neural network consists of input layer, hidden layer and output layer and each layer consists of neuron. So a neural network is a circuit of neurons. Backpropagation is a method to train multilayer neural network the updating of the weights of neural network and is done in such a way so that the error observed can be reduced here, error is only observed in the output layer and that error is back propagated to the previous layers and previous layer is proportionally updated weight. Backpropagation maintain chain rule to update weight. Mainly three steps on backpropagation are (Table-4):

Step Process
Step 1 Forward Pass
Step 2 Backward Pass
Step 3 Sum of all values and calculate updated weight value with Chain – rules.

Table-4 Back-Propagation process

5.2.2 Forward pass/ Forward propagation

Forward propagation is the process where input layer send the input value with randomly selected weight and bias to connected neuron and inside neuron selected activation function combine them and forward to other connected neuron layer after layer then give an output with the help of output layer. Below (Figure-4) display the forward propagation.

Foreward Pass

Figure-4 Forward passes

Input layer take the input of X (X1, X2) combine with randomly selected weight for each connection and with fixed bias (different hidden layer has different bias) send it to first hidden layer where first multiply the input with corresponding weight and added all input with single bias then selected activation function (may different form other layer) combine all input and give output according to function and this process is going on until reach in output layer. Output layer give the output like Y (Y1, Y2) (here output is binary classification as an example) according to selected activation function.

5.2.3 Backward Pass

After calculating error (difference between Forward pass output and actual output) backward pass try to minimize the error with optimisation function by sending backward with proportionally distribution and maintain a chain rule. Backward pass distribution the error in such a way where weighted value is taking under consideration. Below (Figure-5) diagram display the Backward pass process.

Backward Pass

Figure-5 Backward passes

Backpropagation push back the error which is calculated with error function or loss function for update proportional distribution with the help of optimisation algorithm. Division of Optimisation algorithm given below on Figure – 6

Optimisation Algorithms

Figure -6 Division of Optimisation algorithms

Gradient decent calculate gradient and update value by increases or decreases opposite direction of gradients unit and try to find the minimal value. Gradient decent update just one time for whole dataset but stochastic gradient decent update on each training sample and it is faster than normal gradient decent. Gradient decent can be improve by tuning parameter like learning rate (0 to 1 mostly use 0.5). Adagrad use time step based parameter to compute learning rate for every parameter. Adam is Adaptive Moment Estimation. It calculates different parameter with different learning rate. It is faster and performance rate is higher than other optimization algorithm. On the other way Adam algorithm is squares the calculated exponential weighted moving average of gradient.

5.2.4 Chain – rules

Backpropagation maintain chain-rules to update weighted value. On chain-rules backpropagation find the derivative of error respect to any weight. Suppose E is output error. w is weight for input a and bias b and ac neuron output respect of activation function and summation of bias with weighted input (w*a) input to neuron is net. So partial derivative for error respect to weight is ∂E / ∂w display the process on figure-7.

Figure- 7 Partial derivative for error respect to weight

On the chain rules for backward pass to find (error respect to weight) ∂E / ∂w = ∂E / ∂ac * ∂ac / ∂net * ∂net / ∂w. here find to error respect to weight are error respect to output of activation function multiply by activation function output respect to input in a neuron multiply by input in a neuron respect to weight.

5.2.5 Activation function

Activation function is a function which takes the decision about neuron to activate or deactivate. If the activate function activate the neuron then it will give an output on the basis of input. Input in a activation function is sum of input multiply with corresponding weight and adding the layered bias.  The main function of a activate function is non-linearity output of a neuron.

Activation Function

Figure-8 Activation function

Figure – 8 display a neuron in a hidden layer. Here several input (1, 2, 3) with corresponding weight (w1, w2, w3) putting in a neuron input layer where layer bias add with summation of multiplication with input and weight. Equation-9 display the output of an activate function.

Output from activate function y = Activate function (Ʃ (weight * input) + bias)

y = f (Ʃ (w*x) +b)

Equation- 9 Activate function

There are many activation functions like linear function for regression problem, sigmoid function for binary classification problem where result either 0 or 1, Tanh function which is based on sigmoid function but mathematically shifted version and values line -1 to 1. RELU function is Rectified linear unit. RELU is less expensive to compute.

5.2.6 Sigmoid

Sigmoid is a squashing activate function where output range between 0 and 1. Sigmoidal name comes from Greek letter sigma which looks like letter S when graphed. Sigmoid function is a logistic type function, it mainly use in output layer in neural network. Sigmoid is non-linear, fixed output range (between 0 and 1), monotonic (never decrees or never increases) and continuously differentiated function. Sigmoid function is good at classification and output from sigmoid is nonlinear. But Sigmoid has a vanishing gradient problem because output variable is very less to change in input variable. Figure- 9 displays the output of a Sigmoid and derivative of Sigmoid. Here x is any number (positive or negative). On sigmoid function 1 is divided by exponential negative input with adding 1.

Sigmoid

Figure – 9 Sigmoid Functions

4.5.2.7 RELU

RELU stands for Rectified Linear Units it is simple, less expensive in computation and rectifies the gradient vanishing problem. RELU is nonlinear activation function. It gives output either positive (infinity) or 0. RELU has a dying problem because if neurons stop for responding to variation because of gradient is 0 or nothing has to change. Figure- 10 displays the output of an RELU and derivative of RELU. Here x is any positive input and if x is grater then 0 give the output as x or give output 0. RELU function gives the output maximum value of input, here max (0, x).

Relu Activation Function

Figure – 10 RELU Function

4.5.2.8 Cost / loss function (Binary Cross-Entropy)

Cost or loss function compare the predictive value (model outcome) with actual value and give a quantitative value which give the indication about how much good or bad the prediction is.

Cost Function

Figure- 11 Cost function work process

Figure-11 x1 and x2 are input in a activate function f(x) and output y1_out which is sum of weighted input added with bias going through activate function. After model output activate function compare the output with actual output and give a quantitative value which indicate how good or bad the prediction is.

There are many type of loss function but choosing of optimal loss function depends on the problem going to be solved such as regression or classification. For binary classification problem binary cross entropy is used to calculate cost. Equation-10 displays the binary cross entropy where y is actual binary value and yp predictive outcome range 0 and 1. And i is scalar vale range between 1 to model output size (N).

Binary Crossentropy

Equation-10 displays the binary cross entropy

6 Phase 6: – Evaluation

6.1 Confusion matrix

In a classification confusion matrix describe the performance of actual value against predictive value. Confusion Matrix does the performance measurement. So confusion matrix classifies and display predicted and actual value (Visa, S., Ramsay 2011).

Confusion Matrix

Table- 5 Confusion Matrix

Confusion Matrix (Table-5) combines with True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). True Positive is prediction positive and true. True Negative is prediction negative and that is true. False positive is prediction positive and it’s false. False negative is prediction negative and that is false. False positive is known as Type1 error and false negative is known as Type 2 error. Confusion matrix can able to calculate several list of rates which are given below on Table- 6.

Here    N = Total number of observation, TP = True Positive, TN = True Negative

FP = False Positive, FN = False Negative, Total Actual No (AN) = TN + FP,

Total Predictive Yes (PY) = FP + TP. Total Actual Yes (AY) = FN + TP

Rate

 

Description Mathematical Description
Accuracy Classifier, overall how often correctly identified  (TP+TN) / N
Misclassification Rate Classifier, overall how often wrongly identified (FP + FN) / N
True Positive Rate

(Sensitivity / Recall)

Classifier, how often predict correctly yes when it is actually yes.  TP / AY
False Positive Rate Classifier, how often predict wrongly yes when it is actually no.  FP / AN
True Negative Rate

(Specificity)

Classifier, how often predict correctly no when it is actually no.  TN / AN
Precision Classifier how often predict yes when it is correct.  TP / PY
Prevalence Yes conditions how often occur in a sample. AY / N

Table – 6 Confusion matrixes Calculation

From confusion matrix F1 score can be calculated because F1 score related to precision and recall. Higher F1 score is better. If precision or recall any one goes down F1 score also go down.

F1 = \frac{2 * Precision * Recall}{Precision + Recall}

4.6.2 ROC (Receiver Operating Characteristic) curve

In statistics ROC is represent in a graph with plotting a curve which describe a binary classifiers performance as its differentiation threshold is varied. ROC (Equation-11) curve created true positive rate (TPR) against false positive rate (FPR). True positive rate also called as Sensitivity and False positive rate also known as Probability of false alarm. False positive rate also called as a probability of false alarm and it is calculated as 1 – Specificity.

True Positive Rate = \frac{True Positive}{True Positive + False Negative} = Recall or Sensivity

False Positive Rate = \frac{True Negative}{True Negative + False Positive} = 1 - Specificity

Equation- 11 ROC

So ROC (Receiver Operation Characteristic) curve allows visual representation between sensitivity and specificity associated with different values of the test result (Grzybowski, M. and Younger, J.G., 1997)

On ROC curve each point has different Threshold level. Below (Figure – 12) display the ROC curve. Higher the area curve covers is better that means high sensitivity and high specificity represent more accuracy. ROC curve also represent that if classifier predict more often true than it has more true positive and also more false positive. If classifier predict true less often then fewer false positive and also fewer true positive.

ROC Curve

ROC Curve

Figure – 12 ROC curve description

4.6.3 AUC (Area under Curve)

Area under curve (AUC) is the area surrounded by the ROC curve and AUC also represent the degree of separability that means how good the model to distinguished between classes. Higher the AUC value represents better the model performance to separate classes. AUC = 1 for perfect classifier, AUC = 0 represent worst classifier, and AUC = 0.5 means has no class separation capacity. Suppose AUC value is 0.6 that means 60% chance that model can classify positive and negative class.

Figure- 13 to Figure – 16 displays an example of AUC where green distribution curve for positive class and blue distribution curve for negative class. Here threshold or cut-off value is 0.5 and range between ‘0’ to ‘1’. True negative = TN, True Positive = TP, False Negative = FN, False Positive = FP, True positive rate = TPR (range 0 to 1), False positive rate = FPR (range 0 to 1).

On Figure – 13 left distribution curve where two class curves does not overlap that means both class are perfectly distinguished. So this is ideal position and AUC value is 1.  On the left side ROC also display that TPR for positive class is 100% occupied.

ROC distributions (perfectly distinguished

ROC distributions (perfectly distinguished

Figure – 14 two class overlap each other and raise false positive (Type 1), false negative (Type 2) errors. Here error could be minimize or maximize according to threshold. Suppose here AUC = 0.6, that means chance of a model to distinguish two classes is 60%. On ROC curve also display the curve occupied for positive class is 60%.

ROC distributions (class partly overlap distinguished)

ROC distributions (class partly overlap distinguished)

Figure- 15 displayed that positive and negative overlap each other. Here AUC value is 0.5 or near to 0.5. On this position classifier model does not able distinguish positive and negative classes. On left side ROC curve become straight that means TPR and FPR are equal.

ROC distributions (class fully overlap distinguished)

ROC distributions (class fully overlap distinguished)

Figure- 16 positive and negative class swap position and on this position AUC = 0. That means classified model predict positive as a negative and negative as a positive. On the left ROC curve display that curve on FPR side fully fitted.

ROC distributions (class swap position distinguished)

ROC distributions (class swap position distinguished)

7 Summaries

This paper describes a data mining process flow and related model and its algorithm with textual representation. One hot encoding create dummy variable for class features and min-max scaling scale the data in a single format. Balancing by SMOTE data where Euclidian distance calculates the distance in-between nearest neighbour to produce synthetic data under minority class. LDA reduce the distance inside class and maximise distance in-between class and for two class problem give a single dimension features which is less costly to calculate accuracy by base algorithm (random forest and MLP classifier).  Confusion matrix gives the accuracy, precision, sensitivity, specificity which is help to take a decision about base algorithm. AUC and ROC curve also represent true positive rate against false positive rate which indicate base algorithm performance.

Base algorithm Random forest using CART with GINI impurity for feature selection to spread the tree. Here CART is selected because of less costly to run. Random forest algorithm is using bootstrap dataset to grow trees, and aggregation using majority vote to select accuracy.

MLP classifier is a neural network algorithm using backpropagation chain-rule to reducing error. Here inside layers using RLU activation function. Output layers using Sigmoid activation function and binary cross entropy loss function calculate the loss which is back propagate with Adam optimizer to optimize weight and reduce loss.

References:

  1. Visa, S., Ramsay, B., Ralescu, A.L. and Van Der Knaap, E., 2011. Confusion Matrix-based Feature Selection. MAICS, 710, pp.120-127.
  2. Grzybowski, M. and Younger, J.G., 1997. Statistical methodology: III. Receiver operating characteristic (ROC) curves. Academic Emergency Medicine, 4(8), pp.818-826.

Bias and Variance in Machine Learning

Machine learning continues to be an ever more vital component of our lives and ecosystem, whether we’re applying the techniques to answer research or business problems or in some cases even predicting the future. Machine learning models need to give accurate predictions in order to create real value for a given industry or domain.

While training a model is one of the key steps in the Data Science Project Life Cycle, how the model generalizes on unseen data is an equally important aspect that should be considered in every Data Science Project Life Cycle. We need to know whether it works and, consequently, if we can trust its predictions. Could the model be merely memorizing the data it is fed with, and therefore unable to make good predictions on future samples, or samples that it hasn’t seen before?

Let’s know the importance of evaluation with a simple example, There are two student’s Ramesh and Suresh preparing for the CAT exam to get into top IIMs (Indian Institute of Management). They both are quite good friends and stayed in the room during preparation and put an equal amount of hard work while solving numerical problems.

They both prepared for almost the same number of hours for the entire year and appeared in the final CAT exam. Surprisingly, Ramesh cleared, but Suresh did not. When asked, we got to know that there was one difference in their strategy of preparation between them, Ramesh had joined a Test Series course where he used to test his knowledge and understanding by giving mock exams and then further evaluating on which portions he is lagging and making necessary adjustments to he is preparation cycle in order to do well in those areas. But Suresh was confident, and he just kept training himself without testing on the preparation he had done.

Like the above situation we can train a Machine Learning Algorithm extensively with many parameters and new techniques, but if you are skipping its evaluation step, you cannot trust your model to perform well on the unseen data. In this article, we explain the importance of Bias, Variance and the trade-off between them in order to know how well a machine learning model generalizes to new, previously unseen data.

Training of Supervised Machine Learning

Bias

Bias is the difference between the Predicted Value and the Expected Value or how far are the predicted values from the actual values. During the training process the model makes certain assumptions on the training data provided. After Training, when it is introduced to the testing/validation data or unseen data, these assumptions may not always be correct.

If we use a large number of nearest neighbors in the K-Nearest Neighbors Algorithm, the model can totally decide that some parameters are not important at all for the modelling.  For example, it can just consider that only two predictor variables are enough to classify the data point though we have more than 10 variables.

This type of model will make very strong assumptions about the other parameters not affecting the outcome at all. You can take it as a model predicting or understanding only the simple relationship when the data points clearly indicate a more complex relationship.

When the model has high bias error, it results in a very simplistic model that does not consider the complexity of the data very well leading to Underfitting.

Variance

Variance occurs when the model performs well on the trained dataset but does not do well on an unseen data set, it is when the model considers the fluctuations or i.e. the noise as in the data as well. The model will still consider the variance as something to learn from because it learns too much from the noise inside the trained data set that it fails to perform as expected on the unseen data.

Based on the above example from Bias, if the model learns that all the ten predictor variables are important to classify a given data point then it tends to have high variance. You can take it as the model is trying to understand every minute detail making it more complex and failing to perform well on the unseen data.

When a model has High Bias error, it underfits the data and makes very simplistic assumptions on it. When a model has High Variance error, it overfits the data and learns too much from it. When a model has balanced Bias and Variance errors, it performs well on the unseen data.

Bias-Variance Trade-off

Based on the definitions of bias and variance, there is clear trade-off between bias and variance when it comes to the performance of the model. A model will have a high error if it has very high bias and low variance and have a high error if it has high variance and low bias.

A model that strikes a balance between the bias and variance can minimize the error better than those that live on extreme ends.

We can find whether the model has High Bias using the below steps:

  1. We tend to get high training errors.
  2. The validation error or test error will be similar to the training error.

We can find whether the model has High Bias using the below steps:

  1. We tend to get low training error
  2. The validation error or test error will be very high.

We can fix the High Bias using below steps:

  1. We need to gather more input features or can even try to create few using the feature engineering techniques.
  2. We can even add few polynomial features in order to increase the complexity.
  3. If we are using any regularization terms in our model, we can try to minimize it.

We can fix the High Variance using below steps:

  1. We can gather more training data so that the model can learn more on the patterns rather than the noise.
  2. We can even try to reduce the input features or do feature selection.
  3.  If we are using any regularization terms in our model we can try to maximize it.

Conclusion

In this article, we got to know the importance of the evaluation step in the Data Science Project Life Cycle, definitions of Bias and Variance, the trade-off between them and the steps we can take to fix the Underfitting and Overfitting of a Machine Learning Model.

Rethinking linear algebra: visualizing linear transformations and eigen vectors

In terms of calculation processes of Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), which are the dimension reduction techniques I am going to explain in the following articles, diagonalization is what they are all about. Throughout this article, I would like you to have richer insight into diagonalization in order to prepare for understanding those basic dimension reduction techniques.

When our professor started a lecture on the last chapter of our textbook on linear algebra, he said “It is no exaggeration to say that everything we have studied is for this ‘diagonalization.'” Until then we had to write tons of numerical matrices and vectors all over our notebooks, calculating those products, adding their rows or columns to other rows or columns, sometimes transposing the matrices, calculating their determinants.

It was like the scene in “The Karate Kid,” where the protagonist finally understood the profound meaning behind the prolonged and boring “wax on, wax off” training given by Miyagi (or “jacket on, jacket off” training given by Jackie Chan). We had finally understood why we had been doing those seemingly endless calculations.

Source: http://thinkbedoleadership.com/secret-success-wax-wax-off/

But usually you can do those calculations easily with functions in the Numpy library. Unlike Japanese college freshmen, I bet you are too busy to reopen textbooks on linear algebra to refresh your mathematics. Thus I am going to provide less mathematical and more intuitive explanation of diagonalization in this article.

1, The mainstream ways of explaining diagonalization.

*The statements below are very rough for mathematical topics, but I am going to give priority to offering more visual understanding on linear algebra in this article. For further understanding, please refer to textbooks on linear algebra. If you would like to have minimum understandings on linear algebra needed for machine learning, I recommend the Appendix C of Pattern Recognition and Machine Learning by C. M. Bishop.

In most textbooks on linear algebra, the explanations on dioagonalization is like this (if you are not sure what diagonalization is or if you are allergic to mathematics, you do not have to read this seriously):

Let V (dimV = D)be a vector space and let  T_A : V \rightarrow V be a mapping of V into itself,  defined as T_A(v) = A \cdot \boldsymbol{v}, where A is a D\times D matrix and \boldsymbol{v} is D dimensional vector. An element \boldsymbol{v} \in V is called an eigen vector if there exists a number \lambda such that A \cdot \boldsymbol{v}= \lambda \cdot \boldsymbol{v} and \boldsymbol{v} \neq \boldsymbol{0}. In this case \lambda is uniquely determined and is called an eigen value of A belonging to the eigen vector \boldsymbol{v}.

Any matrix A has D eigen values \lambda_{i}, belonging to \boldsymbol{v}_{i} (i=1, 2, …., D). If \boldsymbol{v}_{i} is basis of the vector space V, then A is diagonalizable.

When A is diagonalizable, with D \times D matrices P = (\boldsymbol{v}_{1}, \dots, \boldsymbol{v}_{D}) , whose column vectors are eigen vectors \boldsymbol{v}_{i} (i=1, 2, …., D), the following equation holds: P^{-1}AP = \Lambda, where \Lambda = diag(\lambda_{1}, \dots, \lambda_{D})= \begin{pmatrix} \lambda_{1} & 0& \ldots &0\\ 0 & \lambda_{2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_{D} \end{pmatrix}.

And when A is diagonalizable, you can diagonalize A as below.

Most textbooks keep explaining these type of stuff, but I have to say they lack efforts to make it understandable to readers with low mathematical literacy like me. Especially if you have to apply the idea to data science field, I believe you need more visual understanding of diagonalization. Therefore instead of just explaining the definitions and theorems, I would like to take a different approach. But in order to understand them in more intuitive ways, we first have to rethink waht linear transformation T_A means in more visible ways.

2, Linear transformations

Even though I did my best to make this article understandable to as little prerequisite knowledge, you at least have to understand linear transformation of numerical vectors and with matrices. Linear transformation is nothing difficult, and in this article I am going to use only 2 or 3 dimensional numerical vectors or square matrices. You can calculate linear transformation of \boldsymbol{v} by A as equations in the figure. In other words, \boldsymbol{u} is a vector transformed by A.

*I am not going to use the term “linear transformation” in a precise way in the context of linear algebra. In this article or in the context of data science or machine learning, “linear transformation” for the most part means products of matrices or vectors. 

*Forward/back propagation of deep learning is mainly composed of this linear transformation. You keep linearly transforming input vectors, frequently transforming them with activation functions, which are for the most part not linear transformation.

As you can see in the equations above, linear transformation with A transforms a vector to another vector. Assume that you have an original vector \boldsymbol{v} in grey and that the vector \boldsymbol{u} in pink is the transformed \boldsymbol{v} by A is. If you subtract \boldsymbol{v} from \boldsymbol{u}, you can get a displacement vector, which I displayed in purple. A displacement vector means the transition from a vector to another vector.

Let’s calculate the displacement vector with more vectors \boldsymbol{v}. Assume that A =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, and I prepared several grid vectors \boldsymbol{v} in grey as you can see in the figure below. If you transform those grey grid points with A, they are mapped into the vectors \boldsymbol{u} in pink. With those vectors in grey or pink, you can calculate the their displacement vectors \boldsymbol{u} = \boldsymbol{v} in purple.

I think you noticed that the displacement vectors in the figure above have some tendencies. In order to see that more clearly, let’s calculate displacement vectors with several matrices A and more grid points. Assume that you have three 2 \times 2 square matrices A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, A_2 =\begin{pmatrix} 3 & 1 \\ -1 & 1 \end{pmatrix}, A_3 =\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}, and I plotted displace vectors made by the matrices respectively in the figure below.

I think you noticed some characteristics of the displacement vectors made by those linear transformations: the vectors are swirling and many of them seem to be oriented in certain directions. To be exact, some displacement vectors have extend in the same directions as some of original vectors in grey. That means  linear transformation by A did not change the direction of the original vector \boldsymbol{v}, and the unchanged vectors are called eigen vectors. Real eigen vectors of each A are displayed as arrows in yellow in the figure above. But when it comes to A_3, the matrix does not have any real eigan values.

In linear algebra, depending on the type matrices A, you have consider various cases such as whether the matrices have real or imaginary eigen values, whether the matrices are diagonalizable, whether the eigen vectors are orthogonal, or whether they are unit vectors. But those topics are out of the scope of this article series, so please refer to textbooks on linear algebra if you are interested.

Luckily, however, in terms of PCA or LDA, you only have to consider a type of matrices named positive semidefinite matrices, which A_1 is classified to, and I am going to explain positive semidefinite matrices in the fourth section.

3, Eigen vectors as coordinate system

Source: Ian Stewart, “Professor Stewart’s Cabinet of Mathematical Curiosities,” (2008), Basic Books

Let me take Fibonacci numbers as an example to briefly see why diagonalization is useful. Fibonacci is sequence is quite simple and it is often explained using an example of pairs of rabbits increasing generation by generation. Let a_n (n=0, 1, 2, …) be the number of pairs of grown up rabbits in the n^{th} generation. One pair of grown up rabbits produce one pair of young rabbit The concrete values of a_n are a_0 = 0, a_1 = 1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8, a_7=13, \dots. Assume that A =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} and that \begin{pmatrix} a_1 \\ a_0  \end{pmatrix} =\begin{pmatrix} 1 \\ 0  \end{pmatrix}, then you can calculate the number of the pairs of grown up rabbits in the next generation with the following recurrence relation. \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \cdot \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}.Let \boldsymbol{a}_n be \begin{pmatrix} a_{n+1} \\ a_{n}  \end{pmatrix}, then the recurrence relation can be written as \boldsymbol{a}_{n+1} = A \boldsymbol{a}_n, and the transition of \boldsymbol{a}_n are like purple arrows in the figure below. It seems that the changes of the purple arrows are irregular if you look at the plots in normal coordinate.

Assume that \lambda _1, \lambda_2 (\lambda _1< \lambda_2) are eigen values of A, and \boldsymbol{v}_1, \boldsymbol{v}_2 are eigen vectors belonging to them respectively. Also let \alpha, \beta scalars such that \begin{pmatrix} a_{1} \\ a_{0}  \end{pmatrix} = \begin{pmatrix} 1 \\ 0  \end{pmatrix} = \alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2. According to the definition of eigen values and eigen vectors belonging to them, the following two equations hold: A\boldsymbol{v}_1 = \lambda_1 \boldsymbol{v}_1, A\boldsymbol{v}_2 = \lambda_2 \boldsymbol{v}_2. If you calculate \boldsymbol{a}_1 is, using eigen vectors of A, \boldsymbol{a}_1  = A\boldsymbol{a}_0 = A (\alpha \boldsymbol{v}_1 + \beta \boldsymbol{v}_2) = \alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2. In the same way, \boldsymbol{a}_2 = A\boldsymbol{a}_1 = A (\alpha\lambda _1 \boldsymbol{v}_1 + \beta \lambda_2 \boldsymbol{v}_2) = \alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2, and \boldsymbol{a}_3 = A\boldsymbol{a}_2 = A (\alpha\lambda _{1}^{2} \boldsymbol{v}_1 + \beta \lambda_{2}^{2} \boldsymbol{v}_2) = \alpha\lambda _{1}^{3} \boldsymbol{v}_1 + \beta \lambda_{2}^{3} \boldsymbol{v}_2. These equations show that in coordinate system made by eigen vectors of A, linear transformation by A is easily done by just multiplying eigen values with each eigen vector. Compared to the graph of Fibonacci numbers above, in the figure below you can see that in coordinate system made by eigen vectors the plots changes more systematically generation by generation.

 

In coordinate system made by eigen vectors of square matrices, the linear transformations by the matrices can be much more straightforward, and this is one powerful strength of eigen vectors.

*I do not major in mathematics, so I am not 100% sure, but vectors in linear algebra have more abstract meanings and various things in mathematics can be vectors, even though in machine learning or data science we  mainly use numerical vectors with more concrete elements. We can also say that matrices are a kind of maps. That is just like, at leas in my impression, even though a real town is composed of various components such as houses, smooth or bumpy roads, you can simplify its structure with simple orthogonal lines, like the map of Manhattan. But if you know what the town actually looks like, you do not have to follow the zigzag path on the map.

4, Eigen vectors of positive semidefinite matrices

In the second section of this article I told you that, even though you have to consider various elements when you discuss general diagonalization, in terms of PCA and LDA we mainly use only a type of matrices named positive semidefinite matrices. Let A be a D \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} \geq 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive semidefinite matrix. And also it is known that A being a semidefinite matrix is equivalent to \lambda _{i} \geq 0 for all the eigen values \lambda_i (i=1, \dots , D).

*I think most people first learn a type of matrices called positive definite matrices. Let A be aD \times D square matrix. If \boldsymbol{x}^T A \boldsymbol{x} > 0 for all values of the vector \boldsymbol{x}, the A is said to be a positive definite matrix. You have to keep it in mind that even if all the elements of A are positive, A is not necessarly positive definite/semidefinite.

Just as we did in the second section of this article, let’s visualize displacement vectors made by linear transformation with a 3 \times 3 square positive semidefinite matrix A.

*In fact A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix}, whose linear transformation I visualized the second section, is also positive semidefinite.

Let’s visualize linear transformations by a positive definite matrix A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix}. I visualized the displacement vectors made by the A just as the same way as in the second section of this article. The result is as below, and you can see that, as well as the displacement vectors made by A_1, the three dimensional displacement vectors below are swirling and extending in three directions, in the directions of the three orthogonal eigen vectors \boldsymbol{v}_1, \boldsymbol{v}_2, and \boldsymbol{v}_3.

*It might seem like a weird choice of a matrix, but you are going to see why in the next article.

You might have already noticed A_1 =\begin{pmatrix} 3 & 1 \\ 1 & 2 \end{pmatrix} and A = \frac{1}{50} \begin{pmatrix} 60.45 &  33.63 & 46.29 \\33.63 & 68.49 & 50.93 \\ 46.29 & 50.93 & 53.61 \end{pmatrix} are both symmetric matrices and that their elements are all real values, and that their diagonal elements are all positive values. Super importantly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}).

*The title of this section might be misleading, but please keep it in mind that positive definite/semidefinite matrices are not necessarily real symmetric matrices. And real symmetric vectors are not necessarily positive definite/semidefinite matrices.

5, Orthonormal matrices and rotation of vectors

In this section I am gong to explain orthonormal matrices, as known as rotation matrices. If a D\times D matrix U is an orthonormal matrix, column vectors of U are orthonormal, which means U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D), where \begin{cases} \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 1 \quad (i = j) \\ \boldsymbol{u}_{i}^{T}\boldsymbol{u}_{j} = 0 \quad (i\neq j) \end{cases}. In other words column vectors \boldsymbol{u}_{i} form an orthonormal coordinate system.

Orthonormal matrices U have several important matrices, and one of them is U^{-1} = U^{T}. Combining this fact with what I have told you so far, you we can reach one conclusion that you can orthogonalize a real symmetric matrix A as U^{T}AU = \Lambda. This is known as spectral decomposition or singular value decomposition.

Another important property of U is that U^{T} is also orthonormal. In other words, assume U is orthonormal and that U = (\boldsymbol{u}_1 \dots \boldsymbol{u}_D) = \begin{pmatrix} -\boldsymbol{v_1}^{T}- \\ \vdots \\ -\boldsymbol{v_D}^{T}- \end{pmatrix}, (\boldsymbol{v}_1 \dots \boldsymbol{v}_D) also forms a orthonormal coordinate system.

…It seems things are getting too mathematical and abstract (for me), thus for now I am going to wrap up what I have explained in this article .

We have seen

  • Numerical matrices linearly transform vectors.
  • Certain linear transformations do not change the direction of vectors in certain directions, which are called eigen vectors.
  • Making use of eigen vectors, you can form new coordinate system which can describe the linear transformations in a more straightforward way.
  • You can diagonalize a real symmetric matrix A with an orthonormal matrix U.

Of our current interest is what kind of linear transformation the real symmetric positive definite matrix enables. I am going to explain why the purple vectors in the figure above is swirling like that in the upcoming articles. Before that, however, we are going to  see one application of what we have seen in this article, on dimension reduction. To be concrete the next article is going to be about principal component analysis (PCA), which is very important in many fields.

*In short, the orthonormal matrix U I mentioned above enables rotation of matrix, and the diagonal matrix diag(\lambda_1, \dots, \lambda_D) expands or contracts vectors along each axis. I am going to explain that more precisely in the upcoming articles.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

*I attatched the codes I used to make the figures in this article. You can just copy, paste, and run, sometimes installing necessary libraries.

import matplotlib.pyplot as plt 
import numpy as np
import matplotlib.patches as mpatches

T_A = np.array([[1, 1], 
                [1, 0]])

total_step = 5
x = np.zeros((total_step, 2))

x[0]  = np.array([1, 0])

for i in range(total_step - 1):
    x[i + 1] = np.dot(T_A, x[i])

eigen_values, eigen_vectors = np.linalg.eig(T_A)
idx = eigen_values.argsort()[::-1]   
eigen_values = eigen_values[idx]
eigen_vectors = eigen_vectors[:,idx]
for i in range(len(eigen_vectors)):
    if(eigen_vectors.T[i][0] < 0):
        eigen_vectors.T[i] = - eigen_vectors.T[i]   

v_initial = x[0]
v_coefficients = np.zeros((total_step , 2))
v_coefficients[0] = np.dot(v_initial ,  np.linalg.inv(eigen_vectors.T)) 

for i in range(total_step-1):
    v_coefficients[i + 1] = v_coefficients[i] * eigen_values 

for i in range(total_step):
    v_1_list[i+1] = v_coefficients.T[0][i]*eigen_vectors.T[0]
    v_2_list[i+1] = v_coefficients.T[1][i]*eigen_vectors.T[1]

plt.figure(figsize=(20, 15))
fontsize = 20
small_shift = 0.2

plt.plot(x[:, 0], x[:, 1], marker='o', linestyle='none', markersize=10, color='black')

plt.arrow(0, 0, eigen_vectors.T[0][0], eigen_vectors.T[0][1], width=0.05, head_width=0.2, color='orange')
plt.arrow(0, 0, eigen_vectors.T[1][0], eigen_vectors.T[1][1], width=0.05, head_width=0.2, color='orange')

plt.text(eigen_vectors.T[0][0], eigen_vectors.T[0][1]+small_shift, r'v_{1}', va='center',ha='right', fontsize=fontsize + 10)
plt.text(eigen_vectors.T[1][0] - small_shift, eigen_vectors.T[1][1],r'v_{2}', va='center',ha='right', fontsize=fontsize + 10)

for i in range(total_step): 
    
    plt.arrow(0, 0, v_1_list[i+1][0], v_1_list[i+1][1], head_width=0.05, color='darkviolet', length_includes_head=True)
    plt.arrow(0, 0, v_2_list[i+1][0], v_2_list[i+1][1], head_width=0.05, color='darkviolet', length_includes_head=True)
    
    plt.text(v_1_list[i+1][0] + 2*small_shift , v_1_list[i+1][1]-2*small_shift,r'\alpha \cdot \lambda_{0} ^{1} \cdot v_{2}'.format(1,i+1, 1),va='center',ha='right', fontsize=fontsize)
    plt.text(v_2_list[i+1][0]-0.1, v_2_list[i+1][1],r'\beta \cdot \lambda_{0} ^{1} \cdot v_{2}'.format(2, i+1, 2),va='center',ha='right', fontsize=fontsize)

    plt.arrow(v_1_list[i+1][0], v_1_list[i+1][1], v_2_list[i+1][0], v_2_list[i+1][1], head_width=0, color='black', linestyle='--', length_includes_head=True)
    plt.arrow(v_2_list[i+1][0], v_2_list[i+1][1], v_1_list[i+1][0], v_1_list[i+1][1], head_width=0, color='black', linestyle='--', length_includes_head=True)
    
orange_patch = mpatches.Patch(color='orange', label='Eigen vectors')
purple_patch = mpatches.Patch(color='darkviolet', label='Scalar multiples of the eigen vectors')
plt.legend(handles=[orange_patch, purple_patch], fontsize=25, loc='lower right')

for i in range(total_step):
    plt.text(x[i][0]+0.1, x[i][1]-0.05, r'n={0}'.format(i), fontsize=20)

plt.grid(True)
plt.ylabel("a_{n}: n^{th} generation", fontsize=20)
plt.xlabel("a_{n+1}: n+1 ^{th} geneartion", fontsize=20)
plt.title("Fibonacci sequence and its eigen space", fontsize=30)
#plt.savefig("Fibonacci_eigen_space.png")
plt.show()
import matplotlib.pyplot as plt 
import numpy as np 
import matplotlib.patches as mpatches

const_range = 10

X = np.arange(-const_range, const_range + 1, 1)
Y = np.arange(-const_range, const_range + 1, 1)
U_x, U_y = np.meshgrid(X, Y)

T_A_0 = np.array([[3, 1], 
                [1, 2]])

T_A_1 = np.array([[3, 1],
                [-1, 1]])

T_A_2 = np.array([[1, -1], 
                [1, 1]])

T_A_list = np.array((T_A_0, T_A_1, T_A_2))

const_range = 5
plt.figure(figsize=(30, 10))
plt.subplots_adjust(wspace=0.1)
labels = ["Grids", "Displacement vectors made by A", "Real eigen vectors of A"]  
title_list = [r"A_1 has two different real eigen vectors.", r"A_2 has two identical real unit eigen vectors.",  r"A_3 has only imaginary eigen vectors."]
for idx in range(len(T_A_list)): 
    
    eigen_values, eigen_vectors = np.linalg.eig(T_A_list[idx])
    sorted_idx = eigen_values.argsort()[::-1]   
    eigen_values = eigen_values[sorted_idx]
    eigen_vectors = eigen_vectors[:,sorted_idx]
    eigen_vectors = eigen_vectors.astype(float)
        
    for i in range(len(eigen_vectors)):
        if(eigen_vectors.T[i][0] < 0):
            eigen_vectors.T[i] = - eigen_vectors.T[i]
        

    X = np.arange(-const_range, const_range + 1, 1)
    Y = np.arange(-const_range, const_range + 1, 1)
    U_x, U_y = np.meshgrid(X, Y)

    V_x = np.zeros((len(U_x), len(U_y)))
    V_y = np.zeros((len(U_x), len(U_y)))

    temp_vec = np.zeros((1, 2))

    W_x = np.zeros((len(U_x), len(U_y)))
    W_y = np.zeros((len(U_x), len(U_y)))

    plt.subplot(1, 3, idx + 1)


    for i in range(len(U_x)):
        for j in range(len(U_y)):
            temp_vec[0][0] = U_x[i][j]
            temp_vec[0][1] = U_y[i][j]
        
            temp_vec[0] = np.dot(T_A_list[idx], temp_vec[0])
        
            V_x[i][j] = temp_vec[0][0]
            V_y[i][j] = temp_vec[0][1]
        
            W_x[i][j] = V_x[i][j] - U_x[i][j]
            W_y[i][j] = V_y[i][j] - U_y[i][j]
            #plt.arrow(0, 0, V_x[i][j], V_y[i][j], head_width=0.1, color='red')
            plt.arrow(0, 0, U_x[i][j], U_y[i][j], head_width=0.3, color='dimgrey', label=labels[0])
            plt.arrow(U_x[i][j], U_y[i][j], W_x[i][j], W_y[i][j], head_width=0.3, color='darkviolet', label=labels[1])
            
            range_const = 20
            plt.xlim([-range_const, range_const])
            plt.ylim([-range_const, range_const])
            plt.title(title_list[idx], fontsize=25)
            
            if idx==2:
                continue
 
            plt.arrow(0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, head_width=1, color='orange', label=labels[2])
            plt.arrow(0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, head_width=1, color='orange', label=labels[2])

grey_patch = mpatches.Patch(color='grey', label='Grids')
purple_patch = mpatches.Patch(color='darkviolet', label='Displacement vectors made by A')
yellow_patch = mpatches.Patch(color='gold', label='Real eigen vectors of A')
plt.legend(handles=[grey_patch, purple_patch, yellow_patch], fontsize=25, loc='lower right', bbox_to_anchor=(-0.1, -.35))
#plt.savefig("linear_transformation.png")
plt.show()
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.proj3d import proj_transform
from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib.text import Annotation
from matplotlib.patches import FancyArrowPatch
import matplotlib.patches as mpatches

class Annotation3D(Annotation):
    def __init__(self, text, xyz, *args, **kwargs):
        super().__init__(text, xy=(0,0), *args, **kwargs)
        self._xyz = xyz

    def draw(self, renderer):
        x2, y2, z2 = proj_transform(*self._xyz, renderer.M)
        self.xy=(x2,y2)
        super().draw(renderer)

def _annotate3D(ax,text, xyz, *args, **kwargs):
    '''Add anotation `text` to an `Axes3d` instance.'''

    annotation= Annotation3D(text, xyz, *args, **kwargs)
    ax.add_artist(annotation)

setattr(Axes3D,'annotate3D',_annotate3D)

class Arrow3D(FancyArrowPatch):
    def __init__(self, x, y, z, dx, dy, dz, *args, **kwargs):
        super().__init__((0,0), (0,0), *args, **kwargs)
        self._xyz = (x,y,z)
        self._dxdydz = (dx,dy,dz)

    def draw(self, renderer):
        x1,y1,z1 = self._xyz
        dx,dy,dz = self._dxdydz
        x2,y2,z2 = (x1+dx,y1+dy,z1+dz)

        xs, ys, zs = proj_transform((x1,x2),(y1,y2),(z1,z2), renderer.M)
        self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
        super().draw(renderer)

def _arrow3D(ax, x, y, z, dx, dy, dz, *args, **kwargs):
    '''Add an 3d arrow to an `Axes3D` instance.'''

    arrow = Arrow3D(x, y, z, dx, dy, dz, *args, **kwargs)
    ax.add_artist(arrow)

setattr(Axes3D,'arrow3D',_arrow3D)

T_A = np.array([[60.45, 33.63, 46.29], 
                [33.63, 68.49, 50.93], 
                [46.29, 50.93, 53.61]])

T_A = T_A/50
const_range = 2


X = np.arange(-const_range, const_range + 1, 1)
Y = np.arange(-const_range, const_range + 1, 1)
Z = np.arange(-const_range, const_range + 1, 1)

U_x, U_y, U_z = np.meshgrid(X, Y, Z)

V_x = np.zeros((len(U_x), len(U_y), len(U_z)))
V_y = np.zeros((len(U_x), len(U_y), len(U_z)))
V_z = np.zeros((len(U_x), len(U_y), len(U_z)))

temp_vec = np.zeros((1, 3))

W_x = np.zeros((len(U_x), len(U_y), len(U_z)))
W_y = np.zeros((len(U_x), len(U_y), len(U_z)))
W_z = np.zeros((len(U_x), len(U_y), len(U_z)))

eigen_values, eigen_vectors = np.linalg.eig(T_A)
sorted_idx = eigen_values.argsort()[::-1]   
eigen_values = eigen_values[sorted_idx]
eigen_vectors = eigen_vectors[:,sorted_idx]
eigen_vectors = eigen_vectors.astype(float)

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111, projection='3d')
grid_range = const_range + 5
ax.set_xlim(-grid_range, grid_range)
ax.set_ylim(-grid_range, grid_range)
ax.set_zlim(-grid_range, grid_range)

eigen_values, eigen_vectors = np.linalg.eig(T_A)
sorted_idx = eigen_values.argsort()[::-1]   
eigen_values = eigen_values[sorted_idx]
eigen_vectors = eigen_vectors[:,sorted_idx]
eigen_vectors = eigen_vectors.astype(float)
     
    
for i in range(len(eigen_vectors)):
    if(eigen_vectors.T[i][0] < 0):
        eigen_vectors.T[i] = - eigen_vectors.T[i]

for i in range(len(U_x)):
    for j in range(len(U_x)):
        for k in range(len(U_x)):
            temp_vec[0][0] = U_x[i][j][k]
            temp_vec[0][1] = U_y[i][j][k]
            temp_vec[0][2] = U_z[i][j][k]
        
            temp_vec[0] = np.dot(T_A, temp_vec[0])
        
            V_x[i][j][k] = temp_vec[0][0]
            V_y[i][j][k] = temp_vec[0][1]
            V_z[i][j][k] = temp_vec[0][2]
        
            W_x[i][j][k] = V_x[i][j][k] - U_x[i][j][k]
            W_y[i][j][k] = V_y[i][j][k] - U_y[i][j][k]
            W_z[i][j][k] = V_z[i][j][k] - U_z[i][j][k]
            ax.arrow3D(0, 0, 0, 
                       U_x[i][j][k], U_y[i][j][k], U_z[i][j][k], 
                       mutation_scale=10, arrowstyle="-|>", fc='dimgrey', ec='dimgrey')
            #ax.arrow3D(0, 0, 0, 
            #          V_x[i][j][k], V_y[i][j][k], V_z[i][j][k], 
            #           mutation_scale=10, arrowstyle="-|>", fc='red', ec='red')
            ax.arrow3D(U_x[i][j][k], U_y[i][j][k], U_z[i][j][k], 
                       W_x[i][j][k], W_y[i][j][k], W_z[i][j][k],
                       mutation_scale=10, arrowstyle="-|>", fc='darkviolet', ec='darkviolet')
            
ax.arrow3D(0, 0, 0, eigen_vectors.T[0][0]*10, eigen_vectors.T[0][1]*10, eigen_vectors.T[0][2]*10,
                       mutation_scale=10,  arrowstyle="-|>", fc='orange', ec='orange')
ax.arrow3D(0, 0, 0, eigen_vectors.T[1][0]*10, eigen_vectors.T[1][1]*10, eigen_vectors.T[1][2]*10,
                       mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')
ax.arrow3D(0, 0, 0, eigen_vectors.T[2][0]*10, eigen_vectors.T[2][1]*10, eigen_vectors.T[2][2]*10,
                       mutation_scale=10, arrowstyle="-|>", fc='orange', ec='orange')

ax.text(eigen_vectors.T[0][0]*8 , eigen_vectors.T[0][1]*8, eigen_vectors.T[0][2]*8+1, r'v_1', fontsize=20)
ax.text(eigen_vectors.T[1][0]*8 , eigen_vectors.T[1][1]*8, eigen_vectors.T[1][2]*8, r'v_2', fontsize=20)
ax.text(eigen_vectors.T[2][0]*8 , eigen_vectors.T[2][1]*8, eigen_vectors.T[2][2]*8, r'v_3', fontsize=20)


grey_patch = mpatches.Patch(color='grey', label='Grids')
orange_patch = mpatches.Patch(color='orange', label='Orthogonal eigen vectors of A')
purple_patch = mpatches.Patch(color='darkviolet', label='Displacement vectors made by A')
plt.legend(handles=[grey_patch, orange_patch, purple_patch], fontsize=20, loc='lower right')

ax.set_xlabel(r'x_1', fontsize=25)
ax.set_ylabel(r'x_2', fontsize=25)
ax.set_zlabel(r'x_3', fontsize=25)
#plt.savefig("symmetric_positive_definite_visualizaiton.png")
plt.show()

 

K Nearest Neighbour For Supervised Learning

K-Nearest Neighbour (KNN) Algorithms is an easy-to-implement & advanced level supervised machine learning algorithm used for both – classification as well as regression problems. However, you can see a wide of its applications in classification problems across various industries.

If you’ve been shopping a lot in e-commerce sites like Amazon, Flipkart, Myntra, or love watching web series over Netflix and Amazon Prime, one common thing you’ve always noticed, and that is recommendations.

Are you wondering how they recommend you following your choice? They use KNN Supervised Learning to find out what you may need the next when you’re buying and recommend you with a few more products.

Imagine you’re looking for an iPhone to purchase. When you scroll down a little, you see some iPhone cases, tempered glasses – saying, “People who purchased an iPhone have also purchased these items. The same applies to Netflix and Amazon Prime. When you finished a show or a series, they give you recommendations of the same genre. And do it all using KNN supervised learning and classify the items for the best user experience.

Advantages Of KNN

  • Quickest Calculation Time
  • Simple Algorithms
  • High Accuracy
  • Versatile – best use for Regression and Classification.
  • Doesn’t make any assumptions about data.

Where KNN Are Mostly Used

  • Simple Recommendation Models
  • Image Recognition Technology
  • Decision-Making Models
  • Calculating Credit Rating

Choosing The Right Value For K

 To choose the right value of K, you have to run KNN algorithms several times with different values of K and select the value of K, which reduces the number of errors you’ve come across and come out as the most stable value for K.

Your Step-By-Step Guide For Choosing The Value Of K

  • As you decrease the value of K to 1 (K = 1), you’ll reach a query point, where you get to see many elements from class A (-) and class B (+) where (-) is the only nearest neighbor. Reasonably, you would think about the query point to be most likely the red one. As K =1, which has a blue color, KNN incorrectly predicts the wrong color blue.
  • As you increase the value of K to 2 (K=2), you get to see two elements, (-) and (+) are the only nearest neighbor. As you have two values, which are of Class A and Class B, KNN incorrectly predicts the wrong values (Blue and Red).
  • As you increase the value of K to 3 (K=3), you get to see three elements (-) and (+), (+) are the only nearest neighbor. And this time, you got three values, one from blue and two from red. As your assumption is red, KNN correctly predicts the right value (Blue and Red, Red). Your answer is more stable this time compared to previous ones.

Conclusion

KNN works by finding the nearest distance between a query and all the elements in the database. By choosing the value for K, we get the closest to the query. And then, KNN algorithms look for the most frequent labels in classification and averages of labels in regression.

Spiky cubes, Pac-Man walking, empty M&M’s chocolate: curse of dimensionality

This is the first article of the article series Illustrative introductions on dimension reduction.

“Curse of dimensionality” means the difficulties of machine learning which arise when the dimension of data is higher. In short if the data have too many features like “weight,” “height,” “width,” “strength,” “temperature”…., that can undermine the performances of machine learning. The fact might be contrary to your image which you get from the terms “big” data or “deep” learning. You might assume that the more hints you have, the better the performances of machine learning are. There are some reasons for curse of dimensionality, and in this article I am going to introduce two major reasons below.

  1. High dimensional data usually have rich expressiveness, but usually training data are too poor for that.
  2. The behaviors of data points in high dimensional space are totally different from our common sense.

Through these topics, you will see that you always have to think about which features to use considering the number of data points.

*From now on I am going to talk about only Euclidean distance. If you are not sure what Euclidean distance means, please just keep it in mind that it is the type of distance most people wold have learnt in normal compulsory education.

1. Number of samples and degree of dimension

The most straightforward demerit of adding many features, or increasing dimensions of data, is the growth of computational costs. More importantly, however, you always have to think about the degree of dimensions in relation of the number of data points you have. Let me take a simple example in a book “Pattern Recognition and Machine Learning” by C. M. Bishop (PRML). This is an example of measurements of a pipeline. The figure below shows a comparison plot of 3 classes (red, green and blue), with parameter x_7 plotted against parameter x_6 out of 12 parameters.

* The meaning of data is not important in this article. If you are interested please refer to the appendix in PRML.

Assume that we are interested in classifying the cross in black into one of the three classes. One of the most naive ideas of this classification is dividing the graph into grids and labeling each grid depending on the number of samples in the classes (which are colored at the right side of the figure). And you can classify the test sample, the cross in black, into the class of the grid where the test sample is in. Thereby the cross is classified to the class in red.

Source: C.M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 34-35

As I mentioned in the figure above, we used only two features out of 12 features in total. When the total number of data points is fixed and you add remaining ten axes/features one after another, what would happen? Let’s see what “adding axes/features” means. If you are talking about 1, 2, or 3 dimensional grids, you can visualize them. And as you can see from the figure below, if you make each 10^1, 10^2, 100^3 grids respectively in 1, 2, 3 dimensional spaces, the number of the small regions in the grids are respectively 10, 100, 1000. Even though you cannot visualize it anymore, you can make grids for more than 3 dimensional data. If you continue increasing the degree of dimension, the number of grids increases exponentially, and that can soon surpass the number of training data points. That means there would be a lot of empty spaces in such high dimensional grids. And the classifying method above: coloring each grid and classifying unknown samples depending on the colors of the grids, does not work out anymore because there would be a lot of empty grids.

* If you are still puzzled by the idea of “more than 3 dimensional grids,” you should not think too much about that now. It is enough if you can get some understandings on high dimensional data after reading the whole article of this.

Source: Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, (2016), MIT Press, p. 153

I said the method above is the most naive way, but other classical classification methods , for example k-nearest neighbors algorithm, are more or less base on a similar idea. Many of classical machine learning algorithms are based on the idea of smoothness prior, or local constancy prior. In short in classical ways, you  do not expect data to change so much in a small region, so you can expect unknown samples to be similar to data in vicinity. But that soon turns out to be problematic when the dimension of data is bigger because training data would be sparse because the area of multidimensional space grows exponentially as I mentioned above. And sometimes you would not be able to find training data around test data. Plus, in high dimensional data, you cannot treat distance in the same as you do in lower dimensional space. The ideas of “close,” “nearby,” or “vicinity” get more obscure in high dimensional data. That point is related to the next topic: the intuition have cultivated in normal life is not applicable to higher dimensional data.

2. Bizarre characteristics of high dimensional data

We form our sense of recognition in 3-dimensional ways in our normal life. Even though we can visualize only 1, 2, or 3 dimensional data, we can actually generalize the ideas in 1, 2, or 3 dimensional ideas to higher dimensions. For example 4 dimensional cubes, 100 dimensional spheres, or orthogonality in 255 dimensional space. Again, you cannot exactly visualize those ideas, and for many people, such high dimensional phenomenon are just imaginary matters on blackboards. Those high dimensional ideas are designed to retain some conditions just as well as 1, 2, or 3 dimensional space. Let’s take an example of spheres in several dimensional spaces. General spheres in any D-dimensional space can be defined as a set of any \boldsymbol{x}, such that |\boldsymbol{x} - \boldsymbol{c}| = r, where \boldsymbol{c} is the center point and r is length of radius. When \boldsymbol{x} is 2-dimensional, the spheres are called “circles.” When \boldsymbol{x} is 3-dimensional, the spheres are called “spheres” in our normal life, unless it is used in a conversation in a college cafeteria, by some students in mathematics department. And when \boldsymbol{x} is D-dimensional, they are called D-ball, and again, this is just a imaginary phenomenon on blackboard.

* Vectors and points are almost the same because all the vectors are denoted as “arrows” from the an origin point to sample data points.  The only difference is that when you use vectors, you have to consider their directions.

* “D-ball” is usually called “n-ball,” and in such context it is a sphere in a n-dimensional space. But please let me use the term “D-ball” in this article.

Not only spheres, but only many other ideas have been generalized to D-dimensional space, and many of them are indispensable also for data science. But there is one severe problem: the behaviors of data in high dimensional field is quite different from those in two or three dimensional space. To be concrete, in high dimensional field, cubes are spiky, you have to move like Pac-Man, and M & M’s Chocolate looks empty inside but tastes normal.

2.1: spiky cubes
Let’s take a look at an elementary-school-level example of geometry first. Assume that you have several unit squares or unit cubes like below. In each of them a circle or sphere with diameter 1 is inscribed. The length of a diagonal line in each square is \sqrt{2}, and that in each cube is \sqrt{3}.

If you stack the squares or cubes as below, what are the length of diameters of the blue circle or sphere, circumscribing all the 4 orange circles or the 8 orange spheres?

The answers are, the diameter of the blue circle is \sqrt{2} - 1, and the diameter of the blue sphere is \sqrt{3} - 1.

Next let’s think about the same situation in higher dimensional space. Assume that there are some unit D-dimensional hypercubes stacked, in each of which a D-ball with diameter 1 is inscribed, touching all the surfaces inside. Then what is the length of the diameter of  a D-ball circumscribing all the unit D-ball in the hypercubes ? Given the results above, it ca be predicted that its diameter is \sqrt{D}  -1. If that is true, there is one strange point: \sqrt{D} - 1 can soon surpass 2: that means in the chart above the blue sphere will stick out of the stacked cubes. That sounds like a paradox, but with one hypothesis, the phenomenon makes sense: cubes become more spiky as the degree of dimension grows. This hypothesis is a natural deduction because diagonal lines of hyper cubes get longer, and the the center of each surface of hypercubes still touches the unit D-ball with diameter 1, inscribing inscribing inside each unit hypercube.

If you stack 4 hypercubes, the blue sphere circumscribing them will not stick out of the stacked hypercubes anymore like the figure below.

*Of course you cannot visualize what is going on in D-dimensional space, so the figure below is just a pseudo simulation of D-dimensional space in our 3-dimensional sense. I guess you have to stack more than four hyper cubes in higher dimensional data, but you cannot easily imagine what will go on in such space anymore.

 

*You can confirm the fact that hypercube gets more spiky as the degree of dimension growth, by comparing the volume of the hypercube and the volume of the D-ball inscribed inside the hypercube. Thereby you can prove that the volume of hypercube concentrates on the corners of the hypercube. Plus, as I mentioned the longest diagonal distance of hypercube gets longer as dimension degree increases. That is why hypercube is said to be spiky. For mathematical proof, please check the Exercise 1.19 of PRML.

2.2: Pac-Man walking

Next intriguing phenomenon in high dimensional field is that most of pairs of vectors in high dimensional space are orthogonal. In other words, if you select two random vectors in high dimensional space, the angle between them are mostly close to 90^\circ. Let’s see the general meaning of angle between two vectors in any dimensional spaces. Assume that the angle between two vectors \boldsymbol{u}, and \boldsymbol{v} is \theta, then cos\theta is calculated as cos\theta = \frac{<\boldsymbol{u}, \boldsymbol{v}>}{|\boldsymbol{u}||\boldsymbol{v}|}. In 1, 2, or 3 dimensional space, you can actually see the angle, but again you can define higher dimensional angle, which you cannot visualize anymore. And angles are sometimes used as similarity of two vectors.

* <\boldsymbol{u}, \boldsymbol{v}> is the inner product of \boldsymbol{u}, and \boldsymbol{v}.

Assume that you generate a pair of two points inside a D-dimensional unit sphere and make two vectors \boldsymbol{u}, and \boldsymbol{v} by connecting the origin point and those two points respectively. When D is 2, I mean spheres are circles in this case, any \theta are equally generated as in the chart below. The fact might be the same as your intuition.   How about in 3-dimensional space? In fact the distribution of \theta is not uniform. \theta = 90^\circ is the most likely to be generated. As I explain in the figure below, if you compare the area of cross section of a hemisphere and the area of a cone whose vertex is the center point of the sphere, you can see why.

I generated 10000 random pairs of points in side a D-dimensional unit sphere, and calculated the angle between them. In other words I just randomly generated two D-dimensional vectors \boldsymbol{u} and \boldsymbol{v}, whose elements are randomly generated values between -1 and 1, and calculated the angle between them, repeating this process 10000 times. The chart below are the histograms of angle between pairs of generated vectors in respectively 2, 3, 50, and 100 dimensional space.

As I explained above, in 2-dimensional space, the distribution of \theta is almost uniform. However the distribution concentrates a little around 90^\circ in 3-dimensional space. You can see that the bigger the degree of dimension is, the more the angles of generated vectors concentrate around 90^\circ. That means most pairs of vectors in high dimensional space are close to orthogonal. Movements are also sequence of vectors, so when most pairs of movement vectors are orthogonal, that means you can only move like Pac-Man in such space.

Source: https://edition.cnn.com/style/article/pac-man-40-anniversary-history/index.html

* Of course I am talking about arcade Mac-Man game. Not Pac-Man in Super Smash Bros.  Retro RPG video games might have more similar playability, but in high dimensional space it is also difficult to turn back. At any rate, I think you have understood it is even difficult to move smoothly in high dimensional space, just like the first notorious Resident Evil on the first PS console also had terrible playability .

2.3: empty M & M’s chocolate

Let’s think about the proportion of the volume of the outermost \epsilon surface of general spheres with radius r. First, in 2 two dimensional space, spheres are circles. The area of the brown part of the circle below is \pi r^2. In order calculate the are of \epsilon \cdot r thick surface of the circle, you have only to subtract the area of \pi \{ (1 - \epsilon)\cdot r\} ^2. When \epsilon = 0.01, the area of outer most surface is \pi r^2 - \pi (0.99\cdot r)^2, and its proportion to the area of the whole circle is \frac{\pi r^2 - \pi (0.99\cdot r)^2}{\pi r^2} = 0.0199.

In case of 3-dimensional space, the value of a sphere with radius r is \frac{4}{3} \pi r^2, so the proportion of the \epsilon surface is calculated in the same way: \frac{\frac{4}{3} \pi r^3 -\frac{4}{3} \pi (0.99\cdot r)^2}{\frac{4}{3}\pi r^2} = 0.0297. Compared to the case in 2 dimensional space, the proportion is a little bigger.

How about in D-dimensional space? We have seen that even in  D-dimensional space the surface of a sphere, I mean D-ball, can be defined as a set of any points whose distance from the center point is all r. And it is known that the volume of D-ball is defined as below.

\Gamma () is called gamma function, but in this article it is not so important. The most important point now is, if you discuss any D-ball, their volume only depends on their radius r. That meas the proportion of outer \epsilon surface of D-ball is calculated as \frac{\pi r^2 - \pi \{ (1 - \epsilon)\cdot r\} ^2}{\pi r^2}. When \epsilon is 0.01, the proportion of the 1% surface of D-ball changes like in the chart below.

* And of course when D is 2,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \pi, and when D is 3 ,  \frac{\pi ^{(\frac{D}{2})}}{\Gamma (\frac{D}{2} + 1)} = \frac{4}{3} \pi

You can see that when D is over 400, around 90% of volume is concentrated in the very thin 1% surface of D-ball. That is why, in high dimensional space, M & M’s chocolate look empty but tastes normal: all the chocolate are concentrated beneath the sugar coating.

More interestingly, even if you choose any points as a central point of a sphere with radius r, the other points are squashed to the surface of the sphere, even if all the data points are uniformly distributed. This situation is problematic for classical machine learning algorithms, which are often based on the Euclidean distances between pairs of two sample data points: if you go from the central point to another sample point, the possibility of finding the point within (1 - \epsilon)\cdot r radius of the center is almost zero. But if you reach the outermost \epsilon part of the surface of the sphere, most data points are there. However, for one of the data points in the surface, any other data points are distant in the same way.

Inside M & M’s chocolate is a mysterious world.

Source: https://hipwallpaper.com/mms-wallpapers/

You have seen that using high dimensional data can be problematic in many ways. Data science and machine learning are largely based on one idea: you can find a lower dimensional meaningful and easier structure in data. In the next articles I am going to introduce some famous dimension reduction algorithms. And hopefully I would like to give some deeper insights in to these algorithms, in straightforward ways.

* I could not explain the relationships of variance and bias of data. This is also a very important factor when you think about dimensionality of data. I hope I can write about this topic someday. You can also look it up if you are interested.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Illustrative introductions on dimension reduction

“What is your image on dimensions?”

….That might be a cheesy question to ask to reader of Data Science Blog, but most people, with no scientific background, would answer “One dimension is a line, and two dimension is a plain, and we live in three-dimensional world.” After that if you ask “How about the fourth dimension?” many people would answer “Time?”

You can find books or writings about dimensions in various field. And you can use the word “dimension” in normal conversations, in many contexts.

*In Japanese, if you say “He likes two dimension.” that means he prefers anime characters to real women, as is often the case with Japanese computer science students.

The meanings of “dimensions” depend on the context, but in data science dimension is usually the number of rows of your Excel data.

When you study data science or machine learning, usually you should start with understanding the algorithms with 2 or 3 dimensional data, and you can apply those ideas to any D dimensional data. But of course you cannot visualize D dimensional data anymore, and you always have to be careful of what happens if you expand degree of dimension.

Conversely it is also important to reduce dimension to understand abstract high dimensional stuff in 2 or 3 dimensional space, which are close to our everyday sense. That means dimension reduction is one powerful way of data visualization.

In this blog series I am going to explain meanings of dimension itself in machine learning context and algorithms for dimension reductions, such as PCA, LDA, and t-SNE, with 2 or 3 dimensional visible data. Along with that, I am going to delve into the meaning of calculations so that you can understand them in more like everyday-life sense.

This article series is going to be roughly divided into the contents below.

  1. Curse of Dimensionality
  2. PCA, LDA (to be published soon)
  3. Rethinking eigen vectors (to be published soon)
  4. KL expansion and subspace method (to be published soon)
  5. Autoencoder as dimension reduction (to be published soon)
  6. t-SNE (to be published soon)

I hope you could see that reducing dimension is one of the fundamental approaches in data science or machine learning.

Simple RNN

Understanding LSTM forward propagation in two ways

*This article is only for the sake of understanding the equations in the second page of the paper named “LSTM: A Search Space Odyssey”. If you have no trouble understanding the equations of LSTM forward propagation, I recommend you to skip this article and go the the next article.

1. Preface

I  heard that in Western culture, smart people write textbooks so that other normal people can understand difficult stuff, and that is why textbooks in Western countries tend to be bulky, but also they are not so difficult as they look. On the other hand in Asian culture, smart people write puzzling texts on esoteric topics, and normal people have to struggle to understand what noble people wanted to say. Publishers also require the authors to keep the texts as short as possible, so even though the textbooks are thin, usually students have to repeat reading the textbooks several times because usually they are too abstract.

Both styles have cons and pros, and usually I prefer Japanese textbooks because they are concise, and sometimes it is annoying to read Western style long texts with concrete straightforward examples to reach one conclusion. But a problem is that when it comes to explaining LSTM, almost all the text books are like Asian style ones. Every study material seems to skip the proper steps necessary for “normal people” to understand its algorithms. But after actually making concrete slides on mathematics on LSTM, I understood why: if you write down all the equations on LSTM forward/back propagation, that is going to be massive, and actually I had to make 100-page PowerPoint animated slides to make it understandable to people like me.

I already had a feeling that “Does it help to understand only LSTM with this precision? I should do more practical codings.” For example François Chollet, the developer of Keras, in his book, said as below.

 

For me that sounds like “We have already implemented RNNs for you, so just shut up and use Tensorflow/Keras.” Indeed, I have never cared about the architecture of my Mac Book Air, but I just use it every day, so I think he is to the point. To make matters worse, for me, a promising algorithm called Transformer seems to be replacing the position of LSTM in natural language processing. But in this article series and in my PowerPoint slides, I tried to explain as much as possible, contrary to his advice.

But I think, or rather hope,  it is still meaningful to understand this 23-year-old algorithm, which is as old as me. I think LSTM did build a generation of algorithms for sequence data, and actually Sepp Hochreiter, the inventor of LSTM, has received Neural Network Pioneer Award 2021 for his work.

I hope those who study sequence data processing in the future would come to this article series, and study basics of RNN just as I also study classical machine learning algorithms.

 *In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

2. Why LSTM?

First of all, let’s take a brief look at what I said about the structures of RNNs,  in the first and the second article. A simple RNN is basically densely connected network with a few layers. But the RNN gets an input every time step, and it gives out an output at the time step. Part of information in the middle layer are succeeded to the next time step, and in the next time step, the RNN also gets an input and gives out an output. Therefore, virtually a simple RNN behaves almost the same way as densely connected layers with many layers during forward/back propagation if you focus on its recurrent connections.

That is why simple RNNs suffer from vanishing/exploding gradient problems, where the information exponentially vanishes or explodes when its gradients are multiplied many times through many layers during back propagation. To be exact, I think you need to consider this problem precisely like you can see in this paper. But for now, please at least keep it in mind that when you calculate a gradient of an error function with respect to parameters of simple neural networks, you have to multiply parameters many times like below, and this type of calculation usually leads to vanishing/exploding gradient problem.

LSTM was invented as a way to tackle such problems as I mentioned in the last article.

3. How to display LSTM

I would like you to just go to image search on Google, Bing, or Yahoo!, and type in “LSTM.” I think you will find many figures, but basically LSTM charts are roughly classified into two types: in this article I call them “Space Odyssey type” and “electronic circuit type”, and in conclusion, I highly recommend you to understand LSTM as the “electronic circuit type.”

*I just randomly came up with the terms “Space Odyssey type” and “electronic circuit type” because the former one is used in the paper I mentioned, and the latter one looks like an electronic circuit to me. You do not have to take how I call them seriously.

However, not that all the well-made explanations on LSTM use the “electronic circuit type,” and I am sure you sometimes have to understand LSTM as the “space odyssey type.” And the paper “LSTM: A Search Space Odyssey,” which I learned a lot about LSTM from,  also adopts the “Space Odyssey type.”

LSTM architectur visualization

The main reason why I recommend the “electronic circuit type” is that its behaviors look closer to that of simple RNNs, which you would have seen if you read my former articles.

*Behaviors of both of them look different, but of course they are doing the same things.

If you have some understanding on DCL, I think it was not so hard to understand how simple RNNs work because simple RNNs  are mainly composed of linear connections of neurons and weights, whose structures are the same almost everywhere. And basically they had only straightforward linear connections as you can see below.

But from now on, I would like you to give up the ideas that LSTM is composed of connections of neurons like the head image of this article series. If you do that, I think that would be chaotic and I do not want to make a figure of it on Power Point. In short, sooner or later you have to understand equations of LSTM.

4. Forward propagation of LSTM in “electronic circuit type”

*For further understanding of mathematics of LSTM forward/back propagation, I recommend you to download my slides.

The behaviors of an LSTM block is quite similar to that of a simple RNN block: an RNN block gets an input every time step and gets information from the RNN block of the last time step, via recurrent connections. And the block succeeds information to the next block.

Let’s look at the simplified architecture of  an LSTM block. First of all, you should keep it in mind that LSTM have two streams of information: the one going through all the gates, and the one going through cell connections, the “highway” of LSTM block. For simplicity, we will see the architecture of an LSTM block without peephole connections, the lines in blue. The flow of information through cell connections is relatively uninterrupted. This helps LSTMs to retain information for a long time.

In a LSTM block, the input and the output of the former time step separately go through sections named “gates”: input gate, forget gate, output gate, and block input. The outputs of the forget gate, the input gate, and the block input join the highway of cell connections to renew the value of the cell.

*The small two dots on the cell connections are the “on-ramp” of cell conection highway.

*You would see the terms “input gate,” “forget gate,” “output gate” almost everywhere, but how to call the “block gate” depends on textbooks.

Let’s look at the structure of an LSTM block a bit more concretely. An LSTM block at the time step (t) gets \boldsymbol{y}^{(t-1)}, the output at the last time step,  and \boldsymbol{c}^{(t-1)}, the information of the cell at the time step (t-1), via recurrent connections. The block at time step (t) gets the input \boldsymbol{x}^{(t)}, and it separately goes through each gate, together with \boldsymbol{y}^{(t-1)}. After some calculations and activation, each gate gives out an output. The outputs of the forget gate, the input gate, the block input, and the output gate are respectively \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)}. The outputs of the gates are mixed with \boldsymbol{c}^{(t-1)} and the LSTM block gives out an output \boldsymbol{y}^{(t)}, and gives \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} to the next LSTM block via recurrent connections.

You calculate \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} as below.

  • \boldsymbol{f}^{(t)}= \sigma(\boldsymbol{W}_{for} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{for} \boldsymbol{y}^{(t-1)} +  \boldsymbol{b}_{for})
  • \boldsymbol{i}^{(t)}=\sigma(\boldsymbol{W}_{in} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{in} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{in})
  • \boldsymbol{z}^{(t)}=tanh(\boldsymbol{W}_z \boldsymbol{x}^{(t)} + \boldsymbol{R}_z \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_z)
  • \boldsymbol{o}^{(t)}=\sigma(\boldsymbol{W}_{out} \boldsymbol{x}^{(t)} + \boldsymbol{R}_{out} \boldsymbol{y}^{(t-1)} + \boldsymbol{b}_{out})

*You have to keep it in mind that the equations above do not include peephole connections, which I am going to show with blue lines in the end.

The equations above are quite straightforward if you understand forward propagation of simple neural networks. You add linear products of \boldsymbol{y}^{(t)} and \boldsymbol{c}^{(t)} with different weights in each gate. What makes LSTMs different from simple RNNs is how to mix the outputs of the gates with the cell connections. In order to explain that, I need to introduce a mathematical operator called Hadamard product, which you denote as \odot. This is a very simple operator. This operator produces an elementwise product of two vectors or matrices with identical shape.

With this Hadamar product operator, the renewed cell and the output are calculated as below.

  • \boldsymbol{c}^{(t)} = \boldsymbol{z}^{(t)}\odot \boldsymbol{i}^{(t)} + \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)}
  • \boldsymbol{y}^{(t)} = \boldsymbol{o}^{(t)} \odot tanh(\boldsymbol{c}^{(t)})

The values of \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{z}^{(t)}, \boldsymbol{o}^{(t)} are compressed into the range of [0, 1] or [-1, 1] with activation functions. You can see that the input gate and the block input give new information to the cell. The part \boldsymbol{c}^{(t-1)} \odot \boldsymbol{f}^{(t)} means that the output of the forget gate “forgets” the cell of the last time step by multiplying the values from 0 to 1 elementwise. And the cell \boldsymbol{c}^{(t)} is activated with tanh() and the output of the output gate “suppress” the activated value of \boldsymbol{c}^{(t)}. In other words, the output gatedecides how much information to give out as an output of the LSTM block. The output of every gate depends on the input \boldsymbol{x}^{(t)}, and the recurrent connection \boldsymbol{y}^{(t-1)}. That means an LSTM block learns to forget the cell of the last time step, to renew the cell, and to suppress the output. To describe in an extreme manner, if all the outputs of every gate are always (1, 1, …1)^T, LSTMs forget nothing, retain information of inputs at every time step, and gives out everything. And  if all the outputs of every gate are always (0, 0, …0)^T, LSTMs forget everything, receive no inputs, and give out nothing.

This model has one problem: the outputs of each gate do not directly depend on the information in the cell. To solve this problem, some LSTM models introduce some flows of information from the cell to each gate, which are shown as lines in blue in the figure below.

LSTM inner architecture

LSTM models, for example the one with or without peephole connection, depend on the library you use, and the model I have showed is one of standard LSTM structure. However no matter how complicated structure of an LSTM block looks, you usually cover it with a black box as below and show its behavior in a very simplified way.

5. Space Odyssey type

I personally think there is no advantages of understanding how LSTMs work with this Space Odyssey type chart, but in several cases you would have to use this type of chart. So I will briefly explain how to look at that type of chart, based on understandings of LSTMs you have gained through this article.

In Space Odyssey type of LSTM chart, at the center is a cell. Electronic circuit type of chart, which shows the flow of information of the cell as an uninterrupted “highway” in an LSTM block. On the other hand, in a Spacey Odyssey type of chart, the information of the cell rotate at the center. And each gate gets the information of the cell through peephole connections,  \boldsymbol{x}^{(t)}, the input at the time step (t) , sand \boldsymbol{y}^{(t-1)}, the output at the last time step (t-1), which came through recurrent connections. In Space Odyssey type of chart, you can more clearly see that the information of the cell go to each gate through the peephole connections in blue. Each gate calculates its output.

Just as the charts you have seen, the dotted line denote the information from the past. First, the information of the cell at the time step (t-1) goes to the forget gate and get mixed with the output of the forget cell In this process the cell is partly “forgotten.” Next, the input gate and the block input are mixed to generate part of new value of the the cell at time step  (t). And the partly “forgotten” \boldsymbol{c}^{(t-1)} goes back to the center of the block and it is mixed with the output of the input gate and the block input. That is how \boldsymbol{c}^{(t)} is renewed. And the value of new cell flow to the top of the chart, being mixed with the output of the output gate. Or you can also say the information of new cell is “suppressed” with the output gate.

I have finished the first four articles of this article series, and finally I am gong to write about back propagation of LSTM in the next article. I have to say what I have written so far is all for the next article, and my long long Power Point slides.

 

[References]

[1] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey,” (2017)

[2] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 202-204

[3] “Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021”, Institute of advanced research in artificial intelligence, (2020)
URL: https://www.iarai.ac.at/news/sepp-hochreiter-receives-ieee-cis-neural-networks-pioneer-award-2021/?fbclid=IwAR27cwT5MfCw4Tqzs3MX_W9eahYDcIFuoGymATDR1A-gbtVmDpb8ExfQ87A

[4] Oketani Takayuki, “Machine Learning Professional Series: Deep Learning,” (2015), pp. 120-125
岡谷貴之 著, 「機械学習プロフェッショナルシリーズ 深層学習」, (2015), pp. 120-125

[5] Harada Tatsuya, “Machine Learning Professional Series: Image Recognition,” (2017), pp. 252-257
原田達也 著, 「機械学習プロフェッショナルシリーズ 画像認識」, (2017), pp. 252-257

[6] “Understandable LSTM ~ With the Current Trends,” Qiita, (2015)
「わかるLSTM ~ 最近の動向と共に」, Qiita, (2015)
URL: https://qiita.com/t_Signull/items/21b82be280b46f467d1b

 How Text to Speech Voices Are Used In Data Science

To speak on voices, text to speech platforms are bringing versatility to a new scale by implementing voices that sound more personal and less like a robot. As these services gain traction, vocal quality, and implementation improve to give sounds that feel like they’re speaking to you from a human mouth.

The intention of most text to speech platforms have always been to provide experiences that users feel comfortable using. Voices are a huge part of that, so great strides have been taken to ensure that they sound right.

How Voices are Utilized

Voices in the text to speech are generated by a computer itself. As the computer-generated voices transcribe the text into oral responses, they make up what we hear as dialogue read to us. These voice clips initially had the problem of sounding robotic and unpersonable as they were pulled digitally together. Lately, though, the technology has improved to bring faster response time in transcribing words, as well as seamlessly stringing together. This has brought the advantage of making a computer-generated voice sound much more natural and human. As people seek to connect more with the works they read, having a human-sounding voice is a huge step in letting listeners relate to their works.

To give an example of where this works, you might have a GPS in your car. The GPS has a function where it will transcribe the car’s route and tell you each instruction. Some GPS companies have made full use of this feature and added fun voices to help entertain drivers. These include Darth Vader and Yoda from Star Wars or having Morgan Freeman and Homer Simpson narrate your route. Different voice types are utilized in services depending on the situation. Professional uses like customer service centers will keep automated voices sounding professional and courteous when assisting customers. Educational systems will keep softer and kinder sounding voices to help sound more friendly with students.

When compared to older solutions, the rise of vocal variety in Text to Speech services has taken huge leaps as more people see the value of having a voice that they can connect to. Expressive voices and emotional variance are being applied to voices to help further convey this, with happy or sad sounding voices being implemented wherever appropriate. As time goes on, these services will get better at the reading context within sentences to apply emotion and tone at the correct times, and improve overall vocal quality as well. These reinvent past methods by advancing the once static and robotic sounds that used to be commonplace among text to voice services.

More infrastructures adopt these services to expand their reach to consumers who might not have the capabilities to utilize their offerings.  Having clear and relatable voices matter because customers and users will be drawn to them considerably more than if they chose not to offer them at all. In the near future text to speech voices will develop even further, enhancing the way people of all kinds connect to the words they read.

As Businesses Struggle With ML, Automation Offers a Solution

In recent years, machine learning technology and the business solutions it enables has developed into a big business in and of itself. According to the industry analysts at IDC, spending on ML and AI technology is set to grow to almost $98 billion per year by 2023. In practical terms, that figure represents a business environment where ML technology has become a key priority for companies of every kind.

That doesn’t mean that the path to adopting ML technology is easy for businesses. Far from it. In fact, survey data seems to indicate that businesses are still struggling to get their machine learning efforts up and running. According to one such survey, it currently takes the average business as many as 90 days to deploy a single machine learning model. For 20% of businesses, that number is even higher.

From the data, it seems clear that something is missing in the methodologies that most companies rely on to make meaningful use of machine learning in their business workflows. A closer look at the situation reveals that the vast majority of data workers (analysts, data scientists, etc.) spend an inordinate amount of time on infrastructure work – and not on creating and refining machine learning models.

Streamlining the ML Adoption Process

To fix that problem, businesses need to turn to another growing area of technology: automation. By leveraging the latest in automation technology, it’s now possible to build an automated machine learning pipeline (AutoML pipeline) that cuts down on the repetitive tasks that slow down ML deployments and lets data workers get back to the work they were hired to do. With the right customized solution in place, a business’s ML team can:

  • Reduce the time spent on data collection, cleaning, and ingestion
  • Minimize human errors in the development of ML models
  • Decentralize the ML development process to create an ML-as-a-service model with increased accessibility for all business stakeholders

In short, an AutoML pipeline turns the high-effort functions of the ML development process into quick, self-adjusting steps handled exclusively by machines. In some use cases, an AutoML pipeline can even allow non-technical stakeholders to self-create ML solutions tailored to specific business use cases with no expert help required. In that way, it can cut ML costs, shorten deployment time, and allow data scientists to focus on tackling more complex modelling work to develop custom ML solutions that are still outside the scope of available automation techniques.

The Parts of an AutoML Pipeline

Although the frameworks and tools used to create an AutoML pipeline can vary, they all contain elements that conform to the following areas:

  • Data Preprocessing – Taking available business data from a variety of sources, cleaning it, standardizing it, and conducting missing value imputation
  • Feature Engineering – Identifying features in the raw data set to create hypotheses for the model to base predictions on
  • Model Selection – Choosing the right ML approach or hyperparameters to produce the desired predictions
  • Tuning Hyperparameters – Determining which hyperparameters help the model achieve optimal performance

As anyone familiar with ML development can tell you, the steps in the above process tend to represent the majority of the labour and time-intensive work that goes into creating a model that’s ready for real-world business use. It is also in those steps where the lion’s share of business ML budgets get consumed, and where most of the typical delays occur.

The Limitations and Considerations for Using AutoML

Given the scope of the work that can now become part of an AutoML pipeline, it’s tempting to imagine it as a panacea – something that will allow a business to reduce its reliance on data scientists going forward. Right now, though, the technology can’t do that. At this stage, AutoML technology is still best used as a tool to augment the productivity of business data teams, not to supplant them altogether.

To that end, there are some considerations that businesses using AutoML will need to keep in mind to make sure they get reliable, repeatable, and value-generating results, including:

  • Transparency – Businesses must establish proper vetting procedures to make sure they understand the models created by their AutoML pipeline, so they can explain why it’s making the choices or predictions it’s making. In some industries, such as in medicine or finance, this could even fall under relevant regulatory requirements.
  • Extensibility – Making sure the AutoML framework may be expanded and modified to suit changing business needs or to tackle new challenges as they arise.
  • Monitoring and Maintenance – Since today’s AutoML technology isn’t a set-it-and-forget-it proposition, it’s important to establish processes for the monitoring and maintenance of the deployment so it can continue to produce useful and reliable ML models.

The Bottom Line

As it stands today, the convergence of automation and machine learning holds the promise of delivering ML models at scale for businesses, which would greatly speed up the adoption of the technology and lower barriers to entry for those who have yet to embrace it. On the whole, that’s great news both for the businesses that will benefit from increased access to ML technology, as well as for the legions of data professionals tasked with making it all work.

It’s important to note, of course, that complete end-to-end ML automation with no human intervention is still a long way off. While businesses should absolutely explore building an automated machine learning pipeline to speed up development time in their data operations, they shouldn’t lose sight of the fact that they still need plenty of high-skilled data scientists and analysts on their teams. It’s those specialists that can make appropriate and productive use of the technology. Without them, an AutoML pipeline would accomplish little more than telling the business what it wants to hear.

The good news is that the AutoML tools that exist right now are sufficient to alleviate many of the real-world problems businesses face in their road to ML adoption. As they become more commonplace, there’s little doubt that the lead time to deploy machine learning models is going to shrink correspondingly – and that businesses will enjoy higher ROI and enhanced outcomes as a result.