AI For Advertisers: How Data Analytics Can Change The Maths Of Advertising?

All Images Credit: Freepik

The task of understanding a customer’s journey and designing your marketing strategy accordingly can be difficult in this data-driven world. Today, the customer expresses their needs in myriad forms of requests.

Consumers express their needs and want attitudes, and values in various forms through search, comments, blogs, Tweets, “likes,” videos, and conversations and access such data across many channels like web, mobile, and face to face. Volume, variety, velocity and veracity of the data accumulated through these customer interactions are huge.

BigData and data analytics can be leveraged to understand several phases of the customer journey. There are risks involved in using Artificial Intelligence for the marketing data analysis of data breach and even manipulation. But, AI do have brighter prospects when it comes to marketing and advertiser applications.

As the CEO of a technology firm Chop Dawg and marketer, Joshua Davidson puts it, “AI-powered apps are going to be the future for us, and there are several industries that are ripe for this.” The mobile-first strategy of many enterprises has powered the use of AI for digital marketing and developing technologies and innovations to power industries with intelligent systems.

How AI and Machine learning are affecting customer journeys?

Any consumer journey begins with the recognition of a problem and then stages like initial consideration, active evaluation, purchase, and postpurchase come through up till the consumer journey is over. The need for identifying the purchasing and need patterns of the consumers and finding the buyer personas to strategize the marketing for them.

Need and Want Recognition:

Identifying a need is quite difficult as it is the most initial level of a consumer’s journey and it is more on the category level than at a brand level. Marketers and advertisers are relying on techniques like market research, web analytics, and data mining to build consumer profiles and buyer’s persona for understanding the needs and influencing the purchase of products. AI can help identify these wants and needs in real-time as the consumers usually express their needs and wants online and help build profiles more quickly.

AI technologies offered by several firms help in consumer profiling. Firms like Microsoft offers Azure that crunches billions of data points in seconds to determine the needs of consumers. It then personalizes web content on specific platforms in real-time to align with those status-updates. Consumer digital footprints are evolving through social media status updates, purchasing behavior, online comments and posts. Ai tends to update these profiles continuously through machine learning techniques.

Initial Consideration:

A key objective of advertising is to insert a brand into the consideration set of the consumers when they are looking for deliberate offerings. Advertising includes increasing the visibility of brands and emphasize on the key reasons for consideration. Advertisers currently use search optimization, paid search advertisements, organic search, or advertisement retargeting for finding the consideration and increase the probability of consumer consideration.

AI can leverage machine learning and data analytics to help with search, identify and rank functions of consumer consideration that can match the real-time considerations at any specific time. Take an example of Google Adwords, it analyzes the consumer data and helps advertisers make clearer distinctions between qualified and unqualified leads for better targeting.

Google uses AI to analyze the search-query data by considering, not only the keywords but also context words and phrases, consumer activity data and other BigData. Then, Google identifies valuable subsets of consumers and more accurate targeting.

Active Evaluation: 

When consumers narrow it down to a few choices of brands, advertisers need to insert trust and value among the consumers for brands. A common technique is to identify the higher purchase consumers and persuade them through persuasive content and advertisement. AI can support these tasks using some techniques:

Predictive Lead Scoring: Predictive lead scoring by leveraging machine learning techniques of predictive analytics to allow marketers to make accurate predictions related to the intent of purchase for consumers. A machine learning algorithm runs through a database of existing consumer data, then recognize trends and patterns and after processing the external data on consumer activities and interests, creates robust consumer profiles for advertisers.

Natural Language Generation: By leveraging the image, speech recognition and natural language generation, machine learning enables marketers to curate content while learning from the consumer behavior in real-time scenarios and adjusts the content according to the profiles on the fly.

Emotion AI: Marketers use emotion AI to understand consumer sentiment and feel about the brand in general. By tapping into the reviews, blogs or videos they understand the mood of customers. Marketers also use emotion AI to pretest advertisements before its release. The famous example of Kelloggs, which used emotion AI to help devise an advertising campaign for their cereal, eliminating the advertisement executions whenever the consumer engagement dropped.

Purchase: 

As the consumers decide which brands to choose and what it’s worth, advertising aims to move them out of the decision process and push for the purchase by reinforcing the value of the brand compared with its competition.

Advertisers can insert such value by emphasizing convenience and information about where to buy the product, how to buy the product and reassuring the value through warranties and guarantees. Many marketers also emphasize on rapid return policies and purchase incentives.

AI can completely change the purchase process through dynamic pricing, which encompasses real-time price adjustments on the basis of information such as demand and other consumer-behavior variables, seasonality, and competitor activities.

Post-Purchase: 

Aftersales services can be improved through intelligent systems using AI technologies and machine learning techniques. Marketers and advertisers can hire dedicated developers to design intelligent virtual agents or chatbots that can reinforce the value and performance of a brand among consumers.

Marketers can leverage an intelligent technique known as Propensity modeling to identify the most valuable customers on the basis of lifetime value, likelihood of reengagement, propensity to churn, and other key performance measures of interest. Then advertisers can personalize their communication with these customers on the basis of these data.

Conclusion:

AI has shifted the focus of advertisers and marketers towards the customer-first strategies and enhanced the heuristics of customer engagement. Machine learning and IoT(Internet of Things) has already changed the way customer interact with the brands and this transition has come at a time when advertisers and marketers are looking for new ways to tap into the customer mindset and buyer’s persona.

All Images Credit: Freepik

Best machine learning algorithms you should know

Machine learning is a key technology tool businesses use to build tools that enhance their operations. To do that, they take advantage of machine learning algorithms that come in different shapes and sizes, servicing different purposes and working on different data sets. Choosing the right algorithm for the job is what makes machine learning and deep learning projects successful. That’s why being aware of all the different types of machine learning algorithms is so important – that’s how you get better results and build more advanced solutions.

Here’s an overview of the best machine learning algorithms you should know before starting your project.

What is meant by machine learning algorithms?

First things first, what is machine learning and how do algorithms fit into the picture? A machine learning (ML) algorithm is a process or set of procedures that allow a model to adapt to the data with a specific objective set as the goal.

An ML algorithm specifies how the data is transformed from the input to output, helping the model to learn the appropriate mapping from input to output. That model specifies the mapping functions and holds the parameters in place, while the machine learning algorithm updates the parameters to help the model match its goal.

What are the algorithms used in machine learning?

Algorithms can model problems in many different ways. The easiest way to differentiate between different ML algorithms is by comparing them by learning styles that they can adapt. Generally, machine learning algorithms can adapt to several learning styles that help to solve different problems.

Here are four learning styles in machine learning you need to know:

1 Supervised learning

In supervised learning, the input data serves as training data and comes with a known label or result – for example, the price at a time or spam/not-spam.

In this variant, the training process is critical for preparing a model that makes predictions and then is corrected when the predictions are wrong. The training process continues until the model achieves the appropriate level of accuracy. Classification and regression are examples of problems for this learning type.

 

2 Unsupervised learning

In unsupervised learning, input data isn’t labeled and doesn’t come with a known result. Data scientists prepare models by deducing the structures in the input data to extract general rules or reduce redundancy through mathematical processes. Unsupervised learning addresses problems such as association rule learning, dimensionality reduction, and clustering.

3 Semi-supervised learning

In this learning style, the input data is a mixture of labeled and unlabeled examples. The prediction problem is known, but the model needs to learn the structures for organizing data and making predictions on its own. This learning style is used to address problems such as regression and classification.

4 Reinforcement learning

One of three basic machine learning paradigms together with supervised learning and unsupervised learning, reinforcement learning (RL) is an area of machine learning that focuses on the ways in which software agents should take actions to maximize a specified notion of cumulative reward in a given environment.

The best machine learning algorithms you should know

1 Linear Regression

Linear regression is an algorithm that correlates between two variables in the data set, examining the input and output sets to show a relationship between them. For example, the algorithm can show how changing one of the input variables affects the other variable. The relationship is represented by plotting a line on the graph.

Linear regression is one of the most popular algorithms in machine learning because it’s transparent and requires no tuning to work. Practical applications of this algorithm are risk assessment or sales forecasting solutions.

2 Logistic regression

Logistic regression is a type of constrained Linear Regression with a non-linearity application after you apply weights. Note that this algorithm is used for classification, not regression. The algorithm restricts the outputs close to +/- classes (and 1 and 0 in the case of sigmoid) and can be trained with Gradient Descent or L-BFGS.

Logistic regression is used in Natural Language Processing (NLP) applications, where it often appears under the name of Maximum Entropy Classifier.

3 Principal component analysis (PCA or LDA)

Principal component analysis is an unsupervised method that helps data scientists to understand better the global properties of a data set that consists of vectors. It analyzes the covariance matrix of data points to learn which dimensions/data points have high variance among themselves and low covariance with others. The algorithm helps data scientists to get data points with reduced dimensions.

4 K-means clustering

K- means clustering is a type of unsupervised clustering algorithm that sorts data sets through defined clusters. It offers results in the form of groups based on internal patterns.

For example, you can use a K-means algorithm for sorting web results for the word “cat,” and it will show all the results in the form of groups. The main advantage of this algorithm is its accuracy as it provides data groupings faster than other algorithms.

 

5 Decision trees

A decision tree is made of various branches that represent the outcome of many decisions. This algorithm collects and graphs data in multiple branches to predict response variables on the basis of past decisions. It comes in handy for mapping our decisions and presents results visually to communicate findings easily.

Decision trees work best for smaller data sets and relatively low-stake decisions – otherwise, the long-tail visuals can be hard to decipher. The key advantage of this algorithm is that it allows showing multiple outcomes and tests without having to involve data scientists – it’s easy to use.

6 Random forests

A random forest consists of a great number of individual decision trees where they all operate as an ensemble. An individual tree in the random forest generates a class prediction – the class which receives the highest number of votes becomes the model’s prediction. Having many relatively uncorrelated models (trees) operating as a committee easily outperforms individual constituent models.

The low correlation between these models is the strength of this approach because it allows producing ensemble predictions that are far more accurate than individual predictions. Note that decisions trees protect each other from individual errors. While some trees may generate false predictions, others will generate the right ones – as a group; they will be able to move in the right direction.

7 Support Vector Machine

Support Vector Machines (SVMs) are linear models similar to linear or logistic regression we’ve discussed earlier. However, there’s one difference – they have a different margin-based loss function, which can be optimized by using methods such as L-BFGS or SGD. SVMs internally analyze data sets into classes, which is helpful for future classifications.

The main idea behind SVM is separating data into classes and maximizing the margins of entering future data into classes. This type of algorithm works best for training data. However, it can also serve as a tool for processing nonlinear data. The financial sector makes use of Support Vector Machines thanks to its accuracy in classifying both current and future data sets.

8 Apriori

The Apriori algorithm is used a lot in market analysis. It’s based on the principle of Apriori and checks for positive and negative correlations between products after analyzing values in data sets.

For example, if two values often correlate in a data set, the algorithm will conclude that A will often lead to B, referring to the information in data sets. For example, if customers often buy product A and product B together, this relation will hold a high percentage and help companies like Google or Amazon to predict product searches and purchases.

9 Naive Bayes Classifier

This handy classification technique is based on Bayes’ Theorem, which assumes independence among predictors. The algorithm will assume that the presence of a specific feature in a class is not related to the presence of any other feature in the same class.

For example, a fruit may be considered a banana if it’s yellow, curved, and about 15 cm long. These features depend on each other, and on the existence of hooter features, they all independently contribute to the probability that this fruit is a banana. That’s why the algorithm bears the name “Naive.”

The algorithm offers a model that is easy to build and helpful in handling very large data sets. It can outperform the most sophisticated classification methods.

10 K-Nearest Neighbors (KNN)

This is one of the simplest algorithm types used in machine learning for classification and regression. KNN algorithms classify new data points on the basis of similarity measures, such as the distance function. They perform classification by using a majority vote of the data points’ neighbors. They then assign data to the class, which has the nearest neighbors. Together with increasing the number of nearest neighbors (the value of k), the accuracy may increase as well.

11 Ordinary Least Squares Regression (OLSR)

Ordinary Least Squares Regression (OLSR) is a generalized linear modeling technique data scientists use for estimating unknown parameters that are part of a linear regression model. OLSR describes the relationship between a dependent variable and one or more of its independent variables.

The algorithm is applied in diverse fields such as economics, finance, medicine, and social sciences. Companies use it in machine learning and predictive analytics to dynamically predict specific outcomes on the basis of variables that change dynamically.

We hope that this machine learning algorithms list helps you pick the right tools of the trade for your next machine learning project. If you’d like to learn more about Machine Learning, Data Science and Web Development, visit the Sunscrapers company blog.

Process Paradise by the Dashboard Light

The right questions drive business success. Questions like, “How can I make sure my product is the best of its kind?” “How can I get the edge over my competitors?” and “How can I keep growing my organization?” Modern businesses take their questions further, focusing on the details of how they actually function. At this level, the questions become, “How can I make my business as efficient as possible?” “How can I improve the way my company does business?” and even, “Why aren’t my company’s processes working as they should?”


Read this article in German:

Mit Dashboards zur Prozessoptimierung


To discover the answers to these questions (and many others!), more and more businesses are turning to process mining. Process mining helps organizations unlock hidden value by automatically collecting information on process models from across the different IT systems operating within a business. This allows for continuous monitoring of an organization’s end-to-end process landscape, meaning managers and staff gain specific operational insights into potential risks—as well as ongoing improvement opportunities.

However, process mining is not a silver bullet that turns data into insights at the push of a button. Process mining software is simply a tool that produces information, which then must be analyzed and acted upon by real people. For this to happen, the information produced must be available to decision-makers in an understandable format.

For most process mining tools, the emphasis remains on the sophistication of analysis capabilities, with the resulting data needing to be interpreted by a select group of experts or specialists within an organization. This necessarily creates a delay between the data being produced, the analysis completed, and actions taken in response.

Process mining software that supports a more collaborative approach by reducing the need for specific expertise can help bridge this gap. Only if hypotheses, analysis, and discoveries are shared, discussed, and agreed upon with a wide range of people can really meaningful insights be generated.

Of course, process mining software is currently capable of generating standardized reports and readouts, but in a business environment where the pace of change is constantly increasing, this may not be sufficient for very much longer. For truly effective process mining, the secret to success will be anticipating challenges and opportunities, then dealing with them as they arise in real time.

Dashboards of the future

To think about how process mining could improve, let’s consider an analog example. Technology evolves to make things easier—think of the difference between keeping track of expenditure using a written ledger vs. an electronic spreadsheet. Now imagine the spreadsheet could tell you exactly when you needed to read it, and where to start, as well as alerting you to errors and omissions before you were even aware you’d made them.

Advances in process mining make this sort of enhanced assistance possible for businesses seeking to improve the way they work. With the right process mining software, companies can build tailored operational cockpits that unite real-time operational data with process management. This allows for the usual continuous monitoring of individual processes and outcomes, but it also offers even clearer insights into an organization’s overall process health.

Combining process mining with an organization’s existing process models in the right way turns these models from static representations of the way a particular process operates, into dynamic dashboards that inform, guide and warn managers and staff about problems in real time. And remember, dynamic doesn’t have to mean distracting—the right process mining software cuts into your processes to reveal an all-new analytical layer of process transparency, making things easier to understand, not harder.

As a result, business transformation initiatives and other improvement plans and can be adapted and restructured on the go, while decision-makers can create automated messages to immediately be advised of problems and guided to where the issues are occurring, allowing corrective action to be completed faster than ever. This rapid evaluation and response across any process inefficiencies will help organizations save time and money by improving wasted cycle times, locating bottlenecks, and uncovering non-compliance across their entire process landscape.

Dynamic dashboards with Signavio

To see for yourself how the most modern and advanced process mining software can help you reveal actionable insights into the way your business works, give Signavio Process Intelligence a try. With Signavio’s Live Insights, all your process information can be visualized in one place, represented through a traffic light system. Simply decide which processes and which activities within them you want to monitor or understand, place the indicators, choose the thresholds, and let Signavio Process Intelligence connect your process models to the data.

Banish multiple tabs and confusing layouts, amaze your colleagues and managers with fact-based insights to support your business transformation, and reduce the time it takes to deliver value from your process management initiatives. To find out more about Signavio Process Intelligence, or sign up for a free 30-day trial, visit www.signavio.com/try.

Process mining is a powerful analysis tool, giving you the visibility, quantifiable numbers, and information you need to improve your business processes. Would you like to read more? With this guide to managing successful process mining initiatives, you will learn that how to get started, how to get the right people on board, and the right project approach.

The importance of being Data Scientist

Header-Image by Clint Adair on Unsplash.

The incredible results of Machine Learning and Artificial Intelligence, Deep Learning in particular, could give the impression that Data Scientist are like magician. Just think of it. Recognising faces of people, translating from one language to another, diagnosing diseases from images, computing which product should be shown for us next to buy and so on from numbers only. Numbers which existed for centuries. What a perfect illusion. But it is only an illusion, as Data Scientist existed as well for centuries. However, there is a difference between the one from today compared to the one from the past: evolution.

The main activity of Data Scientist is to work with information also called data. Records of data are as old as mankind, but only within the 16 century did it include also numeric forms — as numbers started to gain more and more ground developing their own symbols. Numerical data, from a given phenomenon — being an experiment or the counts of sheep sold by week over the year –, was from early on saved in tabular form. Such a way to record data is interlinked with the supposition that information can be extracted from it, that knowledge — in form of functions — is hidden and awaits to be discovered. Collecting data and determining the function best fitting them let scientist to new insight into the law of nature right away: Galileo’s velocity law, Kepler’s planetary law, Newton theory of gravity etc.

Such incredible results where not possible without the data. In the past, one was able to collect data only as a scientist, an academic. In many instances, one needed to perform the experiment by himself. Gathering data was tiresome and very time consuming. No sensor which automatically measures the temperature or humidity, no computer on which all the data are written with the corresponding time stamp and are immediately available to be analysed. No, everything was performed manually: from the collection of the data to the tiresome computation.

More then that. Just think of Michael Faraday and Hermann Hertz and there experiments. Such endeavour where what we will call today an one-man-show. Both of them developed parts of the needed physics and tools, detailed the needed experiment settings, conducting the experiment and collect the data and, finally, computing the results. The same is true for many other experiments of their time. In biology Charles Darwin makes its case regarding evolution from the data collected in his expeditions on board of the Beagle over a period of 5 years, or Gregor Mendel which carry out a study of pea regarding the inherence of traits. In physics Blaise Pascal used the barometer to determine the atmospheric pressure or in chemistry Antoine Lavoisier discovers from many reaction in closed container that the total mass does not change over time. In that age, one person was enough to perform everything and was the reason why the last part, of a data scientist, could not be thought of without the rest. It was inseparable from the rest of the phenomenon.

With the advance of technology, theory and experimental tools was a specialisation gradually inescapable. As the experiments grow more and more complex, the background and condition in which the experiments were performed grow more and more complex. Newton managed to make first observation on light with a simple prism, but observing the line and bands from the light of the sun more than a century and half later by Joseph von Fraunhofer was a different matter. The small improvements over the centuries culminated in experiments like CERN or the Human Genome Project which would be impossible to be carried out by one person alone. Not only was it necessary to assign a different person with special skills for a separate task or subtask, but entire teams. CERN employs today around 17 500 people. Only in such a line of specialisation can one concentrate only on one task alone. Thus, some will have just the knowledge about the theory, some just of the tools of the experiment, other just how to collect the data and, again, some other just how to analyse best the recorded data.

If there is a specialisation regarding every part of the experiment, what makes Data Scientist so special? It is impossible to validate a theory, deciding which market strategy is best without the work of the Data Scientist. It is the reason why one starts today recording data in the first place. Not only the size of the experiment has grown in the past centuries, but also the size of the data. Gauss manage to determine the orbit of Ceres with less than 20 measurements, whereas the new picture about the black hole took 5 petabytes of recorded data. To put this in perspective, 1.5 petabytes corresponds to 33 billion photos or 66.5 years of HD-TV videos. If one includes also the time to eat and sleep, than 5 petabytes would be enough for a life time.

For Faraday and Hertz, and all the other scientist of their time, the goal was to find some relationship in the scarce data they painstakingly recorded. Due to time limitations, no special skills could be developed regarding only the part of analysing data. Not only are Data Scientist better equipped as the scientist of the past in analysing data, but they managed to develop new methods like Deep Learning, which have no mathematical foundation yet in spate of their success. Data Scientist developed over the centuries to the seldom branch of science which bring together what the scientific specialisation was forced to split.

What was impossible to conceive in the 19 century, became more and more a reality at the end of the 20 century and developed to a stand alone discipline at the beginning of the 21 century. Such a development is not only natural, but also the ground for the development of A.I. in general. The mathematical tools needed for such an endeavour where already developed by the half of the 20 century in the period when computing power was scars. Although the mathematical methods were present for everyone, to understand them and learn how to apply them developed quite differently within every individual field in which Machine Learning/A.I. was applied. The way the same method would be applied by a physicist, a chemist, a biologist or an economist would differ so radical, that different words emerged which lead to different langues for similar algorithms. Even today, when Data Science has became a independent branch, two different Data Scientists from different application background could find it difficult to understand each other only from a language point of view. The moment they look at the methods and code the differences will slowly melt away.

Finding a universal language for Data Science is one of the next important steps in the development of A.I. Then it would be possible for a Data Scientist to successfully finish a project in industry, turn to a new one in physics, then biology and returning to industry without much need to learn special new languages in order to be able to perform each tasks. It would be possible to concentrate on that what a Data Scientist does best: find the best algorithm. In other words, a Data Scientist could resolve problems independent of the background the problem was stated.

This is the most important aspect that distinguish the Data Scientist. A mathematician is limited to solve problems in mathematics alone, a physicist is able to solve problems only in physics, a biologist problems only in biology. With a unique language regarding the methods and strategies to solve Machine Learning/A.I. problems, a Data Scientist can solve a problem independent of the field. Specialisation put different branches of science at drift from each other, but it is the evolution of the role of the Data Scientist to synthesize from all of them and find the quintessence in a language which transpire beyond all the field of science. The emerging language of Data Science is a new building block, a new mathematical language of nature.

Although such a perspective does not yet exists, the principal component of Machine Learning/A.I. already have such proprieties partially in form of data. Because predicting for example the numbers of eggs sold by a company or the numbers of patients which developed immune bacteria to a specific antibiotic in all hospital in a country can be performed by the same prediction method. The data do not carry any information about the entities which are being predicted. It does not matter anymore if the data are from Faraday’s experiment, CERN of Human Genome. The same data set and its corresponding prediction could stand literary for anything. Thus, the result of the prediction — what we would call for a human being intuition and/or estimation — would be independent of the domain, the area of knowledge it originated.

It also lies at the very heart of A.I., the dream of researcher to create self acting entities, that is machines with consciousness. This implies that the algorithms must be able to determine which task, model is relevant at a given moment. It would be to cumbersome to have a model for every task and and every field and then try to connect them all in one. The independence of scientific language, like of data, is thus a mandatory step. It also means that developing A.I. is not only connected to develop a new consciousness, but, and most important, to the development of our one.

Visual Question Answering with Keras – Part 2: Making Computers Intelligent to answer from images

Making Computers Intelligent to answer from images

This is my second blog on Visual Question Answering, in the last blog, I have introduced to VQA, available datasets and some of the real-life applications of VQA. If you have not gone through then I would highly recommend you to go through it. Click here for more details about it.

In this blog post, I will walk through the implementation of VQA in Keras.

You can download the dataset from here: https://visualqa.org/index.html. All my experiments were performed with VQA v2 and I have used a very tiny subset of entire dataset i.e all samples for training and testing from the validation set.

Table of contents:

  1. Preprocessing Data
  2. Process overview for VQA
  3. Data Preprocessing – Images
  4. Data Preprocessing through the spaCy library- Questions
  5. Model Architecture
  6. Defining model parameters
  7. Evaluating the model
  8. Final Thought
  9. References

NOTE: The purpose of this blog is not to get the state-of-art performance on VQA. But the idea is to get familiar with the concept. All my experiments were performed with the validation set only.

Full code on my Github here.


1. Preprocessing Data:

If you have downloaded the dataset then the question and answers (called as annotations) are in JSON format. I have provided the code to extract the questions, annotations and other useful information in my Github repository. All extracted information is stored in .txt file format. After executing code the preprocessing directory will have the following structure.

All text files will be used for training.

 

2. Process overview for VQA:

As we have discussed in previous post visual question answering is broken down into 2 broad-spectrum i.e. vision and text.  I will represent the Neural Network approach to this problem using the Convolutional Neural Network (for image data) and Recurrent Neural Network(for text data). 

If you are not familiar with RNN (more precisely LSTM) then I would highly recommend you to go through Colah’s blog and Andrej Karpathy blog. The concepts discussed in this blogs are extensively used in my post.

The main idea is to get features for images from CNN and features for the text from RNN and finally combine them to generate the answer by passing them through some fully connected layers. The below figure shows the same idea.

 

I have used VGG-16 to extract the features from the image and LSTM layers to extract the features from questions and combining them to get the answer.

3. Data Preprocessing – Images:

Images are nothing but one of the input to our model. But as you already may know that before feeding images to the model we need to convert into the fixed-size vector.

So we need to convert every image into a fixed-size vector then it can be fed to the neural network. For this, we will use the VGG-16 pretrained model. VGG-16 model architecture is trained on millions on the Imagenet dataset to classify the image into one of 1000 classes. Here our task is not to classify the image but to get the bottleneck features from the second last layer.

Hence after removing the softmax layer, we get a 4096-dimensional vector representation (bottleneck features) for each image.

Image Source: https://www.cs.toronto.edu/~frossard/post/vgg16/

 

For the VQA dataset, the images are from the COCO dataset and each image has unique id associated with it. All these images are passed through the VGG-16 architecture and their vector representation is stored in the “.mat” file along with id. So in actual, we need not have to implement VGG-16 architecture instead we just do look up into file with the id of the image at hand and we will get a 4096-dimensional vector representation for the image.

4. Data Preprocessing through the spaCy library- Questions:

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. As we have converted images into a fixed 4096-dimensional vector we also need to convert questions into a fixed-size vector representation. For installing spaCy click here

You might know that for training word embeddings in Keras we have a layer called an Embedding layer which takes a word and embeds it into a higher dimensional vector representation. But by using the spaCy library we do not have to train the get the vector representation in higher dimensions.

 

This model is actually trained on billions of tokens of the large corpus. So we just need to call the vector method of spaCy class and will get vector representation for word.

After fitting, the vector method on tokens of each question will get the 300-dimensional fixed representation for each word.

5. Model Architecture:

In our problem the input consists of two parts i.e an image vector, and a question, we cannot use the Sequential API of the Keras library. For this reason, we use the Functional API which allows us to create multiple models and finally merge models.

The below picture shows the high-level architecture idea of submodules of neural network.

After concatenating the 2 different models the summary will look like the following.

The below plot helps us to visualize neural network architecture and to understand the two types of input:

 

6. Defining model parameters:

The hyperparameters that we are going to use for our model is defined as follows:

If you know what this parameter means then you can play around it and can get better results.

Time Taken: I used the GPU on https://colab.research.google.com and hence it took me approximately 2 hours to train the model for 5 epochs. However, if you train it on a PC without GPU, it could take more time depending on the configuration of your machine.

7. Evaluating the model:

Since I have used the very small dataset for performing these experiments I am not able to get very good accuracy. The below code will calculate the accuracy of the model.

 

Since I have trained a model multiple times with different parameters you will not get the same accuracy as me. If you want you can directly download mode.h5 file from my google drive.

 

8. Final Thoughts:

One of the interesting thing about VQA is that it a completely new field. So there is absolutely no end to what you can do to solve this problem. Below are some tips while replicating the code.

  1. Start with a very small subset of data: When you start implementing I suggest you start with a very small amount of data. Because once you are ready with the whole setup then you can scale it any time.
  2. Understand the code: Understanding code line by line is very much helpful to match your theoretical knowledge. So for that, I suggest you can take very few samples(maybe 20 or less) and run a small chunk (2 to 3 lines) of code to get the functionality of each part.
  3. Be patient: One of the mistakes that I did while starting with this project was to do everything at one go. If you get some error while replicating code spend 4 to 5 days harder on that. Even after that if you won’t able to solve, I would suggest you resume after a break of 1 or 2 days. 

VQA is the intersection of NLP and CV and hopefully, this project will give you a better understanding (more precisely practically) with most of the deep learning concepts.

If you want to improve the performance of the model below are few tips you can try:

  1. Use larger datasets
  2. Try Building more complex models like Attention, etc
  3. Try using other pre-trained word embeddings like Glove 
  4. Try using a different architecture 
  5. Do more hyperparameter tuning

The list is endless and it goes on.

In the blog, I have not provided the complete code you can get it from my Github repository.

9. References:

  1. https://blog.floydhub.com/asking-questions-to-images-with-deep-learning/
  2. https://tryolabs.com/blog/2018/03/01/introduction-to-visual-question-answering/
  3. https://github.com/sominwadhwa/vqamd_floyd

Marketing Attribution Models

Why do we need attribution?

Attributionis the process of distributing the value of a purchase between the various channels, used in the funnel chain. It allows you to determine the role of each channel in profit. It is used to assess the effectiveness of campaigns, to identify more priority sources. The competent choice of the model makes it possible to optimally distribute the advertising budget. As a result, the business gets more profit and less expenses.

What models of attribution exist

The choice of the appropriate model is an important issue, because depending on the business objectives, it is better to fit something different. For example, for companies that have long been present in the industry, the priority is to know which sources contribute to the purchase. Recognition is the importance for brands entering the market. Thus, incorrect prioritization of sources may cause a decrease in efficiency. Below are the models that are widely used in the market. Each of them is guided by its own logic, it is better suited for different businesses.

First Interaction (First Click)

The value is given to the first touch. It is suitable only for several purposes and does not make it possible to evaluate the role of each component in making a purchase. It is chosen by brands who want to increase awareness and reach.

Advantages

It does not require knowledge of programming, so the introduction of a business is not difficult. A great option that effectively assesses campaigns, aimed at creating awareness and demand for new products.

Disadvantages

It limits the ability to analyze comprehensively all channels that is used to promote a brand. It gives value to the first interaction channel, ignoring the rest.

Who is suitable for?

Suitable for those who use the promotion to increase awareness, the formation of a positive image. Also allows you to find the most effective source.

Last Interaction (Last Click)

It gives value to the last channel with which the consumer interacted before making the purchase. It does not take into account the actions that the user has done up to this point, what marketing activities he encountered on the way to conversion.

Advantages

The tool is widely used in the market, it is not difficult. It solves the problem of small advertising campaigns, where is no more than 3 sources.

Disadvantages

There is no way to track how other channels have affected the acquisition.

Who is suitable for?

It is suitable for business models that have a short purchase cycle. This may be souvenirs, seasonal offers, etc.

Last Non-Direct Click

It is the default in Google Analytics. 100% of the  conversion value gives the last channel that interacted with the buyer before the conversion. However, if this source is Direct, then assumptions are counted.

Suppose a person came from an email list, bookmarked a product, because at that time it was not possible to place an order. After a while he comes back and makes a purchase. In this case, email as a channel for attracting users would be underestimated without this model.

Who is suitable for?

It is perfect for beginners who are afraid of making a mistake in the assessment. Because it allows you to form a general idea of ​​the effectiveness of all the involved channels.

Linear model attribution (Linear model)

The value of the conversion is divided in equal parts between all available channels.

Linear model attribution (Linear model)

Advantages

More advanced model than previous ones, however, characterized by simplicity. It takes into account all the visits before the acquisition.

Disadvantages

Not suitable for reallocating the budget between the channels. This is due to the fact that the effectiveness of sources may differ significantly and evenly divide – it is not the best idea. 

Who is suitable for?

It is performing well for businesses operating in the B2B sector, which plays a great importance to maintain contact with the customer during the entire cycle of the funnel.

Taking into account the interaction duration (Time Decay)

A special feature of the model is the distribution of the value of the purchase between the available channels by increment. Thus, the source, that is at the beginning of the chain, is given the least value, the channel at the end deserves the greatest value.  

Advantages

Value is shared between all channel. The highest value is given to the source that pushed the user to make a purchase.

Disadvantages

There is no fair assessment of the effectiveness of the channels, that have made efforts to obtain the desired result.

Who is suitable for?

It is ideal for evaluating the effectiveness of advertising campaigns with a limited duration.

Position-Based or U-Shaped

40% receive 2 channels, which led the user and pushed him to purchase. 20% share among themselves the intermediate sources that participated in the chain.

Advantages

Most of the value is divided equally between the key channels – the fact that attracted the user and closed the deal..

Disadvantages

Underestimated intermediate channels.It happens that they make it possible to more effectively promote the user chain.. Because they allow you to subscribe to the newsletter or start following the visitor for price reduction, etc.

Who is suitable for?

Interesting for businesses that focus on attracting new audiences, as well as pushing existing customers to buy.

Cons of standard attribution models

According to statistics, only 44% of foreign experts use attribution on the last interaction. Speaking about the domestic market, we can announce the numbers are much higher. However, only 18% of marketers use more complex models. There is also evidence which demonstrates that 72.4% of those who use attribution based on the last interaction, they use it not because of efficiency, but because it is simple.

What leads to a similar state of affairs?

Experts do not understand the effectiveness. Ignorance of how more complex models work leads to a lack of understanding of the real benefits for the business.

Attribution management is distributed among several employees. In view of this, different models can be used simultaneously. This approach greatly distorts the data obtained, not allowing an objective assessment of the effect of channels.

No comprehensive data storage. Information is stored in different places and does not take into account other channels. Using the analytics of the advertising office, it is impossible to work with customers in retail outlets.

You may find ways to eliminate these moments and attribution will work for the benefit of the business.

What algorithmic attribution models exist

Using one channel, there is no need to enable complex models. Attribution will be enough for the last interaction. It has everything to evaluate the effectiveness of the campaign, determine the profitability, understand the benefits for the business.

Moreover, if the number of channels increases significantly, and goals are already far beyond recognition, it will be better to give preference to more complex models. They allow you to collect all the information in one place, open up limitless monitoring capabilities, make it clear how one channel affects the other and which bundles work better together.

Below are the well-known and widely used today algorithmic attribution models.

Data-Driven Attribution

A model that allows you to track all the way that the consumer has done before making a purchase. It objectively evaluates each channel and does not take into account the position of the source in the funnel. It demonstrates how a certain interaction affected the outcome. Data-Driven attribution model is used in Google Analytics 360.

With it, you can work efficiently with channels that are underestimated in simpler models. It gives the opportunity to distribute the advertising budget correctly.

Attribution based on Markov’s Chains (Markov Chains)

Markov’s chain has been used for a long time to predict weather, matches, etc. The model allows you to find out, how the lack of a channel will affect sales. Its advantage is the ability to assess the impact of the source on the conversion, to find out which channel brings the best results.

A great option for companies that store data in one service. To implement requires knowledge of programming. It has one drawback in the form of underestimating the first channel in the chain. 

OWOX BI Attribution

OWOX BI Attribution helps you assess the mutual influence of channels on encouraging a customer through the funnel and achieving a conversion.

What information can be processed:

  • Upload user data from Google Analytics using flexible built-in tools.
  • Process information from various advertising services.
  • Integrate the model with CRM systems.

This approach makes it possible not to lose sight of any channel. Analyze the complex impact of marketing tools, correctly distributing the advertising budget.

The model uses CRM information, which makes it possible to do end-to-end analytics. Each user is assigned an identifier, so no matter what device he came from, you can track the chain of actions and understand that it is him. This allows you to see the overall effect of each channel on the conversion.

Advantages

Provides an integrated approach to assessing the effectiveness of channels, allows you to identify consumers, even with different devices, view all visits. It helps to determine where the user came from, what prompted him to do so. With it, you can control the execution of orders in CRM, to estimate the margin. To evaluate in combination with other models in order to determine the highest priority advertising campaigns that bring the most profit.

Disadvantages

It is impossible to objectively evaluate the first step of the chain.

Who is suitable for?

Suitable for all businesses that aim to account for each step of the chain and the qualitative assessment of all advertising channels.

Conclusion

The above-mentioned Ad Roll study shows that 70% of marketing managers find it difficult to use the results obtained from attribution. Moreover, there will be no result without it.

To obtain a realistic assessment of the effectiveness of marketing activities, do the following:

  • Determine priority KPIs.
  • Appoint a person responsible for evaluating advertising campaigns.
  • Define a user funnel chain.
  • Keep track of all data, online and offline. 
  • Make a diagnosis of incoming data.
  • Find the best attribution model for your business.
  • Use the data to make decisions.

Visual Question Answering with Keras – Part 1

This is Part I of II of the Article Series Visual Question Answering with Keras

Making Computers Intelligent to answer from images

If we look closer in the history of Artificial Intelligence (AI), the Deep Learning has gained more popularity in the recent years and has achieved the human-level performance in the tasks such as Speech Recognition, Image Classification, Object Detection, Machine Translation and so on. However, as humans, not only we but also a five-year child can normally perform these tasks without much inconvenience. But the development of such systems with these capabilities has always considered an ambitious goal for the researchers as well as for developers.

In this series of blog posts, I will cover an introduction to something called VQA (Visual Question Answering), its available datasets, the Neural Network approach for VQA and its implementation in Keras and the applications of this challenging problem in real life. 

Table of Contents:

1 Introduction

2 What is exactly Visual Question Answering?

3 Prerequisites

4 Datasets available for VQA

4.1 DAQUAR Dataset

4.2 CLEVR Dataset

4.3 FigureQA Dataset

4.4 VQA Dataset

5 Real-life applications of VQA

6 Conclusion

 

  1. Introduction:

Let’s say you are given a below picture along with one question. Can you answer it?

I expect confidently you all say it is the Kitchen without much inconvenience which is also the right answer. Even a five-year child who just started to learn things might answer this question correctly.

Alright, but can you write a computer program for such type of task that takes image and question about the image as an input and gives us answer as output?

Before the development of the Deep Neural Network, this problem was considered as one of the difficult, inconceivable and challenging problem for the AI researcher’s community. However, due to the recent advancement of Deep Learning the systems are capable of answering these questions with the promising result if we have a required dataset.

Now I hope you have got at least some intuition of a problem that we are going to discuss in this series of blog posts. Let’s try to formalize the problem in the below section.

  1. What is exactly Visual Question Answering?:

We can define, “Visual Question Answering(VQA) is a system that takes an image and natural language question about the image as an input and generates natural language answer as an output.”

VQA is a research area that requires an understanding of vision(Computer Vision)  as well as text(NLP). The main beauty of VQA is that the reasoning part is performed in the context of the image. So if we have an image with the corresponding question then the system must able to understand the image well in order to generate an appropriate answer. For example, if the question is the number of persons then the system must able to detect faces of the persons. To answer the color of the horse the system need to detect the objects in the image. Many of these common problems such as face detection, object detection, binary object classification(yes or no), etc. have been solved in the field of Computer Vision with good results.

To summarize a good VQA system must be able to address the typical problems of CV as well as NLP.

To get a better feel of VQA you can try online VQA demo by CloudCV. You just go to this link and try uploading the picture you want and ask the related question to the picture, the system will generate the answer to it.

 

  1. Prerequisites:

In the next post, I will walk you through the code for this problem using Keras. So I assume that you are familiar with:

  1. Fundamental concepts of Machine Learning
  2. Multi-Layered Perceptron
  3. Convolutional Neural Network
  4. Recurrent Neural Network (especially LSTM)
  5. Gradient Descent and Backpropagation
  6. Transfer Learning
  7. Hyperparameter Optimization
  8. Python and Keras syntax
  1. Datasets available for VQA:

As you know problems related to the CV or NLP the availability of the dataset is the key to solve the problem. The complex problems like VQA, the dataset must cover all possibilities of questions answers in real-world scenarios. In this section, I will cover some of the datasets available for VQA.

4.1 DAQUAR Dataset:

The DAQUAR dataset is the first dataset for VQA that contains only indoor scenes. It shows the accuracy of 50.2% on the human baseline. It contains images from the NYU_Depth dataset.

Example of DAQUAR dataset

Example of DAQUAR dataset

The main disadvantage of DAQUAR is the size of the dataset is very small to capture all possible indoor scenes.

4.2 CLEVR Dataset:

The CLEVR Dataset from Stanford contains the questions about the object of a different type, colors, shapes, sizes, and material.

It has

  • A training set of 70,000 images and 699,989 questions
  • A validation set of 15,000 images and 149,991 questions
  • A test set of 15,000 images and 14,988 questions

Image Source: https://cs.stanford.edu/people/jcjohns/clevr/?source=post_page

 

4.3 FigureQA Dataset:

FigureQA Dataset contains questions about the bar graphs, line plots, and pie charts. It has 1,327,368 questions for 100,000 images in the training set.

4.4 VQA Dataset:

As comapred to all datasets that we have seen so far VQA dataset is relatively larger. The VQA dataset contains open ended as well as multiple choice questions. VQA v2 dataset contains:

  • 82,783 training images from COCO (common objects in context) dataset
  • 40, 504 validation images and 81,434 validation images
  • 443,757 question-answer pairs for training images
  • 214,354 question-answer pairs for validation images.

As you might expect this dataset is very huge and contains 12.6 GB of training images only. I have used this dataset in the next post but a very small subset of it.

This dataset also contains abstract cartoon images. Each image has 3 questions and each question has 10 multiple choice answers.

  1. Real-life applications of VQA:

There are many applications of VQA. One of the famous applications is to help visually impaired people and blind peoples. In 2016, Microsoft has released the “Seeing AI” app for visually impaired people to describe the surrounding environment around them. You can watch this video for the prototype of the Seeing AI app.

Another application could be on social media or e-commerce sites. VQA can be also used for educational purposes.

  1. Conclusion:

I hope this explanation will give you a good idea of Visual Question Answering. In the next blog post, I will walk you through the code in Keras.

If you like my explanations, do provide some feedback, comments, etc. and stay tuned for the next post.

A Bird’s Eye View: How Machine Learning Can Help You Charge Your E-Scooters

Bird scooters in Columbus, Ohio

Bird scooters in Columbus, Ohio

Ever since I started using bike-sharing to get around in Seattle, I have become fascinated with geolocation data and the transportation sharing economy. When I saw this project leveraging the mobility data RESTful API from the Los Angeles Department of Transportation, I was eager to dive in and get my hands dirty building a data product utilizing a company’s mobility data API.

Unfortunately, the major bike and scooter providers (Bird, JUMP, Lime) don’t have publicly accessible APIs. However, some folks have seemingly been able to reverse-engineer the Bird API used to populate the maps in their Android and iOS applications.

One interesting feature of this data is the nest_id, which indicates if the Bird scooter is in a “nest” — a centralized drop-off spot for charged Birds to be released back into circulation.

I set out to ask the following questions:

  1. Can real-time predictions be made to determine if a scooter is currently in a nest?
  2. For non-nest scooters, can new nest location recommendations be generated from geospatial clustering?

To answer these questions, I built a full-stack machine learning web application, NestGenerator, which provides an automated recommendation engine for new nest locations. This application can help power Bird’s internal nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on proximity to scooters and nest locations in their area.

Bird

The electric scooter market has seen substantial growth with Bird’s recent billion dollar valuation  and their $300 million Series C round in the summer of 2018. Bird offers electric scooters that top out at 15 mph, cost $1 to unlock and 15 cents per minute of use. Bird scooters are in over 100 cities globally and they announced in late 2018 that they eclipsed 10 million scooter rides since their launch in 2017.

Bird scooters in Tel Aviv, Israel

Bird scooters in Tel Aviv, Israel

With all of these scooters populating cities, there’s much-needed demand for people to charge them. Since they are electric, someone needs to charge them! A charger can earn additional income for charging the scooters at their home and releasing them back into circulation at nest locations. The base price for charging each Bird is $5.00. It goes up from there when the Birds are harder to capture.

Data Collection and Machine Learning Pipeline

The full data pipeline for building “NestGenerator”

Data

From the details here, I was able to write a Python script that returned a list of Bird scooters within a specified area, their geolocation, unique ID, battery level and a nest ID.

I collected scooter data from four cities (Atlanta, Austin, Santa Monica, and Washington D.C.) across varying times of day over the course of four weeks. Collecting data from different cities was critical to the goal of training a machine learning model that would generalize well across cities.

Once equipped with the scooter’s latitude and longitude coordinates, I was able to leverage additional APIs and municipal data sources to get granular geolocation data to create an original scooter attribute and city feature dataset.

Data Sources:

  • Walk Score API: returns a walk score, transit score and bike score for any location.
  • Google Elevation API: returns elevation data for all locations on the surface of the earth.
  • Google Places API: returns information about places. Places are defined within this API as establishments, geographic locations, or prominent points of interest.
  • Google Reverse Geocoding API: reverse geocoding is the process of converting geographic coordinates into a human-readable address.
  • Weather Company Data: returns the current weather conditions for a geolocation.
  • LocationIQ: Nearby Points of Interest (PoI) API returns specified PoIs or places around a given coordinate.
  • OSMnx: Python package that lets you download spatial geometries and model, project, visualize, and analyze street networks from OpenStreetMap’s APIs.

Feature Engineering

After extensive API wrangling, which included a four-week prolonged data collection phase, I was finally able to put together a diverse feature set to train machine learning models. I engineered 38 features to classify if a scooter is currently in a nest.

Full Feature Set

Full Feature Set

The features boiled down into four categories:

  • Amenity-based: parks within a given radius, gas stations within a given radius, walk score, bike score
  • City Network Structure: intersection count, average circuity, street length average, average streets per node, elevation level
  • Distance-based: proximity to closest highway, primary road, secondary road, residential road
  • Scooter-specific attributes: battery level, proximity to closest scooter, high battery level (> 90%) scooters within a given radius, total scooters within a given radius

 

Log-Scale Transformation

For each feature, I plotted the distribution to explore the data for feature engineering opportunities. For features with a right-skewed distribution, where the mean is typically greater than the median, I applied these log transformations to normalize the distribution and reduce the variability of outlier observations. This approach was used to generate a log feature for proximity to closest scooter, closest highway, primary road, secondary road, and residential road.

An example of a log transformation

Statistical Analysis: A Systematic Approach

Next, I wanted to ensure that the features I included in my model displayed significant differences when broken up by nest classification. My thinking was that any features that did not significantly differ when stratified by nest classification would not have a meaningful predictive impact on whether a scooter was in a nest or not.

Distributions of a feature stratified by their nest classification can be tested for statistically significant differences. I used an unpaired samples t-test with a 0.01% significance level to compute a p-value and confidence interval to determine if there was a statistically significant difference in means for a feature stratified by nest classification. I rejected the null hypothesis if a p-value was smaller than the 0.01% threshold and if the 99.9% confidence interval did not straddle zero. By rejecting the null-hypothesis in favor of the alternative hypothesis, it’s deemed there is a significant difference in means of a feature by nest classification.

Battery Level Distribution Stratified by Nest Classification to run a t-test

Battery Level Distribution Stratified by Nest Classification to run a t-test

Log of Closest Scooter Distribution Stratified by Nest Classification to run a t-test

Throwing Away Features

Using the approach above, I removed ten features that did not display statistically significant results.

Statistically Insignificant Features Removed Before Model Development

Model Development

I trained two models, a random forest classifier and an extreme gradient boosting classifier since tree-based models can handle skewed data, capture important feature interactions, and provide a feature importance calculation. I trained the models on 70% of the data collected for all four cities and reserved the remaining 30% for testing.

After hyper-parameter tuning the models for performance on cross-validation data it was time to run the models on the 30% of test data set aside from the initial data collection.

I also collected additional test data from other cities (Columbus, Fort Lauderdale, San Diego) not involved in training the models. I took this step to ensure the selection of a machine learning model that would generalize well across cities. The performance of each model on the additional test data determined which model would be integrated into the application development.

Performance on Additional Cities Test Data

The Random Forest Classifier displayed superior performance across the board

The Random Forest Classifier displayed superior performance across the board

I opted to move forward with the random forest model because of its superior performance on AUC score and accuracy metrics on the additional cities test data. AUC is the Area under the ROC Curve, and it provides an aggregate measure of model performance across all possible classification thresholds.

AUC Score on Test Data for each Model

AUC Score on Test Data for each Model

Feature Importance

Battery level dominated as the most important feature. Additional important model features were proximity to high level battery scooters, proximity to closest scooter, and average distance to high level battery scooters.

Feature Importance for the Random Forest Classifier

Feature Importance for the Random Forest Classifier

The Trade-off Space

Once I had a working machine learning model for nest classification, I started to build out the application using the Flask web framework written in Python. After spending a few days of writing code for the application and incorporating the trained random forest model, I had enough to test out the basic functionality. I could finally run the application locally to call the Bird API and classify scooter’s into nests in real-time! There was one huge problem, though. It took more than seven minutes to generate the predictions and populate in the application. That just wasn’t going to cut it.

The question remained: will this model deliver in a production grade environment with the goal of making real-time classifications? This is a key trade-off in production grade machine learning applications where on one end of the spectrum we’re optimizing for model performance and on the other end we’re optimizing for low latency application performance.

As I continued to test out the application’s performance, I still faced the challenge of relying on so many APIs for real-time feature generation. Due to rate-limiting constraints and daily request limits across so many external APIs, the current machine learning classifier was not feasible to incorporate into the final application.

Run-Time Compliant Application Model

After going back to the drawing board, I trained a random forest model that relied primarily on scooter-specific features which were generated directly from the Bird API.

Through a process called vectorization, I was able to transform the geolocation distance calculations utilizing NumPy arrays which enabled batch operations on the data without writing any “for” loops. The distance calculations were applied simultaneously on the entire array of geolocations instead of looping through each individual element. The vectorization implementation optimized real-time feature engineering for distance related calculations which improved the application response time by a factor of ten.

Feature Importance for the Run-time Compliant Random Forest Classifier

Feature Importance for the Run-time Compliant Random Forest Classifier

This random forest model generalized well on test-data with an AUC score of 0.95 and an accuracy rate of 91%. The model retained its prediction accuracy compared to the former feature-rich model, but it gained 60x in application performance. This was a necessary trade-off for building a functional application with real-time prediction capabilities.

Geospatial Clustering

Now that I finally had a working machine learning model for classifying nests in a production grade environment, I could generate new nest locations for the non-nest scooters. The goal was to generate geospatial clusters based on the number of non-nest scooters in a given location.

The k-means algorithm is likely the most common clustering algorithm. However, k-means is not an optimal solution for widespread geolocation data because it minimizes variance, not geodetic distance. This can create suboptimal clustering from distortion in distance calculations at latitudes far from the equator. With this in mind, I initially set out to use the DBSCAN algorithm which clusters spatial data based on two parameters: a minimum cluster size and a physical distance from each point. There were a few issues that prevented me from moving forward with the DBSCAN algorithm.

  1. The DBSCAN algorithm does not allow for specifying the number of clusters, which was problematic as the goal was to generate a number of clusters as a function of non-nest scooters.
  2. I was unable to hone in on an optimal physical distance parameter that would dynamically change based on the Bird API data. This led to suboptimal nest locations due to a distortion in how the physical distance point was used in clustering. For example, Santa Monica, where there are ~15,000 scooters, has a higher concentration of scooters in a given area whereas Brookline, MA has a sparser set of scooter locations.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

Given the granularity of geolocation scooter data I was working with, geospatial distortion was not an issue and the k-means algorithm would work well for generating clusters. Additionally, the k-means algorithm parameters allowed for dynamically customizing the number of clusters based on the number of non-nest scooters in a given location.

Once clusters were formed with the k-means algorithm, I derived a centroid from all of the observations within a given cluster. In this case, the centroids are the mean latitude and mean longitude for the scooters within a given cluster. The centroids coordinates are then projected as the new nest recommendations.

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm.

NestGenerator Application

After wrapping up the machine learning components, I shifted to building out the remaining functionality of the application. The final iteration of the application is deployed to Heroku’s cloud platform.

In the NestGenerator app, a user specifies a location of their choosing. This will then call the Bird API for scooters within that given location and generate all of the model features for predicting nest classification using the trained random forest model. This forms the foundation for map filtering based on nest classification. In the app, a user has the ability to filter the map based on nest classification.

Drop-Down Map View filtering based on Nest Classification

Drop-Down Map View filtering based on Nest Classification

Nearest Generated Nest

To see the generated nest recommendations, a user selects the “Current Non-Nest Scooters & Predicted Nest Locations” filter which will then populate the application with these nest locations. Based on the user’s specified search location, a table is provided with the proximity of the five closest nests and an address of the Nest location to help inform a Bird charger in their decision-making.

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

Conclusion

By accurately predicting nest classification and clustering non-nest scooters, NestGenerator provides an automated recommendation engine for new nest locations. For Bird, this application can help power their nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on scooters and nest locations in their area.

Code

The code for this project can be found on my GitHub

Comments or Questions? Please email me an E-Mail!

 

Attribution Models in Marketing

Attribution Models

A Business and Statistical Case

INTRODUCTION

A desire to understand the causal effect of campaigns on KPIs

Advertising and marketing costs represent a huge and ever more growing part of the budget of companies. Studies have found out this share is as high as 10% and increases with the size of companies (CMO study by American Marketing Association and Duke University, 2017). Measuring precisely the impact of a specific marketing campaign on the sales of a company is a critical step towards an efficient allocation of this budget. Would the return be higher for an euro spent on a Facebook ad, or should we better spend it on a TV spot? How much should I spend on Twitter ads given the volume of sales this channel is responsible for?

Attribution Models have lately received great attention in Marketing departments to answer these issues. The transition from offline to online marketing methods has indeed permitted the collection of multiple individual data throughout the whole customer journey, and  allowed for the development of user-centric attribution models. In short, Attribution Models use the information provided by Tracking technologies such as Google Analytics or Webtrekk to understand customer journeys from the first click on a Facebook ad to the final purchase and adequately ponderate the different marketing campaigns encountered depending on their responsibility in the final conversion.

Issues on Causal Effects

A key question then becomes: how to declare a channel is responsible for a purchase? In other words, how can we isolate the causal effect or incremental value of a campaign ?

          1. A/B-Tests

One method to estimate the pure impact of a campaign is the design of randomized experiments, wherein a control and treated groups are compared.  A/B tests belong to this broad category of randomized methods. Provided the groups are a priori similar in every aspect except for the treatment received, all subsequent differences may be attributed solely to the treatment. This method is typically used in medical studies to assess the effect of a drug to cure a disease.

Main practical issues regarding Randomized Methods are:

  • Assuring that control and treated groups are really similar before treatment. Uually a random assignment (i.e assuring that on a relevant set of observable variables groups are similar) is realized;
  • Potential spillover-effects, i.e the possibility that the treatment has an impact on the non-treated group as well (Stable unit treatment Value Assumption, or SUTVA in Rubin’s framework);
  • The costs of conducting such an experiment, and especially the costs linked to the deliberate assignment of individuals to a group with potentially lower results;
  • The number of such experiments to design if multiple treatments have to be measured;
  • Difficulties taking into account the interaction effects between campaigns or the effect of spending levels. Indeed, usually A/B tests are led by cutting off temporarily one campaign entirely and measuring the subsequent impact on KPI’s compared to the situation where this campaign is maintained;
  • The dynamical reproduction of experiments if we assume that treatment effects may change over time.

In the marketing context, multiple campaigns must be tested in a dynamical way, and treatment effect is likely to be heterogeneous among customers, leading to practical issues in the lauching of A/B tests to approximate the incremental value of all campaigns. However, sites with a lot of traffic and conversions can highly benefit from A/B testing as it provides a scientific and straightforward way to approximate a causal impact. Leading companies such as Uber, Netflix or Airbnb rely on internal tools for A/B testing automation, which allow them to basically test any decision they are about to make.

References:

Books:

Experiment!: Website conversion rate optimization with A/B and multivariate testing, Colin McFarland, ©2013 | New Riders  

A/B testing: the most powerful way to turn clicks into customers. Dan Siroker, Pete Koomen; Wiley, 2013.

Blogs:

https://eng.uber.com/xp

https://medium.com/airbnb-engineering/growing-our-host-community-with-online-marketing-9b2302299324

Study:

https://cmosurvey.org/wp-content/uploads/sites/15/2018/08/The_CMO_Survey-Results_by_Firm_and_Industry_Characteristics-Aug-2018.pdf

        2. Attribution models

Attribution Models do not demand to create an experimental setting. They take into account existing data and derive insights from the variability of customer journeys. One key difficulty is then to differentiate correlation and causality in the links observed between the exposition to campaigns and purchases. Indeed, selection effects may bias results as exposure to campaigns is usually dependant on user-characteristics and thus may not be necessarily independant from the customer’s baseline conversion probabilities. For example, customers purchasing from a discount price comparison website may be intrinsically different from customers buying from FB ad and this a priori difference may alone explain post-exposure differences in purchasing bahaviours. This intrinsic weakness must be remembered when interpreting Attribution Models results.

                          2.1 General Issues

The main issues regarding the implementation of Attribution Models are linked to

  • Causality and fallacious reasonning, as most models do not take into account the aforementionned selection biases.
  • Their difficult evaluation. Indeed, in almost all attribution models (except for those based on classification, where the accuracy of the model can be computed), the additionnal value brought by the use of a given attribution models cannot be evaluated using existing historical data. This additionnal value can only be approximated by analysing how the implementation of the conclusions of the attribution model have impacted a given KPI.
  • Tracking issues, leading to an uncorrect reconstruction of customer journeys
    • Cross-device journeys: cross-device issue arises from the use of different devices throughout the customer journeys, making it difficult to link datapoints. For example, if a customer searches for a product on his computer but later orders it on his mobile, the AM would then mistakenly consider it an order without prior campaign exposure. Though difficult to measure perfectly, the proportion of cross-device orders can approximate 20-30%.
    • Cookies destruction makes it difficult to track the customer his the whole journey. Both regulations and consumers’ rising concerns about data privacy issues mitigate the reliability and use of cookies.1 – From 2002 on, the EU has enacted directives concerning privacy regulation and the extended use of cookies for commercial targeting purposes, which have highly impacted marketing strategies, such as the ‘Privacy and Electronic Communications Directive’ (2002/58/EC). A research was conducted and found out that the adoption of this ‘Privacy Directive’ had led to 64% decrease in advertising methods compared to the rest of the world (Goldfarb et Tucker (2011)). The effect was stronger for generalized sites (Yahoo) than for specialized sites.2 – Users have grown more and more conscious of data privacy issues and have adopted protective measures concerning data privacy, such as automatic destruction of cookies after a session is ended, or simply giving away less personnal information (Goldfarb et Tucker (2012) ) .Valuable user information may be lost, though tracking technologies evolution have permitted to maintain tracking by other means. This issue may be particularly important in countries highly concerned with data privacy issues such as Germany.
    • Offline/Online bridge: an Attribution Model should take into account all campaigns to draw valuable insights. However, the exposure to offline campaigns (TV, newspapers) are difficult to track at the user level. One idea to tackle this issue would be to estimate the proportion of conversions led by offline campaigns through AB testing and deduce this proportion from the credit assigned to the online campaigns accounted for in the Attribution Model.
    • Touch point information available: clicks are easy to follow but irrelevant to take into account the influence of purely visual campaigns such as display ads or video.

                          2.2 Today’s main practices

Two main families of Attribution Models exist:

  • Rule-Based Attribution Models, which have been used for in the last decade but from which companies are gradualy switching.

Attribution depends on the individual journeys that have led to a purchase and is solely based on the rank of the campaign in the journey. Some models focus on a single touch points (First Click, Last Click) while others account for multi-touch journeys (Bathtube, Linear). It can be calculated at the customer level and thus doesn’t require large amounts of data points. We can distinguish two sub-groups of rule-based Attribution Models:

  • One Touch Attribution Models attribute all credit to a single touch point. The First-Click model attributes all credit for a converion to the first touch point of the customer journey; last touch attributes all credit to the last campaign.
  • Multi-touch Rule-Based Attribution Models incorporate information on the whole customer journey are thus an improvement compared to one touch models. To this family belong Linear model where credit is split equally between all channels, Bathtube model where 40% of credit is given to first and last clicks and the remaining 20% is distributed equally between the middle channels, or time-decay models where credit assigned to a click diminishes as the time between the click and the order increases..

The main advantages of rule-based models is their simplicity and cost effectiveness. The main problems are:

– They are a priori known and can thus lead to optimization strategies from competitors
– They do not take into account aggregate intelligence on customer journeys and actual incremental values.
– They tend to bias (depending on the model chosen) channels that are over-represented at the beggining or end of the funnel, according to theoretical assumptions that have no observationnal back-ups.

  • Data-Driven Attribution Models

These models take into account the weaknesses of rule-based models and make a relevant use of available data. Being data-driven, following attribution models cannot be computed using single user level data. On the contrary values are calculated through data aggregation and thus require a certain volume of customer journey information.

References:

https://dspace.mit.edu/handle/1721.1/64920

 

        3. Data-Driven Attribution Models in practice

                          3.1 Issues

Several issues arise in the computation of campaigns individual impact on a given KPI within a data-driven model.

  • Selection biases: Exposure to certain types of advertisement is usually highly correlated to non-observable variables which are in turn correlated to consumption practices. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in conversion probabilities between groups whether than by the campaign effect.
  • Complementarity: it may be that campaigns A and B only have an effect when combined, so that measuring their individual impact would lead to misleading conclusions. The model could then try to assess the effect of combinations of campaigns on top of the effect of individual campaigns. As the number of possible non-ordered combinations of k campaigns is 2k, it becomes clear that inclusing all possible combinations would however be time-consuming.
  • Order-sensitivity: The effect of a campaign A may depend on the place where it appears in the customer journey, meaning the rank of a campaign and not merely its presence could be accounted for in the model.
  • Relative Order-sensitivity: it may be that campaigns A and B only have an effect when one is exposed to campaign A before campaign B. If so, it could be useful to assess the effect of given combinations of campaigns as well. And this for all campaigns, leading to tremendous numbers of possible combinations.
  • All previous phenomenon may be present, increasing even more the potential complexity of a comprehensive Attribution Model. The number of all possible ordered combination of k campaigns is indeed :

 

                          3.2 Main models

                                  A) Logistic Regression and Classification models

If non converting journeys are available, Attribition Model can be shaped as a simple classification issue. Campaign types or campaigns combination and volume of campaign types can be included in the model along with customer or time variables. As we are interested in inference (on campaigns effect) whether than prediction, a parametric model should be used, such as Logistic Regression. Non paramatric models such as Random Forests or Neural Networks can also be used though the interpretation of campaigns value would be more difficult to derive from the model results.

A common pitfall is the usual issue of spurious correlations on one hand and the correct interpretation of coefficients in business terms.

An advantage if the possibility to evaluate the relevance of the model using common model validation methods to evaluate its predictive power (validation set \ AUC \pseudo R squared).

                                  B) Shapley Value

Theory

The Shapley Value is based on a Game Theory framework and is named after its creator, the Nobel Price Laureate Lloyd Shapley. Initially meant to calculate the marginal contribution of players in cooperative games, the model has received much attention in research and industry and has lately been applied to marketing issues. This model is typically used by Google Adords and other ad bidding vendors. Campaigns or marketing channels are in this model seen as compementary players looking forward to increasing a given KPI.
Contrarily to Logistic Regressions, it is a non-parametric model. Contrarily to Markov Chains, all results are built using existing journeys, and not simulated ones.

Channels are considered to enter the game sequentially under a certain joining order. Shapley value try to The Shapley value of channel i is the weighted sum of the marginal values that channel i adds to all possible coalitions that don’t contain channel i.
In other words, the main logic is to analyse the difference of gains when a channel i is added after a coalition Ck of k channels, k<=n. We then sum all the marginal contributions over all possible ordered combination Ck of all campaigns excluding i, with k<=n-1.

Subsets framework

A first an most usual way to compute the Shapley Vaue is to consider that when a channel enters coalition, its additionnal value is the same irrelevant of the order in which previous channels have appeared. In other words, journeys (A>B>C) and (B>A>C) trigger the same gains.
Shapley value is computed as the gains associated to adding a channel i to a subset of channels, weighted by the number of (ordered) sequences that the (unordered) subset represents, summed up on all possible subsets of the total set of campaigns where the channel i is not present.
The Shapley value of the channel ???????? is then:

where |S| is the number of campaigns of a coalition S and the sum extends over all subsets S that do not not contain channel j. ????(????)  is the value of the coalition S and ????(???? ∪ {????????})  the value of the coalition formed by adding ???????? to coalition S. ????(???? ∪ {????????}) − ????(????) is thus the marginal contribution of channel ???????? to the coalition S.

The formula can be rewritten and understood as:

This method is convenient when data on the gains of on all possible permutations of all unordered k subsets of the n campaigns are available. It is also more convenient if the order of campaigns prior to the introduction of a campaign is thought to have no impact.

Ordered sequences

Let us define ????((A>B)) as the value of the sequence A then B. What is we let ????((A>B)) be different from ????((B>A)) ?
This time we would need to sum over all possible permutation of the S campaigns present before  ???????? and the N-(S+1) campaigns after ????????. Doing so we will sum over all possible orderings (i.e all permutations of the n campaigns of the grand coalition containing all campaigns) and we can remove the permutation coefficient s!(p-s+1)!.

This method is convenient when the order of channels prior to and after the introduction of another channel is assumed to have an impact. It is also necessary to possess data for all possible permutations of all k subsets of the n campaigns, and not only on all (unordered) k-subsets of the n campaigns, k<=n. In other words, one must know the gains of A, B, C, A>B, B>A, etc. to compute the Shapley Value.

Differences between the two approaches

We simulate an ordered case where the value for each ordered sequence k for k<=3 is known. We compare it to the usual Shapley value calculated based on known gains of unordered subsets of campaigns. So as to compare relevant values, we have built the gains matrix so that the gains of a subset A, B i.e  ????({B,A}) is the average of the gains of ordered sequences made up with A and B (assuming the number of journeys where A>B equals the number of journeys where B>A, we have ????({B,A})=0.5( ????((A>B)) + ????((B>A)) ). We let the value of the grand coalition be different depending on the order of campaigns-keeping the constraints that it averages to the value used for the unordered case.

Note: mvA refers to the marginal value of A in a given sequence.
With traditionnal unordered coalitions:

With ordered sequences used to compute the marginal values:

 

We can see that the two approaches yield very different results. In the unordered case, the Shapley Value campaign C is the highest, culminating at 20, while A and B have the same Shapley Value mvA=mvB=15. In the ordered case, campaign A has the highest Shapley Value and all campaigns have different Shapley Values.

This example illustrates the inherent differences between the set and sequences approach to Shapley values. Real life data is more likely to resemble the ordered case as conversion probabilities may for any given set of campaigns be influenced by the order through which the campaigns appear.

Advantages

Shapley value has become popular in allocation problems in cooperative games because it is the unique allocation which satisfies different axioms:

  • Efficiency: Shaple Values of all channels add up to the total gains (here, orders) observed.
  • Symmetry: if channels A and B bring the same contribution to any coalition of campaigns, then their Shapley Value i sthe same
  • Null player: if a channel brings no additionnal gains to all coalitions, then its Shapley Value is zero
  • Strong monotony: the Shapley Value of a player increases weakly if all its marginal contributions increase weakly

These properties make the Shapley Value close to what we intuitively define as a fair attribution.

Issues

  • The Shapley Value is based on combinatory mathematics, and the number of possible coalitions and ordered sequences becomes huge when the number of campaigns increases.
  • If unordered, the Shapley Value assumes the contribution of campaign A is the same if followed by campaign B or by C.
  • If ordered, the number of combinations for which data must be available and sufficient is huge.
  • Channels rarely present or present in long journeys will be played down.
  • Generally, gains are supposed to grow with the number of players in the game. However, it is plausible that in the marketing context a journey with a high number of channels will not necessarily bring more orders than a journey with less channels involved.

References:

R package: GameTheoryAllocation

Article:
Zhao & al, 2018 “Shapley Value Methods for Attribution Modeling in Online Advertising “
https://link.springer.com/content/pdf/10.1007/s13278-017-0480-z.pdf
Courses: https://www.lamsade.dauphine.fr/~airiau/Teaching/CoopGames/2011/coopgames-7%5b8up%5d.pdf
Blogs: https://towardsdatascience.com/one-feature-attribution-method-to-supposedly-rule-them-all-shapley-values-f3e04534983d

                                  B) Markov Chains

Markov Chains are used to model random processes, i.e events that occur in a sequential manner and in such a way that the probability to move to a certain state only depends on the past steps. The number of previous steps that are taken into account to model the transition probability is called the memory parameter of the sequence, and for the model to have a solution must be comprised between 0 and 4. A Markov Chain process is thus defined entirely by its Transition Matrix and its initial vector (i.e the starting point of the process).

Markov Chains are applied in many scientific fields. Typically, they are used in weather forecasting, with the sequence of Sunny and Rainy days following a Markov Process of memory parameter 0, so that for each given day the probability that the next day will be rainy or sunny only depends on the weather of the current day. Other applications can be found in sociology to understand the dynamics of social classes intergenerational reproduction. To get more both mathematical and applied illustration, I recommend the reading of this course.

In the marketing context, Markov Chains are an interesting way to model the conversion funnel. To go from the from the Markov Model to the Attribution logic, we calculate the Removal Effect of each channel, i.e the difference in conversions that happen if the channel is removed. Please read below for an introduction to the methodology.

The first step in a Markov Chains Attribution Model is to build the transition matrix that captures the transition probabilities between the campaigns accross existing customer journeys. This Matrix is to be read as a “From state A to state B” table, from the left to the right. A first difficulty is finding the right memory parameter to use. A large memory parameter would allow to take more into account interraction effects within the conversion funnel but would lead to increased computationnal time, a non-readable transition matrix, and be more sensitive to noisy data. Please note that this transition matrix provides useful information on the conversion funnel and on the relationships between campaigns and can be used as such as an analytical tool. I suggest the clear and easily R code which can be found here or here.

Here is an illustration of a Markov Chain with memory Parameter of 0: the probability to go to a certain campaign B in the next step only depend on the campaign we are currently at:

The associated Transition Matrix is then (with null probabilities left as Blank):

The second step is  to compute the actual responsibility of a channel in total conversions. As mentionned above, the main philosophy to do so is to calculate the Removal Effect of each channel, i.e the changes in the number of conversions when a channel is entirely removed. All customer journeys which went through this channel are settled out to be unsuccessful. This calculation is done by applying the transition matrix with and without the removed channels to an initial vector that contains the number of desired simulations.

Building on our current example, we can then settle an initial vector with the desired number of simulations, e.g 10 000:

 

It is possible at this stage to add a constraint on the maximum number of times the matrix is applied to the data, i.e on the maximal number of campaigns a simulated journey is allowed to have.

Advantages

  • The dynamic journey is taken into account, as well as the transition between two states. The funnel is not assumed to be linear.
  • It is possile to build a conversion graph that maps the customer journey provides valuable insights.
  • It is possible to evaluate partly the accuracy of the Attribution Model based on Markov Chains. It is for example possible to see how well the transition matrix help predict the future by analysing the number of correct predictions at any given step over all sequences.

Disadvantages

  • It can be somewhat difficult to set the memory parameter. Complementarity effects between channels are not well taken into account if the memory is low, but a parameter too high will lead to over-sensitivity to noise in the data and be difficult to implement if customer journeys tend to have a number of campaigns below this memory parameter.
  • Long journeys with different channels involved will be overweighted, as they will count many times in the Removal Effect.  For example, if there are n-1 channels in the customer journey, this journey will be considered as failure for the n-1 channel-RE. If the volume effects (i.e the impact of the overall number of channels in a journey, irrelevant from their type° are important then results may be biased.

References:

R package: ChannelAttribution

Git:

https://github.com/MatCyt/Markov-Chain/blob/master/README.md

Course:

https://www.ssc.wisc.edu/~jmontgom/markovchains.pdf

Article:

“Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling”; Anderl, Eva and Becker, Ingo and Wangenheim, Florian V. and Schumann, Jan Hendrik, 2014. Available at SSRN: https://ssrn.com/abstract=2343077 or http://dx.doi.org/10.2139/ssrn.2343077

“Media Exposure through the Funnel: A Model of Multi-Stage Attribution”, Abhishek & al, 2012

“Multichannel Marketing Attribution Using Markov Chains”, Kakalejčík, L., Bucko, J., Resende, P.A.A. and Ferencova, M. Journal of Applied Management and Investments, Vol. 7 No. 1, pp. 49-60.  2018

Blogs:

https://analyzecore.com/2016/08/03/attribution-model-r-part-1

https://analyzecore.com/2016/08/03/attribution-model-r-part-2

                          3.3 To go further: Tackling selection biases with Quasi-Experiments

Exposure to certain types of advertisement is usually highly correlated to non-observable variables. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in converison probabilities between groups whether than by the campaign effect. These potential selection effects may bias the results obtained using historical data.

Quasi-Experiments can help correct this selection effect while still using available observationnal data.  These methods recreate the settings on a randomized setting. The goal is to come as close as possible to the ideal of comparing two populations that are identical in all respects except for the advertising exposure. However, populations might still differ with respect to some unobserved characteristics.

Common quasi-experimental methods used for instance in Public Policy Evaluation are:

  • Discontinuity Regressions
  • Matching Methods, such as Exact Matching,  Propensity-score matching or k-nearest neighbourghs.

References:

Article:

“Towards a digital Attribution Model: Measuring the impact of display advertising on online consumer behaviour”, Anindya Ghose & al, MIS Quarterly Vol. 40 No. 4, pp. 1-XX, 2016

https://pdfs.semanticscholar.org/4fa6/1c53f281fa63a9f0617fbd794d54911a2f84.pdf

        4. First Steps towards a Practical Implementation

Identify key points of interests

  • Identify the nature of touchpoints available: is the data based on clicks? If so, is there a way to complement the data with A/B tests to measure the influence of ads without clicks (display, video) ? For example, what happens to sales when display campaign is removed? Analysing this multiplier effect would give the overall responsibility of display on sales, to be deduced from current attribution values given to click-based channels. More interestingly, what is the impact of the removal of display campaign on the occurences of click-based campaigns ? This would give us an idea of the impact of display ads on the exposure to each other campaigns, which would help correct the attribution values more precisely at the campaign level.
  • Define the KPI to track. From a pure Marketing perspective, looking at purchases may be sufficient, but from a financial perspective looking at profits, though a bit more difficult to compute, may drive more interesting results.
  • Define a customer journey. It may seem obvious, but the notion needs to be clarified at first. Would it be defined by a time limit? If so, which one? Does it end when a conversion is observed? For example, if a customer makes 2 purchases, would the campaigns he’s been exposed to before the first order still be accounted for in the second order? If so, with a time decay?
  • Define the research framework: are we interested only in customer journeys which have led to conversions or in all journeys? Keep in mind that successful customer journeys are a non-representative sample of customer journeys. Models built on the analysis of biased samples may be conservative. Take an extreme example: 80% of customers who see campaign A buy the product, VS 1% for campaign B. However, campaign B exposure is great and 100 Million people see it VS only 1M for campaign A. An Attribution Model based on successful journeys will give higher credit to campaign B which is an auguable conclusion. Taking into account costs per campaign (in the case where costs are calculated by clicks) may of course tackle this issue partly, as campaign A could then exhibit higher returns, but a serious fallacious reasonning is at stake here.

Analyse the typical customer journey    

  • Performing a duration analysis on the data may help you improve the definition of the customer journey to be used by your organization. After which days are converison probabilities null? Should we consider the effect of campaigns disappears after x days without orders? For example, if 99% of orders are placed in the 30 days following a first click, it might be interesting to define the customer journey as a 30 days time frame following the first oder.
  • Look at the distribution of the number of campaigns in a typical journey. If you choose to calculate the effect of campaigns interraction in your Attribution Model, it may indeed help you determine the maximum number of campaigns to be included in a combination. Indeed, you may not need to assess the impact of channel combinations with above than 4 different channels if 95% of orders are placed after less then 4 campaigns.
  • Transition matrixes: what if a campaign A systematically leads to a campaign B? What happens if we remove A or B? These insights would give clues to ask precise questions for a latter AB test, for example to find out if there is complementarity between channels A and B – (implying none should be removed) or mere substitution (implying one can be given up).
  • If conversion rates are available: it can be interesting to perform a survival analysis i.e to analyse the likelihood of conversion based on duration since first click. This could help us excluse potential outliers or individuals who have very low conversion probabilities.

Summary

Attribution is a complex topic which will probably never be definitively solved. Indeed, a main issue is the difficulty, or even impossibility, to evaluate precisely the accuracy of the attribution model that we’ve built. Attribution Models should be seen as a good yet always improvable approximation of the incremental values of campaigns, and be presented with their intrinsinc limits and biases.

Introduction to ROC Curve

The abbreviation ROC stands for Receiver Operating Characteristic. Its main purpose is to illustrate the diagnostic ability of classifier as the discrimination threshold is varied. It was developed during World War II when Radar operators had to decide if the blip on the screen is an enemy target, a friendly ship or just a noise.  For these purposes they measured the ability of a radar receiver operator to make these important distinctions, which was called the Receiver Operating Characteristic.

Later it was found useful in interpreting medical test results and then in Machine learning classification problems. In order to get an introduction to binary classification and terms like ‘precision’ and ‘recall’ one can look into my earlier blog  here.

True positive rate and false positive rate

Let’s imagine a situation where a fire alarm is installed in a kitchen. The alarm is supposed to emit a sound in case fire smoke is detected in the room. Unfortunately, there is a lot of cooking done in the kitchen and the alarm may trigger the sound too often. Thus, instead of serving a purpose the alarm becomes a nuisance due to a large number of false alarms. In statistical terms these types of errors are called type 1 errors, or false positives.

One way to deal with this problem is to simply decrease sensitivity of the device. We do this by increasing the trigger threshold at the alarm setting. But then, not every alarm should have the same threshold setting. Consider the same type of device but kept in a bedroom. With high threshold, the device might miss smoke from a real short-circuit in the wires which poses a real danger of fire. This kind of failure is called Type 2 error or a false negative. Although the two devices are the same, different types of threshold settings are optimal for different circumstances.

To specify this more formally, let us describe the performance of a binary classifier at a particular threshold by the following parameters:

 

These parameters take different values at different thresholds. Hence, they define the performance of the classifier at particular threshold. But we want to examine in overall how good a classifier is. Fortunately, there is a way to do that. We plot the True Positive Rate (TPR) and False Positive rate (FPR) at different thresholds and this plot is called ROC curve.

Let’s try to understand this with an example.

A case with a distinct population distribution

Let’s suppose there is a disease which can be identified with deficiency of some parameter (maybe a certain vitamin). The distribution of population with this disease has a mean vitamin concentration sharply distinct from the mean of a healthy population, as shown below.

This is result of dummy data simulating population of 2000 people,the link to the code is given  in the end of this blog.  As the two populations are distinctly separated (there is no  overlap between the two distributions), we can expect that a classifier would have an easy job distinquishing healthy from sick people. We can run a logistic regression classifier with a threshold of .5 and be 100% succesful in detecting the decease.

The confusion matrix may look something like this.

In this ideal case with a threshold  of  .5 we do not make a single wrong classification. The True positive rate and False positive rate are 1 and 0, respectively. But we can shift the threshold. In that case, we will  get different confusion matrices. First we plot threshold vs. TPR.

We see for most values of threshold the TPR is close to 1 which again proves data is easy to classify and the classifier is returning high probabilities  for the most of positives .

Similarly Let’s plot threshold vs. FPR.

For most of the data points FPR is close to zero. This is also good. Now its time to plot the ROC curve using these results (TPR vs FPR).

Let’s try to interpret  the results,  all the points lie on line x=0 and y=1, it means for all the points FPR is zero or TPR is one, making  the curve a square. which means the classifier does perfectly well.

Case with overlapping  population distribution

The above example was about a perfect classifer. However, life is often not so easy. Now let us consider another more realistic situation in which the parameter distribution of the population is not as distinct as in the previous case. Rather, the mean of the parameter with healthy and not healthy datapoints are close and the distributions overlap, as shown in the next figure.

If we set the threshold to 0.5, the confusion matrix may look like this.

Now, any new choice of threshold location will affect both false positives and false negatives. In fact, there is a trade-off. If we shift the threshold with the goal to reduce false negatives, false positives will increase. If we move the threshold to the other direction and reduce false positive, false negatives will increase.

The plots (TPR vs Threshold) , (FPR vs Threshold) are shown below

If we plot the ROC curve from these results, it looks like this:

From the curve we see the classifier does not perform as well as the earlier one.

What else can be infered from this curve? We first need to understand what the diagonal in this plot represent. The diagonal represents ‘Line of no discrimination’, which we obtain if we randomly guess. This is the ROC curve for the worst possible classifier. Therefore, by comparing the obtained ROC curve with the diagonal, we see how much better our classifer is from random guessing.

The further away ROC curve from the diagonal is (the closest it is to the top left corner) , better the classifier is.

Area Under the curve

The overall performance of the classifier is given by the area under the ROC curve and is usually denoted as AUC. Since TPR and FPR lie within the range of 0 to 1, the AUC also assumes values between 0 and 1. The higher the value of AUC, the better is the overall performance of the classifier.

Let’s see this for the two different distributions which we saw earlier.

As we know the classifier had worked perfectly in the first case with points at (0,1) the area under the curve is 1 which is perfect. In the latter case the classifier was not able to perform as good, the ROC curve is between the diagonal and left hand corner. The AUC as we can see is less than 1.

Some other general characteristics

There are still few points that needs to be discussed on a General ROC curve

  • The ROC curve does not provide information about the actual values of thresholds used for the classifier.
  • Performance of different classifiers can be compared using the AUC of different Classifier. The larger the AUC, the better the classifier.
  • The vertical distance of the ROC curve from the no discrimination line gives a measure of ‘INFORMEDNESS’. This is known as Youden’s J satistic. This statistics can take values between 0 and 1.

Youden’s  J statistic is defined for every point on the ROC curve . The point at which Youden’s  J satistics reaches its maximum for a given ROC curve can be used to guide the selection of the threshold to be used for that classifier.

I hope this post does the job of providing an understanding of ROC curves  and AUC. The  Python program for simulating the example given earlier can be found here .

Please feel free to adjust the mean of the distributions and see the changes in the plot.