How to Ensure Data Quality in an Organization?

Introduction to Data Quality

Today, the world is filled with data. It is everywhere. And, the value of any organization can be measured by the quality of its data. So, what actually is the quality of data or data quality, and why is it important? Well, data quality refers to the capability of a set of data to serve an intended purpose. 

Data quality is important to any organization because it provides timely and accurate information to manage accountability and services. It also helps to ensure and prioritize the best use of resources. Thus, high-quality data will lead to appropriate insights and valuable information for any organization. We can evaluate the quality of data in certain aspects. They include accuracy, relevancy, completeness, and uniqueness. 

Data Quality Problems

As the organizations are collecting vast amounts of data, managing its quality becomes more important every single day. In the year 2016, the costs of problems caused due to poor data quality were estimated by IBM, and it turned out to be $3.1 trillion across the U.S economy. Also, a Forrester report has stated that almost 30 percent of analysts spend 40 percent of their time validating and vetting their data prior to its utilization for strategic decision-making. These statistics indicate that the scale of the problems with data quality is vast.

So, why do these data quality problems occur? The main reasons include manual entry of data, software updates, integration of data sources, skills shortages, and insufficient testing time. Wrong decisions can be taken due to poor data management processes and poor quality of data. Because of this, many organizations lose their clients and customers. So, ensuring data quality must be given utmost importance in an organization. 

How to Ensure Data Quality?

Data quality management helps by combining data, technology, and organizational culture to deliver useful and accurate results. Good management of data quality builds a foundation for all the initiatives of a business. Now, let’s see how we can improve the data quality in an organization.

The first aspect of improving the quality of data is monitoring and cleansing data. This verifies data against standard statistical measures, validates data against matching descriptions, and uncovers relationships. This also checks the uniqueness of data and analyzes the data for its reusability. 

The second one is managing metadata centrally. Multiple people gather and clean data very often and they may work in different countries or offices. Therefore, you require clear policies on how data is gathered and managed as people in different parts of a company may misinterpret certain data terms and concepts. Centralized management of metadata is the solution to this problem as it reduces inconsistent interpretations and helps in establishing corporate standards.  

The next one is to ensure all the requirements are available and offer documentation for data processors and data providers. You have to format the specifications and offer a data dictionary and also provide training for the providers of data and all other new staff. Make sure you offer immediate help for all the data providers.

Very often, data is gathered from different sources and may include distinct spelling options. Hence, segmentation, scoring, smart lists, and many others are impacted by this. So, for entering a data point, a singular approach is essential, and data normalization provides this approach. The goal of this approach is to eliminate redundancy in data. Its advantages include easier object-to-data mapping and increased consistency.

The last aspect is to verify whether the data is consistent with the data rules and business goals, and this has to be done at regular intervals. You have to communicate the current status and data quality metrics to every stakeholder regularly to ensure the maintenance of data quality discipline across the organization.

Conclusion

Data quality is a continuous process but not a one-time project which needs the entire company to be data-focused and data-driven. It is much more than reliability and accuracy. High level of data quality can be achieved when the decision-makers have confidence in data and rely upon it. Follow the above-mentioned steps to ensure a high level of data quality in your organization. 

The Power of Analyzing Processes

Are you thinking BIG enough? Over the past few years, the quality of discussion regarding a ‘process’ and its interfaces between different departments has developed radically. Organizations increasingly reject guesswork, individual assessments, or blame-shifting and instead focus on objective facts: the display of throughput times, process variants, and their optimization.

But while data can hold valuable insights into business, users, customer bases, and markets, companies are sometimes unsure how best to analyze and harness their data. In fact, the problem isn’t usually a lack of data; it’s a breakdown in leveraging useful data. Being unsure how to interpret, explore, and analyze processes can paralyze any go-live, leading to a failure in the efficient interaction of processes and business operations. Without robust data analysis, your business could be losing money, talent, and even clients.

After all, analyzing processes is about letting data tell its true story for improved understanding.

The “as-is” processes

Analyzing the as-is current state helps organizations document, track, and optimize processes for better performance, greater efficiency, and improved outcomes. By contextualizing data, we gain the ability to navigate and organize processes to negate bottlenecks, set business preferences, and plan an optimized route through process mining initiatives. This focus can help across an entire organization, or on one or more specific processes or trends within a department or team.

There are several vital goals/motivations for implementing current state analysis, including:

  • Saving money and improving ROI;
  • Improving existing processes or creating new processes;
  • Increasing customer satisfaction and journeys;
  • Improving business coordination and organizational responsiveness;
  • Complying with new regulatory standards;
  • Adapting methods following a merger or acquisition.

The “to-be” processes

Simply put, if as-is maps where your processes are, to-be maps where you want them to… be. To-be process mapping documents what you want the process to look like, and by using the as-is diagram, you can work with stakeholders to identify developments and improvements of the current process, then outline those changes on your to-be roadmap.

This analysis can help you make optimal decisions for your business and innovative OpEx imperatives. For instance, at leading data companies like Google and Amazon, data is used in such a way that the analysis results make the decisions! Just think of the power Recommendation Engines, PageRank, and Demand Forecasting Systems have over the content we see. To achieve this, advanced techniques of machine learning and statistical modeling are applied, resulting in mechanically improved results from the data. Interestingly, because these techniques reference large-scale data sets and reflect analysis and results in real-time, they are applied to areas that extend beyond human decision-making.

Also, by analyzing and continuously monitoring qualitative and quantitative data, we gain insights across potential risks and ongoing improvement opportunities, too. The powerful combination of process discovery, process analysis, and conformance checking supports a collaborative approach to process improvement, giving you game-changing insights into your business. For example:

  • Which incidents would I like to detect and act upon proactively?
  • Where would task prioritization help improve overall performance?
  • Where do I know that increased transparency would help the company?
  • How can I utilize processes in place of gut feeling/experience?

Further, as the economic environment continues to change rapidly, and modern organizations keep adopting process-based approaches to ensure they are achieving their business goals, process analysis naturally becomes the perfect template for any company.

With this, process mining technology can help modern businesses manage process challenges beyond the boundaries of implementation. We can evaluate the proof of concept (PoC) for any proposed improvements, and extract relevant information from a homogenous data set. Of course, process modeling and business process management (BPM) are available to solve the potentially tricky integration phase.

Process mining and analysis initiatives

Process mining and discovery initiatives can also provide critical insights throughout the automation and any Robotic Process Automation (RPA) journey, from defining the strategy to continuous improvement and innovation. Data-based process mining can even extend process analysis across teams and individuals, decreasing incident resolution times, and subsequently improving working habits via the discovery and validation of automation opportunities.

A further example of where process mining and strategic process analysis/alignment is already paying dividends is IT incident management. Here, “incident” is an unplanned interruption to an IT service, which may be complete unavailability or merely a reduction in quality. The goal of the incident management process is to restore regular service operation as quickly as possible and to minimize the impact on business operations. Incident management is a critical process in Information Technology Library (ITIL).

Process mining can also further drive improvement in as-is incident management processes as well as exceptional and unwanted process steps, by increasing visibility and transparency across IT processes. Process mining will swiftly analyze the different working habits across teams and individuals, decreasing incident resolution times, and subsequently improving customer impact cases.

Positive and practical experiences with process mining across industries have also led to the further dynamic development of tools, use cases, and the end-user community. Even with very experienced process owners, the visualization of processes can skyrocket improvement via new ideas and discussion.

However, the potential performance gains are more extensive, with the benefits of using process mining for incident management, also including:

  • Finding out how escalation rules are working and how the escalation is done;
  • Calculating incident management KPIs, including SLA (%);
  • Discovering root causes for process problems;
  • Understanding the effect of the opening interface (email, web form, phone, etc.);
  • Calculating the cost of the incident process;
  • Aligning the incident management system with your incident management process.

Robotic Process Automation (RPA)

Robotic process automation (RPA) provides a virtual workforce to automatize manual, repetitive, and error-prone tasks. However, successful process automation requires specific knowledge about the intended (and potential) benefits, effective training of the robots, and continuous monitoring of their performance and processes.

With this, process mining supports organizations throughout the lifecycle of RPA initiatives by monitoring and benchmarking robots to ensure sustainable benefits. These insights are especially valuable for process miners and managers with a particular interest in process automation. By unlocking the experiences with process mining, a company better understands what is needed today, for tomorrow’s process initiatives.

To further upgrade the impact of robot-led automation, there is also a need for a solid understanding of legacy systems, and an overview of automation opportunities. Process mining tools provide key insights throughout the entire RPA journey, from defining the strategy to continuous improvement and innovation.

Benefits of process mining and analysis within the RPA lifecycle include:

  1. Overviews of processes within the company, based on specific criteria;
  2. Identification of processes suitable for RPA implementation during the preparation phase;
  3. Mining the optimal process flow/process path;
  4. Understanding the extent to which RPA can be implemented in legacy processes and systems;
  5. Monitoring and analysis of RPA performance during the transition/handover of customization;
  6. Monitoring and continuous improvement of RPA in the post-implementation phase.

The process of better business understanding

Every organization is different and brings with it a variety of process-related questions. Yet some patterns are usually repeated. For example, customers who introduce data supported process analysis as part of business transformation initiatives will typically face challenges in harmonizing processes from fragmented sectors and regional locations. Here it helps enormously to base actions on data and statistics from the respective processes, instead of relying on the instincts and estimations of individuals.

With this, process analysis which is supported by data, enables a fact-based discussion, and builds a bridge between employees, process experts and management. This helps avoid siloed thinking, as well as allowing the transparent design of handovers and process steps which cross departmental boundaries within an organization.

In other words, to unlock future success and transformation, we must be processing… today.

Find out more about process mining with Signavio Process Intelligence, and see how it can help your organization uncover the hidden value of process, generate fresh ideas, and save time and money.

From BI to PI: The Next Step in the Evolution of Data-Driven Decisions

“Change is a constant.” “The pace of change is accelerating.” “The world is increasingly complex, and businesses have to keep up.” Organizations of all shapes and sizes have heard these ideas over and over—perhaps too often! However, the truth remains that adaptation is crucial to a successful business.


Read this article in German: Von der Datenanalyse zur Prozessverbesserung: So gelingt eine erfolgreiche Process-Mining-Initiative

 


Of course, the only way to ensure that the decisions you make are evolving in the right way is to understand the underlying building blocks of your organization. You can think of it as DNA; the business processes that underpin the way you work and combine to create a single unified whole. Knowing how those processes operate, and where the opportunities for improvement lie, can be the difference between success and failure.

Businesses with an eye on their growth understand this already. In the past, Business Intelligence was seen as the solution to this challenge. In more recent times, forward-thinking organizations see the need for monitoring solutions that can keep up with today’s rate of change, at the same time as they recognize that increasing complexity within business processes means traditional methods are no longer sufficient.

Adapting to a changing environment? The challenges of BI

Business Intelligence itself is not necessarily defunct or obsolete. However, the tools and solutions that enable Business Intelligence face a range of challenges in a fast-paced and constantly changing world. Some of these issues may include:

  • High data latency – Data latency refers to how long it takes for a business user to retrieve data from, for example, a business intelligence dashboard. In many cases, this can take more than 24 hours, a critical time period when businesses are attempting to take advantage of opportunities that may have a limited timeframe.
  • Incomplete data sets – The broad approach of Business Intelligence means investigations may run wide but not deep. This increases the chances that data will be missed, especially in instances where the tools themselves make the parameters for investigations difficult to change.
  • Discovery, not analysis – Business intelligence tools are primarily optimized for exploration, with a focus on actually finding data that may be useful to their users. Often, this is where the tools stop, offering no simple way for users to actually analyze the data, and therefore reducing the possibility of finding actionable insights.
  • Limited scalability – In general, Business Intelligence remains an arena for specialists and experts, leaving a gap in understanding for operational staff. Without a wide appreciation for processes and their analysis within an organization, the opportunities to increase the application of a particular Business Intelligence tool will be limited.
  • Unconnected metrics – Business Intelligence can be significantly restricted in its capacity to support positive change within a business through the use of metrics that are not connected to the business context. This makes it difficult for users to interpret and understand the results of an investigation, and apply these results to a useful purpose within their organization.

Process Intelligence: the next evolutionary step

To ensure companies can work efficiently and make the best decisions, a more effective method of process discovery is needed. Process Intelligence (PI) provides the critical background to answer questions that cannot be answered with Business Intelligence tools.

Process Intelligence offers visualization of end-to-end process sequences using raw data, and the right Process Intelligence tool means analysis of that raw data can be conducted straight away, so that processes are displayed accurately. The end-user is free to view and work with this accurate information as they please, without the need to do a preselection for the analysis.

By comparison, because Business Intelligence requires predefined analysis criteria, only once the criteria are defined can BI be truly useful. Organizations can avoid delayed analysis by using Process Intelligence to identify the root causes of process problems, then selecting the right criteria to determine the analysis framework.

Then, you can analyze your system processes and see the gaps and variants between the intended business process and what you actually have. And of course, the faster you discover what you have, the faster you can apply the changes that will make a difference in your business.

In short, Business Intelligence is suitable for gaining a broad understanding of the way a business usually functions. For some businesses, this will be sufficient. For others, an overview is not enough.

They understand that true insights lie in the detail, and are looking for a way of drilling down into exactly how each process within their organization actually works. Software that combines process discovery, process analysis, and conformance checking is the answer.

The right Process Intelligence tools means you will be able to automatically mine process models from the different IT systems operating within your business, as well as continuously monitor your end-to-end processes for insights into potential risks and ongoing improvement opportunities. All of this is in service of a collaborative approach to process improvement, which will lead to a game-changing understanding of how your business works, and how it can work better.

Early humans evolved from more primitive ancestors, and in the process, learned to use more and more sophisticated tools. For the modern human, working in a complex organization, the right tool is Process Intelligence.

Endless Potential with Signavio Process Intelligence

Signavio Process Intelligence allows you to unearth the truth about your processes and make better decisions based on true evidence found in your organization’s IT systems. Get a complete end-to-end perspective and understanding of exactly what is happening in your organization in a matter of weeks.

As part of Signavio Business Transformation Suite, Signavio Process Intelligence integrates perfectly with Signavio Process Manager and is accessible from the Signavio Collaboration Hub. As an entirely cloud-based process mining solution, the tool makes it easy to collaborate with colleagues from all over the world and harness the wisdom of the crowd.

Find out more about Signavio Process Intelligence, and see how it can help your organization generate more ideas, save time and money, and optimize processes.

A Bird’s Eye View: How Machine Learning Can Help You Charge Your E-Scooters

Bird scooters in Columbus, Ohio

Bird scooters in Columbus, Ohio

Ever since I started using bike-sharing to get around in Seattle, I have become fascinated with geolocation data and the transportation sharing economy. When I saw this project leveraging the mobility data RESTful API from the Los Angeles Department of Transportation, I was eager to dive in and get my hands dirty building a data product utilizing a company’s mobility data API.

Unfortunately, the major bike and scooter providers (Bird, JUMP, Lime) don’t have publicly accessible APIs. However, some folks have seemingly been able to reverse-engineer the Bird API used to populate the maps in their Android and iOS applications.

One interesting feature of this data is the nest_id, which indicates if the Bird scooter is in a “nest” — a centralized drop-off spot for charged Birds to be released back into circulation.

I set out to ask the following questions:

  1. Can real-time predictions be made to determine if a scooter is currently in a nest?
  2. For non-nest scooters, can new nest location recommendations be generated from geospatial clustering?

To answer these questions, I built a full-stack machine learning web application, NestGenerator, which provides an automated recommendation engine for new nest locations. This application can help power Bird’s internal nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on proximity to scooters and nest locations in their area.

Bird

The electric scooter market has seen substantial growth with Bird’s recent billion dollar valuation  and their $300 million Series C round in the summer of 2018. Bird offers electric scooters that top out at 15 mph, cost $1 to unlock and 15 cents per minute of use. Bird scooters are in over 100 cities globally and they announced in late 2018 that they eclipsed 10 million scooter rides since their launch in 2017.

Bird scooters in Tel Aviv, Israel

Bird scooters in Tel Aviv, Israel

With all of these scooters populating cities, there’s much-needed demand for people to charge them. Since they are electric, someone needs to charge them! A charger can earn additional income for charging the scooters at their home and releasing them back into circulation at nest locations. The base price for charging each Bird is $5.00. It goes up from there when the Birds are harder to capture.

Data Collection and Machine Learning Pipeline

The full data pipeline for building “NestGenerator”

Data

From the details here, I was able to write a Python script that returned a list of Bird scooters within a specified area, their geolocation, unique ID, battery level and a nest ID.

I collected scooter data from four cities (Atlanta, Austin, Santa Monica, and Washington D.C.) across varying times of day over the course of four weeks. Collecting data from different cities was critical to the goal of training a machine learning model that would generalize well across cities.

Once equipped with the scooter’s latitude and longitude coordinates, I was able to leverage additional APIs and municipal data sources to get granular geolocation data to create an original scooter attribute and city feature dataset.

Data Sources:

  • Walk Score API: returns a walk score, transit score and bike score for any location.
  • Google Elevation API: returns elevation data for all locations on the surface of the earth.
  • Google Places API: returns information about places. Places are defined within this API as establishments, geographic locations, or prominent points of interest.
  • Google Reverse Geocoding API: reverse geocoding is the process of converting geographic coordinates into a human-readable address.
  • Weather Company Data: returns the current weather conditions for a geolocation.
  • LocationIQ: Nearby Points of Interest (PoI) API returns specified PoIs or places around a given coordinate.
  • OSMnx: Python package that lets you download spatial geometries and model, project, visualize, and analyze street networks from OpenStreetMap’s APIs.

Feature Engineering

After extensive API wrangling, which included a four-week prolonged data collection phase, I was finally able to put together a diverse feature set to train machine learning models. I engineered 38 features to classify if a scooter is currently in a nest.

Full Feature Set

Full Feature Set

The features boiled down into four categories:

  • Amenity-based: parks within a given radius, gas stations within a given radius, walk score, bike score
  • City Network Structure: intersection count, average circuity, street length average, average streets per node, elevation level
  • Distance-based: proximity to closest highway, primary road, secondary road, residential road
  • Scooter-specific attributes: battery level, proximity to closest scooter, high battery level (> 90%) scooters within a given radius, total scooters within a given radius

 

Log-Scale Transformation

For each feature, I plotted the distribution to explore the data for feature engineering opportunities. For features with a right-skewed distribution, where the mean is typically greater than the median, I applied these log transformations to normalize the distribution and reduce the variability of outlier observations. This approach was used to generate a log feature for proximity to closest scooter, closest highway, primary road, secondary road, and residential road.

An example of a log transformation

Statistical Analysis: A Systematic Approach

Next, I wanted to ensure that the features I included in my model displayed significant differences when broken up by nest classification. My thinking was that any features that did not significantly differ when stratified by nest classification would not have a meaningful predictive impact on whether a scooter was in a nest or not.

Distributions of a feature stratified by their nest classification can be tested for statistically significant differences. I used an unpaired samples t-test with a 0.01% significance level to compute a p-value and confidence interval to determine if there was a statistically significant difference in means for a feature stratified by nest classification. I rejected the null hypothesis if a p-value was smaller than the 0.01% threshold and if the 99.9% confidence interval did not straddle zero. By rejecting the null-hypothesis in favor of the alternative hypothesis, it’s deemed there is a significant difference in means of a feature by nest classification.

Battery Level Distribution Stratified by Nest Classification to run a t-test

Battery Level Distribution Stratified by Nest Classification to run a t-test

Log of Closest Scooter Distribution Stratified by Nest Classification to run a t-test

Throwing Away Features

Using the approach above, I removed ten features that did not display statistically significant results.

Statistically Insignificant Features Removed Before Model Development

Model Development

I trained two models, a random forest classifier and an extreme gradient boosting classifier since tree-based models can handle skewed data, capture important feature interactions, and provide a feature importance calculation. I trained the models on 70% of the data collected for all four cities and reserved the remaining 30% for testing.

After hyper-parameter tuning the models for performance on cross-validation data it was time to run the models on the 30% of test data set aside from the initial data collection.

I also collected additional test data from other cities (Columbus, Fort Lauderdale, San Diego) not involved in training the models. I took this step to ensure the selection of a machine learning model that would generalize well across cities. The performance of each model on the additional test data determined which model would be integrated into the application development.

Performance on Additional Cities Test Data

The Random Forest Classifier displayed superior performance across the board

The Random Forest Classifier displayed superior performance across the board

I opted to move forward with the random forest model because of its superior performance on AUC score and accuracy metrics on the additional cities test data. AUC is the Area under the ROC Curve, and it provides an aggregate measure of model performance across all possible classification thresholds.

AUC Score on Test Data for each Model

AUC Score on Test Data for each Model

Feature Importance

Battery level dominated as the most important feature. Additional important model features were proximity to high level battery scooters, proximity to closest scooter, and average distance to high level battery scooters.

Feature Importance for the Random Forest Classifier

Feature Importance for the Random Forest Classifier

The Trade-off Space

Once I had a working machine learning model for nest classification, I started to build out the application using the Flask web framework written in Python. After spending a few days of writing code for the application and incorporating the trained random forest model, I had enough to test out the basic functionality. I could finally run the application locally to call the Bird API and classify scooter’s into nests in real-time! There was one huge problem, though. It took more than seven minutes to generate the predictions and populate in the application. That just wasn’t going to cut it.

The question remained: will this model deliver in a production grade environment with the goal of making real-time classifications? This is a key trade-off in production grade machine learning applications where on one end of the spectrum we’re optimizing for model performance and on the other end we’re optimizing for low latency application performance.

As I continued to test out the application’s performance, I still faced the challenge of relying on so many APIs for real-time feature generation. Due to rate-limiting constraints and daily request limits across so many external APIs, the current machine learning classifier was not feasible to incorporate into the final application.

Run-Time Compliant Application Model

After going back to the drawing board, I trained a random forest model that relied primarily on scooter-specific features which were generated directly from the Bird API.

Through a process called vectorization, I was able to transform the geolocation distance calculations utilizing NumPy arrays which enabled batch operations on the data without writing any “for” loops. The distance calculations were applied simultaneously on the entire array of geolocations instead of looping through each individual element. The vectorization implementation optimized real-time feature engineering for distance related calculations which improved the application response time by a factor of ten.

Feature Importance for the Run-time Compliant Random Forest Classifier

Feature Importance for the Run-time Compliant Random Forest Classifier

This random forest model generalized well on test-data with an AUC score of 0.95 and an accuracy rate of 91%. The model retained its prediction accuracy compared to the former feature-rich model, but it gained 60x in application performance. This was a necessary trade-off for building a functional application with real-time prediction capabilities.

Geospatial Clustering

Now that I finally had a working machine learning model for classifying nests in a production grade environment, I could generate new nest locations for the non-nest scooters. The goal was to generate geospatial clusters based on the number of non-nest scooters in a given location.

The k-means algorithm is likely the most common clustering algorithm. However, k-means is not an optimal solution for widespread geolocation data because it minimizes variance, not geodetic distance. This can create suboptimal clustering from distortion in distance calculations at latitudes far from the equator. With this in mind, I initially set out to use the DBSCAN algorithm which clusters spatial data based on two parameters: a minimum cluster size and a physical distance from each point. There were a few issues that prevented me from moving forward with the DBSCAN algorithm.

  1. The DBSCAN algorithm does not allow for specifying the number of clusters, which was problematic as the goal was to generate a number of clusters as a function of non-nest scooters.
  2. I was unable to hone in on an optimal physical distance parameter that would dynamically change based on the Bird API data. This led to suboptimal nest locations due to a distortion in how the physical distance point was used in clustering. For example, Santa Monica, where there are ~15,000 scooters, has a higher concentration of scooters in a given area whereas Brookline, MA has a sparser set of scooter locations.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

An example of how sparse scooter locations vs. highly concentrated scooter locations for a given Bird API call can create cluster distortion based on a static physical distance parameter in the DBSCAN algorithm. Left:Bird scooters in Brookline, MA. Right:Bird scooters in Santa Monica, CA.

Given the granularity of geolocation scooter data I was working with, geospatial distortion was not an issue and the k-means algorithm would work well for generating clusters. Additionally, the k-means algorithm parameters allowed for dynamically customizing the number of clusters based on the number of non-nest scooters in a given location.

Once clusters were formed with the k-means algorithm, I derived a centroid from all of the observations within a given cluster. In this case, the centroids are the mean latitude and mean longitude for the scooters within a given cluster. The centroids coordinates are then projected as the new nest recommendations.

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm

NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithm.

NestGenerator Application

After wrapping up the machine learning components, I shifted to building out the remaining functionality of the application. The final iteration of the application is deployed to Heroku’s cloud platform.

In the NestGenerator app, a user specifies a location of their choosing. This will then call the Bird API for scooters within that given location and generate all of the model features for predicting nest classification using the trained random forest model. This forms the foundation for map filtering based on nest classification. In the app, a user has the ability to filter the map based on nest classification.

Drop-Down Map View filtering based on Nest Classification

Drop-Down Map View filtering based on Nest Classification

Nearest Generated Nest

To see the generated nest recommendations, a user selects the “Current Non-Nest Scooters & Predicted Nest Locations” filter which will then populate the application with these nest locations. Based on the user’s specified search location, a table is provided with the proximity of the five closest nests and an address of the Nest location to help inform a Bird charger in their decision-making.

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

NestGenerator web-layout with nest addresses and proximity to nearest generated nests

Conclusion

By accurately predicting nest classification and clustering non-nest scooters, NestGenerator provides an automated recommendation engine for new nest locations. For Bird, this application can help power their nest location generation that runs within their Android and iOS applications. NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on scooters and nest locations in their area.

Code

The code for this project can be found on my GitHub

Comments or Questions? Please email me an E-Mail!

 

Attribution Models in Marketing

Attribution Models

A Business and Statistical Case

INTRODUCTION

A desire to understand the causal effect of campaigns on KPIs

Advertising and marketing costs represent a huge and ever more growing part of the budget of companies. Studies have found out this share is as high as 10% and increases with the size of companies (CMO study by American Marketing Association and Duke University, 2017). Measuring precisely the impact of a specific marketing campaign on the sales of a company is a critical step towards an efficient allocation of this budget. Would the return be higher for an euro spent on a Facebook ad, or should we better spend it on a TV spot? How much should I spend on Twitter ads given the volume of sales this channel is responsible for?

Attribution Models have lately received great attention in Marketing departments to answer these issues. The transition from offline to online marketing methods has indeed permitted the collection of multiple individual data throughout the whole customer journey, and  allowed for the development of user-centric attribution models. In short, Attribution Models use the information provided by Tracking technologies such as Google Analytics or Webtrekk to understand customer journeys from the first click on a Facebook ad to the final purchase and adequately ponderate the different marketing campaigns encountered depending on their responsibility in the final conversion.

Issues on Causal Effects

A key question then becomes: how to declare a channel is responsible for a purchase? In other words, how can we isolate the causal effect or incremental value of a campaign ?

          1. A/B-Tests

One method to estimate the pure impact of a campaign is the design of randomized experiments, wherein a control and treated groups are compared.  A/B tests belong to this broad category of randomized methods. Provided the groups are a priori similar in every aspect except for the treatment received, all subsequent differences may be attributed solely to the treatment. This method is typically used in medical studies to assess the effect of a drug to cure a disease.

Main practical issues regarding Randomized Methods are:

  • Assuring that control and treated groups are really similar before treatment. Uually a random assignment (i.e assuring that on a relevant set of observable variables groups are similar) is realized;
  • Potential spillover-effects, i.e the possibility that the treatment has an impact on the non-treated group as well (Stable unit treatment Value Assumption, or SUTVA in Rubin’s framework);
  • The costs of conducting such an experiment, and especially the costs linked to the deliberate assignment of individuals to a group with potentially lower results;
  • The number of such experiments to design if multiple treatments have to be measured;
  • Difficulties taking into account the interaction effects between campaigns or the effect of spending levels. Indeed, usually A/B tests are led by cutting off temporarily one campaign entirely and measuring the subsequent impact on KPI’s compared to the situation where this campaign is maintained;
  • The dynamical reproduction of experiments if we assume that treatment effects may change over time.

In the marketing context, multiple campaigns must be tested in a dynamical way, and treatment effect is likely to be heterogeneous among customers, leading to practical issues in the lauching of A/B tests to approximate the incremental value of all campaigns. However, sites with a lot of traffic and conversions can highly benefit from A/B testing as it provides a scientific and straightforward way to approximate a causal impact. Leading companies such as Uber, Netflix or Airbnb rely on internal tools for A/B testing automation, which allow them to basically test any decision they are about to make.

References:

Books:

Experiment!: Website conversion rate optimization with A/B and multivariate testing, Colin McFarland, ©2013 | New Riders  

A/B testing: the most powerful way to turn clicks into customers. Dan Siroker, Pete Koomen; Wiley, 2013.

Blogs:

https://eng.uber.com/xp

https://medium.com/airbnb-engineering/growing-our-host-community-with-online-marketing-9b2302299324

Study:

https://cmosurvey.org/wp-content/uploads/sites/15/2018/08/The_CMO_Survey-Results_by_Firm_and_Industry_Characteristics-Aug-2018.pdf

        2. Attribution models

Attribution Models do not demand to create an experimental setting. They take into account existing data and derive insights from the variability of customer journeys. One key difficulty is then to differentiate correlation and causality in the links observed between the exposition to campaigns and purchases. Indeed, selection effects may bias results as exposure to campaigns is usually dependant on user-characteristics and thus may not be necessarily independant from the customer’s baseline conversion probabilities. For example, customers purchasing from a discount price comparison website may be intrinsically different from customers buying from FB ad and this a priori difference may alone explain post-exposure differences in purchasing bahaviours. This intrinsic weakness must be remembered when interpreting Attribution Models results.

                          2.1 General Issues

The main issues regarding the implementation of Attribution Models are linked to

  • Causality and fallacious reasonning, as most models do not take into account the aforementionned selection biases.
  • Their difficult evaluation. Indeed, in almost all attribution models (except for those based on classification, where the accuracy of the model can be computed), the additionnal value brought by the use of a given attribution models cannot be evaluated using existing historical data. This additionnal value can only be approximated by analysing how the implementation of the conclusions of the attribution model have impacted a given KPI.
  • Tracking issues, leading to an uncorrect reconstruction of customer journeys
    • Cross-device journeys: cross-device issue arises from the use of different devices throughout the customer journeys, making it difficult to link datapoints. For example, if a customer searches for a product on his computer but later orders it on his mobile, the AM would then mistakenly consider it an order without prior campaign exposure. Though difficult to measure perfectly, the proportion of cross-device orders can approximate 20-30%.
    • Cookies destruction makes it difficult to track the customer his the whole journey. Both regulations and consumers’ rising concerns about data privacy issues mitigate the reliability and use of cookies.1 – From 2002 on, the EU has enacted directives concerning privacy regulation and the extended use of cookies for commercial targeting purposes, which have highly impacted marketing strategies, such as the ‘Privacy and Electronic Communications Directive’ (2002/58/EC). A research was conducted and found out that the adoption of this ‘Privacy Directive’ had led to 64% decrease in advertising methods compared to the rest of the world (Goldfarb et Tucker (2011)). The effect was stronger for generalized sites (Yahoo) than for specialized sites.2 – Users have grown more and more conscious of data privacy issues and have adopted protective measures concerning data privacy, such as automatic destruction of cookies after a session is ended, or simply giving away less personnal information (Goldfarb et Tucker (2012) ) .Valuable user information may be lost, though tracking technologies evolution have permitted to maintain tracking by other means. This issue may be particularly important in countries highly concerned with data privacy issues such as Germany.
    • Offline/Online bridge: an Attribution Model should take into account all campaigns to draw valuable insights. However, the exposure to offline campaigns (TV, newspapers) are difficult to track at the user level. One idea to tackle this issue would be to estimate the proportion of conversions led by offline campaigns through AB testing and deduce this proportion from the credit assigned to the online campaigns accounted for in the Attribution Model.
    • Touch point information available: clicks are easy to follow but irrelevant to take into account the influence of purely visual campaigns such as display ads or video.

                          2.2 Today’s main practices

Two main families of Attribution Models exist:

  • Rule-Based Attribution Models, which have been used for in the last decade but from which companies are gradualy switching.

Attribution depends on the individual journeys that have led to a purchase and is solely based on the rank of the campaign in the journey. Some models focus on a single touch points (First Click, Last Click) while others account for multi-touch journeys (Bathtube, Linear). It can be calculated at the customer level and thus doesn’t require large amounts of data points. We can distinguish two sub-groups of rule-based Attribution Models:

  • One Touch Attribution Models attribute all credit to a single touch point. The First-Click model attributes all credit for a converion to the first touch point of the customer journey; last touch attributes all credit to the last campaign.
  • Multi-touch Rule-Based Attribution Models incorporate information on the whole customer journey are thus an improvement compared to one touch models. To this family belong Linear model where credit is split equally between all channels, Bathtube model where 40% of credit is given to first and last clicks and the remaining 20% is distributed equally between the middle channels, or time-decay models where credit assigned to a click diminishes as the time between the click and the order increases..

The main advantages of rule-based models is their simplicity and cost effectiveness. The main problems are:

– They are a priori known and can thus lead to optimization strategies from competitors
– They do not take into account aggregate intelligence on customer journeys and actual incremental values.
– They tend to bias (depending on the model chosen) channels that are over-represented at the beggining or end of the funnel, according to theoretical assumptions that have no observationnal back-ups.

  • Data-Driven Attribution Models

These models take into account the weaknesses of rule-based models and make a relevant use of available data. Being data-driven, following attribution models cannot be computed using single user level data. On the contrary values are calculated through data aggregation and thus require a certain volume of customer journey information.

References:

https://dspace.mit.edu/handle/1721.1/64920

 

        3. Data-Driven Attribution Models in practice

                          3.1 Issues

Several issues arise in the computation of campaigns individual impact on a given KPI within a data-driven model.

  • Selection biases: Exposure to certain types of advertisement is usually highly correlated to non-observable variables which are in turn correlated to consumption practices. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in conversion probabilities between groups whether than by the campaign effect.
  • Complementarity: it may be that campaigns A and B only have an effect when combined, so that measuring their individual impact would lead to misleading conclusions. The model could then try to assess the effect of combinations of campaigns on top of the effect of individual campaigns. As the number of possible non-ordered combinations of k campaigns is 2k, it becomes clear that inclusing all possible combinations would however be time-consuming.
  • Order-sensitivity: The effect of a campaign A may depend on the place where it appears in the customer journey, meaning the rank of a campaign and not merely its presence could be accounted for in the model.
  • Relative Order-sensitivity: it may be that campaigns A and B only have an effect when one is exposed to campaign A before campaign B. If so, it could be useful to assess the effect of given combinations of campaigns as well. And this for all campaigns, leading to tremendous numbers of possible combinations.
  • All previous phenomenon may be present, increasing even more the potential complexity of a comprehensive Attribution Model. The number of all possible ordered combination of k campaigns is indeed :

 

                          3.2 Main models

                                  A) Logistic Regression and Classification models

If non converting journeys are available, Attribition Model can be shaped as a simple classification issue. Campaign types or campaigns combination and volume of campaign types can be included in the model along with customer or time variables. As we are interested in inference (on campaigns effect) whether than prediction, a parametric model should be used, such as Logistic Regression. Non paramatric models such as Random Forests or Neural Networks can also be used though the interpretation of campaigns value would be more difficult to derive from the model results.

A common pitfall is the usual issue of spurious correlations on one hand and the correct interpretation of coefficients in business terms.

An advantage if the possibility to evaluate the relevance of the model using common model validation methods to evaluate its predictive power (validation set \ AUC \pseudo R squared).

                                  B) Shapley Value

Theory

The Shapley Value is based on a Game Theory framework and is named after its creator, the Nobel Price Laureate Lloyd Shapley. Initially meant to calculate the marginal contribution of players in cooperative games, the model has received much attention in research and industry and has lately been applied to marketing issues. This model is typically used by Google Adords and other ad bidding vendors. Campaigns or marketing channels are in this model seen as compementary players looking forward to increasing a given KPI.
Contrarily to Logistic Regressions, it is a non-parametric model. Contrarily to Markov Chains, all results are built using existing journeys, and not simulated ones.

Channels are considered to enter the game sequentially under a certain joining order. Shapley value try to The Shapley value of channel i is the weighted sum of the marginal values that channel i adds to all possible coalitions that don’t contain channel i.
In other words, the main logic is to analyse the difference of gains when a channel i is added after a coalition Ck of k channels, k<=n. We then sum all the marginal contributions over all possible ordered combination Ck of all campaigns excluding i, with k<=n-1.

Subsets framework

A first an most usual way to compute the Shapley Vaue is to consider that when a channel enters coalition, its additionnal value is the same irrelevant of the order in which previous channels have appeared. In other words, journeys (A>B>C) and (B>A>C) trigger the same gains.
Shapley value is computed as the gains associated to adding a channel i to a subset of channels, weighted by the number of (ordered) sequences that the (unordered) subset represents, summed up on all possible subsets of the total set of campaigns where the channel i is not present.
The Shapley value of the channel 𝑥𝑗 is then:

where |S| is the number of campaigns of a coalition S and the sum extends over all subsets S that do not not contain channel j. 𝜈(𝑆)  is the value of the coalition S and 𝜈(𝑆 ∪ {𝑥𝑗})  the value of the coalition formed by adding 𝑥𝑗 to coalition S. 𝜈(𝑆 ∪ {𝑥𝑗}) − 𝜈(𝑆) is thus the marginal contribution of channel 𝑥𝑗 to the coalition S.

The formula can be rewritten and understood as:

This method is convenient when data on the gains of on all possible permutations of all unordered k subsets of the n campaigns are available. It is also more convenient if the order of campaigns prior to the introduction of a campaign is thought to have no impact.

Ordered sequences

Let us define 𝜈((A>B)) as the value of the sequence A then B. What is we let 𝜈((A>B)) be different from 𝜈((B>A)) ?
This time we would need to sum over all possible permutation of the S campaigns present before  𝑥𝑗 and the N-(S+1) campaigns after 𝑥𝑗. Doing so we will sum over all possible orderings (i.e all permutations of the n campaigns of the grand coalition containing all campaigns) and we can remove the permutation coefficient s!(p-s+1)!.

This method is convenient when the order of channels prior to and after the introduction of another channel is assumed to have an impact. It is also necessary to possess data for all possible permutations of all k subsets of the n campaigns, and not only on all (unordered) k-subsets of the n campaigns, k<=n. In other words, one must know the gains of A, B, C, A>B, B>A, etc. to compute the Shapley Value.

Differences between the two approaches

We simulate an ordered case where the value for each ordered sequence k for k<=3 is known. We compare it to the usual Shapley value calculated based on known gains of unordered subsets of campaigns. So as to compare relevant values, we have built the gains matrix so that the gains of a subset A, B i.e  𝜈({B,A}) is the average of the gains of ordered sequences made up with A and B (assuming the number of journeys where A>B equals the number of journeys where B>A, we have 𝜈({B,A})=0.5( 𝜈((A>B)) + 𝜈((B>A)) ). We let the value of the grand coalition be different depending on the order of campaigns-keeping the constraints that it averages to the value used for the unordered case.

Note: mvA refers to the marginal value of A in a given sequence.
With traditionnal unordered coalitions:

With ordered sequences used to compute the marginal values:

 

We can see that the two approaches yield very different results. In the unordered case, the Shapley Value campaign C is the highest, culminating at 20, while A and B have the same Shapley Value mvA=mvB=15. In the ordered case, campaign A has the highest Shapley Value and all campaigns have different Shapley Values.

This example illustrates the inherent differences between the set and sequences approach to Shapley values. Real life data is more likely to resemble the ordered case as conversion probabilities may for any given set of campaigns be influenced by the order through which the campaigns appear.

Advantages

Shapley value has become popular in allocation problems in cooperative games because it is the unique allocation which satisfies different axioms:

  • Efficiency: Shaple Values of all channels add up to the total gains (here, orders) observed.
  • Symmetry: if channels A and B bring the same contribution to any coalition of campaigns, then their Shapley Value i sthe same
  • Null player: if a channel brings no additionnal gains to all coalitions, then its Shapley Value is zero
  • Strong monotony: the Shapley Value of a player increases weakly if all its marginal contributions increase weakly

These properties make the Shapley Value close to what we intuitively define as a fair attribution.

Issues

  • The Shapley Value is based on combinatory mathematics, and the number of possible coalitions and ordered sequences becomes huge when the number of campaigns increases.
  • If unordered, the Shapley Value assumes the contribution of campaign A is the same if followed by campaign B or by C.
  • If ordered, the number of combinations for which data must be available and sufficient is huge.
  • Channels rarely present or present in long journeys will be played down.
  • Generally, gains are supposed to grow with the number of players in the game. However, it is plausible that in the marketing context a journey with a high number of channels will not necessarily bring more orders than a journey with less channels involved.

References:

R package: GameTheoryAllocation

Article:
Zhao & al, 2018 “Shapley Value Methods for Attribution Modeling in Online Advertising “
https://link.springer.com/content/pdf/10.1007/s13278-017-0480-z.pdf
Courses: https://www.lamsade.dauphine.fr/~airiau/Teaching/CoopGames/2011/coopgames-7%5b8up%5d.pdf
Blogs: https://towardsdatascience.com/one-feature-attribution-method-to-supposedly-rule-them-all-shapley-values-f3e04534983d

                                  B) Markov Chains

Markov Chains are used to model random processes, i.e events that occur in a sequential manner and in such a way that the probability to move to a certain state only depends on the past steps. The number of previous steps that are taken into account to model the transition probability is called the memory parameter of the sequence, and for the model to have a solution must be comprised between 0 and 4. A Markov Chain process is thus defined entirely by its Transition Matrix and its initial vector (i.e the starting point of the process).

Markov Chains are applied in many scientific fields. Typically, they are used in weather forecasting, with the sequence of Sunny and Rainy days following a Markov Process of memory parameter 0, so that for each given day the probability that the next day will be rainy or sunny only depends on the weather of the current day. Other applications can be found in sociology to understand the dynamics of social classes intergenerational reproduction. To get more both mathematical and applied illustration, I recommend the reading of this course.

In the marketing context, Markov Chains are an interesting way to model the conversion funnel. To go from the from the Markov Model to the Attribution logic, we calculate the Removal Effect of each channel, i.e the difference in conversions that happen if the channel is removed. Please read below for an introduction to the methodology.

The first step in a Markov Chains Attribution Model is to build the transition matrix that captures the transition probabilities between the campaigns accross existing customer journeys. This Matrix is to be read as a “From state A to state B” table, from the left to the right. A first difficulty is finding the right memory parameter to use. A large memory parameter would allow to take more into account interraction effects within the conversion funnel but would lead to increased computationnal time, a non-readable transition matrix, and be more sensitive to noisy data. Please note that this transition matrix provides useful information on the conversion funnel and on the relationships between campaigns and can be used as such as an analytical tool. I suggest the clear and easily R code which can be found here or here.

Here is an illustration of a Markov Chain with memory Parameter of 0: the probability to go to a certain campaign B in the next step only depend on the campaign we are currently at:

The associated Transition Matrix is then (with null probabilities left as Blank):

The second step is  to compute the actual responsibility of a channel in total conversions. As mentionned above, the main philosophy to do so is to calculate the Removal Effect of each channel, i.e the changes in the number of conversions when a channel is entirely removed. All customer journeys which went through this channel are settled out to be unsuccessful. This calculation is done by applying the transition matrix with and without the removed channels to an initial vector that contains the number of desired simulations.

Building on our current example, we can then settle an initial vector with the desired number of simulations, e.g 10 000:

 

It is possible at this stage to add a constraint on the maximum number of times the matrix is applied to the data, i.e on the maximal number of campaigns a simulated journey is allowed to have.

Advantages

  • The dynamic journey is taken into account, as well as the transition between two states. The funnel is not assumed to be linear.
  • It is possile to build a conversion graph that maps the customer journey provides valuable insights.
  • It is possible to evaluate partly the accuracy of the Attribution Model based on Markov Chains. It is for example possible to see how well the transition matrix help predict the future by analysing the number of correct predictions at any given step over all sequences.

Disadvantages

  • It can be somewhat difficult to set the memory parameter. Complementarity effects between channels are not well taken into account if the memory is low, but a parameter too high will lead to over-sensitivity to noise in the data and be difficult to implement if customer journeys tend to have a number of campaigns below this memory parameter.
  • Long journeys with different channels involved will be overweighted, as they will count many times in the Removal Effect.  For example, if there are n-1 channels in the customer journey, this journey will be considered as failure for the n-1 channel-RE. If the volume effects (i.e the impact of the overall number of channels in a journey, irrelevant from their type° are important then results may be biased.

References:

R package: ChannelAttribution

Git:

https://github.com/MatCyt/Markov-Chain/blob/master/README.md

Course:

https://www.ssc.wisc.edu/~jmontgom/markovchains.pdf

Article:

“Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling”; Anderl, Eva and Becker, Ingo and Wangenheim, Florian V. and Schumann, Jan Hendrik, 2014. Available at SSRN: https://ssrn.com/abstract=2343077 or http://dx.doi.org/10.2139/ssrn.2343077

“Media Exposure through the Funnel: A Model of Multi-Stage Attribution”, Abhishek & al, 2012

“Multichannel Marketing Attribution Using Markov Chains”, Kakalejčík, L., Bucko, J., Resende, P.A.A. and Ferencova, M. Journal of Applied Management and Investments, Vol. 7 No. 1, pp. 49-60.  2018

Blogs:

https://analyzecore.com/2016/08/03/attribution-model-r-part-1

https://analyzecore.com/2016/08/03/attribution-model-r-part-2

                          3.3 To go further: Tackling selection biases with Quasi-Experiments

Exposure to certain types of advertisement is usually highly correlated to non-observable variables. Differences in the behaviour of users exposed to different campaigns may thus only be driven by core differences in converison probabilities between groups whether than by the campaign effect. These potential selection effects may bias the results obtained using historical data.

Quasi-Experiments can help correct this selection effect while still using available observationnal data.  These methods recreate the settings on a randomized setting. The goal is to come as close as possible to the ideal of comparing two populations that are identical in all respects except for the advertising exposure. However, populations might still differ with respect to some unobserved characteristics.

Common quasi-experimental methods used for instance in Public Policy Evaluation are:

  • Discontinuity Regressions
  • Matching Methods, such as Exact Matching,  Propensity-score matching or k-nearest neighbourghs.

References:

Article:

“Towards a digital Attribution Model: Measuring the impact of display advertising on online consumer behaviour”, Anindya Ghose & al, MIS Quarterly Vol. 40 No. 4, pp. 1-XX, 2016

https://pdfs.semanticscholar.org/4fa6/1c53f281fa63a9f0617fbd794d54911a2f84.pdf

        4. First Steps towards a Practical Implementation

Identify key points of interests

  • Identify the nature of touchpoints available: is the data based on clicks? If so, is there a way to complement the data with A/B tests to measure the influence of ads without clicks (display, video) ? For example, what happens to sales when display campaign is removed? Analysing this multiplier effect would give the overall responsibility of display on sales, to be deduced from current attribution values given to click-based channels. More interestingly, what is the impact of the removal of display campaign on the occurences of click-based campaigns ? This would give us an idea of the impact of display ads on the exposure to each other campaigns, which would help correct the attribution values more precisely at the campaign level.
  • Define the KPI to track. From a pure Marketing perspective, looking at purchases may be sufficient, but from a financial perspective looking at profits, though a bit more difficult to compute, may drive more interesting results.
  • Define a customer journey. It may seem obvious, but the notion needs to be clarified at first. Would it be defined by a time limit? If so, which one? Does it end when a conversion is observed? For example, if a customer makes 2 purchases, would the campaigns he’s been exposed to before the first order still be accounted for in the second order? If so, with a time decay?
  • Define the research framework: are we interested only in customer journeys which have led to conversions or in all journeys? Keep in mind that successful customer journeys are a non-representative sample of customer journeys. Models built on the analysis of biased samples may be conservative. Take an extreme example: 80% of customers who see campaign A buy the product, VS 1% for campaign B. However, campaign B exposure is great and 100 Million people see it VS only 1M for campaign A. An Attribution Model based on successful journeys will give higher credit to campaign B which is an auguable conclusion. Taking into account costs per campaign (in the case where costs are calculated by clicks) may of course tackle this issue partly, as campaign A could then exhibit higher returns, but a serious fallacious reasonning is at stake here.

Analyse the typical customer journey    

  • Performing a duration analysis on the data may help you improve the definition of the customer journey to be used by your organization. After which days are converison probabilities null? Should we consider the effect of campaigns disappears after x days without orders? For example, if 99% of orders are placed in the 30 days following a first click, it might be interesting to define the customer journey as a 30 days time frame following the first oder.
  • Look at the distribution of the number of campaigns in a typical journey. If you choose to calculate the effect of campaigns interraction in your Attribution Model, it may indeed help you determine the maximum number of campaigns to be included in a combination. Indeed, you may not need to assess the impact of channel combinations with above than 4 different channels if 95% of orders are placed after less then 4 campaigns.
  • Transition matrixes: what if a campaign A systematically leads to a campaign B? What happens if we remove A or B? These insights would give clues to ask precise questions for a latter AB test, for example to find out if there is complementarity between channels A and B – (implying none should be removed) or mere substitution (implying one can be given up).
  • If conversion rates are available: it can be interesting to perform a survival analysis i.e to analyse the likelihood of conversion based on duration since first click. This could help us excluse potential outliers or individuals who have very low conversion probabilities.

Summary

Attribution is a complex topic which will probably never be definitively solved. Indeed, a main issue is the difficulty, or even impossibility, to evaluate precisely the accuracy of the attribution model that we’ve built. Attribution Models should be seen as a good yet always improvable approximation of the incremental values of campaigns, and be presented with their intrinsinc limits and biases.

A common trap when it comes to sampling from a population that intrinsically includes outliers

I will discuss a common fallacy concerning the conclusions drawn from calculating a sample mean and a sample standard deviation and more importantly how to avoid it.

Suppose you draw a random sample x_1, x_2, … x_N of size N and compute the ordinary (arithmetic) sample mean  x_m and a sample standard deviation sd from it.  Now if (and only if) the (true) population mean µ (first moment) and population variance (second moment) obtained from the actual underlying PDF  are finite, the numbers x_m and sd make the usual sense otherwise they are misleading as will be shown by an example.

By the way: The common correlation coefficient will also be undefined (or in practice always point to zero) in the presence of infinite population variances. Hopefully I will create an article discussing this related fallacy in the near future where a suitable generalization to Lévy-stable variables will be proposed.

 Drawing a random sample from a heavy tailed distribution and discussing certain measures

As an example suppose you have a one dimensional random walker whose step length is distributed by a symmetric standard Cauchy distribution (Lorentz-profile) with heavy tails, i.e. an alpha-stable distribution with alpha being equal to one. The PDF of an individual independent step is given by p(x) = \frac{\pi^{-1}}{(1 + x^2)} , thus neither the first nor the second moment exist whereby the first exists and vanishes at least in the sense of a principal value due to symmetry.

Still let us generate N = 3000 (pseudo) standard Cauchy random numbers in R* to analyze the behavior of their sample mean and standard deviation sd as a function of the reduced sample size n \leq N.

*The R-code is shown at the end of the article.

Here are the piecewise sample mean (in blue) and standard deviation (in red) for the mentioned Cauchy sampling. We see that both the sample mean and sd include jumps and do not converge.

Especially the mean deviates relatively largely from zero even after 3000 observations. The sample sd has no target due to the population variance being infinite.

If the data is new and no prior distribution is known, computing the sample mean and sd will be misleading. Astonishingly enough the sample mean itself will have the (formally exact) same distribution as the single step length p(x). This means that the sample mean is also standard Cauchy distributed implying that with a different Cauchy sample one could have easily observed different sample means far of the presented values in blue.

What sense does it make to present the usual interval x_m \pm sd / \sqrt{N} in such a case? What to do?

The sample median, median absolute difference (mad) and Inter-Quantile-Range (IQR) are more appropriate to describe such a data set including outliers intrinsically. To make this plausible I present the following plot, whereby the median is shown in black, the mad in green and the IQR in orange.

This example shows that the median, mad and IQR converge quickly against their assumed values and contain no major jumps. These quantities do an obviously better job in describing the sample. Even in the presence of outliers they remain robust, whereby the mad converges more quickly than the IQR. Note that a standard Cauchy sample will contain half of its sample in the interval median \pm mad meaning that the IQR is twice the mad.

Drawing a random sample from a PDF that has finite moments

Just for comparison I also show the above quantities for a standard normal (pseudo) sample labeled with the same color as before as a counter example. In this case not only do both the sample mean and median but also the sd and mad converge towards their expected values (see plot below). Here all the quantities describe the data set properly and there is no trap since there are no intrinsic outliers. The sample mean itself follows a standard normal, so that the sd in deed makes sense and one could calculate a standard error \frac{sd}{\sqrt{N}} from it to present the usual stochastic confidence intervals for the sample mean.

A careful observation shows that in contrast to the Cauchy case here the sampled mean and sd converge more quickly than the sample median and the IQR. However still the sampled mad performs about as well as the sd. Again the mad is twice the IQR.

And here are the graphs of the prementioned quantities for a pseudo normal sample:

The take-home-message:

Just be careful when you observe outliers and calculate sample quantities right away, you might miss something. At best one carefully observes how the relevant quantities change with sample size as demonstrated in this article.

Such curves should become of broader interest in order to improve transparency in the Data Science process and reduce fallacies as well.

Thank you for reading.

P.S.: Feel free to play with the set random seed in the R-code below and observe how other quantities behave with rising sample size. Of course you can also try different PDFs at the beginning of the code. You can employ a Cauchy, Gaussian, uniform, exponential or Holtsmark (pseudo) random sample.

 

QUIZ: Which one of the recently mentioned random samples contains a trap** and why?

**in the context of this article

 

R-code used to generate the data and for producing plots:

 

#R-script for emphasizing convergence and divergence of sample means

####install and load relevant packages ####

#uncomment these lines if necessary
#install.packages(c('ggplot2',’stabledist’))
#library(ggplot2)
#library(stabledist)

#####drawing random samples #####

#Setting a random seed for being able to reproduce results  
set.seed(1234567)   
N= 2000     #sample size

#Choose a PDF from which a sample shall be drawn
#To do so (un)comment the respective lines of following code

data <- rcauchy(N)    # option1(default): standard Cauchy sampling

#data <- rnorm(N)     #option2: standard Gaussian sampling
                               
#data <- rexp(N)    # option3: standard exponential sampling

#data <- rstable(N,alpha=1.5,beta=0)  # option4: standard symmetric Holtsmark sampling

#data <- runif(N)              #option5: standard uniform sample

#####descriptive statistics####
#preparations/declarations

SUM = vector()
sd =vector()
mean = vector()
SQ =vector()
SQUARES = vector()
median = vector()
mad =vector()
quantiles = data.frame()
sem =vector()

#piecewise calculaion of descrptive quantities

for (k in 1:length(data)){              #mainloop
SUM[k] <- sum(data[1:k])            # sum of sample
mean[k] <- mean(data[1:k])          # arithmetic mean
sd[k] <- sd(data[1:k])              # standard deviation
sem[k] <- sd[k]/(sqrt(k))          #standard error of the sample mean (for finite variances)
mad[k] <- mad(data[1:k],const=1)   # median absolute deviation    

for (j in 1:5){
qq <- quantile(data[1:k],na.rm = T)
quantiles[k,j] <- qq[j]         #quantiles of sample
}
colnames(quantiles) <- c('min','Q1','median','Q3','max')

for (i in 1:length(data[1:k])){
SQUARES[i] <- data[i]*data[i]    
}
SQ[k] <- sum(SQUARES[1:k])    #sum of squares of random sample
}  #end of mainloop

#create table containing all relevant data
TABLE <-  as.data.frame(cbind(quantiles,mean,sd,SQ,SUM,sem))




#####plotting results###
x11()
print(ggplot(TABLE,aes(1:N,median))+
geom_point(size=.5)+xlab('sample size n')+ylab('sample median'))
x11()
print(ggplot(TABLE,aes(1:N,mad))+geom_point(size=.5,color ='green')+
xlab('sample size n')+ylab('sample median absolute difference'))
x11()
print(ggplot(TABLE,aes(1:N,sd))+geom_point(size=.5,color ='red')+
xlab('sample size n')+ylab('sample standard deviation'))
x11()
print(ggplot(TABLE,aes(1:N,mean))+geom_point(size=.5, color ='blue')+
xlab('sample size n')+ylab('sample mean'))
x11()
print(ggplot(TABLE,aes(1:N,Q3-Q1))+geom_point(size=.5, color ='blue')+
xlab('sample size n')+ylab('IQR'))

#uncomment the following lines of code to see further plots

#x11()
#print(ggplot(TABLE,aes(1:N,sem))+geom_point(size=.5)+
#xlab('sample size n')+ylab('sample sum of r.v.'))
#x11()
#print(ggplot(TABLE,aes(1:N,SUM))+geom_point(size=.5)+
#xlab('sample size n')+ylab('sample sum of r.v.'))
#x11()
#print(ggplot(TABLE,aes(1:N,SQ))+geom_point(size=.5)+
#xlab('sample size n')+ylab('sample sum of squares'))

 

The Inside Out of ML Based Prescriptive Analytics

With the constantly growing number of data, more and more companies are shifting towards analytic solutions. Analytic solutions help in extracting the meaning from the huge amount of data available. Thus, improving decision making.

Decision making is an important aspect of businesses, and technologies like Machine Learning are enhancing it further. The growing use of Machine Learning has changed the way of prescriptive analytics. In order to optimize the efforts, companies need to be more accurate with the historical and present data. This is because the historical and present data are the essentials of analytics. This article helps describe the inside out of Machine Learning-based prescriptive analytics.

Phases of business analytics

Descriptive analytics, predictive analytics, and prescriptive analytics are the three phases of business analytics. Descriptive analytics, being the first one, deals with past performance. Historical data is mined to understand past performance. This serves as a way to look for the reasons behind past success and failure. It is a kind of post-mortem analysis and most management reporting like sales, marketing, operations, and finance etc. make use of this.

The second one is a predictive analysis which answers the question of what is likely to happen. The historical data is now combined with rules, algorithms etc. to determine the possible future outcome or likelihood of a situation occurring.

The final phase, well known to everyone, is prescriptive analytics. It can continually take in new data and re-predict and re-prescribe. This improves the accuracy of the prediction and prescribes better decision options.  Professional services or technology or their combination can be chosen to perform all the three analytics.

More about prescriptive analytics

The analysis of business activities goes through many phases. Prescriptive analytics is one such. It is known to be the third phase of business analytics and comes after descriptive and predictive analytics. It entails the application of mathematical and computational sciences. It makes use of the results obtained from descriptive and predictive analysis to suggest decision options. It goes beyond predicting future outcomes and suggests actions to benefit from the predictions. It shows the implications of each decision option. It anticipates on what will happen when it will happen as well as why it will happen.

ML-based prescriptive analytics

Being just before the prescriptive analytics, predictive analytics is often confused with it. What actually happens is predictive analysis leads to prescriptive analysis. Thus, a Machine Learning based prescriptive analytics goes through an ML-based predictive analysis first. Therefore, it becomes necessary to consider the ML-based predictive analysis first.

ML-based predictive analytics:

A lot of things prevent businesses from achieving predictive analysis capabilities.  Machine Learning can be a great help in boosting Predictive analytics. Use of Machine Learning and Artificial Intelligence algorithms helps businesses in optimizing and uncovering the new statistical patterns. These statistical patterns form the backbone of predictive analysis. E-commerce, marketing, customer service, medical diagnosis etc. are some of the prospective use cases for Machine Learning based predictive analytics.

In E-commerce, machine learning can help in predicting the usual choices of the customer. Thus, presenting him/her according to his/her likes and dislikes. It can also help in predicting fraudulent transaction. Similarly, B2B marketing also makes good use of Machine learning based predictive analytics. Customer services and medical diagnosis also benefit from predictive analytics. Thus, a prediction and a prescription based on machine learning can boost various business functions.

Organizations and software development companies are making more and more use of machine learning based predictive analytics. The advancements like neural networks and deep learning algorithms are able to uncover hidden information. This all requires a well-researched approach. Big data and progressive IT systems also act as important factors in this.

Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

amz_reviews = pd.read_csv("1429_1.csv")

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

amz_reviews.shape
(34660, 21)

amz_reviews.columns
Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')

 

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

columns = ['id','name','keys','manufacturer','reviews.dateAdded', 'reviews.date','reviews.didPurchase',
          'reviews.userCity', 'reviews.userProvince', 'reviews.dateSeen', 'reviews.doRecommend','asins',
          'reviews.id', 'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.title']

df = pd.DataFrame(amz_reviews.drop(columns,axis=1,inplace=False))

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

df['reviews.rating'].value_counts().plot(kind='bar')

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

Data pre-processing for textual variables

Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

## Change the reviews type to string
df['reviews.text'] = df['reviews.text'].astype(str)

## Before lowercasing 
df['reviews.text'][2]
'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it 
already...'

## Lowercase all reviews
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['reviews.text'][2] ## to see the difference
'inexpensive tablet for him to use and learn on, step up from the nabi. he was thrilled with it, learn how to skype on it 
already...'

Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#$%^ *()~;:/<>|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

## remove punctuation
df['reviews.text'] = df['reviews.text'].str.replace('[^ws]','')
df['reviews.text'][2]
'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already'

Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

stop = stopwords.words('english')
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['reviews.text'][2]
'inexpensive tablet use learn step nabi thrilled learn skype already'

Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

st = PorterStemmer()
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['reviews.text'][2]
'inexpens tablet use learn step nabi thrill learn skype alreadi'

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

## Define a function which can be applied to calculate the score for the whole dataset

def senti(x):
    return TextBlob(x).sentiment  

df['senti_score'] = df['reviews.text'].apply(senti)

df.senti_score.head()

0                                   (0.3, 0.8)
1                                (0.65, 0.675)
2                                   (0.0, 0.0)
3    (0.29545454545454547, 0.6492424242424243)
4                    (0.5, 0.5827777777777777)
Name: senti_score, dtype: object

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.

Sentiment Analysis using Python

One of the applications of text mining is sentiment analysis. Most of the data is getting generated in textual format and in the past few years, people are talking more about NLP. Improvement is a continuous process and many product based companies leverage these text mining techniques to examine the sentiments of the customers to find about what they can improve in the product. This information also helps them to understand the trend and demand of the end user which results in Customer satisfaction.

As text mining is a vast concept, the article is divided into two subchapters. The main focus of this article will be calculating two scores: sentiment polarity and subjectivity using python. The range of polarity is from -1 to 1(negative to positive) and will tell us if the text contains positive or negative feedback. Most companies prefer to stop their analysis here but in our second article, we will try to extend our analysis by creating some labels out of these scores. Finally, a multi-label multi-class classifier can be trained to predict future reviews.

Without any delay let’s deep dive into the code and mine some knowledge from textual data.

There are a few NLP libraries existing in Python such as Spacy, NLTK, gensim, TextBlob, etc. For this particular article, we will be using NLTK for pre-processing and TextBlob to calculate sentiment polarity and subjectivity.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

The dataset is available here for download and we will be using pandas read_csv function to import the dataset. I would like to share an additional information here which I came to know about recently. Those who have already used python and pandas before they probably know that read_csv is by far one of the most used function. However, it can take a while to upload a big file. Some folks from  RISELab at UC Berkeley created Modin or Pandas on Ray which is a library that speeds up this process by changing a single line of code.

amz_reviews = pd.read_csv("1429_1.csv")

After importing the dataset it is recommended to understand it first and study the structure of the dataset. At this point we are interested to know how many columns are there and what are these columns so I am going to check the shape of the data frame and go through each column name to see if we need them or not.

amz_reviews.shape
(34660, 21)

amz_reviews.columns
Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='object')

 

There are so many columns which are not useful for our sentiment analysis and it’s better to remove these columns. There are many ways to do that: either just select the columns which you want to keep or select the columns you want to remove and then use the drop function to remove it from the data frame. I prefer the second option as it allows me to look at each column one more time so I don’t miss any important variable for the analysis.

columns = ['id','name','keys','manufacturer','reviews.dateAdded', 'reviews.date','reviews.didPurchase',
          'reviews.userCity', 'reviews.userProvince', 'reviews.dateSeen', 'reviews.doRecommend','asins',
          'reviews.id', 'reviews.numHelpful', 'reviews.sourceURLs', 'reviews.title']

df = pd.DataFrame(amz_reviews.drop(columns,axis=1,inplace=False))

Now let’s dive deep into the data and try to mine some knowledge from the remaining columns. The first step we would want to follow here is just to look at the distribution of the variables and try to make some notes. First, let’s look at the distribution of the ratings.

df['reviews.rating'].value_counts().plot(kind='bar')

Graphs are powerful and at this point, just by looking at the above bar graph we can conclude that most people are somehow satisfied with the products offered at Amazon. The reason I am saying ‘at’ Amazon is because it is just a platform where anyone can sell their products and the user are giving ratings to the product and not to Amazon. However, if the user is satisfied with the products it also means that Amazon has a lower return rate and lower fraud case (from seller side). The job of a Data Scientist relies not only on how good a model is but also on how useful it is for the business and that’s why these business insights are really important.

Data pre-processing for textual variables

Lowercasing

Before we move forward to calculate the sentiment scores for each review it is important to pre-process the textual data. Lowercasing helps in the process of normalization which is an important step to keep the words in a uniform manner (Welbers, et al., 2017, pp. 245-265).

## Change the reviews type to string
df['reviews.text'] = df['reviews.text'].astype(str)

## Before lowercasing 
df['reviews.text'][2]
'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it 
already...'

## Lowercase all reviews
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['reviews.text'][2] ## to see the difference
'inexpensive tablet for him to use and learn on, step up from the nabi. he was thrilled with it, learn how to skype on it 
already...'

Special characters

Special characters are non-alphabetic and non-numeric values such as {!,@#$%^ *()~;:/<>|+_-[]?}. Dealing with numbers is straightforward but special characters can be sometimes tricky. During tokenization, special characters create their own tokens and again not helpful for any algorithm, likewise, numbers.

## remove punctuation
df['reviews.text'] = df['reviews.text'].str.replace('[^ws]','')
df['reviews.text'][2]
'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already'

Stopwords

Stop-words being most commonly used in the English language; however, these words have no predictive power in reality. Words such as I, me, myself, he, she, they, our, mine, you, yours etc.

stop = stopwords.words('english')
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['reviews.text'][2]
'inexpensive tablet use learn step nabi thrilled learn skype already'

Stemming

Stemming algorithm is very useful in the field of text mining and helps to gain relevant information as it reduces all words with the same roots to a common form by removing suffixes such as -action, ing, -es and -ses. However, there can be problematic where there are spelling errors.

st = PorterStemmer()
df['reviews.text'] = df['reviews.text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['reviews.text'][2]
'inexpens tablet use learn step nabi thrill learn skype alreadi'

This step is extremely useful for pre-processing textual data but it also depends on your goal. Here our goal is to calculate sentiment scores and if you look closely to the above code words like ‘inexpensive’ and ‘thrilled’ became ‘inexpens’ and ‘thrill’ after applying this technique. This will help us in text classification to deal with the curse of dimensionality but to calculate the sentiment score this process is not useful.

Sentiment Score

It is now time to calculate sentiment scores of each review and check how these scores look like.

## Define a function which can be applied to calculate the score for the whole dataset

def senti(x):
    return TextBlob(x).sentiment  

df['senti_score'] = df['reviews.text'].apply(senti)

df.senti_score.head()

0                                   (0.3, 0.8)
1                                (0.65, 0.675)
2                                   (0.0, 0.0)
3    (0.29545454545454547, 0.6492424242424243)
4                    (0.5, 0.5827777777777777)
Name: senti_score, dtype: object

As it can be observed there are two scores: the first score is sentiment polarity which tells if the sentiment is positive or negative and the second score is subjectivity score to tell how subjective is the text. The whole code is available here.

In my next article, we will extend this analysis by creating labels based on these scores and finally we will train a classification model.

Bringing intelligence to where data lives: Python & R embedded in T-SQL

Introduction

Did you know that you can write R and Python code within your T-SQL statements? Machine Learning Services in SQL Server eliminates the need for data movement. Instead of transferring large and sensitive data over the network or losing accuracy with sample csv files, you can have your R/Python code execute within your database. Easily deploy your R/Python code with SQL stored procedures making them accessible in your ETL processes or to any application. Train and store machine learning models in your database bringing intelligence to where your data lives.

You can install and run any of the latest open source R/Python packages to build Deep Learning and AI applications on large amounts of data in SQL Server. We also offer leading edge, high-performance algorithms in Microsoft’s RevoScaleR and RevoScalePy APIs. Using these with the latest innovations in the open source world allows you to bring unparalleled selection, performance, and scale to your applications.

If you are excited to try out SQL Server Machine Learning Services, check out the hands on tutorial below. If you do not have Machine Learning Services installed in SQL Server,you will first want to follow the getting started tutorial I published here: 

How-To Tutorial

In this tutorial, I will cover the basics of how to Execute R and Python in T-SQL statements. If you prefer learning through videos, I also published the tutorial on YouTube.

Basics

Open up SQL Server Management Studio and make a connection to your server. Open a new query and paste this basic example: (While I use Python in these samples, you can do everything with R as well)

EXEC sp_execute_external_script @language = N'Python',
@script = N'print(3+4)'

Sp_execute_external_script is a special system stored procedure that enables R and Python execution in SQL Server. There is a “language” parameter that allows us to choose between Python and R. There is a “script” parameter where we can paste R or Python code. If you do not see an output print 7, go back and review the setup steps in this article.

Parameter Introduction

Now that we discussed a basic example, let’s start adding more pieces:

EXEC sp_execute_external_script  @language =N'Python', 
@script = N' 
OutputDataSet = InputDataSet;
',
@input_data_1 =N'SELECT 1 AS Col1';

Machine Learning Services provides more natural communications between SQL and R/Python with an input data parameter that accepts any SQL query. The input parameter name is called “input_data_1”.
You can see in the python code that there are default variables defined to pass data between Python and SQL. The default variable names are “OutputDataSet” and “InputDataSet” You can change these default names like this example:

EXEC sp_execute_external_script  @language =N'Python', 
@script = N' 
MyOutput = MyInput;
',
@input_data_1_name = N'MyInput',
@input_data_1 =N'SELECT 1 AS foo',
@output_data_1_name =N'MyOutput';

As you executed these examples, you might have noticed that they each return a result with “(No column name)”? You can specify a name for the columns that are returned by adding the WITH RESULT SETS clause to the end of the statement which is a comma separated list of columns and their datatypes.

EXEC sp_execute_external_script  @language =N'Python', 
@script=N' 
MyOutput = MyInput;
',
@input_data_1_name = N'MyInput',
@input_data_1 =N'
SELECT 1 AS foo,
2 AS bar
',
@output_data_1_name =N'MyOutput'
WITH RESULT SETS ((MyColName int, MyColName2 int));

Input/Output Data Types

Alright, let’s discuss a little more about the input/output data types used between SQL and Python. Your input SQL SELECT statement passes a “Dataframe” to python relying on the Python Pandas package. Your output from Python back to SQL also needs to be in a Pandas Dataframe object. If you need to convert scalar values into a dataframe here is an example:

EXEC sp_execute_external_script  @language =N'Python', 
@script=N' 
import pandas as pd
c = 1/2
d = 1*2
s = pd.Series([c,d])
df = pd.DataFrame(s)
OutputDataSet = df
'

Variables c and d are both scalar values, which you can add to a pandas Series if you like, and then convert them to a pandas dataframe. This one shows a little bit more complicated example, go read up on the python pandas package documentation for more details and examples:

EXEC sp_execute_external_script  @language =N'Python', 
@script=N' 
import pandas as pd
s = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(s)
OutputDataSet = df
'

You now know the basics to execute Python in T-SQL!

Did you know you can also write your R and Python code in your favorite IDE like RStudio and Jupyter Notebooks and then remotely send the execution of that code to SQL Server? Check out these documentation links to learn more: https://aka.ms/R-RemoteSQLExecution https://aka.ms/PythonRemoteSQLExecution

Check out the SQL Server Machine Learning Services documentation page for more documentation, samples, and solutions. Check out these E2E tutorials on github as well.

Would love to hear from you! Leave a comment below to ask a question, or start a discussion!