Interview: Does Business Intelligence benefit from Cloud Data Warehousing?

Interview with Ross Perez, Senior Director, Marketing EMEA at Snowflake

Read this article in German:
“Profitiert Business Intelligence vom Data Warehouse in der Cloud?”

Does Business Intelligence benefit from Cloud Data Warehousing?

Ross Perez is the Senior Director, Marketing EMEA at Snowflake. He leads the Snowflake marketing team in EMEA and is charged with starting the discussion about analytics, data, and cloud data warehousing across EMEA. Before Snowflake, Ross was a product marketer at Tableau Software where he founded the Iron Viz Championship, the world’s largest and longest running data visualization competition.

Data Science Blog: Ross, Business Intelligence (BI) is not really a new trend. In 2019/2020, making data available for the whole company should not be a big thing anymore. Would you agree?

BI is definitely an old trend, reporting has been around for 50 years. People are accustomed to seeing statistics and data for the company at large, and even their business units. However, using BI to deliver analytics to everyone in the organization and encouraging them to make decisions based on data for their specific area is relatively new. In a lot of the companies Snowflake works with, there is a huge new group of people who have recently received access to self-service BI and visualization tools like Tableau, Looker and Sigma, and they are just starting to find answers to their questions.

Data Science Blog: Up until today, BI was just about delivering dashboards for reporting to the business. The data warehouse (DWH) was something like the backend. Today we have increased demand for data transparency. How should companies deal with this demand?

Because more people in more departments are wanting access to data more frequently, the demand on backend systems like the data warehouse is skyrocketing. In many cases, companies have data warehouses that weren’t built to cope with this concurrent demand and that means that the experience is slow. End users have to wait a long time for their reports. That is where Snowflake comes in: since we can use the power of the cloud to spin up resources on demand, we can serve any number of concurrent users. Snowflake can also house unlimited amounts of data, of both structured and semi-structured formats.

Data Science Blog: Would you say the DWH is the key driver for becoming a data-driven organization? What else should be considered here?

Absolutely. Without having all of your data in a single, highly elastic, and flexible data warehouse, it can be a huge challenge to actually deliver insight to people in the organization.

Data Science Blog: So much for the theory, now let’s talk about specific use cases. In general, it matters a lot whether you are storing and analyzing e.g. financial data or machine data. What do we have to consider for both purposes?

Financial data and machine data do look very different, and often come in different formats. For instance, financial data is often in a standard relational format. Data like this needs to be able to be easily queried with standard SQL, something that many Hadoop and noSQL tools were unable to provide. Luckily, Snowflake is an ansi-standard SQL data warehouse so it can be used with this type of data quite seamlessly.

On the other hand, machine data is often semi-structured or even completely unstructured. This type of data is becoming significantly more common with the rise of IoT, but traditional data warehouses were very bad at dealing with it since they were optimized for relational data. Semi-structured data like JSON, Avro, XML, Orc and Parquet can be loaded into Snowflake for analysis quite seamlessly in its native format. This is important, because you don’t want to have to flatten the data to get any use from it.

Both types of data are important, and Snowflake is really the first data warehouse that can work with them both seamlessly.

Data Science Blog: Back to the common business use case: Creating sales or purchase reports for the business managers, based on data from ERP-systems such as Microsoft or SAP. Which architecture for the DWH could be the right one? How many and which database layers do you see as necessary?

The type of report largely does not matter, because in all cases you want a data warehouse that can support all of your data and serve all of your users. Ideally, you also want to be able to turn it off and on depending on demand. That means that you need a cloud-based architecture… and specifically Snowflake’s innovative architecture that separates storage and compute, making it possible to pay for exactly what you use.

Data Science Blog: Where would you implement the main part of the business logic for the report? In the DWH or in the reporting tool? Does it matter which reporting tool we choose?

The great thing is that you can choose either. Snowflake, as an ansi-Standard SQL data warehouse, can support a high degree of data modeling and business logic. But you can also utilize partners like Looker and Sigma who specialize in data modeling for BI. We think it’s best that the customer chooses what is right for them.

Data Science Blog: Snowflake enables organizations to store and manage their data in the cloud. Does it mean companies lose control over their storage and data management?

Customers have complete control over their data, and in fact Snowflake cannot see, alter or change any aspect of their data. The benefit of a cloud solution is that customers don’t have to manage the infrastructure or the tuning – they decide how they want to store and analyze their data and Snowflake takes care of the rest.

Data Science Blog: How big is the effort for smaller and medium sized companies to set up a DWH in the cloud? Does this have to be an expensive long-term project in every case?

The nice thing about Snowflake is that you can get started with a free trial in a few minutes. Now, moving from a traditional data warehouse to Snowflake can take some time, depending on the legacy technology that you are using. But Snowflake itself is quite easy to set up and very much compatible with historical tools making it relatively easy to move over.

How To Remotely Send R and Python Execution to SQL Server from Jupyter Notebooks

Introduction

Did you know that you can execute R and Python code remotely in SQL Server from Jupyter Notebooks or any IDE? Machine Learning Services in SQL Server eliminates the need to move data around. Instead of transferring large and sensitive data over the network or losing accuracy on ML training with sample csv files, you can have your R/Python code execute within your database. You can work in Jupyter Notebooks, RStudio, PyCharm, VSCode, Visual Studio, wherever you want, and then send function execution to SQL Server bringing intelligence to where your data lives.

This tutorial will show you an example of how you can send your python code from Juptyter notebooks to execute within SQL Server. The same principles apply to R and any other IDE as well. If you prefer to learn through videos, this tutorial is also published on YouTube here:


 

Environment Setup Prerequisites

  1. Install ML Services on SQL Server

In order for R or Python to execute within SQL, you first need the Machine Learning Services feature installed and configured. See this how-to guide.

  1. Install RevoscalePy via Microsoft’s Python Client

In order to send Python execution to SQL from Jupyter Notebooks, you need to use Microsoft’s RevoscalePy package. To get RevoscalePy, download and install Microsoft’s ML Services Python Client. Documentation Page or Direct Download Link (for Windows).

After downloading, open powershell as an administrator and navigate to the download folder. Start the installation with this command (feel free to customize the install folder): .\Install-PyForMLS.ps1 -InstallFolder “C:\Program Files\MicrosoftPythonClient”

Be patient while the installation can take a little while. Once installed navigate to the new path you installed in. Let’s make an empty folder and open Jupyter Notebooks: mkdir JupyterNotebooks; cd JupyterNotebooks; ..\Scripts\jupyter-notebook

Create a new notebook with the Python 3 interpreter:

 

To test if everything is setup, import revoscalepy in the first cell and execute. If there are no error messages you are ready to move forward.

Database Setup (Required for this tutorial only)

For the rest of the tutorial you can clone this Jupyter Notebook from Github if you don’t want to copy paste all of the code. This database setup is a one time step to ensure you have the same data as this tutorial. You don’t need to perform any of these setup steps to use your own data.

  1. Create a database

Modify the connection string for your server and use pyodbc to create a new database.

import pyodbc  
# creating a new db to load Iris sample in 
new_db_name = "MLRemoteExec" connection_string = "Driver=SQL Server;Server=localhost\MSSQLSERVER2017;Database={0};Trusted_Connection=Yes;" 

cnxn = pyodbc.connect(connection_string.format("master"), autocommit=True) 

cnxn.cursor().execute("IF EXISTS(SELECT * FROM sys.databases WHERE [name] = '{0}') DROP DATABASE {0}".format(new_db_name)) 

cnxn.cursor().execute("CREATE DATABASE " + new_db_name)

cnxn.close()

print("Database created") 
  1. Import Iris sample from SkLearn

Iris is a popular dataset for beginner data science tutorials. It is included by default in sklearn package.

from sklearn import datasetsimport pandas as pd
# SkLearn has the Iris sample dataset built in to the packageiris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
  1. Use RecoscalePy APIs to create a table and load the Iris data

(You can also do this with pyodbc, sqlalchemy or other packages)

from revoscalepy import RxSqlServerData, rx_data_step
# Example of using RX APIs to load data into SQL table. You can also do this with pyodbc
table_ref = RxSqlServerData(connection_string=connection_string.format(new_db_name), table="Iris")rx_data_step(input_data = df, output_file = table_ref, overwrite = True)print("New Table Created: Iris")
print("Sklearn Iris sample loaded into Iris table")

Define a Function to Send to SQL Server

Write any python code you want to execute in SQL. In this example we are creating a scatter matrix on the iris dataset and only returning the bytestream of the .png back to Jupyter Notebooks to render on our client.

def send_this_func_to_sql():
    from revoscalepy import RxSqlServerData, rx_import
    from pandas.tools.plotting import scatter_matrix
    import matplotlib.pyplot as plt
    import io    
# remember the scope of the variables in this func are within our SQL Server Python Runtime
    connection_string = "Driver=SQL Server;Server=localhost\MSSQLSERVER2017; Database=MLRemoteExec;Trusted_Connection=Yes;"

# specify a query and load into pandas dataframe df
    sql_query = RxSqlServerData(connection_string=connection_string, sql_query = "select * from Iris")

    df = rx_import(sql_query)
    scatter_matrix(df)

# return bytestream of image created by scatter_matrix
    buf = io.BytesIO()
    plt.savefig(buf, format="png")
    buf.seek(0)
    return buf.getvalue()

Send execution to SQL

Now that we are finally set up, check out how easy sending remote execution really is! First, import revoscalepy. Create a sql_compute_context, and then send the execution of any function seamlessly to SQL Server with RxExec. No raw data had to be transferred from SQL to the Jupyter Notebook. All computation happened within the database and only the image file was returned to be displayed.

from IPython import display
import matplotlib.pyplot as plt 
from revoscalepy import RxInSqlServer, rx_exec# create a remote compute context with connection to SQL Server

sql_compute_context = RxInSqlServer(connection_string=connection_string.format(new_db_name))

# use rx_exec to send the function execution to SQL Server

image = rx_exec(send_this_func_to_sql, compute_context=sql_compute_context)[0]

# only an image was returned to my jupyter client. All data remained secure and was manipulated in my db.

display.Image(data=image)

While this example is trivial with the Iris dataset, imagine the additional scale, performance, and security capabilities that you now unlocked. You can use any of the latest open source R/Python packages to build Deep Learning and AI applications on large amounts of data in SQL Server. We also offer leading edge, high-performance algorithms in Microsoft’s RevoScaleR and RevoScalePy APIs. Using these with the latest innovations in the open source world allows you to bring unparalleled selection, performance, and scale to your applications.

Learn More

Check out SQL Machine Learning Services Documentation to learn how you can easily deploy your R/Python code with SQL stored procedures making them accessible in your ETL processes or to any application. Train and store machine learning models in your database bringing intelligence to where your data lives.

Other YouTube Tutorials:

Consider Anonymization – Process Mining Rule 3 of 4

This is article no. 3 of the four-part article series Privacy, Security and Ethics in Process Mining.

Read this article in German:
Datenschutz, Sicherheit und Ethik beim Process Mining – Regel 3 von 4

If you have sensitive information in your data set, instead of removing it you can also consider the use of anonymization. When you anonymize a set of values, then the actual values (for example, the employee names “Mary Jones”, “Fred Smith”, etc.) will be replaced by another value (for example, “Resource 1”, “Resource 2”, etc.).

If the same original value appears multiple times in the data set, then it will be replaced with the same replacement value (“Mary Jones” will always be replaced by “Resource 1”). This way, anonymization allows you to obfuscate the original data but it preserves the patterns in the data set for your analysis. For example, you will still be able to analyze the workload distribution across all employees without seeing the actual names.

Some process mining tools (Disco and ProM) include anonymization functionality. This means that you can import your data into the process mining tool and select which data fields should be anonymized. For example, you can choose to anonymize just the Case IDs, the resource name, attribute values, or the timestamps. Then you export the anonymized data set and you can distribute it among your team for further analysis.

Do:

  • Determine which data fields are sensitive and need to be anonymized (see also the list of common process mining attributes and how they are impacted if anonymized).
  • Keep in mind that despite the anonymization certain information may still be identifiable. For example, there may be just one patient having a very rare disease, or the birthday information of your customer combined with their place of birth may narrow down the set of possible people so much that the data is not anonymous anymore.

Don’t:

  • Anonymize the data before you have cleaned your data, because after the anonymization the data cleaning may not be possible anymore. For example, imagine that slightly different customer category names are used in different regions but they actually mean the same. You would like to merge these different names in a data cleaning step. However, after you have anonymized the names as “Category 1”, “Category 2”, etc. the data cleaning cannot be done anymore.
  • Anonymize fields that do not need to be anonymized. While anonymization can help to preserve patterns in your data, you can easily lose relevant information. For example, if you anonymize the Case ID in your incident management process, then you cannot look up the ticket number of the incident in the service desk system anymore. By establishing a collaborative culture around your process mining initiative (see guideline No. 4) and by working in a responsible, goal-oriented way, you can often work openly with the original data that you have within your team.

Responsible Handling of Data – Process Mining Rule 2 of 4

This is article no. 2 of the four-part article series Privacy, Security and Ethics in Process Mining.

Read this article in German:
Datenschutz, Sicherheit und Ethik beim Process Mining – Regel 2 von 4

Like in any other data analysis technique, you must be careful with the data once you have obtained it. In many projects, nobody thinks about the data handling until it is brought up by the security department. Be that person who thinks about the appropriate level of protection and has a clear plan already prior to the collection of the data.

Do:

  • Have external parties sign a Non Disclosure Agreement (NDA) to ensure the confidentiality of the data. This holds, for example, for consultants you have hired to perform the process mining analysis for you, or for researchers who are participating in your project. Contact your legal department for this. They will have standard NDAs that you can use.
  • Make sure that the hard drive of your laptop, external hard drives, and USB sticks that you use to transfer the data and your analysis results are encrypted.

Don’t:

  • Give the data set to your co-workers before you have checked what is actually in the data. For example, it could be that the data set contains more information than you requested, or that it contains sensitive data that you did not think about. For example, the names of doctors and nurses might be mentioned in a free-text medical notes attribute. Make sure you remove or anonymize (see guideline No. 3) all sensitive data before you pass it on.
  • Upload your data to a cloud-based process mining tool without checking that your organization allows you to upload this kind of data. Instead, use a desktop-based process mining tool (like Disco [3] or ProM [4]) to analyze your data locally or get the cloud-based process mining vendor to set-up an on-premise version of their software within your organization. This is also true for cloud-based storage services like Dropbox: Don’t just store data or analysis results in the cloud even if it is convenient.

Privacy, Security and Ethics in Process Mining – Article Series

When I moved to the Netherlands 12 years ago and started grocery shopping at one of the local supermarket chains, Albert Heijn, I initially resisted getting their Bonus card (a loyalty card for discounts), because I did not want the company to track my purchases. I felt that using this information would help them to manipulate me by arranging or advertising products in a way that would make me buy more than I wanted to. It simply felt wrong.

Read this article in German:
Datenschutz, Sicherheit und Ethik beim Process Mining – Artikelserie

The truth is that no data analysis technique is intrinsically good or bad. It is always in the hands of the people using the technology to make it productive and constructive. For example, while supermarkets could use the information tracked through the loyalty cards of their customers to make sure that we have to take the longest route through the store to get our typical items (passing by as many other products as possible), they can also use this information to make the shopping experience more pleasant, and to offer more products that we like.

Most companies have started to use data analysis techniques to analyze their data in one way or the other. These data analyses can bring enormous opportunities for the companies and for their customers, but with the increased use of data science the question of ethics and responsible use also grows more dominant. Initiatives like the Responsible Data Science seminar series [1] take on this topic by raising awareness and encouraging researchers to develop algorithms that have concepts like fairness, accuracy, confidentiality, and transparency built in (see Wil van der Aalst’s presentation on Responsible Data Science at Process Mining Camp 2016).

Process Mining can provide you with amazing insights about your processes, and fuel your improvement initiatives with inspiration and enthusiasm, if you approach it in the right way. But how can you ensure that you use process mining responsibly? What should you pay attention to when you introduce process mining in your own organization?

In this article series, we provide you four guidelines that you can follow to prepare your process mining analysis in a responsible way:

Part 1 of 4: Clarify the Goal of the Analysis

Part 2 of 4: Responsible Handling of Data

Part 3 of 4: Consider Anonymization

Part 4 of 4: Establish a collaborative Culture

Acknowledgements

We would like to thank Frank van Geffen and Léonard Studer, who initiated the first discussions in the workgroup around responsible process mining in 2015. Furthermore, we would like to thank Moe Wynn, Felix Mannhardt and Wil van der Aalst for their feedback on earlier versions of this article.