Continuous Integration and Continuous Delivery (CI/CD) for Data Pipelines

The Crucial Intersection of Generative AI and Data Quality: Ensuring Reliable Insights

In data analytics, data’s quality is the bedrock of reliable insights. Just like a skyscraper’s stability depends on a solid foundation, the accuracy and reliability of your insights rely on top-notch data quality. Enter Generative AI – a game-changing technology revolutionizing data management and utilization. Combined with strict data quality practices, Generative AI becomes an incredibly powerful tool, enabling businesses to extract actionable and trustworthy insights.

Building the Foundation: Data Quality

Data quality is the foundation of all analytical endeavors.  Poor data quality can lead to faulty analyses, misguided decisions, and ultimately, a collapse in trust. Businesses must ensure their data is clean, structured, and reliable. Without this, even the most sophisticated AI algorithms will produce skewed results.

Generative AI: The Master Craftsman

Generative AI, with its ability to create, predict, and optimize data patterns,  refines raw data into valuable insights, automates repetitive tasks, and identifies hidden patterns that might elude human analysts. However, for this  to work effectively, it requires high-quality raw materials – that is, impeccable data.

Imagine Generative AI as an artist creating a detailed painting. If the artist is provided with subpar paint and brushes, the resulting artwork will be flawed. Conversely, with high-quality tools, the artist can produce a masterpiece. Similarly, Generative AI needs high-quality data to generate reliable and actionable insights.

The Symbiotic Relationship

The relationship between data quality and Generative AI is symbiotic. High-quality data enhances the performance of Generative AI, while Generative AI can improve data quality through advanced data cleaning, anomaly detection, and data augmentation techniques.

For instance, Generative AI can identify and rectify inconsistencies in datasets, fill in missing values with remarkable accuracy, and generate synthetic data to enhance training datasets for machine learning models. This creates a virtuous cycle where improved data quality leads to better AI performance, which further refines data quality.

Practical Steps for Businesses

  1. Assess Data Quality Regularly: Implement robust data quality assessment frameworks to continuously monitor and improve the quality of your data.
  2. Leverage AI for Data Management: Utilize Generative AI tools to automate data cleaning, error detection, and data augmentation processes.
  3. Invest in Training and Tools: Ensure your team is equipped with the necessary skills and tools to manage and utilize Generative AI effectively.
  4. Foster a Data-Driven Culture: Encourage a culture where data quality is prioritized, and insights are derived from reliable, high-quality data sources.


The AnalyticsCreator Advantage

AnalyticsCreator stands at the forefront of this intersection, offering solutions that seamlessly integrate data quality measures with Generative AI capabilities.  By partnering with AnalyticsCreator, businesses can ensure that their analytical foundations are solid, with Generative AI sculpting insights that drive informed decision-making.

In the rapidly evolving landscape of data analytics, the intersection of Generative AI and data quality is transformative. Ensuring high data quality while leveraging the power of Generative AI can propel businesses to new heights of efficiency and insight.

By embracing this symbiotic relationship, organizations can unlock the full potential of their data, paving the way for innovations and strategic advantages that are both reliable and groundbreaking. AnalyticsCreator is here to guide you through this journey, ensuring your data’s foundation is as strong as your vision for the future.

Continuous Integration and Continuous Delivery (CI/CD) for Data Pipelines

Looking Ahead: The Future of Data Preparation for Generative AI

Sponsored Post

Generative AI is a significant part of the technology landscape. The effectiveness of generative AI is linked to the data it uses. Similar to how a chef needs fresh ingredients to prepare a meal, generative AI needs well-prepared, clean data to produce outputs. Businesses need to understand the trends in data preparation to adapt and succeed.

The Principle of “Garbage In, Garbage Out”

The principle of “garbage in, garbage out” (GIGO) remains as relevant as ever.  If you input poor-quality data into an AI system, the results will be poor. This principle highlights the need for careful data preparation, ensuring that the input data is accurate, consistent, and relevant.

Emerging Trends in Data Preparation

  1. Automated Data Cleaning

Manual data cleaning is both time-consuming and error-prone. Emerging tools now leverage AI to automate this process, identifying and correcting errors more efficiently. This shift not only saves time but also ensures a higher standard of data quality. Tools like BiG EVAL are leading data quality field for all technical systems in which data is transported and transformed. BiG EVAL utilizes plausibility and validation mechanisms to adopt proactive quality assurance and enable short release cycles in agile projects as well.

  1. Real-Time Data Processing

 Businesses are adopting technologies that can process and analyze data instantly due to the need for real-time insights. Real-time data preparation tools allow companies to react quickly to new information, maintaining a competitive edge in fast-paced industries.

  1. Improved Data Integration

Data often comes from various sources, and integrating this data smoothly is essential. Advanced data integration tools now facilitate the  merging of different data sets, creating a cohesive and comprehensive dataset for analysis. Managing a vast array of data sources is almost incomprehensible with data automation tools.

  1. Augmented Data Catalogs

Modern data catalogs are becoming more intuitive and intelligent. They not only help in organizing and finding data but also in understanding its lineage and context. This contextual awareness aids in better data preparation and utilization.

Adapting to These Changes

Businesses must be proactive in adopting these emerging trends. Here are a few strategies to consider:

  1. Invest in Advanced Data Tools

Investing in modern data preparation tools can  enhance data processing capabilities. Solutions like AnalyticsCreator provide robust platforms for real-time processing and seamless integration.

  1. Foster a Data-Driven Culture

Promote a culture where data quality is a shared responsibility. Encourage teams to prioritize data accuracy and consistency at every stage of data handling.

  1. Continuous Training and Development

The field of data science is constantly evolving. Ensure your team is up-to-date with the latest trends and technologies in data preparation through continuous learning and development programs.

  1. Leverage Expert Guidance

Sometimes, navigating the complex landscape of data preparation requires expert guidance. Partnering with specialists can provide valuable insights and help in implementing best practices tailored to your business needs. (Link to our partner page).

The Role of AnalyticsCreator

AnalyticsCreator helps businesses navigate the future of data preparation. By providing advanced tools and solutions, AnalyticsCreator ensures that your data is prepared, well-integrated, and ready for analysis. Its platform is designed to handle the complexities of modern data environments, offering features that align with the latest trends in data preparation.

In conclusion, as generative AI continues to influence industries, the need for high-quality data is important. By staying informed of emerging trends and leveraging tools like AnalyticsCreator, businesses can ensure they are prepared to harness the full potential of generative AI. Just as a chef’s masterpiece depends on the quality of the ingredients, your AI outcomes will depend on the data you prepare. Investing in your data can only lead to positive results.

Why using Infrastructure as Code for developing Cloud-based Data Warehouse Systems?

In the contemporary age of Big Data, Data Warehouse Systems and Data Science Analytics Infrastructures have become an essential component for organizations to store, analyze, and make data-driven decisions. With the evolution of cloud computing, many organizations are now migrating their Data Warehouse Systems to the cloud for better scalability, flexibility, and cost-efficiency. Infrastructure as Code (IaC) can be a game-changer in this scenario. By automating the provisioning and management of cloud resources through code, IaC brings a host of advantages to the development and maintenance of Data Warehouse Systems in the cloud.

So why using IaC for Cloud Data Infrastructures?

Of course you – as a human user – can always login into the admin portal of any cloud provider and manually get your resources like SQL databases, ETL tools, Virtual Networks and tools like Synapse, snowflake, BigQuery or Databrikcs in place by clicking on the right buttons….. But here is why you should better follow the idea of having your code explaining which resources are in what order in place in your cloud:

Version Control for your Cloud Infrastructure

One of the primary advantages of using IaC is version control for your Data Warehouse – or Data Lakehouse – Architecture. Whether you’re using Redshift, Snowflake, or any other cloud-based data warehouse solutions, you can codify your architecture settings, allowing you to track changes over time. This ensures a reliable and consistent development environment and makes it easier to identify issues, rollback updates, or replicate the architecture for other projects.

Scalability Tailored for Data Needs

Data Warehouse Systems often require to scale quickly to handle larger datasets or more queries. Traditional manual scaling methods are cumbersome and slow. IaC allows for efficient auto-scaling based on real-time needs. You can write scripts to automatically provision or de-provision resources depending on your data workloads, making your data warehouse highly adaptive to your organization’s changing requirements.

Cost-Efficiency in Resource Allocation

Cloud resources are priced based on usage, so efficient allocation is crucial for managing costs. IaC enables precise control over cloud resources, allowing you to turn them off when not in use or allocate more resources during peak times. For Data Warehouse Systems that often require powerful (and expensive) computing resources, this level of control can translate into significant cost savings.

Streamlined Collaboration Among Teams

Data Warehouse Systems in the cloud often involve cross-functional teams — data engineers, data scientists, and system administrators. IaC allows these teams to collaborate more effectively. Everyone works with the same infrastructure configurations, reducing discrepancies between development, staging, and production environments. This ensures that the data models and queries developed by data professionals are consistent with the underlying infrastructure.

Enhanced Security and Compliance

Data Warehouses often store sensitive information, making security a paramount concern. IaC allows security configurations to be codified and automated, ensuring that every new resource or service deployed complies with organizational and regulatory guidelines. This proactive security approach is particularly beneficial for industries that have to adhere to strict compliance rules like HIPAA or GDPR.

Reliable Environment for Data Operations

Manual configurations are prone to human error, which can compromise the reliability of a Data Warehouse System. IaC mitigates this risk by automating repetitive tasks, ensuring that the infrastructure is consistently provisioned. This brings reliability to data ETL (Extract, Transform, Load) processes, query performances, and other critical data operations.

Documentation and Disaster Recovery Made Easy

Data is the lifeblood of any organization, and losing it can be catastrophic. IaC allows for swift disaster recovery by codifying the entire infrastructure. If a disaster occurs, the infrastructure can be quickly recreated, reducing downtime and data loss.

Most common IaC solutions

The most common tools for creating Cloud Infrastructure as Code are probably Terraform and Pulumi. However, IaC solutions can be very different in their concepts. For example: While Terraform is a pure declarative configuration language that just describes how the infrastructure will look like (execution then by the Terraform-supporting Cloud Provider), Pulumi on the other hand will execute the deployment by a programming language iteratively deploying the wished cloud resources (e.g. using for loops in Python). While executing Pulumi in any supported programming language like Python or C#, Pulumi generates declarative Infrastructure build plans for the Cloud. Any IaC solution is declaring how the infrastrcture looks like.

Terraform

Terraform is one of the most widely used Infrastructure as Code (IaC) tools, developed by HashiCorp. It enables users to define and provision a data center infrastructure using a declarative configuration language known as HashiCorp Configuration Language (HCL).

The following Terraform script will create an Azure Resource Group, a SQL Server, and a SQL Database. It will also output the fully qualified domain name (FQDN) of the SQL Server, which you can use to connect to the database:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "East US"
}

resource "azurerm_sql_server" "example" {
  name                         = "example-sqlserver"
  resource_group_name          = azurerm_resource_group.example.name
  location                     = azurerm_resource_group.example.location
  version                      = "12.0"
  administrator_login          = "adminUser"
  administrator_login_password = "adminPassword1234!"
}

resource "azurerm_sql_database" "example" {
  name                = "example-sqldb"
  resource_group_name = azurerm_resource_group.example.name
  server_name         = azurerm_sql_server.example.name
  location            = azurerm_resource_group.example.location
  edition             = "Basic"
}

output "sql_server_fqdn" {
  value = azurerm_sql_server.example.fully_qualified_domain_name
}

The HCL code needs to be placed into the Terrafirm main.tf file. Of course, Terraform and the Azure CLI needs to be installed before.

Pulumi

Pulumi is a modern Infrastructure as Code (IaC) tool that sets itself apart by allowing infrastructure to be defined using general-purpose programming languages like Python, TypeScript, Go, and C#.

Example of a Pulumi Python script creating a SQL Database on Microsoft Azure Cloud:

import * as pulumi from "@pulumi/pulumi";
import * as azure from "@pulumi/azure";

// Create an Azure Resource Group
const resourceGroup = new azure.core.ResourceGroup("myResourceGroup", {
    location: "EastUS",
});

// Create an Azure SQL Server
const sqlServer = new azure.sql.SqlServer("mySqlServer", {
    resourceGroupName: resourceGroup.name,
    location: resourceGroup.location,
    version: "12.0",
    administratorLogin: "adminUser",
    administratorLoginPassword: "adminPassword1234!",
});

// Create an Azure SQL Database on the SQL Server
const sqlDatabase = new azure.sql.Database("mySqlDatabase", {
    resourceGroupName: resourceGroup.name,
    serverName: sqlServer.name,
    location: resourceGroup.location,
    edition: "Basic",
});

// Export connection string for the SQL Database
export const sqlConnectionString = pulumi.all([sqlServer.name, resourceGroup.name, sqlDatabase.name]).apply(([serverName, rgName, dbName]) => {
    return `Server=tcp:${serverName}.database.windows.net;initial catalog=${dbName};user ID=adminUser;password=adminPassword1234!;Min Pool Size=0;Max Pool Size=30;Persist Security Info=true;`;
});

Running the script will need the installation of Python, Pulumi and the Azure CLI.

Cloud Provider specific IaC Solutions

Cloud providers might come up with their own IaC solutions, here are the probably most common ones:

Microsoft Azure Bicep is an open-source domain-specific language (DSL) developed by Microsoft, aimed at simplifying the process of deploying Azure resources. It serves as a declarative alternative to JSON for writing Azure Resource Manager (ARM) templates. Bicep compiles down to ARM templates, offering a more concise syntax and easier tooling while leveraging the proven, underlying ARM deployment engine.

AWS CloudFormation is a service offered by Amazon Web Services (AWS) that allows you to define cloud infrastructure in JSON or YAML templates.

Google Cloud Deployment Manager is quite similar to AWS CloudFormation but tailored for Google Cloud Platform (GCP), it allows you to define and deploy resources using YAML or Python templates.

IaC Tools for Server Configuration

There are many other IaC solutions and some of them are more focused on configuration of servers. In common they offer software provisioning as well and a lot detailing in regards to micro-configuration of single applications running on the server.

The most common IaC software for Server Configuration might be Ansible, a YAML-based configuration management tool that uses an agentless architecture. It’s easy to set up and widely used for automating tasks like software provisioning and configuration management. Puppet, Chef and SaltStack are further alternatives and master-agent architecture-based.

Other types of IaC Solutions

IaC solutions with a more narrow focus are e.g. Vagrant as a primarily used IaC tool for setting up virtual development environments, especially for the automation of VM (Virtual Machine) provisioning. The widely used Docker Compose is a tool for defining and running multi-container Docker applications, which can be defined using YAML files.

Furthermore we have tools that are working closely together with IaC tooling, e.g. Prometheus as an open-source monitoring toolkit often used in conjunction with other IaC tools for monitoring deployed resources.

Conclusion

Infrastructure as Code significantly enhances the development and maintenance of Cloud-based Data Infrastructures. From versioning your warehouse architecture and scaling resources according to real-time data needs, to facilitating team collaboration and ensuring security compliance, IaC serves as a foundational technology that brings agility, reliability, and cost-efficiency. As organizations continue to realize the importance of data-driven decision-making, leveraging IaC for cloud-based Data Warehouse Systems will likely become a best practice in data engineering and infrastructure management.

Lambda Architecture vs Kappa Architecture for Big Data Cloud Platforms? Let us discuss which architecture suits best for what use cases.

Big Data – Lambda or Kappa Architecture?

Big Data Analytics stands apart from conventional data processing in its fundamental nature. In the realm of Big Data, there are two prominent architectural concepts that perplex companies embarking on the construction or restructuring of their Big Data platform: Lambda architecture or Kappa architecture. Thus, it is crucial for such companies to contemplate and decide which architectural approach best aligns with their goals.

Lambda – Architecture

Introduced in 2011 during the peak of Big Data’s prominence, the Lambda architecture remains a significant presence in the field. Despite being the older of the two architectures, it offers a more comprehensive approach by incorporating three layers: the batch layer, the speed layer (also known as the stream layer), and the serving layer.

The Batch Layer is responsible for processing the entire dataset, ensuring the generation of the most accurate results. However, this comes at the cost of higher latency due to the batch loading of data. On the flip side, the batch layer can handle complex calculations without time constraints. It stores incoming raw data and filters it for subsequent applications.

Batch runs are suitable for non-time-sensitive data that require regular updates, such as daily or weekly incremental loads. Additionally, batch runs are necessary for complete data migration or overwriting (Full Load) scenarios.

The Speed Layer operates with low latency, producing almost real-time results. It calculates real-time views that complement the batch views. The speed layer receives incoming data and provides incremental updates to the batch layer results. By implementing incremental deduction logic, the speed layer significantly reduces computational costs.

Here is a simplified depiction of the Lambda architecture, showcasing the multi-store concept and the serving layer. In this representation, there is a separate store for events within the speed layer and another store for data loaded during batch processing. The serving layer acts as a mediator, enabling subsequent applications to access the data. It is important to note that in the Lambda architecture, the serving layer can be omitted, allowing batch processing and event streaming to remain separate entities.

Here is a simplified depiction of the Lambda architecture, showcasing the multi-store concept and the serving layer. In this representation, there is a separate store for events within the speed layer and another store for data loaded during batch processing. The serving layer acts as a mediator, enabling subsequent applications to access the data. It is important to note that in the Lambda architecture, the serving layer can be omitted, allowing batch processing and event streaming to remain separate entities.

The batch views within the Lambda architecture allow for the application of more complex or resource-intensive rules, resulting in superior data quality and reduced bias over time. On the other hand, the real-time views provide immediate access to the most current data.

The Serving Layer serves as a conduit for various data queries originating from both the batch and speed layers. It receives batch views from the batch layer and near-real-time views from the speed layer, utilizing this data to facilitate standard reporting and ad hoc analytics.

The Lambda architecture effectively balances speed, reliability, and scalability. However, it is worth mentioning that while the batch layer and real-time stream handle different scenarios, their underlying processing logic often shares similarities. As a result, the development and maintenance efforts for both layers should not be underestimated.

Kappa – Architecture

Jay Kreps introduced the Kappa architecture in 2014 as an alternative to the Lambda architecture. It addresses the redundancy present in the Lambda architecture by completely removing the batch component. By eliminating the parallel operation of two pipelines, the Kappa architecture simplifies the overall architectural complexity.

In the Kappa architecture, only the speed layer, represented by an event-based streaming pipeline, remains. The fundamental concept is to handle real-time data processing and continuous data reprocessing using a single stream processing engine. This approach allows for the avoidance of a multi-layer lambda architecture while ensuring the quality of data processing is maintained.

Illustrated simplified Kappa Architecture. This architectural concept relies on event streaming as the core element of data delivery.

Illustrated simplified Kappa Architecture. This architectural concept relies on event streaming as the core element of data delivery.

In practical implementation, the Kappa architecture is commonly deployed using Apache Kafka or Kafka-based tools. Applications can directly read from and write to Kafka or an alternative message queue tool. For existing event sources, listeners are utilized to stream writes directly from database logs or similar data stores. This approach eliminates the need for inbound batch processing and reduces resource requirements.

By treating every data point as a streaming event, the Kappa architecture enables the ability to near-realtime analytics and observe the state of all data in the organization at any given point. Queries can be performed at a single location, eliminating the need to compare batch and velocity views.

However, there are challenges associated with this architecture. Data processing must be done as a data stream, leading to difficulties such as managing duplicate events, cross-referencing events, and maintaining correct operation order. While batch processing can handle retrospective consolidation of multiple data sets, these challenges persist in the Kappa architecture. As a result, implementing architectures based on the Kappa concept can be more complex compared to those based on the Lambda concept, even though the latter may appear clearer in architectural sketches.

The Kappa architecture is particularly suitable when event streaming or real-time processing use cases are predominant. It offers the advantage of having a single ETL platform to develop and maintain. It is well-suited for developing data systems that emphasize online learning and do not require a separate batch layer. The sequence of events and queries is not predefined but generated in later steps based on business logic, prioritizing speed.

Use cases – When to use which architecture?

It is important to note that Kappa architecture does not serve as a direct substitute for Lambda architecture, as there are certain use cases supported by Lambda that cannot be seamlessly migrated. The Lambda architecture is better suited for implementing complex data processes and ensuring consistently complete data provisioning compared to the pure event processing approach of Kappa. As a result, many Data Lakehouse systems are built upon the foundations of the Lambda architecture.

Requirements that clearly speak for Lambda

  • If data is to be processed ad-hoc on quasi unchanging, quality-assured databases, or if the focus of the database is on data quality and the avoidance of inconsistencies.
  • When fast responses are required, but the system must be able to handle different update cycles.

Requirements that clearly speak in favor of Kappa:

  • When the algorithms applied to the real-time data and the historical data are identical.
  • If the analytics system is online learning capable and therefore does not require a batch layer.
  • The order of events and queries does not matter, but the stream processing platforms can exchange data with the database instantly at any time.

If your requirements prioritize a highly reliable Data Lakehouse update process and efficient machine learning model training for accurate event predictions, the Lambda architecture is the recommended choice. By leveraging both the batch layer and the speed layer, the Lambda architecture ensures minimal errors and optimized processing speed.

Alternatively, if you seek a streamlined Big Data architecture that excels in handling distinct and continuously emerging events (e.g., fueling data for numerous mobile applications), the Kappa architecture is the ideal solution for data platforms with the main purpose of real-time data processing. Its focus on unique, ongoing events allows for effective and responsive data processing.

Control the visibility of the PowerBI visuals based on condition

In PowerBI, there is no direct or functional mechanism to adjust the visibility (Show/Hide) of visualizations based on filter choices. There is, however, a workaround that enables us to show/hide visuals based on filter condition.

The fundamental concept behind this technique is to apply a mask to a visual and change its opacity based on a condition or filter selection.

Use Case:

I have detail table of orders. These orders are divided into Consumer, Home Office, and Corporation categories. I use segment as a filter. One of the requirements is to present a table of detail if the overall profit for the selected segment is less than $100,000. To do this, this task will be divided into two major parts. First, we will display the table if the filter is selected. Next, we will add a condition to the table.

Step 1: Show table only filter is selected

  • Place filter (Slicer) and visual on the Report Pane.

  • Create a measure that will determine if the filter is selected or not.

Filter_Selected = IF(ISFILTERED(Orders[Segment]),1,0)

  • Add this measure to the filter pane of the table visualization and select the show item when the value is 1 option. This will ensure that when no options are selected, only the header is displayed.

  • Set the mask down on the table. Make sure you only mask the table header with a border color that matches your background, or remove it entirely.

  • Create a measure to change the mask’s transparency. If two zeros are appended to the end of any HAX code, this represents complete transparency.

mask_transparency =IF([Filter_Selected],”#FFFFFF00″,”#FFFFFF”)

  • Keep this measure on the Fill of the mask and add conditional formatting to it.

If the mask transparency(measure) field is grayed out during the previous steps, you may need to modify the data type of mask transparency to text.

Step 2 : Add a condition to the solution

  • Create a new measure to determine if our condition is met.

condition_check = IF(CALCULATE(SUM(Orders[Profit]),filter(all(Orders), Orders[Segment] = SELECTEDVALUE(Orders[Segment]))) < 100000,1,0)

  • Now add this new measure to a table visual’s filter pane and pick the show item when the value is 1 option. This ensures that only if the condition meets the table will appear.

You can now display or hide visuals based on slicer selection and condition. If you know a better way to do this, please comment and let me know. For this article, I referred to this page.

 

5 Apache Spark Best Practices

Already familiar with the term big data, right? Despite the fact that we would all discuss Big Data, it takes a very long time before you confront it in your career. Apache Spark is a Big Data tool that aims to handle large datasets in a parallel and distributed manner. Apache Spark began as a research project at UC Berkeley’s AMPLab, a student, researcher, and faculty collaboration centered on data-intensive application domains, in 2009. 

Introduction

Spark’s aim is to create a new framework that was optimized for quick iterative processing, such as machine learning and interactive data analysis while retaining Hadoop MapReduce’s scalability and fault-tolerant. Spark outperforms Hadoop in many ways, reaching performance levels that are nearly 100 times higher in some cases. Spark has a number of components for various types of processing, all of which are based on Spark Core. Today we will be going to discuss in brief the Apache  Spark and 5 of its best practices to look forward to-

What is Apache Spark?

Apache Spark is an open-source distributed system for big data workforces. For fast analytic queries against another size of data, it uses in-memory caching and optimised query execution. It is a parallel processing framework for grouped computers to operate large-scale data analytics applications. This could handle packet and real-time data processing and predictive analysis workloads.

It claims to support code reuse all over multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing—and offers development APIs in Java, Scala, Python, and R. With 365,000 meetup members in 2017, Apache Spark is becoming one of the most renowned big data distributed processing frameworks. Explore for Apache Spark Tutorial for more information.

5 best practices of Apache Spark

1. Begin with a small sample of the data.

Because we want to make big data work, we need to start with a small sample of data to see if we’re on the right track. In my project, I sampled 10% of the data and verified that the pipelines were working properly. This allowed me to use the SQL section of the Spark UI to watch the numbers grow throughout the flow while not having to wait too long for it to complete.

In my experience, if you attain your preferred runtime with a small sample, scaling up is usually simple.

2. Spark troubleshooting

For transformations, Spark seems to have a lazy loading behaviour. That is, it will not initiate the transformation computation; instead, it will keep records of the transformation requested. This makes it difficult to determine where in our code there are bugs or areas that need to be optimised. Splitting the code into sections with df.cache() and then using df.count() to force Spark to calculate the df at every section was one practise that we found useful.

Spark actions seem to be keen in that they cause the underlying action to perform a computation. So, if you’ve had a Spark action which you only call when it’s required, pay attention. A Spark action, for instance, is count() on a dataset. You can now inspect the computation of each section using the spark UI and identify any issues. It’s important to note that if you don’t use the sampling we mentioned in (1), you’ll probably end up with a very long runtime that’s difficult to debug.

Check out Apache Spark Training & Certification Course to get yourself certified in Apache Spark with industry-level skills.

3. Finding and resolving Skewness is a difficult task.

Having to look at the stage specifics in the spark UI and looking for just a major difference between both the max and median can help you find the Skewness:

Let’s begin with a definition of Skewness. As previously stated, our data is divided into partitions, and the size of each partition will most likely change as the progress of transformation. This can result in a large difference in size between partitions, indicating that our data is skew. This implies that a few of the tasks were markedly slower than the rest.

Why is this even a bad thing? Because it may cause other stages to stand in line for these few tasks, leaving cores idle. If you understand where all the Skewness has been coming from, you can fix it right away by changing the partitioning.

4. Appropriately cache

Spark allows you to cache datasets in memory. There are a variety of options to choose from:

  • Since the same operation has been computed several times in the pipeline flow, cache it.
  • To allow the required cache setting, use the persist API to enable caching (persist to disc or not; serialized or not).
  • Be cognizant of lazy loading and, if necessary, prime cache up front. Some APIs are eager, while others aren’t.
  • To see information about the datasets you’ve cached, go to the Storage tab in the Spark UI.
  • It’s a good idea to unpersist your cached datasets after you’ve finished using them to free up resources, especially if other people are using the cluster.

5. Spark has issues with iterative code.

It was particularly difficult. Spark uses lazy evaluation so that when the code is run, it only creates a computational graph, a DAG. Once you have an iterative process, however, this method can be very problematic so because DAG finally opens the prior iteration and then becomes extremely large, we mean extremely large. This may be too large for the driver to remember. Because the application is stuck, this makes it appear in the spark UI as if no jobs are running (which is correct) for an extended period of time — until the driver crashes.

This seems to be presently an obvious issue with Spark, and the workaround that worked for me was to use df.checkpoint() / df.reset() / df.reset() / df.reset() / df.reset() / df. every 5–6 iterations, call localCheckpoint() (find your number by experimenting a bit). This works because, unlike cache(), checkpoint() breaks the lineage and the DAG, saves the results and starts from a new checkpoint. The disadvantage is that you don’t have the entire DAG to recreate the df if something goes wrong.

Conclusion

Spark is now one of the most popular projects inside the Hadoop ecosystem, with many companies using it in conjunction with Hadoop to process large amounts of data. In June 2013, Spark was acknowledged into the Apache Software Foundation’s (ASF) entrepreneurial context, and in February 2014, it was designated as an Apache Top-Level Project. Spark could indeed run by itself, on Apache Mesos, or on Apache Hadoop, which is the most common. Spark is used by large enterprises working with big data applications because of its speed and ability to connect multiple types of databases and run various types of analytics applications.

Learning how to make Spark work its magic takes time, but these 5 practices will help you move your project forward and sprinkle some spark charm on your code.

process.science presents a new release

Advertisement

Process Mining Tool provider process.science presents a new release

process.science, specialist in the development of process mining plugins for BI systems, presents its upgraded version of their product ps4pbi. Process.science has added the following improvements to their plug-in for Microsoft Power BI. Identcal upgrades will soon also be released for ps4qlk, the corresponding plug-in for Qlik Sense:

  • 3x faster performance: By improvement of the graph library the graph built got approx. 300% more performant. This is particularly noticeable in complex processes
  • Navigator window: For a better overview in complex graphs, an overview window has been added, in which the entire graph and the respective position of the viewed area within the overall process is displayed
  • Activities legend: This allows activities to be assigned to specific categories and highlighted in different colors, for example in which source system an activity was carried out
  • Activity drill-through: This makes it possible to take filters that have been set for selected activities into other dashboards
  • Value Color Scale: Activity values ​​can be color-coded and assigned to freely selectable groupings, which makes the overview easier at first sight
process.science Process Mining on Power BI

process.science Process Mining on Power BI

Process mining is a business data analysis technique. The software used for this extracts the data that is already available in the source systems and visualizes them in a process graph. The aim is to ensure continuous monitoring in real time in order to identify optimization measures for processes, to simulate them and to continuously evaluate them after implementation.

The process mining tools from process.science are integrated directly into Microsoft Power BI and Qlik Sense. A corresponding plug-in for Tableau is already in development. So it is not a complicated isolated solution requires a new set up in addition to existing systems. With process.science the existing know-how on the BI system already implemented and the existing infrastructure framework can be adapted.

The integration of process.science in the BI systems has no influence on day-to-day business and bears absolutely no risk of system failures, as process.science does not intervene in the the source system or any other program but extends the respective business intelligence tool by the process perspective including various functionalities.

Contact person for inquiries:

process.science GmbH & Co. KG
Gordon Arnemann
Tel .: + 49 (231) 5869 2868
Email: ga@process.science
https://de.process.science/

My elaborate study notes on reinforcement learning

I will not tell you why, but all of a sudden I was in need of writing an article series on Reinforcement Learning. Though I am also a beginner in reinforcement learning field. Everything I knew was what I learned from one online lecture conducted in a lazy tone in my college. However in the process of learning reinforcement learning, I found a line which could connect the two dots, one is reinforcement learning and the other is my studying field. That is why I made up my mind to make an article series on reinforcement learning seriously.

To be a bit more concrete, I imagine that technologies in our world could be enhanced by a combination of reinforcement learning and virtual reality. That means companies like Toyota or VW might come to invest on visual effect or video game companies more seriously in the future. And I have been actually struggling with how to train deep learning with cgi, which might bridge the virtual world and the real world.

As I am also a beginner in reinforcement learning, this article series would a kind of study note for me. But as I have been doing in my former articles, I prefer exhaustive but intuitive explanations on AI algorithms, thus I will do my best to make my series as instructive and effective as existing tutorial on reinforcement learning.

This article is going to be composed of the following contents.

In this article I would like to share what I have learned about RL, and I hope you could get some hints of learning this fascinating field. In case you have any comments or advice on my “study note,” leaving a comment or contacting me via email would be appreciated.

Coffee Shop Location Predictor

As part of this article, we will explore the main steps involved in predicting the best location for a coffee shop in Vancouver. We will also take into consideration that the coffee shop is near a transit station, and has no Starbucks near it. Well, while at it, let us also add an extra feature where we make sure the crime in the area is lower.

Introduction

In this article, we will highlight the main steps involved to predict a location for a coffee shop in Vancouver. We also want to make sure that the coffee shop is near a transit station, and has no Starbucks near it. As an added feature, we will make sure that the crime concentration in the area is low, and the entire program should be implemented in Python. So let’s walk through the steps.

Steps Required

  • Get crime history for the last two years
  • Get locations of all transit stations and Starbucks in Vancouver
  • Check all the transit stations that do not have any Starbucks near them
  • Get all the data regarding crimes near the filtered transit stations
  • Create a grid of all possible coordinates around the transit station
  • Check crime around each created coordinate and display the top 5 locations.

Gathering Data

This covers the first two steps required to get data from the internet, both manually and automatically.

Getting all Crime History

We can get crime history for the past 14 years in Vancouver from here. This data is in raw crime.csv format, so we have to process it and filter out useless data. We then write this processed information on the crime_processed.csv file.

Note: There are 530,653 records of crime in this file

In this program, we will just use the type and coordinate of the crime. There are many crime types, but we have classified them into three major categories namely;

Theft (red), Break and Enter (orange) and Mischief (green)

These all crimes can be plotted on Graph as displayed below.

This may seem very congested and full, so let’s see a closeup image for future references.

Getting Locations of all Rapid Transit Stations

We can get the coordinates of all Transit Stations in Vancouver from here. This dataset has all coordinates of rapid transit stations in three transit lines in Vancouver. There are a total of 23 of them in Vancouver, we can then use it for further processing.

Getting Locations of all Starbucks

The Starbucks data is present here, we can scrape it easily and get the locations of all the Starbucks in Vancouver. We just need the Starbucks that is near transit stations, so we’ll filter out the rest. There are a total 24 Starbucks in Vancouver, and 10 of them are near Transit Stations.

Note: Other than the coordinates of Transit Stations and Starbucks, we also need coordinates and type of the crime.

Transit Stations with no Starbucks

As we have all the data required, now moving to the next step. We need to get to the transit Station locations that have no Starbucks near them. For that we can create an area of particular radius around each Transit Station. Then check all Starbucks locations with respect to them, whether they are within that area or not.

If none of the Starbucks are within that particular Transit Station’s area, we can append it to a list. At the end, we have a list of all Transit locations with no Starbucks near them. There are a total of 6 Transit Stations with no Starbucks near them.

Crime near Transit Stations

Now lets filter out all crime records and get just what we are interested in, which means the crime near Transit stations. For that we will plot an area of specific radius around each of them to see the crimes. These are more than 110,000 crime records.

Crime near located Transit Stations

Now that we have all the Transit Stations that don’t have any Starbucks near them and also the crime near all Transit Stations. So, let’s use this information and get crime near the located Transit Stations. These are about 44,000 crime records.

This may seem correct at first glance, but the points are overlapping due to abundance, so we can create different lists of crimes based on their types.

Theft

Break and Enter

Mischief

Generating all possible coordinates

Now finally, we have all the prerequisites and let’s get to the main task at hand, predicting the best coordinate for the coffee shop.

There may be many approaches to solve this problem, but the one I used in this program is that I will create a grid of all possible locations (coordinates) in the area of 1 km radius around each located transit station.

Initially I generated 1 coordinate for every m, this resulted in 1000,000 coordinates in every km. This is a huge number, and for the 6 located Transit stations, it becomes 6 Million. It may not seem much at first glance because computers can handle such data in a few seconds.

But for location prediction we need to compare each coordinate with crime coordinates. As the algorithm has to check for ~7,000 Thefts, ~19,000 Break ins, and ~17,000 Mischiefs around each generated coordinate. Computing this would want the program to process an estimate of 432.4 Billion times. This sort of execution takes many hours on normal computers (sometimes days).

The solution to this is to create a coordinate for each 10 m area, this results about 10,000 coordinate per km. For the above mentioned number of crimes, the estimated processes will be several Billions. That would significantly reduce the time, but is still not less.

To control this, we can remove the duplicate values in crime coordinates and those which are too close to each other ~1m. Doing so, we are left with just 816 Thefts, 2,654 Break ins, and 8,234 Mischiefs around each generated coordinate.
The precision will not be affected much but the time and computational resources required will be reduced a lot.

 

Checking Crime near Generated coordinates

Now that we have all the locations, we will start some processing on it and check each coordinate against some constraints. That are respectively;

  1. Filter out Coordinates having Theft near 1 km
    We get 122,000 coordinates with no Thefts (Below merged 1000 to 1)
  2. Filter out Coordinates having Break Ins near 200m
    We get 8000 coordinates with no Thefts (Below merged 1000 to 1)
  3. Filter out Coordinates having Mischief near 200m
    We get 6000 coordinates with no Thefts (Below merged 1000 to 1)
    Now that we have 6 Coordinates of best locations that have passed through all the constraints, we will order them.To order them, we will check their distance from the nearest transit location. The nearest will be on top of the list as the best possible location, then the second and so on. The generated List is;

    1. -123.0419406741792, 49.24824259252004
    2. -123.05887151659479, 49.24327221040713
    3. -123.05287151659476, 49.24327221040713
    4. -123.04994067417924, 49.239242592520064
    5. -123.0419406741792, 49.239242592520064
    6. -123.0409406741792, 49.239242592520064

How can MindTrades help?

MindTrades Consulting Services, a leading marketing agency provides in-depth analysis and insights for the global IT sector including leading data integration brands such as Diyotta. From Cloud Migration, Big Data, Digital Transformation, Agile Deliver, Cyber Security, to Analytics- Mind trades provides published breakthrough ideas, and prompt content delivery. For more information, refer to mindtrades.com.

Code

https://github.com/Mindtrades-Consulting/Coffee-Shop-Location-Predictor

 

Rethinking linear algebra part two: ellipsoids in data science

1 Our expedition of eigenvectors still continues

This article is still going to be about eigenvectors and PCA, and this article still will not cover LDA (linear discriminant analysis). Hereby I would like you to have more organic links of the data science ideas with eigenvectors.

In the second article, we have covered the following points:

  • You can visualize linear transformations with matrices by calculating displacement vectors, and they usually look like vectors swirling.
  • Diagonalization is finding a direction in which the displacement vectors do not swirl, and that is equal to finding new axis/basis where you can describe its linear transformations more straightforwardly. But we have to consider diagonalizability of the matrices.
  • In linear dimension reduction such as PCA or LDA, we mainly use types of matrices called positive definite or positive semidefinite matrices.

In the last article we have seen the following points:

  • PCA is an algorithm of calculating orthogonal axes along which data “swell” the most.
  • PCA is equivalent to calculating a new orthonormal basis for the data where the covariance between components is zero.
  • You can reduced the dimension of the data in the new coordinate system by ignoring the axes corresponding to small eigenvalues.
  • Covariance matrices enable linear transformation of rotation and expansion and contraction of vectors.

I emphasized that the axes are more important than the surface of the high dimensional ellipsoids, but in this article let’s focus more on the surface of ellipsoids, or I would rather say general quadratic curves. After also seeing how to draw ellipsoids on data, you would see the following points about PCA or eigenvectors.

  • Covariance matrices are real symmetric matrices, and also they are positive semidefinite. That means you can always diagonalize covariance matrices, and their eigenvalues are all equal or greater than 0.
  • PCA is equivalent to finding axes of quadratic curves in which gradients are biggest. The values of quadratic curves increases the most in those directions, and that means the directions describe great deal of information of data distribution.
  • Intuitively dimension reduction by PCA is equal to fitting a high dimensional ellipsoid on data and cutting off the axes corresponding to small eigenvalues.

Even if you already understand PCA to some extent, I hope this article provides you with deeper insight into PCA, and at least after reading this article, I think you would be more or less able to visually control eigenvectors and ellipsoids with the Numpy and Maplotlib libraries.

*Let me first introduce some mathematical facts and how I denote them throughout this article in advance. If you are allergic to mathematics, take it easy or please go back to my former articles.

  • Any quadratic curves can be denoted as \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where \boldsymbol{x}\in \mathbb{R}^D , A \in \mathbb{R}^{D\times D} \boldsymbol{b}\in \mathbb{R}^D s\in \mathbb{R}.
  • When I want to clarify dimensions of variables of quadratic curves, I denote parameters as A_D, b_D.
  • If a matrix A is a real symmetric matrix, there exist a rotation matrix U such that U^T A U = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D) and U = (\boldsymbol{u}_1, \dots , \boldsymbol{u}_D). \boldsymbol{u}_1, \dots , \boldsymbol{u}_D are eigenvectors corresponding to \lambda_1, \dots, \lambda_D respectively.
  • PCA corresponds to a case of diagonalizing A where A is a covariance matrix of certain data. When I want to clarify that A is a covariance matrix, I denote it as A=\Sigma.
  • Importantly covariance matrices \Sigma are positive semidefinite and real symmetric, which means you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

*In the last article, I denoted the covariance of data as S, based on Pattern Recognition and Machine Learning by C. M. Bishop.

*Sooner or later you are going to see that I am explaining basically the same ideas from different points of view, using the topic of PCA. However I believe they are all important when you learn linear algebra for data science of machine learning. Even you have not learnt linear algebra or if you have to teach linear algebra, I recommend you to first take a review on the idea of diagonalization, like the second article. And you should be conscious that, in the context of machine learning or data science, only a very limited type of matrices are important, which I have been explaining throughout this article.

2 Rotation or projection?

In this section I am going to talk about basic stuff found in most textbooks on linear algebra. In the last article, I mentioned that if A is a real symmetric matrix, you can diagonalize A with a rotation matrix U = (\boldsymbol{u}_1 \: \cdots \: \boldsymbol{u}_D), such that U^{-1}AU = U^{T}AU =\Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). I also explained that PCA is a case where A=\Sigma, that is, A is the covariance matrix of certain data. \Sigma is known to be positive semidefinite and real symmetric. Thus you can always diagonalize \Sigma and any of their engenvalues cannot be lower than 0.

I think we first need to clarify the difference of rotation and projection. In order to visualize the ideas, let’s consider a case of D=3. Assume that you have got an orthonormal rotation matrix U = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) which diagonalizes A. In the last article I said diagonalization is equivalent to finding new orthogonal axes formed by eigenvectors, and in the case of this section you got new orthonoramal basis (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) which are in red in the figure below. Projecting a point \boldsymbol{x} = (x, y, z) on the new orthonormal basis is simple: you just have to multiply \boldsymbol{x} with U^T. Let U^T \boldsymbol{x} be (x', y', z')^T, and then \left( \begin{array}{c} x' \\ y' \\ z' \end{array} \right) = U^T\boldsymbol{x} = \left( \begin{array}{c} \boldsymbol{u}_1^{T}\boldsymbol{x} \\ \boldsymbol{u}_2^{T}\boldsymbol{x} \\ \boldsymbol{u}_3^{T}\boldsymbol{x} \end{array} \right). You can see x', y', z' are \boldsymbol{x} projected on \boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3 respectively, and the left side of the figure below shows the idea. When you replace the orginal orthonormal basis (\boldsymbol{e}_1, \boldsymbol{e}_2, \boldsymbol{e}_3) with (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) as in the right side of the figure below, you can comprehend the projection as a rotation from (x, y, z) to (x', y', z') by a rotation matrix U^T.

Next, let’s see what rotation is. In case of rotation, you should imagine that you rotate the point \boldsymbol{x} in the same coordinate system, rather than projecting to other coordinate system. You can rotate \boldsymbol{x} by multiplying it with U. This rotation looks like the figure below.

In the initial position, the edges of the cube are aligned with the three orthogonal black axes (\boldsymbol{e}_1,  \boldsymbol{e}_2 , \boldsymbol{e}_3), with one corner of the cube located at the origin point of those axes. The purple dot denotes the corner of the cube directly opposite the origin corner. The cube is rotated in three dimensions, with the origin corner staying fixed in place. After the rotation with a pivot at the origin, the edges of the cube are now aligned with a new set of orthogonal axes (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3), shown in red. You might understand that more clearly with an equation: U\boldsymbol{x} = (\boldsymbol{u}_1 \: \boldsymbol{u}_2 \: \boldsymbol{u}_3) \left( \begin{array}{c} x \\ y \\ z \end{array} \right) = x\boldsymbol{u}_1 + y\boldsymbol{u}_2 + z\boldsymbol{u}_3. In short this rotation means you keep relative position of \boldsymbol{x}, I mean its coordinates (x, y, z), in the new orthonormal basis. In this article, let me call this a “cube rotation.”

The discussion above can be generalized to spaces with dimensions higher than 3. When U \in \mathbb{R}^{D \times D} is an orthonormal matrix and a vector \boldsymbol{x} \in \mathbb{R}^D, you can project \boldsymbol{x} to \boldsymbol{x}' = U^T \boldsymbol{x}or rotate it to \boldsymbol{x}'' = U \boldsymbol{x}, where \boldsymbol{x}' = (x_{1}', \dots, x_{D}')^T and \boldsymbol{x}'' = (x_{1}'', \dots, x_{D}'')^T. In other words \boldsymbol{x} = U \boldsymbol{x}', which means you can rotate back \boldsymbol{x}' to the original point \boldsymbol{x} with the rotation matrix U.

I think you at least saw that rotation and projection are basically the same, and that is only a matter of how you look at the coordinate systems. But I would say the idea of projection is more important through out this article.

Let’s consider a function f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} = (\boldsymbol{x}, A \boldsymbol{x}), where A\in \mathbb{R}^{D\times D} is a real symmetric matrix. The distribution of f(\boldsymbol{x}; A) is quadratic curves whose center point covers the origin, and it is known that you can express this distribution in a much simpler way using eigenvectors. When you project this function on eigenvectors of A, that is when you substitute U \boldsymbol{x}' for \boldsymbol{x}, you get f = (\boldsymbol{x}, A \boldsymbol{x}) =(U \boldsymbol{x}', AU \boldsymbol{x}') = (\boldsymbol{x}')^T U^TAU \boldsymbol{x}' = (\boldsymbol{x}')^T \Lambda \boldsymbol{x}' = \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2. You can always diagonalize real symmetric matrices, so the formula implies that the shapes of quadratic curves largely depend on eigenvectors. We are going to see this in detail in the next section.

*(\boldsymbol{x}, \boldsymbol{y}) denotes an inner product of \boldsymbol{x} and \boldsymbol{y}.

*We are going to see details of the shapes of quadratic “curves” or “functions” in the next section.

To be exact, you cannot naively multiply U or U^T for rotation. Let’s take a part of data I showed in the last article as an example. In the figure below, I projected data on the basis (\boldsymbol{u}_1,  \boldsymbol{u}_2 , \boldsymbol{u}_3).

You might have noticed that you cannot do a “cube rotation” in this case. If you make the coordinate system (\boldsymbol{u}_1, \boldsymbol{u}_2, \boldsymbol{u}_3) with your left hand, like you might have done in science classes in school to learn Fleming’s rule, you would soon realize that the coordinate systems in the figure above do not match. You need to flip the direction of one axis to match them.

Mathematically, you have to consider the determinant of the rotation matrix U. You can do a “cube rotation” when det(U)=1, and in the case above det(U) was -1, and you needed to flip one axis to make the determinant 1. In the example in the figure below, you can match the basis. This also can be generalized to higher dimensions, but that is also beyond the scope of this article series. If you are really interested, you should prepare some coffee and snacks and textbooks on linear algebra, and some weekends.

When you want to make general ellipsoids in a 3d space on Matplotlib, you can take advantage of rotation matrices. You first make a simple ellipsoid symmetric about xyz axis using polar coordinates, and you can rotate the whole ellipsoid with rotation matrices. I made some simple modules for drawing ellipsoid. If you put in a rotation matrix which diagonalize the covariance matrix of data and a list of three radiuses \sqrt{\lambda_1}, \sqrt{\lambda_2}, \sqrt{\lambda_3}, you can rotate the original ellipsoid so that it fits the data well.

3 Types of quadratic curves.

*This article might look like a mathematical writing, but I would say this is more about computer science. Please tolerate some inaccuracy in terms of mathematics. I gave priority to visualizing necessary mathematical ideas in my article series. If you are not sure about details, please let me know.

In linear dimension reduction, or at least in this article series you mainly have to consider ellipsoids. However ellipsoids are just one type of quadratic curves. In the last article, I mentioned that when the center of a D dimensional ellipsoid is the origin point of a normal coordinate system, the formula of the surface of the ellipsoid is as follows: (\boldsymbol{x}, A\boldsymbol{x})=1, where A satisfies certain conditions. To be concrete, when (\boldsymbol{x}, A\boldsymbol{x})=1 is the surface of a ellipsoid, A has to be diagonalizable and positive definite.

*Real symmetric matrices are diagonalizable, and positive definite matrices have only positive eigenvalues. Covariance matrices \Sigma, whose displacement vectors I visualized in the last two articles, are known to be symmetric real matrices and positive semi-defintie. However, the surface of an ellipsoid which fit the data is \boldsymbol{x}^T \Sigma ^{-1} \boldsymbol{x} = const., not \boldsymbol{x}^T \Sigma \boldsymbol{x} = const..

*You have to keep it in mind that \boldsymbol{x} are all deviations.

*You do not have to think too much about what the “semi” of the term “positive semi-definite” means fow now.

As you could imagine, this is just one simple case of richer variety of graphs. Let’s consider a 3-dimensional space. Any quadratic curves in this space can be denoted as ax^2 + by^2 + cz^2 + dxy + eyz + fxz + px + qy + rz + s = 0, where at least one of a, b, c, d, e, f, p, q, r, s is not 0.  Let \boldsymbol{x} be (x, y, z)^T, then the quadratic curves can be simply denoted with a 3\times 3 matrix A and a 3-dimensional vector \boldsymbol{b} as follows: \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0, where A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right), \boldsymbol{b} = \left( \begin{array}{c} \frac{p}{2} \\ \frac{q}{2} \\ \frac{r}{2} \end{array} \right). General quadratic curves are roughly classified into the 9 types below.

You can shift these quadratic curves so that their center points come to the origin, without rotation, and the resulting curves are as follows. The curves can be all denoted as \boldsymbol{x}^T A\boldsymbol{x}.

As you can see, A is a real symmetric matrix. As I have mentioned repeatedly, when all the elements of a D \times D symmetric matrix A are real values and its eigen values are \lambda_{i} (i=1, \dots , D), there exist orthogonal/orthonormal matrices U such that U^{-1}AU = \Lambda, where \Lambda = diag(\lambda_{1}, \dots , \lambda_{D}). Hence, you can diagonalize the A = \left( \begin{array}{ccc} a & \frac{d}{2} & \frac{f}{2} \\ \frac{d}{2} & b & \frac{e}{2} \\ \frac{f}{2} & \frac{e}{2} & c \end{array} \right) with an orthogonal matrix U. Let U be an orthogonal matrix such that U^T A U = \left( \begin{array}{ccc} \alpha  & 0 & 0 \\ 0 & \beta & 0 \\ 0 & 0 & \gamma \end{array} \right) =\left( \begin{array}{ccc} \lambda_1  & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{array} \right). After you apply rotation by U to the curves (a)” ~ (i)”, those curves are symmetrically placed about the xyz axes, and their center points still cross the origin. The resulting curves look like below. Or rather I should say you projected (a)’ ~ (i)’ on their eigenvectors.

In this article mainly (a)” , (g)”, (h)”, and (i)” are important. General equations for the curves is as follows

  • (a)”: \frac{x^2}{l^2} + \frac{y^2}{m^2} + \frac{z^2}{n^2} = 1
  • (g)”: z = \frac{x^2}{l^2} + \frac{y^2}{m^2}
  • (h)”: z = \frac{x^2}{l^2} - \frac{y^2}{m^2}
  • (i)”: z = \frac{x^2}{l^2}

, where l, m, n \in \mathbb{R}^+.

Even if this section has been puzzling to you, you just have to keep one point in your mind: we have been discussing general quadratic curves, but in PCA, you only need to consider a case where A is a covariance matrix, that is A=\Sigma. PCA corresponds to the case where you shift and rotate the curve (a) into (a)”. Subtracting the mean of data from each point of data corresponds to shifting quadratic curve (a) to (a)’. Calculating eigenvectors of A corresponds to calculating a rotation matrix U such that the curve (a)’ comes to (a)” after applying the rotation, or projecting curves on eigenvectors of \Sigma. Importantly we are only discussing the covariance of certain data, not the distribution of the data itself.

*Just in case you are interested in a little more mathematical sides: it is known that if you rotate all the points \boldsymbol{x} on the curve \boldsymbol{x}^T A\boldsymbol{x} + 2\boldsymbol{b}^T\boldsymbol{x} + s = 0 with the rotation matrix P, those points \boldsymbol{x} are mapped into a new quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0. That means the rotation of the original quadratic curve with P (or rather rotating axes) enables getting rid of the terms xy, yz, zx. Also it is known that when \alpha ' \neq 0, with proper translations and rotations, the quadratic curve \alpha x^2 + \beta y^2 + \gamma z^2 + \lambda x + \mu y + \nu z + \rho = 0 can be mapped into one of the types of quadratic curves in the figure below, depending on coefficients of the original quadratic curve. And the discussion so far can be generalized to higher dimensional spaces, but that is beyond the scope of this article series. Please consult decent textbooks on linear algebra around you for further details.

4 Eigenvectors are gradients and sometimes variances.

In the second section I explained that you can express quadratic functions f(\boldsymbol{x}; A) = \boldsymbol{x}^T A \boldsymbol{x} in a very simple way by projecting \boldsymbol{x} on eigenvectors of A.

You can comprehend what I have explained in another way: eigenvectors, to be exact eigenvectors of real symmetric matrices A, are gradients. And in case of PCA, I mean when A=\Sigma eigenvalues are also variances. Before explaining what that means, let me explain a little of the totally common facts on mathematics. If you have variables \boldsymbol{x}\in \mathbb{R}^D, I think you can comprehend functions f(\boldysmbol{x}) in two ways. One is a normal “functions” f(\boldsymbol{x}), and the others are “curves” f(\boldsymbol{x}) = const.. “Functions” get an input \boldsymbol{x} and gives out an output f(\boldsymbol{x}), just as well as normal functions you would imagine. “Curves” are rather sets of \boldsymbol{x} \in \mathbb{R}^D such that f(\boldsymbol{x}) = const..

*Please assume that the terms “functions” and “curves” are my original words. I use them just in case I fail to use functions and curves properly.

The quadratic curves in the figure above are all “curves” in my term, which can be denoted as f(\boldsymbol{x}; A_3, \boldsymbol{b}_3)=const or f(\boldsymbol{x}; A_3)=const. However if you replace z of (g)”, (h)”, and (i)” with f, you can interpret the “curves” as “functions” which are denoted as f(\boldsymbol{x}; A_2). This might sounds too obvious to you, and my point is you can visualize how values of “functions” change only when the inputs are 2 dimensional.

When a symmetric 2\times 2 real matrices A_2 have two eigenvalues \lambda_1, \lambda_2, the distribution of quadratic curves can be roughly classified to the following three types.

  • (g): Both \lambda_1 and \lambda_2 are positive or negative.
  • (h): Either of \lambda_1 or \lambda_2 is positive and the other is negative.
  • (i): Either of \lambda_1 or \lambda_2 is 0 and the other is not.

The equations of (g)” , (h)”, and (i)” correspond to each type of f=(\boldsymbol{x}; A_2), and thier curves look like the three graphs below.

And in fact, when start from the origin and go in the direction of an eigenvector \boldsymbol{u}_i, \lambda_i is the gradient of the direction. You can see that more clearly when you restrict the distribution of f=(\boldsymbol{x}; A_2) to a unit circle. Like in the figure below, in case \lambda_1 = 7, \lambda_2 = 3, which is classified to (g), the distribution looks like the left side, and if you restrict the distribution in the unit circle, the distribution looks like a bowl like the middle and the right side. When you move in the direction of \boldsymbol{u}_1, you can climb the bowl as as high as \lambda_1, in \boldsymbol{u}_2 as high as \lambda_2.

Also in case of (h), the same facts hold. But in this case, you can also descend the curve.

*You might have seen the curve above in the context of optimization with stochastic gradient descent. The origin of the curve above is a notorious saddle point, where gradients are all 0 in any directions but not a local maximum or minimum. Points can be stuck in this point during optimization.

Especially in case of PCA, A is a covariance matrix, thus A=\Sigma. Eigenvalues of \Sigma are all equal to or greater than 0. And it is known that in this case \lambda_i is the variance of data projected on its corresponding eigenvector \boldsymbol{u}_i (i=0, \dots , D). Hence, if you project f(\boldsymbol{x}; \Sigma), quadratic curves formed by a covariance matrix \Sigma, on eigenvectors of \Sigma, you get f(\boldsymbol{x}; \Sigma) = ({x'}_1 \: \dots \: {x'}_D) (\lambda_1 {x'}_1 \: \dots \: \lambda_D {x'}_D)^t =\lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2.  This shows that you can re-weight ({x'}_1 \: \dots \: {x'}_D), the coordinates of data projected projected on eigenvectors of A, with \lambda_1, \dots, \lambda_D, which are variances ({x'}_1 \: \dots \: {x'}_D). As I mentioned in an example of data of exam scores in the last article, the bigger a variance \lambda_i is, the more the feature described by \boldsymbol{u}_i vary from sample to sample. In other words, you can ignore eigenvectors corresponding to small eigenvalues.

That is a great hint why principal components corresponding to large eigenvectors contain much information of the data distribution. And you can also interpret PCA as a “climbing” a bowl of f(\boldsymbol{x}; A_D), as I have visualized in the case of (g) type curve in the figure above.

*But as I have repeatedly mentioned, ellipsoid which fit data well isf(\boldsymbol{x}; \Sigma ^{-1}) =(\boldsymbol{x}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D})\boldsymbol{x}' = \frac{({x'}_{1})^2}{\lambda_1} + \cdots + \frac{({x'}_{D})^2}{\lambda_D} = const..

*You have to be careful that even if you slice a type (h) curve f(\boldsymbol{x}; A_D) with a place z=const. the resulting cross section does not fit the original data well because the equation of the cross section is \lambda_1 ({x'}_1)^2 + \cdots + \lambda_D ({x'}_D)^2 = const. The figure below is an example of slicing the same f(\boldsymbol{x}; A_2) as the one above with z=1, and the resulting cross section.

As we have seen, \lambda_i, the eigenvalues of the covariance matrix of data are variances or data when projected on it eigenvectors. At the same time, when you fit an ellipsoid on the data, \sqrt{\lambda_i} is the radius of the ellipsoid corresponding to \boldsymbol{u}_i. Thus ignoring data projected on eigenvectors corresponding to small eigenvalues is equivalent to cutting of the axes of the ellipsoid with small radiusses.

I have explained PCA in three different ways over three articles.

  • The second article: I focused on what kind of linear transformations convariance matrices \Sigma enable, by visualizing displacement vectors. And those vectors look like swirling and extending into directions of eigenvectors of \Sigma.
  • The third article: We directly found directions where certain data distribution “swell” the most, to find that data swell the most in directions of eigenvectors.
  • In this article, we have seen PCA corresponds to only one case of quadratic functions, where the matrix A is a covariance matrix. When you go in the directions of eigenvectors corresponding to big eigenvalues, the quadratic function increases the most. Also that means data samples have bigger variances when projected on the eigenvectors. Thus you can cut off eigenvectors corresponding to small eigenvectors because they retain little information about data, and that is equivalent to fitting an ellipsoid on data and cutting off axes with small radiuses.

*Let A be a covariance matrix, and you can diagonalize it with an orthogonal matrix U as follow: U^{T}AU = \Lambda, where \Lambda = diag(\lambda_1, \dots, \lambda_D). Thus A = U \Lambda U^{T}. U is a rotation, and multiplying a \boldsymbol{x} with \Lambda means you multiply each eigenvalue to each element of \boldsymbol{x}. At the end U^T enables the reverse rotation.

If you get data like the left side of the figure below, most explanation on PCA would just fit an oval on this data distribution. However after reading this articles series so far, you would have learned to see PCA from different viewpoints like at the right side of the figure below.

 

5 Ellipsoids in Gaussian distributions.

I have explained that if the covariance of a data distribution is \boldsymbol{\Sigma}, the ellipsoid which fits the distribution the best is \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = 1. You might have seen the part \bigl((\boldsymbol{x} - \boldsymbol{\mu}), \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu})\bigr) = (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) somewhere else. It is the exponent of general Gaussian distributions: \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) \}.  It is known that the eigenvalues of \Sigma ^{-1} are \frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}, and eigenvectors corresponding to each eigenvalue are also \boldsymbol{u}_1, \dots, \boldsymbol{u}_D respectively. Hence just as well as what we have seen, if you project (\boldsymbol{x} - \boldsymbol{\mu}) on each eigenvector of \Sigma ^{-1}, we can convert the exponent of the Gaussian distribution.

Let -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) be \boldsymbol{y} and U ^{-1} \boldsymbol{y}= U^{T} \boldsymbol{y} be \boldsymbol{y}', where U=(\boldsymbol{u}_1 \: \dots \: \boldsymbol{u}_D). Just as we have seen, (\boldsymbol{x} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{x} - \boldsymbol{\mu}) =\boldsymbol{y}^T\Sigma^{-1} \boldsymbol{y} =(U\boldsymbol{y}')^T \Sigma^{-1} U\boldsymbol{y}' =((\boldsymbol{y}')^T U^T \Sigma^{-1} U\boldsymbol{y}' = (\boldsymbol{y}')^T diag(\frac{1}{\lambda_1}, \dots, \frac{1}{\lambda_D}) \boldsymbol{y}' = \frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D}. Hence \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\boldsymbol{y}) \boldsymbol{\Sigma}^{-1}(\boldsymbol{y}) \} =  \frac{1}{(2\pi)^{D/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\{ -\frac{1}{2}(\frac{({y'}_{1})^2}{\lambda_1} + \cdots + \frac{({y'}_{D})^2}{\lambda_D} ) \} =\frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2} \frac{({y'}_{1})^2}{\lambda_1} \biggl) \cdots \frac{1}{(2\pi)^{1/2}} \frac{1}{|\boldsymbol{\Sigma}|} exp\biggl( -\frac{1}{2}\frac{({y'}_{D})^2}{\lambda_D} \biggl).

*To be mathematically exact about changing variants of normal distributions, you have to consider for example Jacobian matrices.

This results above demonstrate that, by projecting data on the eigenvectors of its covariance matrix, you can factorize the original multi-dimensional Gaussian distribution into a product of Gaussian distributions which are irrelevant to each other. However, at the same time, that is the potential limit of approximating data with PCA. This idea is going to be more important when you think about more probabilistic ways to handle PCA, which is more robust to lack of data.

I have explained PCA over 3 articles from various viewpoints. If you have been patient enough to read my article series, I think you have gained some deeper insight into not only PCA, but also linear algebra, and that should be helpful when you learn or teach data science. I hope my codes also help you. In fact these are not the only topics about PCA. There are a lot of important PCA-like algorithms.

In fact our expedition of ellipsoids, or PCA still continues, just as Star Wars series still continues. Especially if I have to explain an algorithm named probabilistic PCA, I need to explain the “Bayesian world” of machine learning. Most machine learning algorithms covered by major introductory textbooks tend to be too deterministic and dependent on the size of data. Many of those algorithms have another “parallel world,” where you can handle inaccuracy in better ways. I hope I can also write about them, and I might prepare another trilogy for such PCA. But I will not disappoint you, like “The Phantom Menace.”

Appendix: making a model of a bunch of grape with ellipsoid berries.

If you can control quadratic curves, reshaping and rotating them, you can make a model of a grape of olive bunch on Matplotlib. I made a program of making a model of a bunch of berries on Matplotlib using the module to draw ellipsoids which I introduced earlier. You can check the codes in this page.

*I have no idea how many people on this earth are in need of making such models.

I made some modules so that you can see the grape bunch from several angles. This might look very simple to you, but the locations of berries are organized carefully so that it looks like they are placed around a stem and that the berries are not too close to each other.

 

The programming code I created for this article is completly available here.

[Refereces]

[1]C. M. Bishop, “Pattern Recognition and Machine Learning,” (2006), Springer, pp. 78-83, 559-577

[2]「理工系新課程 線形代数 基礎から応用まで」, 培風館、(2017)

[3]「これなら分かる 最適化数学 基礎原理から計算手法まで」, 金谷健一著、共立出版, (2019), pp. 17-49

[4]「これなら分かる 応用数学教室 最小二乗法からウェーブレットまで」, 金谷健一著、共立出版, (2019), pp.165-208

[5] 「サボテンパイソン 」
https://sabopy.com/