In cooperation between DATANOMIQ, my consulting company for data science, business intelligence and process mining, and Pixolution, a specialist for computer vision with deep learning, we have created an infographic (PDF) about a very special use case for companies with deep learning: How to ensure occupational safety through automatic risk detection using using Deep Learning AI.
In cooperation between DATANOMIQ, my consulting company for data science, business intelligence and process mining, and Pixolution, a specialist for computer vision with deep learning, we have created an infographic (PDF) about a very special use case for companies with deep learning: How to protect the corporate identity of any company by ensuring consistent branding with automated font recognition.
The infographic is available as PDF download:
The process in the semiconductor industry is highly complicated and is normally under consistent observation via the monitoring of the signals coming from several sensors. Thus, it is important for the organization to detect the fault in the sensor as quickly as possible. There are existing traditional statistical based techniques however modern semiconductor industries have the ability to produce more data which is beyond the capability of the traditional process.
For this article, we will be using SECOM dataset which is available here. A lot of work has already done on this dataset by different authors and there are also some articles available online. In this article, we will focus on problem definition, data understanding, and data cleaning.
This article is only the first of three parts, in this article we will discuss the business problem in hand and clean the dataset. In second part we will do feature engineering and in the last article we will build some models and evaluate them.
This data which is collected by these sensors not only contains relevant information but also a lot of noise. The dataset contains readings from 590. Among the 1567 examples, there are only 104 fail cases which means that out target variable is imbalanced. We will look at the distribution of the dataset when we look at the python code.
NOTE: For a detailed description regarding this cases study I highly recommend to read the following research papers:
- Kerdprasop, K., & Kerdprasop, N. A Data Mining Approach to Automate Fault Detection Model Development in the Semiconductor Manufacturing Process.
- Munirathinam, S., & Ramadoss, B. Predictive Models for Equipment Fault Detection in the Semiconductor Manufacturing Process.
Data Understanding and Preparation
Let’s start exploring the dataset now. The first step as always is to import the required libraries.
import pandas as pd
import numpy as np
There are several ways to import the dataset, you can always download and then import from your working directory. However, I will directly import using the link. There are two datasets: one contains the readings from the sensors and the other one contains our target variable and a timestamp.
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
names = ["feature" + str(x) for x in range(1, 591)]
secom_var = pd.read_csv(url, sep=" ", names=names, na_values = "NaN")
url_l = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
secom_labels = pd.read_csv(url_l,sep=" ",names = ["classification","date"],parse_dates = ["date"],na_values = "NaN")
The first step before doing the analysis would be to merge the dataset and we will us pandas library to merge the datasets in just one line of code.
#1. Combined the two datasets
secom_merged = pd.merge(secom_var, secom_labels,left_index=True,right_index=True)
Now let’s check out the distribution of the target variable
secom_merged.target.value_counts().plot(kind = 'bar')
From Figure 1 it can be observed that the target variable is imbalanced and it is highly recommended to deal with this problem before the model building phase to avoid bias model. Xgboost is one of the models which can deal with imbalance classes but one needs to spend a lot of time to tune the hyper-parameters to achieve the best from the model.
The dataset in hand contains a lot of null values and the next step would be to analyse these null values and remove the columns having null values more than a certain percentage. This percentage is calculated based on 95th quantile of null values.
#2. Analyzing nulls
secom_nulls = secom_rmNa.isnull().sum()/len(secom_rmNa)
Now we calculate the 95th percentile of the null values.
x = secom_nulls.quantile(0.95)
secom_rmNa = secom_merged[secom_merged.columns[secom_nulls < x]]
From figure 3 its visible that there are still missing values in the dataset and can be dealt by using many imputation methods. The most common method is to impute these values by mean, median or mode. There also exist few sophisticated techniques like K-nearest neighbour and interpolation. We will be applying interpolation technique to our dataset.
secom_complete = secom_rmNa.interpolate()
To prepare our dataset for analysis we should remove some more unwanted columns like columns with near zero variance. For this we can calulate number of unique values in each column and if there is only one unique value we can delete the column as it holds no information.
df = secom_complete.loc[:,secom_complete.apply(pd.Series.nunique) != 1]
## Let's check the shape of the df
We have applied few data cleaning techniques and reduced the features from 590 to 444. However, In the next article we will apply some feature engineering techniques and adress problems like the curse of dimensionality and will also try to balance the target variable.
Bleiben Sie dran!!
Doing data science in a healthcare company can save lives. Whether it’s by predicting which patients have a tumor on an MRI, are at risk of re-admission, or have misclassified diagnoses in electronic medical records are all examples of how predictive models can lead to better health outcomes and improve the quality of life of patients. Nevertheless, the healthcare industry presents many unique challenges and opportunities for data scientists.
The impact of data science in healthcare
Healthcare providers have a plethora of important but sensitive data. Medical records include a diverse set of data such as basic demographics, diagnosed illnesses, and a wealth of clinical information such as lab test results. For patients with chronic diseases, there could be a long and detailed history of data available on a number of health indicators due to the frequency of visits to a healthcare provider. Information from medical records can often be combined with outside data as well. For example, a patient’s address can be combined with other publicly available information to determine the number of surgeons that practice near a patient or other relevant information about the type of area that patients reside in.
With this rich data about a patient as well as their surroundings, models can be built and trained to predict many outcomes of interest. One important area of interest is models predicting disease progression, which can be used for disease management and planning. For example, at Fresenius Medical Care (where we primarily care for patients with chronic conditions such as kidney disease), we use a Chronic Kidney Disease progression model that can predict the trajectory of a patient’s condition to help clinicians decide whether and when to proceed to the next stage in their medical care. Predictive models can also notify clinicians about patients who may require interventions to reduce risk of negative outcomes. For instance, we use models to predict which patients are at risk for hospitalization or missing a dialysis treatment. These predictions, along with the key factors driving the prediction, are presented to clinicians who can decide if certain interventions might help reduce the patient’s risk.
Challenges of data science in healthcare
One challenge is that the healthcare industry is far behind other sectors in terms of adopting the latest technology and analytics tools. This does present some challenges, and data scientists should be aware that the data infrastructure and development environment at many healthcare companies will not be at the bleeding edge of the field. However it also means there are a lot of opportunities for improvement, and even small simple models can yield vast improvements over current methods.
Another challenge in the healthcare sector arises from the sensitive nature of medical information. Due to concerns over data privacy, it can often be difficult to obtain access to data that the company has. For this reason, data scientists considering a position at a healthcare company should be aware of whether there is already an established protocol for data professionals to get access to the data. If there isn’t, be aware that simply getting access to the data may be a major effort in itself.
Finally, it is important to keep in mind the end-use of any predictive model. In many cases, there are very different costs to false-negatives and false-positives. A false-negative may be detrimental to a patient’s health, while too many false-positives may lead to many costly and unnecessary treatments (also to the detriment of patients’ health for certain treatments as well as economy overall). Education about the proper use of predictive models and their limitations is essential for end-users. Finally, making sure the output of a predictive model is actionable is important. Predicting that a patient is at high-risk is only useful if the model outputs is interpretable enough to explain what factors are putting that patient at risk. Furthermore, if the model is being used to plan interventions, the factors that can be changed need to be highlighted in some way – telling a clinician that a patient is at risk because of their age is not useful if the point of the prediction is to lower risk through intervention.
The future of data science in the healthcare sector
The future holds a lot of promise for data science in healthcare. Wearable devices that track all kinds of activity and biometric data are becoming more sophisticated and more common. Streaming data coming from either wearables or devices providing treatment (such as dialysis machines) could eventually be used to provide real-time alerts to patients or clinicians about health events outside of the hospital.
Currently, a major issue facing medical providers is that patients’ data tends to exist in silos. There is little integration across electronic medical record systems (both between and within medical providers), which can lead to fragmented care. This can lead to clinicians receiving out of date or incomplete information about a patient, or to duplication of treatments. Through a major data engineering effort, these systems could (and should) be integrated. This would vastly increase the potential of data scientists and data engineers, who could then provide analytics services that took into account the whole patients’ history to provide a level of consistency across care providers. Data workers could use such an integrated record to alert clinicians to duplications of procedures or dangerous prescription drug combinations.
Data scientists have a lot to offer in the healthcare industry. The advances of machine learning and data science can and should be adopted in a space where the health of individuals can be improved. The opportunities for data scientists in this sector are nearly endless, and the potential for good is enormous.
Mit Stolz und Freude darf ich verkünden, dass wir ausgehend von unserer Data Science Blog Community den Data Leader Day am 17. November in Berlin maßgeblich mitorganisieren werden!
Der große DataLeaderDay am 17. November 2016 in Berlin bringt das Silicon Valley nach Deutschland. Die Konferenz fokussiert dabei auf die beiden Megatrends in der Digitalwirtschaft: Data Science und Industrie 4.0. Erleben Sie auf dem Data Leader Day was jetzt möglich ist – von Pionieren und hochrangigen Anwendern.
Ein vielfältiges Programm mit Keynote, Präsentationen sowie Use & Business Cases zeigt Ihnen aus der Praxis, wie Sie die Digitalisierung im Unternehmen umsetzen und als neues Wertschöpfungsinstrument einsetzen können. Und das Wichtigste: Sie erleben, welche Wettbewerbsvorteile Sie mit diesen Technologien verwirklichen können. Der Networking-Hub bietet zudem viele Möglichkeiten um Spitzenkräfte zu treffen und um sich über neueste Technologien, Methoden und Entwicklungen auszutauschen.
Zielgruppe – und was Euch erwartet
Auf dem Event werden Entscheider in Führungsposition ihre erfolgreichen Big Data & Data Science Anwendungen präsentieren. Es wird für unterschiedliche Branchen und Fachbereiche viele Erfolgsstories geben, die Mut machen, selbst solche oder ähnliche Anwendungsfälle anzugehen. Ihr werdet mit den Entscheidern networken können!
– Persönliche Vermittlung für ein Karrieregespräch gesucht? Sprecht mich einfach an! –
Unser Data Leader Day richtet sich an Führungskräfte, die von der Digitalisierung bereits profitieren oder demnächst profitieren wollen, aber auch an technische Entwickler, die neue Impulse für erfolgreiche Big Data bzw. Smart Data Projekte mitnehmen möchten. Das Event ist exklusiv und nicht – wie sonst üblich – von Vertrieblern zum Verkauf designed, sondern von Anwendern für Anwender gemacht.
Ort, Programm und Agenda
Aktuelle Informationen zum Event finden sich auf der Event-Seite: www.dataleaderday.com
Sorry, no posts matched your criteria
Interesting linksHere are some interesting links for you! Enjoy your stay :)
- @Data Science Blog
- Autor werden!
- Become an Author
- Bootcamp Datenanalyse und Maschinelles Lernen mit Python
- CIO Interviews
- Computational and Data Science
- Coursera Data Science Specialization
- Coursera 用Python玩转数据 Data Processing Using Python
- Data Leader Day 2016 – Rabatt für Data Scientists!
- Data Science
- Data Science Business Professional
- Data Science Insights
- Data Science Partner
- DATANOMIQ Big Data & Data Science Seminare
- DATANOMIQ Process Mining Workshop
- DATANOMIQ Seminare für Führungskräfte
- DataQuest.io – Interactive Learning
- Donation / Spende
- Education / Certification
- Fraunhofer Academy Zertifikatsprogramm »Data Scientist«
- Für Presse / Redakteure
- HARVARD Data Science Certificate Courses
- Impressum / Imprint
- MapR Big Data Expert
- Masterstudiengang Data Science
- Masterstudiengang Management & Data Science
- MongoDB University Online Courses
- O’Reilly Video Training – Data Science with R
- qSkills Workshop: Industrial Security
- Science Digital Intelligence & Marketing Analytics
- Show your Desk!
- Stanford University Online -Statistical Learning
- Top Authors
- TU Chemnitz – Masterstudiengang Business Intelligence & Analytics
- TU Dortmund – Datenwissenschaft – Master of Science
- TU Dortmund berufsbegleitendes Zertifikatsstudium
- Weiterbildung mit Hochschulzertifikat Data Science & Business Analytics für Einsteiger
- WWU Münster – Zertifikatsstudiengang “Data Science”
- Zertifikatskurs „Data Science“
- Zertifizierter Business Analyst
- Apache Spark
- Artificial Intelligence
- Audit Analytics
- Big Data
- Business Analytics
- Business Intelligence
- Certification / Training
- Connected Car
- Customer Analytics
- Data Engineering
- Data Migration
- Data Mining
- Data Science
- Data Science at the Command Line
- Data Science Hack
- Data Science News
- Data Security
- Data Warehousing
- Deep Learning
- Education / Certification
- Excel / Power BI
- Graph Database
- Hadoop Framework
- Industrie 4.0
- Interview mit CIO
- Machine Learning
- Main Category
- Mobile Device Management
- Mobile Devices
- Natural Language Processing
- Predictive Analytics
- Process Mining
- R Statistics
- Realtime Analytics
- Software Development
- Sponsoring Partner Posts
- Text Mining
- Tool Introduction
- Use Case
- Use Cases
- Web- & App-Tracking
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- April 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015