Digital Data Taxes in China: How Would Big Tech Be Affected?

As 2020 came to a close, Chinese officials hinted at new data regulations on the horizon. Yao Qian, a Chinese securities official, stated that China should impose a digital data tax on some tech companies. Considering big tech’s prominent presence in the country, these taxes, if enacted, could have considerable impacts on the industry.

The international technology industry is inseparable from China. Several of the world’s largest tech companies are Chinese, and many others have bases of operation in the country. As such, any legislature in the nation regarding technology has significant global implications.

The Current State of Digital Services Taxes

China wouldn’t be the first country to establish a digital services tax (DST). In Europe, eight nations have implemented a DST, and six more have proposed or announced such legislation. Not all of these include a digital data tax, but some do.

France has one of the broadest DSTs, covering revenues from a range of digital services. This includes the transmission of user data for advertising purposes, which seems similar to what China might enact. The U.S., which has no such tax, has opposed these measures, threatening France with tariffs until France agreed to postpone collecting DSTs on U.S. companies.

The United States Trade Representative (USTR) has responded similarly to other nations with DSTs. Given this precedent, it’s possible that it will do the same if China goes through with its tax proposal.

The Financial Impact of Digital Data Taxes in China

Chinese officials have said little about what their DST would include, so the impact is still uncertain. They have, however, mentioned that these taxes would specifically target tech platforms that “hold a large amount of users’ data.” This legislation would likely collect payments from online platforms like Facebook and Google based upon how much user data they hold.

Considering that Google, Facebook, Amazon and Microsoft may store as much as 1,200 petabytes between them, these taxes would quickly become expensive. While none of these are Chinese companies, they all do business in China or with Chinese customers. As a result, Big Tech, which relies heavily on collecting user data, could see substantial losses.

Massive tech corporations aren’t the only ones that could face massive payments. Typically, governments impose DSTs on gross revenue, not net income. Consequently, even smaller, unprofitable tech companies may end up paying significant sums to the Chinese government.

Potential Changes in Data Governance

This new financial burden wouldn’t be the only impact of a digital data tax. The way companies gather and handle user data could shift as they adapt to these changes. For instance, some tech companies could store less data at a time, at least for customers in China, to lower the taxes they have to pay.

Officials have said that these proposed taxes come out of concern for consumers’ data rights. This has been a rising issue as data breaches increase, although these are more often than not due to human error, not the platforms themselves. Still, in response to these regulations, Big Tech could take a step back when it comes to holding user information.

Some companies may respond by stepping out of China. Doing so would help them avoid decreased profits from these taxes but would represent a considerable loss in other areas. There are more than 883 million internet users in China, constituting the world’s largest online community. That market is likely too substantial for companies to ignore, even with more taxes.

While this wouldn’t be the first digital data tax, it would set a new precedent. Officials have proposed treating user data as a natural resource, which would represent a legislative first. As data becomes more crucial to businesses, other nations may follow suit, leading to further regulation of the tech industry.

Big Tech Can Expect Heightened Regulation in the Future

Whether or not China will implement this digital data tax remains uncertain. Even if they don’t, tech companies will likely face increased regulations as time goes on. User data is reaching new heights in both its abundance and utility, and world governments won’t likely sit idly by.

These new regulations may make conducting business, especially internationally, more challenging for Big Tech. On the other hand, they could also protect users. For now, tech companies have to stay vigilant about developing changes and be ready to adapt.

On the difficulty of language: prerequisites for NLP with deep learning

This is the first article of my article series “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Preface

This section is virtually just my essay on language. You can skip this if you want to get down on more technical topic.

As I do not study in natural language processing (NLP) field, I would not be able to provide that deep insight into this fast changing deep leaning field throughout my article series. However at least I do understand language is a difficult and profound field, not only in engineering but also in many other study fields. Some people might be feeling that technologies are eliminating languages, or one’s motivations to understand other cultures. First of all, I would like you to keep it in mind that I am not a geek who is trying to turn this multilingual world into a homogeneous one and rebuild Tower of Babel, with deep learning. I would say I am more keen on social or anthropological sides of language.

I think you would think more about languages if you have mastered at least one foreign language. As my mother tongue is Japanese, which is totally different from many other Western languages in terms of characters and ambiguity, I understand translating is not what learning a language is all about. Each language has unique characteristics, and I believe they more or less influence one’s personalities. For example, many Western languages make the verb, I mean the conclusion, of sentences clear in the beginning part of the sentences. That is also true of Chinese, I heard. However in Japanese, the conclusion comes at the end, so that is likely to give an impression that Japanese people are being obscure or indecisive. Also, Japanese sentences usually omit their subjects. In German as well, the conclusion of a sentences tend to come at the end, but I am almost 100% sure that no Japanese people would feel German people make things unclear. I think that comes from the structures of German language, which tends to make the number, verb, relations of words crystal clear.

Let’s take an example to see how obscure Japanese is. A Japanese sentence 「頭が赤い魚を食べる猫」can be interpreted in five ways, depending on where you put emphases on.

Common sense tells you that the sentence is likely to mean the first two cases, but I am sure they can mean those five possibilities. There might be similarly obscure sentences in other languages, but I bet few languages can be as obscure as Japanese. Also as you can see from the last two sentences, you can omit subjects in Japanese. This rule is nothing exceptional. Japanese people usually don’t use subjects in normal conversations. And when you read classical Japanese, which Japanese high school students have to do just like Western students learn some of classical Latin, the writings omit subjects much more frequently.

*However interestingly we have rich vocabulary of subjects. The subject “I” can be translated to 「私」、「僕」、「俺」、「自分」、「うち」etc, depending on your personality, who you are talking to, and the time when it is written in.

I believe one can see the world only in the framework of their language, and it seems one’s personality changes depending on the language they use. I am not sure whether the language originally determines how they think, or how they think forms the language. But at least I would like you to keep it in mind that if you translate a conversation, for example a random conversation at a bar in Berlin, into Japanese, that would linguistically sound Japanese, but not anthropologically. Imagine that such kind of random conversation in Berlin or something is like playing a catch, I mean throwing a ball named “your opinion.” On the other hand,  normal conversations of Japanese people are in stead more of, I would say,  “resonance” of several tuning forks. They do their bests to show that they are listening to each other, by excessively nodding or just repeating “Really?”, but usually it seems hardly any constructive dialogues have been made.

*I sometimes feel you do not even need deep learning to simulate most of such Japanese conversations. Several-line Python codes would be enough.

My point is, this article series is mainly going to cover only a few techniques of NLP in deep learning field: sequence to sequence model (seq2seq model) , and especially Transformer. They are, at least for now, just mathematical models and mappings of a small part of this profound field of language (as far as I can cover in this article series). But still, examples of language would definitely help you understand Transformer model in the long run.

2 Tokens and word embedding

*Throughout my article series, “words” just means the normal words you use in daily life. “Tokens” means more general unit of NLP tasks. For example the word “Transformer” might be denoted as a single token “Transformer,” or maybe as a combination of two tokens “Trans” and “former.”

One challenging part of handling language data is its encodings. If you started learning programming in a language other than English, you would have encountered some troubles of using keyboards with different arrangements or with characters. Some comments on your codes in your native languages are sometimes not readable on some software. You can easily get away with that by using only English, but when it comes to NLP you have to deal with this difficulty seriously. How to encode characters in each language should be a first obstacle of NLP. In this article we are going to rely on a library named BPEmb, which provides word embedding in various languages, and you do not have to care so much about encodings in languages all over the world with this library.

In the first section, you might have noticed that Japanese sentence is not separated with spaces like Western languages. This is also true of Chinese language, and that means we need additional tasks of separating those sentences at least into proper chunks of words. This is not only a matter of engineering, but also of some linguistic fields. Also I think many people are not so conscious of how sentences in their native languages are grammatically separated.

The next point is, unlike other scientific data, such as temperature, velocity, voltage, or air pressure, language itself is not measured as numerical data. Thus in order to process language, including English, you first have to map language to certain numerical data, and after some processes you need to conversely map the output numerical data into language data. This section is going to be mainly about one-hot encoding and word embedding, the ways to convert word/token into numerical data. You might already have heard about this

You might have learnt about word embedding to some extent, but I hope you could get richer insight into this topic through this article.

2.1 One-hot encoding

One-hot encoding would be the most straightforward way to encode words/tokens. Assume that you have a dictionary whose size is |\mathcal{V}|, and it includes words from “a”, “ablation”, “actually” to “zombie”, “?”, “!”

In a mathematical manner, in order to choose a word out of those |\mathcal{V}| words, all you need is a |\mathcal{V}| dimensional vector, one of whose elements is 1, and the others are 0. When you want to choose the No. i word, which is “indeed” in the example below, its corresponding one-hot vector is \boldsymbol{v} = (0, \dots, 1, \dots, 0 ), where only the No. i element is 1. One-hot encoding is also easy to understand, and that’s all. It is easy to imagine that people have already come up with more complicated and better way to encoder words. And one major way to do that is word embedding.

2.2 Word embedding

Source: Francois Chollet, Deep Learning with Python,(2018), Manning

Actually word embedding is related to one-hot encoding, and if you understand how to train a simple neural network, for example densely connected layers, you would understand word embedding easily. The key idea of word embedding is denoting each token with a D dimensional vector, whose dimension is fewer than the vocabulary size |\mathcal{V}|. The elements of the resulting word embedding vector are real values, I mean not only 0 or 1. Obviously you can encode much richer variety of tokens with such vectors. The figure at the left side is from “Deep Learning with Python” by François Chollet, and I think this is an almost perfect and simple explanation of the comparison of one-hot encoding and word embedding. But the problem is how to get such convenient vectors. The answer is very simple: you have only to train a network whose inputs are one-hot vector of the vocabulary.

The figure below is a simplified model of word embedding of a certain word. When the word is input into a neural network, only the corresponding element of the one-hot vector is 1, and that virtually means the very first input layer is composed of one neuron whose value is 1. And the only one neuron propagates to the next D dimensional embedding layer. These weights are the very values which most other study materials call “an embedding vector.”

When you input each word into a certain network, for example RNN or Transformer, you map the input one-hot vector into the embedding layer/vector. The examples in the figure are how inputs are made when the input sentences are “You’ve got the touch” and “You’ve got the power.”   Assume that you have a dictionary of one-hot encoding, whose vocabulary is {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}, and the dimension of word embeding is 6. In this case |\mathcal{V}| = 9, D=6. When the inputs are “You’ve got the touch” or “You’ve got the power” , you put the one-hot vector corresponding to “You’ve”, “got”, “the”, “touch” or “You’ve”, “got”, “the”, “power” sequentially every time step t.

In order to get word embedding of certain vocabulary, you just need to train the network. We know that the words “actually” and “indeed” are used in similar ways in writings. Thus when we propagate those words into the embedding layer, we can expect that those embedding layers are similar. This is how we can mathematically get effective word embedding of certain vocabulary.

More interestingly, if word embedding is properly trained, you can mathematically “calculate” words. For example, \boldsymbol{v}_{king} - \boldsymbol{v}_{man} + \boldsymbol{v}_{woman} \approx \boldsymbol{v}_{queen}, \boldsymbol{v}_{Japan} - \boldsymbol{v}_{Tokyo} + \boldsymbol{v}_{Vietnam} \approx \boldsymbol{v}_{Hanoi}.

*I have tried to demonstrate this type of calculation on several word embedding, but none of them seem to work well. At least you should keep it in mind that word embedding learns complicated linear relations between words.

I should explain word embedding techniques such as word2vec in detail, but the main focus of this article is not NLP, so the points I have mentioned are enough to understand Transformer model with NLP examples in the upcoming articles.


3 Language model

Language models is one of the most straightforward, but crucial ideas in NLP. This is also a big topic, so this article is going to cover only basic points. Language model is a mathematical model of the probabilities of which words to come next, given a context. For example if you have a sentence “In the lecture, he opened a _.”, a language model predicts what comes at the part “_.” It is obvious that this is contextual. If you are talking about general university students, “_” would be “textbook,” but if you are talking about Japanese universities, especially in liberal art department, “_” would be more likely to be “smartphone. I think most of you use this language model everyday. When you type in something on your computer or smartphone, you would constantly see text predictions, or they might even correct your spelling or grammatical errors. This is language modelling. You can make language models in several ways, such as n-gram and neural language models, but in this article I can explain only general formulations for such models.

*I am not sure which algorithm is used in which services. That must be too fast changing and competitive for me to catch up.

As I mentioned in the first article series on RNN, a sentence is usually processed as sequence data in NLP. One single sentence is denoted as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}), a list of vectors. The vectors are usually embedding vectors, and the (t) is the index of the order of tokens. For example the sentence “You’ve go the power.” can be expressed as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}), where \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)} denote “You’ve”, “got”, “the”, “power”, “.” respectively. In this case \tau = 4.

In practice a sentence \boldsymbol{X} usually includes two tokens BOS and EOS at the beginning and the end of the sentence. They mean “Beginning Of Sentence” and “End Of Sentence” respectively. Thus in many cases \boldsymbol{X} = (\boldsymbol{BOS} , \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS} ). \boldsymbol{BOS} and \boldsymbol{EOS} are also both vectors, at least in the Tensorflow tutorial.

P(\boldsymbol{X} = (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) is the probability of incidence of the sentence. But it is easy to imagine that it would be very hard to directly calculate how likely the sentence \boldsymbol{X} appears out of all possible sentences. I would rather say it is impossible. Thus instead in NLP we calculate the probability P(\boldsymbol{X}) as a product of the probability of incidence or a certain word, given all the words so far. When you’ve got the words (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1}) so far, the probability of the incidence of \boldsymbol{x}^{(t)}, given the context is  P(\boldsymbol{x}^{(t)}|\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1)}). P(\boldsymbol{BOS}) is a probability of the the sentence \boldsymbol{X} being (\boldsymbol{BOS}), and the probability of \boldsymbol{X} being (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) can be decomposed this way: P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS})P(\boldsymbol{BOS}).

Just as well P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P( \boldsymbol{BOS}, \boldsymbol{x}^{(1)})= P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P( \boldsymbol{BOS}).

Hence, the general probability of incidence of a sentence \boldsymbol{X} is P(\boldsymbol{X})=P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \dots, \boldsymbol{x}^{(\tau -1)}, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) = P(\boldsymbol{EOS}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}) P(\boldsymbol{x}^{(\tau)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau - 1)}) \cdots P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P(\boldsymbol{BOS}).

Let \boldsymbol{x}^{(0)} be \boldsymbol{BOS} and \boldsymbol{x}^{(\tau + 1)} be \boldsymbol{EOS}. Plus, let P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]}) be P(\boldsymbol{x}^{(t+1)}|\boldsymbol{x}^{(0)}, \dots, \boldsymbol{x}^{(t)}), then P(\boldsymbol{X}) = P(\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Language models calculate which words to come sequentially in this way.

Here’s a question: how would you evaluate a language model?

I would say the answer is, when the language model generates words, the more confident the language model is, the better the language model is. Given a context, when the distribution of the next word is concentrated on a certain word, we can say the language model is confident about which word to come next, given the context.

*For some people, it would be more understandable to call this “entropy.”

Let’s take the vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”} as an example. Assume that P(\boldsymbol{X}) = P(\boldsymbol{BOS}, \boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}, \boldsymbol{EOS}) = P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}, \boldsymbol{EOS})= P(\boldsymbol{x}^{(0)})\prod_{t=0}^{4}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Given a context (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}), the probability of incidence of \boldsymbol{x}^{(2)} is P(\boldsymbol{x}^{2}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}). In the figure below, the distribution at the left side is less confident because probabilities do not spread widely, on the other hand the one at the right side is more confident that next word is “got” because the distribution concentrates on “got”.

*You have to keep it in mind that the sum of all possible probability P(\boldsymbol{x}^{(2)} | \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) is 1, that is, P(\boldsymbol{the}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + P(\boldsymbol{You've}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + \cdots + P(\boldsymbol{Boogie}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = 1.

While the language model generating the sentence “BOS You’ve got the touch EOS”, it is better if the language model keeps being confident. If it is confident, P(\boldsymbol{X})= P(\boldsymbol{BOS}) P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}) P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} gets higher. Thus (-1) \{ log_{b}{P(\boldsymbol{BOS})} + log_{b}{P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}) + log_{b}{P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)})} + log_{b}{P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)})} + log_{b}{P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} \} gets lower, where usually b=2 or b=e.

This is how to measure how confident language models are, and the indicator of the confidence is called perplexity. Assume that you have a data set for evaluation \mathcal{D} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \dots, \boldsymbol{X}_{|\mathcal{D}|}), which is composed of |\mathcal{D}| sentences in total. Each sentence \boldsymbol{X}_n = (\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau ^{(n)}}{P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})} has \tau^{(n)} tokens in total excluding \boldsymbol{BOS}, \boldsymbol{EOS}. And let |\mathcal{V}| be the size of the vocabulary of the language model. Then the perplexity of the language model is b^z, where z = \frac{-1}{|\mathcal{V}|}\sum_{n=1}^{|\mathcal{D}|}{\sum_{t=0}^{\tau ^{(n)}}{log_{b}P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})}. The b is usually 2 or e.

For example, assume that \mathcal{V} is vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}. Also assume that the evaluation data set for perplexity of a language model is \mathcal{D} = (\boldsymbol{X}_1, \boldsymbol{X}_2), where \boldsymbol{X_1} =(\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}) \boldsymbol{X_2} = (\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the }, \boldsymbol{power}). In this case |\mathcal{V}|=9, |\mathcal{D}|=2. I have already showed you how to calculate the perplexity of the sentence “You’ve got the touch.” above. You just need to do a similar thing on another sentence “You’ve got the power”, and then you can get the perplexity of the language model.

*If the network is not properly trained, it would also be confident of generating wrong outputs. However, such network still would give high perplexity because it is “confident” at any rate. I’m sorry I don’t know how to tackle the problem. Please let me put this aside, and let’s get down on Transformer model soon.


Let’s see how word embedding is implemented with a very simple example in the official Tensorflow tutorial. It is a simple binary classification task on IMDb Dataset. The dataset is composed to comments on movies by movie critics, and you have only to classify if the commentary is positive or negative about the movie. For example when you get you get an input “To be honest, Michael Bay is a terrible as an action film maker. You cannot understand what is going on during combat scenes, and his movies rely too much on advertisements. I got a headache when Mark Walberg used a Chinese cridit card in Texas. However he is very competent when it comes to humorous scenes. He is very talented as a comedy director, and I have to admit I laughed a lot.“, the neural netowork has to judge whether the statement is positive or negative.

This networks just takes an average of input embedding vectors and regress it into a one dimensional value from 0 to 1. The shape of embedding layer is (8185, 16). Weights of neural netowrks are usually implemented as matrices, and you can see that each row of the matrix corresponds to emmbedding vector of each token.

*It is easy to imagine that this technique is problematic. This network virtually taking a mean of input embedding vectors. That could mean if the input sentence includes relatively many tokens with negative meanings, it is inclined to be classified as negative. But for example, if the sentence is “This masterpiece is a dark comedy by Charlie Chaplin which depicted stupidity of the evil tyrant gaining power in the time. It thoroughly mocked Germany in the time as an absurd group of fanatics, but such propaganda could have never been made until ‘Casablanca.'” , this can be classified as negative, because only the part “masterpiece” is positive as a token, and there are much more words with negative meanings themselves.

The official Tensorflow tutorial provides visualization of word embedding with Embedding Projector, but I would like you to take more control over the data by yourself. Please just copy and paste the codes below, installing necessary libraries. You would get a map of vocabulary used in the text classification task. It seems you cannot find clear tendency of the clusters of the tokens. You can try other dimension reduction methods to get maps of the vocabulary by for example using Scikit Learn.


[1] “Word embeddings” Tensorflow Core

[2]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 43-64, 72-85, 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 43-64, 72-85, 191-193

[3]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)

[4] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 178-185

[5]”2.2. Manifold learning,” scikit-learn

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Operational Data Store vs. Data Warehouse

One of the main problems with large amounts of data, especially in this age of data-driven tools and near-instant results, is how to store the data. With proper storage also comes the challenge of keeping the data updated, and this is the reason why organizations focus on solutions that will help make data processing faster and more efficient. For many, a digital transformation is in their roadmap, thanks in large part to the changes brought about by the global COVID-19 pandemic. The problem is that organizations often assume that it’s similar to traditional change initiatives, which can’t be any further from the truth. There are a number of challenges to prepare for in digital transformations, however, and without proper planning, non-unified data storage systems and systems of record implemented through the years can slow down or even hinder the process.

Businesses have relied on two main solutions for data storage for many years: traditional data warehouses and operational data stores (ODS). These key data structures provide assistance when it comes to boosting business intelligence so that the business can make sound corporate decisions based on data. Before considering which one will work for your business, it’s important to understand the main differences between the two.

What is a Data Warehouse?

Data warehousing is a common practice because a data warehouse is designed to support business intelligence tools and activities. It’s subject-oriented so data is centered on customers, products, sales, or other subjects that contribute to the business bottom line. Because data comes from a multitude of sources, a data warehouse is also designed to consolidate large amounts of data in a variety of formats, including flat files, legacy database management systems, and relational database management systems. It’s considered an organization’s single source of truth because it houses historical records built through time, which could become invaluable as a source of actionable insights.

One of the main disadvantages of a data warehouse is its non-volatile nature. Non-volatile data is read-only and, therefore, not frequently updated or deleted over time. This leads to some time variance, which means that a data warehouse only stores a time series of periodic data snapshots that show the state of data during specific periods. As such, data loading and data retrieval are the most vital operations for a data warehouse.

What is an Operational Data Store?

Forward-thinking companies turn to an operational data store to resolve the issues with data warehousing, primarily, the issue of always keeping data up-to-date. Similar to a data warehouse, an ODS can aggregate data from multiple sources and report across multiple systems of record to provide a more comprehensive view of the data. It’s essentially a staging area that can receive operational data from transactional sources and can be queried directly. This allows data analytics tools to query ODS data as it’s received from the respective source systems. This offloads the burden from the transactional systems by only providing access to current data that’s queried in an integrated manner. This makes an ODS the ideal solution for those looking for near-real time data that’s processed quickly and efficiently.

Traditional ODS solutions, however, typically suffer from high latency because they are based on either relational databases or disk-based NoSQL databases. These systems simply can’t handle large amounts of data and provide high performance at the same time, which is a common requirement of most modern applications. The limited scalability of traditional systems also leads to performance issues when multiple users access the data store all at the same time. As such, traditional ODS solutions are incapable of providing real-time API services for accessing systems of record.

A Paradigm Shift

As modern real-time digital applications replace previously offline services, companies are going through a paradigm shift and venturing beyond what traditional data storage systems can offer. This has led to the rise of a new breed of ODS solutions that Gartner refers to as digital integration hubs. It’s a cost-effective solution because it doesn’t require a rip-and-replace if you already have a traditional ODS in place. Adopting a digital integration hub can be as simple as augmenting your current system with the missing layers, including the microservices API, smart cache, and event-driven architecture.

While sticking with a data warehouse or traditional ODS may not necessarily hurt your business, the benefits of modernization via a digital integration hub are too great to ignore. Significant improvements in throughput, availability, and scalability will help organizations become more agile so they can drive innovation quicker, helping their industry and pushing the limits of technology further to open up possibilities never before discovered.

Role of Data Science in Education

Ad / Sponsored Post

Data science is a new science that appeared thanks to a lot of reasons. The first reason is that nowadays, we have enough capacity to gather data and later work with it. The second reason is that society accumulates a lot of information every minute, and gadgets can save and then send it to data centers without any communication between people. But saving data is just one step in the world of science. The main task is how to analyze and show the results and later make a conclusion and prognoses. A team of online essay experts from a professional academic writing company say that requests for academic papers for data science research topics increase extremely. It happens because data science analysts’ knowledge is useful in a lot of spheres, and the demand for such specialists is very high. Data science is an important part of sociology, political forecasting, the theory of games, statistics and others. Students need to study it and use it for future research. That’s why the universities added the courses of big data to be modern and meet requirements. Let’s try to discover the role of data science in education to make your own conclusion about its importance.

Why is data science necessary and how to become good in it?

The improvement of the studying process.

All students are different. They use different skills for studying and perceive information differently according to a lot of nuances. For some of them, the best way of getting information from lectors is listening without interruption. Other students prefer discussion during lectures. Some prefer to make notes. Others like to listen carefully and make notes later using the audio version of the lecture. Every group is different but has the same goal, and this goal is to absorb as much information as possible. The best assistant for this goal is data science that will show the teacher the best ways of communicating with students.

Using big data for personal needs. 

Have you ever thought that every one of us is a data scientist? We all use data, analyze it, and act according to conclusions. For example, shopping. Every time you go to the grocery, you notice how many people are there and how long the line is. When you plan your next shopping, you make a prognosis according to that data — the time you need to spend in the market and make a decision if it is optimal to go right now or it is better to visit the shop later when it is almost empty. The same thing works when we talk about studying. According to your observation and experience (that are both data), you make a conclusion on how much time you need to spend on every task.

Learning data science as an additional course. 

To know data science as an additional profession nowadays is very helpful. For the employer, it will be a bonus that can be a decisive factor. The skill to analyze is essential for every profession and helps to understand the market now and necessary for sales. The hardest thing for a data scientist is to ask the right questions for collecting data, and if you are good at it, your salary will increase immensely.

How to become a data science specialist?

Big data and artificial intelligence (AI). The importance of development. We believe that AI is the future of studying. First of all, the machine can collect the information and repeat it as much as the student needs. It is studying with the group, and the quality of communication grows rapidly. The base for machine learning in AI is data science. The analysis of data gives the set of possible reactions and actions to AI that can be changed or improved according to new data that was processed by AI. But from the beginning, AI is a huge set of data. It consists of reactions to life situations, speed and timbre of the voice of the interlocutor, country, the hour of the day, and a lot of other data that finally lead to the reaction of AI to the situation. The development of AI is a question of time, and it will help us to move faster in all spheres of life.

Can we ignore data analytics and don’t take a part in it?

The only way to ignore data science is to throw away all gadgets and become citizens of the wood cabin. And even this step won’t help. Those interested in the amount of population of people who live in the woods will be happy to add you to their list and analyze your +1 using data science technology. Every time you buy bread or don’t buy something on the internet, you are counted. Later they will analyze why you became or not their customer and save for statistics you age, sex, country of request, and all other parameters they can catch. We can only accept this reality and try to use it to our needs.

Don’t be afraid to become a subject for data science. It doesn’t affect your privacy a lot because there are billions of us, and we are only one point for statistics. Thanks to such analytics, scientists can make better specific offers and content for you. On the one hand, it is a kind of manipulation, but it saves time and resources for research from another hand. Use it to make your life more comfortable but remember that there are no coincidences when we talk about data.

Ad / Sponsored Post

Digital und Data braucht Vorantreiber

2020 war das Jahr der Trendwende hin zu mehr Digitalisierung in Unternehmen: Telekommunikation und Tools für Unified Communications & Collaboration (UCC) wie etwa Microsoft Teams oder Skype boomen genauso wie der digitale Posteingang und das digitale Signieren von Dokumenten. Die  Vernetzung und Automatisierung ganz im Sinne der Industrie 4.0 finden nicht nur in der Produktion und Logistik ihren Einzug, sondern beispielsweise auch in Form der Robot Process Automation (RPA) ins Büro – bei vielen Unternehmen ein aktuelles Top-Thema. Und in Zeiten, in denen der öffentliche Verkehr zum unangenehmen Gesundheitsrisiko wird und der Individualverkehr wieder cool ist, boomen digital unterstützte Miet- und Sharing-Angebote für Automobile mehr als je zuvor, gleichwohl autonome Fahrzeuge oder post-ausliefernde Drohnen nach wie vor schmerzlich vermisst werden.

Nahezu jedes Unternehmen muss in der heutigen Zeit nicht nur mit der Digitalisierung der Gesellschaft mithalten, sondern auch sich selbst digital organisieren können und bestenfalls eigene Innovationen vorantreiben. Hierfür ist sollte es mindestens eine verantwortliche Stelle geben, den Chief Digital Officer.

Chief Digital Officer gelten spätestens seit 2020 als Problemlöser in der Krise

Einem Running Gag zufolge haben wir den letzten Digitalisierungsvorschub keinem menschlichen Innovator, sondern der Corona-Pandemie zu verdanken. Und tatsächlich erzwang die Pandemie insbesondere die verstärkte Etablierung von digitalen Alternativen für die Kommunikation und Zusammenarbeit im Unternehmen sowie noch digitalere Shop- und Lieferdiensten oder auch digitale Qualifizierungs- und Event-Angebote. Dennoch scheint die Pandemie bisher noch mit überraschend wenig Innovationskraft verbunden zu sein, denn die meisten Technologien und Konzepte der Digitalisierung waren lange vorher bereits auf dem Erfolgskurs, wenn auch ursprünglich mit dem Ziel der Effizienzsteigerung im Unternehmen statt für die Einhaltung von Abstandsregeln. Die eigentlichen Antreiber dieser Digitalisierungsvorhaben waren bereits lange vorher die Chief Digital Officer (CDO).

Zugegeben ist der Grad an Herausforderung nicht für alle CDOs der gleiche, denn aus unterschiedlichen Branchen ergeben sich unterschiedliche Schwerpunkte. Die Finanzindustrie arbeitet seit jeher im Kern nur mit Daten und betrachtet Digitalisierung eher nur aus der Software-Perspektive. Die produzierende Industrie hat mit der Industrie 4.0 auch das Themenfeld der Vernetzung größere Hürden bei der umfassenden Digitalisierung, aber auch die Logistik- und Tourismusbranchen müssen digitalisieren, um im internationalen Wettbewerb nicht den Boden zu verlieren.

Digitalisierung ist ein alter Hut, aber aktueller denn je

Immer wieder wird behauptet, Digitalisierung sei neu oder – wie zuvor bereits behauptet – im Kern durch Pandemien getrieben. Dabei ist, je nach Perspektive, der Hauptteil der Digitalisierung bereits vor Jahrzehnten mit der Einführung von Tabellenkalkulations- sowie ERP-Software vollzogen. Während in den 1980er noch Briefpapier, Schreibmaschinen, Aktenordner und Karteikarten die Bestellungen auf Kunden- wie auf Lieferantenseite beherrschten, ist jedes Unternehmen mit mehr als hundert Mitarbeiter heute grundsätzlich digital erfasst, wenn nicht gar längst digital gesteuert. Und ERP-Systeme waren nur der Anfang, es folgten – je nach Branche und Funktion – viele weitere Systeme: MES, CRM, SRM, PLM, DMS, ITS und viele mehr.

Zwischenzeitlich kamen um die 2000er Jahre das Web 2.0, eCommerce und Social Media als nächste Evolutionsstufe der Digitalisierung hinzu. Etwa ab 2007 mit der Vorstellung des Apple iPhones, verstärkt jedoch erst um die 2010er Jahre durchdrangen mobile Endgeräte und deren mobile Anwendungen als weitere Befähiger und Game-Changer der Digitalisierung den Markt, womit auch Gaming-Plattformen sich wandelten und digitale Bezahlsysteme etabliert werden konnten. Zeitlich darauf folgten die Trends Big Data, Blockchain, Kryptowährungen, Künstliche Intelligenz, aber auch eher hardware-orientierte Themen wie halb-autonom fahrende, schwimmende oder fliegende Drohnen bis heute als nächste Evolutionsschritte der Digitalisierung.

Dieses Alter der Digitalisierung sowie der anhaltende Trend zur weiteren Durchdringung und neuen Facetten zeigen jedoch auch die Beständigkeit der Digitalisierung als Form des permanenten Wandels und dem Data Driven Thinking. Denn heute bestreben Unternehmen auch Mikroprozesse zu digitalisieren und diese besser mit der Welt interagieren zu lassen. Die Digitalisierung ist demzufolge bereits ein Prozess, der seit Jahrzehnten läuft, bis heute anhält und nur hinsichtlich der Umsetzungsschwerpunkte über die Jahre Verschiebungen erfährt – Daher darf dieser Digitalisierungsprozess keinesfalls aus dem Auge verloren werden. Digitalisierung ist kein Selbstzweck, sondern ein Innovationsprozess zur Erhaltung der Wettbewerbsfähigkeit am Markt.

Digital ist nicht Data, aber Data ist die Konsequenz aus Digital

Trotz der längst erreichten Etablierung des CDOs als wichtige Position im Unternehmen, gilt der Job des CDOs selbst heute noch als recht neu. Zudem hatte die Position des CDOs keinen guten Start, denn hinsichtlich der Zuständigkeit konkurriert der CDO nicht nur sowieso schon mit dem CIO oder CTO, er macht sich sogar selbst Konkurrenz, denn er ist namentlich doppelbesetzt: Neben dem Chief Digital Officer gibt es ebenso auch den noch etwas weniger verbreiteten Chief Data Officer. Doch spielt dieser kleine namentliche Unterschied eine Rolle? Ist beides nicht doch das gemeinsame Gleiche?

Die Antwort darauf lautet ja und nein. Der CDO befasst sich mit den zuvor bereits genannten Themen der Digitalisierung, wie mobile Anwendungen, Blockchain, Internet of Thing und Cyber Physical Systems bzw. deren Ausprägungen als vernetze Endgeräte entsprechend der Konzepte wie Industrie 4.0, Smart Home, Smart Grid, Smart Car und vielen mehr. Die einzelnen Bausteine dieser Konzepte generieren Daten, sind selbst jedoch Teilnehmer der Digitalisierungsevolution. Diese Teilnehmer aus Hardware und Software generieren über ihren Einsatz Daten, die wiederum in Datenbanken gespeichert werden können, bis hin zu großen Volumen aus heterogenen Datenquellen, die gelegentlich bis nahezu in Echtzeit aktualisiert werden (Big Data). Diese Daten können dann einmalig, wiederholt oder gar in nahezu Echtzeit automatisch analysiert werden (Data Science, KI) und die daraus entstehenden Einblicke und Erkenntnisse wiederum in die Verbesserung der digitalen Prozesse und Produkte fließen.

Folglich befassen sich Chief Digital Officer und Chief Data Officer grundsätzlich im Kern mit unterschiedlichen Themen. Während der Chief Digital Officer sich um die Hardware- und Software im Kontext zeitgemäßer Digitalisierungsvorhaben und deren organisatorische Einordnung befasst, tut dies der Chief Data Officer vor allem im Kontext der Speicherung und Analyse von Daten sowie der Data Governance.

Treffen werden sich Digital und Data jedoch immer wieder im Kreislauf der kontinuierlichen Verbesserung von Produkt und Prozess, insbesondere bei der Gestaltung und Analyse der Digital Journey für Mitarbeiter, Kunden und Partnern und Plattform-Entscheidungen wie etwas Cloud-Systeme.

Oftmals differenzieren Unternehmen jedoch gar nicht so genau und betrachten diese Position als Verantwortliche für sowohl Digital als auch für Data und nennen diese Position entweder nach dem einen oder nach dem anderen – jedoch mit Zuständigkeiten für beides. In der Tat verfügen heute nur sehr wenige Unternehmen über beide Rollen, sondern haben einen einzigen CDO. Für die meisten Anwender klingt das trendige Digital allerdings deutlich ansprechender als das nüchterne Data, so dass die Namensgebung der Position eher zum Chief Digital Officer tendieren mag. Nichtsdestotrotz sind Digital-Themen von den Data-Themen recht gut zu trennen und sind strategisch unterschiedlich einzuordnen. Daher benötigen Unternehmen nicht nur eine Digital-, sondern ebenso eine Datenstrategie – Doch wie bereits angedeutet, können CDOs beide Rollen übernehmen und sich für beide Strategien verantwortlich fühlen.

Die gemeinsame Verantwortung von Digital und Data kann sogar als vorteilhafte Nebenwirkung besonders konsistente Entscheidungen ermöglichen und so typische Digital-Themen wie Blockchain oder RPA mit typischen Data-Themen wie Audit-Datenanalysen oder Process Mining verbinden. Oder der Dokumenten-Digitalisierung und -Verwaltung in der kombinierten Betrachtung mit Visual Computing (Deep Learning zur Bilderkennung).

Vielfältige Kompetenzen und Verantwortlichkeiten eines CDOs

Chief Digital Officer befassen sich mit Innovationsthemen und setzen sie für ihr Unternehmen um. Sie sind folglich auch Change Manager. CDOs dürfen keinesfalls bequeme Schönwetter-Manager sein, sondern müssen den Wandel im Unternehmen vorantreiben, Hemmnissen entgegenstehen und bestehende Prozesse und Produkte hinterfragen. Die Schaffung und Nutzung von digitalen Produkten und Prozessen im eigenen Unternehmen sowie auch bei Kunden und Lieferanten generiert wiederum Daten in Massen. Der Kreislauf zwischen Digital und Data treibt einen permanenten Wandel an, den der CDO für das Unternehmen positiv nutzbar machen muss und dabei immer neue Karriereperspektiven für sich und seine Mitarbeiter schaffen kann.

Zugegeben sind das keine guten Nachrichten für Mitarbeiter, die auf Beständigkeit setzen. Die Iterationen des digitalen Wandels zirkulieren immer schneller und stellen Ingenieure, Software-Entwickler, Data Scientists und andere Technologieverantwortliche vor den Herausforderungen des permanenten und voraussichtlich lebenslangen Lernens. Umso mehr muss ein CDO hier lernbereit und dennoch standhaft bleiben, denn Gründe für den Aufschub von Veränderungen findet im Zweifel jede Belegschaft.

Ein CDO mit umfassender Verantwortung lässt auch das Thema der Datennutzung nicht aus und versteht Architekturen für Business Intelligence und Machine Learning. Um seiner Personalverantwortung gerecht zu werden, muss er sich mit diesen Themen auskennen und mit Experten für Digital und Data auf Augenhöhe sprechen können. Jeder CD sollte wissen, was zum Beispiel ein Data Engineer oder Data Scientist können muss, wie Business-Experten zu verstehen und Vorstände zu überzeugen sind – Denn als Innovator, Antreiber und Wandler fürchten gute CDOs nichts außer den Stillstand.

CRISP-DM methodology in technical view

On this paper discuss about CRISP-DM (Cross Industry Standard Process for data mining) methodology and its steps including selecting technique to successful the data mining process. Before going to CRISP-DM it is better to understand what data mining is? So, here first I introduce the data mining and then discuss about CRISP-DM and its steps for any beginner (data scientist) need to know.

1 Data Mining

Data mining is an exploratory analysis where has no idea about interesting outcome (Kantardzic, 2003). So data mining is a process to explore by analysis a large set of data to discover meaningful information which help the business to take a proper decision. For better business decision data mining is a way to select feature, correlation, and interesting patterns from large dataset (Fu, 1997; SPSS White Paper, 1999).

Data mining is a step by step process to discover knowledge from data. Pre-processing data is vital part for a data mining. In pre-process remove noisy data, combining multiple sources of data, retrieve relevant feature and transforming data for analysis. After pre-process mining algorithm applied to extract data pattern so data mining is a step by step process and applied algorithm to find meaning full data pattern. Actually data mining is not only conventional analysis it is more than that (Read, 1999).

Data mining and statistics closely related. Main goal of data mining and statistic is find the structure of data because data mining is a part of statistics (Hand, 1999). However, data mining use tools, techniques, database, machine learning which not part of statistics but data mining use statistics algorithm to find a pattern or discover hidden decision.

Data mining objective could be prediction or description. On prediction data mining considering several features of dataset to predict unidentified future, on the other hand description involve identifying pattern of data to interpreted (Kantardzic, 2003).

From figure 1.1 shows data mining is the only one part of getting unknown information from data but it is the central process of whole process. Before data mining there are several processes need to be done like collecting data from several sources than integrated data and keep in data storage. Stored unprocessed data evaluated and selected with pre-processed activity to give a standard format than data mining algorithm to analysis for hidden pattern.

Data Mining Process

2 CRISP-DM Methodologies

Cross Industry Standard Process for data mining (CRISP-DM) is most popular and widely uses data mining methodology. CRISP-DM breaks down the data mining project life cycle into six phases and each phase consists of many second-level generic tasks. Generic task cover all possible data mining application. CRISP-DM extends KDD (Knowledge Discovery and Data Mining) into six steps which are sequence of data mining application (Martínez-Plumed 2019).

Data science and data mining project extract meaningful information from data. Data science is an art where a lot of time need to spend for understanding the business value and data before applying any algorithm then evaluate and deployed a project. CRISP-DM help any data science and data mining project from start to end by giving step by step process.

Present world every day billions of data are generating. So organisations are struggling with overwhelmed data to process and find a business goal. Comprehensive data mining methodology, CRISP-DM help business to achieve desirable goal by analysing data.

CRISP-DM (Cross Industry Standard Process for Data Mining) is well documented, freely available, data mining methodology. CRISP-DM is developed by more than 200 data mining users and many mining tool and service providers funded by European Union. CRISP-DM encourages organization for best practice and provides a structure of data mining to get better, faster result.

CRISP-DM is a step by step methodology. Figure-2.1 show the phases of CRISP-DM and process of data mining. Here one side arrow indicates the dependency between phases and double side arrow represents repeatable process. Six phases of CRISP-DM are Business understanding, Data understanding, Modelling, Evaluation and Deployment.


2.1 Business Understanding

Business Understanding or domain understanding is the first step of CRISP-DM methodology. On this stage identify the area of business which is going to transform into meaningful information by analysing, processing and implementing several algorithms. Business understanding identifies the available resource (human and hardware), problems and set a goal. Identification of business objective should be agreed with project sponsors and other unit of business which will be affected. This step also focuses about details business success criteria, requirements, constraints, risk, project plan and timeline.

2.2 Data Understanding

Data understanding is the second and closely related with the business understanding phase. This phase mainly focus on data collection and proceeds to get familiar with the data and also detect interesting subset from data. Data understanding has four subsets these are:-

2.2.1 Initial data collection

On this subset considering the data collection sources which is mainly divided into two categories like outsource data or internal source data.  If data is from outsource then it may costly, time consuming and may be low quality but if data is collected form internal source it is an easy and less costly, but it may be contain irrelevant data. If internal source data does not fulfil the interest of analysis than it is necessary to move outsource data. Data collection also give an assumption that the data is quantitative (continuous, count) or qualitative (categorical).  It also gives information about balance or imbalanced dataset.  On data collection should avoid random error, systematic error, exclusion errors, and errors of choosing.

2.2.2 Data Description

Data description performs initial analysis about data. On this stage it is going to determine about the source of data like RDBMS, SQL, NoSQL, Big data etc. then analysis and describe the data about size (large data set give more accurate result but time consuming), number of records, tables, database, variables, and data types (numeric, categorical or Boolean). On this phase examine the accessibility and availability of attributes.

2.2.3 Exploratory data analysis (EDA)

On exploratory data analysis describe the inferential statistics, descriptive statistics and graphical representation of data. Inferential statistics summarize the entire population from the sample data to perform sampling and hypothesis testing. On Parametric hypothesis testing  (Null or alternate – ANOVA, t-test, chi square test) perform for known distribution (based on population) like mean, variance, standard deviation, proportion and Non-parametric hypothesis testing perform when distribution is unknown or sample size is small. On sample dataset, random sampling implement when dataset is balance but for imbalance dataset should be follow random resampling (under  and over sampling), k fold cross validation, SMOTE (synthetic minority oversampling technique), cluster base sampling, ensemble techniques (bagging and boosting – Add boost, Gradient Tree Boosting, XG Boost) to form a balance dataset.

On descriptive statistics analysis describe about the mean, median, mode for measures of central tendency on first moment business decision. On second moment business decision describe the measure of dispersion about the variance, standard deviation and range of data.  On third and fourth moment business decision describe accordingly skewness (Positive skewness – heavier tail to the right, negative skewness – heavier tail to the left, Zero skewness – symmetric distribution) and Kurtosis (Leptokurtosis – heavy tail, platykurtosis – light tail, mesokurtic – normal distribution).

Graphical representation is divided into univariate, bivariate and multivariate analysis. Under univariate whisker plot, histogram identify the outliers and shape of distribution of data and Q-Q plot (Quantile – Quantile) plot describe the normality of data that means data is normally distribution or not.  On whisker plot if data present above of Q3 + 1.5 (IQR) and below of Q1 – 1.5 (IQR) is outlier. For Bivariate correlations identify with scatter plot which describe positive, negative or no correlation and also identify the data linearity or non-linearity. Scatter plot also describe the clusters and outliers of data.  For multivariate has no graphical analysis but used to use regression analysis, ANOVA, Hypothesis analysis.

2.2.4 Data Quality analysis

This phase identified and describes the potential errors like outliers, missing data, level of granularity, validation, reliability, bad metadata and inconsistency.  On this phase AAA (attribute agreement analysis) analysed discrete data for data error. Continuous data analysed with Gage repeatability and reproducibility (Gage R & R) which follow SOP (standard operating procedures). Here Gage R & R define the aggregation of variation in the measurement data because of the measurement system.

2.3 Data Preparation

Data Preparation is the time consuming stage for every data science project. Overall on every data science project 60% to 70% time spend on data preparation stage. Data preparation stapes are described below.

2.3.1 Data integration

Data integration involved to integrate or merged multiple dataset. Integration integrates data from different dataset where same attribute or same columns presents but when there is different attribute then merging the both dataset.

2.3.2 Data Wrangling

On this subset data are going to clean, curate and prepare for next level. Here analysis the outlier and treatment done with 3 R technique (Rectify, Remove, Retain) and for special cases if there are lots of outliner then need to treat outlier separately (upper outliner in an one dataset and lower outliner in another dataset) and alpha (significant value) trim technique use to separate the outliner from the original dataset. If dataset has a missing data then need to use imputation technique like mean, median, mode, regression, KNN etc.

If dataset is not normal or has a collinearity problem or autocorrelation then need to implement transformation techniques like log, exponential, sort, Reciprocal, Box-cox etc. On this subset use the data normalization (data –means/standard deviation) or standardization (min- max scaler) technique to make unitless and scale free data. This step also help if data required converting into categorical then need to use discretization or binning or grouping technique. For factor variable (where has limited set of values), dummy variable creation technique need to apply like one hot encoding.  On this subset also help heterogeneous data to transform into homogenous with clustering technique. Data inconsistencies also handle the inconsistence of data to make data in a single scale.

2.3.3 Feature engineering and selection/reduction

Feature engineering may called as attribute generation or feature extraction. Feature extraction creating new feature by reducing original feature to make simplex model. Feature engineering also do the normalized feature by producing calculative new feature. So feature engineering is a data pre-process technique where improve data quality by cleaning, integration, reduction, transformation and scaling.

Feature selections reduce the multicollinearity or high correlated data and make model simple. Main two type of feature selection technique are supervised and unsupervised. Principal Components Analysis (PCA) is an unsupervised feature reduction/ feature selection technique and LDA is a Linear Discriminant analysis supervised technique mainly use for classification problem. LDA analyse by comparing mean of the variables. Supervised technique is three types filter, wrapper and ensemble method. Filter method is easy to implement but wrapper is costly method and ensemble use inside a model.

2.4 Model

2.4.1 Model Selection Technique

Model selection techniques are influence by accuracy and performance.  Because recommendation need better performance but banking fraud detection needs better accuracy technique.  Model is mainly subdivided into two category supervised learning where predict an output variable according to given an input variable and unsupervised learning where has not output variable.

On supervised learning if an output variable is categorical than it is classification problem like two classes or multiclass classification problem. If an output variable is continuous (numerical) then the problem is called prediction problem. If need to recommending according to relevant information is called recommendation problem or if need to retrieve data according to relevance data is called retrieval problem.

On unsupervised learning where target or output variable is not present. On this technique all variable is treated as an input variable. Unsupervised learning also called clustering problem where clustering the dataset for future decision.

Reinforcement learning agent solves the problem by getting reward for success and penalty for any failure. And semi-supervised learning is a process to solve the problem by combining supervised and unsupervised learning method. On semi-supervised, a problem solved by apply unsupervised clustering technique then for each cluster apply different type of supervised machine learning algorithm like linear algorithm, neural network, K nearest  neighbour etc.

On data mining model selection technique, where output variable is known, then need to implement supervised learning.  Regression is the first choice where interpretation of parameter is important. If response variable is continuous then linear regression or if response variable is discrete with 2 categories value then logistic regression or if response variable is discrete with more than 2 categorical values then multinomial or ordinal regression or if response variable is count then poission where mean is equal to variance or negative binomial regression where variance is grater then mean or if response variable contain excessive zero values then need to choose Zero inflated poission (ZIP) or Zero inflated negative binomial (ZINB).

On supervised technique except regression technique all other technique can be used for both continuous or categorical response variable like KNN (K-Nearest Neighbour),  Naïve Bays, Black box techniques (Neural network, Support vector machine), Ensemble Techniques (Stacking, Bagging like random forest, Boosting like Decision tree, Gradient boosting, XGB, Adaboost).

When response variable is unknown then need to implement unsupervised learning. Unsupervised learning for row reduction is K-Means, Hierarchical etc., for columns reduction or dimension reduction PCA (principal component analysis), LDA (Linear Discriminant analysis), SVD (singular value decomposition) etc. On market basket analysis or association rules where measure are support and confidence then lift ration to determine which rules is important. There are recommendation systems, text analysis and NLP (Natural language processing) also unsupervised learning technique.

For time series need to select forecasting technique. Where forecasting may model based or data based. For Trend under model based need to use linear, exponential, quadratic techniques. And for seasonality need to use additive, multiplicative techniques. On data base approaches used auto regressive, moving average, last sample, exponential smoothing (e.g. SES – simple exponential smoothing, double exponential smoothing, and winters method).

2.4.2 Model building

After selection model according to model criterion model is need to be build. On model building provided data is subdivided with training, validation and testing.  But sometime data is subdivided just training and testing where information may leak from testing data to training data and cause an overfitting problem. So training dataset should be divided into training and validation whereas training model is tested with validation data and if need any tuning to do according to feedback from validation dataset. If accuracy is acceptable and error is reasonable then combine the training and validation data and build the model and test it on unknown testing dataset. If the training error and testing error is minimal or reasonable then the model is right fit or if the training error is low and testing error is high then model is over fitted (Variance) or if training error is high and testing error is also high then model is under fitted (bias). When model is over fitted then need to implement regularization technique (e.g. linear – lasso, ridge regression, Decision tree – pre-pruning, post-pruning, Knn – K value, Naïve Bays – Laplace, Neural network – dropout, drop connect, batch normalization, SVM –  kernel trick)

When data is balance then split the data training, validation and testing and here training is larger dataset then validation and testing. If data set is imbalance then need to use random resampling (over and under) by artificially increases training dataset. On random resampling by randomly partitioning data and for each partition implement the model and taking the average of accuracy. Under K fold cross validation creating K times cross dataset and creating model for every dataset and validate, after validation taking the average of accuracy of all model. There is more technique for imbalance dataset like SMOTH (synthetic minority oversampling technique), cluster based sampling, ensemble techniques e.g. Bagging, Boosting (Ada Boost, XGBoost).

2.4.3 Model evaluation and Tuning

On this stage model evaluate according to errors and accuracy and tune the error and accuracy for acceptable manner. For continuous outcome variable there are several way to measure the error like mean error, mean absolute deviation, Mean squared error, Root mean squared error, Mean percentage error and Mean absolute percentage error but more acceptable way is Mean absolute percentage error. For this continuous data if error is known then it is easy to find out the accuracy because accuracy and error combining value is one. The error function also called cost function or loss function.

For discrete output variable model, for evaluation and tuning need to use confusion matrix or cross table. From confusion matrix, by measuring accuracy, error, precision, sensitivity, specificity, F1 help to take decision about model fitness. ROC curve (Receiver operating characteristic curve), AUC curve (Area under the ROC curve) also evaluate the discrete output variable. AUC and ROC curve plot of sensitivity (true positive rate) vs 1-specificity (false positive rate).  Here sensitivity is a positive recall and  recall is basically out of all positive samples, how sample classifier able to identify. Specificity is negative recall here recall is out of all negative samples, how many sample classifier able to identify.  On AUC where more the area under the ROC is represent better accuracy. On ROC were step bend it’s indicate the cut off value.

2.4.4 Model Assessment

There is several ways to assess the model. First it is need to verify model performance and success according to desire achievement. It needs to identify the implemented model result according to accuracy where accuracy is repeatable and reproducible. It is also need to identify that the model is scalable, maintainable, robust and easy to deploy. On assessment identify that the model evaluation about satisfactory results (identify the precision, recall, sensitivity are balance) and meet business requirements.

2.5 Evaluation

On evaluation steps, all models which are built with same dataset, given a rank to find out the best model by assessing model quality of result and simplicity of algorithm and also cost of deployment. Evaluation part contains the data sufficiency report according to model result and also contain suggestion, feedback and recommendation from solutions team and SMEs (Subject matter experts) and record all these under OPA (organizational process assets).

2.6 Deployment

Deployment process needs to monitor under PEST (political economical social technological) changes within the organization and outside of the organization. PEST is similar to SWOT (strength weakness opportunity and thread) where SW represents the changes of internal and OT represents external changes.

On this deployment steps model should be seamless (like same environment, same result etc.) from development to production. Deployment plan contain the details of human resources, hardware, software requirements. Deployment plan also contain maintenance and monitoring plan by checking the model result and validity and if required then implement retire, replace and update plan.

3 Summaries

CRISP-DM implementation is costly and time consuming. But CRISP-DM methodology is an umbrella for data mining process. CRISP-DM has six phases, Business understanding, Data understanding, Modelling, Evaluation and Deployment. Every phase has several individual criteria, standard and process. CRISP-DM is Guideline for data mining process so if CRISP-DM is going to implement in any project it is necessary to follow each and every single guideline and maintain standard and criteria to get required result.

4 References

  1. Fu, Y., (1997), “Data Mining: Tasks, Techniques and Applications”, Potentials, IEEE, 16: 4, 18–20.
  2. Hand, D. J., (1999), “Statistics and Data Mining: Intersecting Disciplines”, ACM SIGKDD Explorations Newsletter, 1: 1, 16 – 19.
  3. Kantardzic, M., (2003), “Data Mining: Concepts, Models, Methods, and Algorithms” John Wiley and Sons, Inc., Hoboken, New Jersey
  4. Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C., Orallo, J.H., Kull, M., Lachiche, N., Quintana, M.J.R. and Flach, P.A., 2019. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering.
  5. Read, B.J., (1999), “Data Mining and Science? Knowledge discovery in science as opposed to business”, 12th ERCIM Workshop on Database Research.