On the difficulty of language: prerequisites for NLP with deep learning

This is the first article of my article series “Instructions on Transformer for people outside NLP field, but with examples of NLP.”

1 Preface

This section is virtually just my essay on language. You can skip this if you want to get down on more technical topic.

As I do not study in natural language processing (NLP) field, I would not be able to provide that deep insight into this fast changing deep leaning field throughout my article series. However at least I do understand language is a difficult and profound field, not only in engineering but also in many other study fields. Some people might be feeling that technologies are eliminating languages, or one’s motivations to understand other cultures. First of all, I would like you to keep it in mind that I am not a geek who is trying to turn this multilingual world into a homogeneous one and rebuild Tower of Babel, with deep learning. I would say I am more keen on social or anthropological sides of language.

I think you would think more about languages if you have mastered at least one foreign language. As my mother tongue is Japanese, which is totally different from many other Western languages in terms of characters and ambiguity, I understand translating is not what learning a language is all about. Each language has unique characteristics, and I believe they more or less influence one’s personalities. For example, many Western languages make the verb, I mean the conclusion, of sentences clear in the beginning part of the sentences. That is also true of Chinese, I heard. However in Japanese, the conclusion comes at the end, so that is likely to give an impression that Japanese people are being obscure or indecisive. Also, Japanese sentences usually omit their subjects. In German as well, the conclusion of a sentences tend to come at the end, but I am almost 100% sure that no Japanese people would feel German people make things unclear. I think that comes from the structures of German language, which tends to make the number, verb, relations of words crystal clear.

Let’s take an example to see how obscure Japanese is. A Japanese sentence 「頭が赤い魚を食べる猫」can be interpreted in five ways, depending on where you put emphases on.

Common sense tells you that the sentence is likely to mean the first two cases, but I am sure they can mean those five possibilities. There might be similarly obscure sentences in other languages, but I bet few languages can be as obscure as Japanese. Also as you can see from the last two sentences, you can omit subjects in Japanese. This rule is nothing exceptional. Japanese people usually don’t use subjects in normal conversations. And when you read classical Japanese, which Japanese high school students have to do just like Western students learn some of classical Latin, the writings omit subjects much more frequently.

*However interestingly we have rich vocabulary of subjects. The subject “I” can be translated to 「私」、「僕」、「俺」、「自分」、「うち」etc, depending on your personality, who you are talking to, and the time when it is written in.

I believe one can see the world only in the framework of their language, and it seems one’s personality changes depending on the language they use. I am not sure whether the language originally determines how they think, or how they think forms the language. But at least I would like you to keep it in mind that if you translate a conversation, for example a random conversation at a bar in Berlin, into Japanese, that would linguistically sound Japanese, but not anthropologically. Imagine that such kind of random conversation in Berlin or something is like playing a catch, I mean throwing a ball named “your opinion.” On the other hand,  normal conversations of Japanese people are in stead more of, I would say,  “resonance” of several tuning forks. They do their bests to show that they are listening to each other, by excessively nodding or just repeating “Really?”, but usually it seems hardly any constructive dialogues have been made.

*I sometimes feel you do not even need deep learning to simulate most of such Japanese conversations. Several-line Python codes would be enough.

My point is, this article series is mainly going to cover only a few techniques of NLP in deep learning field: sequence to sequence model (seq2seq model) , and especially Transformer. They are, at least for now, just mathematical models and mappings of a small part of this profound field of language (as far as I can cover in this article series). But still, examples of language would definitely help you understand Transformer model in the long run.

2 Tokens and word embedding

*Throughout my article series, “words” just means the normal words you use in daily life. “Tokens” means more general unit of NLP tasks. For example the word “Transformer” might be denoted as a single token “Transformer,” or maybe as a combination of two tokens “Trans” and “former.”

One challenging part of handling language data is its encodings. If you started learning programming in a language other than English, you would have encountered some troubles of using keyboards with different arrangements or with characters. Some comments on your codes in your native languages are sometimes not readable on some software. You can easily get away with that by using only English, but when it comes to NLP you have to deal with this difficulty seriously. How to encode characters in each language should be a first obstacle of NLP. In this article we are going to rely on a library named BPEmb, which provides word embedding in various languages, and you do not have to care so much about encodings in languages all over the world with this library.

In the first section, you might have noticed that Japanese sentence is not separated with spaces like Western languages. This is also true of Chinese language, and that means we need additional tasks of separating those sentences at least into proper chunks of words. This is not only a matter of engineering, but also of some linguistic fields. Also I think many people are not so conscious of how sentences in their native languages are grammatically separated.

The next point is, unlike other scientific data, such as temperature, velocity, voltage, or air pressure, language itself is not measured as numerical data. Thus in order to process language, including English, you first have to map language to certain numerical data, and after some processes you need to conversely map the output numerical data into language data. This section is going to be mainly about one-hot encoding and word embedding, the ways to convert word/token into numerical data. You might already have heard about this

You might have learnt about word embedding to some extent, but I hope you could get richer insight into this topic through this article.

2.1 One-hot encoding

One-hot encoding would be the most straightforward way to encode words/tokens. Assume that you have a dictionary whose size is |\mathcal{V}|, and it includes words from “a”, “ablation”, “actually” to “zombie”, “?”, “!”

In a mathematical manner, in order to choose a word out of those |\mathcal{V}| words, all you need is a |\mathcal{V}| dimensional vector, one of whose elements is 1, and the others are 0. When you want to choose the No. i word, which is “indeed” in the example below, its corresponding one-hot vector is \boldsymbol{v} = (0, \dots, 1, \dots, 0 ), where only the No. i element is 1. One-hot encoding is also easy to understand, and that’s all. It is easy to imagine that people have already come up with more complicated and better way to encoder words. And one major way to do that is word embedding.

2.2 Word embedding

Source: Francois Chollet, Deep Learning with Python,(2018), Manning

Actually word embedding is related to one-hot encoding, and if you understand how to train a simple neural network, for example densely connected layers, you would understand word embedding easily. The key idea of word embedding is denoting each token with a D dimensional vector, whose dimension is fewer than the vocabulary size |\mathcal{V}|. The elements of the resulting word embedding vector are real values, I mean not only 0 or 1. Obviously you can encode much richer variety of tokens with such vectors. The figure at the left side is from “Deep Learning with Python” by François Chollet, and I think this is an almost perfect and simple explanation of the comparison of one-hot encoding and word embedding. But the problem is how to get such convenient vectors. The answer is very simple: you have only to train a network whose inputs are one-hot vector of the vocabulary.

The figure below is a simplified model of word embedding of a certain word. When the word is input into a neural network, only the corresponding element of the one-hot vector is 1, and that virtually means the very first input layer is composed of one neuron whose value is 1. And the only one neuron propagates to the next D dimensional embedding layer. These weights are the very values which most other study materials call “an embedding vector.”

When you input each word into a certain network, for example RNN or Transformer, you map the input one-hot vector into the embedding layer/vector. The examples in the figure are how inputs are made when the input sentences are “You’ve got the touch” and “You’ve got the power.”   Assume that you have a dictionary of one-hot encoding, whose vocabulary is {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}, and the dimension of word embeding is 6. In this case |\mathcal{V}| = 9, D=6. When the inputs are “You’ve got the touch” or “You’ve got the power” , you put the one-hot vector corresponding to “You’ve”, “got”, “the”, “touch” or “You’ve”, “got”, “the”, “power” sequentially every time step t.

In order to get word embedding of certain vocabulary, you just need to train the network. We know that the words “actually” and “indeed” are used in similar ways in writings. Thus when we propagate those words into the embedding layer, we can expect that those embedding layers are similar. This is how we can mathematically get effective word embedding of certain vocabulary.

More interestingly, if word embedding is properly trained, you can mathematically “calculate” words. For example, \boldsymbol{v}_{king} - \boldsymbol{v}_{man} + \boldsymbol{v}_{woman} \approx \boldsymbol{v}_{queen}, \boldsymbol{v}_{Japan} - \boldsymbol{v}_{Tokyo} + \boldsymbol{v}_{Vietnam} \approx \boldsymbol{v}_{Hanoi}.

*I have tried to demonstrate this type of calculation on several word embedding, but none of them seem to work well. At least you should keep it in mind that word embedding learns complicated linear relations between words.

I should explain word embedding techniques such as word2vec in detail, but the main focus of this article is not NLP, so the points I have mentioned are enough to understand Transformer model with NLP examples in the upcoming articles.

 

3 Language model

Language models is one of the most straightforward, but crucial ideas in NLP. This is also a big topic, so this article is going to cover only basic points. Language model is a mathematical model of the probabilities of which words to come next, given a context. For example if you have a sentence “In the lecture, he opened a _.”, a language model predicts what comes at the part “_.” It is obvious that this is contextual. If you are talking about general university students, “_” would be “textbook,” but if you are talking about Japanese universities, especially in liberal art department, “_” would be more likely to be “smartphone. I think most of you use this language model everyday. When you type in something on your computer or smartphone, you would constantly see text predictions, or they might even correct your spelling or grammatical errors. This is language modelling. You can make language models in several ways, such as n-gram and neural language models, but in this article I can explain only general formulations for such models.

*I am not sure which algorithm is used in which services. That must be too fast changing and competitive for me to catch up.

As I mentioned in the first article series on RNN, a sentence is usually processed as sequence data in NLP. One single sentence is denoted as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}), a list of vectors. The vectors are usually embedding vectors, and the (t) is the index of the order of tokens. For example the sentence “You’ve go the power.” can be expressed as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}), where \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)} denote “You’ve”, “got”, “the”, “power”, “.” respectively. In this case \tau = 4.

In practice a sentence \boldsymbol{X} usually includes two tokens BOS and EOS at the beginning and the end of the sentence. They mean “Beginning Of Sentence” and “End Of Sentence” respectively. Thus in many cases \boldsymbol{X} = (\boldsymbol{BOS} , \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS} ). \boldsymbol{BOS} and \boldsymbol{EOS} are also both vectors, at least in the Tensorflow tutorial.

P(\boldsymbol{X} = (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) is the probability of incidence of the sentence. But it is easy to imagine that it would be very hard to directly calculate how likely the sentence \boldsymbol{X} appears out of all possible sentences. I would rather say it is impossible. Thus instead in NLP we calculate the probability P(\boldsymbol{X}) as a product of the probability of incidence or a certain word, given all the words so far. When you’ve got the words (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1}) so far, the probability of the incidence of \boldsymbol{x}^{(t)}, given the context is  P(\boldsymbol{x}^{(t)}|\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1)}). P(\boldsymbol{BOS}) is a probability of the the sentence \boldsymbol{X} being (\boldsymbol{BOS}), and the probability of \boldsymbol{X} being (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) can be decomposed this way: P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS})P(\boldsymbol{BOS}).

Just as well P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P( \boldsymbol{BOS}, \boldsymbol{x}^{(1)})= P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P( \boldsymbol{BOS}).

Hence, the general probability of incidence of a sentence \boldsymbol{X} is P(\boldsymbol{X})=P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \dots, \boldsymbol{x}^{(\tau -1)}, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) = P(\boldsymbol{EOS}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}) P(\boldsymbol{x}^{(\tau)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau - 1)}) \cdots P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P(\boldsymbol{BOS}).

Let \boldsymbol{x}^{(0)} be \boldsymbol{BOS} and \boldsymbol{x}^{(\tau + 1)} be \boldsymbol{EOS}. Plus, let P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]}) be P(\boldsymbol{x}^{(t+1)}|\boldsymbol{x}^{(0)}, \dots, \boldsymbol{x}^{(t)}), then P(\boldsymbol{X}) = P(\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Language models calculate which words to come sequentially in this way.

Here’s a question: how would you evaluate a language model?

I would say the answer is, when the language model generates words, the more confident the language model is, the better the language model is. Given a context, when the distribution of the next word is concentrated on a certain word, we can say the language model is confident about which word to come next, given the context.

*For some people, it would be more understandable to call this “entropy.”

Let’s take the vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”} as an example. Assume that P(\boldsymbol{X}) = P(\boldsymbol{BOS}, \boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}, \boldsymbol{EOS}) = P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}, \boldsymbol{EOS})= P(\boldsymbol{x}^{(0)})\prod_{t=0}^{4}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Given a context (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}), the probability of incidence of \boldsymbol{x}^{(2)} is P(\boldsymbol{x}^{2}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}). In the figure below, the distribution at the left side is less confident because probabilities do not spread widely, on the other hand the one at the right side is more confident that next word is “got” because the distribution concentrates on “got”.

*You have to keep it in mind that the sum of all possible probability P(\boldsymbol{x}^{(2)} | \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) is 1, that is, P(\boldsymbol{the}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + P(\boldsymbol{You've}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + \cdots + P(\boldsymbol{Boogie}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = 1.

While the language model generating the sentence “BOS You’ve got the touch EOS”, it is better if the language model keeps being confident. If it is confident, P(\boldsymbol{X})= P(\boldsymbol{BOS}) P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}) P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} gets higher. Thus (-1) \{ log_{b}{P(\boldsymbol{BOS})} + log_{b}{P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}) + log_{b}{P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)})} + log_{b}{P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)})} + log_{b}{P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} \} gets lower, where usually b=2 or b=e.

This is how to measure how confident language models are, and the indicator of the confidence is called perplexity. Assume that you have a data set for evaluation \mathcal{D} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \dots, \boldsymbol{X}_{|\mathcal{D}|}), which is composed of |\mathcal{D}| sentences in total. Each sentence \boldsymbol{X}_n = (\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau ^{(n)}}{P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})} has \tau^{(n)} tokens in total excluding \boldsymbol{BOS}, \boldsymbol{EOS}. And let |\mathcal{V}| be the size of the vocabulary of the language model. Then the perplexity of the language model is b^z, where z = \frac{-1}{|\mathcal{V}|}\sum_{n=1}^{|\mathcal{D}|}{\sum_{t=0}^{\tau ^{(n)}}{log_{b}P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})}. The b is usually 2 or e.

For example, assume that \mathcal{V} is vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}. Also assume that the evaluation data set for perplexity of a language model is \mathcal{D} = (\boldsymbol{X}_1, \boldsymbol{X}_2), where \boldsymbol{X_1} =(\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}) \boldsymbol{X_2} = (\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the }, \boldsymbol{power}). In this case |\mathcal{V}|=9, |\mathcal{D}|=2. I have already showed you how to calculate the perplexity of the sentence “You’ve got the touch.” above. You just need to do a similar thing on another sentence “You’ve got the power”, and then you can get the perplexity of the language model.

*If the network is not properly trained, it would also be confident of generating wrong outputs. However, such network still would give high perplexity because it is “confident” at any rate. I’m sorry I don’t know how to tackle the problem. Please let me put this aside, and let’s get down on Transformer model soon.

Appendix

Let’s see how word embedding is implemented with a very simple example in the official Tensorflow tutorial. It is a simple binary classification task on IMDb Dataset. The dataset is composed to comments on movies by movie critics, and you have only to classify if the commentary is positive or negative about the movie. For example when you get you get an input “To be honest, Michael Bay is a terrible as an action film maker. You cannot understand what is going on during combat scenes, and his movies rely too much on advertisements. I got a headache when Mark Walberg used a Chinese cridit card in Texas. However he is very competent when it comes to humorous scenes. He is very talented as a comedy director, and I have to admit I laughed a lot.“, the neural netowork has to judge whether the statement is positive or negative.

This networks just takes an average of input embedding vectors and regress it into a one dimensional value from 0 to 1. The shape of embedding layer is (8185, 16). Weights of neural netowrks are usually implemented as matrices, and you can see that each row of the matrix corresponds to emmbedding vector of each token.

*It is easy to imagine that this technique is problematic. This network virtually taking a mean of input embedding vectors. That could mean if the input sentence includes relatively many tokens with negative meanings, it is inclined to be classified as negative. But for example, if the sentence is “This masterpiece is a dark comedy by Charlie Chaplin which depicted stupidity of the evil tyrant gaining power in the time. It thoroughly mocked Germany in the time as an absurd group of fanatics, but such propaganda could have never been made until ‘Casablanca.'” , this can be classified as negative, because only the part “masterpiece” is positive as a token, and there are much more words with negative meanings themselves.

The official Tensorflow tutorial provides visualization of word embedding with Embedding Projector, but I would like you to take more control over the data by yourself. Please just copy and paste the codes below, installing necessary libraries. You would get a map of vocabulary used in the text classification task. It seems you cannot find clear tendency of the clusters of the tokens. You can try other dimension reduction methods to get maps of the vocabulary by for example using Scikit Learn.

[References]

[1] “Word embeddings” Tensorflow Core
https://www.tensorflow.org/tutorials/text/word_embeddings

[2]Tsuboi Yuuta, Unno Yuuya, Suzuki Jun, “Machine Learning Professional Series: Natural Language Processing with Deep Learning,” (2017), pp. 43-64, 72-85, 91-94
坪井祐太、海野裕也、鈴木潤 著, 「機械学習プロフェッショナルシリーズ 深層学習による自然言語処理」, (2017), pp. 43-64, 72-85, 191-193

[3]”Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention”, stanfordonline, (2019)
https://www.youtube.com/watch?v=XXtpJxZBa2c

[4] Francois Chollet, Deep Learning with Python,(2018), Manning , pp. 178-185

[5]”2.2. Manifold learning,” scikit-learn
https://scikit-learn.org/stable/modules/manifold.html

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Top 10 Python Libraries Of All Time

Python is a very popular and renowned language that has replaced several programming languages in the market. Its amazing collection of libraries makes it a convenient programming language for developers.

Python is an ocean of libraries serving an ample number of purposes and as a developer; you must possess sound knowledge of the 10 libraries. One needs to familiarize themselves with the libraries to go on and work on different projects. For the data scientist, it has been a charmer now.

Here today, for you this is a curated list of 10 Python libraries that can help you along with its significant features, when to use them, and also the benefits.

10 Best Python Libraries of All Times

  1. Pandas: Pandas is an open-source library that offers instant high performance, data analysis, and simple data structures. When can you use it? It can be used for data munging and wrangling. If one is looking for quick data visuals, aggregation, manipulation, and reading, then this library is suitable. You can impute the missing data files, plot the data, and make edits in the data column. Moreover, for renaming and merging, this tool can do wonders. It is a foundation library, and a data scientist should have in-depth knowledge about Pandas before any other library knowledge.
  1. TensorFlow: TensorFlow is developed by Google in collaboration with the Brain Team. Using this tool, you can instantly visualize any part of the graphical representation. It comes with modularity and offers high flexibility in its operations. This library is ideal for running and operating in large scale systems. So, as long as you have good internet connectivity, you can use it because it is an open-source platform. What is the beauty of this library? It comes with an unending list of applications associated with it.
  1. NumPy: NumPy is the most popular Python library used by developers. It is used by various libraries for conducting easy operations. What is the beauty of NumPy? Array Interface is the beauty of NumPy and it is always a highlighted feature. NumPy is interactive and very simple to use. It can instantly solve complicated mathematical problems. With this, you need not worry about daunting phases of coding and offering open-source contributions. This interface is widely used for expressing raw streams, sound waves, and other images. If you are looking to implement this into machine learning, you must possess in-depth knowledge about NumPy.
  1. Keras: Are you looking for a cool Python library? Well, Keras is the coolest machine learning python library. It runs smoothly on both CPU and GPU. Do you want to know where Keras is used? It is used in popular applications like Uber, Swiggy, Netflix, Square, and Yelp. Keras easily supports the fully connected, pooling, convolution, and recurrent neural networks. For any innovative research, it does fine because it is expressive and flexible. Keras is completely based on a framework, which enables easy debugging and exploring. Various large scientific organizations use Keras for innovative research.
  1. Scikit- Learn: If your project deals with complex data, it has to be the Scikit- Learn python library. This Python Machine Learning Library is associated with NumPy and SciPy. After various modifications, one such feather cross-validation is used for enabling more than one metric. It is used for extracting features and data from texts and images. It uses various algorithms to make changes in machine learning. What are its functions? It is used in model selection, classification, clustering, and regression. Various training methods like nearest neighbor and logistics regressions are subjected to minimal modification.
  1. PyTorch: PyTorch is the largest library which conducts various computations and accelerations. Also, it solves complicated application issues that are related to the neural networks. It is completely based on the machine language Torch, which is a free and open-source platform. PyTorch is new but gaining huge popularity and very much a favorite among the developers. Why such popularity? It comes with a hybrid end-user which ensures easy usage and flexibility. For processing natural language applications, this library is used. Do you know what the best part is? It is outperforming and taking the popularity of Tensor Flow in recent times.
  1. MoviePy: The MoviePy is a tool that offers unending functionality related to movies and visuals. It is used for exporting, modifying, and importing various video files. Do you want to add a title to your video or rotate it 90 degrees? Well, MoviePy helps you to do all such tasks related to videos. It is not a tool for manipulating data like Pillow. In any task related to movies and videos in python coding, you can no doubt rely on the functionality of MoviePy. It is designed to conduct all the aspects of a standard task and can get it done instantly. For any common task associated with videos, it has a MoviePy library.
  1. Matplotlib: Matplotlib is no doubt a quintessential python library whose presence can never be forgotten. You can visualize data and create innovative and interesting stories. When can you use it? You can use Matplotlib for embedding different plots into the application as it provides an object-oriented application program interface. Any sort of visualization, be it bar graph, histogram, pie chart, or graphs, Matplotlib can easily depict it. With this library, you can create any type of visualization. Do you want to know what visualizations you can create? You can create a histogram, Bar graph, pie chart, area plot, stem plot, and line plot. It also facilitates the legends, grids, and labels.
  2. Tkinter: Tkinter is a library that can help you create any Python application with the help of a graphical user interface. Tkinter is the most common and easy to use python library for developing apps with GUI. It binds python to the GUI tool kit which can be used in any modern operating system. To create a python GUI, Tkinter is the only best way to start instantly.
  3. Plotly: The Plotly is an essential graph plotting python library for developers. Users can import, copy, paste, export the data that needs to be analyzed and visualized. When can you use it? You can use Plotly to display and create figures and visual images. What is interesting is that it has amazing features for sending data to the various cloud servers.

What are the visual charts prepared with Plotly? You can create line pie, bubble, dot, scatter, and pie. One can also construct financial charts, contours, maps, subplots, carpet, radar, and logs. Do you have anything in your mind which needs to be represented visually? Use Plotly!

Finishing Up

In a nutshell, you have the best python libraries of recent times which contribute hugely to development. If your favorite python library didn’t make it in this list of the top 10 best python libraries, do not take offense.

Python comes with unending library packages, and these 10 are some of its popular and best-used ones. If you are a python developer, these are the best libraries you must have in-depth knowledge of.

Article series: 5 Clean Coding Tips – 5.Put yourself in somebody else’s shoes

This is the fifth of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

It might be a bit repetitive to bring up how important the readability of the code is, let’s do it anyway. In the majority of the cases you are writing for others, therefore you need to put yourself in their shoes to be able to assess how good the readability of your code is. For you, it all might be obvious because you wrote it. But it doesn’t have to be easy to read for someone else. If you have a colleague or a friend that has a bit of time for you and is willing to give you feedback, that is great. If, however, you don’t have such a person, having a few imaginary friends might be helpful in this case. It might sound crazy, but don’t close this page just yet. Having a set of imaginary personas at your disposal, to review your work with their eyes, can help you a lot. Imagine that your code met one of those guys. What would they say about it? If you work in a team or collaborate with people, you probably don’t have to imagine them. You’ve met them.

The_PEP8_guy – He has years of experience. He is used to seeing the code in a very particular way. He quotes the style guide during lunch. His fingers make the perfect line splitting and indentation without even his thoughts reaching the conscious state. He knows that lowercase_with_underscore is for variables, UPPER_CASE_NAMES are for constants and the CapitalizedWords are for classes. He will be lost if you do it in any different way. His expectations will not meet what you wrote, and he will not understand anything, because he will be too distracted by the messed up visual. Depending on the character he might start either crying or shouting. Read the style guide and follow it. You might be able to please this guy at least a little bit with the automatic tools like pylint.

The_ grieving _widow – Imagine that something happens to you. Let’s say, that you get hit by a bus[i]. You leave behind sadness and the_ grieving_widow to manage your code, your legacy. Will the future generations be able to make use of it or were you the only one who can understand anything you wrote? That is a bit of an extreme situation, ok. Alternatively, imagine, that you go for a 5-week vacation to a silent retreat with a strict no-phone policy (or that is what you tell your colleagues). Will they be able to carry on if they cannot ask you anything about the code? Review your code and the documentation from the perspective of the poor grieving_widow.

The_not_your_domain_guy – He is from the outside of the world you are currently in and he just does not understand your jargon. He doesn’t have to know that in data science a feature, a predictor and an x probably mean the same thing. SNR might shout signal-to-noise ratio at you, it will only snort at him. You might use abbreviations that are obvious to you but not to everyone. If you think that the majority of people can understand, and it helps with the code readability keep the abbreviations but just in case, document/comment them. There might be abbreviations specific to your company and, someone from the outside, a new guy, a consultant will not get them. Put yourself in the shoes of that guy and maybe make your code a bit more democratic wherever possible.

The_foreigner– You might be working in an environment, where every single person speaks the same language you speak, and it happens not to be English. So, you and your colleagues name variables and write the comments in your language. However, unless you work in a team with rules a strict as Athletic Bilbao, there might be a foreigner joining your team in the future. It is hard to argue that English is the lingua franca in programming (and in the world), these days. So, it might be worth putting yourself in the_foreigner’s shoes, while writing your code, to avoid a huge amount of work in the future, that the translation and explanation will require. And even if you are working on your own, you might want to make your code public one day and want as many people as possible to read it.

The_hurry_up_guy – we all know this guy. Sometimes he doesn’t have a body or a face, but we can feel his presence. You might want to write a perfect solution, comment it in the best possible way and maybe add a bit of glitter on top but sometimes you just need to give in and do it his way. And that’s ok too.

References:

[i] https://en.wikipedia.org/wiki/Bus_factor

Article series: 5 Clean Coding Tips – 4. Stop commenting the obvious

This is the fourth of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

Everyone will tell you that you need to comment your code. You do it for yourself, for others, it might help you to put down a structure of your code before you get down to coding properly. Writing a lot of comments might give you a false sense of confidence, that you are doing a good job. While in reality, you are commenting your code a lot with obvious, redundant statements that are not bringing any value. The role of a comment it to explain, not to describe. You need to realize that any piece of comment has to add information to the code you already have, not to double it.

Keep in mind, you are not narrating the code, adding ‘subtitles’ to python’s performance. The comments are there to clarify what is not explicit in the code itself. Adding a comment saying what the line of code does is completely redundant most of the time:

A good rule of thumb would be: if it starts to sound like an instastory, rethink it. ‘So, I am having my breakfast, with a chai latte and my friend, the cat is here as well’. No.

It is also a good thing to learn to always update necessary comments before you modify the code. It is incredibly easy to modify a line of code, move on and forget the comment. There are people who claim that there are very few crimes in the world worse than comments that contradict the code itself.

Of course, there are situations, where you might be preparing a tutorial for others and you want to narrate what the code is doing. Then writing that load function will load the data is good. It does not have to be obvious for the listener. When teaching, repetitions, and overly explicit explanations are more than welcome. Always have in mind who your reader will be.

Ein Einblick in die Aktienmärkte unter Berücksichtigung von COVID-19

Einleitung

Die COVID-19-Pandemie hat uns alle fest im Griff. Besonders die Wirtschaft leidet stark unter den erforderlichen Maßnahmen, die weltweit angewendet werden. Wir wollen daher die Gelegenheit nutzen einen Blick auf die Aktienkurse zu wagen und analysieren, inwieweit der Virus einen Einfluss auf das Wachstum des Marktes hat.

Rahmenbedingungen

Zuallererst werden wir uns auf die Industrie-, Schwellenländer und Grenzmärkte konzentrieren. Dafür nutzen wir die MSCI Global Investable Market Indizes (kurz GIMI), welche die zuvor genannten Gruppen abbilden. Die MSCI Inc. ist ein US-amerikanischer Finanzdienstleister und vor allem für ihre Aktienindizes bekannt.

Aktienindizes sind Kennzahlen der Entwicklung bzw. Änderung einer Auswahl von Aktienkursen und können repräsentativ für ganze Märkte, spezifische Branchen oder Länder stehen. Der DAX ist zum Beispiel ein Index, welcher die Entwicklung der größten 30 deutschen Unternehmen zusammenfasst.

Leider sind die Daten von MSCI nicht ohne weiteres zugänglich, weshalb wir unsere Analysen mit ETFs (engl.: “Exchange Traded Fund”) durchführen werden. ETFs sind wiederum an Börsen gehandelte Fonds, die von Fondgesellschaften/-verwaltern oder Banken verwaltet werden.

Für unsere erste Analyse sollen folgende ETFs genutzt werden, welche die folgenden Indizes führen:

Index Beschreibung ETF
MSCI World über 1600 Aktienwerte aus 24 Industrieländern iShares MSCI World ETF
MSCI Emerging Markets ca. 1400 Aktienwerte aus 27 Schwellenländern iShares MSCI Emerging Markets ETF
MSCI Frontier Markets Aktienwerte aus ca. 29 Frontier-Ländern iShares MSCI Frontier 100 ETF

Tab.1: MSCI Global Investable Market Indizes mit deren repräsentativen ETFs

Datenquellen

Zur Extraktion der ETF-Börsenkurse nehmen wir die yahoo finance API zur Hilfe. Mit den richtigen Symbolen können wir die historischen Daten unserer ETF-Auswahl ausgeben lassen. Wie unter diesem Link für den iShares MSCI World ETF zu sehen ist, gibt es mehrere Werte in den historischen Daten. Für unsere Analyse nutzen wir den Wert, nachdem die Börse geschlossen hat.

Da die ETFs in ihren Kurswerten Unterschiede haben und uns nur die relative Entwicklung interessiert, werden wir relative Werte für die Analyse nutzen. Der Startzeitpunkt soll mit dem 06.01.2020 festgelegt werden.

Die Daten über bestätigte Infektionen mit COVID-19 entnehmen wir aus der Hochrechnung der Johns Hopkins Universität.

Correlation between confirmed cases and growth of MSCI GIMI
Abb.1: Interaktives Diagramm: Wachstum der Aktienmärkte getrennt in Industrie-, Schwellen-, Frontier-Länder und deren bestätigten COVID-19 Fälle über die Zeit. Die bestätigten Fälle der jeweiligen Märkte basieren auf der Aufsummierung der Länder, welche auch in den Märkten aufzufinden sind und daher kann es zu Unterschieden bei den offiziellen Zahlen kommen.

Interpretation des Diagramms

Auf den ersten Blick sieht man deutlich, dass mit steigenden COVID-19 Fällen die Aktienkurse bis zu -31% einbrechen. (Anfangszeitpunkt: 06.01.2020 Endzeitpunkt: 09.04.2020)

Betrachten wir den Anfang des Diagramms so sehen wir einen Einbruch der Emerging Markets, welche eine Gewichtung von 39.69 % (Stand 09.04.20) chinesische Aktien haben. Am 17.01.20 verzeichnen die Emerging Marktes noch ein Plus von 3.15 % gegenüber unserem Startzeitpunkt, wohingegen wir am 01.02.2020 ein Defizit von -6.05 % gegenüber dem Startzeitpunkt haben, was ein Einbruch von -9.20 % zum 17.01.2020 entspricht. Da der Ursprung des COVID-19 Virus auch in China war, könnte man diesen Punkt als Grund des Einbruches interpretieren. Die Industrie- und  Frontier-Länder bleiben hingegen recht stabil und auch deren bestätigten Fälle sind noch sehr niedrig.

Die Industrieländer erreichen ihren Höchststand am 19.02.20 mit einem Plus von 2.80%. Danach brachen alle drei Märkte deutlich ein. Auch in diesem Zeitraum gab es die ersten Todesopfer in Europa und in den USA. Der derzeitige Tiefpunkt, welcher am 23.03.20 zu registrieren ist, beläuft sich für die Industrieländer -32.10 %, Schwellenländer 31.7 % und Frontier-Länder auf -34.88 %.

Interessanterweise steigen die Marktwerte ab diesem Zeitpunkt wieder an. Gründe könnten die Nachrichten aus China sein, welche keine weiteren Neu-Infektionen verzeichnen, die FED dem Markt bis zu 1.5 Billionen Dollar zur Verfügung stellt und/oder die Ankündigung der Europäische Zentralbank Anleihen in Höhe von 750 MRD. Euro zu kaufen. Auch in Deutschland wurden große Hilfspakete angekündigt.

Um detaillierte Aussagen treffen zu können, müssen wir uns die Kurse auf granularer Ebene anschauen. Durch eine gezieltere Betrachtung auf Länderebene könnten Zusammenhänge näher beschrieben werden.

Wenn du dich für interaktive Analysen interessierst und tiefer in die Materie eintauchen möchtest: DATANOMIQ COVID-19 Dashboard

Hier haben wir ein Dashboard speziell für Analysen für die Aktienmärkte, welches stetig verbessert wird. Auch sollen Krypto-Währungen bald implementiert werden. Habt ihr Vorschläge und Verbesserungswünsche, dann lasst gerne ein Kommentar da!

Article series: 5 Clean Coding Tips – 3. Take Advantage of the Formatting Tools.

This is the third of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

Unfortunately, no automatic formatting tool will correct the logic in your code, suggest meaningful names of your variables or comment the code for you. Yet. Gmail has lately started suggesting email titles based on email content. AI-powered variable naming can be next, who knows. Anyway, the visual level of the code is much easier to correct and there are tools that will do some of the code formatting on the visual level job for you. Some of them might be already existing in your IDE, you just need to look for them a bit, others need to be installed. One of the most popular formatting tools is pylint[i]. It is worth checking it out and learning to use it in an efficient way.

Beware that as convenient as it may seem to copy and paste your code into a quick online ‘beautifier’ it is not always a good idea. The online tools might store your code. If you are working on something that shouldn’t just freely float in the world wide web, stick to reliable tools like pylint, that will store the data within your working directory.

These tools can become very good friends of yours but also very annoying ones. They will not miss single whitespace and will not keep their mouth shut when your line length jumps from 79 to 80 characters. They will be shouting with an underscoring of some worrying color and/or exclamation marks. You will need to find your way to coexist and retain your sanity. It can be very distracting when you are in a working flow and warnings pop up all the time about formatting details that have nothing to do with what you are trying to solve. Sometimes, it might be better to turn those warnings off while you are in your most concentrated/creative phase of writing and turn them back on while the dust of your genius settles down a little bit. Usually the offer a lot of flexibility, regarding which warnings you want to be ignored and other features. The good thing is, they also teach you what are mistakes that you are making and after some time you will just stop making them in the first place.

References:

[i] https://www.pylint.org/

Einführung in die Welt der Autoencoder

An wen ist der Artikel gerichtet?

In diesem Artikel wollen wir uns näher mit dem neuronalen Netz namens Autoencoder beschäftigen und wollen einen Einblick in die Grundprinzipien bekommen, die wir dann mit einem vereinfachten Programmierbeispiel festigen. Kenntnisse in Python, Tensorflow und neuronalen Netzen sind dabei sehr hilfreich.

Funktionsweise des Autoencoders

Ein Autoencoder ist ein neuronales Netz, welches versucht die Eingangsinformationen zu komprimieren und mit den reduzierten Informationen im Ausgang wieder korrekt nachzubilden.

Die Komprimierung und die Rekonstruktion der Eingangsinformationen laufen im Autoencoder nacheinander ab, weshalb wir das neuronale Netz auch in zwei Abschnitten betrachten können.

 

 

 

Der Encoder

Der Encoder oder auch Kodierer hat die Aufgabe, die Dimensionen der Eingangsinformationen zu reduzieren, man spricht auch von Dimensionsreduktion. Durch diese Reduktion werden die Informationen komprimiert und es werden nur die wichtigsten bzw. der Durchschnitt der Informationen weitergeleitet. Diese Methode hat wie viele andere Arten der Komprimierung auch einen Verlust.

In einem neuronalen Netz wird dies durch versteckte Schichten realisiert. Durch die Reduzierung von Knotenpunkten in den kommenden versteckten Schichten werden die Kodierung bewerkstelligt.

Der Decoder

Nachdem das Eingangssignal kodiert ist, kommt der Decoder bzw. Dekodierer zum Einsatz. Er hat die Aufgabe mit den komprimierten Informationen die ursprünglichen Daten zu rekonstruieren. Durch Fehlerrückführung werden die Gewichte des Netzes angepasst.

Ein bisschen Mathematik

Das Hauptziel des Autoencoders ist, dass das Ausgangssignal dem Eingangssignal gleicht, was bedeutet, dass wir eine Loss Funktion haben, die L(x , y) entspricht.

L(x, \hat{x})

Unser Eingang soll mit x gekennzeichnet werden. Unsere versteckte Schicht soll h sein. Damit hat unser Encoder folgenden Zusammenhang h = f(x).

Die Rekonstruktion im Decoder kann mit r = g(h) beschrieben werden. Bei unserem einfachen Autoencoder handelt es sich um ein Feed-Forward Netz ohne rückkoppelten Anteil und wird durch Backpropagation oder zu deutsch Fehlerrückführung optimiert.

Formelzeichen Bedeutung
\mathbf{x}, \hat{\mathbf{x}} Eingangs-, Ausgangssignal
\mathbf{W}, \hat{\mathbf{W}} Gewichte für En- und Decoder
\mathbf{B}, \hat{\mathbf{B}} Bias für En- und Decoder
\sigma, \hat{\sigma} Aktivierungsfunktion für En- und Decoder
L Verlustfunktion

Unsere versteckte Schicht soll mit \latex h gekennzeichnet werden. Damit besteht der Zusammenhang:

(1)   \begin{align*} \mathbf{h} &= f(\mathbf{x}) = \sigma(\mathbf{W}\mathbf{x} + \mathbf{B}) \\ \hat{\mathbf{x}} &= g(\mathbf{h}) = \hat{\sigma}(\hat{\mathbf{W}} \mathbf{h} + \hat{\mathbf{B}}) \\ \hat{\mathbf{x}} &= \hat{\sigma} \{ \hat{\mathbf{W}} \left[\sigma ( \mathbf{W}\mathbf{x} + \mathbf{B} )\right]  + \hat{\mathbf{B}} \}\\ \end{align*}

Für eine Optimierung mit der mittleren quadratischen Abweichung (MSE) könnte die Verlustfunktion wie folgt aussehen:

(2)   \begin{align*} L(\mathbf{x}, \hat{\mathbf{x}}) &= \mathbf{MSE}(\mathbf{x}, \hat{\mathbf{x}}) = \|  \mathbf{x} - \hat{\mathbf{x}} \| ^2 &=  \| \mathbf{x} - \hat{\sigma} \{ \hat{\mathbf{W}} \left[\sigma ( \mathbf{W}\mathbf{x} + \mathbf{B} )\right]  + \hat{\mathbf{B}} \} \| ^2 \end{align*}

 

Wir haben die Theorie und Mathematik eines Autoencoder in seiner Ursprungsform kennengelernt und wollen jetzt diese in einem (sehr) einfachen Beispiel anwenden, um zu schauen, ob der Autoencoder so funktioniert wie die Theorie es besagt.

Dazu nehmen wir einen One Hot (1 aus n) kodierten Datensatz, welcher die Zahlen von 0 bis 3 entspricht.

    \begin{align*} [1, 0, 0, 0] \ \widehat{=}  \ 0 \\ [0, 1, 0, 0] \ \widehat{=}  \ 1 \\ [0, 0, 1, 0] \ \widehat{=}  \ 2 \\ [0, 0, 0, 1] \ \widehat{=} \  3\\ \end{align*}

Diesen Datensatz könnte wie folgt kodiert werden:

    \begin{align*} [1, 0, 0, 0] \ \widehat{=}  \ 0 \ \widehat{=}  \ [0, 0] \\ [0, 1, 0, 0] \ \widehat{=}  \ 1 \ \widehat{=}  \  [0, 1] \\ [0, 0, 1, 0] \ \widehat{=}  \ 2 \ \widehat{=}  \ [1, 0] \\ [0, 0, 0, 1] \ \widehat{=} \  3 \ \widehat{=}  \ [1, 1] \\ \end{align*}

Damit hätten wir eine Dimensionsreduktion von vier auf zwei Merkmalen vorgenommen und genau diesen Vorgang wollen wir bei unserem Beispiel erreichen.

Programmierung eines einfachen Autoencoders

 

Typische Einsatzgebiete des Autoencoders sind neben der Dimensionsreduktion auch Bildaufarbeitung (z.B. Komprimierung, Entrauschen), Anomalie-Erkennung, Sequenz-to-Sequenz Analysen, etc.

Ausblick

Wir haben mit einem einfachen Beispiel die Funktionsweise des Autoencoders festigen können. Im nächsten Schritt wollen wir anhand realer Datensätze tiefer in gehen. Auch soll in kommenden Artikeln Variationen vom Autoencoder in verschiedenen Einsatzgebieten gezeigt werden.

Article series: 5 Clean Coding Tips – 2. Name Variables in a Meaningful Way

This is the second of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

When it comes to naming variables, there are a few official rules in the PEP8 style guide. A variable must start with an underscore or a letter and can be followed by a number of underscores or letters or digits. They cannot be reserved words: True, False, or, not, lambda etc. The preferred naming style is lowercase or lowercase_with_underscore. This all refers to variable names on a visual level. However, for readability purposes, the semantic level is as important, or maybe even more so. If it was for python, the variables could be named like this:

It wouldn’t make the slightest difference. But again, the code is not only for the interpreter to be read. It is for humans. Other people might need to look at your code to understand what you did, to be able to continue the work that you have already started. In any case, they need to be able to decipher what hides behind the variable names, that you’ve given the objects in your code. They will need to remember what they meant as they reappear in the code. And it might not be easy for them.

Remembering names is not an easy thing to do in all life situations. Let’s consider the following situation. You go to a party, there is a bunch of new people that you meet for the first time. They all have names and you try very hard to remember them all. Imagine how much easier would it be if you could call the new girl who came with John as the_girl_who_came_with_John. How much easier would it be to gossip to your friends about her? ‘Camilla is on the 5th glass of wine tonight, isn’t she?!.’ ‘Who are you talking about???’ Your friends might ask. ‘The_Girl_who_came_with_John.’ And they will all know. ‘It was nice to meet you girl_who_came_with_john, see you around.’ The good thing is that variables are not really like people. You can be a bit rude to them, they will not mind. You don’t have to force yourself or anyone else to remember an arbitrary name of a variable, that accidentally came to your mind in the moment of creation. Let your colleagues figure out what is what by a meaningful, straightforward description of it.

There is an important tradeoff to be aware of here. The lines of code should not exceed a certain length (79 characters, according to the PEP 8), therefore, it is recommended that you keep your names as short as possible. It is worth to give it a bit of thought about how you can name your variable in the most descriptive way, keeping it as short as possible. Keep in mind, that
the_blond_girl_in_a_dark_blue_dress_who_came_with_John_to_this_party might not be the best choice.

There are a few additional pieces of advice when it comes to naming your variables. First, try to always use pronounceable names. If you’ve ever been to an international party, you will know how much harder to remember is something that you cannot even repeat. Second, you probably have been taught over and over again that whenever you create a loop, you use i and j to denote the iterators.

It is probably engraved deep into the folds in your brain to write for i in…. You need to try and scrape it out of your cortex. Think about what the i stands for, what it really does and name it accordingly. Is i maybe the row_index? Is it a list_element?

Additionally, think about when to use a noun and where a verb. Variables usually are things and functions usually do things. So, it might be better to name functions with verb expressions, for example: get_id() or raise_to_power().

Moreover, it is a good practice to name constant numbers in the code. First, because when you name them you explain the meaning of the number. Second, because maybe one day you will have to change that number. If it appears multiple times in your code, you will avoid searching and changing it in every place. PEP 8 states that the constants should be named with UPPER_CASE_NAME. It is also quite common practice to explain the meaning of the constants with an inline comment at the end of the line, where the number appears. However, this approach will increase the line length and will require repeating the comment if the number appears more than one time in the code.

Matrix search: Finding the blocks of neighboring fields in a matrix with Python

Task

In this article we will look at a solution in python to the following grid search task:

Find the biggest block of adjoining elements of the same kind and into how many blocks the matrix is divided. As adjoining blocks, we will consider field touching by the sides and not the corners.

Input data

For the ease of the explanation, we will be looking at a simple 3×4 matrix with elements of three different kinds, 0, 1 and 2 (see above). To test the code, we will simulate data to achieve different matrix sizes and a varied number of element types. It will also allow testing edge cases like, where all elements are the same or all elements are different.

To simulate some test data for later, we can use the numpy randint() method:

The code

How the code works

In summary, the algorithm loops through all fields of the matrix looking for unseen fields that will serve as a starting point for a local exploration of each block of color – the find_blocks() function. The local exploration is done by looking at the neighboring fields and if they are within the same kind, moving to them to explore further fields – the explore_block() function. The fields that have already been seen and counted are stored in the visited list.

find_blocks() function:

  1. Finds a starting point of a new block
  2. Runs a the explore_block() function for local exploration of the block
  3. Appends the size of the explored block
  4. Updates the list of visited points
  5. Returns the result, once all fields of the matrix have been visited.

explore_block() function:

  1. Takes the coordinates of the starting field for a new block and the list of visited points
  2. Creates the queue set with the starting point
  3. Sets the size of the current block (field_count) to 1
  4. Starts a while loop that is executed for as long as the queue is not empty
    1. Takes an element of the queue and uses its coordinates as the current location for further exploration
    2. Adds the current field to the visited list
    3. Explores the neighboring fields and if they belong to the same block, they are added to the queue
    4. The fields are taken off the queue for further exploration one by one until the queue is empty
  5. Returns the field_count of the explored block and the updated list of visited fields

Execute the function

The returned result is biggest block: 4, number of blocks: 4.

Run the test matrices:

Visualization

The matrices for the article were visualized with the seaborn heatmap() method.

Article series: 5 Clean Coding Tips – 1. Be Consistent

This is the first of the article series “5 tips for clean coding” to follow as soon as you’ve made the first steps into your coding career, in this article series. Read the introduction here, to find out why it is important to write clean code if you missed it.

Consistency is THE rule to follow if you want to make your code clean and increase readability. Not to make it sound desperate, but honestly, whatever you decide to do when it comes to the coding style, just be consistent. Whether you agree with any standards, formatting styles or don’t even know them, just be consistent. Don’t ever allow inconsistency to sneak into your script or your project. This will only bring confusion, disorientation, chaos and general misery.

The rules for how exactly keep your code clean and organized visually might differ slightly depending on the situation you find yourself in. The PEP 8 rules can be ambiguous in some places and leave room for interpretation. For example, the question, whether you use single or double quotes to denote a string, is open. It is possible, that your work environment already has a standard and you just need to comply with that. No room to show off your highly unique take on it, sorry. However, if you are working on your own and there is no one to roll their eyes looking at your messed-up code, you need to decide for yourself. Once you do, again, be consistent at the level of the script, project, your work in general. Otherwise, it will look messy, patchworky and simply unprofessional.

People famously are quick to ascribe intentionality, even to thermostats[i]. They will assume that the details of how you wrote your code are intentional. They will try to figure out why you are doing one thing in some places and a different thing in other places. If those differences came from you being careless and have no meaning behind them, the reader of your code will waste a lot of time trying to figure it out and end up frustrated. Remember the first few snippets of python code you have ever seen? Maybe you saw some code with double quotes and some with single quotes. You were green, knew nothing and quite possibly thought that they both have different meanings and you spent time trying to figure out why on earth in some places there is a single quote and in other double-quotes.

If those altruistic arguments do not really convince you, let’s see how consistency can serve to your own benefit. First, that outsider, who is looking at your code and is trying very hard to figure out what on Earth is going on, might be you. It might sound crazy, and it is, indeed, quite sad, but most likely, after 6 months of not looking at your code you will no longer remember what you did there if it is not documented well. Documenting in a homogenous way can take some time and some effort. Nevertheless, in general, code gets read many times after it has been written. When in doubt, sacrifice some of your writing time to increase readability and minimize the reading time later. It will pay off in the long run.

Having a set of rules at your disposal can make your work faster. You will avoid arguing with yourself about which option is the best one: mean_income, income_mean or income_avg. You can avoid making loads of small decisions as you write your code by making a set of global rules. In that way, you can allocate your energy and resources into solving the real problem. Not the how-do-I-format-this? one.

It is not necessary that you make all those grand decisions right now. You also don’t have to make them for life, it’s ok to change your mind eventually, so don’t feel overwhelmed. But once you’ve learned this and that, spent a little time coding, have a good long look at your sprouting habits and decide what you are going to do about splitting those lines and stick to it!

References:

[i] https://en.wikipedia.org/wiki/Intentional_stance