Posts

Simple RNN

Prerequisites for understanding RNN at a more mathematical level

Writing the A gentle introduction to the tiresome part of understanding RNN Article Series on recurrent neural network (RNN) is nothing like a creative or ingenious idea. It is quite an ordinary topic. But still I am going to write my own new article on this ordinary topic because I have been frustrated by lack of sufficient explanations on RNN for slow learners like me.

I think many of readers of articles on this website at least know that RNN is a type of neural network used for AI tasks, such as time series prediction, machine translation, and voice recognition. But if you do not understand how RNNs work, especially during its back propagation, this blog series is for you.

After reading this articles series, I think you will be able to understand RNN in more mathematical and abstract ways. But in case some of the readers are allergic or intolerant to mathematics, I tried to use as little mathematics as possible.

Ideal prerequisite knowledge:

  • Some understanding on densely connected layers (or fully connected layers, multilayer perception) and how their forward/back propagation work.
  •  Some understanding on structure of Convolutional Neural Network.

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1, Difficulty of Understanding RNN

I bet a part of difficulty of understanding RNN comes from the variety of its structures. If you search “recurrent neural network” on Google Image or something, you will see what I mean. But that cannot be helped because RNN enables a variety of tasks.

Another major difficulty of understanding RNN is understanding its back propagation algorithm. I think some of you found it hard to understand chain rules in calculating back propagation of densely connected layers, where you have to make the most of linear algebra. And I have to say backprop of RNN, especially LSTM, is a monster of chain rules. I am planing to upload not only a blog post on RNN backprop, but also a presentation slides with animations to make it more understandable, in some external links.

In order to avoid such confusions, I am going to introduce a very simplified type of RNN, which I call a “simple RNN.” The RNN displayed as the head image of this article is a simple RNN.

2, How Neurons are Connected

    \begin{equation*}   1 = 3 - 2 \end{equation*}

How to connect neurons and how to activate them is what neural networks are all about. Structures of those neurons are easy to grasp as long as that is about DCL or CNN. But when it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN(*If you are not sure how neurons are connected in CNN, this link should be helpful. Draw a random digit in the square at the corner.). In fact the structure of RNN is also the same, and as long as it is a simple RNN, and it is not hard to visualize its structure.

Even though RNN is also connections of neurons, usually most RNN charts are simplified, using blackboxes. In case of simple RNN, most study material would display it as the chart below.

But that also cannot be helped because fancier RNN have more complicated connections of neurons, and there are no longer advantages of displaying RNN as connections of neurons, and you would need to understand RNN in more abstract way, I mean, as you see in most of textbooks.

I am going to explain details of simple RNN in the next article of this series.

3, Neural Networks as Mappings

If you still think that neural networks are something like magical spider webs or models of brain tissues, forget that. They are just ordinary mappings.

If you have been allergic to mathematics in your life, you might have never heard of the word “mapping.” If so, at least please keep it in mind that the equation y=f(x), which most people would have seen in compulsory education, is a part of mapping. If you get a value x, you get a value y corresponding to the x.

But in case of deep learning, x is a vector or a tensor, and it is denoted with \boldsymbol{x} . If you have never studied linear algebra , imagine that a vector is a column of Excel data (only one column), a matrix is a sheet of Excel data (with some rows and columns), and a tensor is some sheets of Excel data (each sheet does not necessarily contain only one column.)

CNNs are mainly used for image processing, so their inputs are usually image data. Image data are in many cases (3, hight, width) tensors because usually an image has red, blue, green channels, and the image in each channel can be expressed as a hight*width matrix (the “height” and the “width” are number of pixels, so they are discrete numbers).

The convolutional part of CNN (which I call “feature extraction part”) maps the tensors to a vector, and the last part is usually DCL, which works as classifier/regressor. At the end of the feature extraction part, you get a vector. I call it a “semantic vector” because the vector has information of “meaning” of the input image. In this link you can see maps of pictures plotted depending on the semantic vector. You can see that even if the pictures are not necessarily close pixelwise, they are close in terms of the “meanings” of the images.

In the example of a dog/cat classifier introduced by François Chollet, the developer of Keras, the CNN maps (3, 150, 150) tensors to 2-dimensional vectors, (1, 0) or (0, 1) for (dog, cat).

Wrapping up the points above, at least you should keep two points in mind: first, DCL is a classifier or a regressor, and CNN is a feature extractor used for image processing. And another important thing is, feature extraction parts of CNNs map images to vectors which are more related to the “meaning” of the image.

Importantly, I would like you to understand RNN this way. An RNN is also just a mapping.

*I recommend you to at least take a look at the beautiful pictures in this link. These pictures give you some insight into how CNN perceive images.

4, Problems of DCL and CNN, and needs for RNN

Taking an example of RNN task should be helpful for this topic. Probably machine translation is the most famous application of RNN, and it is also a good example of showing why DCL and CNN are not proper for some tasks. Its algorithms is out of the scope of this article series, but it would give you a good insight of some features of RNN. I prepared three sentences in German, English, and Japanese, which have the same meaning. Assume that each sentence is divided into some parts as shown below and that each vector corresponds to each part. In machine translation we want to convert a set of the vectors into another set of vectors.

Then let’s see why DCL and CNN are not proper for such task.

  • The input size is fixed: In case of the dog/cat classifier I have mentioned, even though the sizes of the input images varies, they were first molded into (3, 150, 150) tensors. But in machine translation, usually the length of the input is supposed to be flexible.
  • The order of inputs does not mater: In case of the dog/cat classifier the last section, even if the input is “cat,” “cat,” “dog” or “dog,” “cat,” “cat” there’s no difference. And in case of DCL, the network is symmetric, so even if you shuffle inputs, as long as you shuffle all of the input data in the same way, the DCL give out the same outcome . And if you have learned at least one foreign language, it is easy to imagine that the orders of vectors in sequence data matter in machine translation.

*It is said English language has phrase structure grammar, on the other hand Japanese language has dependency grammar. In English, the orders of words are important, but in Japanese as long as the particles and conjugations are correct, the orders of words are very flexible. In my impression, German grammar is between them. As long as you put the verb at the second position and the cases of the words are correct, the orders are also relatively flexible.

5, Sequence Data

We can say DCL and CNN are not useful when you want to process sequence data. Sequence data are a type of data which are lists of vectors. And importantly, the orders of the vectors matter. The number of vectors in sequence data is usually called time steps. A simple example of sequence data is meteorological data measured at a spot every ten minutes, for instance temperature, air pressure, wind velocity, humidity. In this case the data is recorded as 4-dimensional vector every ten minutes.

But this “time step” does not necessarily mean “time.” In case of natural language processing (including machine translation), which you I mentioned in the last section, the numberings of each vector denoting each part of sentences are “time steps.”

And RNNs are mappings from a sequence data to another sequence data.

*At least I found a paper on the RNN’s capability of universal approximation on many-to-one RNN task. But I have not found any papers on universal approximation of many-to-many RNN tasks. Please let me know if you find any clue on whether such approximation is possible. I am desperate to know that. 

6, Types of RNN Tasks

RNN tasks can be classified into some types depending on the lengths of input/output sequences (the “length” means the times steps of input/output sequence data).

If you want to predict the temperature in 24 hours, based on several time series data points in the last 96 hours, the task is many-to-one. If you sample data every ten minutes, the input size is 96*6=574 (the input data is a list of 574 vectors), and the output size is 1 (which is a value of temperature). Another example of many-to-one task is sentiment classification. If you want to judge whether a post on SNS is positive or negative, the input size is very flexible (the length of the post varies.) But the output size is one, which is (1, 0) or (0, 1), which denotes (positive, negative).

*The charts in this section are simplified model of RNN used for each task. Please keep it in mind that they are not 100% correct, but I tried to make them as exact as possible compared to those in other study materials.

Music/text generation can be one-to-many tasks. If you give the first sound/word you can generate a phrase.

Next, let’s look at many-to-many tasks. Machine translation and voice recognition are likely to be major examples of many-to-many tasks, but here name entity recognition seems to be a proper choice. Name entity recognition is task of finding proper noun in a sentence . For example if you got two sentences “He said, ‘Teddy bears on sale!’ ” and ‘He said, “Teddy Roosevelt was a great president!” ‘ judging whether the “Teddy” is a proper noun or a normal noun is name entity recognition.

Machine translation and voice recognition, which are more popular, are also many-to-many tasks, but they use more sophisticated models. In case of machine translation, the inputs are sentences in the original language, and the outputs are sentences in another language. When it comes to voice recognition, the input is data of air pressure at several time steps, and the output is the recognized word or sentence. Again, these are out of the scope of this article but I would like to introduce the models briefly.

Machine translation uses a type of RNN named sequence-to-sequence model (which is often called seq2seq model). This model is also very important for other natural language processes tasks in general, such as text summarization. A seq2seq model is divided into the encoder part and the decoder part. The encoder gives out a hidden state vector and it used as the input of the decoder part. And decoder part generates texts, using the output of the last time step as the input of next time step.

Voice recognition is also a famous application of RNN, but it also needs a special type of RNN.

*To be honest, I don’t know what is the state-of-the-art voice recognition algorithm. The example in this article is a combination of RNN and a collapsing function made using Connectionist Temporal Classification (CTC). In this model, the output of RNN is much longer than the recorded words or sentences, so a collapsing function reduces the output into next output with normal length.

You might have noticed that RNNs in the charts above are connected in both directions. Depending on the RNN tasks you need such bidirectional RNNs.  I think it is also easy to imagine that such networks are necessary. Again, machine translation is a good example.

And interestingly, image captioning, which enables a computer to describe a picture, is one-to-many-task. As the output is a sentence, it is easy to imagine that the output is “many.” If it is a one-to-many task, the input is supposed to be a vector.

Where does the input come from? I told you that I was obsessed with the beauty of the last vector of the feature extraction part of CNN. Surprisingly the the “beautiful” vector, which I call a “semantic vector” is the input of image captioning task (after some transformations, depending on the network models).

I think this articles includes major things you need to know as prerequisites when you want to understand RNN at more mathematical level. In the next article, I would like to explain the structure of a simple RNN, and how it forward propagate.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Deep Learning World – Virtual Edition 2020!

DEEP LEARNING WORLD 2020

Virtual Edition, May 11-12, 2020

The premier conference covering the
commercial deployment of deep learning

Deep Learning is no longer the cool new discipline. Instead it has become another tool in the toolbox of the data scientist – but a very important one! Without RNN, CNN etc. many applications that make our daily life better or help us to improve our business wouldn’t be possible. Take for example the German Federal State NRW: they are using neural networks to detect child pornography. Other organizations use it to detect cancer, translate text or inspect machines. It’s also important to understand how Deep Learning sits alongside traditional machine learning methods. As an expert you should know when and how to apply different methods for different applications. At the Deep Learning World conference, you will learn from other practitioners why they decided for a deep, transfer or reinforcement learning approach, what the analytical and technical but also organisational and economic challenges were and how they solved them. Take this opportunity and visit the two-day event to broaden your knowledge, deepen your understanding and discuss your questions with other Deep Learning experts – see you virtually in May 2020!

Why should you participate?

We will provide a live-streamed virtual version of deep learning on 11-12 May, 2020: you will be able to attend sessions and to interact and connect with the speakers and fellow members of the data science community including sponsors and exhibitors from your home or your office.

What about the workshops?

The workshops will also be held virtually on the planned date:
13 May, 2020.

Don’t have a ticket yet?

It‘s not too late to join the data science community.
Register by 10 May to receive access to the livestream and recordings.

REGISTER HERE

We’re looking forward to see you – virtually!

This year the Deep Learning World runs alongside with the Predictive Analytics World for Healthcare and Predictive Analytics World for Industry 4.0.

Predictive Analytics World for Industry 4.0

Difficult times call for creative measures

Predictive Analytics World for Industry 4.0 will go virtual and you still have time to join us!

What do you have in store for me?

We will provide a live-streamed virtual version of PAW Industry 4.0 Munich 2020 on 11-12 May, 2020: you will be able to attend sessions and to interact and connect with the speakers and fellow members of the data science community including sponsors and exhibitors from your home or your office.

What about the workshops?

The workshops will also be held virtually on the planned date:
13 May, 2020.

Get a complimentary virtual sneak preview!

If you would like to join us for a virtual sneak preview of the workshop „Data Thinking“ on Thursday, April 16, so you can familiarise yourself with the quality of the virtual edition of both conference and workshops and how the interaction with speakers and attendees works, please send a request to registration@risingmedia.com.

Don’t have a ticket yet?

It‘s not too late to join the data science community.
Register by 10 May to receive access to the livestream and recordings.

REGISTER HERE

We’re looking forward to see you – virtually!

This year Predictive Analytics World for Industry 4.0 runs alongside Deep Learning World and Predictive Analytics World for Healthcare.

Visual Question Answering with Keras – Part 2: Making Computers Intelligent to answer from images

Making Computers Intelligent to answer from images

This is my second blog on Visual Question Answering, in the last blog, I have introduced to VQA, available datasets and some of the real-life applications of VQA. If you have not gone through then I would highly recommend you to go through it. Click here for more details about it.

In this blog post, I will walk through the implementation of VQA in Keras.

You can download the dataset from here: https://visualqa.org/index.html. All my experiments were performed with VQA v2 and I have used a very tiny subset of entire dataset i.e all samples for training and testing from the validation set.

Table of contents:

  1. Preprocessing Data
  2. Process overview for VQA
  3. Data Preprocessing – Images
  4. Data Preprocessing through the spaCy library- Questions
  5. Model Architecture
  6. Defining model parameters
  7. Evaluating the model
  8. Final Thought
  9. References

NOTE: The purpose of this blog is not to get the state-of-art performance on VQA. But the idea is to get familiar with the concept. All my experiments were performed with the validation set only.

Full code on my Github here.


1. Preprocessing Data:

If you have downloaded the dataset then the question and answers (called as annotations) are in JSON format. I have provided the code to extract the questions, annotations and other useful information in my Github repository. All extracted information is stored in .txt file format. After executing code the preprocessing directory will have the following structure.

All text files will be used for training.

 

2. Process overview for VQA:

As we have discussed in previous post visual question answering is broken down into 2 broad-spectrum i.e. vision and text.  I will represent the Neural Network approach to this problem using the Convolutional Neural Network (for image data) and Recurrent Neural Network(for text data). 

If you are not familiar with RNN (more precisely LSTM) then I would highly recommend you to go through Colah’s blog and Andrej Karpathy blog. The concepts discussed in this blogs are extensively used in my post.

The main idea is to get features for images from CNN and features for the text from RNN and finally combine them to generate the answer by passing them through some fully connected layers. The below figure shows the same idea.

 

I have used VGG-16 to extract the features from the image and LSTM layers to extract the features from questions and combining them to get the answer.

3. Data Preprocessing – Images:

Images are nothing but one of the input to our model. But as you already may know that before feeding images to the model we need to convert into the fixed-size vector.

So we need to convert every image into a fixed-size vector then it can be fed to the neural network. For this, we will use the VGG-16 pretrained model. VGG-16 model architecture is trained on millions on the Imagenet dataset to classify the image into one of 1000 classes. Here our task is not to classify the image but to get the bottleneck features from the second last layer.

Hence after removing the softmax layer, we get a 4096-dimensional vector representation (bottleneck features) for each image.

Image Source: https://www.cs.toronto.edu/~frossard/post/vgg16/

 

For the VQA dataset, the images are from the COCO dataset and each image has unique id associated with it. All these images are passed through the VGG-16 architecture and their vector representation is stored in the “.mat” file along with id. So in actual, we need not have to implement VGG-16 architecture instead we just do look up into file with the id of the image at hand and we will get a 4096-dimensional vector representation for the image.

4. Data Preprocessing through the spaCy library- Questions:

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. As we have converted images into a fixed 4096-dimensional vector we also need to convert questions into a fixed-size vector representation. For installing spaCy click here

You might know that for training word embeddings in Keras we have a layer called an Embedding layer which takes a word and embeds it into a higher dimensional vector representation. But by using the spaCy library we do not have to train the get the vector representation in higher dimensions.

 

This model is actually trained on billions of tokens of the large corpus. So we just need to call the vector method of spaCy class and will get vector representation for word.

After fitting, the vector method on tokens of each question will get the 300-dimensional fixed representation for each word.

5. Model Architecture:

In our problem the input consists of two parts i.e an image vector, and a question, we cannot use the Sequential API of the Keras library. For this reason, we use the Functional API which allows us to create multiple models and finally merge models.

The below picture shows the high-level architecture idea of submodules of neural network.

After concatenating the 2 different models the summary will look like the following.

The below plot helps us to visualize neural network architecture and to understand the two types of input:

 

6. Defining model parameters:

The hyperparameters that we are going to use for our model is defined as follows:

If you know what this parameter means then you can play around it and can get better results.

Time Taken: I used the GPU on https://colab.research.google.com and hence it took me approximately 2 hours to train the model for 5 epochs. However, if you train it on a PC without GPU, it could take more time depending on the configuration of your machine.

7. Evaluating the model:

Since I have used the very small dataset for performing these experiments I am not able to get very good accuracy. The below code will calculate the accuracy of the model.

 

Since I have trained a model multiple times with different parameters you will not get the same accuracy as me. If you want you can directly download mode.h5 file from my google drive.

 

8. Final Thoughts:

One of the interesting thing about VQA is that it a completely new field. So there is absolutely no end to what you can do to solve this problem. Below are some tips while replicating the code.

  1. Start with a very small subset of data: When you start implementing I suggest you start with a very small amount of data. Because once you are ready with the whole setup then you can scale it any time.
  2. Understand the code: Understanding code line by line is very much helpful to match your theoretical knowledge. So for that, I suggest you can take very few samples(maybe 20 or less) and run a small chunk (2 to 3 lines) of code to get the functionality of each part.
  3. Be patient: One of the mistakes that I did while starting with this project was to do everything at one go. If you get some error while replicating code spend 4 to 5 days harder on that. Even after that if you won’t able to solve, I would suggest you resume after a break of 1 or 2 days. 

VQA is the intersection of NLP and CV and hopefully, this project will give you a better understanding (more precisely practically) with most of the deep learning concepts.

If you want to improve the performance of the model below are few tips you can try:

  1. Use larger datasets
  2. Try Building more complex models like Attention, etc
  3. Try using other pre-trained word embeddings like Glove 
  4. Try using a different architecture 
  5. Do more hyperparameter tuning

The list is endless and it goes on.

In the blog, I have not provided the complete code you can get it from my Github repository.

9. References:

  1. https://blog.floydhub.com/asking-questions-to-images-with-deep-learning/
  2. https://tryolabs.com/blog/2018/03/01/introduction-to-visual-question-answering/
  3. https://github.com/sominwadhwa/vqamd_floyd

Accelerate your AI Skills Today: A Million Dollar Job!

The skyrocketing salaries ($1m per year) of AI engineers is not a hype. It is the fact of current corporate world, where you will witness a shift that is inevitable.

We’ve already set our feet at the edge of the technological revolution. A revolution that is at the verge of altering the way we live and work. As the fact suggests, humanity has fundamentally developed human production in three revolutions, and we’re now entering the fourth revolution. In its scope, the fourth revolution projects a transformation that is unlike anything we humans have ever experienced.

  • The first revolution had the world transformed from rural to urban
  • the emergence of mass production in the second revolution
  • third introduced the digital revolution
  • The fourth industrial revolution is anxious to integrate technologies into our lives.

And all thanks to artificial intelligence (AI). An advanced technology that surrounds us, from virtual assistants to software that translates to self-driving cars.

The rise of AI at an exponential rate has disrupted almost every industry. So much so that AI is being rated as one-million-dollar profession.

Did this grab your attention? It did?

Now, what if we were to tell you that the salary compensation for AI experts has grown dramatically. AI and machine learning are fields that have a mountain of demand in the tech industry today but has sparse supply.

AI field is growing at a quicker pace and salaries are skyrocketing! Read it for yourself to know what AI experts, AI researchers and any other AI talent are commanding today.

  • A top-class AI research laboratory, OpenAI says that techies in the AI field are projected to earn a salary compensation ranging between $300 to $500k for fresh graduates. However, expert professionals could earn anywhere up to $1m.
  • Whopping salary package of above 100 million yen that amounts to $1m is being offered to AI geniuses by a Japanese firm, Start Today. A firm that operates a fashion shopping website named Zozotown.

Does this leave you with a question – Is this a right opportunity for you to jump in the field and make hay while the sun is shining? 

And the answer to this question is – yes, it is the right opportunity for any developer seeking a role in the AI industry. It can be your chance to bridge the skill shortage in the AI field either by upskilling or reskilling yourself in the field of AI.

There are a wide varieties of roles available for an AI enthusiast like you. And certain areas are like AI Engineers and AI Researchers are high in demand, as there are not many professionals who have robust AI knowledge.

According to a job report, “The Future of Jobs 2018,” a prediction was made suggesting that machines and algorithms will create around 133 million new job roles by 2022.

AI and machine learning will dominate the tech world. The World Economic Forum says that several sectors have started embracing AI and machine learning to tackle challenges in certain fields such as advertising, supply chain, manufacturing, smart cities, drones, and cybersecurity.

Unraveling the AI realm

From chatbots to financial planners, AI is impacting the way businesses function on a day-today basis. AI makes the work simpler, as it provides variables, which makes the work more streamlined.

Alright! You know that

  • the demand for AI professionals is rising exponentially and that there is just a trickle of supply
  • the AI professionals are demanding skyrocketing salaries

However, beyond that how much more do you know about AI?

Considering the fact that our lives have already been touched by AI (think Alexa, and Siri), it is just a matter of time when AI will become an indispensable part of our lives.

As Gartner predicts that 2020 will be an important year for business growth in AI. Thus, it is possible to witness significant sparks for employment growth. Though AI predicts to diminish 1.8 million jobs, it is also said to replace it with 2.3 million jobs that will be created. As we look forward to stepping into 2020, AI-related job roles are set to make positive progress of achieving 2 million net-new employments by 2025.

With AI promising to score fat paychecks that would reach millions, AI experts are struggling to find new ways to pick up nouveau skills. However, one of the biggest impacts that affect the job market today is the scarcity of talent in this field.

The best way to stay relevant and employable in AI is probably by “reskilling,” and “upskilling.” And  AI certifications is considered ideal for those in the current workforce.

Looking to upskill yourself – here’s how you can become an AI engineer today.

Top three ways to enhance your artificial intelligence career:

  1. Acquire skills in Statistics and Machine Learning: If you’re getting into the field of machine learning, it is crucial that you have in-depth knowledge of statistics. Statistics is considered a prerequisite to the ML field. Both the fields are tightly related. Machine learning models are created to make accurate predictions while statistical models do the job of interpreting the relationship between variables. Many ML techniques heavily rely on the theory obtained through statistics. Thus, having extensive knowledge in statistics help initiate the first step towards an AI career.
  2. Online certification programs in AI skills: Opting for AI certifications will boost your credibility amongst potential employers. Certifications will also enhance your earning potential and increase your marketability. If you’re looking for a change and to be a part of something impactful; join the AI bandwagon. The IT industry is growing at breakneck speed; it is now that businesses are realizing how important it is to hire professionals with certain skillsets. Specifically, those who are certified in AI are becoming sought after in the job market.
  3. Hands-on experience: There’s a vast difference in theory and practical knowledge. One needs to familiarize themselves with the latest tools and technologies used by the industry. This is possible only if the individual is willing to work on projects and build things from scratch.

Despite all the promises, AI does prove to be a threat to job holders, if they don’t upskill or reskill themselves. The upcoming AI revolution will definitely disrupt the way we work, however, it will leave room for humans to perform more creative jobs in the future corporate world.

So a word of advice is to be prepared and stay future ready.

Visual Question Answering with Keras – Part 1

This is Part I of II of the Article Series Visual Question Answering with Keras

Making Computers Intelligent to answer from images

If we look closer in the history of Artificial Intelligence (AI), the Deep Learning has gained more popularity in the recent years and has achieved the human-level performance in the tasks such as Speech Recognition, Image Classification, Object Detection, Machine Translation and so on. However, as humans, not only we but also a five-year child can normally perform these tasks without much inconvenience. But the development of such systems with these capabilities has always considered an ambitious goal for the researchers as well as for developers.

In this series of blog posts, I will cover an introduction to something called VQA (Visual Question Answering), its available datasets, the Neural Network approach for VQA and its implementation in Keras and the applications of this challenging problem in real life. 

Table of Contents:

1 Introduction

2 What is exactly Visual Question Answering?

3 Prerequisites

4 Datasets available for VQA

4.1 DAQUAR Dataset

4.2 CLEVR Dataset

4.3 FigureQA Dataset

4.4 VQA Dataset

5 Real-life applications of VQA

6 Conclusion

 

  1. Introduction:

Let’s say you are given a below picture along with one question. Can you answer it?

I expect confidently you all say it is the Kitchen without much inconvenience which is also the right answer. Even a five-year child who just started to learn things might answer this question correctly.

Alright, but can you write a computer program for such type of task that takes image and question about the image as an input and gives us answer as output?

Before the development of the Deep Neural Network, this problem was considered as one of the difficult, inconceivable and challenging problem for the AI researcher’s community. However, due to the recent advancement of Deep Learning the systems are capable of answering these questions with the promising result if we have a required dataset.

Now I hope you have got at least some intuition of a problem that we are going to discuss in this series of blog posts. Let’s try to formalize the problem in the below section.

  1. What is exactly Visual Question Answering?:

We can define, “Visual Question Answering(VQA) is a system that takes an image and natural language question about the image as an input and generates natural language answer as an output.”

VQA is a research area that requires an understanding of vision(Computer Vision)  as well as text(NLP). The main beauty of VQA is that the reasoning part is performed in the context of the image. So if we have an image with the corresponding question then the system must able to understand the image well in order to generate an appropriate answer. For example, if the question is the number of persons then the system must able to detect faces of the persons. To answer the color of the horse the system need to detect the objects in the image. Many of these common problems such as face detection, object detection, binary object classification(yes or no), etc. have been solved in the field of Computer Vision with good results.

To summarize a good VQA system must be able to address the typical problems of CV as well as NLP.

To get a better feel of VQA you can try online VQA demo by CloudCV. You just go to this link and try uploading the picture you want and ask the related question to the picture, the system will generate the answer to it.

 

  1. Prerequisites:

In the next post, I will walk you through the code for this problem using Keras. So I assume that you are familiar with:

  1. Fundamental concepts of Machine Learning
  2. Multi-Layered Perceptron
  3. Convolutional Neural Network
  4. Recurrent Neural Network (especially LSTM)
  5. Gradient Descent and Backpropagation
  6. Transfer Learning
  7. Hyperparameter Optimization
  8. Python and Keras syntax
  1. Datasets available for VQA:

As you know problems related to the CV or NLP the availability of the dataset is the key to solve the problem. The complex problems like VQA, the dataset must cover all possibilities of questions answers in real-world scenarios. In this section, I will cover some of the datasets available for VQA.

4.1 DAQUAR Dataset:

The DAQUAR dataset is the first dataset for VQA that contains only indoor scenes. It shows the accuracy of 50.2% on the human baseline. It contains images from the NYU_Depth dataset.

Example of DAQUAR dataset

Example of DAQUAR dataset

The main disadvantage of DAQUAR is the size of the dataset is very small to capture all possible indoor scenes.

4.2 CLEVR Dataset:

The CLEVR Dataset from Stanford contains the questions about the object of a different type, colors, shapes, sizes, and material.

It has

  • A training set of 70,000 images and 699,989 questions
  • A validation set of 15,000 images and 149,991 questions
  • A test set of 15,000 images and 14,988 questions

Image Source: https://cs.stanford.edu/people/jcjohns/clevr/?source=post_page

 

4.3 FigureQA Dataset:

FigureQA Dataset contains questions about the bar graphs, line plots, and pie charts. It has 1,327,368 questions for 100,000 images in the training set.

4.4 VQA Dataset:

As comapred to all datasets that we have seen so far VQA dataset is relatively larger. The VQA dataset contains open ended as well as multiple choice questions. VQA v2 dataset contains:

  • 82,783 training images from COCO (common objects in context) dataset
  • 40, 504 validation images and 81,434 validation images
  • 443,757 question-answer pairs for training images
  • 214,354 question-answer pairs for validation images.

As you might expect this dataset is very huge and contains 12.6 GB of training images only. I have used this dataset in the next post but a very small subset of it.

This dataset also contains abstract cartoon images. Each image has 3 questions and each question has 10 multiple choice answers.

  1. Real-life applications of VQA:

There are many applications of VQA. One of the famous applications is to help visually impaired people and blind peoples. In 2016, Microsoft has released the “Seeing AI” app for visually impaired people to describe the surrounding environment around them. You can watch this video for the prototype of the Seeing AI app.

Another application could be on social media or e-commerce sites. VQA can be also used for educational purposes.

  1. Conclusion:

I hope this explanation will give you a good idea of Visual Question Answering. In the next blog post, I will walk you through the code in Keras.

If you like my explanations, do provide some feedback, comments, etc. and stay tuned for the next post.

Understanding Dropout and implementing it on MNIST dataset

Over-fitting is a major problem in deep learning and a plethora of techniques have been introduced to prevent it. One of the most effective one is called “dropout”.  Let’s use the analogy of a person going to gym for understanding this. Let’s say the person going to gym mostly uses his dominant arm, say his right arm to pick up weights. After some time, he notices that his dominant arm is developing a large muscle, but not the other arm. So, what can he do? Obviously, he needs to involve both his arms while training. Sometimes he should stop using his right arm, and use the left arm to lift weights and vice versa.

Something like this happens commonly in neural networks. Sometime one part of the network has very large weights and ends up dominating the training. While other part of the network remains weak and does not really play a role in the training. So, what dropout does to solve this problem, is it randomly shuts off some nodes and stop the gradients flowing through it. So, our forward and back propagation happen without those nodes. In that case the rest of the nodes need to pick up the slack and be more active in the training. We define a probability of the nodes getting dropped. For example, P=0.5 means there is a 50% chance a node will be dropped.

Figure 1 demonstrates the dropout technique, taken from the original research paper.

Dropout in a neuronal Net

Our network can never rely on any given node because it can be squashed at any given time. Hence the network is forced to learn redundant representation for everything to make sure at least some of the information remains. Redundant representation leads our network to be more robust. It also acts as ensemble of many networks, since at every epoch random nodes are dropped, each time our network will be different. Ensemble of different networks perform better than a single network since they capture more randomness. Please note, only non-output nodes are dropped.

Let’s, look at the python code to implement dropout in a neural network:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

# Parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

# Number of samples to calculate validation and accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10  # MNIST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units


# layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}

#function that implements Convolution layer
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

#defining a function to implement maxpool layers
def maxpool2d(x, k=2):
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

#Function that defines all the convolution layers.
def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)


    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)  # Implementing the dropout layer

    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32) # Keep probability for dropout layers

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: dropout})

            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={x: batch_x, y: batch_y, keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.}) #we want to keep all nodes while training so keep prob is 1.

            print('Epoch {:>2}, Batch {:>3} - Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))

    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        x: mnist.test.images[:test_valid_size],
        y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.})
    print('Testing Accuracy: {}'.format(test_acc))

 


Warning: file_get_contents(https://nbviewer.jupyter.org/url/github.com/sarthakbabbar3/Sentiment_Analysis_IMDB/blob/master/Sentiment%20analysis%20imdb.ipynb): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /var/www/datasciencehack/wp-content/plugins/nbconvert-master/nbconvert.php on line 87

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /var/www/datasciencehack/wp-content/plugins/nbconvert-master/nbconvert.php on line 121

Warning: file_get_contents(https://api.github.com/repos/sarthakbabbar3/Sentiment_Analysis_IMDB/commits/master?path=Sentiment%20analysis%20imdb.ipynb&page=1): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /var/www/datasciencehack/wp-content/plugins/nbconvert-master/nbconvert.php on line 38

Sentiment Analysis of IMDB reviews

 

Download the Data Sources

The data sources used in this article can be downloaded here:

Interview – The Importance of Machine Learning for the Data Driven Business

To become more data-driven, organizations must mature their analytics and automate more of their decision making processes for innovation and differentiation. Data science seems like the right approach, yet is a new and fast moving field that seems to have as many dead ends as it has high ways to value. Cloudera Fast Forward Labs, led by Hilary Mason, shows companies the way.

Alice Albrecht is a research engineer at Cloudera Fast Forward Labs.  She spends her days researching the latest and greatest in machine learning and artificial intelligence and bringing that knowledge to working prototypes and delivering concrete advice for clients.  Prior to joining Fast Forward Labs, Alice worked in both finance and technology companies as a practicing data scientist, data science leader, and – most recently – a data product manager.  In addition to teaching machines to do cool things, Alice is passionate about mentoring and helping others grow in their careers.  Alice holds a PhD from Yale in cognitive neuroscience where she studied how humans summarize sensory information from the world around them and the neural substrates that underlie those summaries.

Read this article in German:
“Interview – Die Bedeutung von Machine Learning für das Data Driven Business“

Data Science Blog: Ms. Albrecht, you are a well-known keynote speaker for data science and artificial intelligence. While data science has arrived business already, deep learning seems to be the new trend. Is artificial intelligence for business already normal business or is it an overrated hype?

I’d say it isn’t either of those two options.  Data science is now widely adopted but companies still struggle to integrate this new discipline into their existing businesses.  As for deep learning, it really depends on the company that’s looking into using this technique.  I wouldn’t say that deep learning is by any means part of business as usual- nor should it be.  It’s a tool like any other and building a capacity for using a tool without clearly defined business needs is a recipe for disaster.

Data Science Blog: Just to make sure what we are talking about: What are the differences and overlaps between data analytics, data science, machine learning, deep learning and artificial intelligence?

Here at Cloudera Fast Forward Labs, we like to think of data analytics as collecting data and counting things (mostly for quick charts and reports).  Data science solves business problems by counting cleverly and predicting things with the data that’s collected.  Machine learning is about solving problems with new kinds of feedback loops that improve with more data.  Deep learning is a particular type of machine learning and is not itself a separate concept or type of tool.  Artificial intelligence taps into something more complicated than what we’re seeing today – it’s much broader than training machines to repetitively do very specialized tasks or solve very narrow problems.

Data Science Blog: And how can we add the context to big data?

From a theoretical perspective, data science has been around for decades. The building blocks for modern day machine learning, deep learning and artificial intelligence are based on mathematical theorems  that go back to the 1940’s and 1950’s. The challenge was that at the time, compute power and data storage capacity were simply too expensive for the approaches to be implemented. Today that’s all changed.. Not only has the cost of data storage dropped considerably, open source technology like Apache Hadoop has made it possible to store any volume of data at costs approaching zero. Compute power, even highly specialised chip architectures, are now also available on demand and only for the time organisations need them through public and private cloud solutions. The decreased cost of both data storage and compute power, together with a growing list of tools and resources readily available via the open source community allows companies of any size to benefit from data (no matter that size of that data).

Data Science Blog: What are the challenges for organizations in getting started with data science?

I see two big challenges when getting started with data science.  One is ensuring that you have organizational alignment around exactly what type of work data scientists will deliver (and timing for those projects).  The second hurdle is around ensuring that you have the right data in place before you start hiring data scientists. This can be tricky if you don’t have in-house expertise in this area, so sometimes it’s better to hire a data engineer or a data strategist (or director of data science) before you ever get started building out a data science team.

Data Science Blog: There are many discussions about how to build a data-driven business. Is it just about using data science to get a better understanding of customer behavior?

No, being data driven doesn’t just mean better understanding your customers (though that is one way that data science can help in an organization).  Aside from building an organization that relies on data and analytics to help them make decisions (about customer behavior or otherwise), being a data-driven business means that data is powering your core products.

Data Science Blog: The number of technologies, tools and frameworks is increasing. For organizations this also means increasing complexity. Do companies need to stay always up-to-date or could it be an advice to wait and imitate pioneers later?

While it’s not critical (or advisable) for organizations to adopt every new advancement that comes along, it is critical for them to stay abreast of emerging frameworks.  If a business waits to see what others are doing, and therefore don’t invest in understanding how new advancements can affect their particular business, they’ve likely already missed the boat.

Data Science Blog: Global players have big budgets just for doing research and setting up data labs. Middle-sized companies need to see the break even point soon. How can we accelerate the value generation of data science?

Having a team that is highly focused on a specific set of projects that are well-scoped and aligned to the business makes all the difference.  Data science and machine learning don’t have to sacrifice doing research and being innovative in order to produce value.  The biggest difference is that smaller teams will have to be more aware of how their choice of project fits into emerging frameworks and their particular acute and near term business needs.

Data Science Blog: How does Cloudera Fast Forward Labs help other organizations to accelerate their start with machine learning?

We advise organizations, based on their particular needs, on what the latest advancements are in machine learning and data science, how to build and structure their data teams to develop the capabilities they need to meet their goals, and how to quickly implement custom forward-looking solutions using their own data and in-house expertise.

Data Science Blog: Finally, a question for our younger readers who are looking for a career as a data expert: What makes a good data scientist? Do you like to work with introverted coding nerds or the data loving business experts?

A good data scientists should be deeply curious and have a love for the ways in which data can lead to new discoveries and power the next generation of products.  We expect the people who thrive in this field to come from a variety of backgrounds and experiences.

Deep Learning and Human Intelligence – Part 1 of 2

Many people are under the impression that the new wave of data science, machine learning and/or digitalization is new, that it did not exist before. But its history is as long as the history of humanity and/or science itself.  The scientific discovery could hardly take place without the necessary data. Even the process of discovering the numbers included elements of machine learning: pattern recognition, comparison between different groups (ranking), clustering, etc. So what differentiates mathematical formulas from machine learning and how does it relate to artificial intelligence?

There is no difference between the two if seen from the perspective of formulas however, such a perspective limits the type of data to which they can be applied. Data stored via tables consist of structured data and are stored in so-called relational databases. The reason for such a data storage is the connection between different fields that assume a well-established structure in advance, such as a company’s sales or balance sheet. However, with the emergence of personal computers, many of the daily activities have been digitalized: music, pictures, movies, and so on. All this information is stored unrelated to other data and therefore called unstructured data.

IEEE International Conference on Computer Vision (ICCV), 2015, DOI: 10.1109/ICCV.2015.428

Copyright: IEEE International Conference on Computer Vision (ICCV), 2015, DOI: 10.1109/ICCV.2015.428

The essence of scientific discoveries was and will be structure. Not surprisingly, the mathematical formulas revolve around relations between variables – information, in general. For example, Galileo derived the law of falling balls from measuring the successive hight of a falling ball. The main difficulty was to obtain measurements at regular time intervals. What about if the data is not structured, which mathematical formula should be applied then? There is a distribution of people’s height, but no distribution for the pictures taken in all holidays for the last year, there is an amplitude for acoustic signals, but no function that detects the similarity between two songs. This is one of the reasons why machine learning focuses heavily on clustering and classification.

Roughly speaking, these simple examples are enough to categorize the difference between scientific discovery and machine learning. Science is about discovering relationships between different variables, Machine Learning tries to automatize processes. Every technical improvement is part of the automation, so why is everything different in this case? Because the current automation deals with human intelligence. The car automates the walking, the kitchen stove the fire, but Machine Learning parts of the human intelligence. There is a difference between the previous automation steps and those of human intelligence. All the previous ones are either outside the human body – such as Fire – or unconsciously executed (once learned) – walking, spinning, etc. The automation induced by Machine Learning affects a part of the human intelligence that we consciously perceive. Of course, today’s machine learning tools are unable to automate all human intelligence, but it is a fascinating step in that direction.

A breakthrough in Machine Learning tasks was achieved in 2012 when the first Deep Learning algorithm for detecting types of images, reached near-human accuracy. It could appreciate the likelihood that the image is a human face, a train, a ball or a fish without having “seen” the picture before. Such an algorithm can be used in various areas:  personally – facial recognition in pictures and/or social media – as tagging of images or videos, medicine – cancer detection, etc. For understanding such cutting-edge issues of classification, one cannot avoid understanding how Deep Learning works. To see the beauty of such algorithms and, at the same time, to be able to comprehend the difficulty of working with them, an example will be the best guide.

The building blocks of Deep Learning are neurons, operational units, which perform mathematical operations or logical operations like AND, OR, etc., and are modelled after the neurons in the brain. Already in the 1950’s two neuroscientist, Hubel and Wiesel, observed that not all neurons in the brain are responding in the same fashion to visual stimuli. Some responded only to horizontal lines, whereas others to vertical lines, with other words, the brain is constructed with specialized neurons. Groups of such neurons are called, in the Machine Learning community, layers. Like in the brain, neurons with different properties are clustered in different layers. This implies that layers have also specific properties and have to be arranged in a specific way, called architecture. It is this architecture which differentiates Deep Learning from Artificial Neuronal Networks (ANN are similar to a layer).

Unfortunately, scientists still haven’t figured out how the brain works, thus to discover how to train Deep Learning from data was not an easy task, and is also the reason why another example is used to explain the training of Deep Learning: the eye. One has always to remember: once it is known how Deep Learning works, it is simple to find example which illustrates the working mechanism.  For such an analogy, it is sufficient for someone without any knowledge about Deep Learning, to keep in mind only the elements that compose such architectures: input data, different layers of neurons, output layers, ReLu’s.

Input data are any type of information, in our example it is light. Of course, that Deep Learning is not limited only to images or videos, but also to sound and/or time series, which would imply that the example would be the ear and sound waves, or the brain and numbers.

Layers can be seen as cells in the eye. It is well known that the eye is formed of different layers connected to each other with each of them having different properties, functionalities. The same is true also for the layers of a Deep Learning architecture: one can see the neurons as cells of the layer as the tissue. While, mathematically, the neurons are nothing more than simple operations, usually linear weight functions, they can be seen as the properties of individual cells. Each layer has one weight matrix, which gives the neuron (and layer) specific properties depending on the data and the task at hand.

It is here that the architecture becomes very important. What Deep Learning offers is a default setting of the layers with unknown weights. One can see this as trying to build an eye knowing that there are different types of cells and different ways how tissues of such cells can be arranged, but not which cell exactly is needed (with what properties) and which arrangement of layers works best. Such an approach has the advantage that one is capable of building any type of organ desired, but the disadvantage is also very obvious: it is time consuming to find the appropriate cell properties and layers arrangements.

Still, the strategy of Deep Learning is a significant departure from the Machine Learning approaches. The performance of Machine Learning methods is as good as the features engineering performed by Data Scientists, and thus depending on the creativity of the Data Scientist. In the case of Deep Learning the engineers of the features is performed automatically as part of the model building. This is a huge improvement, as the only difficult task is to have enough data and computer power to find the right weights matrices. Such an endeavor was performed also by nature for the eye — and is also the reason why one can choose it as an example for Deep Learning — evolution. It is not surprising that Deep Learning is one of the best direction scientists have of Artificial Intelligence today.

The evolution of the eye can be seen, from the perspective of Data Scientists, as the continuous training of a Deep Learning architecture which enables to recognize and track one or more objects. The performance of the evolutional process can be summed up as the fine tuning of the cells which are getting more and more susceptible to light and the adaptation of layers to enable a better vision. Different animals in different environments and different targets — as the hawk and the fly — developed different eyes than humans, but they all work according to the same principle. The tasks that Deep Learning is performing today are similar, for example it can be used to drive cars but there is still a difference:  there is no connection to other organs. Deep Learning is not the approximation of an Artificial Organism, like an android, but a simplified Artificial Organ that can work on its own.

Returning to the working mechanism of the Deep Learning architecture, we can already follow the analogy of what happens if a ray of light is hitting the eye. Once the eye is fully adapted to the task, one can followed how the information enters the Deep Learning architecture (Artificial Eye) by penetrating the input layer. already here arises the question, what kind of eye is the best? One where a small source of light can reach as many neurons as possible, or the one where the light sources reaches only few neurons? In order to take such a decision, a last piece of the puzzle is required: ReLu. One can see them as synapses between neurons (cells) and/or similarly for tissue. By using continuous functions, such as the shape of the latter ‘S’ (called sigmoid), the information from one neuron will be distributed over a large number of other neurons. If one uses the maximum function, then only few neurons are updated with processed information from earlier layers.

Such sparse structures between neurons, was a major improvement in the development of the technique of training Deep Learning architectures. Again, it has a strong evolutionary analogy: energy efficiency. By needing less neurons, the tissues and architecture are both kept to a minimal size which enables flexibility in development and less energy. As the information is process by the different layers, the Artificial Eye is gathering more and more complex (non-linear) structures — the adapted features –, which help to decide, from past experience, what kind of object is detected.

This was part 1 of 2 of the article series. Continue with Part 2.