Geschriebene Artikel über Big Data Analytics

Visual Question Answering with Keras – Part 2: Making Computers Intelligent to answer from images

Making Computers Intelligent to answer from images

This is my second blog on Visual Question Answering, in the last blog, I have introduced to VQA, available datasets and some of the real-life applications of VQA. If you have not gone through then I would highly recommend you to go through it. Click here for more details about it.

In this blog post, I will walk through the implementation of VQA in Keras.

You can download the dataset from here: https://visualqa.org/index.html. All my experiments were performed with VQA v2 and I have used a very tiny subset of entire dataset i.e all samples for training and testing from the validation set.

Table of contents:

  1. Preprocessing Data
  2. Process overview for VQA
  3. Data Preprocessing – Images
  4. Data Preprocessing through the spaCy library- Questions
  5. Model Architecture
  6. Defining model parameters
  7. Evaluating the model
  8. Final Thought
  9. References

NOTE: The purpose of this blog is not to get the state-of-art performance on VQA. But the idea is to get familiar with the concept. All my experiments were performed with the validation set only.

Full code on my Github here.


1. Preprocessing Data:

If you have downloaded the dataset then the question and answers (called as annotations) are in JSON format. I have provided the code to extract the questions, annotations and other useful information in my Github repository. All extracted information is stored in .txt file format. After executing code the preprocessing directory will have the following structure.

All text files will be used for training.

 

2. Process overview for VQA:

As we have discussed in previous post visual question answering is broken down into 2 broad-spectrum i.e. vision and text.  I will represent the Neural Network approach to this problem using the Convolutional Neural Network (for image data) and Recurrent Neural Network(for text data). 

If you are not familiar with RNN (more precisely LSTM) then I would highly recommend you to go through Colah’s blog and Andrej Karpathy blog. The concepts discussed in this blogs are extensively used in my post.

The main idea is to get features for images from CNN and features for the text from RNN and finally combine them to generate the answer by passing them through some fully connected layers. The below figure shows the same idea.

 

I have used VGG-16 to extract the features from the image and LSTM layers to extract the features from questions and combining them to get the answer.

3. Data Preprocessing – Images:

Images are nothing but one of the input to our model. But as you already may know that before feeding images to the model we need to convert into the fixed-size vector.

So we need to convert every image into a fixed-size vector then it can be fed to the neural network. For this, we will use the VGG-16 pretrained model. VGG-16 model architecture is trained on millions on the Imagenet dataset to classify the image into one of 1000 classes. Here our task is not to classify the image but to get the bottleneck features from the second last layer.

Hence after removing the softmax layer, we get a 4096-dimensional vector representation (bottleneck features) for each image.

Image Source: https://www.cs.toronto.edu/~frossard/post/vgg16/

 

For the VQA dataset, the images are from the COCO dataset and each image has unique id associated with it. All these images are passed through the VGG-16 architecture and their vector representation is stored in the “.mat” file along with id. So in actual, we need not have to implement VGG-16 architecture instead we just do look up into file with the id of the image at hand and we will get a 4096-dimensional vector representation for the image.

4. Data Preprocessing through the spaCy library- Questions:

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. As we have converted images into a fixed 4096-dimensional vector we also need to convert questions into a fixed-size vector representation. For installing spaCy click here

You might know that for training word embeddings in Keras we have a layer called an Embedding layer which takes a word and embeds it into a higher dimensional vector representation. But by using the spaCy library we do not have to train the get the vector representation in higher dimensions.

 

This model is actually trained on billions of tokens of the large corpus. So we just need to call the vector method of spaCy class and will get vector representation for word.

After fitting, the vector method on tokens of each question will get the 300-dimensional fixed representation for each word.

5. Model Architecture:

In our problem the input consists of two parts i.e an image vector, and a question, we cannot use the Sequential API of the Keras library. For this reason, we use the Functional API which allows us to create multiple models and finally merge models.

The below picture shows the high-level architecture idea of submodules of neural network.

After concatenating the 2 different models the summary will look like the following.

The below plot helps us to visualize neural network architecture and to understand the two types of input:

 

6. Defining model parameters:

The hyperparameters that we are going to use for our model is defined as follows:

If you know what this parameter means then you can play around it and can get better results.

Time Taken: I used the GPU on https://colab.research.google.com and hence it took me approximately 2 hours to train the model for 5 epochs. However, if you train it on a PC without GPU, it could take more time depending on the configuration of your machine.

7. Evaluating the model:

Since I have used the very small dataset for performing these experiments I am not able to get very good accuracy. The below code will calculate the accuracy of the model.

 

Since I have trained a model multiple times with different parameters you will not get the same accuracy as me. If you want you can directly download mode.h5 file from my google drive.

 

8. Final Thoughts:

One of the interesting thing about VQA is that it a completely new field. So there is absolutely no end to what you can do to solve this problem. Below are some tips while replicating the code.

  1. Start with a very small subset of data: When you start implementing I suggest you start with a very small amount of data. Because once you are ready with the whole setup then you can scale it any time.
  2. Understand the code: Understanding code line by line is very much helpful to match your theoretical knowledge. So for that, I suggest you can take very few samples(maybe 20 or less) and run a small chunk (2 to 3 lines) of code to get the functionality of each part.
  3. Be patient: One of the mistakes that I did while starting with this project was to do everything at one go. If you get some error while replicating code spend 4 to 5 days harder on that. Even after that if you won’t able to solve, I would suggest you resume after a break of 1 or 2 days. 

VQA is the intersection of NLP and CV and hopefully, this project will give you a better understanding (more precisely practically) with most of the deep learning concepts.

If you want to improve the performance of the model below are few tips you can try:

  1. Use larger datasets
  2. Try Building more complex models like Attention, etc
  3. Try using other pre-trained word embeddings like Glove 
  4. Try using a different architecture 
  5. Do more hyperparameter tuning

The list is endless and it goes on.

In the blog, I have not provided the complete code you can get it from my Github repository.

9. References:

  1. https://blog.floydhub.com/asking-questions-to-images-with-deep-learning/
  2. https://tryolabs.com/blog/2018/03/01/introduction-to-visual-question-answering/
  3. https://github.com/sominwadhwa/vqamd_floyd

6 Important Reasons for the Java Experts to learn Hadoop Skills

You must be well aware of the fact that Java and Hadoop Skills are in high demand these days. Gone are the days when advancement work moved around Java and social database. Today organizations are managing big information. It is genuinely big. From gigabytes to petabytes in size and social databases are exceptionally restricted to store it. Additionally, organizations are progressively outsourcing the Java development jobs to different groups who are as of now having big data experts.

Ever wondered what your future would have in store for you if you possess Hadoop as well as Java skills? No? Let us take a look. Today we shall discuss the point that why is it preferable for Java Developers to learn Hadoop.

Hadoop is the Future Java-based Framework that Leads the Industry

Data analysis is the current marketing strategy that the companies are adopting these days. What’s more, Hadoop is to process and comprehend all the Big Data that is generated all the time. As a rule, Hadoop is broadly utilized by practically all organizations from big and small and in practically all business spaces. It is an open-source stage where Java owes a noteworthy segment of its success

The processing channel of Hadoop, which is MapReduce, is written in Java. Thus, a Hadoop engineer needs to compose MapReduce contents in Java for Big data analysis. Notwithstanding that, HDFS, which is the record arrangement of Hadoop, is additionally Java-based programming language at its core. Along these lines, a Hadoop developer needs to compose documents from local framework to HDFS through deployment, which likewise includes Java programming.

Learn Hadoop: It is More Comfortable for a Java Developer

Hadoop is more of an environment than a standalone innovation. Also, Hadoop is a Java-based innovation. Regardless of whether it is Hadoop 1 which was about HDFS and MapReduce or Hadoop2 biological system that spreads HDFS, Spark, Yarn, MapReduce, Tez, Flink, Giraph, Storm, JVM is the base for all. Indeed, even a portion of the broadly utilized programming languages utilized in a portion of the Hadoop biological system segments like Spark is JVM based. The run of the mill models is Scala and Clojure.

Consequently, if you have a Java foundation, understanding Hadoop is progressively easier for you. Also, here, a Hadoop engineer needs Java programming information to work in MapReduce or Spark structure. Thus, if you are as of now a Java designer with a logical twist of the brain, you are one stage ahead to turn into a Hadoop developer.

IT Industry is looking for Professionals with Java and Hadoop Skills

If you pursue the expected set of responsibilities and range of abilities required for a Hadoop designer in places of work, wherever you will watch the reference of Java. As Hadoop needs solid Java foundation, from this time forward associations are searching for Java designers as the best substitution for Hadoop engineers. It is savvy asset usage for organizations as they don’t have to prepare Java for new recruits to learn Hadoop for tasks.

Nonetheless, the accessible market asset for Hadoop is less. Therefore, there is a noteworthy possibility for Java designers in the Hadoop occupation field. Henceforth, as a Java designer, on the off chance that you are not yet arrived up in your fantasy organization, learning Hadoop, will without a doubt help you to discover the chance to one of your top picks.

Combined Java and Hadoop Skills Means Better Pay Packages

You will be progressively keen on learning Hadoop on the off chance that you investigate Gartner report on big information industry. According to the report, the Big Data industry has just come to the $50 billion points. Additionally, over 64% of the main 720 organizations worldwide are prepared to put resources into big information innovation. Notwithstanding that when you are a mix of a Java and Hadoop engineer, you can appreciate 250% pay climb with a normal yearly compensation of $150,000.It is about the yearly pay of a senior Hadoop developer.

Besides, when you change to Big Data Hadoop, it very well may be useful to improve the nature of work. You will manage unpredictable and greater tasks. It does not just give you a better extension to demonstrate your expertise yet, in addition, to set up yourself as a profitable asset who can have any kind of effect.

Adapting Big Data Hadoop can be exceptionally advantageous because it will assist you in dealing with greater, complex activities a lot simpler and convey preferable yield over your associates. To be considered for examinations, you should be somebody who can have any kind of effect in the group, and that is the thing that Hadoop lets you be.

Learning Hadoop will open New Opportunities to Other Lucrative Fields

Big data is only not going to learn Hadoop. When you are in Big information space, you have sufficient chance to jump other Java and Hadoop engineer. There are different exceedingly requesting zones in big information like Artificial Intelligence, Machine Learning, Data Science. You can utilize your Java and Hadoop engineer expertise as a springboard to take your vocation to the following level. In any case, the move will give you the best outcome once you move from Java to Hadoop and increase fundamental working knowledge.

Java with Hadoop opens new skylines of occupation jobs, for example, data scientist, data analyst business intelligence analyst, DBA, etc.

Premier organizations prefer Hadoop Developers with Java skills

Throughout the years the Internet has been the greatest driver of information, and the new data produced in 2012 remained at 2500 Exabyte. The computerized world developed by 62% a year ago to 800K petabytes and will keep on developing to the tune of 1.2 zeta bytes during the present year. Gartner gauges the market of Hadoop Ecosystem to $77 million and predicts it will come to the $813 million marks by 2016.

A review of LinkedIn profiles referencing Hadoop as their abilities uncovered that just about 17000 individuals are working in Companies like Cisco, HP, TCS, Oracle, Amazon, Yahoo, and Facebook, and so on. Aside from this Java proficient who learn Hadoop can begin their vocations with numerous new businesses like Platfora, Alpine information labs, Trifacta, Datatorrent, and so forth.

Conclusion

You can see that combining your Java skills with Hadoop skills can open the doors of several new opportunities for you. You can get better remuneration for your efforts, and you will always be in high demand. It is high time to learn Hadoop online now if you are a java developer.

How to Ensure Data Quality in an Organization?

Introduction to Data Quality

Today, the world is filled with data. It is everywhere. And, the value of any organization can be measured by the quality of its data. So, what actually is the quality of data or data quality, and why is it important? Well, data quality refers to the capability of a set of data to serve an intended purpose. 

Data quality is important to any organization because it provides timely and accurate information to manage accountability and services. It also helps to ensure and prioritize the best use of resources. Thus, high-quality data will lead to appropriate insights and valuable information for any organization. We can evaluate the quality of data in certain aspects. They include accuracy, relevancy, completeness, and uniqueness. 

Data Quality Problems

As the organizations are collecting vast amounts of data, managing its quality becomes more important every single day. In the year 2016, the costs of problems caused due to poor data quality were estimated by IBM, and it turned out to be $3.1 trillion across the U.S economy. Also, a Forrester report has stated that almost 30 percent of analysts spend 40 percent of their time validating and vetting their data prior to its utilization for strategic decision-making. These statistics indicate that the scale of the problems with data quality is vast.

So, why do these data quality problems occur? The main reasons include manual entry of data, software updates, integration of data sources, skills shortages, and insufficient testing time. Wrong decisions can be taken due to poor data management processes and poor quality of data. Because of this, many organizations lose their clients and customers. So, ensuring data quality must be given utmost importance in an organization. 

How to Ensure Data Quality?

Data quality management helps by combining data, technology, and organizational culture to deliver useful and accurate results. Good management of data quality builds a foundation for all the initiatives of a business. Now, let’s see how we can improve the data quality in an organization.

The first aspect of improving the quality of data is monitoring and cleansing data. This verifies data against standard statistical measures, validates data against matching descriptions, and uncovers relationships. This also checks the uniqueness of data and analyzes the data for its reusability. 

The second one is managing metadata centrally. Multiple people gather and clean data very often and they may work in different countries or offices. Therefore, you require clear policies on how data is gathered and managed as people in different parts of a company may misinterpret certain data terms and concepts. Centralized management of metadata is the solution to this problem as it reduces inconsistent interpretations and helps in establishing corporate standards.  

The next one is to ensure all the requirements are available and offer documentation for data processors and data providers. You have to format the specifications and offer a data dictionary and also provide training for the providers of data and all other new staff. Make sure you offer immediate help for all the data providers.

Very often, data is gathered from different sources and may include distinct spelling options. Hence, segmentation, scoring, smart lists, and many others are impacted by this. So, for entering a data point, a singular approach is essential, and data normalization provides this approach. The goal of this approach is to eliminate redundancy in data. Its advantages include easier object-to-data mapping and increased consistency.

The last aspect is to verify whether the data is consistent with the data rules and business goals, and this has to be done at regular intervals. You have to communicate the current status and data quality metrics to every stakeholder regularly to ensure the maintenance of data quality discipline across the organization.

Conclusion

Data quality is a continuous process but not a one-time project which needs the entire company to be data-focused and data-driven. It is much more than reliability and accuracy. High level of data quality can be achieved when the decision-makers have confidence in data and rely upon it. Follow the above-mentioned steps to ensure a high level of data quality in your organization. 

The Future of AI in Dental Technology

As we develop more advanced technology, we begin to learn that artificial intelligence can have more and more of an impact on our lives and industries that we have gotten used to being the same over the past decades. One of those industries is dentistry. In your lifetime, you’ve probably not seen many changes in technology, but a boom around artificial intelligence and technology has opened the door for AI in dental technologies.

How Can AI Help?

Though dentists take a lot of pride in their craft and career, most acknowledge that AI can do some things that they can’t do or would make their job easier if they didn’t have to do. AI can perform a number of both simple and advanced tasks. Let’s take a look at some areas that many in the dental industry feel that AI can be of assistance.

Repetitive, Menial Tasks

The most obvious area that AI can help out when it comes to dentistry is with repetitive and menial simple tasks. There are many administrative tasks in the dentistry industry that can be sped up and made more cost-effective with the use of AI. If we can train a computer to do some of these tasks, we may be able to free up more time for our dentists to focus on more important matters and improve their job performance as well. One primary use of AI is virtual consultations that offices like Philly Braces are offering. This saves patients time when they come in as the Doctor already knows what the next steps in their treatment will be.

Using AI to do some basic computer tasks is already being done on a small scale by some, but we have yet to see a very large scale implementation of this technology. We would expect that to happen soon, with how promising and cost-effective the technology has proven to be.

Reducing Misdiagnosis

One area that many think that AI can help a lot in is misdiagnosis. Though dentists do their best, there is still a nearly 20% misdiagnosis rate when reading x-rays in dentistry. We like to think that a human can read an x-ray better, but this may not be the case. AI technology can certainly be trained to read an x-ray and there have been some trials to suggest that they can do it better and identify key conditions that we often misread.

A world with AI diagnosis that is accurate and quicker will save time, money, and lead to better dental health among patients. It hasn’t yet come to fruition, but this seems to be the next major step for AI in dentistry.

Artificial Intelligence Assistants

Once it has been demonstrated that AI can perform a range of tasks that are useful to dentists, the next logical step is to combine those skills to make a fully-functional AI dental assistant. A machine like this has not yet been developed, but we can imagine that it would be an interface that could be spoken to similar to Alexa. The dentist would request vital information and other health history data from a patient or set of patients to assist in the treatment process. This would undoubtedly be a huge step forward and bring a lot of computing power into the average dentist office.

Conclusion

It’s clear that AI has a bright future in the dental industry and has already shown some of the essential skills that it can help with in order to provide more comprehensive and accurate care to dental patients. Some offices like Westwood Orthodontics already use AI in the form of a virtual consult to diagnose issues and provide treatment options before patients actually step foot in the office. Though not nearly all applications that AI can provide have been explored, we are well on our way to discovering the vast benefits of artificial intelligence for both patients and practices in the dental healthcare industry.

Interview – Customer Data Platform, more than CRM 2.0?

Interview with David M. Raab from the CDP Institute

David M. Raab is as a consultant specialized in marketing software and service vendor selection, marketing analytics and marketing technology assessment. Furthermore he is the founder of the Customer Data Platform Institute which is a vendor-neutral educational project to help marketers build a unified customer view that is available to all of their company systems.

Furthermore he is a Keynote-Speaker for the Predictive Analytics World Event 2019 in Berlin.

Data Science Blog: Mr. Raab, what exactly is a Customer Data Platform (CDP)? And where is the need for it?

The CDP Institute defines a Customer Data Platform as „packaged software that builds a unified, persistent customer database that is accessible by other systems“.  In plainer language, a CDP assembles customer data from all sources, combines it into customer profiles, and makes the profiles available for any use.  It’s important because customer data is collected in so many different systems today and must be unified to give customers the experience they expect.

Data Science Blog: Is it something like a CRM System 2.0? What Use Cases can be realized by a Customer Data Platform?

CRM systems are used to interact directly with customers, usually by telephone or in the field.  They work almost exclusively with data that is entered during those interactions.  This gives a very limited view of the customer since interactions through other channels such as order processing or Web sites are not included.  In fact, one common use case for CDP is to give CRM users a view of all customer interactions, typically by opening a window into the CDP database without needing to import the data into the CRM.  There are many other use cases for unified data, including customer segmentation, journey analysis, and personalization.  Anything that requires sharing data across different systems is a CDP use case.

Data Science Blog: When does a CDP make sense for a company? It is more relevant for retail and financial companies than for industrial companies, isn´t it?

CDP has been adopted most widely in retail and online media, where each customer has many interactions and there are many products to choose from.  This is a combination that can make good use of predictive modeling, which benefits greatly from having more complete data.  Financial services was slower to adopt, probably because they have fewer products but also because they already had pretty good customer data systems.  B2B has also been slow to adopt because so much of their customer relationship is handled by sales people.  We’ve more recently been seeing growth in additional sectors such as travel, healthcare, and education.  Those involve fewer transactions than retail but also rely on building strong customer relationships based on good data.

Data Science Blog: There are several providers for CDPs. Adobe, Tealium, Emarsys or Dynamic Yield, just to name some of them. Do they differ a lot between each other?

Yes they do.  All CDPs build the customer profiles I mentioned.  But some do more things, such as predictive modeling, message selection, and, increasingly, message delivery.  Of course they also vary in the industries they specialize in, regions they support, size of clients they work with, and many technical details.  This makes it hard to buy a CDP but also means buyers are more likely to find a system that fits their needs.

Data Science Blog: How established is the concept of the CDP in Europe in general? And how in comparison with the United States?

CDP is becoming more familiar in Europe but is not as well understood as in the U.S.  The European market spent a lot of money on Data Management Platforms (DMPs) which promised to do much of what a CDP does but were not able to because they do not store the level of detail that a CDP does.  Many DMPs also don’t work with personally identifiable data because the DMPs primarily support Web advertising, where many customers are anonymous.  The failures of DMPs have harmed CDPs because they have made buyers skeptical that any system can meet their needs, having already failed once.  But we are overcoming this as the market becomes better educated and more success stories are available.  What’s the same in Europe and the U.S. is that marketers face the same needs.  This will push European marketers towards CDPs as the best solution in many cases.

Data Science Blog: What are coming trends? What will be the main topic 2020?

We see many CDPs with broader functions for marketing execution: campaign management, personalization, and message delivery in particular.  This is because marketers would like to buy as few systems as possible, so they want broader scope in each systems.  We’re seeing expansion into new industries such as financial services, travel, telecommunications, healthcare, and education.  Perhaps most interesting will be the entry of Adobe, Salesforce, and Oracle, who have all promised CDP products late this year or early next year.  That will encourage many more people to consider buying CDPs.  We expect that market will expand quite rapidly, so current CDP vendors will be able to grow even as Adobe, Salesforce, and Oracle make new CDP sales.


You want to get in touch with Daniel M. Raab and understand more about the concept of a CDP? Meet him at the Predictive Analytics World 18th and 19th November 2019 in Berlin, Germany. As a Keynote-Speaker, he will introduce the concept of a Customer Data Platform in the light of Predictive Analytics. Click here to see the agenda of the event.

 


 

Marketing Attribution Models

Why do we need attribution?

Attributionis the process of distributing the value of a purchase between the various channels, used in the funnel chain. It allows you to determine the role of each channel in profit. It is used to assess the effectiveness of campaigns, to identify more priority sources. The competent choice of the model makes it possible to optimally distribute the advertising budget. As a result, the business gets more profit and less expenses.

What models of attribution exist

The choice of the appropriate model is an important issue, because depending on the business objectives, it is better to fit something different. For example, for companies that have long been present in the industry, the priority is to know which sources contribute to the purchase. Recognition is the importance for brands entering the market. Thus, incorrect prioritization of sources may cause a decrease in efficiency. Below are the models that are widely used in the market. Each of them is guided by its own logic, it is better suited for different businesses.

First Interaction (First Click)

The value is given to the first touch. It is suitable only for several purposes and does not make it possible to evaluate the role of each component in making a purchase. It is chosen by brands who want to increase awareness and reach.

Advantages

It does not require knowledge of programming, so the introduction of a business is not difficult. A great option that effectively assesses campaigns, aimed at creating awareness and demand for new products.

Disadvantages

It limits the ability to analyze comprehensively all channels that is used to promote a brand. It gives value to the first interaction channel, ignoring the rest.

Who is suitable for?

Suitable for those who use the promotion to increase awareness, the formation of a positive image. Also allows you to find the most effective source.

Last Interaction (Last Click)

It gives value to the last channel with which the consumer interacted before making the purchase. It does not take into account the actions that the user has done up to this point, what marketing activities he encountered on the way to conversion.

Advantages

The tool is widely used in the market, it is not difficult. It solves the problem of small advertising campaigns, where is no more than 3 sources.

Disadvantages

There is no way to track how other channels have affected the acquisition.

Who is suitable for?

It is suitable for business models that have a short purchase cycle. This may be souvenirs, seasonal offers, etc.

Last Non-Direct Click

It is the default in Google Analytics. 100% of the  conversion value gives the last channel that interacted with the buyer before the conversion. However, if this source is Direct, then assumptions are counted.

Suppose a person came from an email list, bookmarked a product, because at that time it was not possible to place an order. After a while he comes back and makes a purchase. In this case, email as a channel for attracting users would be underestimated without this model.

Who is suitable for?

It is perfect for beginners who are afraid of making a mistake in the assessment. Because it allows you to form a general idea of ​​the effectiveness of all the involved channels.

Linear model attribution (Linear model)

The value of the conversion is divided in equal parts between all available channels.

Linear model attribution (Linear model)

Advantages

More advanced model than previous ones, however, characterized by simplicity. It takes into account all the visits before the acquisition.

Disadvantages

Not suitable for reallocating the budget between the channels. This is due to the fact that the effectiveness of sources may differ significantly and evenly divide – it is not the best idea. 

Who is suitable for?

It is performing well for businesses operating in the B2B sector, which plays a great importance to maintain contact with the customer during the entire cycle of the funnel.

Taking into account the interaction duration (Time Decay)

A special feature of the model is the distribution of the value of the purchase between the available channels by increment. Thus, the source, that is at the beginning of the chain, is given the least value, the channel at the end deserves the greatest value.  

Advantages

Value is shared between all channel. The highest value is given to the source that pushed the user to make a purchase.

Disadvantages

There is no fair assessment of the effectiveness of the channels, that have made efforts to obtain the desired result.

Who is suitable for?

It is ideal for evaluating the effectiveness of advertising campaigns with a limited duration.

Position-Based or U-Shaped

40% receive 2 channels, which led the user and pushed him to purchase. 20% share among themselves the intermediate sources that participated in the chain.

Advantages

Most of the value is divided equally between the key channels – the fact that attracted the user and closed the deal..

Disadvantages

Underestimated intermediate channels.It happens that they make it possible to more effectively promote the user chain.. Because they allow you to subscribe to the newsletter or start following the visitor for price reduction, etc.

Who is suitable for?

Interesting for businesses that focus on attracting new audiences, as well as pushing existing customers to buy.

Cons of standard attribution models

According to statistics, only 44% of foreign experts use attribution on the last interaction. Speaking about the domestic market, we can announce the numbers are much higher. However, only 18% of marketers use more complex models. There is also evidence which demonstrates that 72.4% of those who use attribution based on the last interaction, they use it not because of efficiency, but because it is simple.

What leads to a similar state of affairs?

Experts do not understand the effectiveness. Ignorance of how more complex models work leads to a lack of understanding of the real benefits for the business.

Attribution management is distributed among several employees. In view of this, different models can be used simultaneously. This approach greatly distorts the data obtained, not allowing an objective assessment of the effect of channels.

No comprehensive data storage. Information is stored in different places and does not take into account other channels. Using the analytics of the advertising office, it is impossible to work with customers in retail outlets.

You may find ways to eliminate these moments and attribution will work for the benefit of the business.

What algorithmic attribution models exist

Using one channel, there is no need to enable complex models. Attribution will be enough for the last interaction. It has everything to evaluate the effectiveness of the campaign, determine the profitability, understand the benefits for the business.

Moreover, if the number of channels increases significantly, and goals are already far beyond recognition, it will be better to give preference to more complex models. They allow you to collect all the information in one place, open up limitless monitoring capabilities, make it clear how one channel affects the other and which bundles work better together.

Below are the well-known and widely used today algorithmic attribution models.

Data-Driven Attribution

A model that allows you to track all the way that the consumer has done before making a purchase. It objectively evaluates each channel and does not take into account the position of the source in the funnel. It demonstrates how a certain interaction affected the outcome. Data-Driven attribution model is used in Google Analytics 360.

With it, you can work efficiently with channels that are underestimated in simpler models. It gives the opportunity to distribute the advertising budget correctly.

Attribution based on Markov’s Chains (Markov Chains)

Markov’s chain has been used for a long time to predict weather, matches, etc. The model allows you to find out, how the lack of a channel will affect sales. Its advantage is the ability to assess the impact of the source on the conversion, to find out which channel brings the best results.

A great option for companies that store data in one service. To implement requires knowledge of programming. It has one drawback in the form of underestimating the first channel in the chain. 

OWOX BI Attribution

OWOX BI Attribution helps you assess the mutual influence of channels on encouraging a customer through the funnel and achieving a conversion.

What information can be processed:

  • Upload user data from Google Analytics using flexible built-in tools.
  • Process information from various advertising services.
  • Integrate the model with CRM systems.

This approach makes it possible not to lose sight of any channel. Analyze the complex impact of marketing tools, correctly distributing the advertising budget.

The model uses CRM information, which makes it possible to do end-to-end analytics. Each user is assigned an identifier, so no matter what device he came from, you can track the chain of actions and understand that it is him. This allows you to see the overall effect of each channel on the conversion.

Advantages

Provides an integrated approach to assessing the effectiveness of channels, allows you to identify consumers, even with different devices, view all visits. It helps to determine where the user came from, what prompted him to do so. With it, you can control the execution of orders in CRM, to estimate the margin. To evaluate in combination with other models in order to determine the highest priority advertising campaigns that bring the most profit.

Disadvantages

It is impossible to objectively evaluate the first step of the chain.

Who is suitable for?

Suitable for all businesses that aim to account for each step of the chain and the qualitative assessment of all advertising channels.

Conclusion

The above-mentioned Ad Roll study shows that 70% of marketing managers find it difficult to use the results obtained from attribution. Moreover, there will be no result without it.

To obtain a realistic assessment of the effectiveness of marketing activities, do the following:

  • Determine priority KPIs.
  • Appoint a person responsible for evaluating advertising campaigns.
  • Define a user funnel chain.
  • Keep track of all data, online and offline. 
  • Make a diagnosis of incoming data.
  • Find the best attribution model for your business.
  • Use the data to make decisions.

The Power of Analyzing Processes

Are you thinking BIG enough? Over the past few years, the quality of discussion regarding a ‘process’ and its interfaces between different departments has developed radically. Organizations increasingly reject guesswork, individual assessments, or blame-shifting and instead focus on objective facts: the display of throughput times, process variants, and their optimization.

But while data can hold valuable insights into business, users, customer bases, and markets, companies are sometimes unsure how best to analyze and harness their data. In fact, the problem isn’t usually a lack of data; it’s a breakdown in leveraging useful data. Being unsure how to interpret, explore, and analyze processes can paralyze any go-live, leading to a failure in the efficient interaction of processes and business operations. Without robust data analysis, your business could be losing money, talent, and even clients.

After all, analyzing processes is about letting data tell its true story for improved understanding.

The “as-is” processes

Analyzing the as-is current state helps organizations document, track, and optimize processes for better performance, greater efficiency, and improved outcomes. By contextualizing data, we gain the ability to navigate and organize processes to negate bottlenecks, set business preferences, and plan an optimized route through process mining initiatives. This focus can help across an entire organization, or on one or more specific processes or trends within a department or team.

There are several vital goals/motivations for implementing current state analysis, including:

  • Saving money and improving ROI;
  • Improving existing processes or creating new processes;
  • Increasing customer satisfaction and journeys;
  • Improving business coordination and organizational responsiveness;
  • Complying with new regulatory standards;
  • Adapting methods following a merger or acquisition.

The “to-be” processes

Simply put, if as-is maps where your processes are, to-be maps where you want them to… be. To-be process mapping documents what you want the process to look like, and by using the as-is diagram, you can work with stakeholders to identify developments and improvements of the current process, then outline those changes on your to-be roadmap.

This analysis can help you make optimal decisions for your business and innovative OpEx imperatives. For instance, at leading data companies like Google and Amazon, data is used in such a way that the analysis results make the decisions! Just think of the power Recommendation Engines, PageRank, and Demand Forecasting Systems have over the content we see. To achieve this, advanced techniques of machine learning and statistical modeling are applied, resulting in mechanically improved results from the data. Interestingly, because these techniques reference large-scale data sets and reflect analysis and results in real-time, they are applied to areas that extend beyond human decision-making.

Also, by analyzing and continuously monitoring qualitative and quantitative data, we gain insights across potential risks and ongoing improvement opportunities, too. The powerful combination of process discovery, process analysis, and conformance checking supports a collaborative approach to process improvement, giving you game-changing insights into your business. For example:

  • Which incidents would I like to detect and act upon proactively?
  • Where would task prioritization help improve overall performance?
  • Where do I know that increased transparency would help the company?
  • How can I utilize processes in place of gut feeling/experience?

Further, as the economic environment continues to change rapidly, and modern organizations keep adopting process-based approaches to ensure they are achieving their business goals, process analysis naturally becomes the perfect template for any company.

With this, process mining technology can help modern businesses manage process challenges beyond the boundaries of implementation. We can evaluate the proof of concept (PoC) for any proposed improvements, and extract relevant information from a homogenous data set. Of course, process modeling and business process management (BPM) are available to solve the potentially tricky integration phase.

Process mining and analysis initiatives

Process mining and discovery initiatives can also provide critical insights throughout the automation and any Robotic Process Automation (RPA) journey, from defining the strategy to continuous improvement and innovation. Data-based process mining can even extend process analysis across teams and individuals, decreasing incident resolution times, and subsequently improving working habits via the discovery and validation of automation opportunities.

A further example of where process mining and strategic process analysis/alignment is already paying dividends is IT incident management. Here, “incident” is an unplanned interruption to an IT service, which may be complete unavailability or merely a reduction in quality. The goal of the incident management process is to restore regular service operation as quickly as possible and to minimize the impact on business operations. Incident management is a critical process in Information Technology Library (ITIL).

Process mining can also further drive improvement in as-is incident management processes as well as exceptional and unwanted process steps, by increasing visibility and transparency across IT processes. Process mining will swiftly analyze the different working habits across teams and individuals, decreasing incident resolution times, and subsequently improving customer impact cases.

Positive and practical experiences with process mining across industries have also led to the further dynamic development of tools, use cases, and the end-user community. Even with very experienced process owners, the visualization of processes can skyrocket improvement via new ideas and discussion.

However, the potential performance gains are more extensive, with the benefits of using process mining for incident management, also including:

  • Finding out how escalation rules are working and how the escalation is done;
  • Calculating incident management KPIs, including SLA (%);
  • Discovering root causes for process problems;
  • Understanding the effect of the opening interface (email, web form, phone, etc.);
  • Calculating the cost of the incident process;
  • Aligning the incident management system with your incident management process.

Robotic Process Automation (RPA)

Robotic process automation (RPA) provides a virtual workforce to automatize manual, repetitive, and error-prone tasks. However, successful process automation requires specific knowledge about the intended (and potential) benefits, effective training of the robots, and continuous monitoring of their performance and processes.

With this, process mining supports organizations throughout the lifecycle of RPA initiatives by monitoring and benchmarking robots to ensure sustainable benefits. These insights are especially valuable for process miners and managers with a particular interest in process automation. By unlocking the experiences with process mining, a company better understands what is needed today, for tomorrow’s process initiatives.

To further upgrade the impact of robot-led automation, there is also a need for a solid understanding of legacy systems, and an overview of automation opportunities. Process mining tools provide key insights throughout the entire RPA journey, from defining the strategy to continuous improvement and innovation.

Benefits of process mining and analysis within the RPA lifecycle include:

  1. Overviews of processes within the company, based on specific criteria;
  2. Identification of processes suitable for RPA implementation during the preparation phase;
  3. Mining the optimal process flow/process path;
  4. Understanding the extent to which RPA can be implemented in legacy processes and systems;
  5. Monitoring and analysis of RPA performance during the transition/handover of customization;
  6. Monitoring and continuous improvement of RPA in the post-implementation phase.

The process of better business understanding

Every organization is different and brings with it a variety of process-related questions. Yet some patterns are usually repeated. For example, customers who introduce data supported process analysis as part of business transformation initiatives will typically face challenges in harmonizing processes from fragmented sectors and regional locations. Here it helps enormously to base actions on data and statistics from the respective processes, instead of relying on the instincts and estimations of individuals.

With this, process analysis which is supported by data, enables a fact-based discussion, and builds a bridge between employees, process experts and management. This helps avoid siloed thinking, as well as allowing the transparent design of handovers and process steps which cross departmental boundaries within an organization.

In other words, to unlock future success and transformation, we must be processing… today.

Find out more about process mining with Signavio Process Intelligence, and see how it can help your organization uncover the hidden value of process, generate fresh ideas, and save time and money.

Accelerate your AI Skills Today: A Million Dollar Job!

The skyrocketing salaries ($1m per year) of AI engineers is not a hype. It is the fact of current corporate world, where you will witness a shift that is inevitable.

We’ve already set our feet at the edge of the technological revolution. A revolution that is at the verge of altering the way we live and work. As the fact suggests, humanity has fundamentally developed human production in three revolutions, and we’re now entering the fourth revolution. In its scope, the fourth revolution projects a transformation that is unlike anything we humans have ever experienced.

  • The first revolution had the world transformed from rural to urban
  • the emergence of mass production in the second revolution
  • third introduced the digital revolution
  • The fourth industrial revolution is anxious to integrate technologies into our lives.

And all thanks to artificial intelligence (AI). An advanced technology that surrounds us, from virtual assistants to software that translates to self-driving cars.

The rise of AI at an exponential rate has disrupted almost every industry. So much so that AI is being rated as one-million-dollar profession.

Did this grab your attention? It did?

Now, what if we were to tell you that the salary compensation for AI experts has grown dramatically. AI and machine learning are fields that have a mountain of demand in the tech industry today but has sparse supply.

AI field is growing at a quicker pace and salaries are skyrocketing! Read it for yourself to know what AI experts, AI researchers and any other AI talent are commanding today.

  • A top-class AI research laboratory, OpenAI says that techies in the AI field are projected to earn a salary compensation ranging between $300 to $500k for fresh graduates. However, expert professionals could earn anywhere up to $1m.
  • Whopping salary package of above 100 million yen that amounts to $1m is being offered to AI geniuses by a Japanese firm, Start Today. A firm that operates a fashion shopping website named Zozotown.

Does this leave you with a question – Is this a right opportunity for you to jump in the field and make hay while the sun is shining? 

And the answer to this question is – yes, it is the right opportunity for any developer seeking a role in the AI industry. It can be your chance to bridge the skill shortage in the AI field either by upskilling or reskilling yourself in the field of AI.

There are a wide varieties of roles available for an AI enthusiast like you. And certain areas are like AI Engineers and AI Researchers are high in demand, as there are not many professionals who have robust AI knowledge.

According to a job report, “The Future of Jobs 2018,” a prediction was made suggesting that machines and algorithms will create around 133 million new job roles by 2022.

AI and machine learning will dominate the tech world. The World Economic Forum says that several sectors have started embracing AI and machine learning to tackle challenges in certain fields such as advertising, supply chain, manufacturing, smart cities, drones, and cybersecurity.

Unraveling the AI realm

From chatbots to financial planners, AI is impacting the way businesses function on a day-today basis. AI makes the work simpler, as it provides variables, which makes the work more streamlined.

Alright! You know that

  • the demand for AI professionals is rising exponentially and that there is just a trickle of supply
  • the AI professionals are demanding skyrocketing salaries

However, beyond that how much more do you know about AI?

Considering the fact that our lives have already been touched by AI (think Alexa, and Siri), it is just a matter of time when AI will become an indispensable part of our lives.

As Gartner predicts that 2020 will be an important year for business growth in AI. Thus, it is possible to witness significant sparks for employment growth. Though AI predicts to diminish 1.8 million jobs, it is also said to replace it with 2.3 million jobs that will be created. As we look forward to stepping into 2020, AI-related job roles are set to make positive progress of achieving 2 million net-new employments by 2025.

With AI promising to score fat paychecks that would reach millions, AI experts are struggling to find new ways to pick up nouveau skills. However, one of the biggest impacts that affect the job market today is the scarcity of talent in this field.

The best way to stay relevant and employable in AI is probably by “reskilling,” and “upskilling.” And  AI certifications is considered ideal for those in the current workforce.

Looking to upskill yourself – here’s how you can become an AI engineer today.

Top three ways to enhance your artificial intelligence career:

  1. Acquire skills in Statistics and Machine Learning: If you’re getting into the field of machine learning, it is crucial that you have in-depth knowledge of statistics. Statistics is considered a prerequisite to the ML field. Both the fields are tightly related. Machine learning models are created to make accurate predictions while statistical models do the job of interpreting the relationship between variables. Many ML techniques heavily rely on the theory obtained through statistics. Thus, having extensive knowledge in statistics help initiate the first step towards an AI career.
  2. Online certification programs in AI skills: Opting for AI certifications will boost your credibility amongst potential employers. Certifications will also enhance your earning potential and increase your marketability. If you’re looking for a change and to be a part of something impactful; join the AI bandwagon. The IT industry is growing at breakneck speed; it is now that businesses are realizing how important it is to hire professionals with certain skillsets. Specifically, those who are certified in AI are becoming sought after in the job market.
  3. Hands-on experience: There’s a vast difference in theory and practical knowledge. One needs to familiarize themselves with the latest tools and technologies used by the industry. This is possible only if the individual is willing to work on projects and build things from scratch.

Despite all the promises, AI does prove to be a threat to job holders, if they don’t upskill or reskill themselves. The upcoming AI revolution will definitely disrupt the way we work, however, it will leave room for humans to perform more creative jobs in the future corporate world.

So a word of advice is to be prepared and stay future ready.

Visual Question Answering with Keras – Part 1

This is Part I of II of the Article Series Visual Question Answering with Keras

Making Computers Intelligent to answer from images

If we look closer in the history of Artificial Intelligence (AI), the Deep Learning has gained more popularity in the recent years and has achieved the human-level performance in the tasks such as Speech Recognition, Image Classification, Object Detection, Machine Translation and so on. However, as humans, not only we but also a five-year child can normally perform these tasks without much inconvenience. But the development of such systems with these capabilities has always considered an ambitious goal for the researchers as well as for developers.

In this series of blog posts, I will cover an introduction to something called VQA (Visual Question Answering), its available datasets, the Neural Network approach for VQA and its implementation in Keras and the applications of this challenging problem in real life. 

Table of Contents:

1 Introduction

2 What is exactly Visual Question Answering?

3 Prerequisites

4 Datasets available for VQA

4.1 DAQUAR Dataset

4.2 CLEVR Dataset

4.3 FigureQA Dataset

4.4 VQA Dataset

5 Real-life applications of VQA

6 Conclusion

 

  1. Introduction:

Let’s say you are given a below picture along with one question. Can you answer it?

I expect confidently you all say it is the Kitchen without much inconvenience which is also the right answer. Even a five-year child who just started to learn things might answer this question correctly.

Alright, but can you write a computer program for such type of task that takes image and question about the image as an input and gives us answer as output?

Before the development of the Deep Neural Network, this problem was considered as one of the difficult, inconceivable and challenging problem for the AI researcher’s community. However, due to the recent advancement of Deep Learning the systems are capable of answering these questions with the promising result if we have a required dataset.

Now I hope you have got at least some intuition of a problem that we are going to discuss in this series of blog posts. Let’s try to formalize the problem in the below section.

  1. What is exactly Visual Question Answering?:

We can define, “Visual Question Answering(VQA) is a system that takes an image and natural language question about the image as an input and generates natural language answer as an output.”

VQA is a research area that requires an understanding of vision(Computer Vision)  as well as text(NLP). The main beauty of VQA is that the reasoning part is performed in the context of the image. So if we have an image with the corresponding question then the system must able to understand the image well in order to generate an appropriate answer. For example, if the question is the number of persons then the system must able to detect faces of the persons. To answer the color of the horse the system need to detect the objects in the image. Many of these common problems such as face detection, object detection, binary object classification(yes or no), etc. have been solved in the field of Computer Vision with good results.

To summarize a good VQA system must be able to address the typical problems of CV as well as NLP.

To get a better feel of VQA you can try online VQA demo by CloudCV. You just go to this link and try uploading the picture you want and ask the related question to the picture, the system will generate the answer to it.

 

  1. Prerequisites:

In the next post, I will walk you through the code for this problem using Keras. So I assume that you are familiar with:

  1. Fundamental concepts of Machine Learning
  2. Multi-Layered Perceptron
  3. Convolutional Neural Network
  4. Recurrent Neural Network (especially LSTM)
  5. Gradient Descent and Backpropagation
  6. Transfer Learning
  7. Hyperparameter Optimization
  8. Python and Keras syntax
  1. Datasets available for VQA:

As you know problems related to the CV or NLP the availability of the dataset is the key to solve the problem. The complex problems like VQA, the dataset must cover all possibilities of questions answers in real-world scenarios. In this section, I will cover some of the datasets available for VQA.

4.1 DAQUAR Dataset:

The DAQUAR dataset is the first dataset for VQA that contains only indoor scenes. It shows the accuracy of 50.2% on the human baseline. It contains images from the NYU_Depth dataset.

Example of DAQUAR dataset

Example of DAQUAR dataset

The main disadvantage of DAQUAR is the size of the dataset is very small to capture all possible indoor scenes.

4.2 CLEVR Dataset:

The CLEVR Dataset from Stanford contains the questions about the object of a different type, colors, shapes, sizes, and material.

It has

  • A training set of 70,000 images and 699,989 questions
  • A validation set of 15,000 images and 149,991 questions
  • A test set of 15,000 images and 14,988 questions

Image Source: https://cs.stanford.edu/people/jcjohns/clevr/?source=post_page

 

4.3 FigureQA Dataset:

FigureQA Dataset contains questions about the bar graphs, line plots, and pie charts. It has 1,327,368 questions for 100,000 images in the training set.

4.4 VQA Dataset:

As comapred to all datasets that we have seen so far VQA dataset is relatively larger. The VQA dataset contains open ended as well as multiple choice questions. VQA v2 dataset contains:

  • 82,783 training images from COCO (common objects in context) dataset
  • 40, 504 validation images and 81,434 validation images
  • 443,757 question-answer pairs for training images
  • 214,354 question-answer pairs for validation images.

As you might expect this dataset is very huge and contains 12.6 GB of training images only. I have used this dataset in the next post but a very small subset of it.

This dataset also contains abstract cartoon images. Each image has 3 questions and each question has 10 multiple choice answers.

  1. Real-life applications of VQA:

There are many applications of VQA. One of the famous applications is to help visually impaired people and blind peoples. In 2016, Microsoft has released the “Seeing AI” app for visually impaired people to describe the surrounding environment around them. You can watch this video for the prototype of the Seeing AI app.

Another application could be on social media or e-commerce sites. VQA can be also used for educational purposes.

  1. Conclusion:

I hope this explanation will give you a good idea of Visual Question Answering. In the next blog post, I will walk you through the code in Keras.

If you like my explanations, do provide some feedback, comments, etc. and stay tuned for the next post.

The New Age of Big Data: Is It the Death of Hadoop?

Big Data had gone through several transformations through the years, growing into the phrase we identify it as today. From its first identified use on the back of Hadoop and MapReduce, a new age of Big Data has been ushered in with the spread of new technologies such as Kubernetes, Spark, and NoSQL databases.

These might not serve the exact same purpose as Hadoop individually, but they fill the same niche and do the same job with features the original platform designers never envisioned.

The multi-cloud architecture boom and increasing emphasis on real-time data may just mean the end of Big Data as we know it, and Hadoop with it.

A brief history of Big Data

The use of data for making business decisions can be traced back to ancient civilizations in Mesopotamia. However, the age of Big Data as we know it is only as old as 2005 when O’Reilly Media launched the phrase. It was used to describe the massive amounts of data that the world was beginning to produce on the internet.

The newly-dubbed Web 2.0 needed to be indexed and easily searchable, and, Yahoo, being the behemoth that it was, was just the right company for the job. Hadoop was born off the efforts of Yahoo engineers, depending on Google’s MapReduce under the hood. A new era of Big Data had begun, and Hadoop was at the forefront of the revolution.

The new technologies led to a fundamental shift in the way the world regarded data processing. Traditional assumptions of atomicity, consistency, isolation, and durability (ACID) began to fade, and new use cases for previously unusable data began to emerge.

Hadoop would begin its life as a commercial platform with the launch of Cloudera in 2008, followed by rivals such as Hortonworks, EMC and MapR. It continued its momentous run until it seemingly hit its peak in 2015, and its place in the enterprise market would never be guaranteed again

Where Hadoop Couldn’t Keep Up

Hadoop made its mark in the world of Big Data by being a platform to collect, store and analyze large swathes of data. However, not even a technology as revolutionary and versatile as Hadoop could exist without its drawbacks.

Some of these would be so costly developers would rather design whole new systems to deal with them. With time, Hadoop started to lose its charm, unable to grow past its initial vision as a Big Data software.

Hadoop is a machine made up of smaller moving parts that are incredibly efficient at what they do – crunch data. This ultimately results in one of the first drawbacks of Hadoop – it does not come with built-in support for analytics data. Hadoop works well to process your data, but not likely as you need – visual reports about how the data is being processed, for instance.

MapReduce was also built from the ground up to be file-intensive. This makes it a great piece of software for simple requests, but not so much for iterative data. For smaller datasets, it turns out to be a rather inefficient solution.

Another area Hadoop lands flat on its face is with regards to real-time processing and reporting. Hadoop suffers from the curse of time. It relies on technologies that even its very founders (Google in particular) no longer rely on.

With MapReduce, every time you want to analyze a modified dataset (say, after adding or deleting data), you have to stream over the whole dataset again. Thanks to this feature, Hadoop is horrible at real-time reporting – a feature that led to the creation of Percolator, MapReduce’s replacement within Google.

The emergence of better technology has also meant a rise in the number of threats to said technology and a corresponding increase in the emphasis that is placed on it.

Unfortunately, Hadoop is nowhere close to being secure. As a matter of fact, its security settings are off by default, and it has too much inertia to simply change that. To make things worse, plugging in security measures isn’t that much easier.

The Fall of Hadoop

With these and more shortcomings in the data science world, new tools such as Hive, Pig and Spark were created to work on top of Hadoop to overcome its weaknesses. But it simply couldn’t grow out of the shoes it had been made for.

The growth of NoSQL databases such as Hazelcast and MongoDB also meant that problems Hadoop was designed to support were now being solved by single players rather than the ‘all or nothing’ approach Hadoop was designed with. It wasn’t flexible enough to evolve beyond simply being a batch processing software.

Over time, new Big Data challenges began to emerge that a large monolithic software like Hadoop couldn’t deal with, either. Being primarily file-intensive, it couldn’t keep up with the variety of data sources that were now available, the lack of support for dynamic schemas, on-the-fly queries, and the rise of cloud infrastructure all caused people to seek different solutions. Hadoop had lost its grip on the enterprise world.

Businesses whose primary concern was dealing with Hadoop infrastructure like Cloudera and Hortonworks were seeing less and less adoption. This led to the eventual merger of the two companies in 2019, and the same message rang out from different corners of the world at the same time: ‘Hadoop is dead.’

Is Hadoop Really Dead?

Hadoop still has a place in the enterprise world – the problems it was designed to solve still exist to this day. Technologies such as Spark have largely taken over the same space that Hadoop once occupied.

The question of Hadoop or Spark is one every data scientist has to contend with at some point, and most seem to be settling in the latter of these, thanks to the great advantages is speed it offers.

It’s unlikely Hadoop will see much more adoption with newer marker entrants, especially considering the pace with which technology moves. It also doesn’t help that a lot of alternatives have a much smaller learning curve than the convoluted monolith that is Hadoop. Companies like MapR and Cloudera have also begun to pivot away from Hadoop-only infrastructure to more robust cloud-based solutions. Hadoop still has its place, but maybe not for long.