All about Big Data Storage and Analytics

CAP Theorem

Understanding databases for storing, updating and analyzing data requires the understanding of the CAP Theorem. This is the second article of the article series Data Warehousing Basics.

Understanding NoSQL Databases by the CAP Theorem

CAP theorem – or Brewer’s theorem – was introduced by the computer scientist Eric Brewer at Symposium on Principles of Distributed computing in 2000. The CAP stands for Consistency, Availability and Partition tolerance.

  • Consistency: Every read receives the most recent writes or an error. Once a client writes a value to any server and gets a response, it is expected to get afresh and valid value back from any server or node of the database cluster it reads from.
    Be aware that the definition of consistency for CAP means something different than to ACID (relational consistency).
  • Availability: The database is not allowed to be unavailable because it is busy with requests. Every request received by a non-failing node in the system must result in a response. Whether you want to read or write you will get some response back. If the server has not crashed, it is not allowed to ignore the client’s requests.
  • Partition tolerance: Databases which store big data will use a cluster of nodes that distribute the connections evenly over the whole cluster. If this system has partition tolerance, it will continue to operate despite a number of messages being delayed or even lost by the network between the cluster nodes.

CAP theorem applies the logic that  for a distributed system it is only possible to simultaneously provide  two out of the above three guarantees. Eric Brewer, the father of the CAP theorem, proved that we are limited to two of three characteristics, “by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three.” (Brewer, E., 2012).

CAP Theorem Triangle

To recap, with the CAP theorem in relation to Big Data distributed solutions (such as NoSQL databases), it is important to reiterate the fact, that in such distributed systems it is not possible to guarantee all three characteristics (Availability, Consistency, and Partition Tolerance) at the same time.

Database systems designed to fulfill traditional ACID guarantees like relational database (management) systems (RDBMS) choose consistency over availability, whereas NoSQL databases are mostly systems designed referring to the BASE philosophy which prefer availability over consistency.

The CAP Theorem in the real world

Lets look at some examples to understand the CAP Theorem further and provewe cannot create database systems which are being consistent, partition tolerant as well as always available simultaniously.

AP – Availability + Partition Tolerance

If we have achieved Availability (our databases will always respond to our requests) as well as Partition Tolerance (all nodes of the database will work even if they cannot communicate), it will immediately mean that we cannot provide Consistency as all nodes will go out of sync as soon as we write new information to one of the nodes. The nodes will continue to accept the database transactions each separately, but they cannot transfer the transaction between each other keeping them in synchronization. We therefore cannot fully guarantee the system consistency. When the partition is resolved, the AP databases typically resync the nodes to repair all inconsistencies in the system.

A well-known real world example of an AP system is the Domain Name System (DNS). This central network component is responsible for resolving domain names into IP addresses and focuses on the two properties of availability and failure tolerance. Thanks to the large number of servers, the system is available almost without exception. If a single DNS server fails,another one takes over. According to the CAP theorem, DNS is not consistent: If a DNS entry is changed, e.g. when a new domain has been registered or deleted, it can take a few days before this change is passed on to the entire system hierarchy and can be seen by all clients.

CA – Consistency + Availability

Guarantee of full Consistency and Availability is practically impossible to achieve in a system which distributes data over several nodes. We can have databases over more than one node online and available, and we keep the data consistent between these nodes, but the nature of computer networks (LAN, WAN) is that the connection can get interrupted, meaning we cannot guarantee the Partition Tolerance and therefor not the reliability of having the whole database service online at all times.

Database management systems based on the relational database models (RDBMS) are a good example of CA systems. These database systems are primarily characterized by a high level of consistency and strive for the highest possible availability. In case of doubt, however, availability can decrease in favor of consistency. Reliability by distributing data over partitions in order to make data reachable in any case – even if computer or network failure – meanwhile plays a subordinate role.

CP – Consistency + Partition Tolerance

If the Consistency of data is given – which means that the data between two or more nodes always contain the up-to-date information – and Partition Tolerance is given as well – which means that we are avoiding any desynchronization of our data between all nodes, then we will lose Availability as soon as only one a partition occurs between any two nodes In most distributed systems, high availability is one of the most important properties, which is why CP systems tend to be a rarity in practice. These systems prove their worth particularly in the financial sector: banking applications that must reliably debit and transfer amounts of money on the account side are dependent on consistency and reliability by consistent redundancies to always be able to rule out incorrect postings – even in the event of disruptions in the data traffic. If consistency and reliability is not guaranteed, the system might be unavailable for the users.

CAP Theorem Venn Diagram

Conclusion

The CAP Theorem is still an important topic to understand for data engineers and data scientists, but many modern databases enable us to switch between the possibilities within the CAP Theorem. For example, the Cosmos DB von Microsoft Azure offers many granular options to switch between the consistency, availability and partition tolerance . A common misunderstanding of the CAP theorem that it´s none-absoluteness: “All three properties are more continuous than binary. Availability is continuous from 0 to 100 percent, there are many levels of consistency, and even partitions have nuances. Exploring these nuances requires pushing the traditional way of dealing with partitions, which is the fundamental challenge. Because partitions are rare, CAP should allow perfect C and A most of the time, but when partitions are present or perceived, a strategy is in order.” (Brewer, E., 2012).

ACID vs BASE Concepts

Understanding databases for storing, updating and analyzing data requires the understanding of two concepts: ACID and BASE. This is the first article of the article series Data Warehousing Basics.

The properties of ACID are being applied for databases in order to fulfill enterprise requirements of reliability and consistency.

ACID is an acronym, and stands for:

  • Atomicity – Each transaction is either properly executed completely or does not happen at all. If the transaction was not finished the process reverts the database back to the state before the transaction started. This ensures that all data in the database is valid even if we execute big transactions which include multiple statements (e. g. SQL) composed into one transaction updating many data rows in the database. If one statement fails, the entire transaction will be aborted, and hence, no changes will be made.
  • Consistency – Databases are governed by specific rules defined by table formats (data types) and table relations as well as further functions like triggers. The consistency of data will stay reliable if transactions never endanger the structural integrity of the database. Therefor, it is not allowed to save data of different types into the same single column, to use written primary key values again or to delete data from a table which is strictly related to data in another table.
  • Isolation – Databases are multi-user systems where multiple transactions happen at the same time. With Isolation, transactions cannot compromise the integrity of other transactions by interacting with them while they are still in progress. It guarantees data tables will be in the same states with several transactions happening concurrently as they happen sequentially.
  • Durability – The data related to the completed transaction will persist even in cases of network or power outages. Databases that guarante Durability save data inserted or updated permanently, save all executed and planed transactions in a recording and ensure availability of the data committed via transaction even after a power failure or other system failures If a transaction fails to complete successfully because of a technical failure, it will not transform the targeted data.

ACID Databases

The ACID transaction model ensures that all performed transactions will result in reliable and consistent databases. This suits best for businesses which use OLTP (Online Transaction Processing) for IT-Systems such like ERP- or CRM-Systems. Furthermore, it can also be a good choice for OLAP (Online Analytical Processing) which is used in Data Warehouses. These applications need backend database systems which can handle many small- or medium-sized transactions occurring simultaneous by many users. An interrupted transaction with write-access must be removed from the database immediately as it could cause negative side effects impacting the consistency(e.g., vendors could be deleted although they still have open purchase orders or financial payments could be debited from one account and due to technical failure, never credited to another).

The speed of the querying should be as fast as possible, but even more important for those applications is zero tolerance for invalid states which is prevented by using ACID-conform databases.

BASE Concept

ACID databases have their advantages but also one big tradeoff: If all transactions need to be committed and checked for consistency correctly, the databases are slow in reading and writing data. Furthermore, they demand more effort if it comes to storing new data in new formats.

In chemistry, a base is the opposite to acid. The database concepts of BASE and ACID have a similar relationship. The BASE concept provides several benefits over ACID compliant databases asthey focus more intensely on data availability of database systems without guarantee of safety from network failures or inconsistency.

The acronym BASE is even more confusing than ACID as BASE relates to ACID indirectly. The words behind BASE suggest alternatives to ACID.

BASE stands for:

  • Basically Available – Rather than enforcing consistency in any case, BASE databases will guarantee availability of data by spreading and replicating it across the nodes of the database cluster. Basic read and write functionality is provided without liabilityfor consistency. In rare cases it could happen that an insert- or update-statement does not result in persistently stored data. Read queries might not provide the latest data.
  • Soft State – Databases following this concept do not check rules to stay write-consistent or mutually consistent. The user can toss all data into the database, delegating the responsibility of avoiding inconsistency or redundancy to developers or users.
  • Eventually Consistent –No guarantee of enforced immediate consistency does not mean that the database never achieves it. The database can become consistent over time. After a waiting period, updates will ripple through all cluster nodes of the database. However, reading data out of it will stay always be possible, it is just not certain if we always get the last refreshed data.

All the three above mentioned properties of BASE-conforming databases sound like disadvantages. So why would you choose BASE? There is a tradeoff compared to ACID. If databases do not have to follow ACID properties then the database can work much faster in terms of writing and reading from the database. Further, the developers have more freedom to implement data storage solutions or simplify data entry into the database without thinking about formats and structure beforehand.

BASE Databases

While ACID databases are mostly RDBMS, most other database types, known as NoSQL databases, tend more to conform to BASE principles. Redis, CouchDB, MongoDB, Cosmos DB, Cassandra, ElasticSearch, Neo4J, OrientDB or ArangoDB are just some popular examples. But other than ACID, BASE is not a strict approach. Some NoSQL databases apply at least partly to ACID rules or provide optional functions to get almost or even full ACID compatibility. These databases provide different level of freedom which can be useful for the Staging Layer in Data Warehouses or as a Data Lake, but they are not the recommended choice for applications which need data environments guaranteeing strict consistency.

Data Warehousing Basiscs

Data Warehousing is applied Big Data Management and a key success factor in almost every company. Without a data warehouse, no company today can control its processes and make the right decisions on a strategic level as there would be a lack of data transparency for all decision makers. Bigger comanies even have multiple data warehouses for different purposes.

In this series of articles I would like to explain what a data warehouse actually is and how it is set up. However, I would also like to explain basic topics regarding Data Engineering and concepts about databases and data flows.

To do this, we tick off the following points step by step:

 

How to Efficiently Manage Big Data

The benefits of big data today can’t be ignored, especially since these benefits encompass industries. Despite the misconceptions around big data, it has shown potential in helping organizations move forward and adapt to an ever-changing market, where those that can’t respond appropriately or quickly enough are left behind. Data analytics is the name of the game, and efficient data management is the main differentiator.

A digital integration hub (DIH) will provide the competitive edge an organization needs in the efficient handling of data. It’s an application architecture that aggregates operational data into a low-latency data fabric and helps in digital transformation by offloading from legacy architecture and providing a decoupled API layer that effectively supports modern online applications. Data management entails the governance, organization, and administration of large, and possibly complex, datasets. The rapid growth of data pooIs has left unprepared companies scrambling to find solutions that will help keep them above water. Data in these pools originate from a myriad of sources, including websites, social media sites, and system logs. The variety in data types and their sheer size make data management a fairly complex undertaking.

Big Data Management for Big Wins

In today’s data-driven world, the capability to efficiently analyze and store data are vital factors in enhancing current business processes and setting up new ones. Data has gone beyond the realm of data analysts and into the business mainstream. As such, businesses should add data analysis as a core competency to ensure that the entire organization is on the same page when it comes to data strategy. Below are a few ways you can make big data work for you and your business.

Define Specific Goals

The data you need to capture will depend on your business goals so it’s imperative that you know what these are and ensure that these are shared across the organization. Without definite and specific goals, you’ll end up with large pools of data and nowhere to use them in. As such, it’s advisable to involve the entire team in mapping out a data strategy based on the company’s objectives. This strategy should be part of the organization’s overall business strategy to avoid the collection of irrelevant data that has no impact on business performance. Setting the direction early on will help set you up for long-term success.

Secure Your Data

Because companies have to contend with large amounts of data each day, storage and management could become very challenging. Security is also a main concern; no organization wants to lose it’s precious data after spending time and money processing and storing them. While keeping data accessible for analysis, you should also ensure that it’s kept secure at all times. When handling data, you should have security measures in place, such as firewall security, malware scanning, and spam filtering. Data security is especially important when collecting customer data to avoid violating data privacy regulations. Ideally, it should be one of the main considerations in data management because it’s a critical factor that could mean the difference between a successful venture and a problematic one.

Interlink Your Data

Different channels can be used to access a database, but this doesn’t necessarily mean that you should use several or all of them. There’s no need to deploy different tools for each application your organization uses. One of the ways to prevent miscommunication between applications and ensure that data is synchronized at all times is to keep data interlinked. Synchronicity of data is vital if your organization or team plans to use a single database. An in-memory data grid, cloud storage system, and remote database administrator are just some of the tools a company can use to interlink data.

Ensure Compliance With Audit Regulations

One thing that could be easy to overlook is how compliant systems are to audit regulations. A database and it is conducted to check on the actions of database users and managers. It is typically done for security purposes—to ensure that data or information can be accessed only by those authorized to do so. Adhering to audit regulations is a must, even for offsite database administrators, so it’s critical that they maintain compliant database components.

Be Prepared for Change

There have been significant changes in the field of data processing and management in recent years, which indicates a promising yet constantly changing landscape. To get ahead of the data analytics game, it’s vital that you keep up with current data trends. New tools and technology are made available at an almost regular pace, and keeping abreast of them will ensure that a business keeps its database up to date. It’s also important to be flexible and be able to pivot or restrategize at a moment’s notice so the business can adapt to change accordingly.

Big Data for the Long Haul

Traditional data warehouses and relational database platforms are slowly becoming things of the past. Big data analytics has changed the game, with data management moving away from being a complex, IT-team focused function and becoming a core competency of every business. Ensuring efficient data management means giving your business a competitive edge, and implementing the tips above ensures that a business manages its data effectively. Changes in data strategies are certain, and they may come sooner than later. Equipping your people with the appropriate skills and knowledge will ensure that your business can embrace change with ease.

On the difficulty of language: prerequisites for NLP with deep learning

1 Preface

This section is virtually just my essay on language. You can skip this if you want to get down on more technical topic.

As I do not study in natural language processing (NLP) field, I would not be able to provide that deep insight into this fast changing deep leaning field throughout my article series. However at least I do understand language is a difficult and profound field, not only in engineering but also in many other study fields. Some people might be feeling that technologies are eliminating languages, or one’s motivations to understand other cultures. First of all, I would like you to keep it in mind that I am not a geek who is trying to turn this multilingual world into a homogeneous one and rebuild Tower of Babel, with deep learning. I would say I am more keen on social or anthropological sides of language.

I think you would think more about languages if you have mastered at least one foreign language. As my mother tongue is Japanese, which is totally different from many other Western languages in terms of characters and ambiguity, I understand translating is not what learning a language is all about. Each language has unique characteristics, and I believe they more or less influence one’s personalities. For example, many Western languages make the verb, I mean the conclusion, of sentences clear in the beginning part of the sentences. That is also true of Chinese, I heard. However in Japanese, the conclusion comes at the end, so that is likely to give an impression that Japanese people are being obscure or indecisive. Also, Japanese sentences usually omit their subjects. In German as well, the conclusion of a sentences tend to come at the end, but I am almost 100% sure that no Japanese people would feel German people make things unclear. I think that comes from the structures of German language, which tends to make the number, verb, relations of words crystal clear.

Let’s take an example to see how obscure Japanese is. A Japanese sentence 「頭が赤い魚を食べる猫」can be interpreted in five ways, depending on where you put emphases on.

Common sense tells you that the sentence is likely to mean the first two cases, but I am sure they can mean those five possibilities. There might be similarly obscure sentences in other languages, but I bet few languages can be as obscure as Japanese. Also as you can see from the last two sentences, you can omit subjects in Japanese. This rule is nothing exceptional. Japanese people usually don’t use subjects in normal conversations. And when you read classical Japanese, which Japanese high school students have to do just like Western students learn some of classical Latin, the writings omit subjects much more frequently.

*However interestingly we have rich vocabulary of subjects. The subject “I” can be translated to 「私」、「僕」、「俺」、「自分」、「うち」etc, depending on your personality, who you are talking to, and the time when it is written in.

I believe one can see the world only in the framework of their language, and it seems one’s personality changes depending on the language they use. I am not sure whether the language originally determines how they think, or how they think forms the language. But at least I would like you to keep it in mind that if you translate a conversation, for example a random conversation at a bar in Berlin, into Japanese, that would linguistically sound Japanese, but not anthropologically. Imagine that such kind of random conversation in Berlin or something is like playing a catch, I mean throwing a ball named “your opinion.” On the other hand,  normal conversations of Japanese people are in stead more of, I would say,  “resonance” of several tuning forks. They do their bests to show that they are listening to each other, by excessively nodding or just repeating “Really?”, but usually it seems hardly any constructive dialogues have been made.

*I sometimes feel you do not even need deep learning to simulate most of such Japanese conversations. Several-line Python codes would be enough.

My point is, this article series is mainly going to cover only a few techniques of NLP in deep learning field: sequence to sequence model (seq2seq model) , and especially Transformer. They are, at least for now, just mathematical models and mappings of a small part of this profound field of language (as far as I can cover in this article series). But still, examples of language would definitely help you understand Transformer model in the long run.

2 Tokens and word embedding

*Throughout my article series, “words” just means the normal words you use in daily life. “Tokens” means more general unit of NLP tasks. For example the word “Transformer” might be denoted as a single token “Transformer,” or maybe as a combination of two tokens “Trans” and “former.”

One challenging part of handling language data is its encodings. If you started learning programming in a language other than English, you would have encountered some troubles of using keyboards with different arrangements or with characters. Some comments on your codes in your native languages are sometimes not readable on some software. You can easily get away with that by using only English, but when it comes to NLP you have to deal with this difficulty seriously. How to encode characters in each language should be a first obstacle of NLP. In this article we are going to rely on a library named BPEmb, which provides word embedding in various languages, and you do not have to care so much about encodings in languages all over the world with this library.

In the first section, you might have noticed that Japanese sentence is not separated with spaces like Western languages. This is also true of Chinese language, and that means we need additional tasks of separating those sentences at least into proper chunks of words. This is not only a matter of engineering, but also of some linguistic fields. Also I think many people are not so conscious of how sentences in their native languages are grammatically separated.

The next point is, unlike other scientific data, such as temperature, velocity, voltage, or air pressure, language itself is not measured as numerical data. Thus in order to process language, including English, you first have to map language to certain numerical data, and after some processes you need to conversely map the output numerical data into language data. This section is going to be mainly about one-hot encoding and word embedding, the ways to convert word/token into numerical data. You might already have heard about this

You might have learnt about word embedding to some extent, but I hope you could get richer insight into this topic through this article.

2.1 One-hot encoding

One-hot encoding would be the most straightforward way to encode words/tokens. Assume that you have a dictionary whose size is |\mathcal{V}|, and it includes words from “a”, “ablation”, “actually” to “zombie”, “?”, “!”

In a mathematical manner, in order to choose a word out of those |\mathcal{V}| words, all you need is a |\mathcal{V}| dimensional vector, one of whose elements is 1, and the others are 0. When you want to choose the No. i word, which is “indeed” in the example below, its corresponding one-hot vector is \boldsymbol{v} = (0, \dots, 1, \dots, 0 ), where only the No. i element is 1. One-hot encoding is also easy to understand, and that’s all. It is easy to imagine that people have already come up with more complicated and better way to encoder words. And one major way to do that is word embedding.

2.2 Word embedding

Source: Francois Chollet, Deep Learning with Python,(2018), Manning

Actually word embedding is related to one-hot encoding, and if you understand how to train a simple neural network, for example densely connected layers, you would understand word embedding easily. The key idea of word embedding is denoting each token with a D dimensional vector, whose dimension is fewer than the vocabulary size |\mathcal{V}|. The elements of the resulting word embedding vector are real values, I mean not only 0 or 1. Obviously you can encode much richer variety of tokens with such vectors. The figure at the left side is from “Deep Learning with Python” by François Chollet, and I think this is an almost perfect and simple explanation of the comparison of one-hot encoding and word embedding. But the problem is how to get such convenient vectors. The answer is very simple: you have only to train a network whose inputs are one-hot vector of the vocabulary.

The figure below is a simplified model of word embedding of a certain word. When the word is input into a neural network, only the corresponding element of the one-hot vector is 1, and that virtually means the very first input layer is composed of one neuron whose value is 1. And the only one neuron propagates to the next D dimensional embedding layer. These weights are the very values which most other study materials call “an embedding vector.”

When you input each word into a certain network, for example RNN or Transformer, you map the input one-hot vector into the embedding layer/vector. The examples in the figure are how inputs are made when the input sentences are “You’ve got the touch” and “You’ve got the power.”   Assume that you have a dictionary of one-hot encoding, whose vocabulary is {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}, and the dimension of word embeding is 6. In this case |\mathcal{V}| = 9, D=6. When the inputs are “You’ve got the touch” or “You’ve got the power” , you put the one-hot vector corresponding to “You’ve”, “got”, “the”, “touch” or “You’ve”, “got”, “the”, “power” sequentially every time step t.

In order to get word embedding of certain vocabulary, you just need to train the network. We know that the words “actually” and “indeed” are used in similar ways in writings. Thus when we propagate those words into the embedding layer, we can expect that those embedding layers are similar. This is how we can mathematically get effective word embedding of certain vocabulary.

More interestingly, if word embedding is properly trained, you can mathematically “calculate” words. For example, \boldsymbol{v}_{king} - \boldsymbol{v}_{man} + \boldsymbol{v}_{woman} \approx \boldsymbol{v}_{queen}, \boldsymbol{v}_{Japan} - \boldsymbol{v}_{Tokyo} + \boldsymbol{v}_{Vietnam} \approx \boldsymbol{v}_{Hanoi}.

*I have tried to demonstrate this type of calculation on several word embedding, but none of them seem to work well. At least you should keep it in mind that word embedding learns complicated linear relations between words.

I should explain word embedding techniques such as word2vec in detail, but the main focus of this article is not NLP, so the points I have mentioned are enough to understand Transformer model with NLP examples in the upcoming articles.

 

3 Language model

Language models is one of the most straightforward, but crucial ideas in NLP. This is also a big topic, so this article is going to cover only basic points. Language model is a mathematical model of the probabilities of which words to come next, given a context. For example if you have a sentence “In the lecture, he opened a _.”, a language model predicts what comes at the part “_.” It is obvious that this is contextual. If you are talking about general university students, “_” would be “textbook,” but if you are talking about Japanese universities, especially in liberal art department, “_” would be more likely to be “smartphone. I think most of you use this language model everyday. When you type in something on your computer or smartphone, you would constantly see text predictions, or they might even correct your spelling or grammatical errors. This is language modelling. You can make language models in several ways, such as n-gram and neural language models, but in this article I can explain only general formulations for such models.

*I am not sure which algorithm is used in which services. That must be too fast changing and competitive for me to catch up.

As I mentioned in the first article series on RNN, a sentence is usually processed as sequence data in NLP. One single sentence is denoted as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}), a list of vectors. The vectors are usually embedding vectors, and the (t) is the index of the order of tokens. For example the sentence “You’ve go the power.” can be expressed as \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}), where \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)} denote “You’ve”, “got”, “the”, “power”, “.” respectively. In this case \tau = 4.

In practice a sentence \boldsymbol{X} usually includes two tokens BOS and EOS at the beginning and the end of the sentence. They mean “Beginning Of Sentence” and “End Of Sentence” respectively. Thus in many cases \boldsymbol{X} = (\boldsymbol{BOS} , \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS} ). \boldsymbol{BOS} and \boldsymbol{EOS} are also both vectors, at least in the Tensorflow tutorial.

P(\boldsymbol{X} = (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) is the probability of incidence of the sentence. But it is easy to imagine that it would be very hard to directly calculate how likely the sentence \boldsymbol{X} appears out of all possible sentences. I would rather say it is impossible. Thus instead in NLP we calculate the probability P(\boldsymbol{X}) as a product of the probability of incidence or a certain word, given all the words so far. When you’ve got the words (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1}) so far, the probability of the incidence of \boldsymbol{x}^{(t)}, given the context is  P(\boldsymbol{x}^{(t)}|\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(t-1)}). P(\boldsymbol{BOS}) is a probability of the the sentence \boldsymbol{X} being (\boldsymbol{BOS}), and the probability of \boldsymbol{X} being (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) can be decomposed this way: P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS})P(\boldsymbol{BOS}).

Just as well P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) = P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P( \boldsymbol{BOS}, \boldsymbol{x}^{(1)})= P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P( \boldsymbol{BOS}).

Hence, the general probability of incidence of a sentence \boldsymbol{X} is P(\boldsymbol{X})=P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \dots, \boldsymbol{x}^{(\tau -1)}, \boldsymbol{x}^{(\tau)}, \boldsymbol{EOS}) = P(\boldsymbol{EOS}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau)}) P(\boldsymbol{x}^{(\tau)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau - 1)}) \cdots P(\boldsymbol{x}^{(2)}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) P(\boldsymbol{x}^{(1)}| \boldsymbol{BOS}) P(\boldsymbol{BOS}).

Let \boldsymbol{x}^{(0)} be \boldsymbol{BOS} and \boldsymbol{x}^{(\tau + 1)} be \boldsymbol{EOS}. Plus, let P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]}) be P(\boldsymbol{x}^{(t+1)}|\boldsymbol{x}^{(0)}, \dots, \boldsymbol{x}^{(t)}), then P(\boldsymbol{X}) = P(\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Language models calculate which words to come sequentially in this way.

Here’s a question: how would you evaluate a language model?

I would say the answer is, when the language model generates words, the more confident the language model is, the better the language model is. Given a context, when the distribution of the next word is concentrated on a certain word, we can say the language model is confident about which word to come next, given the context.

*For some people, it would be more understandable to call this “entropy.”

Let’s take the vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”} as an example. Assume that P(\boldsymbol{X}) = P(\boldsymbol{BOS}, \boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}, \boldsymbol{EOS}) = P(\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)}, \boldsymbol{EOS})= P(\boldsymbol{x}^{(0)})\prod_{t=0}^{4}{P(\boldsymbol{x}^{(t+1)}|\boldsymbol{X}_{[0, t]})}. Given a context (\boldsymbol{BOS}, \boldsymbol{x}^{(1)}), the probability of incidence of \boldsymbol{x}^{(2)} is P(\boldsymbol{x}^{2}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}). In the figure below, the distribution at the left side is less confident because probabilities do not spread widely, on the other hand the one at the right side is more confident that next word is “got” because the distribution concentrates on “got”.

*You have to keep it in mind that the sum of all possible probability P(\boldsymbol{x}^{(2)} | \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) is 1, that is, P(\boldsymbol{the}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + P(\boldsymbol{You've}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) + \cdots + P(\boldsymbol{Boogie}| \boldsymbol{BOS}, \boldsymbol{x}^{(1)}) = 1.

While the language model generating the sentence “BOS You’ve got the touch EOS”, it is better if the language model keeps being confident. If it is confident, P(\boldsymbol{X})= P(\boldsymbol{BOS}) P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}) P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}) P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} gets higher. Thus (-1) \{ log_{b}{P(\boldsymbol{BOS})} + log_{b}{P(\boldsymbol{x}^{(1)}|\boldsymbol{BOS}}) + log_{b}{P(\boldsymbol{x}^{(3)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)})} + log_{b}{P(\boldsymbol{x}^{(4)}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)})} + log_{b}{P(\boldsymbol{EOS}|\boldsymbol{BOS}, \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \boldsymbol{x}^{(3)}, \boldsymbol{x}^{(4)})} \} gets lower, where usually b=2 or b=e.

This is how to measure how confident language models are, and the indicator of the confidence is called perplexity. Assume that you have a data set for evaluation \mathcal{D} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \dots, \boldsymbol{X}_{|\mathcal{D}|}), which is composed of |\mathcal{D}| sentences in total. Each sentence \boldsymbol{X}_n = (\boldsymbol{x}^{(0)})\prod_{t=0}^{\tau ^{(n)}}{P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})} has \tau^{(n)} tokens in total excluding \boldsymbol{BOS}, \boldsymbol{EOS}. And let |\mathcal{V}| be the size of the vocabulary of the language model. Then the perplexity of the language model is b^z, where z = \frac{-1}{|\mathcal{V}|}\sum_{n=1}^{|\mathcal{D}|}{\sum_{t=0}^{\tau ^{(n)}}{log_{b}P(\boldsymbol{x}_{n}^{(t+1)}|\boldsymbol{X}_{n, [0, t]})}. The b is usually 2 or e.

For example, assume that \mathcal{V} is vocabulary {“the”, “You’ve”, “Walberg”, “touch”, “power”, “Nights”, “got”, “Mark”, “Boogie”}. Also assume that the evaluation data set for perplexity of a language model is \mathcal{D} = (\boldsymbol{X}_1, \boldsymbol{X}_2), where \boldsymbol{X_1} =(\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the}, \boldsymbol{touch}) \boldsymbol{X_2} = (\boldsymbol{You've}, \boldsymbol{got}, \boldsymbol{the }, \boldsymbol{power}). In this case |\mathcal{V}|=9, |\mathcal{D}|=2. I have already showed you how to calculate the perplexity of the sentence “You’ve got the touch.” above. You just need to do a similar thing on another sentence “You’ve got the power”, and then you can get the perplexity of the language model.

*If the network is not properly trained, it would also be confident of generating wrong outputs. However, such network still would give high perplexity because it is “confident” at any rate. I’m sorry I don’t know how to tackle the problem. Please let me put this aside, and let’s get down on Transformer model soon.

Appendix

Let’s see how word embedding is implemented with a very simple example in the official Tensorflow tutorial. It is a simple binary classification task on IMDb Dataset. The dataset is composed to comments on movies by movie critics, and you have only to classify if the commentary is positive or negative about the movie. For example when you get you get an input “To be honest, Michael Bay is a terrible as an action film maker. You cannot understand what is going on during combat scenes, and his movies rely too much on advertisements. I got a headache when Mark Walberg used a Chinese cridit card in Texas. However he is very competent when it comes to humorous scenes. He is very talented as a comedy director, and I have to admit I laughed a lot.“, the neural netowork has to judge whether the statement is positive or negative.

This networks just takes an average of input embedding vectors and regress it into a one dimensional value from 0 to 1. The shape of embedding layer is (8185, 16). Weights of neural netowrks are usually implemented as matrices, and you can see that each row of the matrix corresponds to emmbedding vector of each token.

*It is easy to imagine that this technique is problematic. This network virtually taking a mean of input embedding vectors. That could mean if the input sentence includes relatively many tokens with negative meanings, it is inclined to be classified as negative. But for example, if the sentence is “This masterpiece is a dark comedy by Charlie Chaplin which depicted stupidity of the evil tyrant gaining power in the time. It thoroughly mocked Germany in the time as an absurd group of fanatics, but such propaganda could have never been made until ‘Casablanca.'” , this can be classified as negative, because only the part “masterpiece” is positive as a token, and there are much more words with negative meanings themselves.

The official Tensorflow tutorial provides visualization of word embedding with Embedding Projector, but I would like you to take more control over the data by yourself. Please just copy and paste the codes below, installing necessary libraries. You would get a map of vocabulary used in the text classification task. It seems you cannot find clear tendency of the clusters of the tokens. You can try other dimension reduction methods to get maps of the vocabulary by for example using Scikit Learn.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)

train_batches = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.shuffle(1000).padded_batch(10)

embedding_dim=16

encoder = info.features['text'].encoder

model = keras.Sequential([
  layers.Embedding(encoder.vocab_size, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'),
  layers.Dense(1)
])

print("\n\nThe size of the vocabulary generated from IMDb Dataset is " + str(encoder.vocab_size) + '\n\n')

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches, validation_steps=20)

word_embedding_vectors = model.layers[0].get_weights()[0]

print("\n\nThe shape of the trained weigths of the embedding layer is " + str(word_embedding_vectors.shape) + '\n\n')

from sklearn.manifold import TSNE
X_reduced = TSNE(n_components = 2, init='pca', random_state=0).fit_transform(word_embedding_vectors)

import numpy as np
embedding_dict = zip(encoder.subwords, np.arange(len(encoder.subwords)))
embedding_dict = dict(embedding_dict)

import matplotlib.pyplot as plt

plt.figure(figsize=(60, 45))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1])

for i in range(0, len(encoder.subwords), 5):
    plt.text(X_reduced[i, 0], X_reduced[i, 1], encoder.subwords[i], fontsize=20, color='red')
plt.title("The map of vocabulary of IMDb Dataset mapped to a 2 dimensional space by t-SNE", fontsize=60)
#plt.savefig('imdb_tsne_map.png')
plt.show()

 

 

 

 

 

Turbocharge Business Analytics With In-memory Computing

One of the customer traits that’s been gradually diminishing through the years is patience; if a customer-facing website or application doesn’t deliver real-time or near-instant results, it can be a reason for a customer to look elsewhere. This trend has pushed companies to turn to in-memory computing to get the speed needed to address customer demands in real-time. It simplifies access to multiple data sources to provide super-fast performance that’s thousands of times faster than disk-based storage systems. By storing data in RAM and processing in parallel against the full dataset, in-memory computing solutions allow for real-time insights that lead to informed business decisions and improved performance.

The in-memory computing solutions market has been on the rise in recent years because it has been heralded as the platform that will accelerate IT modernization. In-memory data grids, in particular, show great promise because it addresses the main limitation of an in-memory relational database. While the latter is designed to scale up, the former is designed to scale out. This scalability is one of the main draws of an in-memory data grid, since a scale-up architecture is not sustainable in the long term and will always have a breaking point. In-memory data grids on the other hand, benefit from horizontal scalability and computing elasticity. Scaling an in-memory data grid is as simple as adding nodes to a cluster and removing it when it’s no longer needed. This is especially useful for businesses that demand speed in the management of hundreds of terabytes of data across multiple networked computers in geographically distributed data centers.

Since big data is complex and fast-moving, keeping data synchronized across data centers is vital to preserve data integrity. Keeping data in memory removes the bottleneck caused by constant access to disk -based storage and allows applications and their data to collocate in the same memory space. This allows for optimization that allows the amount of data to exceed the amount of available memory. Speed and efficiency is also improved by keeping frequently accessed data in memory and the rest on disk, consequently allowing data to reside both in memory and on disk.

Future-proofing Businesses With In-memory Computing

Data analytics is as much a part of every business as other marketing and business intelligence tools. Because data constantly grows at an exponential rate, in-memory computing serves as the enabler of data analytics because it provides speed, high availability, and straightforward scalability. Speeds more than 100 times faster than other solutions enable in-memory computing solutions to provide real-time insights that are applicable in a host of industries and use cases.

Location-based Marketing

A report from 2019 shows that location-based marketing helped 89% of marketers increase sales, 86% grow their customer base, and 84% improve customer engagement. Location data can be leveraged to identify patterns of behavior by analyzing frequently visited locations. By understanding why certain customers frequent specific locations and knowing when they are there, you can better target your marketing messages and make more strategic customer acquisitions. Location data can also be used as a demographic identifier to help you segment your customers and tailor your offers and messaging accordingly.

Fraud Detection

In-memory computing helps improve operational intelligence by detecting anomalies in transaction data immediately. Through high-speed analysis of large amounts of data, potential risks are detected early on and addressed as soon as possible. Transaction data is fast-moving and changes frequently, and in-memory computing is equipped to handle data as it changes. This is why it’s an ideal platform for payment processing; it helps make comparisons of current transactions with the history of all transactions on record in a matter of seconds. Companies typically have several fraud detection measures in place, and in-memory computing allows running these algorithms concurrently without compromising overall system performance. This ensures responsiveness of systems despite peak volume levels and avoids interruptions to customer service.

Tailored Customer Experiences

The real-time insights delivered by in-memory computing helps personalize experiences based on customer data. Because customer experiences are time-sensitive, processing and analyzing data at super-fast speeds is vital in capturing real-time event data that can be used to craft the best experience possible for each customer. Without in-memory computing, getting real-time data and other necessary information that ensures a seamless customer experience would have been near impossible.

Real-time data analytics helps provide personalized recommendations based on both existing and new customer data. By looking at historical data like previously visited pages and comparing them with newer data from the stream, businesses can craft the proper messaging and plan the next course of action. The anticipation and forecasting of customers’ future actions and behavior is the key to improving conversion rates and customer satisfaction—ultimately leading to higher revenues and more loyal customers.

Conclusion

Big data is the future, and companies that don’t use it to their advantage would find it hard to compete in this ever-connected world that demands results in an instant. Processing and analyzing data can only become more complex and challenging through time, and for this reason, in-memory computing should be a solution that companies should consider. Aside from improving their business from within, it will also help drive customer acquisition and revenue, while also providing a viable low-latency, high throughput platform for high-speed data analytics.

Simple RNN

Prerequisites for understanding RNN at a more mathematical level

Writing the A gentle introduction to the tiresome part of understanding RNN Article Series on recurrent neural network (RNN) is nothing like a creative or ingenious idea. It is quite an ordinary topic. But still I am going to write my own new article on this ordinary topic because I have been frustrated by lack of sufficient explanations on RNN for slow learners like me.

I think many of readers of articles on this website at least know that RNN is a type of neural network used for AI tasks, such as time series prediction, machine translation, and voice recognition. But if you do not understand how RNNs work, especially during its back propagation, this blog series is for you.

After reading this articles series, I think you will be able to understand RNN in more mathematical and abstract ways. But in case some of the readers are allergic or intolerant to mathematics, I tried to use as little mathematics as possible.

Ideal prerequisite knowledge:

  • Some understanding on densely connected layers (or fully connected layers, multilayer perception) and how their forward/back propagation work.
  •  Some understanding on structure of Convolutional Neural Network.

*In this article “Densely Connected Layers” is written as “DCL,” and “Convolutional Neural Network” as “CNN.”

1, Difficulty of Understanding RNN

I bet a part of difficulty of understanding RNN comes from the variety of its structures. If you search “recurrent neural network” on Google Image or something, you will see what I mean. But that cannot be helped because RNN enables a variety of tasks.

Another major difficulty of understanding RNN is understanding its back propagation algorithm. I think some of you found it hard to understand chain rules in calculating back propagation of densely connected layers, where you have to make the most of linear algebra. And I have to say backprop of RNN, especially LSTM, is a monster of chain rules. I am planing to upload not only a blog post on RNN backprop, but also a presentation slides with animations to make it more understandable, in some external links.

In order to avoid such confusions, I am going to introduce a very simplified type of RNN, which I call a “simple RNN.” The RNN displayed as the head image of this article is a simple RNN.

2, How Neurons are Connected

    \begin{equation*}   1 = 3 - 2 \end{equation*}

How to connect neurons and how to activate them is what neural networks are all about. Structures of those neurons are easy to grasp as long as that is about DCL or CNN. But when it comes to the structure of RNN, many study materials try to avoid showing that RNNs are also connections of neurons, as well as DCL or CNN(*If you are not sure how neurons are connected in CNN, this link should be helpful. Draw a random digit in the square at the corner.). In fact the structure of RNN is also the same, and as long as it is a simple RNN, and it is not hard to visualize its structure.

Even though RNN is also connections of neurons, usually most RNN charts are simplified, using blackboxes. In case of simple RNN, most study material would display it as the chart below.

But that also cannot be helped because fancier RNN have more complicated connections of neurons, and there are no longer advantages of displaying RNN as connections of neurons, and you would need to understand RNN in more abstract way, I mean, as you see in most of textbooks.

I am going to explain details of simple RNN in the next article of this series.

3, Neural Networks as Mappings

If you still think that neural networks are something like magical spider webs or models of brain tissues, forget that. They are just ordinary mappings.

If you have been allergic to mathematics in your life, you might have never heard of the word “mapping.” If so, at least please keep it in mind that the equation y=f(x), which most people would have seen in compulsory education, is a part of mapping. If you get a value x, you get a value y corresponding to the x.

But in case of deep learning, x is a vector or a tensor, and it is denoted with \boldsymbol{x} . If you have never studied linear algebra , imagine that a vector is a column of Excel data (only one column), a matrix is a sheet of Excel data (with some rows and columns), and a tensor is some sheets of Excel data (each sheet does not necessarily contain only one column.)

CNNs are mainly used for image processing, so their inputs are usually image data. Image data are in many cases (3, hight, width) tensors because usually an image has red, blue, green channels, and the image in each channel can be expressed as a hight*width matrix (the “height” and the “width” are number of pixels, so they are discrete numbers).

The convolutional part of CNN (which I call “feature extraction part”) maps the tensors to a vector, and the last part is usually DCL, which works as classifier/regressor. At the end of the feature extraction part, you get a vector. I call it a “semantic vector” because the vector has information of “meaning” of the input image. In this link you can see maps of pictures plotted depending on the semantic vector. You can see that even if the pictures are not necessarily close pixelwise, they are close in terms of the “meanings” of the images.

In the example of a dog/cat classifier introduced by François Chollet, the developer of Keras, the CNN maps (3, 150, 150) tensors to 2-dimensional vectors, (1, 0) or (0, 1) for (dog, cat).

Wrapping up the points above, at least you should keep two points in mind: first, DCL is a classifier or a regressor, and CNN is a feature extractor used for image processing. And another important thing is, feature extraction parts of CNNs map images to vectors which are more related to the “meaning” of the image.

Importantly, I would like you to understand RNN this way. An RNN is also just a mapping.

*I recommend you to at least take a look at the beautiful pictures in this link. These pictures give you some insight into how CNN perceive images.

4, Problems of DCL and CNN, and needs for RNN

Taking an example of RNN task should be helpful for this topic. Probably machine translation is the most famous application of RNN, and it is also a good example of showing why DCL and CNN are not proper for some tasks. Its algorithms is out of the scope of this article series, but it would give you a good insight of some features of RNN. I prepared three sentences in German, English, and Japanese, which have the same meaning. Assume that each sentence is divided into some parts as shown below and that each vector corresponds to each part. In machine translation we want to convert a set of the vectors into another set of vectors.

Then let’s see why DCL and CNN are not proper for such task.

  • The input size is fixed: In case of the dog/cat classifier I have mentioned, even though the sizes of the input images varies, they were first molded into (3, 150, 150) tensors. But in machine translation, usually the length of the input is supposed to be flexible.
  • The order of inputs does not mater: In case of the dog/cat classifier the last section, even if the input is “cat,” “cat,” “dog” or “dog,” “cat,” “cat” there’s no difference. And in case of DCL, the network is symmetric, so even if you shuffle inputs, as long as you shuffle all of the input data in the same way, the DCL give out the same outcome . And if you have learned at least one foreign language, it is easy to imagine that the orders of vectors in sequence data matter in machine translation.

*It is said English language has phrase structure grammar, on the other hand Japanese language has dependency grammar. In English, the orders of words are important, but in Japanese as long as the particles and conjugations are correct, the orders of words are very flexible. In my impression, German grammar is between them. As long as you put the verb at the second position and the cases of the words are correct, the orders are also relatively flexible.

5, Sequence Data

We can say DCL and CNN are not useful when you want to process sequence data. Sequence data are a type of data which are lists of vectors. And importantly, the orders of the vectors matter. The number of vectors in sequence data is usually called time steps. A simple example of sequence data is meteorological data measured at a spot every ten minutes, for instance temperature, air pressure, wind velocity, humidity. In this case the data is recorded as 4-dimensional vector every ten minutes.

But this “time step” does not necessarily mean “time.” In case of natural language processing (including machine translation), which you I mentioned in the last section, the numberings of each vector denoting each part of sentences are “time steps.”

And RNNs are mappings from a sequence data to another sequence data.

*At least I found a paper on the RNN’s capability of universal approximation on many-to-one RNN task. But I have not found any papers on universal approximation of many-to-many RNN tasks. Please let me know if you find any clue on whether such approximation is possible. I am desperate to know that. 

6, Types of RNN Tasks

RNN tasks can be classified into some types depending on the lengths of input/output sequences (the “length” means the times steps of input/output sequence data).

If you want to predict the temperature in 24 hours, based on several time series data points in the last 96 hours, the task is many-to-one. If you sample data every ten minutes, the input size is 96*6=574 (the input data is a list of 574 vectors), and the output size is 1 (which is a value of temperature). Another example of many-to-one task is sentiment classification. If you want to judge whether a post on SNS is positive or negative, the input size is very flexible (the length of the post varies.) But the output size is one, which is (1, 0) or (0, 1), which denotes (positive, negative).

*The charts in this section are simplified model of RNN used for each task. Please keep it in mind that they are not 100% correct, but I tried to make them as exact as possible compared to those in other study materials.

Music/text generation can be one-to-many tasks. If you give the first sound/word you can generate a phrase.

Next, let’s look at many-to-many tasks. Machine translation and voice recognition are likely to be major examples of many-to-many tasks, but here name entity recognition seems to be a proper choice. Name entity recognition is task of finding proper noun in a sentence . For example if you got two sentences “He said, ‘Teddy bears on sale!’ ” and ‘He said, “Teddy Roosevelt was a great president!” ‘ judging whether the “Teddy” is a proper noun or a normal noun is name entity recognition.

Machine translation and voice recognition, which are more popular, are also many-to-many tasks, but they use more sophisticated models. In case of machine translation, the inputs are sentences in the original language, and the outputs are sentences in another language. When it comes to voice recognition, the input is data of air pressure at several time steps, and the output is the recognized word or sentence. Again, these are out of the scope of this article but I would like to introduce the models briefly.

Machine translation uses a type of RNN named sequence-to-sequence model (which is often called seq2seq model). This model is also very important for other natural language processes tasks in general, such as text summarization. A seq2seq model is divided into the encoder part and the decoder part. The encoder gives out a hidden state vector and it used as the input of the decoder part. And decoder part generates texts, using the output of the last time step as the input of next time step.

Voice recognition is also a famous application of RNN, but it also needs a special type of RNN.

*To be honest, I don’t know what is the state-of-the-art voice recognition algorithm. The example in this article is a combination of RNN and a collapsing function made using Connectionist Temporal Classification (CTC). In this model, the output of RNN is much longer than the recorded words or sentences, so a collapsing function reduces the output into next output with normal length.

You might have noticed that RNNs in the charts above are connected in both directions. Depending on the RNN tasks you need such bidirectional RNNs.  I think it is also easy to imagine that such networks are necessary. Again, machine translation is a good example.

And interestingly, image captioning, which enables a computer to describe a picture, is one-to-many-task. As the output is a sentence, it is easy to imagine that the output is “many.” If it is a one-to-many task, the input is supposed to be a vector.

Where does the input come from? I told you that I was obsessed with the beauty of the last vector of the feature extraction part of CNN. Surprisingly the the “beautiful” vector, which I call a “semantic vector” is the input of image captioning task (after some transformations, depending on the network models).

I think this articles includes major things you need to know as prerequisites when you want to understand RNN at more mathematical level. In the next article, I would like to explain the structure of a simple RNN, and how it forward propagate.

* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.

Integrate Unstructured Data into Your Enterprise to Drive Actionable Insights

In an ideal world, all enterprise data is structured – classified neatly into columns, rows, and tables, easily integrated and shared across the organization.

The reality is far from it! Datamation estimates that unstructured data accounts for more than 80% of enterprise data, and it is growing at a rate of 55 – 65 percent annually. This includes information stored in images, emails, spreadsheets, etc., that cannot fit into databases.

Therefore, it becomes imperative for a data-driven organization to leverage their non-traditional information assets to derive business value. We have outlined a simple 3-step process that can help organizations integrate unstructured sources into their data eco-system:

1. Determine the Challenge

The primary step is narrowing down the challenges you want to solve through the unstructured data flowing in and out of your organization. Financial organizations, for instance, use call reports, sales notes, or other text documents to get real-time insights from the data and make decisions based on the trends. Marketers make use of social media data to evaluate their customers’ needs and shape their marketing strategy.

Figuring out which process your organization is trying to optimize through unstructured data can help you reach your goal faster.

2. Map Out the Unstructured Data Sources Within the Enterprise

An actionable plan starts with identifying the range of data sources that are essential to creating a truly integrated environment. This enables organizations to align the sources with business objectives and streamline their data initiatives.

Deciding which data should be extracted, analyzed, and stored should be a primary concern in this regard. Even if you can ingest data from any source, it doesn’t mean that you should.

Collecting a large volume of unstructured data is not enough to generate insights. It needs to be properly organized and validated for quality before integration. Full, incremental, online, and offline extraction methods are generally used to mine valuable information from unstructured data sources.

3. Transform Unstructured Assets into Decision-Ready Insights

Now that you have all the puzzle pieces, the next step is to create a complete picture. This may require making changes in your organization’s infrastructure to derive meaning from your unstructured assets and get a 360-degree business view.

IDC recommends creating a company culture that promotes the collection, use, and sharing of both unstructured and structured business assets. Therefore, finding an enterprise-grade integration solution that offers enhanced connectivity to a range of data sources, ideally structured, unstructured, and semi-structured, can help organizations generate the most value out of their data assets.

Automation is another feature that can help speed up integration processes, minimize error probability, and generate time-and-cost savings. Features like job scheduling, auto-mapping, and workflow automation can optimize the process of extracting information from XML, JSON, Excel or audio files, and storing it into a relational database or generating insights.

The push to become a data-forward organization has enterprises re-evaluating the way to leverage unstructured data assets for decision-making. With an actionable plan in place to integrate these sources with the rest of the data, organizations can take advantage of the opportunities offered by analytics and stand out from the competition.

Introduction to Recommendation Engines

This is the second article of article series Getting started with the top eCommerce use cases. If you are interested in reading the first article you can find it here.

What are Recommendation Engines?

Recommendation engines are the automated systems which helps select out similar things whenever a user selects something online. Be it Netflix, Amazon, Spotify, Facebook or YouTube etc. All of these companies are now using some sort of recommendation engine to improve their user experience. A recommendation engine not only helps to predict if a user prefers an item or not but also helps to increase sales, ,helps to understand customer behavior, increase number of registered users and helps a user to do better time management. For instance Netflix will suggest what movie you would want to watch or Amazon will suggest what kind of other products you might want to buy. All the mentioned platforms operates using the same basic algorithm in the background and in this article we are going to discuss the idea behind it.

What are the techniques?

There are two fundamental algorithms that comes into play when there’s a need to generate recommendations. In next section these techniques are discussed in detail.

Content-Based Filtering

The idea behind content based filtering is to analyse a set of features which will provide a similarity between items themselves i.e. between two movies, two products or two songs etc. These set of features once compared gives a similarity score at the end which can be used as a reference for the recommendations.

There are several steps involved to get to this similarity score and the first step is to construct a profile for each item by representing some of the important features of that item. In other terms, this steps requires to define a set of characteristics that are discovered easily. For instance, consider that there’s an article which a user has already read and once you know that this user likes this article you may want to show him recommendations of similar articles. Now, using content based filtering technique you could find the similar articles. The easiest way to do that is to set some features for this article like publisher, genre, author etc. Based on these features similar articles can be recommended to the user (as illustrated in Figure 1). There are three main similarity measures one could use to find the similar articles mentioned below.

 

Figure 1: Content-Based Filtering

 

 

Minkowski distance

Minkowski distance between two variables can be calculated as:

(x,y)= (\sum_{i=1}^{n}{|X_{i} - Y_{i}|^{p}})^{1/p}

 

Cosine Similarity

Cosine similarity between two variables can be calculated as :

  \mbox{Cosine Similarity} = \frac{\sum_{i=1}^{n}{x_{i} y_{i}}} {\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}} \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}} \

 

Jaccard Similarity

 

  J(X,Y) = |X ∩ Y| / |X ∪ Y|

 

These measures can be used to create a matrix which will give you the similarity between each movie and then a function can be defined to return the top 10 similar articles.

 

Collaborative filtering

This filtering method focuses on finding how similar two users or two products are by analyzing user behavior or preferences rather than focusing on the content of the items. For instance consider that there are three users A,B and C.  We want to recommend some movies to user A, our first approach would be to find similar users and compare which movies user A has not yet watched and recommend those movies to user A.  This approach where we try to find similar users is called as User-User Collaborative Filtering.  

The other approach that could be used here is when you try to find similar movies based on the ratings given by others, this type is called as Item-Item Collaborative Filtering. The research shows that item-item collaborative filtering works better than user-user collaborative filtering as user behavior is really dynamic and changes over time. Also, there are a lot more users and increasing everyday but on the other side item characteristics remains the same. To calculate the similarities we can use Cosine distance.

 

Figure 2: Collaborative Filtering

 

Recently some companies have started to take advantage of both content based and collaborative filtering techniques to make a hybrid recommendation engine. The results from both models are combined into one hybrid model which provides more accurate recommendations. Five steps are involved to make a recommendation engine work which are collection of data, storing of data, analyzing the data, filtering the data and providing recommendations. There are a lot of attributes that are involved in order to collect user data including browsing history, page views, search logs, order history, marketing channel touch points etc. which requires a strong data architecture.  The collection of data is pretty straightforward but it can be overwhelming to analyze this amount of data. Storing this data could get tricky on the other hand as you need a scalable database for this kind of data. With the rise of graph databases this area is also improving for many use cases including recommendation engines. Graph databases like Neo4j can also help to analyze and find similar users and relationship among them. Analyzing the data can be carried in different ways, depending on how strong and scalable your architecture you can run real time, batch or near real time analysis. The fourth step involves the filtering of the data and here you can use any of the above mentioned approach to find similarities to finally provide the recommendations.

Having a good recommendation engine can be time consuming initially but it is definitely beneficial in the longer run. It not only helps to generate revenue but also helps to to improve your product catalog and customer service.

Glorious career paths of a Big Data Professional

Are you wondering about the career profiles you may get to fill if you get into Big Data industry? If yes, then Bingo! This is the post that will inform you just about that. Big data is just an umbrella term. There are a lot of profiles and career paths that are covered under this umbrella term. Let us have a look at some of these profiles.

Data Visualisation Specialist

The process of visualizing data is turning out to be critical in guaranteeing information-driven representatives get the upfront investment required to actualize goal-oriented and significant Big Data extends in their organization. Making your data to tell a story and the craft of envisioning information convincingly has turned into a significant piece of the Big Data world and progressively associations need to have these capacities in-house. Besides, as a rule, these experts are relied upon to realize how to picture in different instruments, for example, Spotfire, D3, Carto, and Tableau – among numerous others. Information Visualization Specialists should be versatile and inquisitive to guarantee they stay aware of most recent patterns and answers for a recount to their information stories in the most intriguing manner conceivable with regards to the board room. 

 

Big Data Architect

This is the place the Hadoop specialists come in. Ordinarily, a Big Data planner tends to explicit information issues and necessities, having the option to portray the structure and conduct of a Big Data arrangement utilizing the innovation wherein they practice – which is, as a rule, mostly Hadoop.

These representatives go about as a significant connection between the association (and its specific needs) and Data Scientists and Engineers. Any organization that needs to assemble a Big Data condition will require a Big Data modeler who can serenely deal with the total lifecycle of a Hadoop arrangement – including necessity investigation, stage determination, specialized engineering structure, application plan, and advancement, testing the much-dreaded task of deploying lastly.

Systems Architect 

This Big data professional is in charge of how your enormous information frameworks are architected and interconnected. Their essential incentive to your group lies in their capacity to use their product building foundation and involvement with huge scale circulated handling frameworks to deal with your innovation decisions and execution forms. You’ll need this individual to construct an information design that lines up with the business, alongside abnormal state anticipating the improvement. The person in question will consider different limitations, adherence to gauges, and varying needs over the business.

Here are some responsibilities that they play:

    • Determine auxiliary prerequisites of databases by investigating customer tasks, applications, and programming; audit targets with customers and assess current frameworks.
    • Develop database arrangements by planning proposed framework; characterize physical database structure and utilitarian abilities, security, back-up and recuperation particulars.
    • Install database frameworks by creating flowcharts; apply ideal access methods, arrange establishment activities, and record activities.
    • Maintain database execution by distinguishing and settling generation and application advancement issues, figuring ideal qualities for parameters; assessing, incorporating, and putting in new discharges, finishing support and responding to client questions.
    • Provide database support by coding utilities, reacting to client questions, and settling issues.


Artificial Intelligence Developer

The certain promotion around Artificial Intelligence is additionally set to quicken the number of jobs publicized for masters who truly see how to apply AI, Machine Learning, and Deep Learning strategies in the business world. Selection representatives will request designers with broad learning of a wide exhibit of programming dialects which loan well to AI improvement, for example, Lisp, Prolog, C/C++, Java, and Python.

All said and done; many people estimate that this popular demand for AI specialists could cause a something like what we call a “Brain Drain” organizations poaching talented individuals away from the universe of the scholarly world. A month ago in the Financial Times, profound learning pioneer and specialist Yoshua Bengio, of the University of Montreal expressed: “The industry has been selecting a ton of ability — so now there’s a lack in the scholarly world, which is fine for those organizations. However, it’s not extraordinary for the scholarly world.” It ; howeverusiasm to perceive how this contention among the scholarly world and business is rotated in the following couple of years.

Data Scientist

The move of Big Data from tech publicity to business reality may have quickened, yet the move away from enrolling top Data Scientists isn’t set to change in 2020. An ongoing Deloitte report featured that the universe of business will require three million Data Scientists by 2021, so if their expectations are right, there’s a major ability hole in the market. This multidisciplinary profile requires specialized logical aptitudes, specialized software engineering abilities just as solid gentler abilities, for example, correspondence, business keenness, and scholarly interest.

Data Engineer

Clean and quality data is crucial in the accomplishment of Big Data ventures. Consequently, we hope to see a lot of opening in 2020 for Data Engineers who have a predictable and awesome way to deal with information transformation and treatment. Organizations will search for these special data masters to have broad involvement in controlling data with SQL, T-SQL, R, Hadoop, Hive, Python and Spark. Much like Data Scientists. They are likewise expected to be innovative with regards to contrasting information with clashing information types with have the option to determine issues. They additionally frequently need to make arrangements which enable organizations to catch existing information in increasingly usable information groups – just as performing information demonstrations and their modeling.

IT/Operations Manager Job Description

In Big data industry, the IT/Operations Manager is a profitable expansion to your group and will essentially be in charge of sending, overseeing, and checking your enormous information frameworks. You’ll depend on this colleague to plan and execute new hardware and administrations. The person in question will work with business partners to comprehend the best innovation ventures to address their procedures and concerns—interpreting business necessities to innovation plans. They’ll likewise work with venture chiefs to actualize innovation and be in charge of effective progress and general activities.

Here are some responsibilities that they play:

  • Manage and be proactive in announcing, settling and raising issues where required 
  • Lead and co-ordinate issue the executive’s exercises, notwithstanding ceaseless procedure improvement activities  
  • Proactively deal with our IT framework 
  • Supervise and oversee IT staffing, including enrollment, supervision, planning, advancement, and assessment
  • Verify existing business apparatuses and procedures remain ideally practical and worth included 
  • Benchmark, dissect, report on and make suggestions for the improvement and development of the IT framework and IT frameworks 
  • Advance and keep up a corporate SLA structure

Conclusion

These are some of the best career paths that big data professionals can play after entering the industry. Honesty and hard work can always take you to the zenith of any field that you choose to be in. Also, keep upgrading your skills by taking newer certifications and technologies. Good Luck