Posts

Unsupervised Learning in R: K-Means Clustering

Die Clusteranalyse ist ein gruppenbildendes Verfahren, mit dem Objekte Gruppen – sogenannten Clustern zuordnet werden. Die dem Cluster zugeordneten Objekte sollen möglichst homogen sein, wohingegen die Objekte, die unterschiedlichen Clustern zugeordnet werden möglichst heterogen sein sollen. Dieses Verfahren wird z.B. im Marketing bei der Zielgruppensegmentierung, um Angebote entsprechend anzupassen oder im User Experience Bereich zur Identifikation sog. Personas.

Es gibt in der Praxis eine Vielzahl von Cluster-Verfahren, eine der bekanntesten und gebräuchlichsten Verfahren ist das K-Means Clustering, ein sog. Partitionierendes Clusterverfahren. Das Ziel dabei ist es, den Datensatz in K Cluster zu unterteilen. Dabei werden zunächst K beliebige Punkte als Anfangszentren (sog. Zentroiden) ausgewählt und jedem dieser Punkte der Punkt zugeordnet, zu dessen Zentrum er die geringste Distanz hat. K-Means ist ein „harter“ Clusteralgorithmus, d.h. jede Beobachtung wird genau einem Cluster zugeordnet. Zur Berechnung existieren verschiedene Distanzmaße. Das gebräuchlichste Distanzmaß ist die quadrierte euklidische Distanz:

D^2 = \sum_{i=1}^{v}(x_i - y_i)^2

Nachdem jede Beobachtung einem Cluster zugeordnet wurde, wird das Clusterzentrum neu berechnet und die Punkte werden den neuen Clusterzentren erneut zugeordnet. Dieser Vorgang wird so lange durchgeführt bis die Clusterzentren stabil sind oder eine vorher bestimmte Anzahl an Iterationen durchlaufen sind.
Das komplette Vorgehen wird im Folgenden anhand eines künstlich erzeugten Testdatensatzes erläutert.

Zunächst wird ein Testdatensatz mit den Variablen „Alter“ und „Einkommen“ erzeugt, der 12 Fälle enthält. Als Schritt des „Data preprocessing“ müssen zunächst beide Variablen standardisiert werden, da ansonsten die Variable „Alter“ die Clusterbildung zu stark beeinflusst.

Das Ganze geplottet:

Wie bereits eingangs erwähnt müssen Cluster innerhalb möglichst homogen und zu Objekten anderer Cluster möglichst heterogen sein. Ein Maß für die Homogenität die „Within Cluster Sums of Squares“ (WSS), ein Maß für die Heterogenität „Between Cluster Sums of Squares“ (BSS).

Diese sind beispielsweise für eine 3-Cluster-Lösung wie folgt:

Sollte man die Anzahl der Cluster nicht bereits kennen oder sind diese extern nicht vorgegeben, dann bietet es sich an, anhand des Verhältnisses von WSS und BSS die „optimale“ Clusteranzahl zu berechnen. Dafür wird zunächst ein leerer Vektor initialisiert, dessen Werte nachfolgend über die Schleife mit dem Verhältnis von WSS und WSS gefüllt werden. Dies lässt sich anschließend per „Screeplot“ visualisieren.

Die „optimale“ Anzahl der Cluster zählt sich am Knick der Linie ablesen (auch Ellbow-Kriterium genannt). Alternativ kann man sich an dem Richtwert von 0.2 orientieren. Unterschreitet das Verhältnis von WSS und BSS diesen Wert, so hat man die beste Lösung gefunden. In diesem Beispiel ist sehr deutlich, dass eine 3-Cluster-Lösung am besten ist.

Fazit: Mit K-Means Clustering lassen sich schnell und einfach Muster in Datensätzen erkennen, die, gerade wenn mehr als zwei Variablen geclustert werden, sonst verborgen blieben. K-Means ist allerdings anfällig gegenüber Ausreißern, da Ausreißer gerne als separate Cluster betrachtet werden. Ebenfalls problematisch sind Cluster, deren Struktur nicht kugelförmig ist. Dies ist vor der Durchführung der Clusteranalyse mittels explorativer Datenanalyse zu überprüfen.

Einstieg in das Maschinelle Lernen mit Python(x,y)

Python(x,y) ist eine Python-Distribution, die speziell für wissenschaftliche Arbeiten entwickelt wurde. Es umfasst neben der Programmiersprache auch die Entwicklungsumgebung Spyder und eine Reihe integrierter Python-Bibliotheken. Mithilfe von Python(x,y) kann eine Vielzahl von Interessensbereichen bearbeitet werden. Dazu zählen unter anderem Bildverarbeitung oder auch das maschinelle Lernen. Das All-in-One-Setup für Python(x,y) ist für alle gängigen Betriebssysteme online erhältlich. Read more

Statistical Relational Learning – Part 2

In the first part of this series onAn Introduction to Statistical Relational Learning”, I touched upon the basic Machine Learning paradigms, some background and intuition of the concepts and concluded with how the MLN template looks like. In this blog, we will dive in to get an in depth knowledge on the MLN template; again with the help of sample examples. I would then conclude by highlighting the various toolkit available and some of its differentiating features.

MLN Template – explained

A Markov logic network can be thought of as a group of formulas incorporating first-order logic and also tied with a weight. But what exactly does this weight signify?

Weight Learning

According to the definition, it is the log odds between a world where F is true and a world where F is false,

and captures the marginal distribution of the corresponding predicate.

Each formula can be associated with some weight value, that is a positive or negative real number. The higher the value of weight, the stronger the constraint represented by the formula. In contrast to classical logic, all worlds (i.e., Herbrand Interpretations) are possible with a certain probability [1]. The main idea behind this is that the probability of a world increases as the number of formulas it violates decreases.

Markov logic networks with its probabilistic approach combined to logic posit that a world is less likely if it violates formulas unlike in pure logic where a world is false if it violates even a single formula. Consider the case when a formula with high weight i.e. more significance is violated implying that it is less likely in occurrence.

Another important concept during the first phase of Weight Learning while applying an MLN template is “Grounding”. Grounding means to replace each variable/function in predicate with constants from the domain.

Weight Learning – An Example

Note: All examples are highlighted in the Alchemy MLN format

Let us consider an example where we want to identify the relationship between 2 different types of verb-noun pairs i.e noun subject and direct object.

The input predicateFormula.mln file contains

  1. The predicates nsubj(verb, subject) and dobj(verb, object) and
  2. Formula of nsubj(+ver, +s) and dobj(+ver, +o)

These predicates or rules are to learn all possible SVO combinations i.e. what is the probability of a Subject-Verb-Object combination. The + sign ensures a cross product between the domains and learns all combinations. The training database consists of the nsubj and dobj tuples i.e. relations is the evidence used to learn the weights.

When we run the above command for this set of rules against the training evidence, we learn the weights as here:

Note that the formula is now grounded by all occurrences of nsubj and dobj tuples from the training database or evidence and the weights are attached to it at the start of each such combination.

But it should be noted that there is no network yet and this is just a set of weighted first-order logic formulas. The MLN template we created so far will generate Markov networks from all of our ground formulas. Internally, it is represented as a factor graph.where each ground formula is a factor and all the ground predicates found in the ground formula are linked to the factor.

Inference

The definition goes as follows:

Estimate probability distribution encoded by a graphical model, for a given data (or observation).

Out of the many Inference algorithms, the two major ones are MAP & Marginal Inference. For example, in a MAP Inference we find the most likely state of world given evidence, where y is the query and x is the evidence.

which is in turn equivalent to this formula.

Another is the Marginal Inference which computes the conditional probability of query predicates, given some evidence. Some advanced inference algorithms are Loopy Belief Propagation, Walk-SAT, MC-SAT, etc.

The probability of a world is given by the weighted sum of all true groundings of a formula i under an exponential function, divided by the partition function Z i.e. equivalent to the sum of the values of all possible assignments. The partition function acts a normalization constant to get the probability values between 0 and 1.

Inference – An Example

Let us draw inference on the the same example as earlier.

After learning the weights we run inference (with or without partial evidence) and query the relations of interest (nsubj here), to get inferred values.

Tool-kits

Let’s look at some of the MLN tool-kits at disposal to do learning and large scale inference. I have tried to make an assorted list of all tools here and tried to highlight some of its main features & problems.

For example, BUGS i.e. Bayesian Logic uses a Swift Compiler but is Not relational! ProbLog has a Python wrapper and is based on Horn clauses but has No Learning feature. These tools were invented in the initial days, much before the present day MLN looks like.

ProbCog developed at Technical University of Munich (TUM) & the AI Lab at Bremen covers not just MLN but also Bayesian Logic Networks (BLNs), Bayesian Networks & ProLog. In fact, it is now GUI based. Thebeast gives a shell to analyze & inspect model feature weights & missing features.

Alchemy from University of Washington (UoW) was the 1st First Order (FO) probabilistic logic toolkit. RockIt from University of Mannheim has an online & rest based interface and uses only Conjunctive Normal Forms (CNF) i.e. And-Or format in its formulas.

Tuffy scales this up by using a Relational Database Management System (RDBMS) whereas Felix allows Large Scale inference! Elementary makes use of secondary storage and Deep Dive is the current state of the art. All of these tools are part of the HAZY project group at Stanford University.

Lastly, LoMRF i.e. Logical Markov Random Field (MRF) is Scala based and has a feature to analyse different hypothesis by comparing the difference in .mln files!

 

Hope you enjoyed the read. The content starts from basic concepts and ends up highlighting key tools. In the final part of this 3 part blog series I would explain an application scenario and highlight the active research and industry players. Any feedback as a comment below or through a message is more than welcome!

Back to Part I – Statistical Relational Learning

Additional Links:

[1] Knowledge base files in Logical Markov Random Fields (LoMRF)

[2] (still) nothing clever Posts categorized “Machine Learning” – Markov Logic Networks

[3] A gentle introduction to statistical relational learning: maths, code, and examples

Statistical Relational Learning

An Introduction to Statistical Relational Learning – Part 1

Statistical Relational Learning (SRL) is an emerging field and one that is taking centre stage in the Data Science age. Big Data has been one of the primary reasons for the continued prominence of this relational learning approach given, the voluminous amount of data available now to learn interesting and unknown patterns from data. Moreover, the tools have also improved their processing prowess especially, in terms of scalability.

This introductory blog is a prelude on SRL and later on I would also touch base on more advanced topics, specifically Markov Logic Networks (MLN). To start off, let’s look at how SRL fits into one of the 5 different Machine Learning paradigms.

Five Machine Learning Paradigms

Lets look at the 5 Machine Learning Paradigms: Each of which is inspired by ideas from a different field!

  1. Connectionists as they are called and led by Geoffrey Hinton (University of Toronto & Google and one of the major names in the Deep Learning community) think that a learning algorithm should mimic the brain! After all it is the brain that does all the complex actions for us and, this idea stems from Neuroscience.
  2. Another group of Evolutionists whose leader is the late John Holland (from the University of Michigan) believed it is not the brain but evolution that was precedent and hence the master algorithm to build anything. And using this approach of having the fittest ones program the future they are currently building 3D prints of future robots.
  3. Another thought stems from Philosophy where Analogists like Douglas R. Hofstadter an American writer and author of popular and award winning book – Gödel, Escher, Bach: an Eternal Golden Braid believe that Analogy is the core of Cognition.
  4. Symbolists like Stephen Muggleton (Imperial College London) think Psychology is the base and by developing Rules in deductive reasoning they built Adam – a robot scientist at the University of Manchester!
  5. Lastly we have a school of thought which has its foundations rested on Statistics & Logic, which is the focal point of interest in this blog. This emerging field has started to gain prominence with the invention of Bayesian networks 2011 by Judea Pearl (University of California Los Angeles – UCLA) who was awarded with the Turing award (the highest award in Computer Science). Bayesians as they are called, are the most fanatical of the lot as they think everything can be represented by the Bayes theorem using hypothesis which can be updated based on new evidence.

SRL fits into the last paradigm of Statistics and Logic. As such it offers another alternative to the now booming Deep Learning approach inspired from Neuroscience.

Background

In many real world scenario and use cases, often the underlying data is assumed to be independent and identically distributed (i.i.d.). However, real world data is not and instead consists of many relationships. SRL as such attempts to represent, model, and learn in the relational domain!

There are 4 main Models in SRL

  1. Probabilistic Relational Models (PRM)
  2. Markov Logic Networks (MLN)
  3. Relational Dependency Networks (RDN)
  4. Bayesian Logic Programs (BLP)

It is difficult to cover all major models and hence the focus of this blog is only on the emerging field of Markov Logic Networks.
MLN is a powerful framework that combines statistics (i.e. it uses Markov Random Fields) and logical reasoning (first order logic).

 

markov-random-fields-first-order-logic

Academia

Some of the prominent names in academic and the research community in MLN include:

  1. Professor Pedro Domingos from the University of Washington is credited with introducing MLN in his paper from 2006. His group created the tool called Alchemy which was one of the first, First Order Logic tools.
  2. Another famous name – Professor Luc De Raedt from the AI group at University of Leuven in Belgium, and their team created the tool ProbLog which also has a Python Wrapper.
  3. HAZY Project (Stanford University) led by Prof. Christopher Ré from the InfoLab is doing active research in this field and Tuffy, Felix, Elementary, Deep Dive are some of the tools developed by them. More on it later!
  4. Talking about academia close by i.e. in Germany, Prof. Michael Beetz and his entire team moved from TUM to TU Bremen. Their group invented the tool – ProbCog
  5. At present, Prof. Volker Tresp from Ludwig Maximilians University (LMU), Munich & Dr. Matthias Nickles at Technical University of Munich (TUM) have research interests in SRL.

Theory & Formulation

A look at some background and theoretical concepts to understand MLN better.

A. Basics – Probabilistic Graphical Models (PGM)

The definition of a PGM goes as such:

A PGM encodes a joint p(x,y) or conditional p(y|x) probability distribution such that given some observations we are provided with a full probability distribution over all feasible solutions.

A PGM helps to encode relationships between a set of random variables. And it achieves this by making use of a graph! These graphs can be either be Directed or Undirected Graphs.

B. Markov Blanket

A Markov Blanket is a Directed Acyclic graph. It is a Bayesian network and as you can see the central node A highlighted in red is dependent on its parents and parents of descendents (moralization) by the circle drawn around it. Thus these nodes are the only knowledge needed to predict node A.

C. Markov Random Fields (MRF)

A MRF is an Undirected graphical model. Every node in an MRF satisfies the Local Markov property of Conditional Independence, i.e. a node is conditionally independent of another node, given its neighbours. And now relating it to Markov Blanket as explained previously, a Markov blanket for a node is simply its adjacent nodes!

Intuition

We now that Probability handles uncertainty whereas Logic handles complexity. So why not make use of both of them to model relationships in data that is both uncertain and complex. Markov Logic Networks (MLN) precisely does that for us!

MLN is composed of a set of pairs of  <w, F> where F is the formula (written in FO logic) and weights (real numbers identifying the strength of the constraint).

MLN basically provides a template to ground a Markov network. Grounding would be explained in detail in the next but one section on “Weight Learning”.

It can be defined as a Log linear model where probability of a world is given by the weighted sum of all true groundings of a formula i under an exponential function. It is then divided by Z which is termed as the partition function and used to normalize and get probability values between 0 and 1.

propability_of_a_world_x

The MLN Template

Rules or Predicates

The relation to be learned is expressed in FO logic. Some of the different possible FO logical connectives and quantifiers are And (^), Or (V), Implication (→), and many more. Plus, Formulas may contain one or more predicates, connected to each other with logical connectives and quantified symbols.

Evidence

Evidence represent known facts i.e. the ground predicates. Each fact is expressed with predicates that contain only constants from their corresponding domains.

Weight Learning

Discover the importance of relations based on grounded evidence.

Inference

Query relations, given partial evidence to infer a probabilistic estimate of the world.

More on Weight Learning and Inference in the next part of this series!

Hope you enjoyed the read. I have deliberately kept the content basic and a mix of non technical and technical so as to highlight first the key players and some background concepts and generate the reader’s interest in this topic, the technicalities of which can easily be read in the paper. Any feedback as a comment below or through a message are more than welcome!

Continue reading with Statistical Relational Learning – Part II.

References