//

2018/2019

# Probabilistic Graphical Models

## Planning

• Bayes fundamentals
• Directed / undirected models
• Naive Bayes / MRF, CRF
• Inference
• Marginal vs. MAP
• Approximate: MCMC
• Learning
• Maximum Likelihood
• Bayesian learning

## Planning

• Latent variable models
• Mixture models, EM
• Refs:
• http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/020217.pdf
• https://ermongroup.github.io/cs228-notes/

## Why probabilistic models

• Model uncertainty and intrinsic variability
• “there’s 90% of chance to rain”
• a model may generate various (random) answers for the same input
• Solid foundations to
• inject prior knowledge
• explain some forms of regularization

## The problem of features independence

• We want to detect SPAM emails, using the whole text as features.
• We can code the occurence of the $$i^{th}$$ entry in the vocabulary into $$X_i = \{0,1\}$$ for $$1\leq i \leq V$$

## The problem of features independence

• The joint $$P(x)=P(x_1,\cdots,x_V)$$ assumes that both following texts are totally independent:
Cheap product to sell
Buy a cheap product
• Issues:
• Exponential number of values !
• No sharing of information between them

## Conditional independence

• We may assume conditional independence btw variables: $p(y,x_1,\cdots,x_V) = p(y) \prod_i^V p(x_i|y)$

• This is the Naive Bayes assumption
• With $$x_i=$$word, aka Bag-of-words

• Requires only $$O(V)$$ parameters !

## Multinomial distribution

• $$p(x_i|y)$$ models the likelihood to observe the word $$x_i$$ in SPAM (resp. non-SPAM)
• Discrete domain for $$x_i$$ $$\rightarrow$$ Multinomial distribution
• Parameterized by $$2V$$ parameters ($$V$$ is the vocabulary size, 2 is the size of the domain of $$x_i$$): $p(x_i|y) = \theta_i~~~~~ 1 \leq i \leq V$ $\sum_{j=1}^2 \theta_{i,j} = 1$
• (note: we could have used only $$V$$ parameters here, but in the general case with $$N$$ classes, we need $$NV$$ parameters)

## 3 ways to describe the model

• 1: describe with the equation

$p(y,x_1,\cdots,x_V) = p(y) \prod_i^V p(x_i|y)$

## 3 ways to describe the model

• 2: describe our independence assumptions via a graph

$$X_i$$ is conditionally independent from all other $$X_j$$ given $$Y$$

## Generative story

3: generative models can also be described by a generative story:

• To generate an email, we first sample whether it’s a spam or not from the Bernoulli distribution $Y \sim B(\alpha)$
• Then, for each of the $$V$$ words in the vocabulary:
• sample $$X_i$$ given $$Y$$ (i.e., decide whether word $$i$$ occurs in the email) from the Bernoulli distribution $X_i \sim B(\theta_{i,Y})$

## Bayesian network

It’s a pair $$(G,P)$$ where

• $$G=(V(G),A(G))$$ is an acyclic directed graph
• $$V(G)$$ is a set of vertices
• $$A(G) \subseteq V(G) \times V(G)$$ is a set of arcs
• Joint probability distribution: $P(X_{V(G)}) = \prod_{v \in V(G)} p(X_v | X_{\pi(v)})$

## Graphical models

• $$p(X_v | X_{\pi(v)})$$ is a CPD (Conditional Probability Distribution)

2 types of graphical models:

• If $$G$$ is a DAG = bayesian networks

• If $$G$$ is an undirected graphs = Markov random fields
• (see next)

## Bayesian network

Exercice: Write the equation of this graphical model ?

## Bayesian network

$P(X_1,X_2,X_3)=P(X_1|X_2,X_3)P(X_2|X_3)P(X_3)$

• $$X_2$$ is conditionally independent of $$X_1$$ given $$X_3$$

## Markov Random Fields

Example: Model of voting preferences

• (A,B), (B,C), (C,D) and (D,A) are friends
• They talk and influence each other when voting

## Markov Random Fields

The proba to have a given vote, say $$(A,B,C,D)=(0,1,1,1)$$, is proportional to the score:

$\tilde{p}(A,B,C,D) = \phi(A,B)\phi(B,C)\phi(C,D)\phi(D,A)$

• $$\phi(X,Y)$$ = factor that assigns more weight to votes among friends:
• $$\phi(X,Y)=10$$ if X=Y=1
• $$\phi(X,Y)=5$$ if X=Y=0
• $$\phi(X,Y)=1$$ otherwise

## Markov Random Fields

We still need to normalize to get a probability distribution:

$p(A,B,C,D) = \frac 1 Z \tilde{p}(A,B,C,D)$ with $Z = \sum_{A,B,C,D} \tilde{p}(A,B,C,D)$

Would be difficult to model with a Bayesian network !

## MRF

A MRF is a proba distribution over variables $$x_1,\cdots,x_n$$ defined by an undirected graph $$G$$ in which nodes correspond to variables $$x_i$$. $$p$$ has the form: $p=\frac 1 Z \prod_{c\in C} \phi_c(x_c)$ where $$C$$=set of cliques of G.

Normalizing constant: $Z=\sum_{x_1,\cdots,x_n}\prod_{c\in C} \phi_c(x_c)$

## MRF vs. Bayesian Networks

• MRF can be applied to a wider range of problems
• MRF can compactly represent certain dependencies that BN can not

• Computing $$Z$$ may require to sum over an exponential number of assignments (NP-hard)
• Easier to generate data from a BN

## MRF to/from BN

• Always possible to transform BN into MRF

• Inverse is possible, but may be computationally intractable

## Conditional Random Fields

Important special case of MRF:

$P(y|x) = \frac 1 {Z(x)} \prod_{t}^T \phi(y_t|x) \prod_t^{T-1} \phi(y_t,y_{t+1}|x)$

## CRF

$P(y|x) = \frac 1 {Z(x)} \prod_{t}^T \phi(y_t|x) \prod_t^{T-1} \phi(y_t,y_{t+1}|x)$

• $$x$$ is actually composed of 2 components:
• global observed features (not shown in the graph): e.g., nb of observed letters
• local observations ($$x_t$$)
• All potentials are conditioned on $$x$$:
• $$Z(x)$$ now depends on $$x$$, because $$x$$ are assumed fixed, and so we sum only over $$y$$

## CRF as an exponential model

In most applications, we define potentials as follows:

$\phi(x_c) = \exp(w_c^T f(x_c))$

where $$f(x_c)$$ is a feature vector, e.g.:

• How many vertical lines are there ?
• Is the image closest from training example 524 ?

Ref: https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

## CRF as an exponential model

Why using this form $$\phi(x_c) = \exp(w_c^T f(x_c))$$ ?

Because:

• It is always positive
• The global product simplifies into a log-linear model
• Logistic regression (or Maximum Entropy model) is a special case of a CRF

Ref: https://homes.cs.washington.edu/~thickstn/docs/crf.pdf

# Inference

## Model usage

Once a model is defined, how can it be used ?

• Marginal inference:
• Assuming that the parameters $$\theta~~$$ are known/fixed
• How to compute the distribution of variables of interest $$Y~~$$ ?
• Training:
• Assuming some annotated dataset is provided: $(X_1,Y_1),\cdots,(X_T,Y_T)$
• How to compute the optimal parameters $$\theta~~$$ ?

## Marginal Inference

• It’s a way to answer questions given a model

• Example of marginal inference
• What is the proportion of spams ? $p(y)=\sum_{x_1} \cdots \sum_{x_n} p(y,x_1,\cdots,x_n)$
• What is the probability that an email is a spam ? $p(y|x)=\frac {p(y,x_1,\cdots,x_n)}{p(x_1,\cdots,x_n)}$

## MAP Inference

Instead of computing the whole distribution as in marginal inference, it’s often easier to compute a single optimal value:

• Maximum a posteriori (MAP) inference
• What is the most likely spam message ? $\max_{x_1,\cdots,x_n} p(y=1,x_1,\cdots,x_n)$
• Classification: Is this email a spam ? EXERCICE

## MAP Inference

Instead of computing the whole distribution as in marginal inference, it’s often easier to compute a single optimal value:

• Maximum a posteriori (MAP) inference
• What is the most likely spam message ? $\max_{x_1,\cdots,x_n} p(y=1,x_1,\cdots,x_n)$
• Classification: Is this email a spam ? $\max_{y} p(y|x)$

## Exact inference in graphical models

Algorithms exist to compute exact inference in the general case:

• Variable elimination
• Better: Junction tree algorithms:
• (Loopy) Belief propagation

… But the general case is NP-hard !

• $$\simeq$$ No known algo in polynomial complexity
• (http://www.columbia.edu/~cs2035/courses/ieor4405.S17/np.pdf)

## Exact MAP inference

MAP inference is less ambitious, sometimes easier (ex: MRF):

$\arg\max_x \left(\sum_c \phi_c(x_c)\right) -\log Z$

• “one point” in general inference
• depending on the model $$\rightarrow$$ efficient specific algorithms
• Graphcuts
• Viterbi

## Inference in practice

• Either use a special / known model with its dedicated algorithms

• Or forget exact inference, and run approximate inference !

## Approximate inference

2 classes of approximate inference methods:

• Variational (see next in deep models)

• Sampling = Monte Carlo approach:
• Draw one random sample, following the model
• Repeat many times
• The target distribution is approximated by this cloud of points

## Extra slides

Optional / going further:

• Forward sampling
• Importance sampling
• MCMC methods
• Gibbs sampling
• Metropolis-Hastings

## Sampling from a bayesian network

• Computers generate sampling from simple distributions (uniform)

• Sampling from multinomial:
• subdivide a unit interval as the multinomial
• sample uniformly
• Sampling from directed graphical models: forward sampling
• Sample variables in topological order (starting from variables without parents)

## Forward sampling

• Sample the value of $$X_3=a$$ from $$P(X_3)$$
• Sample the value of $$X_2=b$$ from $$P(X_2|X_3=a)$$
• Sample the value of $$X_1=c$$ from $$P(X_1|X_3=a,X_2=b)$$

## Rejection sampling

What if $$X_2=x_2$$ is observed ?

• Sample the value of $$X_3=a$$ from $$P(X_3)$$
• Sample the value of $$X_2=b$$ from $$P(X_2|X_3=a)$$
• If $$b\neq x_2$$, reject sample and re-start
• Sample the value of $$X_1=c$$ from $$P(X_1|X_3=a,X_2=x_2)$$

## Usages of sampling

• Can be used to estimate a (marginal or joint) proba: $P(X=x)=\frac {N(X=x)}{N}$
• with $$N=$$ total nb of samples
• with $$N(X=x)=$$ nb of samples with $$X=x$$

## Usages of sampling

• Can be used to estimate a conditional proba:

$P(X=x|Y=y) = \frac {P(X,Y)}{P(Y)} = \frac{N(X=x,Y=y)}{N} \cdot \frac{N}{N(Y=y)}$ $P(X=x|Y=y) = \frac {N(X=x,Y=y)}{N(Y=y)}$

## Limitations of rejection sampling

Evidence $$\rightarrow$$ high rejection rate $$\rightarrow$$ very costly !

Solution: Don’t reject samples, but rather fix the values of observed nodes:

• Gibbs sampling
• Importance sampling

## Gibbs sampling

• Samples are dependent and form a Markov Chain
• = $$s_t$$ depends on $$s_{t-1}$$
• Is a type of Markov Chain Monte Carlo (MCMC) method

• Guaranteed to converge when all $$P>0$$

## Gibbs sampling

Init $$t=0$$: - Fix values of all observed variables - For each latent variable: - Initialize it randomly

## Gibbs sampling

• Intuition: update every variable one after the other, assuming all the others are given from previous iteration

• Iterate through each latent variable $$x_i$$ one at a time:
• Assume all other variables $$x_{-i}$$ are observed
• Derive the distribution of possible values that $$x_i$$ can take: $$P(x_i|x_{-i})$$
• Sample one such value: $$x'_i \sim P(x_i|x_{-i})$$
• Fix the new value of $$x_i$$ and continue to $$i+1$$
• Iterate this process long enough

## Gibbs sampling

• After some burn in period, the samples are independent and draw from the posterior
• Burn in is a new hyper-parameter
• How to derive $$P(x_i|x_{-i})$$ ?

## Gibbs sampling: derivation

• We can get back to the joint: $$P(x_i|x_{-i}) = \frac {P(x_i,x_{-i})}{P(x_{-i})}$$
• The denominator does not depend on $$x_i$$, so: $$P(x_i|x_{-i}) \propto P(x_i,x_{-i})$$

Hence, in practice, we can just: - compute the joint successively for each possible value of $$x_i$$ (keeping all other variables to their value from the previous iteration) - normalize s.t. the sum is 1

## Gibbs sampling: enhancements

• Ways to improve Gibbs sampling (reduce variance):
• Collapsed Gibbs sampling
• Generate multiple chains
• Pick one sample out of K

## Importance sampling

We may rewrite our objective ($$E_{x\sim p}[f(x)]$$) as: $E_{x\sim p}[f(x)] = \sum_x f(x)p(x)$ $E_{x\sim p}[f(x)] = \sum_x f(x)\frac {p(x)}{q(x)} q(x)$ $E_{x\sim p}[f(x)] = E_{x\sim q} [f(x)w(x)]$ $E_{x\sim p}[f(x)] \simeq \frac 1 T \sum_t f(x_t)w(x_t)$

= Sample according to $$q$$ and reweigh them with $$w(x)$$

## Markov Chain Monte Carlo

• Allows to compute marginals and MAP inference

Markov Chain: - seq. of state variables $$S_1,S_2,\cdots$$ - generated from $$P(S_i|S_{i-1})$$ ($$\rightarrow$$ transition matrix T) - proba of ending up in each state after $$t$$ steps: $p_t = T^t p_0$ - When it exists, stationary distribution: $\pi = \lim_{t\rightarrow +\infty} p_t$

MCMC: - Markov Chain whose $$\pi=p$$

## Existence of $$\pi$$

2 sufficient conditions: - Irreducibility: possible to get from state $$x$$ to any $$x'$$ with proba >0 in a finite nb of steps - Prevent absorbing states - Aperiodicity: possible to return to any state: $$\exists n$$ s.t. $\forall i, \forall n'\geq n ~~~ P(s_{n'}=i|s_0=i)>0$ - Prevent “oscillating” chains

Any ergodic chain has a stationary distribution.

## MCMC usage

Assume we have a Markov Chain with $$\pi=p$$, and an initial assignment $$x_0$$ of the variables of $$p$$; we can then: - run the Markov chain from $$x_0$$ for B burn-in steps - run the Markov chain for N sampling steps and collect the visited states = sample of $$p$$

We can compute marginals from this sample, or pick the MAP sample.

## Metropolis-Hastings

$$T(x'|x)$$ is built from 2 components: - A transition kernel $$Q(x'|x)$$ defined by the user - an acceptance probability for moves proposed by $$Q$$: $A(x'|x)=\min\left( 1, \frac {P(x')Q(x|x')}{P(x)Q(x'|x)} \right)$

At each step of the Markov chain: - we propose a new point $$x'$$ according to $$Q$$. - we accept $$x'$$ with proba $$\alpha$$.

## Metropolis-Hastings

Assume $$Q$$ is uniform: - if the proposed $$x'$$ is in a low proba region of $$P$$, it is often rejected

• For continuous variables, $$Q$$ is usually a Gaussian centered on $$x$$

• $$\forall Q$$, MH ensures that $$P$$ = stationary distribution of the Markov chain

## Gibbs sampling

Special case of MH: - Iterate through each variable one at a time: - sample $$x'_i \sim p(x_i|x_{-i})$$ - set $$x^{t+1} = (x_1^t,\cdots,x'_i,\cdots,x_n^t)$$

Running time of MCMC: - related to the number of burn-in steps B - = mixing time of the Markov chain - May be very slow - if proposal do not move fast enough - if $$P$$ has 2 separated narrow modes

# Linear vs. probabilistic models

## Wrap-up: parametric model

• Function $$f(X)$$ that does not know anything about the true generative process of $$X$$
• With parameters $$f_{\theta}(X)$$ to make the function compatible with observations $$X$$
• We have matched a linear fct to some “unknown” observations “and”, “xor”… $f_{\theta}(x) = -2 + 2x_0 + x_1$

## Where is the test set ?

• Our “and”, “xor”, “or”… data is generated by a true process that is deterministic and there exists only a few possible combinations.
• train and test are the same !

## Another model: graphical

• “xor”-type data fails with linear model
• Question: propose a graphical model

## Joint model

• $$P(x_1,x_2,y;\theta)$$ is good here, because very few possible combinations
• Question: how many parameters ? What do parameters represent ?

## Discriminative model

• $$P(y|x_1,x_2,\theta)$$ has the same number of parameters
• Question: how do the parameters differ ?

## Naive Bayes model

• What is the Naive Bayes model in this case ?
• How many parameters ?
• Is it good ?

# Learning in directed models

## Learning in directed models

Let us assume that:

• the domain is governed by unknown distribution $$\rho$$
• we are given a dataset $$D$$ of $$m$$ i.i.d. samples from $$\rho$$
• we are given a family of models $$M$$, defined by a set of parameters

Learning = find a model in $$M$$ that best approximates $$\rho$$

## Learning in directed models

What is best ?

• Let $$X=$$ all random variables in our model
• Let $$\theta=$$ parameters of our model
• We want that the proba distrib $$p(X;\theta)$$ defined by our model to be as close as possible to the real distrib $$\rho(X)$$.
• close’’ $$\rightarrow$$ distance between proba distrib ?

## Entropy

Entropy of a probability distribution = measure of disorder $H(p)=-\sum_i p_i \log p_i$

## “Distance” btw distributions

• Kullback-Leibler divergence: $KL(\rho||p)=\sum_x \rho(x)\log \frac {\rho(x)}{p(x)} = -H(\rho) -E_{x\sim \rho}[\log p(x)]$

• First term indep from $$p$$
• Maximize the expected log-likelihood: $E_{x\sim \rho}[\log p(x)]$

## Maximum likelihood

But we don’t know $$\rho$$ !

So we approximate the expected log-likelihood with the empirical log-likelihood:

$E_{x\sim \rho}[\log p(x)] \simeq \frac 1 {|D|} \sum_{x\in D} \log p(x)$

Our trained model is then: $\max_{\theta} \frac 1 {|D|} \sum_{x\in D} \log p(x;\theta)$

## Example: biased coin

• Assume $$\rho(x)$$ with $$x\in {h,t}$$
• Class of models $$M$$ = set of all proba distributions over $${h,t}$$
• Define parameters:
• $$p(x=h)=\theta$$
• $$p(x=t)=1-\theta$$
• Our log-likelihood function is: $L(\theta) = N_h \cdot \log(\theta) + N_t \cdot \log(1-\theta)$
• Optimal solution: $\hat \theta = \frac{N_h}{N_h + N_t}$

## ML loss

• Maximum Likelihood = optimizing with log loss: $-\log p(x)$

• Another loss: conditional log-likelihood: $-\log p(y|x)$
• no need to estimate the joint
• used to train CRF

## ML of BN

• Assume a BN: $p(x)=\prod_{i=1}^n \theta_{x_i | \pi(x_i)}$
• Assume iid samples: $$D={x^1,\cdots,x^m}$$

Likelihood: $L(\theta,D)=\prod_i^n \prod_j^m \theta_{x_i^j|\pi(x_i^j)}$

## ML of BN

Combining same terms: $\log L(\theta,D)=\sum_i N(x_i,\pi(x_i)) \cdot \log \theta_{x_i | \pi(x_i)}$

Optimum (see Exercice 2): $\hat \theta_{x_i|\pi(x_i)} = \frac{N(x_i,\pi(x_i))}{N(\pi(x_i))}$

## Wrap-up

• 2 types of graphical models:
• Bayesian Networks
• Naive Bayes model
• Markov Random Fields
• CRF

## Wrap-up

• Inference:
• $$\theta$$ is known
• $$X$$ is observed
• Exact vs. approximate inference

## Wrap-up

• Learning:
• $$\theta$$ is unknown
• $$X$$ is observed
• We want to estimate $$\theta$$:
• Maximum Likelihood (closed-form for BN)

• Model to predict punctuation:
• 7 classes: . ! ? ; : , AnyOtherChar
• Model: $$P(y|x_t,x_{t-1},\cdots,x_{t-n}) \sim M(\theta)$$
• $$x_{t-i}=$$ char at position $$t-i$$
• $$y=$$ punct. class computed from char at position $$t+1$$
Sinon j'espère que le nouvel OP de Jojo est mortel
Et la t'as les terroristes tranquillement qui regardent I-Télé sur leur portable en train d'apprendre la stratégie du GIPN #merciitelebfm
top merci
Ça fait peur tout ce qui se passe, c'est beaucoup trop, j'espère que ça va s'arrêter..
t srx

## Exercice 2 (TD)

• Draw a graphical model
• Is it generative or discriminative ?
• Let $$m$$ be the size of the char lexicon
• How many parameters ?
• A training corpus is composed of $$T$$ sequences $$\{x_t,x_{t-1},\cdots,x_{t-n},y\}_{t=1}^T$$
• Write the conditional likelihood $$P(Y|X;\theta)$$ over the corpus
• Derive the optimum parameters in the ML sense
• How could you handle unobserved sequences ?

## Exercice 2 (TD)

Tricks for optimizing the conditional likelihood:

• Merge together all occurrences of the same $$\theta_s$$
• Optimize the log-likelihood instead of the likelihood
• Don’t forget the constraint: sum of proba = 1
• Use Lagrange Multipliers to handle this constraint

# Learning in undirected models

## Learning in MRF / CRF

$p(x_1,\cdots,x_n)=\frac 1 {Z(\varphi)} \prod_{c\in C} \Phi_c (x_c; \varphi)$ $Z(\varphi) = \sum_{x_1,\cdots,x_n} \prod_{c\in C} \Phi_c(x_c; \varphi)$

• The partition function is very hard to evaluate and optimize:

• We can prove that $$\log Z(\theta)$$ is convex by computing its Hessian $$\nabla_\theta^2 \log Z(\theta)$$
• … Despite being convex, it is very hard to optimize !
• Its gradient can be estimated with MCMC

## Learning in CRF

• Same issue than with MRF:
• Requires inference over all training samples for every evaluation of objective / gradient
• Optim with quasi-Newton method (L-BFGS)
• Challenger: SGD
• Inference only on batch of training samples

https://people.cs.pitt.edu/~milos/courses/cs3710/Lectures/Class23.pdf

# Learning with latent variables

## Example: Language models

• $$x_t$$ = words in a text document
• A n LM models $$p(x_1,\cdots,x_T)$$

• A better model is conditioned on the topic $p(x)=p(x|t)p(t)$

• Problem for training: $$t$$ is not observed !

## Example: Gaussian Mixture Models

$p(x) = \sum_{k=1}^K p(x|z=k)p(z=k) = \sum_{k=1}^K \pi_k N(x;\mu_k,\Sigma_k)$

## Gaussian Mixture Models

• GMM are generative models:
• First sample one cluster according to $$\pi_k$$
• Then sample a point according to $$N(x;\mu_k,\Sigma_k)$$

## Marginal likelihood training

Maximizing the marginal log-likelihood of the data:

$\log p(D) = \sum_{x\in D} \log p(x) = \sum_{x\in D} \log \left( \sum_z p(x|z)p(z) \right)$

## Marginal likelihood training

• No closed-form solution, even for directed graphical models
• No more convexity properties
• Use approximate learning: Expectation-Maximization

## Expectation-Maximization

• Start with initial $$\theta_0$$

• Iterate on $$t$$:
• E-step: Compute the posterior $$p(z|x;\theta_t)$$ $$\forall x$$
• M-step: Optimize parameters: $\theta_{t+1} = \arg\max_\theta \sum_{x\in D} E_{z\sim p(z|x;\theta_t)} \log p(x,z;\theta)$

## EM

• convergence proof:
• With Jensen inequality
• cf. https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf

## Example: GMM (E-step)

E-step:

$p(z|x;\theta_t) = \frac {p(z,x;\theta_t)}{p(x;\theta_t)} = \frac{p(x|z;\theta_t)p(z;\theta_t)}{\sum_{k=1}^K p(x|z_k;\theta_t)p(z_k;\theta_t)}$

Intuitively, we successively assume $$x$$ is generated by mixture $$z_k$$, and compute the corresponding Gaussian.

## Example: GMM (M-step)

$\theta_{t+1} = \arg\max_\theta \sum_{x\in D} E_{z\sim p(z|x;\theta_t)} \log p(x,z;\theta)$

$\theta_{t+1} = \arg\max_\theta \sum_k \sum_{x\in D} p(z_k|x;\theta_t) \log p(x|z_k;\theta) +$ $\sum_k \sum_{x\in D} p(z_k|x;\theta_t) \log p(z_k;\theta)$

Each term can be optimized independently (because affect different parameters)

## GMM training: emission term

$p(x|z_k;\theta) = N(x;\mu_k,\Sigma_k)$

Maximization = find the $$\mu_k,\Sigma_k$$ that maximize: $\sum_{x\in D} p(z_k|x;\theta_t) \log p(x|z_k;\theta)$

## GMM training: emission term

Solution: simply the weighted mean and variance of the data: $\mu_k=\sum_{x\in D} \frac{p(z_k|x;\theta_t)}{\sum_{x\in D} p(z_k|x;\theta_t)} x$

$\Sigma_k = \sum_{x\in D} \frac{p(z_k|x;\theta_t)}{\sum_{x\in D} p(z_k|x;\theta_t)} (x-\mu_k)(x-\mu_k)^T$

## GMM training: class priors

Similarly:

$\pi_k = \frac 1 {|D|} \sum_{x\in D} p(z_k|x;\theta_t)$

• All these results may be derived from standard calculus

## Properties of EM

• Marginal likelihood increases after every EM cycle

• EM eventually converge

• But it may (most often) converge to a local optimum

• It is very sensitive to its initial parameters $$\theta_0$$

## Limits of ML

Example 1: biased coin

• XP1: 10 samples, observe 6 heads $$\rightarrow$$ $$\theta_h=0.6$$

• XP2: 100 samples, observe 60 heads $$\rightarrow$$ $$\theta_h=0.6$$

• No notion of reliable estimate !

## Limits of ML

Example 2: language model

D={“Graphical models are fun”}

 $$\theta(graphical)=0.2$$ $$\theta(models)=0.2$$ $$\theta(are)=0.2$$ $$\theta(fun)=0.2$$

• $$P(\text{Graphical models are hard}|\theta)$$ = 0 !
• Difficult to add prior information !

# Bayesian learning

## Bayesian learning

Model parameters $$\theta$$ are random variables as well !

• The prior distribution $$p(\theta)$$ encodes our initial beliefs.

• After observing data $$D$$ (=evidence), we update these beliefs with:

$p(\theta|D) = \frac {p(D|\theta)p(\theta)}{p(D)}$

## Conjugate priors

• Computing the denominator $$p(D)$$ implies an integral $$\rightarrow$$ hard !

• A parametric family $$\varphi$$ is conjugate to the likelihood $$p(D|\theta)$$ if $$p(\theta)\in\varphi \Rightarrow p(\theta|D) \in \varphi$$

• We may then use simple calculus to compute the Bayes rule

## Coin example

• $$\theta=$$ proba of heads; the data likelihood follows a Bernoulli distrib; if $$h=$$nb of heads, $$n=$$nb of samples: $p(D|\theta)=\theta^h(1-\theta)^{n-h}$

• Conjugate prior of Bernoulli = Beta: $p(\theta)=\frac{\theta^{\alpha_H-1}(1-\theta)^{\alpha_T-1}}{B(\alpha_H,\alpha_T)}$

Exercice: derive the posterior $$p(\theta|D)$$

## Coin example

Computation of the posterior: $p(\theta|D)=\frac{p(D|\theta)p(\theta)}{p(D)} = \frac{\theta^{h+\alpha_H-1}(1-\theta)^{n-h+\alpha_T-1}}{B(\alpha_H,\alpha_T)p(D)}$

• We know that $$p(\theta|D)$$ is Beta, so the denominator is $$B(h+\alpha_H,n-h+\alpha_T)$$

## Coin posterior

• Iterate with new data: evidence will cumulate:

## Notion of reliability

• The more data, the more peaky is the posterior

## Notion of prior knowledge

• Our prior distribution = Beta is such that we initially give equal probability to all words, including the ones that do not occur in the training set !

## Latent Dirichlet Allocation (LDA)

Topic model: cluster a set of documents into topics

Plate diagram:

## Latent Dirichlet Allocation (LDA)

• $$N$$ words in a document, $$M$$ documents, $$K$$ topics
• $$w_{m,n}$$= one observed word (multinomial)
• $$z_{m,n}$$= topic of one word (multinomial)
• $$\theta_m$$= topic distribution for document $$m$$ (dirichlet)
• $$\varphi_k$$= word distribution for topic $$k$$ (dirichlet)
• $$\alpha$$= param. of the Dirichlet prior
• $$\beta$$= param. of the Dirichlet prior

## Dirichlet distribution

Conjugate prior of the Multinomial distribution

## Inference or training ?

What can we do with this model ?

• Initially, we initialize this model (through $$\alpha=0.1$$ and $$\beta=0.1$$) following our prior beliefs

• But we want to update it according to new observations (words)

= train LDA in an unsupervised way with Gibbs sampling

Next: optional slides

## Inference with Gibbs sampling

$P(W,Z,\theta,\varphi;\alpha,\beta) = \prod_i^K P(\varphi_i;\beta) \prod_j^M P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j)P(W_{j,t}|\varphi_{Z_{j,t}})$

We don’t need the values of $$\varphi$$ and $$\theta$$ $$\rightarrow$$ Collapsed Gibbs sampling

$P(W,Z;\alpha,\beta) = \int_\theta \int_\varphi P(W,Z,\theta,\varphi;\alpha,\beta) d\varphi d\theta$

$=\int_\varphi \prod_i^K P(\varphi_i;\beta) \prod_j^M \prod_t^N P(W_{j,t}|\varphi_{Z_{j,t}}) d\varphi \int_\theta \prod_j^M P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta$

## Inference with Gibbs sampling

Every $$\theta_j$$ is independent from all other $$\theta$$ and $$\varphi$$, so the term for one of them is:

$\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \int_{\theta_j} \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \prod_i^K \theta_{j,i}^{\alpha_i -1} \prod_t^N P(Z_{j,t}|\theta_j)d\theta_j$

Let $$n_{j,r}^i$$ be the nb of occurrences of word $$r$$ in document $$j$$ with topic $$i$$ assigned at the previous Gibbs iteration, then $\prod_t^N P(Z_{j,t}|\theta_j) = \prod_t^K \theta_{j,i}^{n_{j,\cdot}^i}$

## Inference with Gibbs sampling

So

$\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \int_{\theta_j} \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \prod_i^K \theta_{j,i}^{n_{j,\cdot}^i+\alpha_i -1}$

This has the form of a Dirichlet !

So we just introduce the correct constant that makes the Dirichlet sum to 1, and all that remains is: $\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \frac{\prod_i^K \Gamma(n_{j,\cdot}^i+\alpha_i)}{\Gamma\left( \sum_k^K n_{j,\cdot}^i + \alpha_i \right)}$

## Inference with Gibbs sampling

Homework: do the same for $$\varphi$$ to get:

$P(W,Z;\alpha,\beta) = \prod_j^M \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \frac{\prod_i^K \Gamma(n_{j,\cdot}^i+\alpha_i)}{\Gamma\left( \sum_k^K n_{j,\cdot}^i + \alpha_i \right)} \times \prod_i^K \frac{\Gamma\left( \sum_r^V \beta_r \right)}{\prod_r^V \Gamma(\beta_r)} \frac{\prod_r^V \Gamma(n_{\cdot,r}^i+\beta_r)}{\Gamma\left( \sum_r^V n_{\cdot,r}^i + \beta_r \right)}$

To approximate $$P(Z|W;\alpha,\beta)$$, Gibbs sampling requires

$P(Z_{m,n}|Z_{-(m,n)},W;\alpha,\beta) = \frac{P(Z_{m,n},Z_{-(m,n)},W;\alpha,\beta)}{P(Z_{-(m,n)},W;\alpha,\beta)}$

## Inference with Gibbs sampling

We can sample from the unnormalized distribution

$P(Z_{m,n}|Z_{-(m,n)},W;\alpha,\beta) \propto P(Z_{m,n},Z_{-(m,n)},W;\alpha,\beta)$

Let $$v$$ be the word at $$(m,n)$$. Removing all factors that do not depend on $$m,n$$, it remains:

$\propto \prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) \prod_i^K \frac{\Gamma(n_{\cdot,v}^i+\beta_v)}{\Gamma\left( \sum_r^V n_{\cdot,r}^i +\beta_r\right)}$

Let’s compute it for $$v=1$$, and afterwards $$v=2$$, and so on.

## Inference with Gibbs sampling

• $$n_{m,\cdot}^i$$ is the nb of words with topic $$i$$ in document $$m$$
• Let $$n_{m,\cdot}^{i,-(m,n)}$$ be the counts excluding the $$n^{th}$$ word:

Assume that the topic of the $$n^{th}$$ word $$Z_{m,n}=k$$, so for $$i=k$$: $n_{m,\cdot}^i = n_{m,\cdot}^{i,-(m,n)} + 1$ and for $$i\neq k$$: $n_{m,\cdot}^i = n_{m,\cdot}^{i,-(m,n)}$

## Inference with Gibbs sampling

So $\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i\neq k}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times \Gamma(n_{m,\cdot}^{k,-(m,n)}+\alpha_k+1)$ We know that $$\Gamma(1+x)=x\Gamma(x)$$, so: $\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i\neq k}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times \Gamma(n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\times (n_{m,\cdot}^{k,-(m,n)}+\alpha_k)$ $\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times (n_{m,\cdot}^{k,-(m,n)}+\alpha_k)$ The first product does not depend on the chosen topic $$k$$, so we can remove it !

## Inference with Gibbs sampling

Similarly for the other term, we get:

$\propto \left( n_{m,\cdot}^{k,-(m,n)} + \alpha_k \right) \frac{n_{\cdot,v}^{k,-(m,n)} + \beta_v}{\sum_r^V n_{\cdot,r}^{k,-(m,n)} + \beta_r}$

We can now sample every $$Z$$ from this unnormalized posterior.

## Output of Gibbs sampling

• Gibbs sampling gives many samples of $$Z=(Z_1,\cdots,Z_T)$$

• We can estimate from these samples interesting distributions, for instance words that are the most related to a topic: $\hat \varphi_{z,w} = \frac {N(W=w,Z=z)+\beta}{\sum_v (N(W=v,Z=z) + \beta)}$

• Because we are Bayesian, our model gives us a distribution over all possible $$\varphi_z$$

• We want the most probable $$\hat \varphi_z = E_{P(\varphi_z | W_z, \beta)}[\varphi_z]$$
• $$W_z$$ are our “observations” obtained by Gibbs sampling = all the words assigned to topic $$z$$
• The posterior $$P(\varphi_z|W_z,\beta) \propto P(W_z|\varphi_z)P(\varphi_z|\beta)$$
• 1st $$P()$$ is multinomial, 2d is Dirichlet $$\rightarrow$$ product = Dirichlet
• The new $$\beta$$ of the resulting Dirichlet are $$N_v + \beta_v$$ with $$N_v$$ the number of samples of word $$v$$ assigned to topic $$z$$
• The formula of the mean of a Dirichlet gives us the target

## Non-parametric models

The number of topics is fixed, but what if you process an infinite stream of text ?

• Grow the number of topics with the size of incoming data

• Non-parametric models
• The nb of parameters increase with data
• Extensions of LDA:
• Hierarchical Dirichlet Process Mixture Model
• Hierarchical Pitman-Yor process
• Chinse Restaurant Process
• Indian Buffet Process