Machine and Deep learning

Christophe Cerisara


Probabilistic Graphical Models


  • Bayes fundamentals
    • Directed / undirected models
    • Naive Bayes / MRF, CRF
  • Inference
    • Marginal vs. MAP
    • Approximate: MCMC
  • Learning
    • Maximum Likelihood
    • Bayesian learning


  • Latent variable models
    • Mixture models, EM
  • Refs:
    • http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/020217.pdf
    • https://ermongroup.github.io/cs228-notes/

Why probabilistic models

  • Model uncertainty and intrinsic variability
    • “there’s 90% of chance to rain”
    • a model may generate various (random) answers for the same input
  • Solid foundations to
    • inject prior knowledge
    • explain some forms of regularization

The problem of features independence

  • We want to detect SPAM emails, using the whole text as features.
  • We can code the occurence of the \(i^{th}\) entry in the vocabulary into \(X_i = \{0,1\}\) for \(1\leq i \leq V\)

The problem of features independence

  • The joint \(P(x)=P(x_1,\cdots,x_V)\) assumes that both following texts are totally independent:
Cheap product to sell
Buy a cheap product
  • Issues:
    • Exponential number of values !
    • No sharing of information between them

Conditional independence

  • We may assume conditional independence btw variables: \[p(y,x_1,\cdots,x_V) = p(y) \prod_i^V p(x_i|y)\]

  • This is the Naive Bayes assumption
  • With \(x_i=\)word, aka Bag-of-words

  • Requires only \(O(V)\) parameters !

Multinomial distribution

  • \(p(x_i|y)\) models the likelihood to observe the word \(x_i\) in SPAM (resp. non-SPAM)
  • Discrete domain for \(x_i\) \(\rightarrow\) Multinomial distribution
  • Parameterized by \(2V\) parameters (\(V\) is the vocabulary size, 2 is the size of the domain of \(x_i\)): \[p(x_i|y) = \theta_i~~~~~ 1 \leq i \leq V\] \[\sum_{j=1}^2 \theta_{i,j} = 1\]
  • (note: we could have used only \(V\) parameters here, but in the general case with \(N\) classes, we need \(NV\) parameters)

3 ways to describe the model

  • 1: describe with the equation

\[p(y,x_1,\cdots,x_V) = p(y) \prod_i^V p(x_i|y)\]

3 ways to describe the model

  • 2: describe our independence assumptions via a graph

\(X_i\) is conditionally independent from all other \(X_j\) given \(Y\)

Generative story

3: generative models can also be described by a generative story:

  • To generate an email, we first sample whether it’s a spam or not from the Bernoulli distribution \[Y \sim B(\alpha)\]
  • Then, for each of the \(V\) words in the vocabulary:
    • sample \(X_i\) given \(Y\) (i.e., decide whether word \(i\) occurs in the email) from the Bernoulli distribution \[X_i \sim B(\theta_{i,Y})\]

Bayesian network

It’s a pair \((G,P)\) where

  • \(G=(V(G),A(G))\) is an acyclic directed graph
    • \(V(G)\) is a set of vertices
    • \(A(G) \subseteq V(G) \times V(G)\) is a set of arcs
  • Joint probability distribution: \[P(X_{V(G)}) = \prod_{v \in V(G)} p(X_v | X_{\pi(v)})\]

Graphical models

  • \(p(X_v | X_{\pi(v)})\) is a CPD (Conditional Probability Distribution)

2 types of graphical models:

  • If \(G\) is a DAG = bayesian networks

  • If \(G\) is an undirected graphs = Markov random fields
    • (see next)

Bayesian network

Exercice: Write the equation of this graphical model ?

Bayesian network


  • \(X_2\) is conditionally independent of \(X_1\) given \(X_3\)

Markov Random Fields

Example: Model of voting preferences

  • (A,B), (B,C), (C,D) and (D,A) are friends
    • They talk and influence each other when voting

Markov Random Fields

The proba to have a given vote, say \((A,B,C,D)=(0,1,1,1)\), is proportional to the score:

\[\tilde{p}(A,B,C,D) = \phi(A,B)\phi(B,C)\phi(C,D)\phi(D,A)\]

  • \(\phi(X,Y)\) = factor that assigns more weight to votes among friends:
    • \(\phi(X,Y)=10\) if X=Y=1
    • \(\phi(X,Y)=5\) if X=Y=0
    • \(\phi(X,Y)=1\) otherwise

Markov Random Fields

We still need to normalize to get a probability distribution:

\[p(A,B,C,D) = \frac 1 Z \tilde{p}(A,B,C,D)\] with \[Z = \sum_{A,B,C,D} \tilde{p}(A,B,C,D)\]

Would be difficult to model with a Bayesian network !


A MRF is a proba distribution over variables \(x_1,\cdots,x_n\) defined by an undirected graph \(G\) in which nodes correspond to variables \(x_i\). \(p\) has the form: \[p=\frac 1 Z \prod_{c\in C} \phi_c(x_c)\] where \(C\)=set of cliques of G.

Normalizing constant: \[Z=\sum_{x_1,\cdots,x_n}\prod_{c\in C} \phi_c(x_c)\]

MRF vs. Bayesian Networks

  • MRF can be applied to a wider range of problems
  • MRF can compactly represent certain dependencies that BN can not

  • Computing \(Z\) may require to sum over an exponential number of assignments (NP-hard)
  • Easier to generate data from a BN

MRF to/from BN

  • Always possible to transform BN into MRF

  • Inverse is possible, but may be computationally intractable

Conditional Random Fields

Important special case of MRF:

\[P(y|x) = \frac 1 {Z(x)} \prod_{t}^T \phi(y_t|x) \prod_t^{T-1} \phi(y_t,y_{t+1}|x)\]


\[P(y|x) = \frac 1 {Z(x)} \prod_{t}^T \phi(y_t|x) \prod_t^{T-1} \phi(y_t,y_{t+1}|x)\]

  • \(x\) is actually composed of 2 components:
    • global observed features (not shown in the graph): e.g., nb of observed letters
    • local observations (\(x_t\))
  • All potentials are conditioned on \(x\):
    • \(Z(x)\) now depends on \(x\), because \(x\) are assumed fixed, and so we sum only over \(y\)

CRF as an exponential model

In most applications, we define potentials as follows:

\[\phi(x_c) = \exp(w_c^T f(x_c))\]

where \(f(x_c)\) is a feature vector, e.g.:

  • How many vertical lines are there ?
  • Is the image closest from training example 524 ?

Ref: https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

CRF as an exponential model

Why using this form \(\phi(x_c) = \exp(w_c^T f(x_c))\) ?


  • It is always positive
  • The global product simplifies into a log-linear model
  • Logistic regression (or Maximum Entropy model) is a special case of a CRF

Ref: https://homes.cs.washington.edu/~thickstn/docs/crf.pdf


Model usage

Once a model is defined, how can it be used ?

  • Marginal inference:
    • Assuming that the parameters \(\theta~~\) are known/fixed
    • How to compute the distribution of variables of interest \(Y~~\) ?
  • Training:
    • Assuming some annotated dataset is provided: \[(X_1,Y_1),\cdots,(X_T,Y_T)\]
    • How to compute the optimal parameters \(\theta~~\) ?

Marginal Inference

  • It’s a way to answer questions given a model

  • Example of marginal inference
    • What is the proportion of spams ? \[p(y)=\sum_{x_1} \cdots \sum_{x_n} p(y,x_1,\cdots,x_n)\]
    • What is the probability that an email is a spam ? \[p(y|x)=\frac {p(y,x_1,\cdots,x_n)}{p(x_1,\cdots,x_n)}\]

MAP Inference

Instead of computing the whole distribution as in marginal inference, it’s often easier to compute a single optimal value:

  • Maximum a posteriori (MAP) inference
    • What is the most likely spam message ? \[\max_{x_1,\cdots,x_n} p(y=1,x_1,\cdots,x_n)\]
    • Classification: Is this email a spam ? EXERCICE

MAP Inference

Instead of computing the whole distribution as in marginal inference, it’s often easier to compute a single optimal value:

  • Maximum a posteriori (MAP) inference
    • What is the most likely spam message ? \[\max_{x_1,\cdots,x_n} p(y=1,x_1,\cdots,x_n)\]
    • Classification: Is this email a spam ? \[\max_{y} p(y|x)\]

Exact inference in graphical models

Algorithms exist to compute exact inference in the general case:

  • Variable elimination
  • Better: Junction tree algorithms:
  • (Loopy) Belief propagation

… But the general case is NP-hard !

  • \(\simeq\) No known algo in polynomial complexity
  • (http://www.columbia.edu/~cs2035/courses/ieor4405.S17/np.pdf)

Exact MAP inference

MAP inference is less ambitious, sometimes easier (ex: MRF):

\[\arg\max_x \left(\sum_c \phi_c(x_c)\right) -\log Z\]

  • “one point” in general inference
  • depending on the model \(\rightarrow\) efficient specific algorithms
    • Graphcuts
    • Viterbi

Inference in practice

  • Either use a special / known model with its dedicated algorithms

  • Or forget exact inference, and run approximate inference !

Approximate inference

2 classes of approximate inference methods:

  • Variational (see next in deep models)

  • Sampling = Monte Carlo approach:
    • Draw one random sample, following the model
    • Repeat many times
    • The target distribution is approximated by this cloud of points

Extra slides

Optional / going further:

  • Forward sampling
  • Importance sampling
  • MCMC methods
  • Gibbs sampling
  • Metropolis-Hastings

Sampling from a bayesian network

  • Computers generate sampling from simple distributions (uniform)

  • Sampling from multinomial:
    • subdivide a unit interval as the multinomial
    • sample uniformly
  • Sampling from directed graphical models: forward sampling
    • Sample variables in topological order (starting from variables without parents)

Forward sampling

  • Sample the value of \(X_3=a\) from \(P(X_3)\)
  • Sample the value of \(X_2=b\) from \(P(X_2|X_3=a)\)
  • Sample the value of \(X_1=c\) from \(P(X_1|X_3=a,X_2=b)\)

Rejection sampling

What if \(X_2=x_2\) is observed ?

  • Sample the value of \(X_3=a\) from \(P(X_3)\)
  • Sample the value of \(X_2=b\) from \(P(X_2|X_3=a)\)
    • If \(b\neq x_2\), reject sample and re-start
  • Sample the value of \(X_1=c\) from \(P(X_1|X_3=a,X_2=x_2)\)

Usages of sampling

  • Can be used to estimate a (marginal or joint) proba: \[P(X=x)=\frac {N(X=x)}{N}\]
  • with \(N=\) total nb of samples
  • with \(N(X=x)=\) nb of samples with \(X=x\)

Usages of sampling

  • Can be used to estimate a conditional proba:

\[P(X=x|Y=y) = \frac {P(X,Y)}{P(Y)} = \frac{N(X=x,Y=y)}{N} \cdot \frac{N}{N(Y=y)}\] \[P(X=x|Y=y) = \frac {N(X=x,Y=y)}{N(Y=y)}\]

Limitations of rejection sampling

Evidence \(\rightarrow\) high rejection rate \(\rightarrow\) very costly !

Solution: Don’t reject samples, but rather fix the values of observed nodes:

  • Gibbs sampling
  • Importance sampling

Gibbs sampling

  • Samples are dependent and form a Markov Chain
    • = \(s_t\) depends on \(s_{t-1}\)
  • Is a type of Markov Chain Monte Carlo (MCMC) method

  • Guaranteed to converge when all \(P>0\)

Gibbs sampling

Init \(t=0\): - Fix values of all observed variables - For each latent variable: - Initialize it randomly

Gibbs sampling

  • Intuition: update every variable one after the other, assuming all the others are given from previous iteration

  • Iterate through each latent variable \(x_i\) one at a time:
    • Assume all other variables \(x_{-i}\) are observed
    • Derive the distribution of possible values that \(x_i\) can take: \(P(x_i|x_{-i})\)
    • Sample one such value: \(x'_i \sim P(x_i|x_{-i})\)
    • Fix the new value of \(x_i\) and continue to \(i+1\)
  • Iterate this process long enough

Gibbs sampling

  • After some burn in period, the samples are independent and draw from the posterior
    • Burn in is a new hyper-parameter
  • How to derive \(P(x_i|x_{-i})\) ?

Gibbs sampling: derivation

  • We can get back to the joint: \(P(x_i|x_{-i}) = \frac {P(x_i,x_{-i})}{P(x_{-i})}\)
  • The denominator does not depend on \(x_i\), so: \(P(x_i|x_{-i}) \propto P(x_i,x_{-i})\)

Hence, in practice, we can just: - compute the joint successively for each possible value of \(x_i\) (keeping all other variables to their value from the previous iteration) - normalize s.t. the sum is 1

Gibbs sampling: enhancements

  • Ways to improve Gibbs sampling (reduce variance):
    • Collapsed Gibbs sampling
    • Generate multiple chains
    • Pick one sample out of K

Importance sampling

We may rewrite our objective (\(E_{x\sim p}[f(x)]\)) as: \[E_{x\sim p}[f(x)] = \sum_x f(x)p(x)\] \[E_{x\sim p}[f(x)] = \sum_x f(x)\frac {p(x)}{q(x)} q(x)\] \[E_{x\sim p}[f(x)] = E_{x\sim q} [f(x)w(x)]\] \[E_{x\sim p}[f(x)] \simeq \frac 1 T \sum_t f(x_t)w(x_t)\]

= Sample according to \(q\) and reweigh them with \(w(x)\)

Markov Chain Monte Carlo

  • Allows to compute marginals and MAP inference

Markov Chain: - seq. of state variables \(S_1,S_2,\cdots\) - generated from \(P(S_i|S_{i-1})\) (\(\rightarrow\) transition matrix T) - proba of ending up in each state after \(t\) steps: \[p_t = T^t p_0\] - When it exists, stationary distribution: \[\pi = \lim_{t\rightarrow +\infty} p_t\]

MCMC: - Markov Chain whose \(\pi=p\)

Existence of \(\pi\)

2 sufficient conditions: - Irreducibility: possible to get from state \(x\) to any \(x'\) with proba >0 in a finite nb of steps - Prevent absorbing states - Aperiodicity: possible to return to any state: \(\exists n\) s.t. \[\forall i, \forall n'\geq n ~~~ P(s_{n'}=i|s_0=i)>0\] - Prevent “oscillating” chains

Any ergodic chain has a stationary distribution.

MCMC usage

Assume we have a Markov Chain with \(\pi=p\), and an initial assignment \(x_0\) of the variables of \(p\); we can then: - run the Markov chain from \(x_0\) for B burn-in steps - run the Markov chain for N sampling steps and collect the visited states = sample of \(p\)

We can compute marginals from this sample, or pick the MAP sample.


\(T(x'|x)\) is built from 2 components: - A transition kernel \(Q(x'|x)\) defined by the user - an acceptance probability for moves proposed by \(Q\): \[A(x'|x)=\min\left( 1, \frac {P(x')Q(x|x')}{P(x)Q(x'|x)} \right)\]

At each step of the Markov chain: - we propose a new point \(x'\) according to \(Q\). - we accept \(x'\) with proba \(\alpha\).


Assume \(Q\) is uniform: - if the proposed \(x'\) is in a low proba region of \(P\), it is often rejected

  • For continuous variables, \(Q\) is usually a Gaussian centered on \(x\)

  • \(\forall Q\), MH ensures that \(P\) = stationary distribution of the Markov chain

Gibbs sampling

Special case of MH: - Iterate through each variable one at a time: - sample \(x'_i \sim p(x_i|x_{-i})\) - set \(x^{t+1} = (x_1^t,\cdots,x'_i,\cdots,x_n^t)\)

Running time of MCMC: - related to the number of burn-in steps B - = mixing time of the Markov chain - May be very slow - if proposal do not move fast enough - if \(P\) has 2 separated narrow modes

Linear vs. probabilistic models

Wrap-up: parametric model

  • Function \(f(X)\) that does not know anything about the true generative process of \(X\)
  • With parameters \(f_{\theta}(X)\) to make the function compatible with observations \(X\)
  • We have matched a linear fct to some “unknown” observations “and”, “xor”… \[f_{\theta}(x) = -2 + 2x_0 + x_1\]

Where is the test set ?

  • Our “and”, “xor”, “or”… data is generated by a true process that is deterministic and there exists only a few possible combinations.
  • train and test are the same !

Another model: graphical

  • “xor”-type data fails with linear model
  • Question: propose a graphical model

Joint model

  • \(P(x_1,x_2,y;\theta)\) is good here, because very few possible combinations
  • Question: how many parameters ? What do parameters represent ?

Discriminative model

  • \(P(y|x_1,x_2,\theta)\) has the same number of parameters
  • Question: how do the parameters differ ?

Naive Bayes model

  • What is the Naive Bayes model in this case ?
  • How many parameters ?
  • Is it good ?

Learning in directed models

Learning in directed models

Let us assume that:

  • the domain is governed by unknown distribution \(\rho\)
  • we are given a dataset \(D\) of \(m\) i.i.d. samples from \(\rho\)
  • we are given a family of models \(M\), defined by a set of parameters

Learning = find a model in \(M\) that best approximates \(\rho\)

Learning in directed models

What is best ?

  • Let \(X=\) all random variables in our model
  • Let \(\theta=\) parameters of our model
  • We want that the proba distrib \(p(X;\theta)\) defined by our model to be as close as possible to the real distrib \(\rho(X)\).
  • ``close’’ \(\rightarrow\) distance between proba distrib ?


Entropy of a probability distribution = measure of disorder \[H(p)=-\sum_i p_i \log p_i\]

“Distance” btw distributions

  • Kullback-Leibler divergence: \[KL(\rho||p)=\sum_x \rho(x)\log \frac {\rho(x)}{p(x)} = -H(\rho) -E_{x\sim \rho}[\log p(x)]\]

  • First term indep from \(p\)
  • Maximize the expected log-likelihood: \[E_{x\sim \rho}[\log p(x)]\]

Maximum likelihood

But we don’t know \(\rho\) !

So we approximate the expected log-likelihood with the empirical log-likelihood:

\[E_{x\sim \rho}[\log p(x)] \simeq \frac 1 {|D|} \sum_{x\in D} \log p(x)\]

Our trained model is then: \[\max_{\theta} \frac 1 {|D|} \sum_{x\in D} \log p(x;\theta)\]

Example: biased coin

  • Assume \(\rho(x)\) with \(x\in {h,t}\)
  • Class of models \(M\) = set of all proba distributions over \({h,t}\)
  • Define parameters:
    • \(p(x=h)=\theta\)
    • \(p(x=t)=1-\theta\)
  • Our log-likelihood function is: \[L(\theta) = N_h \cdot \log(\theta) + N_t \cdot \log(1-\theta)\]
  • Optimal solution: \[\hat \theta = \frac{N_h}{N_h + N_t}\]

ML loss

  • Maximum Likelihood = optimizing with log loss: \[-\log p(x)\]

  • Another loss: conditional log-likelihood: \[-\log p(y|x)\]
    • specific prediction task
    • no need to estimate the joint
    • used to train CRF

ML of BN

  • Assume a BN: \[p(x)=\prod_{i=1}^n \theta_{x_i | \pi(x_i)}\]
  • Assume iid samples: \(D={x^1,\cdots,x^m}\)

Likelihood: \[L(\theta,D)=\prod_i^n \prod_j^m \theta_{x_i^j|\pi(x_i^j)}\]

ML of BN

Combining same terms: \[\log L(\theta,D)=\sum_i N(x_i,\pi(x_i)) \cdot \log \theta_{x_i | \pi(x_i)}\]

Optimum (see Exercice 2): \[\hat \theta_{x_i|\pi(x_i)} = \frac{N(x_i,\pi(x_i))}{N(\pi(x_i))}\]


  • 2 types of graphical models:
    • Bayesian Networks
      • Naive Bayes model
    • Markov Random Fields
      • CRF


  • Inference:
    • \(\theta\) is known
    • \(X\) is observed
    • Exact vs. approximate inference


  • Learning:
    • \(\theta\) is unknown
    • \(X\) is observed
    • We want to estimate \(\theta\):
      • Maximum Likelihood (closed-form for BN)

Exercice 2: Twitter posts

  • Model to predict punctuation:
    • 7 classes: . ! ? ; : , AnyOtherChar
  • Model: \(P(y|x_t,x_{t-1},\cdots,x_{t-n}) \sim M(\theta)\)
    • \(x_{t-i}=\) char at position \(t-i\)
    • \(y=\) punct. class computed from char at position \(t+1\)
Sinon j'espère que le nouvel OP de Jojo est mortel
Et la t'as les terroristes tranquillement qui regardent I-Télé sur leur portable en train d'apprendre la stratégie du GIPN #merciitelebfm
 top merci
Ça fait peur tout ce qui se passe, c'est beaucoup trop, j'espère que ça va s'arrêter..
 t srx

Exercice 2 (TD)

  • Draw a graphical model
    • Is it generative or discriminative ?
  • Let \(m\) be the size of the char lexicon
    • How many parameters ?
  • A training corpus is composed of \(T\) sequences \(\{x_t,x_{t-1},\cdots,x_{t-n},y\}_{t=1}^T\)
    • Write the conditional likelihood \(P(Y|X;\theta)\) over the corpus
    • Derive the optimum parameters in the ML sense
  • How could you handle unobserved sequences ?

Exercice 2 (TD)

Tricks for optimizing the conditional likelihood:

  • Merge together all occurrences of the same \(\theta_s\)
  • Optimize the log-likelihood instead of the likelihood
  • Don’t forget the constraint: sum of proba = 1
  • Use Lagrange Multipliers to handle this constraint

Learning in undirected models

Learning in MRF / CRF

\[p(x_1,\cdots,x_n)=\frac 1 {Z(\varphi)} \prod_{c\in C} \Phi_c (x_c; \varphi)\] \[Z(\varphi) = \sum_{x_1,\cdots,x_n} \prod_{c\in C} \Phi_c(x_c; \varphi)\]

  • The partition function is very hard to evaluate and optimize:

  • We can prove that \(\log Z(\theta)\) is convex by computing its Hessian \(\nabla_\theta^2 \log Z(\theta)\)
  • … Despite being convex, it is very hard to optimize !
  • Its gradient can be estimated with MCMC

Learning in CRF

  • Same issue than with MRF:
    • Requires inference over all training samples for every evaluation of objective / gradient
  • Optim with quasi-Newton method (L-BFGS)
    • Challenger: SGD
      • Inference only on batch of training samples


Learning with latent variables

Example: Language models

  • \(x_t\) = words in a text document
  • A n LM models \(p(x_1,\cdots,x_T)\)

  • A better model is conditioned on the topic \[p(x)=p(x|t)p(t)\]

  • Problem for training: \(t\) is not observed !

Example: Gaussian Mixture Models

\[p(x) = \sum_{k=1}^K p(x|z=k)p(z=k) = \sum_{k=1}^K \pi_k N(x;\mu_k,\Sigma_k)\]

Gaussian Mixture Models

  • GMM are generative models:
    • First sample one cluster according to \(\pi_k\)
    • Then sample a point according to \(N(x;\mu_k,\Sigma_k)\)

Marginal likelihood training

Maximizing the marginal log-likelihood of the data:

\[\log p(D) = \sum_{x\in D} \log p(x) = \sum_{x\in D} \log \left( \sum_z p(x|z)p(z) \right)\]

Marginal likelihood training

  • No closed-form solution, even for directed graphical models
  • No more convexity properties
  • Use approximate learning: Expectation-Maximization


  • Start with initial \(\theta_0\)

  • Iterate on \(t\):
    • E-step: Compute the posterior \(p(z|x;\theta_t)\) \(\forall x\)
    • M-step: Optimize parameters: \[\theta_{t+1} = \arg\max_\theta \sum_{x\in D} E_{z\sim p(z|x;\theta_t)} \log p(x,z;\theta)\]


  • convergence proof:
    • With Jensen inequality
    • cf. https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf

Example: GMM (E-step)


\[p(z|x;\theta_t) = \frac {p(z,x;\theta_t)}{p(x;\theta_t)} = \frac{p(x|z;\theta_t)p(z;\theta_t)}{\sum_{k=1}^K p(x|z_k;\theta_t)p(z_k;\theta_t)}\]

Intuitively, we successively assume \(x\) is generated by mixture \(z_k\), and compute the corresponding Gaussian.

Example: GMM (M-step)

\[\theta_{t+1} = \arg\max_\theta \sum_{x\in D} E_{z\sim p(z|x;\theta_t)} \log p(x,z;\theta)\]

\[\theta_{t+1} = \arg\max_\theta \sum_k \sum_{x\in D} p(z_k|x;\theta_t) \log p(x|z_k;\theta) +\] \[\sum_k \sum_{x\in D} p(z_k|x;\theta_t) \log p(z_k;\theta) \]

Each term can be optimized independently (because affect different parameters)

GMM training: emission term

\[p(x|z_k;\theta) = N(x;\mu_k,\Sigma_k)\]

Maximization = find the \(\mu_k,\Sigma_k\) that maximize: \[\sum_{x\in D} p(z_k|x;\theta_t) \log p(x|z_k;\theta)\]

GMM training: emission term

Solution: simply the weighted mean and variance of the data: \[\mu_k=\sum_{x\in D} \frac{p(z_k|x;\theta_t)}{\sum_{x\in D} p(z_k|x;\theta_t)} x\]

\[\Sigma_k = \sum_{x\in D} \frac{p(z_k|x;\theta_t)}{\sum_{x\in D} p(z_k|x;\theta_t)} (x-\mu_k)(x-\mu_k)^T\]

GMM training: class priors


\[\pi_k = \frac 1 {|D|} \sum_{x\in D} p(z_k|x;\theta_t)\]

  • All these results may be derived from standard calculus

Properties of EM

  • Marginal likelihood increases after every EM cycle

  • EM eventually converge

  • But it may (most often) converge to a local optimum

  • It is very sensitive to its initial parameters \(\theta_0\)

Limits of ML

Example 1: biased coin

  • XP1: 10 samples, observe 6 heads \(\rightarrow\) \(\theta_h=0.6\)

  • XP2: 100 samples, observe 60 heads \(\rightarrow\) \(\theta_h=0.6\)

  • No notion of reliable estimate !

Limits of ML

Example 2: language model

D={“Graphical models are fun”}

\(\theta(graphical)=0.2\) \(\theta(models)=0.2\)
\(\theta(are)=0.2\) \(\theta(fun)=0.2\)

  • \(P(\text{Graphical models are hard}|\theta)\) = 0 !
  • Difficult to add prior information !

Bayesian learning

Bayesian learning

Model parameters \(\theta\) are random variables as well !

  • The prior distribution \(p(\theta)\) encodes our initial beliefs.

  • After observing data \(D\) (=evidence), we update these beliefs with:

\[p(\theta|D) = \frac {p(D|\theta)p(\theta)}{p(D)}\]

Conjugate priors

  • Computing the denominator \(p(D)\) implies an integral \(\rightarrow\) hard !

  • A parametric family \(\varphi\) is conjugate to the likelihood \(p(D|\theta)\) if \(p(\theta)\in\varphi \Rightarrow p(\theta|D) \in \varphi\)

  • We may then use simple calculus to compute the Bayes rule

Coin example

  • \(\theta=\) proba of heads; the data likelihood follows a Bernoulli distrib; if \(h=\)nb of heads, \(n=\)nb of samples: \[p(D|\theta)=\theta^h(1-\theta)^{n-h}\]

  • Conjugate prior of Bernoulli = Beta: \[p(\theta)=\frac{\theta^{\alpha_H-1}(1-\theta)^{\alpha_T-1}}{B(\alpha_H,\alpha_T)}\]

Exercice: derive the posterior \(p(\theta|D)\)

Coin example

Computation of the posterior: \[p(\theta|D)=\frac{p(D|\theta)p(\theta)}{p(D)} = \frac{\theta^{h+\alpha_H-1}(1-\theta)^{n-h+\alpha_T-1}}{B(\alpha_H,\alpha_T)p(D)}\]

  • We know that \(p(\theta|D)\) is Beta, so the denominator is \(B(h+\alpha_H,n-h+\alpha_T)\)

Coin posterior

  • Iterate with new data: evidence will cumulate:

Notion of reliability

  • The more data, the more peaky is the posterior

Notion of prior knowledge

  • Our prior distribution = Beta is such that we initially give equal probability to all words, including the ones that do not occur in the training set !

Latent Dirichlet Allocation (LDA)

Topic model: cluster a set of documents into topics

Plate diagram:

Latent Dirichlet Allocation (LDA)

  • \(N\) words in a document, \(M\) documents, \(K\) topics
  • \(w_{m,n}\)= one observed word (multinomial)
  • \(z_{m,n}\)= topic of one word (multinomial)
  • \(\theta_m\)= topic distribution for document \(m\) (dirichlet)
  • \(\varphi_k\)= word distribution for topic \(k\) (dirichlet)
  • \(\alpha\)= param. of the Dirichlet prior
  • \(\beta\)= param. of the Dirichlet prior

Dirichlet distribution

Conjugate prior of the Multinomial distribution

Inference or training ?

What can we do with this model ?

  • Initially, we initialize this model (through \(\alpha=0.1\) and \(\beta=0.1\)) following our prior beliefs

  • But we want to update it according to new observations (words)

= train LDA in an unsupervised way with Gibbs sampling

Next: optional slides

Inference with Gibbs sampling

\[P(W,Z,\theta,\varphi;\alpha,\beta) = \prod_i^K P(\varphi_i;\beta) \prod_j^M P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j)P(W_{j,t}|\varphi_{Z_{j,t}})\]

We don’t need the values of \(\varphi\) and \(\theta\) \(\rightarrow\) Collapsed Gibbs sampling

\[P(W,Z;\alpha,\beta) = \int_\theta \int_\varphi P(W,Z,\theta,\varphi;\alpha,\beta) d\varphi d\theta\]

\[=\int_\varphi \prod_i^K P(\varphi_i;\beta) \prod_j^M \prod_t^N P(W_{j,t}|\varphi_{Z_{j,t}}) d\varphi \int_\theta \prod_j^M P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta\]

Inference with Gibbs sampling

Every \(\theta_j\) is independent from all other \(\theta\) and \(\varphi\), so the term for one of them is:

\[\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \int_{\theta_j} \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \prod_i^K \theta_{j,i}^{\alpha_i -1} \prod_t^N P(Z_{j,t}|\theta_j)d\theta_j\]

Let \(n_{j,r}^i\) be the nb of occurrences of word \(r\) in document \(j\) with topic \(i\) assigned at the previous Gibbs iteration, then \[\prod_t^N P(Z_{j,t}|\theta_j) = \prod_t^K \theta_{j,i}^{n_{j,\cdot}^i}\]

Inference with Gibbs sampling


\[\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \int_{\theta_j} \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \prod_i^K \theta_{j,i}^{n_{j,\cdot}^i+\alpha_i -1}\]

This has the form of a Dirichlet !

So we just introduce the correct constant that makes the Dirichlet sum to 1, and all that remains is: \[\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \frac{\prod_i^K \Gamma(n_{j,\cdot}^i+\alpha_i)}{\Gamma\left( \sum_k^K n_{j,\cdot}^i + \alpha_i \right)}\]

Inference with Gibbs sampling

Homework: do the same for \(\varphi\) to get:

\[P(W,Z;\alpha,\beta) = \prod_j^M \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \frac{\prod_i^K \Gamma(n_{j,\cdot}^i+\alpha_i)}{\Gamma\left( \sum_k^K n_{j,\cdot}^i + \alpha_i \right)} \times \prod_i^K \frac{\Gamma\left( \sum_r^V \beta_r \right)}{\prod_r^V \Gamma(\beta_r)} \frac{\prod_r^V \Gamma(n_{\cdot,r}^i+\beta_r)}{\Gamma\left( \sum_r^V n_{\cdot,r}^i + \beta_r \right)}\]

To approximate \(P(Z|W;\alpha,\beta)\), Gibbs sampling requires

\[P(Z_{m,n}|Z_{-(m,n)},W;\alpha,\beta) = \frac{P(Z_{m,n},Z_{-(m,n)},W;\alpha,\beta)}{P(Z_{-(m,n)},W;\alpha,\beta)}\]

Inference with Gibbs sampling

We can sample from the unnormalized distribution

\[P(Z_{m,n}|Z_{-(m,n)},W;\alpha,\beta) \propto P(Z_{m,n},Z_{-(m,n)},W;\alpha,\beta)\]

Let \(v\) be the word at \((m,n)\). Removing all factors that do not depend on \(m,n\), it remains:

\[\propto \prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) \prod_i^K \frac{\Gamma(n_{\cdot,v}^i+\beta_v)}{\Gamma\left( \sum_r^V n_{\cdot,r}^i +\beta_r\right)}\]

Let’s compute it for \(v=1\), and afterwards \(v=2\), and so on.

Inference with Gibbs sampling

  • \(n_{m,\cdot}^i\) is the nb of words with topic \(i\) in document \(m\)
  • Let \(n_{m,\cdot}^{i,-(m,n)}\) be the counts excluding the \(n^{th}\) word:

Assume that the topic of the \(n^{th}\) word \(Z_{m,n}=k\), so for \(i=k\): \[n_{m,\cdot}^i = n_{m,\cdot}^{i,-(m,n)} + 1\] and for \(i\neq k\): \[n_{m,\cdot}^i = n_{m,\cdot}^{i,-(m,n)}\]

Inference with Gibbs sampling

So \[\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i\neq k}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times \Gamma(n_{m,\cdot}^{k,-(m,n)}+\alpha_k+1)\] We know that \(\Gamma(1+x)=x\Gamma(x)\), so: \[\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i\neq k}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times \Gamma(n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\times (n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\] \[\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times (n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\] The first product does not depend on the chosen topic \(k\), so we can remove it !

Inference with Gibbs sampling

Similarly for the other term, we get:

\[\propto \left( n_{m,\cdot}^{k,-(m,n)} + \alpha_k \right) \frac{n_{\cdot,v}^{k,-(m,n)} + \beta_v}{\sum_r^V n_{\cdot,r}^{k,-(m,n)} + \beta_r}\]

We can now sample every \(Z\) from this unnormalized posterior.

Output of Gibbs sampling

  • Gibbs sampling gives many samples of \(Z=(Z_1,\cdots,Z_T)\)

  • We can estimate from these samples interesting distributions, for instance words that are the most related to a topic: \[\hat \varphi_{z,w} = \frac {N(W=w,Z=z)+\beta}{\sum_v (N(W=v,Z=z) + \beta)}\]

  • Because we are Bayesian, our model gives us a distribution over all possible \(\varphi_z\)

  • We want the most probable \(\hat \varphi_z = E_{P(\varphi_z | W_z, \beta)}[\varphi_z]\)
    • \(W_z\) are our “observations” obtained by Gibbs sampling = all the words assigned to topic \(z\)
  • The posterior \(P(\varphi_z|W_z,\beta) \propto P(W_z|\varphi_z)P(\varphi_z|\beta)\)
    • 1st \(P()\) is multinomial, 2d is Dirichlet \(\rightarrow\) product = Dirichlet
    • The new \(\beta\) of the resulting Dirichlet are \(N_v + \beta_v\) with \(N_v\) the number of samples of word \(v\) assigned to topic \(z\)
  • The formula of the mean of a Dirichlet gives us the target

Non-parametric models

The number of topics is fixed, but what if you process an infinite stream of text ?

  • Grow the number of topics with the size of incoming data

  • Non-parametric models
    • The nb of parameters increase with data
  • Extensions of LDA:
    • Hierarchical Dirichlet Process Mixture Model
    • Hierarchical Pitman-Yor process
    • Chinse Restaurant Process
    • Indian Buffet Process