Christophe Cerisara

2018/2019

- Bayes fundamentals
- Directed / undirected models
- Naive Bayes / MRF, CRF

- Inference
- Marginal vs. MAP
- Approximate: MCMC

- Learning
- Maximum Likelihood
- Bayesian learning

- Latent variable models
- Mixture models, EM

- Refs:
- http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/020217.pdf
- https://ermongroup.github.io/cs228-notes/

- Model
*uncertainty*and intrinsic variability- “there’s 90% of chance to rain”
- a model may generate various (random) answers for the same input

- Solid foundations to
- inject prior knowledge
- explain some forms of regularization

- Neural networks actually output the posterior \(p(y|x,D)\)
- Proof (Lippmann, 1991):
- the network minimizes the MSE = \(E_{p(x,y)}[(\hat y(x) - y)^2]\)
- proves it is also the minimum of \(E_{p(x,y)}[(\hat y(x) - p(y|x))^2]\)

- Proof (Lippmann, 1991):

- Recent deep neural networks:
- Variational autoencoder: DNN model latent proba distributions
- Generative Adversarial Networks: training DNNs to match proba distributions

- We want to detect SPAM emails, using the whole text as features.
- We can code the occurence of the \(i^{th}\) entry in the vocabulary into \(X_i = \{0,1\}\) for \(1\leq i \leq V\)

- The joint \(P(x)=P(x_1,\cdots,x_V)\) assumes that both following texts are totally independent:

```
Cheap product to sell
Buy a cheap product
```

- Issues:
- Exponential number of values !
- No sharing of information between them

We may assume

*conditional independence*btw variables: \[p(y,x_1,\cdots,x_V) = p(y) \prod_i^V p(x_i|y)\]- This is the
*Naive Bayes*assumption With \(x_i=\)word, aka

*Bag-of-words*Requires only \(O(V)\) parameters !

- \(p(x_i|y)\) models the likelihood to observe the word \(x_i\) in SPAM (resp. non-SPAM)
- Discrete domain for \(x_i\) \(\rightarrow\)
*Multinomial*distribution - Parameterized by \(2V\) parameters (\(V\) is the vocabulary size, 2 is the size of the domain of \(x_i\)): \[p(x_i|y) = \theta_i~~~~~ 1 \leq i \leq V\] \[\sum_{j=1}^2 \theta_{i,j} = 1\]
- (note: we could have used only \(V\) parameters here, but in the general case with \(N\) classes, we need \(NV\) parameters)

- 1: describe with the equation

\[p(y,x_1,\cdots,x_V) = p(y) \prod_i^V p(x_i|y)\]

- 2: describe our independence assumptions via a graph

\(X_i\) is conditionally independent from all other \(X_j\) given \(Y\)

3: generative models can also be described by a *generative story*:

- To generate an email, we first sample whether it’s a spam or not from the Bernoulli distribution \[Y \sim B(\alpha)\]
- Then, for each of the \(V\) words in the vocabulary:
- sample \(X_i\) given \(Y\) (i.e., decide whether word \(i\) occurs in the email) from the Bernoulli distribution \[X_i \sim B(\theta_{i,Y})\]

It’s a pair \((G,P)\) where

- \(G=(V(G),A(G))\) is an acyclic directed graph
- \(V(G)\) is a set of vertices
- \(A(G) \subseteq V(G) \times V(G)\) is a set of arcs

- Joint probability distribution: \[P(X_{V(G)}) = \prod_{v \in V(G)} p(X_v | X_{\pi(v)})\]

- \(p(X_v | X_{\pi(v)})\) is a CPD (Conditional Probability Distribution)

2 types of graphical models:

If \(G\) is a DAG = bayesian networks

- If \(G\) is an undirected graphs = Markov random fields
- (see next)

Exercice: Write the equation of this graphical model ?

\[P(X_1,X_2,X_3)=P(X_1|X_2,X_3)P(X_2|X_3)P(X_3)\]

- \(X_2\) is conditionally independent of \(X_1\) given \(X_3\)

Example: Model of voting preferences

- (A,B), (B,C), (C,D) and (D,A) are friends
- They talk and influence each other when voting

The proba to have a given vote, say \((A,B,C,D)=(0,1,1,1)\), is proportional to the score:

\[\tilde{p}(A,B,C,D) = \phi(A,B)\phi(B,C)\phi(C,D)\phi(D,A)\]

- \(\phi(X,Y)\) = factor that assigns more weight to votes among friends:
- \(\phi(X,Y)=10\) if X=Y=1
- \(\phi(X,Y)=5\) if X=Y=0
- \(\phi(X,Y)=1\) otherwise

We still need to normalize to get a probability distribution:

\[p(A,B,C,D) = \frac 1 Z \tilde{p}(A,B,C,D)\] with \[Z = \sum_{A,B,C,D} \tilde{p}(A,B,C,D)\]

Would be difficult to model with a Bayesian network !

A MRF is a proba distribution over variables \(x_1,\cdots,x_n\) defined by an undirected graph \(G\) in which nodes correspond to variables \(x_i\). \(p\) has the form: \[p=\frac 1 Z \prod_{c\in C} \phi_c(x_c)\] where \(C\)=set of *cliques* of G.

Normalizing constant: \[Z=\sum_{x_1,\cdots,x_n}\prod_{c\in C} \phi_c(x_c)\]

- MRF can be applied to a wider range of problems
MRF can compactly represent certain dependencies that BN can not

- Computing \(Z\) may require to sum over an exponential number of assignments (NP-hard)
Easier to generate data from a BN

- Always possible to transform BN into MRF

- Inverse is possible, but may be computationally intractable

Important special case of MRF:

\[P(y|x) = \frac 1 {Z(x)} \prod_{t}^T \phi(y_t|x) \prod_t^{T-1} \phi(y_t,y_{t+1}|x)\]

\[P(y|x) = \frac 1 {Z(x)} \prod_{t}^T \phi(y_t|x) \prod_t^{T-1} \phi(y_t,y_{t+1}|x)\]

- \(x\) is actually composed of 2 components:
- global observed features (not shown in the graph): e.g., nb of observed letters
- local observations (\(x_t\))

- All potentials are conditioned on \(x\):
- \(Z(x)\) now depends on \(x\), because \(x\) are assumed fixed, and so we sum only over \(y\)

In most applications, we define potentials as follows:

\[\phi(x_c) = \exp(w_c^T f(x_c))\]

where \(f(x_c)\) is a feature vector, e.g.:

- How many vertical lines are there ?
- Is the image closest from training example 524 ?

Ref: https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

Why using this form \(\phi(x_c) = \exp(w_c^T f(x_c))\) ?

Because:

- It is always positive
- The global product simplifies into a log-linear model
- Logistic regression (or Maximum Entropy model) is a special case of a CRF

Ref: https://homes.cs.washington.edu/~thickstn/docs/crf.pdf

Once a model is defined, how can it be used ?

- Marginal inference:
- Assuming that the parameters \(\theta~~\) are known/fixed
- How to compute the distribution of variables of interest \(Y~~\) ?

- Training:
- Assuming some annotated dataset is provided: \[(X_1,Y_1),\cdots,(X_T,Y_T)\]
- How to compute the optimal parameters \(\theta~~\) ?

It’s a way to answer questions given a model

- Example of marginal inference
- What is the proportion of spams ? \[p(y)=\sum_{x_1} \cdots \sum_{x_n} p(y,x_1,\cdots,x_n)\]
- What is the probability that an email is a spam ? \[p(y|x)=\frac {p(y,x_1,\cdots,x_n)}{p(x_1,\cdots,x_n)}\]

Instead of computing the whole distribution as in marginal inference, it’s often easier to compute a single optimal value:

- Maximum a posteriori (MAP) inference
- What is the most likely spam message ? \[\max_{x_1,\cdots,x_n} p(y=1,x_1,\cdots,x_n)\]
- Classification: Is this email a spam ?
**EXERCICE**

Instead of computing the whole distribution as in marginal inference, it’s often easier to compute a single optimal value:

- Maximum a posteriori (MAP) inference
- What is the most likely spam message ? \[\max_{x_1,\cdots,x_n} p(y=1,x_1,\cdots,x_n)\]
- Classification: Is this email a spam ? \[\max_{y} p(y|x)\]

Algorithms exist to compute exact inference in the general case:

- Variable elimination
- Better: Junction tree algorithms:
- (Loopy) Belief propagation

… But the general case is NP-hard !

- \(\simeq\) No known algo in polynomial complexity
- (http://www.columbia.edu/~cs2035/courses/ieor4405.S17/np.pdf)

MAP inference is less ambitious, sometimes easier (ex: MRF):

\[\arg\max_x \left(\sum_c \phi_c(x_c)\right) -\log Z\]

- “one point” in general inference
- depending on the model \(\rightarrow\) efficient specific algorithms
- Graphcuts
- Viterbi

Either use a special / known model with its dedicated algorithms

Or forget exact inference, and run approximate inference !

2 classes of approximate inference methods:

Variational (see next in deep models)

- Sampling = Monte Carlo approach:
- Draw one random sample, following the model
- Repeat many times
- The target distribution is approximated by this cloud of points

Optional / going further:

- Forward sampling
- Importance sampling
- MCMC methods
- Gibbs sampling
- Metropolis-Hastings

Computers generate sampling from simple distributions (uniform)

- Sampling from multinomial:
- subdivide a unit interval as the multinomial
- sample uniformly

- Sampling from directed graphical models:
*forward*sampling- Sample variables in topological order (starting from variables without parents)

- Sample the value of \(X_3=a\) from \(P(X_3)\)
- Sample the value of \(X_2=b\) from \(P(X_2|X_3=a)\)
- Sample the value of \(X_1=c\) from \(P(X_1|X_3=a,X_2=b)\)

What if \(X_2=x_2\) is observed ?

- Sample the value of \(X_3=a\) from \(P(X_3)\)
- Sample the value of \(X_2=b\) from \(P(X_2|X_3=a)\)
- If \(b\neq x_2\), reject sample and re-start

- Sample the value of \(X_1=c\) from \(P(X_1|X_3=a,X_2=x_2)\)

- Can be used to estimate a (marginal or joint) proba: \[P(X=x)=\frac {N(X=x)}{N}\]
- with \(N=\) total nb of samples
- with \(N(X=x)=\) nb of samples with \(X=x\)

- Can be used to estimate a conditional proba:

\[P(X=x|Y=y) = \frac {P(X,Y)}{P(Y)} = \frac{N(X=x,Y=y)}{N} \cdot \frac{N}{N(Y=y)}\] \[P(X=x|Y=y) = \frac {N(X=x,Y=y)}{N(Y=y)}\]

Evidence \(\rightarrow\) high rejection rate \(\rightarrow\) very costly !

Solution: Don’t reject samples, but rather *fix* the values of observed nodes:

- Gibbs sampling
- Importance sampling

- Samples are
*dependent*and form a Markov Chain- = \(s_t\) depends on \(s_{t-1}\)

Is a type of Markov Chain Monte Carlo (MCMC) method

Guaranteed to converge when all \(P>0\)

Init \(t=0\): - Fix values of all observed variables - For each latent variable: - Initialize it randomly

Intuition: update every variable one after the other, assuming all the others are given from previous iteration

- Iterate through each latent variable \(x_i\) one at a time:
- Assume all other variables \(x_{-i}\) are observed
- Derive the distribution of possible values that \(x_i\) can take: \(P(x_i|x_{-i})\)
- Sample one such value: \(x'_i \sim P(x_i|x_{-i})\)
- Fix the new value of \(x_i\) and continue to \(i+1\)

Iterate this process long enough

- After some
*burn in*period, the samples are independent and draw from the posterior*Burn in*is a new hyper-parameter

- How to derive \(P(x_i|x_{-i})\) ?

- We can get back to the joint: \(P(x_i|x_{-i}) = \frac {P(x_i,x_{-i})}{P(x_{-i})}\)
- The denominator does not depend on \(x_i\), so: \(P(x_i|x_{-i}) \propto P(x_i,x_{-i})\)

Hence, in practice, we can just: - compute the joint successively for each possible value of \(x_i\) (keeping all other variables to their value from the previous iteration) - normalize s.t. the sum is 1

- Ways to improve Gibbs sampling (reduce variance):
- Collapsed Gibbs sampling
- Generate multiple chains
- Pick one sample out of K
- …

We may rewrite our objective (\(E_{x\sim p}[f(x)]\)) as: \[E_{x\sim p}[f(x)] = \sum_x f(x)p(x)\] \[E_{x\sim p}[f(x)] = \sum_x f(x)\frac {p(x)}{q(x)} q(x)\] \[E_{x\sim p}[f(x)] = E_{x\sim q} [f(x)w(x)]\] \[E_{x\sim p}[f(x)] \simeq \frac 1 T \sum_t f(x_t)w(x_t)\]

= Sample according to \(q\) and reweigh them with \(w(x)\)

- Allows to compute marginals and MAP inference

Markov Chain: - seq. of state variables \(S_1,S_2,\cdots\) - generated from \(P(S_i|S_{i-1})\) (\(\rightarrow\) transition matrix T) - proba of ending up in each state after \(t\) steps: \[p_t = T^t p_0\] - When it exists, stationary distribution: \[\pi = \lim_{t\rightarrow +\infty} p_t\]

MCMC: - Markov Chain whose \(\pi=p\)

2 sufficient conditions: - *Irreducibility*: possible to get from state \(x\) to any \(x'\) with proba >0 in a finite nb of steps - Prevent absorbing states - *Aperiodicity*: possible to return to any state: \(\exists n\) s.t. \[\forall i, \forall n'\geq n ~~~ P(s_{n'}=i|s_0=i)>0\] - Prevent “oscillating” chains

Any *ergodic* chain has a stationary distribution.

Assume we have a Markov Chain with \(\pi=p\), and an initial assignment \(x_0\) of the variables of \(p\); we can then: - run the Markov chain from \(x_0\) for B burn-in steps - run the Markov chain for N sampling steps and collect the visited states = sample of \(p\)

We can compute marginals from this sample, or pick the MAP sample.

\(T(x'|x)\) is built from 2 components: - A transition kernel \(Q(x'|x)\) defined by the user - an acceptance probability for moves proposed by \(Q\): \[A(x'|x)=\min\left( 1, \frac {P(x')Q(x|x')}{P(x)Q(x'|x)} \right)\]

At each step of the Markov chain: - we propose a new point \(x'\) according to \(Q\). - we accept \(x'\) with proba \(\alpha\).

Assume \(Q\) is uniform: - if the proposed \(x'\) is in a low proba region of \(P\), it is often rejected

For continuous variables, \(Q\) is usually a Gaussian centered on \(x\)

\(\forall Q\), MH ensures that \(P\) = stationary distribution of the Markov chain

Special case of MH: - Iterate through each variable one at a time: - sample \(x'_i \sim p(x_i|x_{-i})\) - set \(x^{t+1} = (x_1^t,\cdots,x'_i,\cdots,x_n^t)\)

Running time of MCMC: - related to the number of burn-in steps B - = mixing time of the Markov chain - May be very slow - if proposal do not move fast enough - if \(P\) has 2 separated narrow modes

- Function \(f(X)\) that does not know anything about the true generative process of \(X\)
- With parameters \(f_{\theta}(X)\) to make the function compatible with observations \(X\)
- We have matched a linear fct to some “unknown” observations “and”, “xor”… \[f_{\theta}(x) = -2 + 2x_0 + x_1\]

- Our “and”, “xor”, “or”… data is generated by a true process that is deterministic and there exists only a few possible combinations.
- train and test are the same !

- “xor”-type data fails with linear model
- Question: propose a graphical model

- \(P(x_1,x_2,y;\theta)\) is good here, because very few possible combinations
- Question: how many parameters ? What do parameters represent ?

- \(P(y|x_1,x_2,\theta)\) has the same number of parameters
- Question: how do the parameters differ ?

- What is the Naive Bayes model in this case ?
- How many parameters ?
- Is it good ?

Let us assume that:

- the domain is governed by unknown distribution \(\rho\)
- we are given a dataset \(D\) of \(m\) i.i.d. samples from \(\rho\)
- we are given a family of models \(M\), defined by a set of parameters

Learning = find a model in \(M\) that best approximates \(\rho\)

What is *best* ?

- Let \(X=\)
*all*random variables in our model - Let \(\theta=\) parameters of our model
- We want that the proba distrib \(p(X;\theta)\) defined by our model to be as close as possible to the real distrib \(\rho(X)\).
- ``close’’ \(\rightarrow\) distance between proba distrib ?

Entropy of a probability distribution = measure of disorder \[H(p)=-\sum_i p_i \log p_i\]

Kullback-Leibler divergence: \[KL(\rho||p)=\sum_x \rho(x)\log \frac {\rho(x)}{p(x)} = -H(\rho) -E_{x\sim \rho}[\log p(x)]\]

- First term indep from \(p\)
Maximize the expected

*log-likelihood*: \[E_{x\sim \rho}[\log p(x)]\]

But we don’t know \(\rho\) !

So we approximate the expected log-likelihood with the empirical log-likelihood:

\[E_{x\sim \rho}[\log p(x)] \simeq \frac 1 {|D|} \sum_{x\in D} \log p(x)\]

Our trained model is then: \[\max_{\theta} \frac 1 {|D|} \sum_{x\in D} \log p(x;\theta)\]

- Assume \(\rho(x)\) with \(x\in {h,t}\)
- Class of models \(M\) = set of all proba distributions over \({h,t}\)
- Define parameters:
- \(p(x=h)=\theta\)
- \(p(x=t)=1-\theta\)

- Our log-likelihood function is: \[L(\theta) = N_h \cdot \log(\theta) + N_t \cdot \log(1-\theta)\]
- Optimal solution: \[\hat \theta = \frac{N_h}{N_h + N_t}\]

Maximum Likelihood = optimizing with log loss: \[-\log p(x)\]

- Another loss: conditional log-likelihood: \[-\log p(y|x)\]
- specific prediction task
- no need to estimate the joint
- used to train CRF

- Assume a BN: \[p(x)=\prod_{i=1}^n \theta_{x_i | \pi(x_i)}\]
- Assume iid samples: \(D={x^1,\cdots,x^m}\)

Likelihood: \[L(\theta,D)=\prod_i^n \prod_j^m \theta_{x_i^j|\pi(x_i^j)}\]

Combining same terms: \[\log L(\theta,D)=\sum_i N(x_i,\pi(x_i)) \cdot \log \theta_{x_i | \pi(x_i)}\]

Optimum (see Exercice 2): \[\hat \theta_{x_i|\pi(x_i)} = \frac{N(x_i,\pi(x_i))}{N(\pi(x_i))}\]

- 2 types of graphical models:
- Bayesian Networks
- Naive Bayes model

- Markov Random Fields
- CRF

- Bayesian Networks

- Inference:
- \(\theta\) is known
- \(X\) is observed
- Exact vs. approximate inference

- Learning:
- \(\theta\) is unknown
- \(X\) is observed
- We want to estimate \(\theta\):
- Maximum Likelihood (closed-form for BN)

- Model to predict punctuation:
- 7 classes: . ! ? ; : , AnyOtherChar

- Model: \(P(y|x_t,x_{t-1},\cdots,x_{t-n}) \sim M(\theta)\)
- \(x_{t-i}=\) char at position \(t-i\)
- \(y=\) punct. class computed from char at position \(t+1\)

```
Sinon j'espère que le nouvel OP de Jojo est mortel
Et la t'as les terroristes tranquillement qui regardent I-Télé sur leur portable en train d'apprendre la stratégie du GIPN #merciitelebfm
top merci
Ça fait peur tout ce qui se passe, c'est beaucoup trop, j'espère que ça va s'arrêter..
t srx
```

- Draw a graphical model
- Is it generative or discriminative ?

- Let \(m\) be the size of the char lexicon
- How many parameters ?

- A training corpus is composed of \(T\) sequences \(\{x_t,x_{t-1},\cdots,x_{t-n},y\}_{t=1}^T\)
- Write the conditional likelihood \(P(Y|X;\theta)\) over the corpus
- Derive the optimum parameters in the ML sense

- How could you handle unobserved sequences ?

Tricks for optimizing the conditional likelihood:

- Merge together all occurrences of the same \(\theta_s\)
- Optimize the log-likelihood instead of the likelihood
- Don’t forget the constraint: sum of proba = 1
- Use Lagrange Multipliers to handle this constraint

\[p(x_1,\cdots,x_n)=\frac 1 {Z(\varphi)} \prod_{c\in C} \Phi_c (x_c; \varphi)\] \[Z(\varphi) = \sum_{x_1,\cdots,x_n} \prod_{c\in C} \Phi_c(x_c; \varphi)\]

The partition function is very hard to evaluate and optimize:

- We can prove that \(\log Z(\theta)\) is convex by computing its Hessian \(\nabla_\theta^2 \log Z(\theta)\)
- … Despite being convex, it is very hard to optimize !
Its gradient can be estimated with MCMC

- Same issue than with MRF:
- Requires inference over all training samples for every evaluation of objective / gradient

- Optim with quasi-Newton method (L-BFGS)
- Challenger: SGD
- Inference only on batch of training samples

- Challenger: SGD

https://people.cs.pitt.edu/~milos/courses/cs3710/Lectures/Class23.pdf

- \(x_t\) = words in a text document
A n LM models \(p(x_1,\cdots,x_T)\)

A better model is conditioned on the

*topic*\[p(x)=p(x|t)p(t)\]Problem for training: \(t\) is not observed !

\[p(x) = \sum_{k=1}^K p(x|z=k)p(z=k) = \sum_{k=1}^K \pi_k N(x;\mu_k,\Sigma_k)\]

- GMM are generative models:
- First sample one cluster according to \(\pi_k\)
- Then sample a point according to \(N(x;\mu_k,\Sigma_k)\)

Maximizing the marginal log-likelihood of the data:

\[\log p(D) = \sum_{x\in D} \log p(x) = \sum_{x\in D} \log \left( \sum_z p(x|z)p(z) \right)\]

- No closed-form solution, even for directed graphical models
- No more convexity properties
- Use approximate learning: Expectation-Maximization

Start with initial \(\theta_0\)

- Iterate on \(t\):
*E-step*: Compute the posterior \(p(z|x;\theta_t)\) \(\forall x\)*M-step*: Optimize parameters: \[\theta_{t+1} = \arg\max_\theta \sum_{x\in D} E_{z\sim p(z|x;\theta_t)} \log p(x,z;\theta)\]

- convergence proof:
- With Jensen inequality
- cf. https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf

E-step:

\[p(z|x;\theta_t) = \frac {p(z,x;\theta_t)}{p(x;\theta_t)} = \frac{p(x|z;\theta_t)p(z;\theta_t)}{\sum_{k=1}^K p(x|z_k;\theta_t)p(z_k;\theta_t)}\]

Intuitively, we successively assume \(x\) is generated by mixture \(z_k\), and compute the corresponding Gaussian.

\[\theta_{t+1} = \arg\max_\theta \sum_{x\in D} E_{z\sim p(z|x;\theta_t)} \log p(x,z;\theta)\]

\[\theta_{t+1} = \arg\max_\theta \sum_k \sum_{x\in D} p(z_k|x;\theta_t) \log p(x|z_k;\theta) +\] \[\sum_k \sum_{x\in D} p(z_k|x;\theta_t) \log p(z_k;\theta) \]

Each term can be optimized independently (because affect different parameters)

\[p(x|z_k;\theta) = N(x;\mu_k,\Sigma_k)\]

Maximization = find the \(\mu_k,\Sigma_k\) that maximize: \[\sum_{x\in D} p(z_k|x;\theta_t) \log p(x|z_k;\theta)\]

Solution: simply the weighted mean and variance of the data: \[\mu_k=\sum_{x\in D} \frac{p(z_k|x;\theta_t)}{\sum_{x\in D} p(z_k|x;\theta_t)} x\]

\[\Sigma_k = \sum_{x\in D} \frac{p(z_k|x;\theta_t)}{\sum_{x\in D} p(z_k|x;\theta_t)} (x-\mu_k)(x-\mu_k)^T\]

Similarly:

\[\pi_k = \frac 1 {|D|} \sum_{x\in D} p(z_k|x;\theta_t)\]

- All these results may be derived from standard calculus

Marginal likelihood increases after every EM cycle

EM eventually converge

But it may (most often) converge to a local optimum

It is very sensitive to its initial parameters \(\theta_0\)

Example 1: biased coin

XP1: 10 samples, observe 6 heads \(\rightarrow\) \(\theta_h=0.6\)

XP2: 100 samples, observe 60 heads \(\rightarrow\) \(\theta_h=0.6\)

No notion of

*reliable*estimate !

Example 2: language model

D={“Graphical models are fun”}

\(\theta(graphical)=0.2\) | \(\theta(models)=0.2\) |

\(\theta(are)=0.2\) | \(\theta(fun)=0.2\) |

- \(P(\text{Graphical models are hard}|\theta)\) = 0 !
- Difficult to add prior information !

Model parameters \(\theta\) are random variables as well !

The prior distribution \(p(\theta)\) encodes our initial beliefs.

After observing data \(D\) (=evidence), we update these beliefs with:

\[p(\theta|D) = \frac {p(D|\theta)p(\theta)}{p(D)}\]

Computing the denominator \(p(D)\) implies an integral \(\rightarrow\) hard !

A parametric family \(\varphi\) is conjugate to the likelihood \(p(D|\theta)\) if \(p(\theta)\in\varphi \Rightarrow p(\theta|D) \in \varphi\)

We may then use simple calculus to compute the Bayes rule

\(\theta=\) proba of heads; the data likelihood follows a Bernoulli distrib; if \(h=\)nb of heads, \(n=\)nb of samples: \[p(D|\theta)=\theta^h(1-\theta)^{n-h}\]

Conjugate prior of Bernoulli = Beta: \[p(\theta)=\frac{\theta^{\alpha_H-1}(1-\theta)^{\alpha_T-1}}{B(\alpha_H,\alpha_T)}\]

Exercice: derive the posterior \(p(\theta|D)\)

Computation of the posterior: \[p(\theta|D)=\frac{p(D|\theta)p(\theta)}{p(D)} = \frac{\theta^{h+\alpha_H-1}(1-\theta)^{n-h+\alpha_T-1}}{B(\alpha_H,\alpha_T)p(D)}\]

- We know that \(p(\theta|D)\) is Beta, so the denominator is \(B(h+\alpha_H,n-h+\alpha_T)\)

- Iterate with new data: evidence will cumulate:

- The more data, the more peaky is the posterior

- Our prior distribution = Beta is such that we initially give equal probability to all words, including the ones that do not occur in the training set !

Topic model: cluster a set of documents into topics

Plate diagram:

- \(N\) words in a document, \(M\) documents, \(K\) topics
- \(w_{m,n}\)= one observed word (multinomial)
- \(z_{m,n}\)= topic of one word (multinomial)
- \(\theta_m\)= topic distribution for document \(m\) (dirichlet)
- \(\varphi_k\)= word distribution for topic \(k\) (dirichlet)
- \(\alpha\)= param. of the Dirichlet prior
- \(\beta\)= param. of the Dirichlet prior

Conjugate prior of the Multinomial distribution

What can we do with this model ?

Initially, we initialize this model (through \(\alpha=0.1\) and \(\beta=0.1\)) following our prior beliefs

But we want to update it according to new observations (words)

= train LDA in an unsupervised way with Gibbs sampling

Next: optional slides

\[P(W,Z,\theta,\varphi;\alpha,\beta) = \prod_i^K P(\varphi_i;\beta) \prod_j^M P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j)P(W_{j,t}|\varphi_{Z_{j,t}})\]

We don’t need the values of \(\varphi\) and \(\theta\) \(\rightarrow\) Collapsed Gibbs sampling

\[P(W,Z;\alpha,\beta) = \int_\theta \int_\varphi P(W,Z,\theta,\varphi;\alpha,\beta) d\varphi d\theta\]

\[=\int_\varphi \prod_i^K P(\varphi_i;\beta) \prod_j^M \prod_t^N P(W_{j,t}|\varphi_{Z_{j,t}}) d\varphi \int_\theta \prod_j^M P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta\]

Every \(\theta_j\) is independent from all other \(\theta\) and \(\varphi\), so the term for one of them is:

\[\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \int_{\theta_j} \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \prod_i^K \theta_{j,i}^{\alpha_i -1} \prod_t^N P(Z_{j,t}|\theta_j)d\theta_j\]

Let \(n_{j,r}^i\) be the nb of occurrences of word \(r\) in document \(j\) with topic \(i\) assigned at the previous Gibbs iteration, then \[\prod_t^N P(Z_{j,t}|\theta_j) = \prod_t^K \theta_{j,i}^{n_{j,\cdot}^i}\]

So

\[\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \int_{\theta_j} \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \prod_i^K \theta_{j,i}^{n_{j,\cdot}^i+\alpha_i -1}\]

This has the form of a Dirichlet !

So we just introduce the correct constant that makes the Dirichlet sum to 1, and all that remains is: \[\int_{\theta_j} P(\theta_j;\alpha) \prod_t^N P(Z_{j,t}|\theta_j) d\theta_j = \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \frac{\prod_i^K \Gamma(n_{j,\cdot}^i+\alpha_i)}{\Gamma\left( \sum_k^K n_{j,\cdot}^i + \alpha_i \right)}\]

Homework: do the same for \(\varphi\) to get:

\[P(W,Z;\alpha,\beta) = \prod_j^M \frac{\Gamma\left( \sum_i^K \alpha_i \right)}{\prod_i^K \Gamma(\alpha_i)} \frac{\prod_i^K \Gamma(n_{j,\cdot}^i+\alpha_i)}{\Gamma\left( \sum_k^K n_{j,\cdot}^i + \alpha_i \right)} \times \prod_i^K \frac{\Gamma\left( \sum_r^V \beta_r \right)}{\prod_r^V \Gamma(\beta_r)} \frac{\prod_r^V \Gamma(n_{\cdot,r}^i+\beta_r)}{\Gamma\left( \sum_r^V n_{\cdot,r}^i + \beta_r \right)}\]

To approximate \(P(Z|W;\alpha,\beta)\), Gibbs sampling requires

\[P(Z_{m,n}|Z_{-(m,n)},W;\alpha,\beta) = \frac{P(Z_{m,n},Z_{-(m,n)},W;\alpha,\beta)}{P(Z_{-(m,n)},W;\alpha,\beta)}\]

We can sample from the unnormalized distribution

\[P(Z_{m,n}|Z_{-(m,n)},W;\alpha,\beta) \propto P(Z_{m,n},Z_{-(m,n)},W;\alpha,\beta)\]

Let \(v\) be the word at \((m,n)\). Removing all factors that do not depend on \(m,n\), it remains:

\[\propto \prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) \prod_i^K \frac{\Gamma(n_{\cdot,v}^i+\beta_v)}{\Gamma\left( \sum_r^V n_{\cdot,r}^i +\beta_r\right)}\]

Let’s compute it for \(v=1\), and afterwards \(v=2\), and so on.

- \(n_{m,\cdot}^i\) is the nb of words with topic \(i\) in document \(m\)
- Let \(n_{m,\cdot}^{i,-(m,n)}\) be the counts excluding the \(n^{th}\) word:

Assume that the topic of the \(n^{th}\) word \(Z_{m,n}=k\), so for \(i=k\): \[n_{m,\cdot}^i = n_{m,\cdot}^{i,-(m,n)} + 1\] and for \(i\neq k\): \[n_{m,\cdot}^i = n_{m,\cdot}^{i,-(m,n)}\]

So \[\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i\neq k}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times \Gamma(n_{m,\cdot}^{k,-(m,n)}+\alpha_k+1)\] We know that \(\Gamma(1+x)=x\Gamma(x)\), so: \[\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i\neq k}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times \Gamma(n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\times (n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\] \[\prod_i^K \Gamma(n_{m,\cdot}^i+\alpha_i) = \prod_{i}^K \Gamma(n_{m,\cdot}^{i,-(m,n)}+\alpha_i) \times (n_{m,\cdot}^{k,-(m,n)}+\alpha_k)\] The first product does not depend on the chosen topic \(k\), so we can remove it !

Similarly for the other term, we get:

\[\propto \left( n_{m,\cdot}^{k,-(m,n)} + \alpha_k \right) \frac{n_{\cdot,v}^{k,-(m,n)} + \beta_v}{\sum_r^V n_{\cdot,r}^{k,-(m,n)} + \beta_r}\]

We can now sample every \(Z\) from this unnormalized posterior.

Gibbs sampling gives many samples of \(Z=(Z_1,\cdots,Z_T)\)

We can estimate from these samples interesting distributions, for instance words that are the most related to a topic: \[\hat \varphi_{z,w} = \frac {N(W=w,Z=z)+\beta}{\sum_v (N(W=v,Z=z) + \beta)}\]

Because we are Bayesian, our model gives us a distribution over all possible \(\varphi_z\)

- We want the most probable \(\hat \varphi_z = E_{P(\varphi_z | W_z, \beta)}[\varphi_z]\)
- \(W_z\) are our “observations” obtained by Gibbs sampling = all the words assigned to topic \(z\)

- The posterior \(P(\varphi_z|W_z,\beta) \propto P(W_z|\varphi_z)P(\varphi_z|\beta)\)
- 1st \(P()\) is multinomial, 2d is Dirichlet \(\rightarrow\) product = Dirichlet
- The new \(\beta\) of the resulting Dirichlet are \(N_v + \beta_v\) with \(N_v\) the number of samples of word \(v\) assigned to topic \(z\)

The formula of the mean of a Dirichlet gives us the target

The number of topics is fixed, but what if you process an infinite stream of text ?

Grow the number of topics with the size of incoming data

- Non-parametric models
- The nb of parameters increase with data

- Extensions of LDA:
- Hierarchical Dirichlet Process Mixture Model
- Hierarchical Pitman-Yor process
- Chinse Restaurant Process
- Indian Buffet Process