Christophe Cerisara

2018/2019

- Cannot really generate:
- Requires \(z\) to come from a known \(x\)

- Enhancement: learn to generate from unseen \(z\)

- \(z\) is now a probability distribution \[z \sim N(\mu,\sigma)\]
- The encoder generates 2 vectors from \(x\): \(\mu\) and \(\sigma\)
- A new sample \(z\) is drawn from \(N(\mu,\sigma)\)
- \(z\) is passed to the decoder to generate \(\hat x\)

- Let the encoder outputs \(q_{\theta}(z|x)\) (gaussian)
- Because \(z\) is a random variable, so is the output of the decoder: \(p_{\phi}(x|z)\)

- The loss for datapoint \(x_i\) is: \[l_i(\theta,\phi)=-E_{z\sim q_{\theta}(z|x_i)}[\log p_{\phi}(x_i|z)] + KL(q_{\theta}(z|x_i)||p(z))\]
- Cross-Entropy (from variable \(z\)) + regularization
- We assume \(p(z) = N(0,1)\)

- The VAE can be seen as a BN: \(p(x,z)=p(x|z)p(z)\)
- Generative story:
- For each datapoint \(i\):
- Draw latent variable \(z_i \sim p(z)\)
- Draw datapoint \(x_i \sim p(x|z)\)

- Inference: we want the posterior: \[p(z|x)=\frac{p(x|z)p(z)}{p(x)}\]
- \(p(x)\) requires exponential time to compute
- \(p(z|x)\) can be approx. with MCMC or variational inference

- Let’s approx. \(p(z|x)\) with family \(q_{\lambda}(z|x)\)
- \(\lambda=\) variational parameter
- How good is the approximation ? \[KL(q_{\lambda}(z|x)||p(z|x)) = E_q[\log q_{\lambda}(z|x)] - \] \[E_q[\log p(x,z)] + \log p(x)\]

- Optimal approximate posterior: \[q^*_{\lambda}(z|x)=\arg\min_{\lambda}KL(q_{\lambda}(z|x)||p(z|x))\]
- Intractable because of \(p(x)\)
- What if we remove it ?

\[ELBO(\lambda)=E_q[\log p(x,z)] - E_q[\log q_{\lambda}(z|x)]\] Rearranged: \[\log p(x) = ELBO(\lambda) + KL(q_{\lambda}(z|x)||p(z|x))\]

- We know \(KL\geq 0\)
- So minimizing KL is equivalent to maximizing ELBO
- ELBO = Evidence lower bound
- We can just maximize ELBO

- \(z\) is indep. from one datapoint to another, so we can decompose ELBO per datapoint: we can use SGD to optimize \(\lambda\)

\[ELBO_i(\lambda)=E_{q_{\lambda}(z|x_i)}[\log p(x_i|z)] - KL(q_{\lambda}(z|x_i)||p(z))\]

- Encoder = inference network:
- outputs parameters \(\lambda\) from \(x\), to give \(q_{\theta}(z|x,\lambda)\)

- Decoder = generative network:
- outputs parameters of Multinomial from \(z\), to give \(p_{\phi}(x|z)\)

- This gives our initial loss:

\[l_i(\theta,\phi)=-E_{z\sim q_{\theta}(z|x_i)}[\log p_{\phi}(x_i|z)] + KL(q_{\theta}(z|x_i)||p(z))\]

- Variational inference = optimizing \(\lambda\)
- Variational EM = optimizing \(\theta\), \(\phi\)

- \(z\) is sampled, how to compute \(\frac{\partial l(z)}{\partial \theta}\) ?
- We rewrite \(z\) as: \[z = \mu + \sigma \times \epsilon\]
- with \(\epsilon \sim N(0,1)\)

https://jaan.io/what-is-variational-autoencoder-vae-tutorial/

- With a large Y domain, the softmax becomes too costly to compute \[\sigma(z_i)=\frac{e^{z_i}}{\sum_k e^{z_k}}\]
- Class-based hierarchical softmax (Goodman, 2001) \[p(w|h) = p(C_w|h) \times p(w|C_w,h)\]
- Adaptive softmax (Grave, 2017), One-vs-Each (Titsias, 2016), SVD-softmax (Shim, 2017)…

Replace multinomial with binary classifiers

- Principle: only update a few parameters per training sample
- Skip-gram: forward pass with the central word
- Backprop from the positive word w/ target 1
- Backprop from 2-5 sampled negative words w/ target -1

With Neural Language Model:

- Generate the next word \(\hat y\)
- Append \(\hat y\) to the input history: \(w_1,\cdots,w_t,\hat y\)
- Iterate

Issues:

- What if \(\hat y\) is wrong ?
- Discrepancy between training and test procedure
- Teacher forcing

Viterbi: global optimum

\[\max_{\phi} \prod_t P(w_t|w_{t-d},\cdots,w_{t-1})\]

- Beam search: Keep N best hypothesis at each step
- A star: Estimate the score until the end, at each step

In practice: Greedy works well with RNN !

- Andrej Karpathy https://karpathy.github.io/2015/05/21/rnn-effectiveness/
- Train on Shakespeare (4.4MB) 3 days (GPU), 1 month (CPU)

```
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
```