//

2018/2019

# Auto-Encoder

## Limits of auto-encoders

• Cannot really generate:
• Requires $$z$$ to come from a known $$x$$
• Enhancement: learn to generate from unseen $$z$$

## Variational Auto-Encoder

• $$z$$ is now a probability distribution $z \sim N(\mu,\sigma)$
• The encoder generates 2 vectors from $$x$$: $$\mu$$ and $$\sigma$$
• A new sample $$z$$ is drawn from $$N(\mu,\sigma)$$
• $$z$$ is passed to the decoder to generate $$\hat x$$

## Variational Auto-Encoder

• Let the encoder outputs $$q_{\theta}(z|x)$$ (gaussian)
• Because $$z$$ is a random variable, so is the output of the decoder: $$p_{\phi}(x|z)$$

## Variational Auto-Encoder

• The loss for datapoint $$x_i$$ is: $l_i(\theta,\phi)=-E_{z\sim q_{\theta}(z|x_i)}[\log p_{\phi}(x_i|z)] + KL(q_{\theta}(z|x_i)||p(z))$
• Cross-Entropy (from variable $$z$$) + regularization
• We assume $$p(z) = N(0,1)$$

## Graphical model perspective

• The VAE can be seen as a BN: $$p(x,z)=p(x|z)p(z)$$
• Generative story:
• For each datapoint $$i$$:
• Draw latent variable $$z_i \sim p(z)$$
• Draw datapoint $$x_i \sim p(x|z)$$

## Graphical model perspective

• Inference: we want the posterior: $p(z|x)=\frac{p(x|z)p(z)}{p(x)}$
• $$p(x)$$ requires exponential time to compute
• $$p(z|x)$$ can be approx. with MCMC or variational inference

## Graphical model perspective

• Let’s approx. $$p(z|x)$$ with family $$q_{\lambda}(z|x)$$
• $$\lambda=$$ variational parameter
• How good is the approximation ? $KL(q_{\lambda}(z|x)||p(z|x)) = E_q[\log q_{\lambda}(z|x)] -$ $E_q[\log p(x,z)] + \log p(x)$

## Graphical model perspective

• Optimal approximate posterior: $q^*_{\lambda}(z|x)=\arg\min_{\lambda}KL(q_{\lambda}(z|x)||p(z|x))$
• Intractable because of $$p(x)$$
• What if we remove it ?

## Graphical model perspective

$ELBO(\lambda)=E_q[\log p(x,z)] - E_q[\log q_{\lambda}(z|x)]$ Rearranged: $\log p(x) = ELBO(\lambda) + KL(q_{\lambda}(z|x)||p(z|x))$

## Graphical model perspective

• We know $$KL\geq 0$$
• So minimizing KL is equivalent to maximizing ELBO
• ELBO = Evidence lower bound
• We can just maximize ELBO

## Maximizing ELBO

• $$z$$ is indep. from one datapoint to another, so we can decompose ELBO per datapoint: we can use SGD to optimize $$\lambda$$

$ELBO_i(\lambda)=E_{q_{\lambda}(z|x_i)}[\log p(x_i|z)] - KL(q_{\lambda}(z|x_i)||p(z))$

## Back to neural nets

• Encoder = inference network:
• outputs parameters $$\lambda$$ from $$x$$, to give $$q_{\theta}(z|x,\lambda)$$
• Decoder = generative network:
• outputs parameters of Multinomial from $$z$$, to give $$p_{\phi}(x|z)$$

## Back to neural nets

• This gives our initial loss:

$l_i(\theta,\phi)=-E_{z\sim q_{\theta}(z|x_i)}[\log p_{\phi}(x_i|z)] + KL(q_{\theta}(z|x_i)||p(z))$

• Variational inference = optimizing $$\lambda$$
• Variational EM = optimizing $$\theta$$, $$\phi$$

## Reparametrization trick

• $$z$$ is sampled, how to compute $$\frac{\partial l(z)}{\partial \theta}$$ ?
• We rewrite $$z$$ as: $z = \mu + \sigma \times \epsilon$
• with $$\epsilon \sim N(0,1)$$

## Reparametrization trick

https://jaan.io/what-is-variational-autoencoder-vae-tutorial/

# Embeddings

## Word-level generation: Softmax issue

• With a large Y domain, the softmax becomes too costly to compute $\sigma(z_i)=\frac{e^{z_i}}{\sum_k e^{z_k}}$
• Class-based hierarchical softmax (Goodman, 2001) $p(w|h) = p(C_w|h) \times p(w|C_w,h)$
• Adaptive softmax (Grave, 2017), One-vs-Each (Titsias, 2016), SVD-softmax (Shim, 2017)…

## Noise contrastive estimation

Replace multinomial with binary classifiers

## Negative sampling

• Principle: only update a few parameters per training sample
• Skip-gram: forward pass with the central word
• Backprop from the positive word w/ target 1
• Backprop from 2-5 sampled negative words w/ target -1

# Sequence generation

## Greedy Generation

With Neural Language Model:

• Generate the next word $$\hat y$$
• Append $$\hat y$$ to the input history: $$w_1,\cdots,w_t,\hat y$$
• Iterate

## Greedy Generation

Issues:

• What if $$\hat y$$ is wrong ?
• Discrepancy between training and test procedure
• Teacher forcing

## Dynamic programming

Viterbi: global optimum

$\max_{\phi} \prod_t P(w_t|w_{t-d},\cdots,w_{t-1})$

## Other best-path algorithms

• Beam search: Keep N best hypothesis at each step
• A star: Estimate the score until the end, at each step

In practice: Greedy works well with RNN !

## Character-LSTM

• Andrej Karpathy https://karpathy.github.io/2015/05/21/rnn-effectiveness/
• Train on Shakespeare (4.4MB) 3 days (GPU), 1 month (CPU)
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.