Machine and Deep learning

Christophe Cerisara


Deep Neural Networks


Limits of auto-encoders

  • Cannot really generate:
    • Requires \(z\) to come from a known \(x\)
  • Enhancement: learn to generate from unseen \(z\)

Variational Auto-Encoder

  • \(z\) is now a probability distribution \[z \sim N(\mu,\sigma)\]
  • The encoder generates 2 vectors from \(x\): \(\mu\) and \(\sigma\)
  • A new sample \(z\) is drawn from \(N(\mu,\sigma)\)
  • \(z\) is passed to the decoder to generate \(\hat x\)

Variational Auto-Encoder

  • Let the encoder outputs \(q_{\theta}(z|x)\) (gaussian)
  • Because \(z\) is a random variable, so is the output of the decoder: \(p_{\phi}(x|z)\)

Variational Auto-Encoder

  • The loss for datapoint \(x_i\) is: \[l_i(\theta,\phi)=-E_{z\sim q_{\theta}(z|x_i)}[\log p_{\phi}(x_i|z)] + KL(q_{\theta}(z|x_i)||p(z))\]
  • Cross-Entropy (from variable \(z\)) + regularization
  • We assume \(p(z) = N(0,1)\)

Graphical model perspective

  • The VAE can be seen as a BN: \(p(x,z)=p(x|z)p(z)\)
  • Generative story:
    • For each datapoint \(i\):
    • Draw latent variable \(z_i \sim p(z)\)
    • Draw datapoint \(x_i \sim p(x|z)\)

Graphical model perspective

  • Inference: we want the posterior: \[p(z|x)=\frac{p(x|z)p(z)}{p(x)}\]
  • \(p(x)\) requires exponential time to compute
  • \(p(z|x)\) can be approx. with MCMC or variational inference

Graphical model perspective

  • Let’s approx. \(p(z|x)\) with family \(q_{\lambda}(z|x)\)
  • \(\lambda=\) variational parameter
  • How good is the approximation ? \[KL(q_{\lambda}(z|x)||p(z|x)) = E_q[\log q_{\lambda}(z|x)] - \] \[E_q[\log p(x,z)] + \log p(x)\]

Graphical model perspective

  • Optimal approximate posterior: \[q^*_{\lambda}(z|x)=\arg\min_{\lambda}KL(q_{\lambda}(z|x)||p(z|x))\]
  • Intractable because of \(p(x)\)
  • What if we remove it ?

Graphical model perspective

\[ELBO(\lambda)=E_q[\log p(x,z)] - E_q[\log q_{\lambda}(z|x)]\] Rearranged: \[\log p(x) = ELBO(\lambda) + KL(q_{\lambda}(z|x)||p(z|x))\]

Graphical model perspective

  • We know \(KL\geq 0\)
  • So minimizing KL is equivalent to maximizing ELBO
  • ELBO = Evidence lower bound
  • We can just maximize ELBO

Maximizing ELBO

  • \(z\) is indep. from one datapoint to another, so we can decompose ELBO per datapoint: we can use SGD to optimize \(\lambda\)

\[ELBO_i(\lambda)=E_{q_{\lambda}(z|x_i)}[\log p(x_i|z)] - KL(q_{\lambda}(z|x_i)||p(z))\]

Back to neural nets

  • Encoder = inference network:
    • outputs parameters \(\lambda\) from \(x\), to give \(q_{\theta}(z|x,\lambda)\)
  • Decoder = generative network:
    • outputs parameters of Multinomial from \(z\), to give \(p_{\phi}(x|z)\)

Back to neural nets

  • This gives our initial loss:

\[l_i(\theta,\phi)=-E_{z\sim q_{\theta}(z|x_i)}[\log p_{\phi}(x_i|z)] + KL(q_{\theta}(z|x_i)||p(z))\]

  • Variational inference = optimizing \(\lambda\)
  • Variational EM = optimizing \(\theta\), \(\phi\)

Reparametrization trick

  • \(z\) is sampled, how to compute \(\frac{\partial l(z)}{\partial \theta}\) ?
  • We rewrite \(z\) as: \[z = \mu + \sigma \times \epsilon\]
  • with \(\epsilon \sim N(0,1)\)

Reparametrization trick





Word-level generation: Softmax issue

  • With a large Y domain, the softmax becomes too costly to compute \[\sigma(z_i)=\frac{e^{z_i}}{\sum_k e^{z_k}}\]
  • Class-based hierarchical softmax (Goodman, 2001) \[p(w|h) = p(C_w|h) \times p(w|C_w,h)\]
  • Adaptive softmax (Grave, 2017), One-vs-Each (Titsias, 2016), SVD-softmax (Shim, 2017)…

Noise contrastive estimation

Replace multinomial with binary classifiers

Negative sampling

  • Principle: only update a few parameters per training sample
  • Skip-gram: forward pass with the central word
  • Backprop from the positive word w/ target 1
  • Backprop from 2-5 sampled negative words w/ target -1

Sequence generation

Greedy Generation

With Neural Language Model:

  • Generate the next word \(\hat y\)
  • Append \(\hat y\) to the input history: \(w_1,\cdots,w_t,\hat y\)
  • Iterate

Greedy Generation


  • What if \(\hat y\) is wrong ?
  • Discrepancy between training and test procedure
    • Teacher forcing

Dynamic programming

Viterbi: global optimum

\[\max_{\phi} \prod_t P(w_t|w_{t-d},\cdots,w_{t-1})\]

Other best-path algorithms

  • Beam search: Keep N best hypothesis at each step
  • A star: Estimate the score until the end, at each step

In practice: Greedy works well with RNN !


  • Andrej Karpathy https://karpathy.github.io/2015/05/21/rnn-effectiveness/
  • Train on Shakespeare (4.4MB) 3 days (GPU), 1 month (CPU)
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.