Christophe Cerisara

2018/2019

- Introduction
- Backpropagation
- Feedforward networks
- Convolutional networks
- Recurrent networks
- Generative models (incl. VAE & GAN)

Refs:

- Stanford CS224n-2017 lecture on “NLP and deep learning”
- http://www.deeplearningbook.org

*Deep Learning*is a subfield of*machine learning*Machine learning boils down to optimization

*Representation learning*attempts to automatically learn good features*Deep learning*attemps to learn multiple levels of representation and an output- From
*raw*inputs, e.g., sounds, characters, pixels… - DL is trained
*end-to-end* - DL exploits backpropagation to optime parameters

We will focus on different kinds of

*neural networks*- We will not take a historical approach but instead focus on methods which work well now
For a long history of deep learning (starting in the 1960s), see “Deep Learning in Neural Networks: An Overview” by Jürgen Schmidhuber

G. Hinton, J. Bengio, Y. Lecun

Manually designed features are often over-specified, incomplete and take a long time to design and validate

Learned features are easy to adapt, fast to learn

Deep learning provides a flexible learnable framework for representing world, visual and linguistic information

Deep learning can learn unsupervised (from raw text) and supervised (with specific labels)

2010: DL >> ML, why ?

- Large amount of data favors deep learning
- Faster machines and GPUs favor deep learning
- New models, ideas:
- Better learning of intermediate representations
- Effective end-to-end system learning
- Effective methods that exploit contexts and transfer between tasks

Improved performances, first in speech and vision, then NLP and games

First breakthrough on large datasets happened in speech recognition Context-dependent pre-trained deep neural networks for large vocabulary speech recognition: Dahl et al. (2010)

WER | RT03S FSH | Hub5 SWB |
---|---|---|

Traditional features | 27.4 | 23.6 |

Deep learning | 18.5 | 16.1 |

*The* breakthrough paper: ImageNet Classification with deep convolutional neural networks; Krizhevsky, Sutskever & Hinton, Univ. Toront, 2012 37% error reduction

- Atari games (2016)
- Alpha go (2016)
- Alpha zero (2017), Dota2, poker…

“Deep Learning is dead;”

“Long live Differentiable Programming !”

- Program rules: expert systems

- Program features + optimisation

- Program function

DL & AI terms overused in media

DL = approximator of parametric function

- Input: image, video, sound, text, sensors…
- Output: score, proba distribution, image, video…

(Gary Marcu, jan 2018)

https://arxiv.org/abs/1801.00631a

AI requires much more than a function:

- World model
- Continuous unsupervised training
- Generalisation through extrapolation
- Training from definitions
- …

Race to deeper and deeper models drive by speech and vision applications

- Collobert 2011
- Mikolov (W2V) 2013
- Yoon Kim 2014
- Hoa T. Le 2018

“Deep Learning is dead;”

“Long live Differentiable Programming !”

DL is fusing with programming: Pytorch (oct. 2016)

```
x = Variable(torch.ones(2, 2))
y = x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)
```

We do not build models any more, we program parametric functions:

- python
- variables = tensors
- operators = tensor algebrae
- parameters optimized automatically

!

Seppo Linnainmaa (1970)

- sub-gradients
- REINFORCE (Williams, 1992)
- Reparametrization trick
- …

Infinite variety of programmable functions: how to guarantee convergence of training ?

- Key asset of Deep Learning: regularization everywhere:
- loss, dropout, minibatch, SGD, init, précision réels…

- (Kukacka, 2017) https://arxiv.org/abs/1710.10686
- Millions of parameters are not an issue
- Most models have more parameters than training samples

(Kawaguchi, 2016)

https://arxiv.org/abs/1605.07110

- Most \(\nabla f=0\) are saddle points
- Most local optima are close to global

(Tishby, 2015)

https://arxiv.org/abs/1503.02406

- 2 phases: overtraining, then compression

- Gets close to the theoretical compression limit
- DNN would be nearly optimal according to information theory ?

- Libraries @ range of abstraction levels:
- relu, softmax, max-pooling…
- MLP, RNN, CNN…
- inception, seq2seq…
- key-value memnet, resnet, transformer…

- Design patterns:
- attention: dynamic focus
- R/W dynamic memory
- hops: step-wise reasoning

Stéphane Mallat

Leçon inaugurale au Collège de France

- Wavelet decomposition is a special case of ConvNets
- Multiscale wavelets may encode robustness to translation / rotation
- Analysis of convnets show they do the same, and more

- Gradient descent
- Backpropagation
- Autodiff

Refs:

- Stanford CS231n-2017 lecture on “Convolutional Neural Networks for Visual Recognition”
- https://people.cs.umass.edu/~domke/courses/sml2011/03optimization.pdf

- We want to optimize some function (risk)
- But we don’t have a closed-form solution

\(\rightarrow\) Descent methods

Two problems:

Find a descent direction \(\Delta x\)

Find a step size \(t\)

Then, update the current solution:

\(x \leftarrow x + t\Delta x\)

Exact

*line search*: when possible, compute \(\min_t f(x+t\Delta x)\)- Backtracking line search: try with “large” \(t\) and decrease \(t\) until \(f(x)\) improves enough:
- \(t=1\)
- While \(f(x+t\Delta x) > f(x) + \alpha t \nabla f(x)^T \Delta x\)
- Set \(t \leftarrow \beta t\)

- With \(0<\alpha<\frac 1 2\) and \(0 < \beta < 1\)

Any \(\Delta x\) that verifies:

\[\Delta x^T \nabla f(x) < 0\]

is a descent direction.

- For small \(t\): \[f(x+t\Delta x) \simeq f(x) + t\Delta x^T \nabla f(x)\]

How to find the descent direction ?

Simplest approach: Gradient descent: \[\Delta x = -\nabla f(x)\]

For strictly convex functions, we have

*linear convergence*: \[f(x^k)-p \leq c\left( f(x^{k-1})-p \right)\] with \(0 < c < 1\)At each iteration, the error is multiplied by a number less than 1, so eventually converges to 0

\[f(x)=\frac 1 2 (x_1^2 + ax_2^2)\]

\[\nabla f(x) = [ x_1 , ax_2 ]\]

From the point (1,1), the steepest descent is (-1,-a)

But we know \(\min f\) is in (0,0) \(\rightarrow\) optimal direction (-1,-1)

- Conversely, the optimal direction is found in one step of \(a=1\):
- Gradient descent is not invariant to linear transformations!

So scaling of inputs is very important !

- If GD slow, try rescaling your data and the parameters
- You can use a fixed step size, but have to tune it

Multiply the gradient with the inverse Hessian:

\[\Delta x = -H(x)^{-1} \nabla f(x)\]

\[H(x)=\nabla^2 f(x) = \frac {\partial^2 f(x)}{\partial x \partial x^T}\]

- Handles relative scaling between dimensions, and so is invariant to linear changes of coordinates

If \(f(x)\) quadratic, gives the global optimum direction

- But it is not invariant to higher order (e.g. quadratic) changes of coordinates
It may be very costly (\(O(N^3)\) with \(N\)=nb of dimensions) to solve the linear system \(H(x)^{-1}\nabla f(x)\)

Approximate \(f\):

\[f^k(x+v)=f(x)+\nabla f(x)^T v + \frac 1 2 v^T B_k v\]

The direction of the descent is:

\[\Delta x = -B_k^{-1} \nabla f(x)\]

\(B_k\) is an approximation of the Hessian that is continuously updated

See also

*Kylov methods*to approximate Hessian

Special case of Quasi-Newton method: Broyden, Fletcher, Goldfarb and Shanno.

\[s_k = x_{k+1} - x_k\]

\[y_k = \nabla f(x^{k+1})-\nabla f(x^k)\]

\[\rho_k = \frac 1 {y_k^T s_k}\]

\[F_{k+1} = (I-\rho_k s_k y_k^T) F_k (I-\rho_k y_k s_k^T) + \rho_k s_k s_k^T\]

Descent direction \(F_k \nabla f(x)\) can be computed in \(O(N^2)\)

- For quadratic fcts, search directions are successively orthogonal

- This leads to longer paths than expected

- Eliminate all of the error along one axis in one step

As well as conjugate gradient:

- In non-convex space, do not generalize well
- Solution: simulated annealing

- Requires to compute the gradient of the error over the full corpus

Assume the function to minize has the form: \[f(x)=\frac 1 T \sum_t^T f_t(x)\]

This is the case for

- maximum likelihood training
- empirical loss minimization

Standard Gradient Descent:

\[x \leftarrow x - t\nabla f(x)\]

SGD approximates the gradient with the gradient at a single sample:

\[x \leftarrow x - t\nabla f_i(x)\]

In practice:

- Shuffle the dataset at the beginning of each epoch
- Iterate over all samples in the dataset

There are proofs that, under relatively mild assumptions, and when the learning rates \(t\) decrease with an appropriate rate, then SGD converges almost surely to a global minimum when the function is convex, and otherwise to a local minimum.

SGD + back-propagation is used to train

*all*deep neural nets

Recent research works suggest that SGD is so successful because such an approximation of the gradient strongly

*regularizes*the learning process.Some other works suggest that THE main reason that explains the success of DNN is because of SGD !

Momentum:

\[\Delta x \leftarrow \alpha \Delta x - t \nabla f_i(x)\]

- There’s a “memory” / “inertia” of the previous update
- This prevents oscillations

decay: the learning rate decreases over time

Adagrad: different learning rate per dimension

RMSProp: uses a running average of recent gradients per dimension

Adam: enhancements of RMSProp

mini-batch SGD: rule of thumb: use 32, 64, 128 or 256

Objective: given a function (loss) \(f(x)\), compute its gradient \(\nabla f(x)\)

- Complex \(f(x)\) can be decomposed using intermediary variables: \[f(x,y,z)=(x+y)^2z\] is better computed as:
- \(q=x+y\)
- \(h=q^2\)
\(f=h\times z\)

- \(q=x+y\)
- \(h=q^2\)
\(f=h\times z\)

Principle: back-propagate the final gradient locally, through each intermediary variable, thanks to the chain rule

(or computational graph)

*Forward pass*: computes values from inputs to outputs (in green)*Backward pass*: recursively applies chain rule to compute gradients (in red) from the end to the inputs

Every gate in the circuit computes its local gradient with respect to its output value.

\[\frac {df}{df} = 1\] then \[f=qz \rightarrow \frac {df}{dq} = z = -4\] \[f=qz \rightarrow \frac {df}{dz} = q = 3\] then \[q=x+y \rightarrow \frac {df}{dx} = \frac {df}{dq} \frac {dq}{dx} + \frac {df}{dz}\frac{dz}{dx} = -4 \times 1 + 0\]

- Example: the “+” gate: \(x + y = q\)
- It is equiped with local derivative:
- To its first input: \(\frac {dq}{dx}=1\)
- To its second input: \(\frac {dq}{dy}=1\)

- It is equiped with local derivative:
- Assume it is given a gradient value \(\frac {df}{dq}\) at its output:
- It passes it to its first input by multiplying it with its local derivative: \(\frac {df}{dq} \times \frac {dq}{dx} = \frac {df}{dx}\)
- It passes it to its second input by multiplying it with its local derivative: \(\frac {df}{dq} \times \frac {dq}{dy} = \frac {df}{dy}\)

\[f(w,x)=\frac 1 {1+e^{-(w_0 x_0 + w_1 x_1 + w_2)}}\]

Exercice:

- Draw a computation graph
- Write the forward values with \(w0=2\),\(x0=-1\),\(w1=-3\),\(x1=-2\) and \(w2=-3\)
- Write the backward values

The sigmoid is made up of multiple gates:

\[f(x)=\frac 1 x ~~ \rightarrow ~~ \frac {df}{dx} = -1/x^2\] \[f_c(x)=c+x ~~ \rightarrow ~~ \frac{df}{dx} = 1\] \[f(x)=e^x ~~ \rightarrow ~~ \frac{df}{dx} = e^x\] \[f_a(x)=ax ~~ \rightarrow ~~ \frac{df}{dx} = a\]

\[\sigma(x) = \frac 1 {1+e^{-x}}\]

\[\frac {d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2}\]

\[ = \left( \frac{1+e^{-x}-1}{1+e^{-x}}\right) \left(\frac 1 {1+e^{-x}}\right)\]

\[ = (1-\sigma(x))\sigma(x)\]

It’s simple enough to be used as a single “gate” in complex networks.

```
w=[2,-3,-3] # assume some random weights and data
x=[-1,-2]
# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function
# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit
```

Staged backpropagation: add intermediary variable *ddot* to break down computation

\[f(x,y)=\frac {x+\sigma(y)}{\sigma(x)+(x+y)^2}\]

Deriving this function would give very complex expressions.

But we don’t need it: we can evaluate the gradient without them.

```
x = 3 # example values
y = -4
# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator #(1)
num = x + sigy # numerator #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # denominator #(6)
invden = 1.0 / den #(7)
f = num * invden # done! #(8)
```

```
# backprop f = num * invden
dnum = invden # gradient on numerator #(8)
dinvden = num #(8)
# backprop invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden #(6)
dxpysqr = (1) * dden #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr #(5)
# backprop xpy = x + y
dx = (1) * dxpy #(4)
dy = (1) * dxpy #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below #(3)
# backprop num = x + sigy
dx += (1) * dnum #(2)
dsigy = (1) * dnum #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy #(1)
```

- The
*add*gate distributes evenly the gradient to all of its inputs - The
*max*gate routes the gradient to its highest input

The

*mult*gate switches its inputs and multiplies each by the gradientConsequence for linear classifiers \(w^Tx\): if you multiply input by 1000, the gradient on the weights will be 1000 times larger; preprocessing hence matters a lot !

- see http://cs231n.stanford.edu/vecDerivs.pdf
- see https://arxiv.org/abs/1502.05767

Refs:

- Yann LeCun. (1988) A Theoretical Framework from Back-Propagation.
- https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule
- http://conal.net/papers/beautiful-differentiation/

Autodiff = set of techniques to numerically evaluate the derivative of a function specified by a program

Many algorithms are just autodiff: forward-backward, inside-outside…

\[f(x) = \exp(\exp(x)+\exp(x^2)) + \sin(\exp(x)+\exp(x^2))\]

may be computed as follows:

```
def f(x):
a = exp(x)
b = a ** 2
c = a + b
d = exp(c)
e = sin(c)
return d + e
```

\[\arg\max_x f\]

with constraints:

\[a=\exp(x)\] \[b=a^2\] \[c=a+b\] \[d=\exp(c)\] \[e=\sin(c)\] \[f=d+e\]

Generalisation:

The constraints are composed of 3 blocks:

- \(d\) input variables \[z_i = x_i\]
- intermediary variables \[z_i = f_i(z_{\alpha(i)})\]
- output variable \[z_n = f_n(z_{\alpha(n)})\]

\(\alpha(i)\) is a subset of the first \(n-1\) variables \(z_i\).

Assumptions:

- \(\alpha\) is acyclic: no \(z_i\) transitively depends on itself
- Single assignment: each \(z_i\) appears once in the lhs of constraints

Lagrangian:

\[L(x,z,\lambda) = z_n - \sum_{i=d+1}^n \lambda_i \cdot (z_i - f_i (z_{\alpha(i)}))\]

Intermediary variables: \[\nabla_{\lambda_i} L = z_i - f_i(z_{\alpha(i)}) = 0\]

Multipliers (except last one): \(\nabla_{z_j} L=0\) gives \[\lambda_j = \sum_{i \in \beta(j)} \lambda_i \frac{\partial f_i(z_{\alpha(i)})}{\partial z_j}\]

Output multiplier: \(\nabla_{z_n} L=0\) gives \[\lambda_n=1\]

Input multipliers: \(\nabla_x f(x) = \lambda_{1:d}\)

Input variables need to be set by gradient ascent

We solve this linear system by back-substitution when we use the backpropagation method.

Want cyclic graph of dependencies ?

Can be done with general linear system solvers (\(O(n^3)\)) instead of back-substitution (=back-propagation in \(O(n)\)).

Back-prop operates on

*programs*(=with intermediary variables) and not only*functions*, so it is more than the chain rule.

Why the activation function ?

- To “normalize” the output within [-1;1] and help make a decision
- To add non-linearity (and so, richer modeling capacity)
- To be able to stack neurons

- tanh: nearly linear around 0
- relu: fast, better for vanishing gradient

\[tanh(x) = \frac {e^x - e^{-x}} {e^x + e^{-x}}\]

derivative: \[\frac {\partial tanh(x)}{\partial x} = 1-tanh^2(x)\]

- aka “feed-forward” network
- “Dense” layer = 1 matrix \(W\)

- Hidden layers are
- larger than inputs if we want to capture complex relationships or a large capacity
- smaller than inputs if we want to compress representations (auto-encoder)

- The MLP does not assume any particular structure on the data

To do regression, we need:

- As many output neurons as output dimensions
- A linear output activation (or sigmoid…)

To do binary classification, we need:

- 1 output neuron with tanh (or sigmoid)
- decision threshold = 0

For N-class classification: ??

Use 1 neuron per possible output label

Normalize the \(N\) scores into a proba distribution: \[s(x)_i = \frac {e^{x_i}}{\sum_k^N e^{x_k}}\]

decision: pick the maximum proba

- sometimes include parameters = weight matrix \(\rightarrow\) multinomial logistic regression or maximum entropy classifier
- See http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf

also used in attention models

Entropy of a probability distribution = nb of bits required to encode the distribution \[H(p)=-\sum_i p_i \log p_i\]

Cross-entropy between 2 distributions = nb of bits required to encode one distribution, when we only know another distribution \[H(p,q)=-\sum_i p_i \log q_i\]

Kullback-Leibler divergence = Xent - ent \[KL(p||q)= \sum_i p_i \log \frac {p_i}{q_i}\]

Naive DNN overfits \(\rightarrow\) we want to reduce variance

- This can be done by adding a term to the loss that limits the space where parameters can live:
- L2 regularization is the most common: \[loss = E[H(p,q)] + \lambda \sum_i \theta_i^2\]
- L2 can be interpreted as adding a Gaussian prior

- L1 \[loss = E[H(p,q)] + \lambda \sum_i |\theta_i|\]

- Dropout
- Randomly remove \(\lambda\)% input and hidden neurons

- DropConnect

Artificial data expansion: Replicate dataset by adding \(\lambda\)% noise

SGD: Tune mini-batch size

Each layer of a feedfoward classifier is composed of:

Linear parameters: weight matrix + bias vector

An activation function (relu)

An optional dropout “layer”

The output layer is composed of:

Linear parameters: weight matrix + bias vector

A softmax activation

The training loss = cross-entropy + L2

- 1-hidden layer MLP: 2 parameter matrices
- Regularized neg-loglike loss:

\[ J = J_{MLE} + \lambda \left( \sum_{i,j} (W_{i,j})^2 + \sum_{i,j}(W'_{i,j})^2 \right)\]

Why initialization is important ?

- If the weights start too small, the signal shrinks as it pass through until it’s too tiny to be useful
- If the weights start too large, the signal grows as it pass through until it’s too large to be useful

So we want the variance of the inputs and outputs to be similar !

For each neuron with \(n_{in}\) inputs:

- Initialize its weights \(W\) from a zero-mean Gaussian with

\[Var(W)=\frac 1 {n_{in}}\]

Refs: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

For each neuron with \(n_{in}\) inputs, going into \(n_{out}\) neurons:

- Initialize its weights \(W\) from a zero-mean Gaussian with

\[Var(W)=\frac 2 {n_{in} + n_{out}}\]

http://pytorch.org/docs/master/nn.html?highlight=torch nn init#torch.nn.init.xavier_normal

- You
*must*normalize your inputs - Transforms the activations of the previous layer at each batch, so that the mean=0 and standard deviation=1
- May have parameters to further scale + shift
- https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
- http://pytorch.org/docs/master/nn.html?highlight=batchnorm1d#torch.nn.BatchNorm1d

MLP, simple ?

- Find a good topology, residual connections, parameter sharing…
- Find a good SGD variant
- Tune learning rate, nb of neurons…
- Tune regularization strength, dropout…
- Tune batch normalization
- Tune nb of epochs, batch size…

Andrew Ng: “If you tune only one thing, tune the learning rate !”

SGD on complex models requires a lot of tensors computations:

- A tensor = generalization of a matrix to higher dimensions
- Can be parallelized onto GPU (or TPU)
- NVidia adds in its “Volta” series 16-bits TPU
- yet another regularization: 64 -> 32 -> 16 bits

Don’t reprogram from scratch: reuse toolkits

- All of them implement:
- Autograd of most tensor operators
- Wrappers to abstract model’s parts (dense, conv, lstm..)
- Optimized SGD variants
- Standard activations, normalizations, regularizations…
- Efficient initializations
- CPU + CUDA versions of all tensor operators !!

This makes life wayyy easier for deep learners…

Tensorflow, Keras, theano, tflearn…

- You define the graph (with
*add_node(…)*) - You compile the graph
Then you pass in data, fit, iterate

Pytorch, chainer, dynet…

- You start with a Variable, and apply operations on it
For every data chunk, a new graph is built