//

2018/2019

# Deep Neural Networks

## Planning

• Introduction
• Backpropagation
• Feedforward networks
• Convolutional networks
• Recurrent networks
• Generative models (incl. VAE & GAN)

# Introduction

Refs:

• Stanford CS224n-2017 lecture on “NLP and deep learning”
• http://www.deeplearningbook.org

## What is deep learning

• Deep Learning is a subfield of machine learning

• Machine learning boils down to optimization

## What is deep learning

• Representation learning attempts to automatically learn good features
• Deep learning attemps to learn multiple levels of representation and an output
• From raw inputs, e.g., sounds, characters, pixels…
• DL is trained end-to-end
• DL exploits backpropagation to optime parameters

## Course path

• We will focus on different kinds of neural networks

• We will not take a historical approach but instead focus on methods which work well now
• For a long history of deep learning (starting in the 1960s), see “Deep Learning in Neural Networks: An Overview” by Jürgen Schmidhuber

G. Hinton, J. Bengio, Y. Lecun

## Reasons for exploring deep learning

• Manually designed features are often over-specified, incomplete and take a long time to design and validate

• Learned features are easy to adapt, fast to learn

• Deep learning provides a flexible learnable framework for representing world, visual and linguistic information

• Deep learning can learn unsupervised (from raw text) and supervised (with specific labels)

## Reasons for exploring deep learning

• 2010: DL >> ML, why ?

• Large amount of data favors deep learning
• Faster machines and GPUs favor deep learning
• New models, ideas:
• Better learning of intermediate representations
• Effective end-to-end system learning
• Effective methods that exploit contexts and transfer between tasks
• Improved performances, first in speech and vision, then NLP and games

## Deep learning for speech

First breakthrough on large datasets happened in speech recognition Context-dependent pre-trained deep neural networks for large vocabulary speech recognition: Dahl et al. (2010)

WER RT03S FSH Hub5 SWB
Deep learning 18.5 16.1

## Deep learning for computer vision

The breakthrough paper: ImageNet Classification with deep convolutional neural networks; Krizhevsky, Sutskever & Hinton, Univ. Toront, 2012 37% error reduction

## Deep reinforcement learning

• Atari games (2016)
• Alpha go (2016)
• Alpha zero (2017), Dota2, poker…

## Yann Lecun, Jan 2018

“Long live Differentiable Programming !”

## Symbolic AI (1970-80)

• Program rules: expert systems

## Machine learning AI (1990-2000)

• Program features + optimisation

## Deep learning AI (2010-)

• Program function

## Beware of buzzword

DL & AI terms overused in media

## Limits of DL

DL = approximator of parametric function

• Input: image, video, sound, text, sensors…
• Output: score, proba distribution, image, video…

## From DL to AI

(Gary Marcu, jan 2018)
https://arxiv.org/abs/1801.00631a

AI requires much more than a function:

• World model
• Continuous unsupervised training
• Generalisation through extrapolation
• Training from definitions

## The first wave of DL (2012-2015)

Race to deeper and deeper models drive by speech and vision applications

## But Natural Language Processing is shallow

• Collobert 2011
• Mikolov (W2V) 2013
• Yoon Kim 2014
• Hoa T. Le 2018

“Long live Differentiable Programming !”

## DL & programming (2016-)

DL is fusing with programming: Pytorch (oct. 2016)

## Two visions of DL

#### Dynamic function



x = Variable(torch.ones(2, 2))
y = x + 2
z = y * y * 3
out = z.mean()
out.backward()



## DL model = program

We do not build models any more, we program parametric functions:

• python
• variables = tensors
• operators = tensor algebrae
• parameters optimized automatically

## Automatic derivation

!

Seppo Linnainmaa (1970)

## What if $$f(x)$$ is not differentiable ?

• REINFORCE (Williams, 1992)
• Reparametrization trick

## Challenge: optimize

• Infinite variety of programmable functions: how to guarantee convergence of training ?

• Key asset of Deep Learning: regularization everywhere:
• loss, dropout, minibatch, SGD, init, précision réels…
• (Kukacka, 2017) https://arxiv.org/abs/1710.10686
• Millions of parameters are not an issue
• Most models have more parameters than training samples

## Convexity is not required

(Kawaguchi, 2016)
https://arxiv.org/abs/1605.07110

• Most $$\nabla f=0$$ are saddle points
• Most local optima are close to global

## Optimality of DL ?

(Tishby, 2015)
https://arxiv.org/abs/1503.02406

• 2 phases: overtraining, then compression

## Optimality of DL ?

• Gets close to the theoretical compression limit
• DNN would be nearly optimal according to information theory ?

## Ecosystem of DL programming

• Libraries @ range of abstraction levels:
• relu, softmax, max-pooling…
• MLP, RNN, CNN…
• inception, seq2seq…
• key-value memnet, resnet, transformer…
• Design patterns:
• attention: dynamic focus
• R/W dynamic memory
• hops: step-wise reasoning

## DNN: explicability

Stéphane Mallat
Leçon inaugurale au Collège de France

• Wavelet decomposition is a special case of ConvNets
• Multiscale wavelets may encode robustness to translation / rotation
• Analysis of convnets show they do the same, and more

# Backpropagation

• Backpropagation
• Autodiff

Refs:

• Stanford CS231n-2017 lecture on “Convolutional Neural Networks for Visual Recognition”
• https://people.cs.umass.edu/~domke/courses/sml2011/03optimization.pdf

## Optimization by descent methods

• We want to optimize some function (risk)
• But we don’t have a closed-form solution

$$\rightarrow$$ Descent methods

## Generic descent method

Two problems:

• Find a descent direction $$\Delta x$$

• Find a step size $$t$$

Then, update the current solution:

$$x \leftarrow x + t\Delta x$$

## Setting the step size

• Exact line search: when possible, compute $$\min_t f(x+t\Delta x)$$

• Backtracking line search: try with “large” $$t$$ and decrease $$t$$ until $$f(x)$$ improves enough:
• $$t=1$$
• While $$f(x+t\Delta x) > f(x) + \alpha t \nabla f(x)^T \Delta x$$
• Set $$t \leftarrow \beta t$$
• With $$0<\alpha<\frac 1 2$$ and $$0 < \beta < 1$$

## Finding the descent direction

Any $$\Delta x$$ that verifies:

$\Delta x^T \nabla f(x) < 0$

is a descent direction.

• For small $$t$$: $f(x+t\Delta x) \simeq f(x) + t\Delta x^T \nabla f(x)$

• How to find the descent direction ?

• Simplest approach: Gradient descent: $\Delta x = -\nabla f(x)$

• For strictly convex functions, we have linear convergence: $f(x^k)-p \leq c\left( f(x^{k-1})-p \right)$ with $$0 < c < 1$$

• At each iteration, the error is multiplied by a number less than 1, so eventually converges to 0

## GD is not optimal

$f(x)=\frac 1 2 (x_1^2 + ax_2^2)$

$\nabla f(x) = [ x_1 , ax_2 ]$

• From the point (1,1), the steepest descent is (-1,-a)

• But we know $$\min f$$ is in (0,0) $$\rightarrow$$ optimal direction (-1,-1)

## GD is not optimal

• Conversely, the optimal direction is found in one step of $$a=1$$:
• Gradient descent is not invariant to linear transformations!

So scaling of inputs is very important !

• If GD slow, try rescaling your data and the parameters
• You can use a fixed step size, but have to tune it

## Newton’s method

Multiply the gradient with the inverse Hessian:

$\Delta x = -H(x)^{-1} \nabla f(x)$

$H(x)=\nabla^2 f(x) = \frac {\partial^2 f(x)}{\partial x \partial x^T}$

• Handles relative scaling between dimensions, and so is invariant to linear changes of coordinates

## Newton’s method

• If $$f(x)$$ quadratic, gives the global optimum direction

• But it is not invariant to higher order (e.g. quadratic) changes of coordinates
• It may be very costly ($$O(N^3)$$ with $$N$$=nb of dimensions) to solve the linear system $$H(x)^{-1}\nabla f(x)$$

## Quasi-Newton method

Approximate $$f$$:

$f^k(x+v)=f(x)+\nabla f(x)^T v + \frac 1 2 v^T B_k v$

The direction of the descent is:

$\Delta x = -B_k^{-1} \nabla f(x)$

• $$B_k$$ is an approximation of the Hessian that is continuously updated

## BFGS method

Special case of Quasi-Newton method: Broyden, Fletcher, Goldfarb and Shanno.

## BFGS method

$s_k = x_{k+1} - x_k$

$y_k = \nabla f(x^{k+1})-\nabla f(x^k)$

$\rho_k = \frac 1 {y_k^T s_k}$

$F_{k+1} = (I-\rho_k s_k y_k^T) F_k (I-\rho_k y_k s_k^T) + \rho_k s_k s_k^T$

Descent direction $$F_k \nabla f(x)$$ can be computed in $$O(N^2)$$

## GD: orthogonal steps

• For quadratic fcts, search directions are successively orthogonal

## GD: orthogonal steps

• This leads to longer paths than expected

• Eliminate all of the error along one axis in one step

## GD is slow

• In non-convex space, do not generalize well
• Solution: simulated annealing
• Requires to compute the gradient of the error over the full corpus

Assume the function to minize has the form: $f(x)=\frac 1 T \sum_t^T f_t(x)$

This is the case for

• maximum likelihood training
• empirical loss minimization

$x \leftarrow x - t\nabla f(x)$

$x \leftarrow x - t\nabla f_i(x)$

In practice:

• Shuffle the dataset at the beginning of each epoch
• Iterate over all samples in the dataset

## SGD convergence

• There are proofs that, under relatively mild assumptions, and when the learning rates $$t$$ decrease with an appropriate rate, then SGD converges almost surely to a global minimum when the function is convex, and otherwise to a local minimum.

• SGD + back-propagation is used to train all deep neural nets

## SGD convergence

• Recent research works suggest that SGD is so successful because such an approximation of the gradient strongly regularizes the learning process.

• Some other works suggest that THE main reason that explains the success of DNN is because of SGD !

## SGD with momentum

Momentum:

$\Delta x \leftarrow \alpha \Delta x - t \nabla f_i(x)$

• There’s a “memory” / “inertia” of the previous update
• This prevents oscillations

## Other SGD extensions

• decay: the learning rate decreases over time

• RMSProp: uses a running average of recent gradients per dimension

• mini-batch SGD: rule of thumb: use 32, 64, 128 or 256

## Introduction to backpropagation

• Objective: given a function (loss) $$f(x)$$, compute its gradient $$\nabla f(x)$$

• Complex $$f(x)$$ can be decomposed using intermediary variables: $f(x,y,z)=(x+y)^2z$ is better computed as:
• $$q=x+y$$
• $$h=q^2$$
• $$f=h\times z$$

## Introduction to backpropagation

• $$q=x+y$$
• $$h=q^2$$
• $$f=h\times z$$

• Principle: back-propagate the final gradient locally, through each intermediary variable, thanks to the chain rule

## Interpretation as a circuit diagram

(or computational graph)

## Two passes

• Forward pass: computes values from inputs to outputs (in green)
• Backward pass: recursively applies chain rule to compute gradients (in red) from the end to the inputs

Every gate in the circuit computes its local gradient with respect to its output value.

## Backward pass

$\frac {df}{df} = 1$ then $f=qz \rightarrow \frac {df}{dq} = z = -4$ $f=qz \rightarrow \frac {df}{dz} = q = 3$ then $q=x+y \rightarrow \frac {df}{dx} = \frac {df}{dq} \frac {dq}{dx} + \frac {df}{dz}\frac{dz}{dx} = -4 \times 1 + 0$

## Backprop is a local process

• Example: the “+” gate: $$x + y = q$$
• It is equiped with local derivative:
• To its first input: $$\frac {dq}{dx}=1$$
• To its second input: $$\frac {dq}{dy}=1$$
• Assume it is given a gradient value $$\frac {df}{dq}$$ at its output:
• It passes it to its first input by multiplying it with its local derivative: $$\frac {df}{dq} \times \frac {dq}{dx} = \frac {df}{dx}$$
• It passes it to its second input by multiplying it with its local derivative: $$\frac {df}{dq} \times \frac {dq}{dy} = \frac {df}{dy}$$

## Sigmoid neuron example

$f(w,x)=\frac 1 {1+e^{-(w_0 x_0 + w_1 x_1 + w_2)}}$

Exercice:

• Draw a computation graph
• Write the forward values with $$w0=2$$,$$x0=-1$$,$$w1=-3$$,$$x1=-2$$ and $$w2=-3$$
• Write the backward values

## Sigmoid neuron example

The sigmoid is made up of multiple gates:

$f(x)=\frac 1 x ~~ \rightarrow ~~ \frac {df}{dx} = -1/x^2$ $f_c(x)=c+x ~~ \rightarrow ~~ \frac{df}{dx} = 1$ $f(x)=e^x ~~ \rightarrow ~~ \frac{df}{dx} = e^x$ $f_a(x)=ax ~~ \rightarrow ~~ \frac{df}{dx} = a$

## Sigmoid can also be derived directly

$\sigma(x) = \frac 1 {1+e^{-x}}$

$\frac {d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2}$

$= \left( \frac{1+e^{-x}-1}{1+e^{-x}}\right) \left(\frac 1 {1+e^{-x}}\right)$

$= (1-\sigma(x))\sigma(x)$

It’s simple enough to be used as a single “gate” in complex networks.

## Implementation

w=[2,-3,-3]     # assume some random weights and data
x=[-1,-2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit

Staged backpropagation: add intermediary variable ddot to break down computation

## Another example

$f(x,y)=\frac {x+\sigma(y)}{\sigma(x)+(x+y)^2}$

• Deriving this function would give very complex expressions.

• But we don’t need it: we can evaluate the gradient without them.

## Another example

x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

## Backward implementation

# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)

• The add gate distributes evenly the gradient to all of its inputs
• The max gate routes the gradient to its highest input

• The mult gate switches its inputs and multiplies each by the gradient

• Consequence for linear classifiers $$w^Tx$$: if you multiply input by 1000, the gradient on the weights will be 1000 times larger; preprocessing hence matters a lot !

• see http://cs231n.stanford.edu/vecDerivs.pdf
• see https://arxiv.org/abs/1502.05767

## Autodiff and backpropagation

Refs:

• Yann LeCun. (1988) A Theoretical Framework from Back-Propagation.
• https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule
• http://conal.net/papers/beautiful-differentiation/

## Autodiff

Autodiff = set of techniques to numerically evaluate the derivative of a function specified by a program

Many algorithms are just autodiff: forward-backward, inside-outside…

## Another derivation of back-propagation

$f(x) = \exp(\exp(x)+\exp(x^2)) + \sin(\exp(x)+\exp(x^2))$

may be computed as follows:

def f(x):
a = exp(x)
b = a ** 2
c = a + b
d = exp(c)
e = sin(c)
return d + e

## Using Lagrange Multipliers

$\arg\max_x f$

with constraints:

$a=\exp(x)$ $b=a^2$ $c=a+b$ $d=\exp(c)$ $e=\sin(c)$ $f=d+e$

## Using Lagrange Multipliers

Generalisation:

The constraints are composed of 3 blocks:

• $$d$$ input variables $z_i = x_i$
• intermediary variables $z_i = f_i(z_{\alpha(i)})$
• output variable $z_n = f_n(z_{\alpha(n)})$

$$\alpha(i)$$ is a subset of the first $$n-1$$ variables $$z_i$$.

## Using Lagrange Multipliers

Assumptions:

• $$\alpha$$ is acyclic: no $$z_i$$ transitively depends on itself
• Single assignment: each $$z_i$$ appears once in the lhs of constraints

Lagrangian:

$L(x,z,\lambda) = z_n - \sum_{i=d+1}^n \lambda_i \cdot (z_i - f_i (z_{\alpha(i)}))$

## Linear system

• Intermediary variables: $\nabla_{\lambda_i} L = z_i - f_i(z_{\alpha(i)}) = 0$

• Multipliers (except last one): $$\nabla_{z_j} L=0$$ gives $\lambda_j = \sum_{i \in \beta(j)} \lambda_i \frac{\partial f_i(z_{\alpha(i)})}{\partial z_j}$

• Output multiplier: $$\nabla_{z_n} L=0$$ gives $\lambda_n=1$

## Linear system

• Input multipliers: $$\nabla_x f(x) = \lambda_{1:d}$$

• Input variables need to be set by gradient ascent

• We solve this linear system by back-substitution when we use the backpropagation method.

## Linear system

• Want cyclic graph of dependencies ?

• Can be done with general linear system solvers ($$O(n^3)$$) instead of back-substitution (=back-propagation in $$O(n)$$).

• Back-prop operates on programs (=with intermediary variables) and not only functions, so it is more than the chain rule.

# Feedforward neural network

## Multi Layer Perceptron

Why the activation function ?

• To “normalize” the output within [-1;1] and help make a decision
• To add non-linearity (and so, richer modeling capacity)
• To be able to stack neurons

## Activation functions

• tanh: nearly linear around 0
• relu: fast, better for vanishing gradient

## tanh

$tanh(x) = \frac {e^x - e^{-x}} {e^x + e^{-x}}$

derivative: $\frac {\partial tanh(x)}{\partial x} = 1-tanh^2(x)$

## MLP

• aka “feed-forward” network
• “Dense” layer = 1 matrix $$W$$

## MLP

• Hidden layers are
• larger than inputs if we want to capture complex relationships or a large capacity
• smaller than inputs if we want to compress representations (auto-encoder)
• The MLP does not assume any particular structure on the data

## Output layer

To do regression, we need:

• As many output neurons as output dimensions
• A linear output activation (or sigmoid…)

To do binary classification, we need:

• 1 output neuron with tanh (or sigmoid)
• decision threshold = 0

For N-class classification: ??

## Softmax

• Use 1 neuron per possible output label

• Normalize the $$N$$ scores into a proba distribution: $s(x)_i = \frac {e^{x_i}}{\sum_k^N e^{x_k}}$

• decision: pick the maximum proba

• sometimes include parameters = weight matrix $$\rightarrow$$ multinomial logistic regression or maximum entropy classifier
• See http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf
• also used in attention models

## Cross-entropy

Entropy of a probability distribution = nb of bits required to encode the distribution $H(p)=-\sum_i p_i \log p_i$

Cross-entropy between 2 distributions = nb of bits required to encode one distribution, when we only know another distribution $H(p,q)=-\sum_i p_i \log q_i$

Kullback-Leibler divergence = Xent - ent $KL(p||q)= \sum_i p_i \log \frac {p_i}{q_i}$

## Loss Regularization

• Naive DNN overfits $$\rightarrow$$ we want to reduce variance

• This can be done by adding a term to the loss that limits the space where parameters can live:
• L2 regularization is the most common: $loss = E[H(p,q)] + \lambda \sum_i \theta_i^2$
• L2 can be interpreted as adding a Gaussian prior

## Other regularizations

• L1 $loss = E[H(p,q)] + \lambda \sum_i |\theta_i|$

## Other regularizations

• Dropout
• Randomly remove $$\lambda$$% input and hidden neurons

• DropConnect

## Other regularizations

• Artificial data expansion: Replicate dataset by adding $$\lambda$$% noise

• SGD: Tune mini-batch size

## Putting it all together

Each layer of a feedfoward classifier is composed of:

• Linear parameters: weight matrix + bias vector

• An activation function (relu)

• An optional dropout “layer”

The output layer is composed of:

• Linear parameters: weight matrix + bias vector

• A softmax activation

The training loss = cross-entropy + L2

## Computation graph for MLP

• 1-hidden layer MLP: 2 parameter matrices
• Regularized neg-loglike loss:

$J = J_{MLE} + \lambda \left( \sum_{i,j} (W_{i,j})^2 + \sum_{i,j}(W'_{i,j})^2 \right)$

## Weights initialization

Why initialization is important ?

• If the weights start too small, the signal shrinks as it pass through until it’s too tiny to be useful
• If the weights start too large, the signal grows as it pass through until it’s too large to be useful

So we want the variance of the inputs and outputs to be similar !

## Xavier initialization

For each neuron with $$n_{in}$$ inputs:

• Initialize its weights $$W$$ from a zero-mean Gaussian with

$Var(W)=\frac 1 {n_{in}}$

Refs: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

## Glorot initialization

For each neuron with $$n_{in}$$ inputs, going into $$n_{out}$$ neurons:

• Initialize its weights $$W$$ from a zero-mean Gaussian with

$Var(W)=\frac 2 {n_{in} + n_{out}}$

http://pytorch.org/docs/master/nn.html?highlight=torch nn init#torch.nn.init.xavier_normal

## Batch normalization

• You must normalize your inputs
• Transforms the activations of the previous layer at each batch, so that the mean=0 and standard deviation=1
• May have parameters to further scale + shift
• http://pytorch.org/docs/master/nn.html?highlight=batchnorm1d#torch.nn.BatchNorm1d

## A lot of tricks

MLP, simple ?

• Find a good topology, residual connections, parameter sharing…
• Find a good SGD variant
• Tune learning rate, nb of neurons…
• Tune regularization strength, dropout…
• Tune batch normalization
• Tune nb of epochs, batch size…

Andrew Ng: “If you tune only one thing, tune the learning rate !”

## GPU

SGD on complex models requires a lot of tensors computations:

• A tensor = generalization of a matrix to higher dimensions
• Can be parallelized onto GPU (or TPU)
• NVidia adds in its “Volta” series 16-bits TPU
• yet another regularization: 64 -> 32 -> 16 bits

## Deep Learning toolkits

• Don’t reprogram from scratch: reuse toolkits

• All of them implement:
• Autograd of most tensor operators
• Wrappers to abstract model’s parts (dense, conv, lstm..)
• Optimized SGD variants
• Standard activations, normalizations, regularizations…
• Efficient initializations
• CPU + CUDA versions of all tensor operators !!
• This makes life wayyy easier for deep learners…

## 2 types of toolkits

### Static toolkits

• Tensorflow, Keras, theano, tflearn…

• You define the graph (with add_node(…))
• You compile the graph
• Then you pass in data, fit, iterate

### Dynamic toolkits

• Pytorch, chainer, dynet…