//

Machine and Deep learning

Christophe Cerisara

2018/2019

Deep Neural Networks

Planning

  • Introduction
  • Backpropagation
  • Feedforward networks
  • Convolutional networks
  • Recurrent networks
  • Generative models (incl. VAE & GAN)

Introduction

Refs:

  • Stanford CS224n-2017 lecture on “NLP and deep learning”
  • http://www.deeplearningbook.org

What is deep learning

  • Deep Learning is a subfield of machine learning

  • Machine learning boils down to optimization

Machine vs. Deep Learning

What is deep learning

What is deep learning

  • Representation learning attempts to automatically learn good features
  • Deep learning attemps to learn multiple levels of representation and an output
  • From raw inputs, e.g., sounds, characters, pixels…
  • DL is trained end-to-end
  • DL exploits backpropagation to optime parameters

Course path

  • We will focus on different kinds of neural networks

  • We will not take a historical approach but instead focus on methods which work well now
  • For a long history of deep learning (starting in the 1960s), see “Deep Learning in Neural Networks: An Overview” by Jürgen Schmidhuber

G. Hinton, J. Bengio, Y. Lecun

Reasons for exploring deep learning

  • Manually designed features are often over-specified, incomplete and take a long time to design and validate

  • Learned features are easy to adapt, fast to learn

  • Deep learning provides a flexible learnable framework for representing world, visual and linguistic information

  • Deep learning can learn unsupervised (from raw text) and supervised (with specific labels)

Reasons for exploring deep learning

  • 2010: DL >> ML, why ?

  • Large amount of data favors deep learning
  • Faster machines and GPUs favor deep learning
  • New models, ideas:
    • Better learning of intermediate representations
    • Effective end-to-end system learning
    • Effective methods that exploit contexts and transfer between tasks
  • Improved performances, first in speech and vision, then NLP and games

Deep learning for speech

First breakthrough on large datasets happened in speech recognition Context-dependent pre-trained deep neural networks for large vocabulary speech recognition: Dahl et al. (2010)

WER RT03S FSH Hub5 SWB
Traditional features 27.4 23.6
Deep learning 18.5 16.1

Deep learning for computer vision

The breakthrough paper: ImageNet Classification with deep convolutional neural networks; Krizhevsky, Sutskever & Hinton, Univ. Toront, 2012 37% error reduction

Deep reinforcement learning

  • Atari games (2016)
  • Alpha go (2016)
  • Alpha zero (2017), Dota2, poker…

Yann Lecun, Jan 2018

“Deep Learning is dead;”



“Long live Differentiable Programming !”

Symbolic AI (1970-80)

  • Program rules: expert systems

Machine learning AI (1990-2000)

  • Program features + optimisation

Deep learning AI (2010-)

  • Program function

Beware of buzzword

DL & AI terms overused in media

Limits of DL

DL = approximator of parametric function

  • Input: image, video, sound, text, sensors…
  • Output: score, proba distribution, image, video…

From DL to AI

(Gary Marcu, jan 2018)
https://arxiv.org/abs/1801.00631a

AI requires much more than a function:

  • World model
  • Continuous unsupervised training
  • Generalisation through extrapolation
  • Training from definitions

The first wave of DL (2012-2015)

Race to deeper and deeper models drive by speech and vision applications

But Natural Language Processing is shallow

  • Collobert 2011
  • Mikolov (W2V) 2013
  • Yoon Kim 2014
  • Hoa T. Le 2018

“Deep Learning is dead;”



“Long live Differentiable Programming !”

DL & programming (2016-)

DL is fusing with programming: Pytorch (oct. 2016)

Two visions of DL

Static model

Dynamic function



x = Variable(torch.ones(2, 2))
y = x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)


DL model = program

We do not build models any more, we program parametric functions:

  • python
  • variables = tensors
  • operators = tensor algebrae
  • parameters optimized automatically

Automatic derivation

!

Seppo Linnainmaa (1970)

What if \(f(x)\) is not differentiable ?

  • sub-gradients
  • REINFORCE (Williams, 1992)
  • Reparametrization trick

Challenge: optimize

  • Infinite variety of programmable functions: how to guarantee convergence of training ?

  • Key asset of Deep Learning: regularization everywhere:
    • loss, dropout, minibatch, SGD, init, précision réels…
  • (Kukacka, 2017) https://arxiv.org/abs/1710.10686
    • Millions of parameters are not an issue
    • Most models have more parameters than training samples

Convexity is not required

(Kawaguchi, 2016)
https://arxiv.org/abs/1605.07110

  • Most \(\nabla f=0\) are saddle points
  • Most local optima are close to global

Optimality of DL ?

(Tishby, 2015)
https://arxiv.org/abs/1503.02406

  • 2 phases: overtraining, then compression

Optimality of DL ?

  • Gets close to the theoretical compression limit
  • DNN would be nearly optimal according to information theory ?

Ecosystem of DL programming

  • Libraries @ range of abstraction levels:
    • relu, softmax, max-pooling…
    • MLP, RNN, CNN…
    • inception, seq2seq…
    • key-value memnet, resnet, transformer…
  • Design patterns:
    • attention: dynamic focus
    • R/W dynamic memory
    • hops: step-wise reasoning

DNN + Bayes

DNN: competition / collab.

DNN: explicability

Stéphane Mallat
Leçon inaugurale au Collège de France

  • Wavelet decomposition is a special case of ConvNets
  • Multiscale wavelets may encode robustness to translation / rotation
  • Analysis of convnets show they do the same, and more

Convergence

Backpropagation

Gradient descents

  • Gradient descent
  • Backpropagation
  • Autodiff

Refs:

  • Stanford CS231n-2017 lecture on “Convolutional Neural Networks for Visual Recognition”
  • https://people.cs.umass.edu/~domke/courses/sml2011/03optimization.pdf

Optimization by descent methods

  • We want to optimize some function (risk)
  • But we don’t have a closed-form solution

\(\rightarrow\) Descent methods

Generic descent method

Two problems:

  • Find a descent direction \(\Delta x\)

  • Find a step size \(t\)

Then, update the current solution:

\(x \leftarrow x + t\Delta x\)

Setting the step size

  • Exact line search: when possible, compute \(\min_t f(x+t\Delta x)\)

  • Backtracking line search: try with “large” \(t\) and decrease \(t\) until \(f(x)\) improves enough:
    • \(t=1\)
    • While \(f(x+t\Delta x) > f(x) + \alpha t \nabla f(x)^T \Delta x\)
      • Set \(t \leftarrow \beta t\)
    • With \(0<\alpha<\frac 1 2\) and \(0 < \beta < 1\)

Finding the descent direction

Any \(\Delta x\) that verifies:

\[\Delta x^T \nabla f(x) < 0\]

is a descent direction.

  • For small \(t\): \[f(x+t\Delta x) \simeq f(x) + t\Delta x^T \nabla f(x)\]

Gradient descent

  • How to find the descent direction ?

  • Simplest approach: Gradient descent: \[\Delta x = -\nabla f(x)\]

  • For strictly convex functions, we have linear convergence: \[f(x^k)-p \leq c\left( f(x^{k-1})-p \right)\] with \(0 < c < 1\)

  • At each iteration, the error is multiplied by a number less than 1, so eventually converges to 0

GD is not optimal

\[f(x)=\frac 1 2 (x_1^2 + ax_2^2)\]

\[\nabla f(x) = [ x_1 , ax_2 ]\]

  • From the point (1,1), the steepest descent is (-1,-a)

  • But we know \(\min f\) is in (0,0) \(\rightarrow\) optimal direction (-1,-1)

GD is not optimal

GD is not optimal

  • Conversely, the optimal direction is found in one step of \(a=1\):
    • Gradient descent is not invariant to linear transformations!

So scaling of inputs is very important !

  • If GD slow, try rescaling your data and the parameters
  • You can use a fixed step size, but have to tune it

Newton’s method

Multiply the gradient with the inverse Hessian:

\[\Delta x = -H(x)^{-1} \nabla f(x)\]

\[H(x)=\nabla^2 f(x) = \frac {\partial^2 f(x)}{\partial x \partial x^T}\]

  • Handles relative scaling between dimensions, and so is invariant to linear changes of coordinates

Newton’s method

  • If \(f(x)\) quadratic, gives the global optimum direction

  • But it is not invariant to higher order (e.g. quadratic) changes of coordinates
  • It may be very costly (\(O(N^3)\) with \(N\)=nb of dimensions) to solve the linear system \(H(x)^{-1}\nabla f(x)\)

Quasi-Newton method

Approximate \(f\):

\[f^k(x+v)=f(x)+\nabla f(x)^T v + \frac 1 2 v^T B_k v\]

The direction of the descent is:

\[\Delta x = -B_k^{-1} \nabla f(x)\]

  • \(B_k\) is an approximation of the Hessian that is continuously updated

  • See also Kylov methods to approximate Hessian

BFGS method

Special case of Quasi-Newton method: Broyden, Fletcher, Goldfarb and Shanno.

BFGS method

\[s_k = x_{k+1} - x_k\]

\[y_k = \nabla f(x^{k+1})-\nabla f(x^k)\]

\[\rho_k = \frac 1 {y_k^T s_k}\]

\[F_{k+1} = (I-\rho_k s_k y_k^T) F_k (I-\rho_k y_k s_k^T) + \rho_k s_k s_k^T\]

Descent direction \(F_k \nabla f(x)\) can be computed in \(O(N^2)\)

GD: orthogonal steps

  • For quadratic fcts, search directions are successively orthogonal

GD: orthogonal steps

  • This leads to longer paths than expected

Conjugate gradient

  • Eliminate all of the error along one axis in one step

GD is slow

As well as conjugate gradient:

  • In non-convex space, do not generalize well
    • Solution: simulated annealing
  • Requires to compute the gradient of the error over the full corpus

Stochastic Gradient Descent

Assume the function to minize has the form: \[f(x)=\frac 1 T \sum_t^T f_t(x)\]

This is the case for

  • maximum likelihood training
  • empirical loss minimization

Stochastic Gradient Descent

Standard Gradient Descent:

\[x \leftarrow x - t\nabla f(x)\]

SGD approximates the gradient with the gradient at a single sample:

\[x \leftarrow x - t\nabla f_i(x)\]

In practice:

  • Shuffle the dataset at the beginning of each epoch
  • Iterate over all samples in the dataset

SGD convergence

  • There are proofs that, under relatively mild assumptions, and when the learning rates \(t\) decrease with an appropriate rate, then SGD converges almost surely to a global minimum when the function is convex, and otherwise to a local minimum.

  • SGD + back-propagation is used to train all deep neural nets

SGD convergence

  • Recent research works suggest that SGD is so successful because such an approximation of the gradient strongly regularizes the learning process.

  • Some other works suggest that THE main reason that explains the success of DNN is because of SGD !

SGD with momentum

Momentum:

\[\Delta x \leftarrow \alpha \Delta x - t \nabla f_i(x)\]

  • There’s a “memory” / “inertia” of the previous update
  • This prevents oscillations

Other SGD extensions

  • decay: the learning rate decreases over time

  • Adagrad: different learning rate per dimension

  • RMSProp: uses a running average of recent gradients per dimension

  • Adam: enhancements of RMSProp

  • mini-batch SGD: rule of thumb: use 32, 64, 128 or 256

Introduction to backpropagation

  • Objective: given a function (loss) \(f(x)\), compute its gradient \(\nabla f(x)\)

  • Complex \(f(x)\) can be decomposed using intermediary variables: \[f(x,y,z)=(x+y)^2z\] is better computed as:
  • \(q=x+y\)
  • \(h=q^2\)
  • \(f=h\times z\)

Introduction to backpropagation

  • \(q=x+y\)
  • \(h=q^2\)
  • \(f=h\times z\)

  • Principle: back-propagate the final gradient locally, through each intermediary variable, thanks to the chain rule

Interpretation as a circuit diagram

(or computational graph)

Two passes

  • Forward pass: computes values from inputs to outputs (in green)
  • Backward pass: recursively applies chain rule to compute gradients (in red) from the end to the inputs

Every gate in the circuit computes its local gradient with respect to its output value.

Backward pass

\[\frac {df}{df} = 1\] then \[f=qz \rightarrow \frac {df}{dq} = z = -4\] \[f=qz \rightarrow \frac {df}{dz} = q = 3\] then \[q=x+y \rightarrow \frac {df}{dx} = \frac {df}{dq} \frac {dq}{dx} + \frac {df}{dz}\frac{dz}{dx} = -4 \times 1 + 0\]

Backprop is a local process

  • Example: the “+” gate: \(x + y = q\)
    • It is equiped with local derivative:
      • To its first input: \(\frac {dq}{dx}=1\)
      • To its second input: \(\frac {dq}{dy}=1\)
  • Assume it is given a gradient value \(\frac {df}{dq}\) at its output:
    • It passes it to its first input by multiplying it with its local derivative: \(\frac {df}{dq} \times \frac {dq}{dx} = \frac {df}{dx}\)
    • It passes it to its second input by multiplying it with its local derivative: \(\frac {df}{dq} \times \frac {dq}{dy} = \frac {df}{dy}\)

Sigmoid neuron example

\[f(w,x)=\frac 1 {1+e^{-(w_0 x_0 + w_1 x_1 + w_2)}}\]

Exercice:

  • Draw a computation graph
  • Write the forward values with \(w0=2\),\(x0=-1\),\(w1=-3\),\(x1=-2\) and \(w2=-3\)
  • Write the backward values

Sigmoid neuron example

The sigmoid is made up of multiple gates:

\[f(x)=\frac 1 x ~~ \rightarrow ~~ \frac {df}{dx} = -1/x^2\] \[f_c(x)=c+x ~~ \rightarrow ~~ \frac{df}{dx} = 1\] \[f(x)=e^x ~~ \rightarrow ~~ \frac{df}{dx} = e^x\] \[f_a(x)=ax ~~ \rightarrow ~~ \frac{df}{dx} = a\]

Sigmoid neuron example

Sigmoid can also be derived directly

\[\sigma(x) = \frac 1 {1+e^{-x}}\]

\[\frac {d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2}\]

\[ = \left( \frac{1+e^{-x}-1}{1+e^{-x}}\right) \left(\frac 1 {1+e^{-x}}\right)\]

\[ = (1-\sigma(x))\sigma(x)\]

It’s simple enough to be used as a single “gate” in complex networks.

Implementation

w=[2,-3,-3]     # assume some random weights and data
x=[-1,-2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit

Staged backpropagation: add intermediary variable ddot to break down computation

Another example

\[f(x,y)=\frac {x+\sigma(y)}{\sigma(x)+(x+y)^2}\]

  • Deriving this function would give very complex expressions.

  • But we don’t need it: we can evaluate the gradient without them.

Another example

x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

Backward implementation

# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)

Understanding gradient in gates

  • The add gate distributes evenly the gradient to all of its inputs
  • The max gate routes the gradient to its highest input

Understanding gradient in gates

  • The mult gate switches its inputs and multiplies each by the gradient

  • Consequence for linear classifiers \(w^Tx\): if you multiply input by 1000, the gradient on the weights will be 1000 times larger; preprocessing hence matters a lot !

Additional material

  • see http://cs231n.stanford.edu/vecDerivs.pdf
  • see https://arxiv.org/abs/1502.05767

Autodiff and backpropagation

Refs:

  • Yann LeCun. (1988) A Theoretical Framework from Back-Propagation.
  • https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule
  • http://conal.net/papers/beautiful-differentiation/

Autodiff

Autodiff = set of techniques to numerically evaluate the derivative of a function specified by a program

Many algorithms are just autodiff: forward-backward, inside-outside…

Another derivation of back-propagation

\[f(x) = \exp(\exp(x)+\exp(x^2)) + \sin(\exp(x)+\exp(x^2))\]

may be computed as follows:

def f(x):
    a = exp(x)
    b = a ** 2
    c = a + b
    d = exp(c)
    e = sin(c)
    return d + e

Using Lagrange Multipliers

\[\arg\max_x f\]

with constraints:

\[a=\exp(x)\] \[b=a^2\] \[c=a+b\] \[d=\exp(c)\] \[e=\sin(c)\] \[f=d+e\]

Using Lagrange Multipliers

Generalisation:

The constraints are composed of 3 blocks:

  • \(d\) input variables \[z_i = x_i\]
  • intermediary variables \[z_i = f_i(z_{\alpha(i)})\]
  • output variable \[z_n = f_n(z_{\alpha(n)})\]

\(\alpha(i)\) is a subset of the first \(n-1\) variables \(z_i\).

Using Lagrange Multipliers

Assumptions:

  • \(\alpha\) is acyclic: no \(z_i\) transitively depends on itself
  • Single assignment: each \(z_i\) appears once in the lhs of constraints

Lagrangian:

\[L(x,z,\lambda) = z_n - \sum_{i=d+1}^n \lambda_i \cdot (z_i - f_i (z_{\alpha(i)}))\]

Linear system

  • Intermediary variables: \[\nabla_{\lambda_i} L = z_i - f_i(z_{\alpha(i)}) = 0\]

  • Multipliers (except last one): \(\nabla_{z_j} L=0\) gives \[\lambda_j = \sum_{i \in \beta(j)} \lambda_i \frac{\partial f_i(z_{\alpha(i)})}{\partial z_j}\]

  • Output multiplier: \(\nabla_{z_n} L=0\) gives \[\lambda_n=1\]

Linear system

  • Input multipliers: \(\nabla_x f(x) = \lambda_{1:d}\)

  • Input variables need to be set by gradient ascent

  • We solve this linear system by back-substitution when we use the backpropagation method.

Linear system

  • Want cyclic graph of dependencies ?

  • Can be done with general linear system solvers (\(O(n^3)\)) instead of back-substitution (=back-propagation in \(O(n)\)).

  • Back-prop operates on programs (=with intermediary variables) and not only functions, so it is more than the chain rule.

Feedforward neural network

Multi Layer Perceptron

Why the activation function ?

  • To “normalize” the output within [-1;1] and help make a decision
  • To add non-linearity (and so, richer modeling capacity)
  • To be able to stack neurons

Activation functions

  • tanh: nearly linear around 0
  • relu: fast, better for vanishing gradient

tanh

\[tanh(x) = \frac {e^x - e^{-x}} {e^x + e^{-x}}\]

derivative: \[\frac {\partial tanh(x)}{\partial x} = 1-tanh^2(x)\]

MLP

  • aka “feed-forward” network
  • “Dense” layer = 1 matrix \(W\)

MLP

  • Hidden layers are
    • larger than inputs if we want to capture complex relationships or a large capacity
    • smaller than inputs if we want to compress representations (auto-encoder)
  • The MLP does not assume any particular structure on the data

Output layer

To do regression, we need:

  • As many output neurons as output dimensions
  • A linear output activation (or sigmoid…)

To do binary classification, we need:

  • 1 output neuron with tanh (or sigmoid)
  • decision threshold = 0

For N-class classification: ??

Softmax

  • Use 1 neuron per possible output label

  • Normalize the \(N\) scores into a proba distribution: \[s(x)_i = \frac {e^{x_i}}{\sum_k^N e^{x_k}}\]

  • decision: pick the maximum proba

  • sometimes include parameters = weight matrix \(\rightarrow\) multinomial logistic regression or maximum entropy classifier
    • See http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf
  • also used in attention models

Cross-entropy

Entropy of a probability distribution = nb of bits required to encode the distribution \[H(p)=-\sum_i p_i \log p_i\]

Cross-entropy between 2 distributions = nb of bits required to encode one distribution, when we only know another distribution \[H(p,q)=-\sum_i p_i \log q_i\]

Kullback-Leibler divergence = Xent - ent \[KL(p||q)= \sum_i p_i \log \frac {p_i}{q_i}\]

Loss Regularization

  • Naive DNN overfits \(\rightarrow\) we want to reduce variance

  • This can be done by adding a term to the loss that limits the space where parameters can live:
    • L2 regularization is the most common: \[loss = E[H(p,q)] + \lambda \sum_i \theta_i^2\]
    • L2 can be interpreted as adding a Gaussian prior

Other regularizations

  • L1 \[loss = E[H(p,q)] + \lambda \sum_i |\theta_i|\]

Other regularizations

  • Dropout
    • Randomly remove \(\lambda\)% input and hidden neurons

Other regularizations

  • DropConnect

Other regularizations

  • Artificial data expansion: Replicate dataset by adding \(\lambda\)% noise

  • SGD: Tune mini-batch size

Putting it all together

Each layer of a feedfoward classifier is composed of:

  • Linear parameters: weight matrix + bias vector

  • An activation function (relu)

  • An optional dropout “layer”

The output layer is composed of:

  • Linear parameters: weight matrix + bias vector

  • A softmax activation

The training loss = cross-entropy + L2

Computation graph for MLP

  • 1-hidden layer MLP: 2 parameter matrices
  • Regularized neg-loglike loss:

\[ J = J_{MLE} + \lambda \left( \sum_{i,j} (W_{i,j})^2 + \sum_{i,j}(W'_{i,j})^2 \right)\]

Computation graph for MLP

Weights initialization

Why initialization is important ?

  • If the weights start too small, the signal shrinks as it pass through until it’s too tiny to be useful
  • If the weights start too large, the signal grows as it pass through until it’s too large to be useful

So we want the variance of the inputs and outputs to be similar !

Xavier initialization

For each neuron with \(n_{in}\) inputs:

  • Initialize its weights \(W\) from a zero-mean Gaussian with

\[Var(W)=\frac 1 {n_{in}}\]

Refs: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

Glorot initialization

For each neuron with \(n_{in}\) inputs, going into \(n_{out}\) neurons:

  • Initialize its weights \(W\) from a zero-mean Gaussian with

\[Var(W)=\frac 2 {n_{in} + n_{out}}\]

http://pytorch.org/docs/master/nn.html?highlight=torch nn init#torch.nn.init.xavier_normal

Batch normalization

  • You must normalize your inputs
  • Transforms the activations of the previous layer at each batch, so that the mean=0 and standard deviation=1
  • May have parameters to further scale + shift
    • https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
    • http://pytorch.org/docs/master/nn.html?highlight=batchnorm1d#torch.nn.BatchNorm1d

A lot of tricks

MLP, simple ?

  • Find a good topology, residual connections, parameter sharing…
  • Find a good SGD variant
  • Tune learning rate, nb of neurons…
  • Tune regularization strength, dropout…
  • Tune batch normalization
  • Tune nb of epochs, batch size…

Andrew Ng: “If you tune only one thing, tune the learning rate !”

GPU

SGD on complex models requires a lot of tensors computations:

  • A tensor = generalization of a matrix to higher dimensions
  • Can be parallelized onto GPU (or TPU)
  • NVidia adds in its “Volta” series 16-bits TPU
  • yet another regularization: 64 -> 32 -> 16 bits

Deep Learning toolkits

  • Don’t reprogram from scratch: reuse toolkits

  • All of them implement:
    • Autograd of most tensor operators
    • Wrappers to abstract model’s parts (dense, conv, lstm..)
    • Optimized SGD variants
    • Standard activations, normalizations, regularizations…
    • Efficient initializations
    • CPU + CUDA versions of all tensor operators !!
  • This makes life wayyy easier for deep learners…

Major toolkits

2 types of toolkits

Static toolkits

  • Tensorflow, Keras, theano, tflearn…

  • You define the graph (with add_node(…))
  • You compile the graph
  • Then you pass in data, fit, iterate

Dynamic toolkits

  • Pytorch, chainer, dynet…

  • You start with a Variable, and apply operations on it
  • For every data chunk, a new graph is built