Machine and Deep learning

Christophe Cerisara


Deep Neural Networks

Convolutional networks

Limits of feed-forward networks: size

Example: does person X appears in a photo ?

  • small B&W photo: 300x200 = 60k inputs
  • e.g., 3 layers (60k; 1000; 1)
  • 60M parameters (=connections) !

Limits of feed-forward networks: no sharing

  • Sample 1: the person appears on the top-left of the photo
  • Sample 2: the same person appears on the bottom-right of the photo
  • Common information between both samples is lost !

Sharing parameters

  • Exploit the structural properties within data
  • E.g., for images: invariance to translation
    • So we can use the same set of weights at various positions of the image
    • \(\rightarrow\) convolutional networks
  • E.g., for speech: Markov property of the signal
    • \(\rightarrow\) recurrent networks
  • E.g., for graph data:
    • \(\rightarrow\) graph convolutional networks

2D convolution

Kernel size = 3x3

2D convolution

  • Input image \(I\)
  • Kernel \(K=\) matrix of size \(h \times w\)

\[(I * K)_{x,y} = \sum_{i=1}^h \sum_{j=1}^w K_{ij} \cdot I_{x+i-1,y+j-1}\]

  • Assume the input has \(d\) channels (red/green/blue), the kernel dimensions are extended accordingly
  • A bias and an activation are also typically used:

\[conv(I,K)_{x,y} = \sigma\left(b+\sum_{i=1}^h \sum_{j=1}^w \sum_{k=1}^d K_{ijk} \cdot I_{x+i-1,y+j-1,k}\right)\]

Stride / Shift

From left to right: stride=1, stride=2, receptive field=3


Kernel size 3x3, stride 2, padding 1

Adding filters

  • Apply 5 convolutions \(\rightarrow\) 5-dim vector:

Adding filters


  • With appropriate kernel size, padding and stride1, the resulting “blurred image” has the same size as the input image

  • We want to compress the information and reduce the layers size progressively down to the final layer:
    • Option 1: remove padding, increase stride


  • Option 2: downsampling = pooling


Max pooling = pick the max value within a filter size

ConvNet = stacking of various types of layers

1x1 convolution

  • reduce dims: 200x200x50 \(\rightarrow\) 200x200x\(N_f\)
  • add another non-linearity, with few parameters (deeper net)

Dilated convolution

  • skip 0, or 1, or 3 pixels
  • increase receptive field

ConvNet architectures

INPUT -> [[CONV ->  RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
  • Prefer a stack of small filter CONV to one large receptive field CONV layer

Hyper-parameter choice

  • conv layer use small filters (3x3, at most 5x5)

  • pool layer: usually 2x2 with stride 2

Model Zoo in Image Processing


  • Yann Lecun 1990


  • Alex Krizhevsky, Ilya Sutskever, Geoff Hinton, 2012


  • Szegedy et al., 2014


Inception module:

  • Let the model choose:
    • compute multiple transformations: 1x1-conv, 3x3-conv, 5x5-conv, 3x3-pool
  • Reduce dimensionality:
    • by adding 1x1-convolutions


  • Karen Simonyan, Andrew Zisserman, 2014

  • Classical architecture but much deeper


  • Kaiming He et al., 2015


  • G. Huang et al., 2016

  • each layer directly connected to all following layers

Model Zoo in Natural Language Processing

  • Use 1D-convolutions instead of 2D-convolutions
    • shift the kernel along the time axis only

Collobert et al., 2011

Collobert et al., 2011

  • Lookup table = Word Embeddings
    • assign a single vector \(\vec w \in R^{100}\) to every word in the voc.
      • e.g. \(\vec w_{le}\), \(\vec w_{chat}\), \(\vec w_{chien}\)
    • whenever word \(w\) occurs in the input, this vector \(\vec w\) is inserted as input
      • e.g. “le chat le chien” \(\rightarrow\) \((\vec w_{le}, \vec w_{chat}, \vec w_{le}, \vec w_{chien})\)
    • the real values in these vectors are trained by back-propagation
  • The following convolutions use 100 input channels

Collobert et al., 2011

Wide and deep CNN

Yoon Kim, 2014

Not very deep, but extremely efficient !

Kalchbrenner, 2014

Zhang et al., 2015

Deep CNN: Conneau (2017)

Transfert learning

  • ConvNets are reusable amongst multiple tasks
  • Nobody retrains a CNN from scratch
    • Remove the last layer of ImageNet ConvNet
    • … or fix the N first layers

Visualize convnets

  • plot hidden weights

ex: AlexNet, 1st and 2d layers:

Visualize convnets

  • plot hidden weights

ex: AlexNet, 1st and 2d layers:

Other approaches

  • plot hidden activations
  • occluding parts of the image
  • retrieve images that maximally activate a neuron
  • visualize gradients
  • DeconvNet


  • embed images into 2D s.t. their (pair-wise) distance in 2D \(\simeq\) their distance before last layer


  • Stanford CS231n
  • pytorch tutorials: https://github.com/ritchieng/the-incredible-pytorch


MNIST official example

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

model = Net()
if args.cuda: model.cuda()

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

def train(epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        if args.cuda: data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        output = model(data)
        loss = F.nll_loss(output, target)
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.data[0]))

def test():
    # ...

Comments on the implementation

  • Show how to “move” the data and model onto GPU
  • Illustrate the vanilla SGD optimizer
  • Show the Negative Log-Likelihood loss function that does not include the final softmax - so the softmax must be defined in the model
  • Show a useful method: view(), which may be used to “manipulate” tensor dimensions
  • Show how to set the training or test mode of a model: important when using dropout !

Recurrent neural net (RNN)

Recurrent neural networks




  • \(x\) = observation/input; \(o\) = output
  • \(s\) = hidden neuron = represents the “state” (=memory)

\[s_t = f(Ux_t + W s_{t-1})\]

RNN applications

  • RNN compute representations \(o\) of the inputs \(x\) with memory
    • Often followed by a classifier layer
  • All domains with a “time”-axis: language, speech, finance, DNA…
    • speech recognition, translation, chatbots…

RNN training

Back-Propagation Through Time:

Back-Propagation Through Time

  • Key diff with standard backprop: the gradients sum at every time step

Exemple: gradient with respect to output 3: \[\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}\]


  • \(\nabla E_3\) contributes to steps 3, 2, 1 and 0: \[\frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \left( \prod_{j=k+1}^3 \frac {\partial s_j}{\partial s_{j-1}} \right) \frac{\partial s_k}{\partial W}\]
  • All steps share the same \(W\): \[\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \cdots \]

Vanishing gradient

  • Even in MLP: \[ l_1 = U \cdot X\] \[ h_1 = \text{tanh}(l_1)\] \[ l_2 = V \cdot h_1\] \[ y = \text{tanh}(l_2)\] \[\frac {\partial E_y}{\partial l_1} = \frac{\partial E_y}{\partial y}\frac{\partial y}{\partial l_2}\frac{\partial l_2}{\partial h_1}\frac{\partial h_1}{\partial l_1}\]

Vanishing gradient

Derivative of the \(tanh\) activation:

Vanishing gradient

  • The gradient of the activation is often close to 0
  • When many such gradients are multiplied, the gradient becomes smaller and smaller, negligeable
  • After some time steps, no more training is possible

  • Exploding gradients are also possible, but can be solved with clipping

Solutions to vanishing gradient

  • Careful initialization of the \(W\) matrices
  • Careful tuning of regularization
  • Use ReLU activations instead of sigmoid/tanh
  • Use LSTM or GRU cells instead of simple neurons


  • Computes the new state \(s_{t+1}\) from the previous state and input using 3 gates:
    • input gate: how much of the input-derived state we let through
    • forget gate: how much of the previous state we let through
    • output gate: how much of the overall state will be passed out

Vanishing gradient

Plot of vanishing gradient

RNN extensions

  • Bidirectional
  • Deep (bidirectional) RNNs
  • Model zoo based on RNNs:
    • seq2seq models
    • attention networks
    • key-value memory networks
    • neural turing machine

Design patterns


Cho et al., 2014


  • Seq2Seq is also an Encoder-Decoder
model in/out
seq \(x_1,\cdots,x_t \rightarrow x_{t+1}\)
seq2seq \(x_1,\cdots,x_t \rightarrow x_{t+1}, x_{t+2}, \cdots\)

Attention models

Given the text, predict the topic:

The Republicans started a media offensive.
This is the first time the NBA organized a game in this city.
  • Focus on the most important words: compute \(\alpha_t =P(i_t|w_*,c)\)
  • \(c=\) context vector, \(i_t=\) importance of word \(t\)
  • Proba obtained with a softmax and \(i_t=w_t \cdot c\)

Attention models

  • Seq2seq attention
  • bi-LSTM encoder
  • \(\alpha_t\) vary at every decoding timestep


  • Attention can be used without recurrence (greatly reduce cost !)
  • e.g.:
    • memory networks: attention on embeddings
    • transformers: idem

Memory networks

A primer on tensors

  • Tensor dimensions are noted \((n,m,p,q)\)
  • Element-wise ops:
    • addition, subtraction
    • Hadamard product: torch.mul()
    • division

A primer on tensors

  • dot product
torch.dot(torch.tensor([4,2]), torch.tensor([3,1])) = [14]
  • matrix-vector product
  • matrix-matrix product

A primer on tensors

  • batch matrix-matrix product
  • first dimensions = batch dimensions
  • ex: \((j,1,n,m) \times (k,m,p) \rightarrow (j,k,n,p)\)

  • batch must be broadcasted dimensions: start comparing dims from right to left: they must be equal, one of them is 1, or one of them does not exist

A primer on tensors

  • view
  • (un)squeeze: add/del dimensions of size 1

  • concatenate, split, slice… : https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/
  • do exercice Attention

Another Attention

  • 3 key components (vectors or batches of vectors):
    • Query \(Q\)
    • Keys \(K\)
    • Value \(V\)
  • Compare the query with every key with a compatibility measure
  • Ex: scaled dot-product attention: \[\text{softmax}\left( \frac {Q^T K}{\sqrt{d_k}} \right) V\]

Another Attention

  • Exercice: implement this measure as:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, scale):
  • We assume Q,K,V have same sizes batch,seq,vecdim

Multi-head attention

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, d):


  • Every word in sequence compared to all other words of the sentence
  • Implemented by feeding the same sequences of words \(X\) to \(Q=K=V\)


Encoder: feed forward

  • 2 layers:
    • conv1D w/ kernel=1 (d -> d_hid)
    • relu
    • conv1D w/ kernel=1 (d_hid -> d)
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d, d_hid, dropout=0.1):

Encoder layer

  • Create an EncoderLayer that stacks together a multi-head attention with a position-wise feed forward
class EncoderLayer(nn.Module):
    def __init__(self, d, d_inner, n_head):

Positional encoding

  • Attention focuses on some words, based on their meaning (embedding)
  • The position of words in the sentence is also important, and there is no recurrence to capture it
  • We may add the (int) position to the embedding; but longer sentences are rare
  • So we use repeating position patterns: sinusoids \[PE_(pos,2i) = \sin(pos/10000^{2i/d})\] \[PE_(pos,2i+1) = \cos(pos/10000^{2i/d})\]


  • Simply adds (sums) word embeddings with position encodings (they have the same size)
  • Stack multiple EncoderLayer using:
nn.ModuleList([EncoderLayer(d, d_hid, n_head) for _ in range(n_layers)])
  • Create positional Embeddings as follows:
self.pos_enc = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(max_len+1, d), freeze=True)


  • Implement the decoder similarly, test the whole model